Ceph performance: benchmark and optimization

Objective

The objective of this test is to showcase the maximum performance achievable in a Ceph cluster (in particular, CephFS) with the INTEL SSDPEYKX040T8 NVMe drives. To avoid accusations of vendor cheating, an industry-standard IO500 benchmark is used to evaluate the performance of the whole storage setup.

Spoiler: even though only a 5-node Ceph cluster is used, and therefore the results cannot be submitted officially (at least 10 nodes would be required), the achieved benchmark score is already sufficient for hitting the top 40 of the list of best-performing 10-node storage systems of year 2019.

Hardware

A lab consisting of seven Supermicro servers connected to a 100Gbps network was provided to croit. Six of the servers had the following specs:

Model: SSG-1029P-NES32R
Base board: X11DSF-E
CPU: 2x Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Turbo frequencies up to 3.70 GHz), 96 virtual cores total
RAM: 8x Micron Technology 36ASF4G72PZ-2G9E2 32GB DDR4 DIMMs, i.e. 256 GB total, configured at 2400 MT/s but capable of 2933 MT/s
Storage: 8x INTEL SSDPEYKX040T8 NVMe drives, 4TB each
Auxiliary storage: a 64GB SATA SSD
Onboard network: 2x Intel Corporation Ethernet Controller 10G X550T [8086:1563]
PCIe Ethernet card: Mellanox Technologies MT27700 Family [ConnectX-4] [15b3:1013] with two QSFP ports capable of 100 Gbps each

Also, there was one server with a different configuration:

Model: SSG-1029P-NMR36L
Base board: X11DSF-E
CPU: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz (Turbo frequencies up to 3.70 GHz), 80 virtual cores total
RAM: 8x SK Hynix HMA84GR7CJR4N-WM 32GB DDR4 DIMMs, i.e. 256 GB total, configured at 2666 MT/s but capable of 2933 MT/s
Storage: 32x SAMSUNG MZ4LB3T8HALS NVMe drives, 3.84 TB each (unused)
Auxiliary storage: a 64GB SATA SSD
Onboard network: 2x Intel Corporation Ethernet Controller 10G X550T [8086:1563]
PCIe Ethernet card: Mellanox Technologies MT27700 Family [ConnectX-4] [15b3:1013] with two QSFP ports capable of 100 Gbps each

On all servers, the onboard network was used only for IPMI and management purposes. Out of the two 100Gbe ports of each Mellanox network card, only one was connected using a direct-attach copper cable to a SSE-C3632S switch running Cumulus Linux 4.2.

Software

During the tests, the SSG-1029P-NMR36L server was used as a croit management server, and as a host to run the benchmark on. As it was (rightly) suspected that a single 100Gbps link would not be enough to reveal the performance of the cluster, one of the SSG-1029P-NES32R servers was also dedicated to a client role. On both of these servers, Debian 10.5 was installed. The kernel was installed from the Debian “backports” repository, in order to get the latest improvements in the cephfs client.

The remaining five SSG-1029P-NES32R servers were used for the Ceph cluster (with Ceph 14.2.9), by means of net-booting them from the management node. The kernel version was 4.19.

Ceph Cluster

Five servers were participating in the Ceph cluster. On three servers, the small SATA SSD was used for a MON disk. On each NVMe drive, one OSD was created. On each server, an MDS (a Ceph component responsible for cephfs metadata operations) was provisioned. In order to parallelize metadata operations where possible (i.e. in the “easy” parts of the benchmark), four MDS servers were marked as active, and the remaining one functioned as a standby. It should be noted that it is still debated whether a configuration with multiple active MDS servers is OK for production Ceph clusters.

On the client nodes, the kernel cephfs client was used, via this line in /etc/fstab:

:/ /mnt/cephfs ceph name=admin,_netdev 0 0

All clients and servers were in one L2 network segment, with 10.10.49.0/24 network. Passwordless ssh was set up, because this is a requirement for OpenMPI, and the IO500 benchmark uses OpenMPI for running the workers in a parallel and distributed way.

Initially, there was an intention to compare the benchmark results to another Ceph cluster, which had 6 OSDs per node. Therefore, two NVMe drives on each host were set aside by assigning a separate device class to them. However, this intention never materialized. Still, most of the benchmarking was done with only 6 NVMe OSDs per node.

The available storage was organized into three pools: cephfs_metadata (64 PGs), cephfs_data (512 PGs), and rbd_benchmark (also 512 PGs). So, while the total number of PGs per OSD was close to the ideal, cephfs worked with less PGs in the data pool than one would normally use in this case (i.e. 1024). The theory here is that too few PGs would result in data imbalance (and we don’t really care), while too many PGs would potentially create performance problems.

IO500 Benchmark

IO500 is a storage benchmark administered by Virtual Institute for I/O. It measures both the bandwidth and IOPS figures of a cluster-based filesystem in different scenarios, and derives the final score as a geometric mean of the performance metrics obtained in all the phases of the test. During each phase, several copies of either the “ior” tool (for bandwidth testing) or the “mdtest” tool (for testing performance of various metadata operations) are executed and the results are combined. There is also a phase based on the parallel “find”-like program. As an implementation detail, MPI is used for orchestration.

The sequence in which the phases are executed is organized in such a way that the ior-based and mdtest-based phases are mostly interleaved with each other.

Bandwidth Testing Phases

Bandwidth testing is done with the “ior” tool. There are two “difficulty” levels (easy and hard), and for each level, the bandwidth is measured for writing and, separately, for reading. The bandwidth figure reported at the end of the benchmark is the geometric mean of all four tests.

Writing is done, by default, for 5 minutes on both difficulty levels, using POSIX primitives for filesystem access. In the easy mode, one file is used per ior process, and the writes are done sequentially with a 256 KiB transfer size. In the hard mode, all processes write to interleaved parts of the same file, using the weird 47008-byte transfer size and “jumpy” linear access (lseek() over the bytes to be written by the other processes, write 47008 bytes, repeat the cycle). In both cases, each process does just one fsync() call at the end, i.e., works with essentially unlimited queue depth.

For CephFS with multiple clients, the “hard” I/O pattern is indeed hard: each write results in a partial modification of a RADOS object that was previously touched by another client and could be touched by another client concurrently, and therefore the write has to be performed atomically. The lack of fsync() calls except at the end does not help either: fsync() guarantees that the data hits the stable storage, but what this test cares about is data coherency between clients, which is a completely different thing and cannot be turned off in POSIX-compliant filesystems. Therefore, even though the test shows results in GiB/s, it is mostly affected by the latency of communications between the clients and the metadata servers.

For reading, both the easy and the hard tests use the same files as previously written, access them in the same ways, and verify that the data in the files matches the expected signatures.

IOPS Testing Phases

Almost all of the IOPS testing is done with the “mdtest” tool. Unlike ior, mdtest creates a lot of files and stresses metadata operations.

Just as with the bandwidth test, there are two “difficulty” levels that the tests run at: easy and hard. For each difficulty, numerous test files (up to one million) are created by each process in the “create” phase, then all the files are examined in the “stat” phase of the test, and then deleted in the “delete” phase. At the end of each phase, the “sync” command is executed and its runtime is also accounted for.

The hard test differs from the easy one in the following aspects:

Each file is empty in the easy test, and gets 3901 bytes written during the “create” phase and read back later in the hard test;
Each process gets a unique working directory in the easy case, and a shared directory is used in the hard test.

Raw Storage Performance

Croit comes with a built-in fio-based benchmark that serves to evaluate the raw performance of the disk drives in database applications. The benchmark, under the hood, runs this command for various values (from 1 to 16) of the number of parallel jobs:

fio --filename=/dev/XXX --direct=1 --fsync=1 --rw=write --bs=4k --numjobs=YYY --iodepth=1 --runtime=60 --time_based --group_reporting --name=4k-sync-write-YYY

There is no reason not to use this benchmark as “at least something” that characterizes the NVMe drives and serves to objectively compare their performance under e.g. various BIOS settings.

One of the immediate findings was that different servers had different performance, especially in the 1-job variant of the benchmark, ranging between 78K and 91K IOPS. The 16-job figure was more consistent, showing only variations between 548K (strangely, on the server that was the fastest in the 1-job benchmark) and 566K IOPS.

The reason for this variation was originally believed to lie in non-identical BIOS settings initially present on the servers in the “CPU Configuration” menu. Indeed, reading data from high-performance storage is, by itself, a CPU-intensive activity: in this case, 30% CPU is consumed by fio, and 12% by the “kworker/4:1H-kblockd” kernel thread. Therefore, it is plausible that it makes sense to allow the CPU to reach as high clock frequency as possible.

Counter-intuitively, setting the “Power Technology” parameter in the “Advanced Power Management Configuration” area of the BIOS to “Disable” is the wrong thing to do. It would lock the CPU frequency to the highest non-turbo state, i.e., to 2.10 GHz, and make the 3.70 GHz frequency unavailable. Then, the fio storage benchmark would yield only 66K IOPS, which is just too bad.

Another option in this BIOS area decides whether the BIOS or the OS controls the energy-performance bias. If the control is given to the BIOS, there is a setting that tells it what to do, and for the “Manual” setting of the “Power Technology” parameter, there are many options for fine-tuning C-, P-, and T-states. We did not find any significant difference, as far as the NVMe performance goes, between the “Energy Efficient” and “Custom” power technologies, given that the energy-performance-bias control, as well as control over C-, P-, and T-states of the CPU, is given to the OS. In both cases, the NVMes benchmarked as 89K-91K IOPS initially. Unfortunately, some later tuning (not known what exactly) has destroyed this achievement, and the end result is, again, inconsistent performance between 84K and 87K write IOPS for a single thread.

Well, at least, from this point on, the BIOS settings became consistent on all servers - see the table below. The idea behind these settings is to give as much control over the CPU state as possible to the OS instead of the hardware or the BIOS. croit has also found that hyper-threading, LLC Prefetch and Extended APIC did not significantly affect the storage performance. Therefore, all CPU features were enabled.

BIOS setting						Value
Hyper-Threading [ALL] 				Enable
Cores Enabled						0 (all)
Monitor/Mwait						Auto
Execute Disable Bit					Enable
Intel Virtualization Technology		Enable
PPIN Control						Unlock/Enable
Hardware Prefetcher					Enable
Adjacent Cache Prefetch				Enable
DCU Streamer Prefetcher				Enable
DCU IP Prefetcher					Enable
LLC Prefetch						Enable
Extended APIC						Enable
AES-NI								Enable
Power Technology					Custom
Power Performance Tuning			OS Controls EPB
SpeedStep (P-States)				Enable
EIST PSD Function					HW_ALL
Turbo Mode							Enable
Hardware P-States					Disable
Autonomous Core C-State				Disable
CPU C6 Report						Auto
Enhanced Halt State (C1E)			Enable
Package C State						Auto
Software Controlled T-States		Enable

Network Performance

The Mellanox adapter is able to reach 85+ Gbps throughput without any tuning, but multiple TCP streams are necessary for this. Setting the net.core.rmem_max sysctl to a huge value would further improve the achievable throughput to 94 Gbit/s, but wouldn’t improve the benchmark score, so it wasn’t done.

To demonstrate the excellent throughput, we ran iperf (version 2.0.12) on two hosts, as follows:

On the “server”:

iperf -s -p 9999

On the “client”:

iperf -c 10.10.49.33 -p 9999

, where 10.10.49.33 is the server IP

With a single TCP stream (as illustrated above), iperf indicated 39.5 Gbit/s of throughput. To use four streams simultaneously, the “-P 4” parameters were added on the client side, which brought the throughput up to 87.8 Gbit/s.

Throughput is good, but remember that some of the IOR500 tests are actually for latency, and it also needs to be optimized. For latency testing, we used two objective benchmarks:

Measuring 4k-sized write IOPS on an RBD device;
Just pinging another host.

The RBD benchmark command (which is even more aggressive than what IOR500 does) is:

fio --ioengine=rbd --pool=rbd_benchmark --rbdname=rbd0 --direct=1 --fsync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=4k-sync-write-1

Without any tuning, it reaches mere 441 IOPS. Yes, 200+ times reduction as compared to the raw storage.

One of the limiting factors here is the CPU clock speed. The default “powersave” CPU frequency governor in the Linux kernel keeps the CPU clock low until it sees that the workload is too tough. And in this case, it is probably still “too easy” (27%) and is not seen as a sufficient reason to ramp up the frequency - which can be confirmed by running something like

grep MHz /proc/cpuinfo | sort | tail -n 4

at the same time as fio.

Once the CPU frequency governor is changed to “performance” both on the client and on Ceph OSD nodes (cpupower frequency-set -g performance), the situation improves: 2369 IOPS.

Optimizing Network Latency

As it was mentioned already, the IO500 benchmark is sensitive to network latency. Without any tuning, the latency (as reported by the “ping” command) is 0.178 ms, which means that, during the whole request-response cycle, 0.356 ms are just wasted. The ping time is doubled here because there are two hops where latency matters: from the client to the primary OSD, and from the primary OSD to the secondary OSDs. There are 2369 such cycles per second in the fio benchmark from the previous section, so each cycle lasts 0.422 ms on average. Therefore, it would appear that reducing latency is very important.

It turns out that the CPU load is low enough, and its cores “take naps” by going to power-saving C-states. The deepest such state is C6, and, according to “cpupower idle-info”, it takes 0.133 ms to transition out of it. The next states are C1E, C1, and “CPUIDLE CORE POLL IDLE” (which does not save any power), and all of them take less than 0.01 ms to get out of. Therefore, the next tuning step was to disable the C6 state. The command to do so, “cpupower idle-set -D 11”, actually means “disable all idle states that take more than 0.011 ms to get out of”. Result: ping time dropped down to 0.054 ms, but the fio benchmark produced only 2079 IOPS - worse than before. That’s probably because cores not in C6 reduce the maximum frequency available for the CPU, and, with this “fio” benchmark, reaching the highest possible frequency is actually more important.

Still, as we will see later, disabling C6 is beneficial for the overall IO500 score.

Actually Running the IO500 Benchmark

The source code for the IO500 benchmark was obtained from https://github.com/VI4IO/io500-app. The code from the “io500-isc20” branch (which, at that time, pointed to commit 46e0e53) could not be compiled, because of improper use of “extern” variables. Fortunately, the bugfix was available from the master branch of the same repository. Therefore, all the benchmarking was done with commit 20efd24. We are aware of the fact that a new IO500 release has been created on October 7, 2020, but, for consistency, continued with commit 46e0e53.

The main script for the benchmark is named “io500.sh”. At the top of the file, there is an “io500_mpiargs” variable that is, by default, set to “-np 2”, which means “run two processes locally”. To test that distributed operation works at all, this variable was changed to

-np 4 -mca btl ^openib -mca btl_tcp_if_include 10.10.49.0/24 -oversubscribe -H 10.10.49.2,10.10.49.33 --allow-run-as-root

so that two processes are started on each of the two client nodes.

The “-mca btl ^openib” parameter excludes InfiniBand from the list of transports to be tried by OpenMPI. This is necessary because the Mellanox network adapter supports InfiniBand in theory, but InfiniBand has not been configured in this cluster. The benchmark does not need to send a large amount of data between the workers, therefore a fallback to TCP is acceptable.

The “-mca btl_tcp_if_include 10.10.49.0/24” parameter specifies the network to be used during the benchmark for OpenMPI. Without this parameter, OpenMPI would sometimes choose the docker0 interface on one of the hosts to be the primary one, and attempt to connect to 172.17.0.1 from other nodes, which is going to fail.

The benchmark runs all its phases twice, once using a shell script for coordination, and once using a C program. These two “drivers” print the results in slightly different formats, but quantitatively, there is not much difference. For this reason, only the benchmark scores reported by the shell-based driver are mentioned below.

The benchmark also needs a configuration file. An extensively commented example, config-full.ini, is provided together with the benchmark sources. However, only a few options were needed: the directory where IO500 would write data, an optional script to drop caches, and, for debugging, a way to shorten the benchmark duration. It does not make sense to perform a full run of the benchmark every time until the best options are found, therefore, the stonewall timer was set to 30 seconds.

[global]
datadir = /mnt/cephfs/default
drop-caches = TRUE
drop-caches-cmd = /usr/local/bin/drop_caches
[debug]
stonewall-time = 30

The contents of the /usr/local/bin/drop_caches script are:

#!/bin/sh
echo 3 > /proc/sys/vm/drop_caches
ssh 10.10.49.33 echo 3 ">" /proc/sys/vm/drop_caches

To Dısable C6 or Not?

The 30-second version of the benchmark runs OK and produces this report on a completely untuned (except OS tuning) cluster, with C6 idle state enabled (which is the default):

[RESULT] BW   phase 1            ior_easy_write                6.021 GiB/s : time  35.83 seconds
[RESULT] BW   phase 2            ior_hard_write                0.068 GiB/s : time  43.69 seconds
[RESULT] BW   phase 3             ior_easy_read                5.144 GiB/s : time  46.86 seconds
[RESULT] BW   phase 4             ior_hard_read                0.219 GiB/s : time  13.52 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               10.334 kiops : time  32.09 seconds
[RESULT] IOPS phase 2         mdtest_hard_write                5.509 kiops : time  45.68 seconds
[RESULT] IOPS phase 3                      find              123.770 kiops : time   4.71 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat               31.086 kiops : time  10.67 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               30.733 kiops : time   8.19 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete                4.868 kiops : time  68.13 seconds
[RESULT] IOPS phase 7          mdtest_hard_read                5.734 kiops : time  43.88 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                3.443 kiops : time  75.07 seconds
[SCORE] Bandwidth 0.822726 GiB/s : IOPS 12.6286 kiops : TOTAL 3.22333

With C6 disabled, the “ior_hard_write” test gets a significant boost, but the majority of the other tests get worse results. Still, the overall score is slightly improved, both due to bandwidth and due to IOPS:

[RESULT] BW   phase 1            ior_easy_write                5.608 GiB/s : time  35.97 seconds
[RESULT] BW   phase 2            ior_hard_write                0.101 GiB/s : time  36.17 seconds
[RESULT] BW   phase 3             ior_easy_read                4.384 GiB/s : time  47.43 seconds
[RESULT] BW   phase 4             ior_hard_read                0.223 GiB/s : time  16.30 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               10.614 kiops : time  31.73 seconds
[RESULT] IOPS phase 2         mdtest_hard_write                4.884 kiops : time  43.06 seconds
[RESULT] IOPS phase 3                      find              157.530 kiops : time   3.47 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat               26.136 kiops : time  12.88 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               30.081 kiops : time   6.99 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete                5.122 kiops : time  65.74 seconds
[RESULT] IOPS phase 7          mdtest_hard_read                7.689 kiops : time  27.35 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                3.382 kiops : time  64.18 seconds
[SCORE] Bandwidth 0.86169 GiB/s : IOPS 13.0773 kiops : TOTAL 3.35687

This is not really surprising: the ior_easy_write step is bottlenecked on 100% CPU being consumed by the “ior” process, and the available clock speed with zero cores in C6 is lower than it would be otherwise. To avoid this CPU core saturation phenomenon, a separate test, with 4 processes per host (instead of 2) was performed.

Results with C6 enabled:

[RESULT] BW   phase 1            ior_easy_write                7.058 GiB/s : time  38.88 seconds
[RESULT] BW   phase 2            ior_hard_write                0.074 GiB/s : time  39.40 seconds
[RESULT] BW   phase 3             ior_easy_read                7.933 GiB/s : time  34.78 seconds
[RESULT] BW   phase 4             ior_hard_read                0.172 GiB/s : time  16.97 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               11.416 kiops : time  34.38 seconds
[RESULT] IOPS phase 2         mdtest_hard_write                5.492 kiops : time  43.10 seconds
[RESULT] IOPS phase 3                      find              169.540 kiops : time   3.71 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat               41.339 kiops : time   9.50 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               47.345 kiops : time   5.00 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete                8.997 kiops : time  43.63 seconds
[RESULT] IOPS phase 7          mdtest_hard_read                9.854 kiops : time  24.02 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                3.213 kiops : time  75.66 seconds
[SCORE] Bandwidth 0.919144 GiB/s : IOPS 16.6569 kiops : TOTAL 3.91281

Results with C6 disabled:

[RESULT] BW   phase 1            ior_easy_write                5.983 GiB/s : time  39.96 seconds
[RESULT] BW   phase 2            ior_hard_write                0.100 GiB/s : time  37.91 seconds
[RESULT] BW   phase 3             ior_easy_read                7.413 GiB/s : time  31.65 seconds
[RESULT] BW   phase 4             ior_hard_read                0.232 GiB/s : time  16.26 seconds
[RESULT] IOPS phase 1         mdtest_easy_write                9.793 kiops : time  35.57 seconds
[RESULT] IOPS phase 2         mdtest_hard_write                4.845 kiops : time  36.70 seconds
[RESULT] IOPS phase 3                      find              147.360 kiops : time   3.57 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat               50.768 kiops : time   6.86 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               50.125 kiops : time   3.55 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete                7.763 kiops : time  44.87 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               13.135 kiops : time  13.54 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                3.699 kiops : time  50.04 seconds
[SCORE] Bandwidth 1.00608 GiB/s : IOPS 16.918 kiops : TOTAL 4.12563

The conclusion stays the same: the advice to disable C6 is not universal. It helps some “hard” places of the benchmark but hurts others. Still, it improves the overall score a little bit in both cases, so C6 is disabled in the remaining benchmarks.

Tuning MDs

The short version of the test ran smoothly. However, during the full benchmark with the default 300-second stonewall timer (specifically, during the “mdtest_hard_write” phase), some MDS health warnings were observed:

MDSs behind on trimming
MDSs report oversized cache
Clients failing to respond to cache pressure

Also the final score was lower, specifically, due to mdtest-based “stat” tests:

[RESULT] BW   phase 1            ior_easy_write                5.442 GiB/s : time 314.02 seconds
[RESULT] BW   phase 2            ior_hard_write                0.099 GiB/s : time 363.64 seconds
[RESULT] BW   phase 3             ior_easy_read                7.838 GiB/s : time 215.95 seconds
[RESULT] BW   phase 4             ior_hard_read                0.231 GiB/s : time 155.15 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               11.423 kiops : time 431.68 seconds
[RESULT] IOPS phase 2         mdtest_hard_write                5.518 kiops : time 328.02 seconds
[RESULT] IOPS phase 3                      find              120.880 kiops : time  55.76 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat                0.866 kiops : time 5694.08 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat                2.072 kiops : time 873.55 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete                1.972 kiops : time 2500.54 seconds
[RESULT] IOPS phase 7          mdtest_hard_read                1.925 kiops : time 940.46 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                3.304 kiops : time 549.71 seconds
[SCORE] Bandwidth 0.99279 GiB/s : IOPS 4.51093 kiops : TOTAL 2.11622

Clearly, MDSs were overwhelmed with the metadata load created by the benchmark. Therefore, a decision has been made to increase the “mds cache memory limit” Ceph parameter to 12884901888 (12 GB) from the default value of 1 GB. The exact value was chosen as 2x the peak reported cache size from the health warning. This almost restores the score seen in the short version of the benchmark:

[RESULT] BW   phase 1            ior_easy_write                5.274 GiB/s : time 322.44 seconds
[RESULT] BW   phase 2            ior_hard_write                0.105 GiB/s : time 348.94 seconds
[RESULT] BW   phase 3             ior_easy_read                7.738 GiB/s : time 217.92 seconds
[RESULT] BW   phase 4             ior_hard_read                0.239 GiB/s : time 153.87 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               10.692 kiops : time 429.36 seconds
[RESULT] IOPS phase 2         mdtest_hard_write                5.318 kiops : time 324.34 seconds
[RESULT] IOPS phase 3                      find              211.550 kiops : time  29.85 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat               44.120 kiops : time 104.05 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               29.881 kiops : time  57.72 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete                6.993 kiops : time 656.42 seconds
[RESULT] IOPS phase 7          mdtest_hard_read                9.773 kiops : time 176.46 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                2.949 kiops : time 586.78 seconds
[SCORE] Bandwidth 1.0071 GiB/s : IOPS 15.4197 kiops : TOTAL 3.94071

Eliminating the Client-Side Bottleneck

Increasing the number of workers per host beyond 4 did not result in any improvement of the IO500 score, and did not increase the CPU usage by OSDs. In this scenario, the likely bottleneck during the “ior” phases is the single client-side “kworker/u194:0+flush-ceph-1” thread, which consumed 70-80% of the CPU. This thread is single, because there is only one cephfs mount used by all worker processes on one host.

The solution, quite obviously, would be to mount cephfs multiple times, in such a way that they don’t share a single “kworker/u194:0+flush-ceph-1” thread. This mount option is called “noshare”. Here is its description from the “mount.ceph” manual page:

Create a new client instance, instead of sharing an existing instance of a client mounting the same cluster.

However, the IO500 benchmark is not designed to access the storage via multiple mount points. To circumvent this restriction, multiple LXC containers were created on each client host, with a separate cephfs mount for each container. For networking, the “macvlan” backend was used, in order to directly expose each container to the 100Gbps network without additional overhead due to routing. On each host itself, a separate macvlan interface was added, so that containers could communicate with the host. Here is the relevant part of /etc/network/interfaces:

iface enp216s0f0np0 inet manual
    up /sbin/ip link add link enp216s0f0np0 name enp216s0f0mv0 type macvlan mode bridge
allow-hotplug enp216s0f0mv0
iface enp216s0f0mv0 inet static
    address 10.10.49.2/24

It is not possible to flush Linux caches from within a container, therefore, the “/usr/local/bin/drop_caches” script in a container had to be modified as follows:

#!/bin/sh
ssh 10.10.49.2 echo 3 ">" /proc/sys/vm/drop_caches
ssh 10.10.49.33 echo 3 ">" /proc/sys/vm/drop_caches

That is, it would connect to both client hosts via ssh and flush caches there the usual way.

The optimal results were obtained with four containers per host, and three or four worker processes per container. The mpirun arguments that prescribe this are: "-np 32 -oversubscribe -H 10.10.49.3,10.10.49.4,... --allow-run-as-root". In fact, with a 30-second stonewall timer, some phases of the benchmark, e.g. “mdtest_hard_stat”, became too short (less than 4 seconds) for the results to be reliable and reproducible. Therefore, from this point on, the benchmark was run with a 300-second stonewall timer.

Here are the benchmark results for three processes per container:

[RESULT] BW   phase 1            ior_easy_write               11.307 GiB/s : time 372.62 seconds
[RESULT] BW   phase 2            ior_hard_write                0.383 GiB/s : time 352.78 seconds
[RESULT] BW   phase 3             ior_easy_read               15.144 GiB/s : time 277.94 seconds
[RESULT] BW   phase 4             ior_hard_read                0.931 GiB/s : time 145.10 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               14.032 kiops : time 472.96 seconds
[RESULT] IOPS phase 2         mdtest_hard_write                9.891 kiops : time 313.09 seconds
[RESULT] IOPS phase 3                      find              231.190 kiops : time  42.10 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat               55.821 kiops : time 118.89 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               79.348 kiops : time  39.03 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete                9.597 kiops : time 691.54 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               16.702 kiops : time 185.42 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                8.507 kiops : time 366.75 seconds
[SCORE] Bandwidth 2.79532 GiB/s : IOPS 25.7583 kiops : TOTAL 8.48544

With four processes per container, some of the benchmark results (notably, “mdtest_hard_read”) look better, but others look worse, and the cluster definitely feels overloaded (the reported median latency comes close to 0.5s). So it is debatable whether it’s better to use three or four worker processes per container here. Here are the benchmark results:

[RESULT] BW   phase 1            ior_easy_write               11.459 GiB/s : time 350.88 seconds
[RESULT] BW   phase 2            ior_hard_write                0.373 GiB/s : time 354.69 seconds
[RESULT] BW   phase 3             ior_easy_read               16.011 GiB/s : time 250.28 seconds
[RESULT] BW   phase 4             ior_hard_read                0.930 GiB/s : time 142.02 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               15.894 kiops : time 441.59 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               10.690 kiops : time 313.34 seconds
[RESULT] IOPS phase 3                      find              174.440 kiops : time  59.44 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat               55.944 kiops : time 125.46 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               66.896 kiops : time  50.07 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete                9.371 kiops : time 748.99 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               19.120 kiops : time 175.18 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                9.948 kiops : time 340.54 seconds
[SCORE] Bandwidth 2.82402 GiB/s : IOPS 25.8226 kiops : TOTAL 8.53953

This is a quite serious load: in the Ceph cluster, each OSD cpu consumption peaked at ~400% during the “ior_easy_write” phase. And this is also the first time the total client throughput exceeding the capability of a single 100Gbps link was observed.

Scaling the MDs

The MDS tuning performed in the previous sections is quite obviously complete - during both of the “mdtest_easy_delete” and “mdtest_hard_delete” phases, the CPU usage by each active ceph-mds process comes close to 100%. In other words, with only four active MDSs, that’s about as fast as the cluster can go in these metadata-intensive benchmarks.

But who said that with five physical servers, one is limited with four active MDSs and one standby? The croit UI. The official Ceph documentation mentions multiple MDSs per node as a possible solution for MDS bottlenecks:

Even if a single MDS daemon is unable to fully utilize the hardware, it may be desirable later on to start more active MDS daemons on the same node to fully utilize the available cores and memory. Additionally, it may become clear with workloads on the cluster that performance improves with multiple active MDS on the same node rather than over-provisioning a single MDS.

In fact, we found that even two MDSs per host are not enough to avoid the bottleneck. So we deployed 20 MDSs total - four per host, and made 16 of them active. So, here is the manual procedure that we used.

On each host, one has to (temporarily) save the admin keyring as /etc/ceph/ceph.client.admin.keyring. After that, execute the following commands, substituting the correct hostname instead of croit-host-25 and making up unique names for the new MDSs.

# sudo -u ceph -s /bin/bash
$ mkdir /var/lib/ceph/mds/ceph-croit-host-25a
$ ceph-authtool --create-keyring /var/lib/ceph/mds/ceph-croit-host-25a/keyring --gen-key -n
mds.croit-host-25a
$ ceph auth add mds.croit-host-25a osd "allow rwx" mds "allow" mon "allow profile mds" -i
/var/lib/ceph/mds/ceph-croit-host-25a/keyring
$ cat /var/lib/ceph/mds/ceph-croit-host-25a/keyring

As a result, one has three new valid keyrings per host, one for each future MDS. Inside the croit container, one has to insert them into the database - something that the UI does not allow.

$ mysql croit
MariaDB [croit]> insert into service values(NULL, 5, 'mds', NULL, 'croit-host-25a', '[mds.croit-host-25a]\nkey = AQA9jn1f7oyKDRAACIJ0Kj8e8D1C4z4p8hU2WA==\n', 'enabled', NOW(), NULL);
[and the remaining keys]

Then croit starts the extra MDS daemons on each of the cluster nodes, and these extra daemons even survive a rolling reboot.

Finally, to make them active, the following command is needed:

ceph fs set cephfs max_mds 16

Benchmark results have improved, as expected (this is with four worker processes per container):

[RESULT] BW   phase 1            ior_easy_write               13.829 GiB/s : time 325.93 seconds
[RESULT] BW   phase 2            ior_hard_write                0.373 GiB/s : time 365.60 seconds
[RESULT] BW   phase 3             ior_easy_read               16.204 GiB/s : time 278.80 seconds
[RESULT] BW   phase 4             ior_hard_read                0.920 GiB/s : time 148.24 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               25.154 kiops : time 594.93 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               13.227 kiops : time 316.73 seconds
[RESULT] IOPS phase 3                      find              391.840 kiops : time  48.88 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              153.305 kiops : time  97.61 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               93.870 kiops : time  44.63 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               20.886 kiops : time 716.49 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               28.205 kiops : time 148.53 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete               10.496 kiops : time 401.73 seconds
[SCORE] Bandwidth 2.96175 GiB/s : IOPS 42.959 kiops : TOTAL 11.2798

It looks like four MDSs per host are indeed the maximum that the cluster can usefully have. We do not see periods anymore when they all consume 100% of the CPU for a long time. The typical figures are now 120% + 80% + 60% + 20%.

“Optimizations" That Didn't Work Out

Extra OSDs

One could, in theory, do the same kind of extra provisioning with OSDs. In fact, in the past, it was a common recommendation to provision several (2-4) OSDs per SSD. Modern Ceph has a different solution to offer: the built-in OSD sharding controlled by the “osd op num shards” and “osd op num threads per shard” parameters. We have investigated both options, but found that none of them improve the overall benchmark score. The only metric that is consistently improved is “mdtest_hard_stat”, but this is offset by other components of the score becoming worse.

Tuning OSD Memory Target

Bluestore OSDs use their own cache, and not Linux page cache, for storing a copy data that might be needed again. Generally, increasing the “osd memory target” tunable from its default value (3 GB in croit) is supposed to improve performance, because then OSDs would serve more from the cache and less from the disks. But this particular benchmark is designed to make caches as ineffective as possible: it reads data only after having written them all. Therefore, and we have confirmed it experimentally, increasing the OSD memory target has no effect on the final score.

Tuning Network MTU

All the benchmarking so far has been conducted with the default MTU (1500) and socket buffer sizes. However, it is commonly advised to set the MTU to 9000 on high-speed networks, and our 100Gbps network certainly qualifies. Also, it is suggested to increase the limits and defaults for socket buffer sizes.

We started with increasing the MTU to 9000, both on OSD nodes and the clients. After increasing the MTU, performance of the “ior_easy_write” stage has indeed improved to 15.163 GiB/s. However, we were not able to finish the benchmarks, because of the “MDS behind on trimming” and “MDS slow ops” health warnings, and the resulting blacklisting of the client. Then one MDS went into a read-only mode, which usually happens when the filesystem is somehow corrupted.

We went on and retested the network throughput using iperf. It was worse than before. With MTU 1500 and the default socket buffer sizes, we were able to reach 87 Gbit/s using four concurrent connections. With MTU 9000, we had only 77 Gbit/s. Apparently, the network card can efficiently perform TCP segmentation offload when sending, and the opposite of it when receiving, only with MTU = 1500.

The MTU tuning was undone, and the filesystem, unfortunately, had to be destroyed and recreated.

Tuning TCP Parameters

We tried to allow the Linux kernel to auto-tune TCP buffers to larger values, both on the OSDs and the clients, using the following sysctls:

net.core.rmem_max=33554432
net.core.wmem_max=33554432
net.ipv4.tcp_rmem=4096 65536 33554432
net.ipv4.tcp_wmem=4096 131072 33554432

The result, again, was a score reduction:

[RESULT] BW   phase 1            ior_easy_write               12.905 GiB/s : time 351.74 seconds
[RESULT] BW   phase 2            ior_hard_write                0.382 GiB/s : time 354.09 seconds
[RESULT] BW   phase 3             ior_easy_read               16.459 GiB/s : time 275.82 seconds
[RESULT] BW   phase 4             ior_hard_read                0.926 GiB/s : time 145.97 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               24.030 kiops : time 637.19 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               12.289 kiops : time 331.97 seconds
[RESULT] IOPS phase 3                      find              252.270 kiops : time  76.87 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat               95.420 kiops : time 160.47 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               97.223 kiops : time  41.96 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               16.660 kiops : time 919.06 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               15.179 kiops : time 268.76 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                8.265 kiops : time 497.80 seconds
[SCORE] Bandwidth 2.9435 GiB/s : IOPS 33.1105 kiops : TOTAL 9.87222

Just like any other bad tuning attempt, these settings were undone.

Increasing MDs Log Max Segments

To prevent the “MDSs behind on trimming” health warning that we have seen with MTU = 9000, we tried to set the “mds log max segments” option to 1280 (this is approximately 2x the peak value of num_segments from the warning). This has severely degraded the mdtest performance, and, for that reason, has been undone as well. Here is the benchmark report with MTU = 1500:

[RESULT] BW   phase 1            ior_easy_write               12.733 GiB/s : time 335.69 seconds
[RESULT] BW   phase 2            ior_hard_write                0.374 GiB/s : time 348.14 seconds
[RESULT] BW   phase 3             ior_easy_read               17.026 GiB/s : time 256.40 seconds
[RESULT] BW   phase 4             ior_hard_read                0.932 GiB/s : time 139.58 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               26.622 kiops : time 575.94 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               12.984 kiops : time 328.36 seconds
[RESULT] IOPS phase 3                      find              341.760 kiops : time  57.34 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              107.865 kiops : time 142.15 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat              103.463 kiops : time  41.21 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               14.181 kiops : time 1081.19 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               15.855 kiops : time 268.90 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                8.702 kiops : time 493.00 seconds
[SCORE] Bandwidth 2.94821 GiB/s : IOPS 35.5994 kiops : TOTAL 10.2447

Eliminating the KSWAPD Bottleneck

During the IOR phases of the benchmark, one could see “kswapd[01]” kernel threads eating 100% CPU on the client. The “kswapd” threads, one per NUMA node, keep the Linux memory subsystem happy about the number of free pages. In particular, they try to reclaim page cache. So, it looks like these benchmark phases produce too many dirty or cached pages. The benchmark offers an option (“posix.odirect = True” in the [global] section of the configuration file, or in phase-specific sections) to bypass the page cache, by means of the O_DIRECT flag. However, it actually makes the score of each IOR phase much worse, i.e. does exactly the opposite of the intended effect.

As an attempt to make more CPU time (actually, more CPUs) available for the task of page cache reclamation, a set of fake NUMA nodes was created on both clients, using the “numa=fake=8” kernel command line argument. Result: this did not help the IOR phases (so kswapd was not a bottleneck, after all), but hurt “find” and most of the mdtest-based phases.

[RESULT] BW   phase 1            ior_easy_write               13.674 GiB/s : time 342.55 seconds
[RESULT] BW   phase 2            ior_hard_write                0.381 GiB/s : time 355.32 seconds
[RESULT] BW   phase 3             ior_easy_read               16.784 GiB/s : time 278.14 seconds
[RESULT] BW   phase 4             ior_hard_read                0.925 GiB/s : time 146.32 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               23.718 kiops : time 578.62 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               11.818 kiops : time 326.18 seconds
[RESULT] IOPS phase 3                      find              261.800 kiops : time  67.14 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              101.016 kiops : time 135.85 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               69.404 kiops : time  55.54 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               18.087 kiops : time 758.75 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               26.474 kiops : time 145.60 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                9.933 kiops : time 390.60 seconds
[SCORE] Bandwidth 2.99932 GiB/s : IOPS 35.3655 kiops : TOTAL 10.2991

So, this had to be undone as well.

Doing Nothing

Doing nothing is not supposed to hurt. It is not supposed to help either, so it is a great way to test reproducibility of benchmark results. Therefore, with all the bad optimizations undone, we would expect to regain the high score we have previously seen. Let’s check:

[RESULT] BW   phase 1            ior_easy_write               13.359 GiB/s : time 455.21 seconds
[RESULT] BW   phase 2            ior_hard_write                0.376 GiB/s : time 338.87 seconds
[RESULT] BW   phase 3             ior_easy_read               16.634 GiB/s : time 366.70 seconds
[RESULT] BW   phase 4             ior_hard_read                0.934 GiB/s : time 136.30 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               24.508 kiops : time 606.63 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               12.352 kiops : time 329.28 seconds
[RESULT] IOPS phase 3                      find              288.930 kiops : time  65.53 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              109.363 kiops : time 135.95 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               98.175 kiops : time  41.43 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               18.731 kiops : time 793.72 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               26.504 kiops : time 153.46 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete               10.293 kiops : time 397.73 seconds
[SCORE] Bandwidth 2.9722 GiB/s : IOPS 38.4718 kiops : TOTAL 10.6933

Not quite there (mostly because of “find”), but at least still better than all misguided optimization attempts discussed so far. So, the conclusions are still valid.

Benchmarking the Full Cluster

Remember that we have set two NVME drives apart on each node, so that they didn’t participate in the benchmark. Now is the time to bring them back and thus, maybe, prove that the underlying storage is at least part of the bottleneck.

[RESULT] BW   phase 1            ior_easy_write               12.982 GiB/s : time 391.63 seconds
[RESULT] BW   phase 2            ior_hard_write                0.383 GiB/s : time 349.43 seconds
[RESULT] BW   phase 3             ior_easy_read               16.991 GiB/s : time 304.82 seconds
[RESULT] BW   phase 4             ior_hard_read                0.923 GiB/s : time 145.14 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               23.997 kiops : time 600.49 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               11.300 kiops : time 335.35 seconds
[RESULT] IOPS phase 3                      find              345.570 kiops : time  52.66 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              113.218 kiops : time 127.28 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               99.407 kiops : time  38.12 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               19.261 kiops : time 748.14 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               28.119 kiops : time 134.76 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete               10.485 kiops : time 364.02 seconds
[SCORE] Bandwidth 2.97168 GiB/s : IOPS 39.5523 kiops : TOTAL 10.8414

Well, the improvement is insignificant. On the other hand, with only four OSDs per node (even if we strategically choose them so that the NUMA nodes of their NVMe drives match that of the network card), the score is slightly reduced, because the NVME drives are overloaded during the IOR phases:

[RESULT] BW   phase 1            ior_easy_write               10.775 GiB/s : time 344.10 seconds
[RESULT] BW   phase 2            ior_hard_write                0.374 GiB/s : time 359.87 seconds
[RESULT] BW   phase 3             ior_easy_read               16.431 GiB/s : time 224.47 seconds
[RESULT] BW   phase 4             ior_hard_read                0.925 GiB/s : time 145.59 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               25.422 kiops : time 641.73 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               12.040 kiops : time 324.63 seconds
[RESULT] IOPS phase 3                      find              374.070 kiops : time  54.06 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              101.753 kiops : time 160.33 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               78.571 kiops : time  49.74 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               19.180 kiops : time 850.58 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               25.747 kiops : time 151.80 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                9.840 kiops : time 399.77 seconds
[SCORE] Bandwidth 2.79783 GiB/s : IOPS 38.1085 kiops : TOTAL 10.3257

So, while the amount of storage does affect the performance, it does so in a very slight amount. This is an argument for the theory that the real bottleneck is elsewhere.

Changing the Number of PGs

As we have already mentioned, the cephfs_data pool was created with less PGs (512) than normally recommended (1024) for a cluster of this size. Just for testing, we resized the pool to 1024 PGs. The result is a very slight decrease of the score:

[RESULT] BW   phase 1            ior_easy_write               13.373 GiB/s : time 349.24 seconds
[RESULT] BW   phase 2            ior_hard_write                0.376 GiB/s : time 361.25 seconds
[RESULT] BW   phase 3             ior_easy_read               16.557 GiB/s : time 282.76 seconds
[RESULT] BW   phase 4             ior_hard_read                0.912 GiB/s : time 148.76 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               20.689 kiops : time 793.92 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               11.058 kiops : time 344.35 seconds
[RESULT] IOPS phase 3                      find              333.310 kiops : time  60.70 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              143.516 kiops : time 114.45 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat              104.899 kiops : time  36.30 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               19.006 kiops : time 864.21 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               23.765 kiops : time 160.23 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                9.188 kiops : time 417.02 seconds
[SCORE] Bandwidth 2.95176 GiB/s : IOPS 38.4369 kiops : TOTAL 10.6516

This is with 8 OSDs per node.

Upgrading to Octopus

As a final part of our testing, we installed a nightly version of the croit container and used it to upgrade the cluster to Ceph Octopus (version 15.2.5). This does result in some difference:

[RESULT] BW   phase 1            ior_easy_write               13.656 GiB/s : time 347.14 seconds
[RESULT] BW   phase 2            ior_hard_write                0.332 GiB/s : time 356.28 seconds
[RESULT] BW   phase 3             ior_easy_read               16.740 GiB/s : time 282.90 seconds
[RESULT] BW   phase 4             ior_hard_read                0.790 GiB/s : time 149.64 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               26.457 kiops : time 436.90 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               12.974 kiops : time 335.61 seconds
[RESULT] IOPS phase 3                      find              413.790 kiops : time  38.46 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              117.764 kiops : time  98.16 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               90.990 kiops : time  47.85 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               20.957 kiops : time 551.56 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               25.622 kiops : time 169.94 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                9.646 kiops : time 453.91 seconds
[SCORE] Bandwidth 2.7817 GiB/s : IOPS 40.9341 kiops : TOTAL 10.6708

Less bandwidth in the “hard” tests, more IOPS in “find” and “mdtest_easy_write”, but the final score is the same.

We have also retested MTU 9000 while on Ceph Octopus. This time, the cluster has survived the benchmark, and the positive effect on the “ior_easy_write” phase has been confirmed, but the overall score has decreased:

[RESULT] BW   phase 1            ior_easy_write               15.608 GiB/s : time 343.82 seconds
[RESULT] BW   phase 2            ior_hard_write                0.333 GiB/s : time 356.22 seconds
[RESULT] BW   phase 3             ior_easy_read               14.657 GiB/s : time 368.92 seconds
[RESULT] BW   phase 4             ior_hard_read                0.783 GiB/s : time 151.37 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               25.044 kiops : time 557.53 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               11.682 kiops : time 326.95 seconds
[RESULT] IOPS phase 3                      find              394.750 kiops : time  45.04 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              121.566 kiops : time 114.85 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               91.986 kiops : time  41.52 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               19.932 kiops : time 700.49 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               21.683 kiops : time 176.15 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                8.189 kiops : time 471.37 seconds
[SCORE] Bandwidth 2.77798 GiB/s : IOPS 38.2382 kiops : TOTAL 10.3065

So, we can conclude that 11.27 is the highest score that we could obtain in the IO500 benchmark on this cluster, and 10.84 is the highest score that we could obtain reproducibly.

Remaining Bottlenecks

A benchmark report is incomplete without explanations why the test cannot go faster. That is, without identifying the bottlenecks that cannot be removed.

The “ior_easy_write” phase is unique because it is bottlenecked or almost bottlenecked on multiple factors: with less OSDs, that would be disk performance, and with MTU 1500, this is the network throughput (80 Gbit/s on OSDs). Also, on the clients, CPU use by kswapd might be a problem (but we could not confirm that it is actually a problem).

Several phases (“ior_hard_write”, “ior_hard_read”, “mdtest_hard_write”) are positively affected by disabling C6, that is, they are sensitive to network latency. Even more of them (“mdtest_easy_write”, “mdtest_easy_stat”, “mdtest_hard_stat”, “mdtest_easy_delete”, “mdtest_hard_read”) cause the CPU usage by MDSs to spike up to 100%, which, because MDS is mostly single-threaded, suggests a bottleneck. For two phases (“find” and “mdtest_hard_delete”) we could not pinpoint the exact bottleneck, but they are positively affected by the addition of extra MDSs, so that might be it.

Final Words

Ceph has many tunables. However, only one of them (mds cache memory limit) was needed here. The rest of the work was all about tuning the Linux OS, planning the cluster, and creating enough clients.