Home > Workload Solutions > High Performance Computing > White Papers > Dell Validated Design for HPC pixstor Storage—Joint Solution with Kalray > PowerEdge R750 Metadata performance with MDtest using empty files
This section is similar to the previous sections that are using the new HDMD module based on one or more pairs of NVMe servers based on the PowerEdge R650 server with 10 NVMe direct-attached devices. Metadata performance on the new HDMD NVMe module with PowerEdge R750 NVMe servers for data is documented in this section. Metadata performance for the new HDMD NVMe module with ME-based storage for data will be documented for the next update of the pixstor solution using a new PowerVault array (ME5).
Metadata performance was measured with MDtest version 3.3.0, with OpenMPI 4.1.4rc1 to run the benchmark over the 16 compute nodes. The tests that we ran varied from a single thread up to 512 threads. The benchmark was used for files only (no directories metadata), getting the number of create, stat, read, and remove operations the solution can handle.
Several HDMD on NVMe Modules (NVMe R650 pairs) can be used to increase the number of files supported (inodes) and increase the metadata performance with each additional pair of servers. An exception to this increase might be stat operations (and reads for empty files) because their numbers are high, and the CPUs become a bottleneck and performance does not continue to increase.
The following command was used to run the benchmark, where the Threads variable is the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes. Like the IOR benchmark, the maximum number of threads was limited to 512 because there are not enough cores for more than 640 threads and context switching could affect the results, reporting a number lower than the real performance of the solution.
mpirun–-allow-run-as-root -np $Threads–-hostfile my_hosts.$Threads –map-by node–-mca btl_openib_allow_ib 1–-oversubscribe–-prefix /usr/mpi/gcc/openmpi-4.1.2a1 /usr/local/bin/mdtest -v -d /mmfs1/perf/mdtest -P -i 1 -b $Directories -z 1 -L -I 1024 -u -t -F
Because the total number of IOPs, the number of files per directory and the number of threads can affect performance results, we decided to keep the total number of files fixed to 2 Mi files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed as shown in the following table:
Number of threads | Number of directories per thread | Total number of files |
1 | 2048 | 2,097,152 |
2 | 1024 | 2,097,152 |
4 | 512 | 2,097,152 |
8 | 256 | 2,097,152 |
16 | 128 | 2,097,152 |
32 | 64 | 2,097,152 |
64 | 32 | 2,097,152 |
128 | 16 | 2,097,152 |
256 | 8 | 2,097,152 |
512 | 4 | 2,097,152 |
1024 | 2 | 2,097,152 |
Note that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences several orders of magnitude; otherwise, some of the operations appear like a flat line close to 0 on a normal graph. A log graph with base 2 is more appropriate because the number of threads are increased by powers of 2. Such a graph looks similar, but people tend to perceive and remember numbers based on powers of 10 better.
The system shows good results for stat operations reaching their peak value at 64 threads with 10.5M op/s. Read operations reached a peak of 4.3M op/s at 64 threads (note that we are reading empty files). Remove operations attained the maximum of 419.1K op/s and create operations achieving their peak with 252.5K op/s, both at 64 threads. Stat and read operations have more variability, but when they reach their peak value, performance does not drop below 4.5M op/s for stat operations and 2.3M op/s for read operations. Create and remove operations have less variability. Create operations keep increasing as the number of threads grows, and remove operations slowly decreases after reaching its peak value.
Because these numbers are for a metadata module with one PowerEdge R650 NVMe metadata pair, performance increases for each additional PowerEdge R650 NVMe pair, however we cannot assume a linear increase for all operations. Unless the whole file fits inside the inode for such file, data targets on other devices will be used to store small files, limiting the performance to some degree.