Home > Workload Solutions > High Performance Computing > White Papers > Dell Validated Design for HPC pixstor Storage > pixstor solution with capacity expansion and high demand meta-data
This benchmark was performed on the initial Large configuration plus four ME484 expansions, that is two PowerEdge R750 servers connected to four ME4084 arrays and four ME484 expansions (one behind each ME4084 array), with the optional HDMD module (two PowerEdge R750 servers) using a single ME4024 array.
Sequential N clients to N files performance was measured with IOzone version 3.492. The tests that we ran varied from a single thread up to 1024 threads, and the results of the capacity expanded solution (4 x ME4084 arrays plus 4 x ME484 expansions) are contrasted to the large-size solution (4 x ME4084 arrays).
We used files large enough to minimize caching effects, with a total data size of 8 TiB, more than twice the total memory size of servers and clients. Note that GPFS sets the tunable page pool to the maximum amount of memory used for caching data, regardless the amount of RAM that is installed and is free (set to 32 GiB on clients and 96 GiB on servers to allow I/O optimizations). While in other Dell HPC solutions in which the block size for large sequential transfers is 1 MiB, GPFS was formatted with a block size of 8 MiB; therefore, use that value or its multiples on the benchmark for optimal performance. A block size of 8 MiB might seem too large and waste too much space when using small files, but GPFS uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided into 512 subblocks of 16 KiB each.
The following commands were used to run the benchmark for write operations and read operations, where the Threads variable is the number of threads used (1 to 512 incremented in powers of 2), and threadlist is the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. The FileSize variable has the result of 8192 (GiB)/ Threads to divide the total data size evenly among all threads used. A transfer size of 16 MiB was used for this performance characterization.
./iozone -i0 -c -e -w -r 16M -s ${FileSize}G -t $Threads -+n -+m ./threadlist
./iozone -i1 -c -e -w -r 16M -s ${FileSize}G -t $Threads -+n -+m ./threadlist
From the results, we see that performance rises quickly with the number of clients used and then reaches a plateau that is stable until the maximum number of threads that IOzone allows is reached; therefore, large file sequential performance is stable even for 1024 concurrent clients. Note that both read and write performance benefited from doubling the number of drives; write performance benefitted only slightly and read performance benefitted significantly. The maximum read performance was limited by the bandwidth of the ME4084 controllers used on the storage nodes starting at four threads. The maximum write performance increased to 21.6 GB/s at eight and 128 threads and it is closer to the ME4 arrays maximum specifications (22 GB/s).
Remember that for GPFS, the preferred mode of operation is scattered, and the solution was formatted to use it. In this mode, data blocks are allocated right after file system creation in a pseudorandom fashion, spreading data across the whole surface of each HDD. While the obvious disadvantage is a lower initial maximum performance, that performance remains constant regardless of how much space is used on the file system. This result is in contrast to other parallel file systems that initially use the outer tracks that can hold more data (sectors) per disk revolution. Therefore, these file systems have the highest possible performance that HDDs can provide. As the system uses more space, inner tracks with less data per revolution are used, with the consequent reduction of performance.
Sequential N clients to a single shared file performance was measured with IOR version 3.3.0 with OpenMPI v4.1.2A1 to run the benchmark over the 16 compute nodes. The tests that we ran varied from a single thread up to 512 threads because there are not enough cores for 1024 threads (the 16 clients have a total of 16 x 2 x 20 = 640 cores). Also, oversubscription overhead seemed to affect IOzone results at 1024 threads.
Caching effects were minimized by setting the GPFS page pool tunable to 32 GiB on the clients and 96 GiB on the servers, and using a total data size of 8 TiB, more than twice the RAM size from servers and clients combined. A transfer size of 16 MiB was used for this performance characterization. The previous performance test section provides a more complete explanation.
The following commands were used to run the benchmark, where the Threads variable is the number of threads used (1 to 512 incremented in powers of 2), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes. The FileSize variable has the result of 8192 (GiB)/Threads to divide the total data size evenly among all threads used.
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /usr/mpi/gcc/openmpi-4.1.2a1 /usr/local/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/ior/tst.file -w -s 1 -t 16m -b ${FileSize}G
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe –prefix /usr/mpi/gcc/openmpi-4.1.2a1 /usr/local/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/ior/tst.file -r -s 1 -t 16m -b ${FileSize}G
From the results, we see again that the extra drives greatly benefit reads and only marginally benefit write performance. Performance rises again quickly with the number of clients used and then reaches a plateau that is stable for reads and writes up to the maximum number of threads used on this test. Note that the maximum read performance was 27.9 GB/s at 16 threads and the bottleneck was the ME4 controllers. The maximum write performance of 21.7 GB/s was reached at 16 threads and plateaued.
Random N clients to N files performance was measured with IOzone version 3.492. The tests that we ran varied from four up to 512 threads, using 4 KiB blocks for emulating small blocks traffic. Lower thread counts were not used because they provide little information about maximum sustained performance, and execution time can take several days for a single data point (IOzone does not offer an option to run random read and write operations separately). The reason for the low random read performance is because there is not enough I/O pressure to schedule read operations due to the combined effect of the behavior of the mq-deadline I/O scheduler on the Linux operating system and the internal ME4 array controller software, delaying read operations until a threshold is met.
We set the GPFS page pool tunable to 16 GiB on the clients and 32 GiB on the servers and using files twice that size to minimize caching effects. The Sequential IOzone Performance section provides a complete explanation about why this is effective on GPFS.
The following command was used to run the benchmark in random IO mode for both read and write operations, where the Threads variable is the number of threads used (4 to 512 incremented in powers of 2), and threadlist was the file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes:
./iozone -i0 -c -e -w -r 16M -s ${Size}G -t $Threads -+n -+m ./threadlist
./iozone -i2 -O -w -r 4K -s ${Size}G -t $Threads -+n -+m ./threadlist
From the results, we see that write performance starts at a high value of 23.1K IOps at four threads and remains at that level until suddenly rising at 256 threads and reaches a peak of 34.2K IOPS at 512 threads. This behavior was not expected, and more testing is needed to understand the reason. Until then, it is best to assume the peak from pixstor 5 is a safer value to use to size of the solution.
Read performance starts at 322 IOPS at four threads and increases performance almost linearly with the number of clients used (note that the number of threads is doubled for each data point) and reaches the maximum performance of 32.33K IOPS at 512 threads. Using more threads than the number of cores on the current 16 compute nodes (640) might incur more context switching.
While characterizing the solution with pixstor 5, results for empty files showed that metadata performance as measured with MDtest version 3.3.0 and OpenMPI v4.0.1 was approximately the same for the system with or without the ME484 expansions. The reason is that empty files are completely created inside inodes so that the storage nodes and ME4084 arrays are not used at all. Furthermore, using small files of 4 KiB that do not fit completely inside inodes, showed that results with or without the expansions are marginally different. Therefore, only the characterization previously included in the metadata performance with MDtest without ME484 expansions is in this document. Instead of duplicating results, metadata on NVMe devices using a pair of PowerEdge R650 servers with 10 devices each was characterized. The new HDMD module is three times denser, has greater performance, and is more cost effective due to the per drive licensing price (24 drives on the ME4024 array compared to 20 drives per pair of PowerEdge R650 servers). See New NVMe Tier for information about this characterization and the rest of the NVMe work.
The current solution delivered good performance, which is expected to be stable regardless of the used file system space (because the system was formatted in scattered mode), as shown in the following table. The solution scales in capacity and performance linearly as more storage nodes modules are added, and a similar performance increase can be expected from the optional HDMD.
Benchmark | Peak performance | Sustained performance | ||
Write | Read | Write | Read | |
Large Sequential N clients to N files | 21.6 GB/s | 26.9 GB/s | 21.1 GB/s | 25.7 GB/s |
Large Sequential N clients to single shared file | 21.7 GB/s | 27.9 GB/s | 21 GB/s | 27.7 GB/s |
Random Small blocks N clients to N files | 27.4K IOps | 32.3K IOps | 27.4K IOps | 32.3KIOps |