PowerEdge R650 Sequential IOzone Performance N clients to N files

Thank you for your feedback!

We measured Sequential N clients to N files performance with IOzone version 3.492. The tests that we ran varied from a single thread up to 1024 threads.
We used files large enough to minimize caching effects, with a total data size of 8 TiB, twice the total memory size of servers (four R650 nodes) and clients. GPFS sets the tunable page pool to the maximum amount of memory used for caching data, regardless of the amount of RAM that is installed and is free (set to 32 GiB on clients and 96 GiB on servers to allow I/O optimizations). While in other Dell HPC solutions the block size for large sequential transfers is 1 MiB, GPFS was formatted with a block size of 8 MiB; therefore, use that value or its multiples on the benchmark for optimal performance. A block size of 8 MiB might seem too large and waste too much space when using small files, but GPFS uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided into 512 subblocks of 16 KiB each.
The following commands were used to run the benchmark for read and write operations, where the Threads variable is the number of threads used (1 to 1024 incremented in powers of 2), and threadlist was the file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes. The FileSize variable has the result of 8192 (GiB)/Threads to divide the total data size evenly among all threads used. A transfer size of 16 MiB was used for this performance characterization.
./iozone -i0 -c -e -w -r 16M -s ${FileSize}G -t $Threads -+n -+m ./threadlist
./iozone -i1 -c -e -w -r 16M -s ${FileSize}G -t $Threads -+n -+m ./threadlist
Figure 29. N to N Sequential Performance
From the results, we see that read performance reaches a plateau of approximately 180 GB/s at 32 threads and a peak at 64 threads with 180.8 GB/s, a considerable increase from the previous generation of PowerEdge R640 servers with NVMe Gen 3 that had a plateau of about 80 GB/s for the same number of NVMe nodes (4).
The write performance reached a plateau of approximately 40 GB/s at 16 threads, with a peak at 40.2 GB/s at 16, 32 and 512 threads. Write performance might look low compared to read performance, however, consider two factors:
- Replication was used to have a copy for each NVMe device (NSD) on a different server for HA purposes, effectively creating a data mirror. Only half of the NVMe drives contribute to write performance, while the other half become required overhead to mirror data.
- NVMe PCIe 4 models available for the preproduction servers used in this project were limited, so Dell 1.6 TB (OEM PM1735) PCIe4 devices were used on the test bed. The performance specifications of the devices are 7,000 GB/s for reads and 2,400 GB/s for write operations, which are limited compared to larger capacity devices (7,000 GB/s read operations and 3,800 GB/s write operations).
Both read and write results are stable when the plateau is reached, which is a favorable behavior because the servers do not exhibit a drop in performance as the number of simultaneous clients access different threads. As a future test and because IOzone has a limitation of 1024 maximum number of threads, IOR can be used to find the limit with respect to simultaneous clients/files (after adding more clients to avoid context switching within the clients, affecting performance).

Your Browser is Out of Date

PowerEdge R650 Sequential IOzone Performance N clients to N files

PowerEdge R650 Sequential IOzone Performance N clients to N files