Home > Workload Solutions > High Performance Computing > White Papers > Dell Validated Design for HPC NG-Stor Storage - Joint Solution with Kalray > PowerEdge R760
Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, with OpenMPI 4.1.5rc2 to run the benchmark over the eight compute nodes. The tests that we ran varied from single thread up to 512 threads because there are not enough cores for 1024 threads (the eight clients have a total of 4 x 2 x 32 + 4 x 2 x 64 = 768 cores).
We minimized caching effects by setting the GPFS page pool tunable to 12 GiB on the clients and 128 GiB on the servers, and using a total data size of 8 TiB, twice the RAM size from servers and clients combined. We used a transfer size of 16 MiB for this performance characterization. For a complete explanation, see PowerEdge R760 Sequential IOzone Performance N clients to N files.
The following commands were used to run the benchmark, where the Threads variable is the number of threads used (1 to 512 incremented in powers of 2), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the eight compute nodes. The FileSize variable has the result of 8192 (GiB)/threads to divide the total data size evenly among all threads used.
mpirun –allow-run-as-root -np $Threads –hostfile my_hosts.$Threads –mca btl_openib_allow_ib 1 –mca pml ^ucx –oversubscribe –prefix /usr/mpi/gcc/openmpi-4.1.2a1 /usr/local/bin/ior -a POSIX -v -I 1 -d 3 -e -k -o /mmfs1/perftest/ior/tst.file -w -s 1 -t 16m -b ${FileSize}G
mpirun–-allow-run-as-root -np $Threads–-hostfile my_hosts.$Threads–-mca btl_openib_allow_ib 1–-mca pml ^ucx–-oversubscribe–-prefix /usr/mpi/gcc/openmpi-4.1.2a1 /usr/local/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/ior/tst.file -r -s 1 -t 16m -b ${FileSize}G
Figure 19. N to 1 sequential performance
From the results, we see that read performance rises with the number of threads used and then reaches a peak of 180.6 GB/s at 256 threads, and then a plateau at approximately 180 GB/s, with a value too low for 64 threads that requires more scrutiny. Write operations reached a plateau at 16 threads of approximately 65 GB/s, with a peak of 77.4 at 512 threads. Therefore, large single-shared file sequential performance is stable even for 512 concurrent threads. The peak performance of the N-N and N-1 benchmarks are similar, as expected of a PFS with efficient locking for a single shared file.