Home > Workload Solutions > High Performance Computing > White Papers > Dell Validated Design for HPC pixstor Storage—Joint Solution with Kalray > PowerEdge R650 Sequential IOR Performance N clients to 1 file
Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, with OpenMPI 4.1.4rc1 to run the benchmark over the 16 compute nodes. The tests that we ran varied from a single thread up to 512 threads because there are not enough cores for 1024 threads (the 16 clients have a total of 16 x 2 x 20 = 640 cores). Also, oversubscription overhead slightly affected results at 1024 threads.
We minimized caching by setting the GPFS page pool tunable to 32 GiB on the clients and 96 GiB on the servers, and using a total data size of 8 TiB, twice the RAM size from servers (four R650 nodes) and clients combined. We used a transfer size of 16 MiB for this performance characterization. The previous performance test section provides a complete explanation.
The following commands were used to run the benchmark, where the Threads variable is the number of threads used (1 to 512 incremented in powers of 2), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes. The FileSize variable has the result of 8192 (GiB)/Threads to divide the total data size evenly among all threads used.
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /usr/mpi/gcc/openmpi-4.1.2a1 /usr/local/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/ior/tst.file -w -s 1 -t 16m -b ${FileSize}G
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /usr/mpi/gcc/openmpi-4.1.2a1 /usr/local/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/ior/tst.file -r -s 1 -t 16m -b ${FileSize}G
From the results, we see that performance rises quickly with the number of clients used and then reaches a plateau that is semistable for read operations (at approximately 64 threads) and stable for write operations (at 16 threads) up to the maximum number of threads used. Therefore, large single-shared file sequential performance is stable even for 512 concurrent threads. Note that the maximum read performance was 180.7 GB/s at 512 threads and for write performance was 41.8 GB/s at 128 threads. Read performance takes longer to reach the peak value compared to N-N tests, however using a single shared file adds locking overhead that seems to affect read performance. However, write performance should be affected even more by locking on a single shared file, and yet, peak performance (41.8 GB/s) is higher compared to N-N (40.2 GB/s). This result might be due to MPI plus IOR accesses being more efficient for write operations compared to IOzone or a reason that is not obvious. More investigation is needed for this peculiar behavior.