16G PowerEdge Platform BIOS Characterization for HPC with AMD Genoa 9354
Fri, 30 Jun 2023 13:44:52 -0000
|Read Time: 0 minutes
With the release of 4th Gen AMD EPYC 9004 CPUs (code-named “Genoa”), Dell PowerEdge servers have been refreshed to support these latest processors. In this blog, we will present the results of a study evaluating the performance of HPC synthetic benchmarks with AMD 9354 processors on Dell’s latest PowerEdge dual socket 1U R6625 server and dual socket 2U R7625 server.
Architecture
AMD Genoa is based on the new Zen4 micro-architecture built with 5nm fabrication technology. Major changes from its predecessor AMD EPYC 7003 CPUs (code-named “Milan”) include the support for DDR5 memory at speeds up to 4800 MT/s and PCIe Gen5. It supports up to 96 cores per socket and the L2 cache per core is doubled. Zen4 adds support for the AVX-512 instruction set. The implementation in Zen4 executes AVX-512 instructions in two cycles. Also, improvements are made in instructions per cycle (IPC).
Benchmark hardware and software configuration
Table 1. Test bed system configuration used for this benchmark study
Platform | Dell PowerEdge R6625 /R7625 |
Processor | AMD EPYC 9354 |
Cores | 32 cores/socket |
Base Frequency | 3.25 GHz |
Turbo Clock | Up to 3.8 GHz |
TDP | 280 W |
Configurable TDP | 240-300 W |
L1 Cache | 64K per core |
L2 Cache | 1MB per core |
L3 Cache | 256MB (shared) |
Memory | 32 GB x 24 DIMMs| 4800 MT/s |
Interconnect | NVIDIA Mellanox NDR 400 |
Operating System | RHEL 8.6 |
Linux Kernel | 4.18.0-372.9.1 |
BIOS/CPLD | 1.1.3/1.1.3 |
OFED | MLNX_OFED_LINUX-5.7-1.0.2.0 |
BIOS Workload Profile | HPC Profile |
Compiler | AOCC 4.0.0 and AOCL 4.0 |
OpenMPI | 4.1.5 |
Turbo Boost | ON |
Recommended BIOS optimizations
We tested different combinations of BIOS options in this study to understand the potential performance improvements in synthetic benchmarks. We found that setting workload profile in BIOS as “HPCProfile” will give us the best performance on HPC synthetic benchmarks.
This workload profile option can be found in System Profile Settings of BIOS. It is a collection of multiple BIOS options recommended for HPC workloads. This setting can be updated using the RACADM CLI tool. Use the following command to enable “HPCProfile” and reboot your system using racadm.
racadm set bios.sysprofilesettings.WorkloadProfile HpcProfile && sudo racadm jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW -e TIME_NA
Once the system is up, use the command below to verify that the setting is enabled.
racadm bios.sysprofilesettings.WorkloadProfile
It should show the workload profile as HPCProfile. Note that any changes made in BIOS settings on top of the “HPCProfile” will set this parameter to “Not Configured”, keeping the other settings of “HPCProfile” intact.
We have studied the impact of different BIOS options on top of “HPCProfile”. All the performance numbers mentioned in this blog are with workload profile set to “HPCProfile”.
Table 2. Synthetic benchmarks application details
S.No. | Application Name | Version Used |
High-Performance Linpack (HPL) | ||
v7.1 |
We used prebuilt AMD Optimized binaries for HPL, Stream, and HPCG benchmarks, which are optimized for AMD’s Zen4 architecture. OSU was compiled using AOCC 4.0 compiler. Benchmark information and performance numbers are mentioned in the following section.
Benchmark performance results
HPL: This benchmark solves random system of linear equations in double precision (64-bits) for distributed systems. It reports floating-point execution rate of the system.
In the HPL benchmark test, we used 94 percent of available memory as the problem size where N=301440 and NB=384 was used. We achieved ~3.75 TFlops of performance across dual sockets with around 113 percent efficiency, compared to the base frequency on the AMD 9354 processor. We monitored the frequency throughout the benchmark run and observed that the processor was able to utilize its turbo frequencies constantly, which explains the efficiency being above 100 percent for this processor. The average power consumption during the benchmark run was ~830 watts when the system profile in BIOS was set to “HPCProfile” option. We obtained the best performance-per-watt results with this option.
Figure 1. HPL performance with AMD Genoa 9354 processor on 16G PowerEdge R6625 and R7625 servers
STREAM: This synthetic benchmark is designed to measure sustainable memory bandwidth and a corresponding computation rate for four simple vector kernels: Copy, Scale, Add and Triad.
In the STREAM TRIAD benchmark test, we were able to reach up to ~752 GB/s when utilizing all available cores of the dual socket server. To learn more about the STREAM performance numbers on AMD MILAN based servers, please refer to our previous blog here.
Figure 2. STREAM performance with AMD Genoa 9354 processor on 16G PowerEdge R6625 and R7625 servers
HPCG: This benchmark project is an effort to create a new metric for ranking HPC systems. It is an internally I/O bound benchmark, intended to complement LINPACK benchmarks.
In the HPCG benchmark, we used nx=ny=nz=192 local sub grid dimensions to tune the problem size as per our system memory. We were able to reach ~115 Gflops of performance with AMD optimized binary for HPCG.
Figure 3. HPCG performance with AMD Genoa 9354 processor on 16G PowerEdge R6625 and R7625 servers
OSU Micro Benchmarks: These micro-benchmarks are widely used for measuring and evaluating the performance of MPI operations for point-to-point, multi-pair, and collective communications between the nodes.
In the OSU benchmark, we used two nodes connected with NDR400. We checked bidirectional bandwidth, unidirectional bandwidth, message rate, and latency between these two nodes. In a dual socket server, the socket connected to the network adapter card acts as local and the other acts as remote. We completed this test on both R6625 and R7625 servers for both remote and local latency and bandwidth. The results below are obtained from theR6625 server. All the results for OSU shown below are run using a single core per node.
The Delta label in secondary axis represents the percentage difference between local and remote latency and bandwidth.
Figure 4. OSU Latency with AMD Genoa 9354 processor on Dell PowerEdge R6625 server
We achieved ~48 GB/s unidirectional bandwidth and ~87 GB/s of bidirectional bandwidth.
Figure 5. OSU message rate with AMD Genoa 9354 processor on Dell PowerEdge R6625 server
Figure 6. OSU bi-directional bandwidth with AMD Genoa 9354 processor on Dell PowerEdge R6625 server
Figure 7. OSU uni-directional bandwidth with AMD Genoa 9354 processor on 16G Dell PowerEdge R6625 server
Conclusion and future work
We have seen a significant improvement in the performance of synthetic benchmarks using Genoa-based servers as compared to earlier Milan-based servers. Setting up the right BIOS parameters is important to achieve the best results on these servers. As part of our study, we tested different BIOS parameters, finding suggest that setting the workload profile to “HPCProfile” provides the best performance result.
For future work, we plan to study performance improvements on HPC applications from different domains using these latest AMD processors and Dell PowerEdge servers.
Check back soon for the next blog.
Additional resources
Visit our website to read our previous blog on AMD Milan-based servers.