Accelerating HPC Workloads with NVIDIA A100 NVLink on Dell PowerEdge XE8545
Wed, 07 Apr 2021 18:02:56 -0000|
Read Time: 0 minutes
NVIDIA A100 GPU
Three years after launching the Tesla V100 GPU, NVIDIA recently announced its latest data center GPU A100, built on the Ampere architecture. The A100 is available in two form factors, PCIe and SXM4, allowing GPU-to-GPU communication over PCIe or NVLink. The NVLink version is also known as the A100 SXM4 GPU and is available on the HGX A100 server board.
As you’d expect, the Innovation Lab tested the performance of the A100 GPU in a new platform. The new PowerEdge XE8545 4U server from Dell Technologies supports these GPUs with the NVLink SXM4 form factor and dual-socket AMD 3rd generation EPYC CPUs (codename Milan). This platform supports PCIe Gen 4 speed, up to 10 local drives, and up to 16 DIMM slots running at 3200 MT/s. Milan CPUs are available with up to 64 physical cores per CPU.
The PCIe version of the A100 can be housed in the PowerEdge R7525, which also supports AMD EPYC CPUs, up to 24 drives, and up to 16 DIMM slots running at 3200MT/s. This blog compares the performance of the A100-PCIe system to the A100-SXM4 system.
Figure 1: PowerEdge XE8545 Server
A previous blog discussed the performance of the NVIDIA A100-PCIe GPU compared to its predecessor NVIDIA Tesla V100-PCIe GPU in the PowerEdge R7525 platform.
The following table shows the specifications of the NVIDIA A100 and V100 GPUs.
Table 1: NVIDIA A100 and V100 GPUs with PCIe and SXM4 form factors
SXM (NVIDIA NVLink)
Type of NVIDIA
GPU memory bandwidth
Peak FP64 Tensor Core
GPU base clock
GPU boost clock
Max power consumption
From Table 1, we see that the A100 offers 42 percent improved memory bandwidth and 20 to 30 percent higher double precision FLOPS when compared to the Tesla V100 GPU. While the A100-PCIe GPU consumes the same amount of power as the V100-PCIe GPU, the NVLink version of the A100 GPU consumes 25 percent more power than the V100 GPU.
How are the GPUs connected in the PowerEdge servers?
An understanding of the server architecture is helpful in determining the behavior of any application. The PowerEdge XE8545 server is an accelerator optimized server with four A100-SMX4 GPUs connected with third generation NVLink, as shown in the following figure.
Figure 2: PowerEdge XE8545 CPU-GPU connectivity
In the A100 GPU, each NVLink lane supports a data rate of 50x 4 Gbit/s in each direction. The total number of NVLink lanes increases from six lanes in the V100 GPU to 12 lanes in the A100 GPU, now yielding 600 GB/s total. Workloads that can take advantage of the higher GPU-to-GPU communication bandwidth can be benefit from the NVLink links in PowerEdge XE8545 Server.
As shown in the following figure, the PowerEdge R7525 server can accommodate up to three PCIe-based GPUs; however the configuration chosen for this evaluation used two A100-PCIe GPUs. With this option, the GPU-to-GPU communication must flow through the AMD Infinity Fabric inter-CPU links.
Figure 3: PowerEdge R7525 CPU-GPU connectivity
The following table shows the tested configuration details:
Table 2: Test bed configuration details
Dual AMD EPYC 7713, 64C, 2.8 GHz
(16 x 32 GB @ 3200 MT/s)
(16 x 64 GB @ 3200 MT/s)
Height of system
4 x NVIDIA A100 SXM4 40 GB
2 x NVIDIA A100 PCIe 40 GB
Red Hat Enterprise Linux release 8.3 (Ootpa)
OpenMPI - 4.0
The following table lists the version of HPC application that was used for the benchmark evaluation:
Table 3: HPC Applications used for the evaluation
High Performance Linpack
High Performance Linpack (HPL) is a standard HPC system benchmark that is used to measure the computing power of a server or cluster. It is also used as a reference benchmark by the TOP500 org to rank supercomputers worldwide. HPL for GPU uses double precision floating point operations. There are a few parameters that are significant for the HPL benchmark, as listed below:
- N is the problem size provided as input to the benchmark and determines the size of linear matrix that is solved by HPL. For a GPU system, the highest HPL performance is obtained when the problem size utilizes as much as possible of the GPU memory without exceeding it. For this study, we used HPL compiled with NVIDIA libraries as listed in Table 3.
- NB is the block size which is used for data distribution. For this test configuration, we used an NB of 288.
- PxQ is the matrix size and is equal to the total number of GPUs in the system.
- Rpeak is the theoretical peak of the system.
- Rmax is the maximum measured performance achieved on the system.
Figure 4: HPL Performance on the PowerEdge R7525 and XE8545 with NVIDIA A100-40 GB
Figure 5: HPL Power Utilization on the PowerEdge XE8545 with four NVIDIA A100 GPUs and R7525 with two NVIDIA A100 GPUs
From Figure 4 and Figure 5, we can make the following observations:
- SXM4 vs PCIe: At 1-GPU, the NVIDIA A100-SXM4 GPU outperforms the A100-PCIe by 11 percent. The higher SMX4 GPU base clock frequency is the predominant factor contributing to the additional performance over the PCIe GPU.
- Scalability: The PowerEdge XE8545 server with four NVIDIA A100-SXM4-40GB GPUs delivers 3.5 times higher HPL performance compared to one NVIDIA A100-SXM4-40GB GPU. On the other hand, two A100-PCIe GPUs is 1.94 times faster than one on the R7525 platform. The A100 GPUs scale well on both platforms for HPL benchmark.
- Higher Rpeak: HPL code on A100 GPUs use the new double-precision Tensor cores. So, the theoretical peak for each card would be 19.5 TFlops, as opposed to 9.7 TFlops.
- Power: Figure 5 shows power consumption of a complete HPL run with PowerEdge XE8545 using 4 x A100-SXM4 GPUs and PowerEdge R7525 using 2 x A100-PCIe GPUs. This was measured with iDRAC commands, and the peak power consumption for XE8545 is 2877 Watts, while peak power consumption for R7525 is 1206 Watts.
High Performance Conjugate Gradient
Figure 6: HPCG Performance on the PowerEdge R7525 and PowerEdge XE8545 Servers
Unlike HPL, HPCG performance depends heavily on the memory system and network performance when we go beyond one server. Because both the PCIe and SXM4 form factors of the A100 GPUs have the same memory bandwidth, there is no variation in the performance at a single node and HPCG performance scales well on both servers.
The following figure shows the performance results for GROMACS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. GROMACS 2021.0 was compiled with CUDA compilers and Open-MPI, as shown in Table 3.
Figure 7: GROMACS performance on the PowerEdge R7545 and PowerEdge XE8545 Server
The GROMACS build included thread MPI (built in with the GROMACS package). Performance results are presented using the ns/day metric. For each test, the performance was optimized by varying the number of MPI ranks and threads, number of PME ranks, and different nstlist values to obtain the best performance result.
With one GPU in test, the performance of the SMX4 XE8545 server is similar to the PCIe R7525. With two GPUs in test, the SMX4 XE8545 performance is up to 28 percent better than the PCIe R7525. As the performance was based on a comparative analysis between NVIDIA PCIe and SXM4 form factors along the server platforms, datasets like Water 1536 and Water 3072 demand more GPU-GPU communication, and SXM4 performs around 28 percent better. On the other hand, for datasets like LignoCellulose 3M, the two GPU R7525 achieves the same per-GPU performance as the XE8545, but with the lower 250 W GPU making it the more efficient solution.
The following figure shows the performance results for LAMMPS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. The code was compiled with the KOKKOS package to run efficiently on NVIDIA GPUs, and Lennard Jones is the dataset that was tested with Timesteps/s as the metric for comparison.
Figure 8: LAMMPS performance on the PowerEdge R7545 and PowerEdge XE8545 Servers
With one GPU in test, the performance of the SMX4 XE8545 server is 13 percent higher than the PCIe R7525, and with two GPUs in test, a 23 percent performance improvement was measured. The PowerEdge XE8545 is at an advantage because the GPUs can communicate with each other over NVLink without the intervention of a CPU. The R7525 server with two GPUs is limited by the GPU-to-GPU communication pattern. Additionally, the other factor contributing for better performance is the higher clock rate of the SXM4 A100 GPU.
In this blog, we discussed the performance of NVIDIA A100 GPUs on the PowerEdge R7525 Server and the PowerEdge XE8545 Server, which is the new addition from Dell Technologies. The A100 GPU has 42 percent more memory bandwidth and higher double precision FLOPs compared to its predecessor, the V100 series GPU. For workloads which demand more GPU-to-GPU communication, the PowerEdge XE8545 server is an ideal choice. For data centers where space and power are limited, the PowerEdge R7525 server may be the right fit. The overall performance of PowerEdge XE8545 Server with four A100-SXM4 GPUs is 1.5 to 2.3 times faster than the PowerEdge R7525 server with two A100-PCIe GPUs.
In the future, we intend to evaluate the A100-80GB GPUs and NVIDIA A40 GPUs that will be available this year. We also plan to focus on a multi-node performance study with these GPUs.
Please contact your Dell sales specialist about the HPC and AI Innovation Lab if you would like to evaluate these GPU servers.
Related Blog Posts
New Frontiers—Dell EMC PowerEdge R750xa Server with NVIDIA A100 GPUs
Tue, 01 Jun 2021 20:08:09 -0000|
Read Time: 0 minutes
Dell Technologies has released the new PowerEdge R750xa server, a GPU workload-based platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3rd Gen Intel Xeon processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server.
Compared with the previous generation PowerEdge C4140 and PowerEdge R740 GPU platform options, the new PowerEdge R750xa server supports larger storage capacity, provides more flexible GPU offerings, and improves the thermal requirement
Figure 1 PowerEdge R750xa server
The NVIDIA A100 GPUs are built on the NVIDIA Ampere architecture to enable double precision workloads. This blog evaluates the new PowerEdge R750xa server and compares its performance with the previous generation PowerEdge C4140 server.
The following table shows the specifications for the NVIDIA GPU that is discussed in this blog and compares the performance improvement from the previous generation.
Table 1 NVIDIA GPU specifications
GPU memory bandwidth
Peak FP64 Tensor Core
Peak FP32 Tensor Core
Peak Mixed Precision
FP16 ops/ FP32
GPU base clock
GPU Boost clock
Maximum power consumption
Test bed and applications
This blog quantifies the performance improvement of the GPUs with the new PowerEdge GPU platform.
Using a single node PowerEdge R750xa server in the Dell HPC & AI Innovation Lab, we derived all results presented in this blog from this test bed. This section describes the test bed and the applications that were evaluated as part of the study. The following table provides test environment details:
Table 2 Server configuration
Test Bed 1
Test Bed 2
Dell PowerEdge R750xa
Dell PowerEdge C4140 configuration M
Intel Xeon 8380
Intel Xeon 6248
32 x 16 GB @ 3200MT/s
16 x 16 GB @ 2933MT/s
Red Hat Enterprise Linux 8.3
Red Hat Enterprise Linux 8.3
4 x NVIDIA A100-PCIe-40 GB GPU
4 x NVIDIA V100-PCIe-32 GB GPU
The following table provides information about the applications and benchmarks used:
Table 3 Benchmark and application details
Floating point compute-intensive system benchmark
Problem size is more than 95% of GPU memory
Sparse matrix calculations
512 * 512 * 288
Molecular dynamics application
Molecular dynamics application
29 October 2020 release
Large-Scale Atomic/Molecular Massively Parallel simulator (LAMMPS) is distributed by Sandia National Labs and the US Department of Energy. LAMMPS is open-source code that has different accelerated models for performance on CPUs and GPUs. For our test, we compiled the binary using the KOKKOS package, which runs efficiently on GPUs.
Figure 2 LAMMPS Performance on PowerEdge R750xa and PowerEdge C4140 servers
With the newer generation GPUs, this application improves 2.4 times compared to single GPU performance. The overall performance from a single server improved twice with the PowerEdge R750xa server and NVIDIA A100 GPUs.
GROMACS is a free and open-source parallel molecular dynamics package designed for simulations of biochemical molecules such as proteins, lipids, and nucleic acids. It is used by a wide variety of researchers, particularly for biomolecular and chemistry simulations. GROMACS supports all the usual algorithms expected from modern molecular dynamics implementation. It is open-source software with the latest versions available under the GNU Lesser General Public License (LGPL).
Figure 3 GROMACS performance on PowerEdge C4140 and r750xa servers
With the newer generation GPUs, this application improved approximately 1.5 times across the dataset compared to single GPU performance. The overall performance from a single server improved 1.5 times with a PowerEdge R750xa server and NVIDIA A100 GPUs.
High-Performance Linpack (HPL) needs no introduction in the HPC arena. It is a widely used standard benchmark tests in the industry.
Figure 4 HPL Performance on the PowerEdge R750xa server with A100 GPU and PowerEdge C4140 server with V100 GPU
Figure 5 Power use of the HPL running on NVIDIA GPUs
From Figure 4 and Figure 5, the following results were observed:
- Performance—For GPU count, the NVIDIA A100 GPU demonstrates twice the performance of the NVIDIA V100 GPU. Higher memory size, double precision FLOPS, and a newer architecture contribute to the improvement for the NVIDIA A100 GPU.
- Scalability—The PowerEdge R750xa server with four NVIDIA A100-PCIe-40 GB GPUs delivers 3.6 times higher HPL performance compared to one NVIDIA A100-PCIE-40 GB GPU. The NVIDIA A100 GPUs scale well inside the PowerEdge R750xa server for the HPL benchmark.
- Higher Rpeak—The HPL code on NVIDIA A100 GPUs uses the new double-precision Tensor cores. The theoretical peak for each GPU is 19.5 TFlops, as opposed to 9.7 TFlops.
- Power—Figure 5 shows power consumption of a complete HPL run with the PowerEdge R750xa server using four A100-PCIe GPUs. This result was measured with iDRAC commands, and the peak power consumption was observed as 2022 Watts. Based on our previous observations, we know that the PowerEdge C4140 server consumes approximately 1800 W of power.
Figure 6 Scaling GPU performance data for HPCG Benchmark
As discussed in other blogs, high performance conjugate gradient (HPCG) is another standard benchmark to test data access patterns of sparse matrix calculations. From the graph, we see that the HPCG benchmark scales well with this benchmark resulting in 1.6 times performance improvement over the previous generation PowerEdge C4140 server with an NVIDIA V100 GPU.
The 72 percent improvement in memory bandwidth of the NVIDIA A100 GPU over the NVIDIA V100 GPU contributes to the performance improvement.
In this blog, we introduced the latest generation PowerEdge R750xa platform and discussed the performance improvement over the previous generation PowerEdge C4140 server. The PowerEdge R750xa server is a good option for customers looking for an Intel Xeon scalable CPU-based platform powered with NVIDIA GPUs.
With the newer generation PowerEdge R750xa server and NVIDIA A100 GPUs, the applications discussed in this blog show significant performance improvement.
In future blogs, we plan to evaluate NVLINK bridge support, which is another important feature of the PowerEdge R750xa server and NVIDIA A100 GPUs.
HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs
Tue, 24 Nov 2020 17:39:49 -0000|
Read Time: 0 minutes
High-Performance Linpack benchmark
High Performance Conjugate Gradient benchmark