Home Workload Solutions High Performance Computing Blogs

Accelerating HPC Workloads with NVIDIA A100 NVLink on Dell PowerEdge XE8545

Tue, 13 Apr 2021 14:25:31 -0000

Read Time: 0 minutes

Savitha Pareek

Deepthi Cherlopalle

Frank Han

NVIDIA A100 GPU

Three years after launching the Tesla V100 GPU, NVIDIA recently announced its latest data center GPU A100, built on the Ampere architecture. The A100 is available in two form factors, PCIe and SXM4, allowing GPU-to-GPU communication over PCIe or NVLink. The NVLink version is also known as the A100 SXM4 GPU and is available on the HGX A100 server board.

As you’d expect, the Innovation Lab tested the performance of the A100 GPU in a new platform. The new PowerEdge XE8545 4U server from Dell Technologies supports these GPUs with the NVLink SXM4 form factor and dual-socket AMD 3rd generation EPYC CPUs (codename Milan). This platform supports PCIe Gen 4 speed, up to 10 local drives, and up to 16 DIMM slots running at 3200 MT/s. Milan CPUs are available with up to 64 physical cores per CPU.

The PCIe version of the A100 can be housed in the PowerEdge R7525, which also supports AMD EPYC CPUs, up to 24 drives, and up to 16 DIMM slots running at 3200MT/s. This blog compares the performance of the A100-PCIe system to the A100-SXM4 system.

Figure 1: PowerEdge XE8545 Server

A previous blog discussed the performance of the NVIDIA A100-PCIe GPU compared to its predecessor NVIDIA Tesla V100-PCIe GPU in the PowerEdge R7525 platform.

The following table shows the specifications of the NVIDIA A100 and V100 GPUs.

Table 1: NVIDIA A100 and V100 GPUs with PCIe and SXM4 form factors

Form factor	PCIe		SXM (NVIDIA NVLink)
Type of NVIDIA	A100	V100	A100	V100
GPU architecture	Ampere	Volta	Ampere	Volta
GPU memory	40 GB	32 GB	40 GB	32 GB
GPU memory bandwidth	1555 GB/s	900 GB/s	1555 GB/s	900 GB/s
Peak FP64	9.7 TFLOPS	7 TFLOPS	9.7 TFLOPS	7.8 TFLOPS
Peak FP64 Tensor Core	19.5 TFLOPS	N/A	19.5 TFLOPS	N/A
GPU base clock	765 MHz	1230 MHz	1095 MHz	1290 MHz
GPU boost clock	1410 MHz	1380 MHz	1410 MHz	1530 MHz
NVLink speed	600 GB/s	N/A	600 GB/s	300 GB/s
Max power consumption	250 W	250 W	400 W	300 W

From Table 1, we see that the A100 offers 42 percent improved memory bandwidth and 20 to 30 percent higher double precision FLOPS when compared to the Tesla V100 GPU. While the A100-PCIe GPU consumes the same amount of power as the V100-PCIe GPU, the NVLink version of the A100 GPU consumes 25 percent more power than the V100 GPU.

How are the GPUs connected in the PowerEdge servers?

An understanding of the server architecture is helpful in determining the behavior of any application. The PowerEdge XE8545 server is an accelerator optimized server with four A100-SMX4 GPUs connected with third generation NVLink, as shown in the following figure.

Figure 2: PowerEdge XE8545 CPU-GPU connectivity

In the A100 GPU, each NVLink lane supports a data rate of 50x 4 Gbit/s in each direction. The total number of NVLink lanes increases from six lanes in the V100 GPU to 12 lanes in the A100 GPU, now yielding 600 GB/s total. Workloads that can take advantage of the higher GPU-to-GPU communication bandwidth can be benefit from the NVLink links in PowerEdge XE8545 Server.

As shown in the following figure, the PowerEdge R7525 server can accommodate up to three PCIe-based GPUs; however the configuration chosen for this evaluation used two A100-PCIe GPUs. With this option, the GPU-to-GPU communication must flow through the AMD Infinity Fabric inter-CPU links.

Figure 3: PowerEdge R7525 CPU-GPU connectivity

Testbed details

The following table shows the tested configuration details:

Table 2: Test bed configuration details

Server	PowerEdge XE8545	PowerEdge R7525
Processor	Dual AMD EPYC 7713, 64C, 2.8 GHz
Memory	512 GB (16 x 32 GB @ 3200 MT/s)	1024 GB (16 x 64 GB @ 3200 MT/s)
Height of system	4U	2U
GPUs	4 x NVIDIA A100 SXM4 40 GB	2 x NVIDIA A100 PCIe 40 GB
Operating system Kernel	Red Hat Enterprise Linux release 8.3 (Ootpa) 4.18.0-240.el8.x86_64
BIOS settings	Sysprofile=PerfOptimized LogicalProcessor=Disabled NumaNodesPerSocket=4
CUDA Driver CUDA Toolkit	450.51.05 11.1
GCC	9.2.0
MPI	OpenMPI - 4.0

The following table lists the version of HPC application that was used for the benchmark evaluation:

Table 3: HPC Applications used for the evaluation

Benchmark	Details
HPL	xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20
HPCG	xhpcg-3.1_cuda_11_ompi-3.1
GROMACS	v2021
NAMD	Git-2021-03-02_Source
LAMMPS	29Oct2020 release

Benchmark evaluation

High Performance Linpack

High Performance Linpack (HPL) is a standard HPC system benchmark that is used to measure the computing power of a server or cluster. It is also used as a reference benchmark by the TOP500 org to rank supercomputers worldwide. HPL for GPU uses double precision floating point operations. There are a few parameters that are significant for the HPL benchmark, as listed below:

N is the problem size provided as input to the benchmark and determines the size of linear matrix that is solved by HPL. For a GPU system, the highest HPL performance is obtained when the problem size utilizes as much as possible of the GPU memory without exceeding it. For this study, we used HPL compiled with NVIDIA libraries as listed in Table 3.
NB is the block size which is used for data distribution. For this test configuration, we used an NB of 288.
PxQ is the matrix size and is equal to the total number of GPUs in the system.
Rpeak is the theoretical peak of the system.
Rmax is the maximum measured performance achieved on the system.

Figure 4: HPL Performance on the PowerEdge R7525 and XE8545 with NVIDIA A100-40 GB

Figure 5: HPL Power Utilization on the PowerEdge XE8545 with four NVIDIA A100 GPUs and R7525 with two NVIDIA A100 GPUs

From Figure 4 and Figure 5, we can make the following observations:

SXM4 vs PCIe: At 1-GPU, the NVIDIA A100-SXM4 GPU outperforms the A100-PCIe by 11 percent. The higher SMX4 GPU base clock frequency is the predominant factor contributing to the additional performance over the PCIe GPU.
Scalability: The PowerEdge XE8545 server with four NVIDIA A100-SXM4-40GB GPUs delivers 3.5 times higher HPL performance compared to one NVIDIA A100-SXM4-40GB GPU. On the other hand, two A100-PCIe GPUs is 1.94 times faster than one on the R7525 platform. The A100 GPUs scale well on both platforms for HPL benchmark.
Higher Rpeak: HPL code on A100 GPUs use the new double-precision Tensor cores. So, the theoretical peak for each card would be 19.5 TFlops, as opposed to 9.7 TFlops.
Power: Figure 5 shows power consumption of a complete HPL run with PowerEdge XE8545 using 4 x A100-SXM4 GPUs and PowerEdge R7525 using 2 x A100-PCIe GPUs. This was measured with iDRAC commands, and the peak power consumption for XE8545 is 2877 Watts, while peak power consumption for R7525 is 1206 Watts.

High Performance Conjugate Gradient

The TOP500 list has incorporated the High Performance Conjugate Gradient (HPCG) results as an alternate metric to assess system performance.

Figure 6: HPCG Performance on the PowerEdge R7525 and PowerEdge XE8545 Servers

Unlike HPL, HPCG performance depends heavily on the memory system and network performance when we go beyond one server. Because both the PCIe and SXM4 form factors of the A100 GPUs have the same memory bandwidth, there is no variation in the performance at a single node and HPCG performance scales well on both servers.

GROMACS

The following figure shows the performance results for GROMACS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. GROMACS 2021.0 was compiled with CUDA compilers and Open-MPI, as shown in Table 3.

Figure 7: GROMACS performance on the PowerEdge R7545 and PowerEdge XE8545 Server

The GROMACS build included thread MPI (built in with the GROMACS package). Performance results are presented using the ns/day metric. For each test, the performance was optimized by varying the number of MPI ranks and threads, number of PME ranks, and different nstlist values to obtain the best performance result.

With one GPU in test, the performance of the SMX4 XE8545 server is similar to the PCIe R7525. With two GPUs in test, the SMX4 XE8545 performance is up to 28 percent better than the PCIe R7525. As the performance was based on a comparative analysis between NVIDIA PCIe and SXM4 form factors along the server platforms, datasets like Water 1536 and Water 3072 demand more GPU-GPU communication, and SXM4 performs around 28 percent better. On the other hand, for datasets like LignoCellulose 3M, the two GPU R7525 achieves the same per-GPU performance as the XE8545, but with the lower 250 W GPU making it the more efficient solution.

LAMMPS

The following figure shows the performance results for LAMMPS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. The code was compiled with the KOKKOS package to run efficiently on NVIDIA GPUs, and Lennard Jones is the dataset that was tested with Timesteps/s as the metric for comparison.

Figure 8: LAMMPS performance on the PowerEdge R7545 and PowerEdge XE8545 Servers

With one GPU in test, the performance of the SMX4 XE8545 server is 13 percent higher than the PCIe R7525, and with two GPUs in test, a 23 percent performance improvement was measured. The PowerEdge XE8545 is at an advantage because the GPUs can communicate with each other over NVLink without the intervention of a CPU. The R7525 server with two GPUs is limited by the GPU-to-GPU communication pattern. Additionally, the other factor contributing for better performance is the higher clock rate of the SXM4 A100 GPU.

Conclusion

In this blog, we discussed the performance of NVIDIA A100 GPUs on the PowerEdge R7525 Server and the PowerEdge XE8545 Server, which is the new addition from Dell Technologies. The A100 GPU has 42 percent more memory bandwidth and higher double precision FLOPs compared to its predecessor, the V100 series GPU. For workloads which demand more GPU-to-GPU communication, the PowerEdge XE8545 server is an ideal choice. For data centers where space and power are limited, the PowerEdge R7525 server may be the right fit. The overall performance of PowerEdge XE8545 Server with four A100-SXM4 GPUs is 1.5 to 2.3 times faster than the PowerEdge R7525 server with two A100-PCIe GPUs.

In the future, we intend to evaluate the A100-80GB GPUs and NVIDIA A40 GPUs that will be available this year. We also plan to focus on a multi-node performance study with these GPUs.

Please contact your Dell sales specialist about the HPC and AI Innovation Lab if you would like to evaluate these GPU servers.

Tags:

Component	Test Bed 1	Test Bed 2
Server	Dell PowerEdge R750xa	Dell PowerEdge C4140 configuration M
Processor	Intel Xeon 8380	Intel Xeon 6248
Memory	32 x 16 GB @ 3200MT/s	16 x 16 GB @ 2933MT/s
Operating system	Red Hat Enterprise Linux 8.3	Red Hat Enterprise Linux 8.3
GPU	4 x NVIDIA A100-PCIe-40 GB GPU	4 x NVIDIA V100-PCIe-32 GB GPU

Application	Domain	Version	Benchmark dataset
High-Performance Linpack	Floating point compute-intensive system benchmark	xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20	Problem size is more than 95% of GPU memory
HPCG	Sparse matrix calculations	xhpcg-3.1_cuda_11_ompi-3.1	512 * 512 * 288
GROMACS	Molecular dynamics application	2020	Ligno Cellulose Water 1536 Water 3072
LAMMPS	Molecular dynamics application	29 October 2020 release	Lennard Jones

	NVIDIA A100 GPGPU		NVIDIA V100S GPGPU
Form factor	SXM4	PCIe Gen4	SXM2	PCIe Gen3
GPU architecture	Ampere		Volta
Memory size	40 GB	40 GB	32 GB	32 GB
CUDA cores	6912		5120
Base clock	1095 MHz	765 MHz	1290 MHz	1245 MHz
Boost clock	1410 MHz		1530 MHz	1597 MHz
Memory clock	1215 MHz		877 MHz	1107 MHz
MIG support	Yes		No
Peak memory bandwidth	Up to 1555 GB/s		Up to 900 GB/s	Up to 1134 GB/s
Total board power	400 W	250 W	300 W	250 W

Your Browser is Out of Date

Accelerating HPC Workloads with NVIDIA A100 NVLink on Dell PowerEdge XE8545

NVIDIA A100 GPU

How are the GPUs connected in the PowerEdge servers?

Testbed details

Benchmark evaluation

High Performance Linpack

High Performance Conjugate Gradient

GROMACS

LAMMPS

Conclusion

Related Blog Posts

New Frontiers—Dell EMC PowerEdge R750xa Server with NVIDIA A100 GPUs

Test bed and applications

LAMMPS

GROMACS

High-Performance Linpack

HPCG

Conclusion

Next steps

HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs

Overview

Server configuration

Benchmark results

High-Performance Linpack benchmark

High Performance Conjugate Gradient benchmark

GROMACS

Conclusion