Home Workload Solutions High Performance Computing Blogs

New Frontiers—Dell EMC PowerEdge R750xa Server with NVIDIA A100 GPUs

Tue, 01 Jun 2021 20:18:04 -0000

Read Time: 0 minutes

Deepthi Cherlopalle

Frank Han

Savitha Pareek

Dell Technologies has released the new PowerEdge R750xa server, a GPU workload-based platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3^rd Gen Intel Xeon processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server.

Compared with the previous generation PowerEdge C4140 and PowerEdge R740 GPU platform options, the new PowerEdge R750xa server supports larger storage capacity, provides more flexible GPU offerings, and improves the thermal requirement

Figure 1 PowerEdge R750xa server

The NVIDIA A100 GPUs are built on the NVIDIA Ampere architecture to enable double precision workloads. This blog evaluates the new PowerEdge R750xa server and compares its performance with the previous generation PowerEdge C4140 server.

The following table shows the specifications for the NVIDIA GPU that is discussed in this blog and compares the performance improvement from the previous generation.

Table 1 NVIDIA GPU specifications

	PCIe		Improvement
GPU name	A100	V100
GPU architecture	Ampere	Volta	-
GPU memory	40 GB	32 GB	60%
GPU memory bandwidth	1555 GB/s	900 GB/s	73%
Peak FP64	9.7 TFLOPS	7 TFLOPS	39%
Peak FP64 Tensor Core	19.5 TFLOPS	N/A	-
Peak FP32	19.5 TFLOPS	14 TFLOPS	39%
Peak FP32 Tensor Core	156 TFLOPS 312 TFLOPS*	N/A	-
Peak Mixed Precision FP16 ops/ FP32 Accumulate	312 TFLOPS 624 TFLOPS*	125 TFLOPS	5x
GPU base clock	765 MHz	1230 MHz	-
Peak INT8	624 TOPS 1,248 TOPS*	N/A	-
GPU Boost clock	1410 MHz	1380 MHz	2.1%
NVLink speed	600 GB/s	N/A	-
Maximum power consumption	250 W	250 W	No change

*with sparsity

Test bed and applications

This blog quantifies the performance improvement of the GPUs with the new PowerEdge GPU platform.

Using a single node PowerEdge R750xa server in the Dell HPC & AI Innovation Lab, we derived all results presented in this blog from this test bed. This section describes the test bed and the applications that were evaluated as part of the study. The following table provides test environment details:

Table 2 Server configuration

Component	Test Bed 1	Test Bed 2
Server	Dell PowerEdge R750xa	Dell PowerEdge C4140 configuration M
Processor	Intel Xeon 8380	Intel Xeon 6248
Memory	32 x 16 GB @ 3200MT/s	16 x 16 GB @ 2933MT/s
Operating system	Red Hat Enterprise Linux 8.3	Red Hat Enterprise Linux 8.3
GPU	4 x NVIDIA A100-PCIe-40 GB GPU	4 x NVIDIA V100-PCIe-32 GB GPU

The following table provides information about the applications and benchmarks used:

Table 3 Benchmark and application details

Application	Domain	Version	Benchmark dataset
High-Performance Linpack	Floating point compute-intensive system benchmark	xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20	Problem size is more than 95% of GPU memory
HPCG	Sparse matrix calculations	xhpcg-3.1_cuda_11_ompi-3.1	512 * 512 * 288
GROMACS	Molecular dynamics application	2020	Ligno Cellulose Water 1536 Water 3072
LAMMPS	Molecular dynamics application	29 October 2020 release	Lennard Jones

LAMMPS

Large-Scale Atomic/Molecular Massively Parallel simulator (LAMMPS) is distributed by Sandia National Labs and the US Department of Energy. LAMMPS is open-source code that has different accelerated models for performance on CPUs and GPUs. For our test, we compiled the binary using the KOKKOS package, which runs efficiently on GPUs.

Figure 2 LAMMPS Performance on PowerEdge R750xa and PowerEdge C4140 servers

With the newer generation GPUs, this application improves 2.4 times compared to single GPU performance. The overall performance from a single server improved twice with the PowerEdge R750xa server and NVIDIA A100 GPUs.

GROMACS

GROMACS is a free and open-source parallel molecular dynamics package designed for simulations of biochemical molecules such as proteins, lipids, and nucleic acids. It is used by a wide variety of researchers, particularly for biomolecular and chemistry simulations. GROMACS supports all the usual algorithms expected from modern molecular dynamics implementation. It is open-source software with the latest versions available under the GNU Lesser General Public License (LGPL).

Figure 3 GROMACS performance on PowerEdge C4140 and r750xa servers

With the newer generation GPUs, this application improved approximately 1.5 times across the dataset compared to single GPU performance. The overall performance from a single server improved 1.5 times with a PowerEdge R750xa server and NVIDIA A100 GPUs.

High-Performance Linpack

High-Performance Linpack (HPL) needs no introduction in the HPC arena. It is a widely used standard benchmark tests in the industry.

Figure 4 HPL Performance on the PowerEdge R750xa server with A100 GPU and PowerEdge C4140 server with V100 GPU

Figure 5 Power use of the HPL running on NVIDIA GPUs

From Figure 4 and Figure 5, the following results were observed:

Performance—For GPU count, the NVIDIA A100 GPU demonstrates twice the performance of the NVIDIA V100 GPU. Higher memory size, double precision FLOPS, and a newer architecture contribute to the improvement for the NVIDIA A100 GPU.
Scalability—The PowerEdge R750xa server with four NVIDIA A100-PCIe-40 GB GPUs delivers 3.6 times higher HPL performance compared to one NVIDIA A100-PCIE-40 GB GPU. The NVIDIA A100 GPUs scale well inside the PowerEdge R750xa server for the HPL benchmark.
Higher Rpeak—The HPL code on NVIDIA A100 GPUs uses the new double-precision Tensor cores. The theoretical peak for each GPU is 19.5 TFlops, as opposed to 9.7 TFlops.
Power—Figure 5 shows power consumption of a complete HPL run with the PowerEdge R750xa server using four A100-PCIe GPUs. This result was measured with iDRAC commands, and the peak power consumption was observed as 2022 Watts. Based on our previous observations, we know that the PowerEdge C4140 server consumes approximately 1800 W of power.

HPCG

Figure 6 Scaling GPU performance data for HPCG Benchmark

As discussed in other blogs, high performance conjugate gradient (HPCG) is another standard benchmark to test data access patterns of sparse matrix calculations. From the graph, we see that the HPCG benchmark scales well with this benchmark resulting in 1.6 times performance improvement over the previous generation PowerEdge C4140 server with an NVIDIA V100 GPU.

The 72 percent improvement in memory bandwidth of the NVIDIA A100 GPU over the NVIDIA V100 GPU contributes to the performance improvement.

Conclusion

In this blog, we introduced the latest generation PowerEdge R750xa platform and discussed the performance improvement over the previous generation PowerEdge C4140 server. The PowerEdge R750xa server is a good option for customers looking for an Intel Xeon scalable CPU-based platform powered with NVIDIA GPUs.

With the newer generation PowerEdge R750xa server and NVIDIA A100 GPUs, the applications discussed in this blog show significant performance improvement.

Next steps

In future blogs, we plan to evaluate NVLINK bridge support, which is another important feature of the PowerEdge R750xa server and NVIDIA A100 GPUs.

Tags:

Platform	DSS 8440 10xA100 TensorRT	R750xa 4xA100 TensorRT
MLPerf system ID	DSS8440_A100_PCIE_80GBx10_TRT	R750xa_A100_PCIE_80GBx4_TRT
Operating system	CentOS 8.2
CPU	Intel Xeon Gold 6248R CPU @ 3.00 GHz	Intel Xeon Gold 6338 CPU @ 2.00 GHz
Memory	768 GB	1 TB
GPU	NVIDIA A100
GPU form factor	PCIe
GPU count	10	4
Software stack	TensorRT 8.4.0 CUDA 11.6 cuDNN 8.3.2 Driver 510.39.01 DALI 0.31.0

Platform	PowerEdge XE2420 1xA30 TensorRT	PowerEdge XE2420 2xA30 TensorRT	PowerEdge XE2420 1xA30 TensorRT MaxQ	PowerEdge XE2420 1xAT4 TensorRT
MLPerf system ID	XE2420_A30x1_TRT	XE2420_A30x2_TRT	XE2420_A30x1_TRT_MaxQ	XE2420_T4x1_TRT
Operating system	Ubuntu 20.04.4			CentOS 8.2.2004
CPU	Intel Xeon Gold 6252 CPU @ 2.10 GHz	Intel Xeon Gold 6252N CPU @ 2.30 GHz	Intel Xeon Silver 4216 CPU @ 2.10 GHz	Intel Xeon Gold 6238 CPU @ 2.10 GHz
Memory	1 TB		64 GB	256 GB
GPU	NVIDIA A30			NVIDIA T4
GPU form factor	PCIe
GPU count	1	2	1	1
Software stack	TensorRT 8.4.0 CUDA 11.6 cuDNN 8.3.2 Driver 510.39.01 DALI 0.31.0

Platform	PowerEdge XE8545 4xA100 TensorRT	PowerEdge XE8545 4xA100 TensorRT, Triton	PowerEdge XE8545 1xA100 MIG 1x1g.10g TensorRT
MLPerf system ID	XE8545_A100_SXM_80GBx4_TRT	XE8545_A100_SXM_80GBx4_TRT_Triton	XE8545_A100_SXM_80GB_1xMIG_TRT
Operating system	Ubuntu 20.04.3
CPU	AMD EPYC 7763
Memory	1 TB
GPU	NVIDIA A100-SXM-80 GB		NVIDIA A100-SXM-80 GB (1x1g.10gb MIG)
GPU form factor	SXM
GPU count	4		1
Software stack	TensorRT 8.4.0 CUDA 11.6 CuDNN 8.3.2 Driver 510.47.03 DALI 0.31.0
Software stack		Triton 22.01

Platform	PowerEdge XR12 1xA2 TensorRT	PowerEdge XR12 1xA2 TensorRT MaxQ
MLPerf system ID	XR12_A2x1_TRT	XR12_A2x1_TRT_MaxQ
Operating system	CentOS 8.2
CPU	Intel Xeon Gold 6312U CPU @ 2.40 GHz
Memory	256 GB
GPU	NVIDIA A2
GPU form factor	PCIe
GPU count	1
Software stack	TensorRT 8.4.0 CUDA 11.6 cuDNN 8.3.2 Driver 510.39.01 DALI 0.31.0

	NVIDIA A100 GPGPU		NVIDIA V100S GPGPU
Form factor	SXM4	PCIe Gen4	SXM2	PCIe Gen3
GPU architecture	Ampere		Volta
Memory size	40 GB	40 GB	32 GB	32 GB
CUDA cores	6912		5120
Base clock	1095 MHz	765 MHz	1290 MHz	1245 MHz
Boost clock	1410 MHz		1530 MHz	1597 MHz
Memory clock	1215 MHz		877 MHz	1107 MHz
MIG support	Yes		No
Peak memory bandwidth	Up to 1555 GB/s		Up to 900 GB/s	Up to 1134 GB/s
Total board power	400 W	250 W	300 W	250 W

Server	PowerEdge R7525
Processor	2nd Gen AMD EPYC 7502, 32C, 2.5Ghz
Memory	512 GB (16 x 32 GB @3200MT/s)
GPGPUs	Either of the following: 2 x NVIDIA A100 PCIe 40 GB 2 x NVIDIA V100S PCIe 32 GB
Logical processors	Disabled
Operating system	CentOS Linux release 8.1 (4.18.0-147.el8.x86_64)
CUDA	11.0 (Driver version - 450.51.05)
gcc	9.2.0
MPI	OpenMPI-3.0
HPL	hpl_cuda_11.0_ompi-4.0_ampere_volta_8-7-20
HPCG	xhpcg-3.1_cuda_11_ompi-3.1
GROMACS	v2020.4

Your Browser is Out of Date

New Frontiers—Dell EMC PowerEdge R750xa Server with NVIDIA A100 GPUs

Test bed and applications

LAMMPS

GROMACS

High-Performance Linpack

HPCG

Conclusion

Next steps

Related Blog Posts

Overview of MLPerf™ Inference v2.0 Results on Dell Servers

What is new?

Results at a glance

Noteworthy results

MLPerf Inference v2.0 benchmark results

Observations

Conclusion

Appendix

Software stack

SUT configuration

HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs

Overview

Server configuration

Benchmark results

High-Performance Linpack benchmark

High Performance Conjugate Gradient benchmark

GROMACS

Conclusion