HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs
Tue, 24 Nov 2020 17:39:49 -0000|
Read Time: 0 minutes
High-Performance Linpack benchmark
High Performance Conjugate Gradient benchmark
Related Blog Posts
New Frontiers—Dell EMC PowerEdge R750xa Server with NVIDIA A100 GPUs
Tue, 01 Jun 2021 20:08:09 -0000|
Read Time: 0 minutes
Dell Technologies has released the new PowerEdge R750xa server, a GPU workload-based platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3rd Gen Intel Xeon processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server.
Compared with the previous generation PowerEdge C4140 and PowerEdge R740 GPU platform options, the new PowerEdge R750xa server supports larger storage capacity, provides more flexible GPU offerings, and improves the thermal requirement
Figure 1 PowerEdge R750xa server
The NVIDIA A100 GPUs are built on the NVIDIA Ampere architecture to enable double precision workloads. This blog evaluates the new PowerEdge R750xa server and compares its performance with the previous generation PowerEdge C4140 server.
The following table shows the specifications for the NVIDIA GPU that is discussed in this blog and compares the performance improvement from the previous generation.
Table 1 NVIDIA GPU specifications
GPU memory bandwidth
Peak FP64 Tensor Core
Peak FP32 Tensor Core
Peak Mixed Precision
FP16 ops/ FP32
GPU base clock
GPU Boost clock
Maximum power consumption
Test bed and applications
This blog quantifies the performance improvement of the GPUs with the new PowerEdge GPU platform.
Using a single node PowerEdge R750xa server in the Dell HPC & AI Innovation Lab, we derived all results presented in this blog from this test bed. This section describes the test bed and the applications that were evaluated as part of the study. The following table provides test environment details:
Table 2 Server configuration
Test Bed 1
Test Bed 2
Dell PowerEdge R750xa
Dell PowerEdge C4140 configuration M
Intel Xeon 8380
Intel Xeon 6248
32 x 16 GB @ 3200MT/s
16 x 16 GB @ 2933MT/s
Red Hat Enterprise Linux 8.3
Red Hat Enterprise Linux 8.3
4 x NVIDIA A100-PCIe-40 GB GPU
4 x NVIDIA V100-PCIe-32 GB GPU
The following table provides information about the applications and benchmarks used:
Table 3 Benchmark and application details
Floating point compute-intensive system benchmark
Problem size is more than 95% of GPU memory
Sparse matrix calculations
512 * 512 * 288
Molecular dynamics application
Molecular dynamics application
29 October 2020 release
Large-Scale Atomic/Molecular Massively Parallel simulator (LAMMPS) is distributed by Sandia National Labs and the US Department of Energy. LAMMPS is open-source code that has different accelerated models for performance on CPUs and GPUs. For our test, we compiled the binary using the KOKKOS package, which runs efficiently on GPUs.
Figure 2 LAMMPS Performance on PowerEdge R750xa and PowerEdge C4140 servers
With the newer generation GPUs, this application improves 2.4 times compared to single GPU performance. The overall performance from a single server improved twice with the PowerEdge R750xa server and NVIDIA A100 GPUs.
GROMACS is a free and open-source parallel molecular dynamics package designed for simulations of biochemical molecules such as proteins, lipids, and nucleic acids. It is used by a wide variety of researchers, particularly for biomolecular and chemistry simulations. GROMACS supports all the usual algorithms expected from modern molecular dynamics implementation. It is open-source software with the latest versions available under the GNU Lesser General Public License (LGPL).
Figure 3 GROMACS performance on PowerEdge C4140 and r750xa servers
With the newer generation GPUs, this application improved approximately 1.5 times across the dataset compared to single GPU performance. The overall performance from a single server improved 1.5 times with a PowerEdge R750xa server and NVIDIA A100 GPUs.
High-Performance Linpack (HPL) needs no introduction in the HPC arena. It is a widely used standard benchmark tests in the industry.
Figure 4 HPL Performance on the PowerEdge R750xa server with A100 GPU and PowerEdge C4140 server with V100 GPU
Figure 5 Power use of the HPL running on NVIDIA GPUs
From Figure 4 and Figure 5, the following results were observed:
- Performance—For GPU count, the NVIDIA A100 GPU demonstrates twice the performance of the NVIDIA V100 GPU. Higher memory size, double precision FLOPS, and a newer architecture contribute to the improvement for the NVIDIA A100 GPU.
- Scalability—The PowerEdge R750xa server with four NVIDIA A100-PCIe-40 GB GPUs delivers 3.6 times higher HPL performance compared to one NVIDIA A100-PCIE-40 GB GPU. The NVIDIA A100 GPUs scale well inside the PowerEdge R750xa server for the HPL benchmark.
- Higher Rpeak—The HPL code on NVIDIA A100 GPUs uses the new double-precision Tensor cores. The theoretical peak for each GPU is 19.5 TFlops, as opposed to 9.7 TFlops.
- Power—Figure 5 shows power consumption of a complete HPL run with the PowerEdge R750xa server using four A100-PCIe GPUs. This result was measured with iDRAC commands, and the peak power consumption was observed as 2022 Watts. Based on our previous observations, we know that the PowerEdge C4140 server consumes approximately 1800 W of power.
Figure 6 Scaling GPU performance data for HPCG Benchmark
As discussed in other blogs, high performance conjugate gradient (HPCG) is another standard benchmark to test data access patterns of sparse matrix calculations. From the graph, we see that the HPCG benchmark scales well with this benchmark resulting in 1.6 times performance improvement over the previous generation PowerEdge C4140 server with an NVIDIA V100 GPU.
The 72 percent improvement in memory bandwidth of the NVIDIA A100 GPU over the NVIDIA V100 GPU contributes to the performance improvement.
In this blog, we introduced the latest generation PowerEdge R750xa platform and discussed the performance improvement over the previous generation PowerEdge C4140 server. The PowerEdge R750xa server is a good option for customers looking for an Intel Xeon scalable CPU-based platform powered with NVIDIA GPUs.
With the newer generation PowerEdge R750xa server and NVIDIA A100 GPUs, the applications discussed in this blog show significant performance improvement.
In future blogs, we plan to evaluate NVLINK bridge support, which is another important feature of the PowerEdge R750xa server and NVIDIA A100 GPUs.
HPC Application Performance on Dell EMC PowerEdge R7525 Servers with the AMD MI100 Accelerator
Tue, 15 Dec 2020 14:23:27 -0000|
Read Time: 0 minutes
The Dell EMC PowerEdge R7525 server supports the AMD MI100 GPU Accelerator. The server is a two-socket, 2U rack-based server that is designed to run complex workloads using highly scalable memory, I/O capacity, and network options. The system is based on the 2nd Gen AMD EPYC processor (up to 64 cores), has up to 32 DIMMs, and has PCI Express (PCIe) 4.0-enabled expansion slots. The server supports SATA, SAS, and NVMe drives and up to three double-wide 300 W accelerators.
The following figure shows the front view of the server:
Figure 1. Dell EMC PowerEdge R7525 server
The AMD Instinct™ MI100 accelerator is one of the world’s fastest HPC GPUs available in the market. It offers innovations to obtain higher performance for HPC applications with the following key technologies:
- AMD Compute DNA (CDNA)—Architecture optimized for compute-oriented workloads
- AMD ROCm—An Open Software Platform that includes GPU drivers, compilers, profilers, math and communication libraries, and system resource management tools
- Heterogeneous-Computing Interface for Portability (HIP)—An interface that enables developers to covert CUDA code to portable C++ so that the same source code can run on AMD GPUs
This blog focuses on the performance characteristics of a single PowerEdge R7525 server with AMD MI100-32G GPUs. We present results from the general matrix multiplication (GEMM) microbenchmarks, the LAMMPS benchmarks, and the NAMD benchmarks to showcase performance and scalability.
The following table provides the configuration details of the PowerEdge R7525 system under test (SUT):
Table 1. SUT hardware and software configurations
AMD EPYC 7502 32-core processor
512 GB (32 GB 3200 MT/s * 16)
2 x 1.8 TB SSD (No RAID)
Red Hat Enterprise Linux Server 8.2
3 x AMD MI100-PCIe-32G
Processor Settings > Logical Processors
NUMA node per socket
Version: NAMD 3.0 ALPHA 6
LAMMPS (KOKKOS) benchmark
Version: LAMMPS patch_18Sep2020+AMD patches
The following table lists the AMD MI100 GPU specifications:
Table 2. AMD MI100 PCIe GPU specification
Peak Engine Clock (MHz)
Peak FP64 (TFLOPS)
Peak FP64 Tensor DGEMM (TFLOPS)
Peak FP32 (TFLOPS)
Peak FP32 Tensor SGEMM (TFLOPS)
Memory size (GB)
Memory ECC support
The GEMM benchmark is a simple, multithreaded dense matrix-to-matrix multiplication benchmark that can be used to test the performance of GEMM on a single GPU. The rocblas-bench binary compiled from https://github.com/ROCmSoftwarePlatform/rocBLAS was used to collect DGEMM and SGEMM results. The results of these tests reflect the performance of an ideal application that only runs matrix multiplication in the form of the peak TFLOPS that the GPU can deliver. Although GEMM benchmark results might not represent real-world application performance, it is still a good benchmark to demonstrate the performance capability of different GPUs.
The following figure shows the observed numbers of DGEMM and SGEMM:
Figure 2. DGEMM and SGEMM for both AMD MI100 peak and AMD-PCIe sustained
The results indicate:
- In the DGEMM (double-precision GEMM) benchmark, the theoretical peak performance of the AMD MI100 GPU is 11.5 TFLOPS and the measured sustained performance is 7.9 TFLOPS. As shown in Table 2, the standard double precision (FP64) theoretical peak and the FP64 tensor DGEMM peak performance are both at 11.5 TFLOPS. Because most real world HPC applications typically are not heavily implemented with DGEMM or other matrix operations, this high standard FP64 capability boosts the performance on other non-matrix double-precision math calculations.
- For FP32 Tensor operations in the SGEMM (single-precision GEMM) benchmark, the theoretical peak performance of the AMD MI100 GPU is 46.1 TFLOPS, and the measured sustained performance is approximately 30 TFLOPS.
The LAMMPS benchmark
The Large-Scale Atom/Molecular Massively Parallel Simulator (LAMMPS) runs threads in parallel using message-passing techniques. This benchmark measures the scalability and performance of large, parallel systems of multiple GPUs.
The following figure shows the KOKKOS implementation of LAMMPS scaled relatively linearly as AMD MI100 GPUs were added across four datasets: EAM, LJ, Tersoff, and ReaxFF/C.
Figure 3. LAMMPS benchmark showing scaling of multiple AMD MI100 GPUs
The NAMD Benchmark
Nanoscale Molecular Dynamics (NAMD) is a parallel molecular dynamics system designed for simulation of large biomolecular systems. The NAMD benchmark stresses the scaling and performance aspects of the server and GPU configuration.
The following figure plots the results of the NAMD microbenchmark:
Figure 4. NAMD benchmark performance
Aggregate data of multiple GPU cards is preferred because the Alpha builds of the NAMD 3.0 binary do not scale beyond a single accelerator. Three replica simulations were launched on the same server, one on each GPU, in parallel. NAMD was CPU-bound in previous versions. The new 3.0 version has reduced the CPU dependence. As a result, three-copy simulation produced linear scaling performing three times faster across all datasets.
As part of the optimization, the NAMD benchmark numbers in the following figure show the relative performance difference using different numbers of CPU cores for the STMV dataset:
Figure 5. CPU core dependency on NAMD
The AMD MI100 GPU exhibited an optimum configuration of four CPU cores per GPU.
The AMD MI100 accelerator delivers industry-leading performance, and it is a well-positioned performance per dollar GPU for both FP32 and FP64 HPC parallel codes.
- FP32 applications perform well using the AMD MI100 GPU based on the SGEMM, LAMMPS, and NAMD benchmarks performance by using tensor cores and native FP32 compute cores.
- FP64 applications perform well using the AMD MI100 GPU by using native FP64 compute cores.
In the future, we plan to test other HPC and Deep Learning applications. We also plan to research using “Hipify” tools to port CUDA sources to HIP.