Home Workload Solutions High Performance Computing Blogs

Molecular Dynamics Simulations with Dell EMC PowerEdge XE8545 Server and NVIDIA A100

Wed, 02 Jun 2021 19:37:48 -0000

Read Time: 0 minutes

Kihoon Yoon

Overview

Over the past decade, graphics processing units, or GPUs, have become popular in scientific computing because of their great ability to exploit a high degree of parallelism. NVIDIA has a handful of life sciences applications optimized and run on their general-purpose GPUs. Unfortunately, these GPUs can only be programmed with CUDA, OpenACC, and the OpenCL framework. Most members of the life sciences community are not familiar with these frameworks, and so few biologists or bioinformaticians can make efficient use of GPU architectures. However, GPUs have been making inroads into the molecular dynamics simulation (MDS) field since MD was developed in the 1950s. MDS requires heavy computational work to simulate biomolecular structures or their interactions.

In this blog, we tested two MDS applications; NAMD, and LAMMPS using the Dell EMC PowerEdge XE8545 server with NVIDIA A100 GPUs. Since the XE8545 server does not support NVIDIA V100 GPU, we can roughly estimate the performance boost with the A100 from our previous tests.

These two applications are free and open-source parallel MD packages designed for analyzing the physical movements of atoms and molecules.

The test server configuration is summarized in the following table.

Table 1. Tested compute node configuration

Dell EMC PowerEdge XE8545
CPU	2x 7713 (Milan), 64 Cores, 2.0 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache
RAM	DDR4 1024 GB (32 x 32 GB) 3200 MT/s
Operating system	RHEL 8.3 (4.18.0-240.el8.x86_64)
Filesystem network	Mellanox InfiniBand HDR100
Filesystem	Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage
BIOS system profile	Performance Optimized
Logical processor	Disabled
Virtualization technology	Disabled
Accelerator	4 x A100-40 GB SXM4
Cuda/Toolkit	11.2
OpenMPI	4.1.1
NAMD	NAMD_Git-2021-04-01_Source
LAMMPS	Stable version (29 Oct 2020)

Performance Evaluation

NAMD

Nanoscale Molecular Dynamics (NAMD) is open-source software for molecular dynamics simulation written in a CHARMM parallel programming model and is designed for high-performance simulation of large biomolecular systems.

NAMD was built with the NAMD_Git-2021-04-01_Source source code on GCC 11.1 and CUDA 11.2. For our tests, we used two sets of data; 1.06 million-atoms of the Satellite Tobacco Mosaic Virus (STMV) system, and the HECBioSim 3000k-atom system, which is a pair of 1IVO and 1NQL hEGFR tetramers.

Figure 1 shows the performance of 4x A100 GPUs with the STMV dataset. NAMD uses ++p options to specify the number of worker threads, and as recommended, is equal to the total number of cores minus the total number of GPUs. However, the number of total cores in the Milan Eypc 7003 family of processors, such as the Eypc 7713 that is used in the testing system, does not follow the generic recommendation. It seems to be around 79 to 90 cores. The optimal number of cores depends on the data size. Close to 9-nanosecond simulations (ns) per day performance is a significant performance gain from the NVIDIA V100 tests that we ran previously. It is difficult to say the performance gain is the sole contribution of the new A100 GPUs because the comparison of the 16 GB V100 on the Intel Skylake platform to the 40 GB A100 on the AMD Milan platform may not be valid.

Figure 1. Estimated simulation time per day with 4x NVIDIA A100 GPUs

The purpose of an additional test with 3 million atom protein tetramers is to confirm that the STMV test results are not artificial due to the relatively small icosahedron structure of SMTV, and the partial simulation of assembly and disassembly processes. Figure 2 shows the nanosecond simulations per day plot for 3000k-atom data. 2.1 ns/day seems to be close to the maximum performance with 64 cores.

Figure 2. Estimated simulation time per day with 4x NVIDIA A100 GPUs

LAMMPS

Large-scale Atomic/Molecular Massively Parallel Simulator, or LAMMPS, is a classical molecular dynamics code and has potentials for solid-state materials (metals and semiconductors), soft matter (biomolecules and polymers), and coarse-grained or mesoscopic systems. LAMMPS can model atoms, or can be used as a parallel particle simulator at the atomic, meso, or continuum scale. LAMMPS runs on single processors, or in parallel using message-passing techniques and spatial decomposition of the simulation domain. LAMMPS was built with GCC 11.1, OpenMPI 4.1.1, and CUDA 11.2 from the source. The 465k-atom system was selected from HECBioSim.

As shown in Figure 3, LAMMPS scales well over the number of A100s. With 4x A100 GPUs, a 8.4 ns/day simulation is achievable.

Figure 3. Estimated simulation time per day with various number of BPUs

Conclusion

Although it is not possible to compare the performance of the A100 and the V100 from this study, the Milan CPUs and A100 show a strong synergy between more cores with better and faster GPUs. Running NAMD and LAMMPS on the XE8545 with the A100 can deliver a better performance than a system with the V100.

Tags:

NVIDIA GPUs
	A100	A10	A30	A40
FP64 (TFLOPS)	9.7	Unknown	5.2	Unknown
FP64 Tensor Core (TFLOPS)	19.5	Unknown	10.3	Unknown
FP32 (TFLOPS)	19.5	31.2	10.3	37.4
Tensor Float 32 (TFLOPS)	156 \| 312*	62.5 \| 125*	82 \| 165 *	74.8 \| 149.6*
BFLOAT16 Tensor Core (TFLOPS)	312 \| 624*	125 \| 250*	165 \| 330*	149.7 \| 299.4*
FP16 Tensor Core (TFLOPS)	312 \| 624*	125 \| 250*	165 \| 330*	149.7 \| 299.4*
INT8 Tensor Core (TOPS)	624 \| 1248*	250 \| 500*	330 \| 661*	299.3 \| 598.6*
INT4 Tensor Core (TOPS)	Unknown	500 \| 1,000*	661 \| 1321*	598.7 \| 1,197.4*
GPU memory	40 GB HBM2	24 GB GDDR6	24 GB HBM2	48 GB GDDR6
GPU memory bandwidth	1,555 GB/s	600 GB/s	933 GB/s	696 GB/s
Max Thermal Design Power (TDP)	400W	150W	165W	300W
Multi-Instance GPU	Up to 7 MIGs @ 5 GB	Unknown	4 GPU instances @ 6 GB each 2 GPU instances @ 12 GB each 1 GPU instance @ 24 GB	Unknown
Form factor	PCIe	Single-slot, full-height, full-length (FHFL)	Dual-slot, full-height, full-length (FHFL)	4.4" (H) x 10.5" (L) dual slot
Interconnect	PCIe Gen4: 64 GB/s	PCIe Gen4: 64 GB/s	PCIe Gen4: 64 GB/s	PCIE Gen4 x 16 31.5 GB/s (bidirectional)

Server	PowerEdge R750xa	PowerEdge R760xa	PowerEdge R7615
MLPerf Version	V4.0
GPU	NVIDIA A100 PCIe 80 GB	NVIDIA L40S
Number of GPUs	4		2
MLPerf System ID	R750xa_A100_PCIe_80GBx4_TRT	R760xa_L40Sx4_TRT	R7615_L40Sx2_TRT
CPU	2 x Intel Xeon Gold 6338 CPU @ 2.00GHz	2 x Intel Xeon Platinum 8470Q	1 x AMD EPYC 9354 32-Core Processor
Memory	512 GB
Software Stack	TensorRT 9.3.0 CUDA 12.2 cuDNN 8.9.2 Driver 535.54.03 / 535.104.12 DALI 1.28.0

Model	NVIDIA A100			NVIDIA L40S
Form factor	SXM4	PCIe Gen4		PCIe Gen4
GPU architecture	Ampere			Ada Lovelace
CUDA cores	6912			18176
Memory size	80 GB			48 GB
Memory type	HBM2e			HBM2e
Base clock	1275 MHz		1065 MHz	1110 MHz
Boost clock	1410 MHz			2520 MHz
Memory clock	1593 MHz		1512 MHz	2250 MHz
MIG support	Yes			No
Peak memory bandwidth	2039 GB/s		1935 GB/s	864 GB/s
Total board power	500 W		300 W	350 W

Benchmark	Dell PowerEdge R760xa L40S result (Server in Queries/s and Offline in Samples/s)	Dell’s % gain to the next best non-Dell results (%)
Stable Diffusion XL Server	0.65	5.24
Stable Diffusion XL Offline	0.67	2.28
GPT-J 99 Server	12.75	4.33
GPT-J 99 Offline	12.61	1.88
GPT-J 99.9 Server	12.75	4.33
GPT-J 99.9 Offline	12.61	1.88

Your Browser is Out of Date

Molecular Dynamics Simulations with Dell EMC PowerEdge XE8545 Server and NVIDIA A100

Overview

Performance Evaluation

NAMD

LAMMPS

Conclusion

Related Blog Posts

Nanoscale Molecular Dynamics (NAMD) Performance with Dell EMC PowerEdge R750xa & NVIDIA A series GPUs

MLPerf™ Inference v4.0 Performance on Dell PowerEdge R760xa and R7615 Servers with NVIDIA L40S GPUs

Abstract

Introduction

System Under Test configuration

Dell PowerEdge R760xa server

Dell PowerEdge R7615 server

Dell PowerEdge R750xa server

Performance results

Classical Deep Learning models performance

Generative AI performance

Conclusion