Home Workload Solutions High Performance Computing Blogs

MD Simulation of GROMACS with AMD EPYC 7003 Series Processors on Dell EMC PowerEdge Servers

Thu, 19 Aug 2021 20:06:53 -0000

Read Time: 0 minutes

Savitha Pareek

Joseph Stanfield

Ashish K Singh

AMD has recently announced and launched its third generation 7003 series EPYC processors family (code named Milan). These processors build upon the proceeding generation 7002 series (Rome) processors and improve L3 cache architecture along with an increased memory bandwidth for workloads such as High Performance Computing (HPC).

The Dell EMC HPC and AI Innovation Lab has been evaluating these new processors with Dell EMC’s latest 15G PowerEdge servers and will report our initial findings for the molecular dynamics (MD) application GROMACs in this blog.

Given the enormous health impact of the ongoing COVID-19 pandemic, researchers and scientists are working closely with the HPC and AI Innovation Lab to obtain the appropriate computing resources to improve the performance of molecular dynamics simulations. Of these resources, GROMACS is an extensively used application for MD simulations. It has been evaluated with the standard datasets by combining the latest AMD EPYC Milan processor (based on Zen 3 cores) with Dell EMC PowerEdge servers to get most out of the MD simulations.

In a previous blog, Molecular Dynamic Simulation with GROMACS on AMD EPYC- ROME, we published benchmark data for a GROMACS application study on a single node and multinode with AMD EPYC ROME based Dell EMC servers.

The results featured in this blog come from the test bed described in the following table. We performed a single-node and multi-node application study on Milan processors, using the latest AMD stack shown in Table 1, with GROMACS 2020.4 to understand the performance improvement over the older generation processor (Rome).

Table 1: Testbed hardware and software details

Server	Dell EMC PowerEdge 2-socket servers (with AMD Milan processors)	Dell EMC PowerEdge 2-socket servers (with AMD Rome processors)
Processor Cores/socket Frequency (Base-Boost ) Default TDP L3 cache Processor bus speed	7763 (Milan) 64 2.45 GHz – 3.5 GHz 280 W 256 MB 16 GT/s	7H12 (Rome) 64 2.6 GHz – 3.3 GHz 280 W 256 MB 16 GT/s
Processor Cores/socket Frequency Default TDP L3 cache Processor bus speed	7713 (Milan) 64 2.0 GHz – 3.675 GHz 225 W 256 MB 16 GT/s	7702 (Rome) 64 2.0 GHz – 3.35 GHz 200 W 256 MB 16 GT/s
Processor Cores/socket Frequency Default TDP L3 cache Processor bus speed	7543 (Milan) 32 2.8 GHz – 3.7 GHz 225 W 256 MB 16 GT/s	7542 (Rome) 32 2.9 GHz – 3.4 GHz 225 W 128 MB 16 GT/s
Operating system	Red Hat Enterprise Linux 8.3 (4.18.0-240.el8.x86_64)	Red Hat Enterprise Linux 7.8
Memory	DDR4 256 G (16 GB x 16) 3200 MT/s
BIOS/CPLD	2.0.2 / 1.1.12
Interconnect	NVIDIA Mellanox HDR	NVIDIA Mellanox HDR 100

Table 2: Benchmark datasets used for GROMACS performance evaluation

Datasets	Details
Water Molecule	1536 K and 3072 K
HecBioSim	1400 K and 3000 K
Prace – Lignocellulose	3M

The following information describes the performance evaluation for the processor stack listed in the Table 1.

Rome processors compared to Milan processors (GROMACS)

Figure 1: GROMACS performance comparison with AMD Rome processors

For performance benchmark comparisons, we selected Rome processors that are closest to their Milan counterparts in terms of hardware features such as cache size, TDP values, and Processor Base/Turbo Frequency, and marked the maximum value attained for Ns/day by each of the datasets mentioned in Table 2.

Figure 1 shows a 32C Milan processor has higher performance improvements (19 percent for water 1536, 21 percent for water 3072, and 10 to approximately 12 percent with HECBIO sim and lingo cellulose datasets) compared to a 32C Rome processor. This result is due to a higher processor speed and improved L3 cache, wherein more data can be accessed by each core.

Next, with the higher end processor we see only 10 percent gain with respect to the water dataset, as they are more memory intensive. Some percentage is added on due to improvement of frequency for the remaining datasets. Overall, the Milan processor results demonstrated a substantial performance improvement for GROMACS over Rome processors.

Milan processors comparison (32C processors compared to 64C processors)

Figure 2: GROMACS performance with Milan processors

Figure 2 shows performance relative to the performance obtained on the 7543 processor. For instance, the performance of water 1536 is improved from the 32C processor to the 64 core (64C) processor from 41 percent (7713 processor) to 57 percent (7763 processor). The performance improvement is due to the increasing core counts and higher CPU core frequency performance improvement. We observed that GROMACS is frequency sensitive, but not to a great extent. Greater gains may be seen when running GROMACS across multiple ensembles runs or running dataset with higher number of atoms.

We recommend that you compare the price-to-performance ratio before choosing the processor based on the datasets with higher CPU core frequency, as the processors with a higher number of lower-frequency cores may provide better total performance.

Multi-node study with 7713 64C processors

Figure 3: Multi-node study with 7713 64c SKUs

For multi-node tests, the test bed was configured with an NVIDIA Mellanox HDR interconnect running at 200 Gbps and each server included an AMD EPYC 7713 processor. We achieved the expected linear performance scalability for GROMACS of up to four nodes and across each of the datasets. All cores in each server were used while running the benchmarks. The performance increases are close to linear across all the dataset types as core count increases.

Conclusion

For the various datasets we evaluated, GROMACS exhibited strong scaling and was compute intensive. We recommend a processor with high core count for smaller datasets (water 1536, hec 1400); larger datasets (water 3072, ligno,HEC 3000) would benefit from memory per core. Configuring the best BIOS options is important to get the best performance out of the system.

For more information and updates, follow this blog site.

Tags:

Software	Version
Cloudera Data Platform (CDP)	7.1.7 SP2
Hadoop	3.1.1
HDFS	3.1.1
YARN	3.1.1
MR2	3.1.1
Spark	2.4.7
ZooKeeper	3.5.5
Java	1.8.0
Python	3.7.16
Red Hat Enterprise Linux	8.7 (Master node) 8.6 (Worker nodes)
TPCx-AI Kit	1.0.2

Primary Metric	Score
Performance (AIUCpm@1000)	3,258.01
Price/Performance (USD/AIUCpm@100)	267.96
Availability	June 13, 2023

Metric	Score
Total system cost	$872,988
Framework	Cloudera SEL Data Platform Private Cloud Base Edition
Operating system	Red Hat Enterprise Linux 8.6/8.7
Scale factor	1,000
Physical storage divided by scale factor	214.56
Scale factor divided by physical memory	0.12
Main data redundancy mode	Replication 3, RAID 1
Number of servers	11
Total processors, cores, and threads	22/704/1,344
Number of streams	4

Benchmark run	Time
Benchmark start	06-07-2023 9:35:25 PM
Benchmark end	06-08-2023 3:20:10 AM
Benchmark duration	5:44:45.193

Benchmark phase	Metric_name	Metric value
Data Generation	DATAGEN	2419.613
Data Loading	TLOAD	927.45
Load Test	TLD	927.45
Power Training	TPTT	492.143
Power Serving 1	TPST1	56.998
Power Serving 2	TPST2	57.357
Power Serving	TPST	57.357
Throughput	TTT	43.934
AIUCpm@1000.0		3258.066

Use case	TRAINING	SERVING_1	SERVING_2	Throughput (avg.)	Accuracy	Threshold
1	523.703	51.215	49.736	56.083	-1.00000	-1.0 >= -1
2	1813.764	85.354	88.783	129.274	0.43830	word_error rate <= 0.5
3	95.795	12.443	12.811	13.84	4.57451	mean_squared_log_error <= 5.4
4	59.08	25.475	25.489	31.016	0.71189	f1_score >= 0.65
5	943.023	76.289	78.351	91.615	0.03347	mean_squared_log_error <= 5.4 <= 0.5
6	435.865	33.135	33.071	37.12	0.21355	matthews_corrcoef >= 0.19
7	43.585	15.317	15.3	17.143	1.65306	median_absolute_error <= 1.8
8	1940.283	338.579	341.811	372.418	0.74996	accuracy_score >= 0.65
9	5448.735	703.291	699.631	745.458	1.00000	accuracy_score >= 0.9
10	818.635	28.326	28.19	31.162	0.81691	accuracy_score >= 0.7

	NVIDIA A100 GPGPU		NVIDIA V100S GPGPU
Form factor	SXM4	PCIe Gen4	SXM2	PCIe Gen3
GPU architecture	Ampere		Volta
Memory size	40 GB	40 GB	32 GB	32 GB
CUDA cores	6912		5120
Base clock	1095 MHz	765 MHz	1290 MHz	1245 MHz
Boost clock	1410 MHz		1530 MHz	1597 MHz
Memory clock	1215 MHz		877 MHz	1107 MHz
MIG support	Yes		No
Peak memory bandwidth	Up to 1555 GB/s		Up to 900 GB/s	Up to 1134 GB/s
Total board power	400 W	250 W	300 W	250 W

Server	PowerEdge R7525
Processor	2nd Gen AMD EPYC 7502, 32C, 2.5Ghz
Memory	512 GB (16 x 32 GB @3200MT/s)
GPGPUs	Either of the following: 2 x NVIDIA A100 PCIe 40 GB 2 x NVIDIA V100S PCIe 32 GB
Logical processors	Disabled
Operating system	CentOS Linux release 8.1 (4.18.0-147.el8.x86_64)
CUDA	11.0 (Driver version - 450.51.05)
gcc	9.2.0
MPI	OpenMPI-3.0
HPL	hpl_cuda_11.0_ompi-4.0_ampere_volta_8-7-20
HPCG	xhpcg-3.1_cuda_11_ompi-3.1
GROMACS	v2020.4

Your Browser is Out of Date

MD Simulation of GROMACS with AMD EPYC 7003 Series Processors on Dell EMC PowerEdge Servers

Rome processors compared to Milan processors (GROMACS)

Multi-node study with 7713 64C processors

Conclusion

Related Blog Posts

Dell Reinforces its TPCx-AI Benchmark Leadership using the 16G PowerEdge R6625 Hardware Platform at SF1000

Overview

What TPCx-AI tests measure

System under test (SUT)

Software versions

The result

Primary metrics

Other metrics

Numerical quantities

Benchmark run times

Benchmark phase times

Use case times and accuracy

TPCx-AI SF1000 results tables

Key takeaways

Conclusion

References

HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs

Overview

Server configuration

Benchmark results

High-Performance Linpack benchmark

High Performance Conjugate Gradient benchmark

GROMACS

Conclusion