Your Browser is Out of Date

ShareDemos uses technology that works best in other browsers.
For a full experience use one of the browsers below

Blogs

Blogs about Dell Technologies solutions for high performance computing

Blogs(21)

Tag :

All Tags

Author :

All Authors

WRF

WRF Performance with 3rd Generation Intel Xeon Scalable Processors On Dell EMC PowerEdge servers

Puneet Singh Ashish Kumar Singh

Tue, 21 Sep 2021 13:47:34 -0000

|

Read Time: 0 minutes

Many sectors like aviation, travel, tourism, energy, and transportation heavily rely on timely and accurate weather predictions provided by weather forecast centers. These operational forecast centers make use of numerical weather prediction (NWP) models to predict the weather based on current weather conditions. Weather research and forecasting (WRF) is one of the most widely used numerical weather prediction systems for weather forecasting. A suitable combination of robust computational resources, high-speed network and high throughput storage is required to achieve the maximum performance on high-performance computing (HPC) cluster for the WRF model to deliver timely forecasts.

In this blog, we highlight the performance improvement for WRF with Intel Ice Lake processors as compared with Intel Cascade Lake processors with Dell EMC PowerEdge servers. These tests were carried out on two socket Dell PowerEdge servers by setting the BIOS option to the HPC workload profile. The testbed hardware and software details are outlined in the following table:

Table 1: Testbed hardware and software details

Component

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge R650 server

Dell EMC PowerEdge C6420 server

Dell EMC PowerEdge C6420 server

SKU

8380

6338

8280

6252

Cores/Socket

40

32

28

24

Frequency (Base-Max Turbo)
  

2.30 – 3.40 GHz

2.0 – 3.20 GHz

2.70 – 4.0 GHz

2.10 – 3.70 GHz

TDP

270 W

205 W

205 W

150 W

L3Cache

60M

48M

38.5M

37.75M

Operating System

Red Hat Enterprise Linux 8.3   4.18.0-240.22.1.el8_3.x86_64

Red Hat Enterprise Linux 8.3   4.18.0-240.22.1.el8_3.x86_64

Red Hat Enterprise Linux 8.3

4.18.0-240.el8.x86_64

Red Hat Enterprise Linux 8.3

4.18.0-240.el8.x86_64

Memory

32 GB x 16 (2Rx8) 3200 MT/s

32 GB x 16 (2Rx8) 3200 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

BIOS/CPLD

1.2.4/1.0.5

2.11.2/1.1.0

Interconnect

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR100

NVIDIA Mellanox HDR100

Compiler

Intel parallel studio 2020 (update 4)

Datasets

conus 2.5km,  new conus 2.5km, wrf_large 3km

We benchmarked WRF-V3.9.1.1 with the conus 2.5km and new conus 2.5km datasets and WRF-V4.2.2 with new conus 2.5km and wrf_large 3km datasets. The following figure shows the simulation domain for the tested datasets:

Figure 1: Domain configuration for new conus 2.5 km, conus 2.5 km, and wrf_large datasets.

The following table provides a brief description of each dataset:

Table 2:   Configuration for new conus 2.5 km, conus 2.5 km and wrf_large datasets

 

conus 2.5 km

new conus 2.5 km

wrf_large

Run hours

3

3

2

Resolution(m)

2500

2500

3000

Vertical layers

35

35

50

Grid points

1501 x 1201

1901 x 1301

1500 x 1500

interval_seconds

10800

10800

21600

The results were measured by averaging the WRF computation time of each timestep from the rsl.error.0000 output file. The timesteps during the file read / write (of wrfout* / wrfinput* ) were not included in the average.

Single node performance

The following figures show the application performance for the datasets mentioned in Table 2. In each figure, the numbers over the bars represent the relative performance compared to the performance obtained with the Intel 6252 Cascade Lake processor model. The blue and green bars represent application performance obtained with Ice lake and Cascade Lake processors.

Figure 2: Relative performance of WRF by processor and dataset type mentioned in Table 1

WRF was compiled with the "dm + sm" configuration with avx2 instructions and serial netcdf support (io_form* set to 2). All the available cores were subscribed during WRF simulation runs. To optimize performance, we tested different MPI process counts, OpenMP thread count combinations, and tiling schemes (WRF_NUM_TILES).

Depending on the dataset, the 8380 processor model can deliver up to 19 percent better performance compared to the 6338 processor model. Relative to Cascade Lake, the Ice Lake architecture has more memory channels and offers higher aggregate memory bandwidth. WRF, which is typically memory bandwidth bound, can take advantage of the additional memory bandwidth (Table 3) provided by Ice Lake and the results demonstrate up to 65 percent performance improvement over the Cascade Lake counterparts. Comparison of Instructions Per Cycle (IPC) and DRAM Bandwidth Utilization collected using Intel OneAPI Vtune profiler on Intel Ice Lake and Cascade Lake processors is shown in Table 3.

Table 3: Metrics collected using Intel OneAPI vtune profiler


8380

8280


IPC

Bandwidth(GB/s)

IPC

Bandwidth (GB/s)

conus 2.5km (WRFV3)

0.99

257.32

0.86

128.30

new conus2.5km (WRFV3)

1.57

192.18

1.48

120.96

new conus 2.5km (WRFV4)

1.36

191.43

1.14

115.46

wrf_large (WRFV4)

1.09

64.80

0.90

62.55

Intel’s Ice Lake is expected to deliver around 20 percent better IPC than the Cascade Lake model (8380 vs 8280). With datasets covered in this blog, we found that Intel 8380 processor reports 6 to 19% better IPC than the Intel 8280 processor.

Figure 3 shows the power consumption using the box and whiskers plot when the system was being benchmarked with the four tests shown in Figure 2. Box indicates the spread of the central 50% of the power data, and the central line represents the median power value. The dots shows the outlier power values , most of which were recorded during the initialization and finalization phase of the tests.

Figure 3: Power used by platform and processor type

Average frequency usage for 8380, 6338, 8280, and 6252 processors were around 2.9, 2.5, 3.0, and 2.5 GHz respectively for all datasets.

Multi-node Scalability

We used eight nodes to evaluate the scalability of WRF. Each node is equipped with the Intel 8380 processor and interconnected using the NVIDIA Mellanox HDR interconnect. The nodes used for benchmarking were connected to the same HDR switch. Table 1 provides details about the server and software that was used for the test. The text on top of the bar in Figure 4 represents the relative performance (on two, four, and eight nodes) for the application as compared with the performance with a single node.

Figure 4: Multi-node performance of WRF on an Intel 8380 processor model for datasets listed in Table 1

The scalability numbers have been rounded off to a single digit. We observed good scalability with all the datasets listed in Table 1.

Conclusions and recommendations

For WRF, Intel Ice Lake demonstrates significant performance improvement as compared with Intel Cascade Lake processors. WRF simulations scale well with the datasets described in this blog. The scalability might vary depending on the dataset being used and the node count being tested. For the best performance with WRF, the impact of the tile size, process, and threads per process should be evaluated.

Read Full Blog

LAMMPS — with Ice Lake on Dell EMC PowerEdge Servers

Savitha Pareek Joseph Stanfield Ashish Kumar Singh

Mon, 30 Aug 2021 20:08:02 -0000

|

Read Time: 0 minutes

3rd Generation Intel Xeon®  Scalable processors (architecture code named Ice Lake) is Intel’s successor to Cascade Lake. New features include up to 40 cores per processor, eight memory channels with support for 3200 MT/s memory speed and PCIe Gen4. The HPC and AI Innovation Lab at Dell EMC had access to a few systems and this blog presents the results of our initial benchmarking study.

LAMMPS Overview

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is an open source, well parallelized collection of packages for molecular dynamics (MD) research. LAMMPS has a nice collection of “atom styles”, force fields, and many contributed packages. LAMMPS can run on a single processor or on the largest parallel super-computers. It also has packages that provide force calculations accelerated on GPU’s. It can do simulations with billions of atoms!

LAMMPS can be run on a single processor or in parallel using some form of message passing, such as Message Passing Interface (MPI). The most current source code for LAMMPS is written in C++. For more information about LAMMPS, see the following link: https://www.lammps.org/.  

Objective

In this study we measure the performance of LAMMPS on different Ice Lake processor models as listed in Table 1 with a comparison to the previous generation Cascade Lake systems. Single node as well as the multi-node scalability tests were conducted.

Compilation Details

The LAMMPS version used for testing release was lammps-2July-2021, using the Intel 2020 update 5 compiler to take advantage of AVX2 and AVX512 optimizations, and the Intel MKL FFT library. We used the default INTEL package, which comes along the LAMMPS package providing some well optimized atom pair styles in LAMMPS for the vector instructions on Intel processors. The datasets used for our study are described in Table 2, along with detailed configuration of atom sizes and run steps. The unit of performance is timesteps per second, and higher is better.

Hardware and software configurations

Table 1: Hardware and Software test bed details 

 

 

Component

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6420 server

Dell EMC PowerEdge C6420 server

CPU model

Xeon 8380

Xeon 8358

Xeon 8352Y

Xeon 6330

Xeon 8280

Xeon 6248R

Cores/Socket

40

32

32

28

28

24

Base Frequency

2.30 GHz

2.60 GHz

2.20 GHz

2.00 GHz

2.70 GHz

3.00 GHz

TDP

270 W

250 W

205 W

205 W

205 W

205 W

Operating System

Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64

Memory

 

16 GB x 16 (2Rx8) 3200 MT/s

 

 

16 GB x 12 (2Rx8)

2933 MT/s

BIOS/CPLD

1.1.2/1.0.1

Interconnect

NVIDIA Mellanox HDR

 

NVIDIA Mellanox HDR100

Compiler

Intel parallel studio 2020 (update 4)

LAMMPS

2july2021

Datasets used for performance analysis

Table 2: Description of datasets used for performance analysis

Datasets

Description

Units

Atomic Style

Atom Size

Step Size

Lennard Jones

Atomic fluid (LJ Benchmark)

lj

atomic

512000

7900

Rhodo

Protein (Rhodopsin Benchmark)

real

full

512000

520

Liquid crystal

Liquid Crystal w/ Gay-Berne potential

lj

ellipsoid

524288

840

Eam

Copper benchmark with Embedded Atom Method

metal

atomic

512000

3100

Stilliger Weber

Silicon benchmark with Stillinger-Weber

metal

atomic

512000

6200

Tersoff

Silicon benchmark with Tersoff

metal

atomic

512000

2420

Water

Coarse-grain water benchmark using Stillinger-Weber

real

atomic

512000

2600

Polyethylene

Polyethylene benchmark with AIREBO

metal

atomic

522240

550




Figure 1:  Image view of datasets from OVITO (scientific data visualization and analysis software for molecular and other particle-based simulation model). Images are listed in order 1a-1h, subfigure 1a- 1h represents a small portion of simulation domain for Atomic Fluid (Lennard Jones), rhodo(protein), liquid crystal(lc), copper(eam), stilliger webner(sw), Terasoff, water, polyethylene datasets respectively.

Table 1 and Figure 1 shows the image view of datasets used for the single and multi-node analysis. For visualization of all datasets was done using OVITO, scientific data visualization and analysis software for molecular and other particle-based simulation model. For single node performance study, all the datasets shown in Table 2 were used, and for multi-node study Atomic fluid was considered for benchmarking.

Performance Analyses on Single Node  





 

 

 

Figure 2:  Single Node Performance of LAMMPS across the datasets with Intel Ice Lake processor model.  Each graph in Figure 2 is an individual subfigure, labeled (a-h) in the order shown. Each subfigure (2a- 2h) represents single node performance comparison across the Xeon processors with Xeon 6330 as baseline for Atomic Fluid (Lennard Jones), rhodo(protein), liquid crystal(lc), copper(eam), stilliger webner(sw), Terasoff, water, polyethylene datasets respectively.

Figure 2 shows the single node performance for the eight datasets (sub figure 2a-h) listed in Table 2 with the four Ice Lake processor model available for evaluation of LAMMPS.

For ease of comparison across the processor model, the relative performance of the datasets has been included into a different graph. However, it is worth noting that each dataset behaves individually when performance is considered, as each uses different molecular potentials and have different number of atoms. Figure 2 shows that increases in the core count in the processor model increases the performance, across the dataset used. Next, by comparing the relative numbers to the baseline processor Xeon 6330(28C) with Xeon 8380(40C), we measured a 30 to 45 percent performance gain with these datasets. A fraction of these boosts was due to frequency of the processor model.

Figure 3a: Performance of LAMMPS on Cascade Lake (Xeon 6248R) in comparison to Ice Lake (Xeon 6330)

Figure 3b: Performance of LAMMPS on Cascade Lake (Xeon 8280) in comparison to Ice Lake (Xeon 8380)

Figure 3 compares the performance of the mid-bin Cascade Lake 6248R (24core) to the Ice Lake 6330 (28 core), and the top end Cascade Lake 8280 (28 core) to the Ice Lake 8380 (40 core) From the figure 3a, Ice Lake 6330 is up to 30 percent faster than the 6248R. The Xeon 6330 has 16 percent more cores, and 9 percent faster memory bandwidth. Figure 3b shows Ice Lake 8380 is up to 75 percent faster than the Xeon 8280 on single node tests, this is in line with the 42 percent additional cores and 9 percent faster memory bandwidth. These results are due to a higher processor speed wherein more data can be accessed by each core.

Performance Analysis on Multi-Node

To analyze the scalability test with strong and weak scaling, we used the Atom Fluid (LJ) dataset from the Intel package. The job run time was 7900 steps with 512000 atoms in the simulation system. 


Figure 4a: Fixed size Atomic fluid (LJ) for different problem size (strong scaling) w/ Xeon 8380 

With strong scaling, we referred to the fixed problem size and increasing the parallel processes (Amdahl’s law). Whereas in weak scaling, we varied the atom size from 512000 atoms to 4096000 atoms in the simulation environment with an increase in the parallel processes (Gustafson-Barsis law). The test bed included DellEMC Poweredge R750 servers each with dual Ice Lake Xeon 8380 processors an NVIDIA Networking HDR interconnect running at 200 Gbps.

Figure 4a plots the fixed-size relative performance for four different problem sizes, viz, 512000,1024000,2048000, and 4096000 atoms, on different number of nodes.

The relative performance is normalized by single node performance. Hence, the single node performance for each curve is 1.00 (unity). Relative Performance for fixed size Atomic fluid was calculated by the following equation:

Relative Performance = loop time of ‘N’ node / loop time for single node

Loop time is the total wall-clock time for the simulation to run. It can be observed that relative performance increases with increase in problem size. This is because for smaller problems system spend more time in inter-nodal communications. Time spent in communication at 8 nodes is 61.91%, 59.74%,48.42%,45.04% for 512000,1024000,2048000,4096000 atom size respectively.

Figure 4b: Scaled size Atomic fluid (LJ) with 512000 atoms per node (weak scaling) w/ Xeon 8380

Figure 4b plots the scaled-size efficiency for runs with 512000 atoms per node. Thus, a scaled-size 2 node run is for 1024000 atoms; 8 node runs is for 4096000 atoms. Relative Performance for Scaled size Atomic fluid was calculated by the following equation:

Relative Performance= loop time for ‘n’ node/ (loop time for single node * number of nodes)

Weak scaling efficiency decreases with increase in no of nodes in the investigated range. This is due to the fact for larger number of nodes the time spend in MPI communication is larger. Time spent in communication with scaled size atom for 1N, 2N, 4N and 8 N are 27.17%, 32.42 %, 40.87%, 45.04 % respectively.

Figure 5: Multi-node efficiency for Atomic Fluid (LJ) w/ I 8380

Figure 5 plots the multi-node efficiency for Atomic Fluid with Xeon 8380. The relative performance is normalized by single node with 512000 atoms performance. Hence, the single node performance for 512000 atoms is 1.00 (unity). This point is taken as baseline for other comparison.

Performance Rating = (loop time * number of atoms)/ (loop time for 512000 atoms on 1 node * number of nodes * 512000)

We observed  that for smaller systems, such as those with fewer atoms, the efficiency of strong scaling decreases as the system spends more time in MPI communication; whereas in larger systems with many atoms, the efficiency of strong scaling increases as the time spent in pair-wise force calculation  becomes dominant. For weak scaling, as the no of nodes increases the efficiency of weak scaling decreases.

Conclusion

The Ice Lake processor-based Dell EMC Power Edge servers, with its hardware feature upgrades over Cascade Lake, demonstrate up to 50 to70 percent performance gain for all the datasets used for benchmarking LAMMPS. Watch our blog site for updates!

Read Full Blog

GROMACS — with Ice Lake on Dell EMC PowerEdge Servers

Savitha Pareek Joseph Stanfield Ashish Kumar Singh

Mon, 30 Aug 2021 20:08:03 -0000

|

Read Time: 0 minutes

3rd Generation Intel Xeon®  Scalable processors (architecture code named Ice Lake) is Intel’s successor to Cascade Lake. New features include up to 40 cores per processor, eight memory channels with support for 3200 MT/s memory speed and PCIe Gen4.

The HPC and AI Innovation Lab at Dell EMC had access to a few systems and this blog presents the results of our initial benchmarking study on a popular open-source molecular dynamics application – GROningen MAchine for Chemical Simulations (GROMACS).

Molecular dynamics (MD) simulations are a popular technique for studying the atomistic behavior of any molecular system. It performs the analysis of the trajectories of atoms and molecules where the dynamics of the system progresses over time. 

At HPC and AI Innovation Lab, we have conducted research on the SARS-COV-2 study where applications like GROMACS helped researchers identify molecules that bind to the spike protein of the virus and block it from infecting human cells. Other use cases of MD simulation in medicinal biology is iterative drug design through prediction of protein-ligand docking (in this case usually modelling a drug to target protein interaction).

Overview of GROMACS

GROMACS is a versatile package to perform MD simulations, such as simulate the Newtonian equations of motion for systems with hundreds to millions of particles. GROMACS can be run on CPUs and GPUs in single-node and multi-node (cluster) configurations. It is a free, open-source software released under the GNU General Public License (GPL). Check out this page for more details on GROMACS.

Hardware and software configurations

Table 1: Hardware and Software testbed details

 

 

Component

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6420 server

Dell EMC PowerEdge C6420 server

SKU

Xeon 8380

Xeon 8358

Xeon 8352Y

Xeon 6330

Xeon 8280

Xeon 6252

Cores/Socket

40

32

32

28

28

24

Base Frequency 

2.30 GHz

2.60 GHz

2.20 GHz

2.00 GHz

2.70 GHz

2.10 – GHz

TDP

270 W

250 W

205 W

205 W

205 W

150 W

L3Cache

60M

48M

48M

42M

38.5M

37.75M

Operating System

Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64

Memory

16 GB x 16 (2Rx8) 3200 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

BIOS/CPLD

1.1.2/1.0.1

Interconnect

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR100

Compiler

Intel parallel studio 2020 (update 4)

GROMACS

2021.1

Datasets used for performance analysis

Table 2: Description of datasets used for performance analysis

 

Datasets/Download Link

Description

Electrostatics

Atoms

System Size

Water 

Movement of Water

This example is to simulate- the motion process of many water molecules in each space and temperature.

 

Particle Mesh Ewald (PME)

 

1536K

small

3072K

Large

HecBioSim

This example is to simulate-

1.4M atom system - A Pair of hEGFR Dimers of 1IVO and 1IVO

3M atom system –

A Pair of hEGFR tetramers of 1IVO and 1IVO

 

 

Particle Mesh Ewald (PME)

 

1.5M

Small

3M

Large

Prace – Lignocellulose

This example is to simulate the lignocellulose – the tpr was obtained from PRACE website

 

Reaction Field (rf)

 

3M

Large

Compilation Details

We compiled GROMACS from source (version-2021.1) using the Intel 2020 Update 5 Compiler to take advantage of AVX2 and AVX512 optimizations, and the Intel MKL FFT library. The new version of GROMACS has a significant performance gain due to the improvements in its parallelization algorithms. The GROMACS build system and the gmx mdrun tool have built-in and configurable intelligence that detects your hardware and make effective use of it.

Objective of Benchmarking

Our objective is to quantify the performance of GROMACS using different test cases, like performance evaluation on different Ice Lake processors as listed in Table 1, then we compare the  2nd and 3rd Gen Xeon Scalable (Cascade Lake vs Ice Lake), and finally we compare multi-node scalability with hyper threading enabled and disabled.

To evaluate the datasets results with an appropriate metric, we added associated high-level compiler flags, electrostatic field load balancing (like PME, etc), tested with multiple ranks, separate PME ranks, varying different nstlist values, and created a paradigm for our application (GROMACS).

The typical time scales of the simulated system are in the order of micro-seconds (µs) or nanoseconds (ns). We measure the performance for the dataset’s simulation as nanoseconds per day (ns/day).

Performance Analyses on Single Node

Figure 1(a): Single node performance of Water 1536K and Water 3072K on Ice Lake processor model

Figure 1(b): Single node performance of Lignocellulose 3M on Ice Lake processor model

Figure 1(c): Single node performance of HecBioSim 1.4M and HecBioSim 3M on Ice Lake processor model

Figure 1 (a), (b) and (c) shows are the single node performance analyses for three datasets mentioned in Table 2 with the four processor models available for evaluation of GROMACS.

Figure 2:  Relative Performance of GROMACS across the datasets with Intel Ice Lake Processor Model

For ease of comparison across the various datasets, the relative performance of the processor model has been included into a single graph. However, it is worth noting that each dataset behaves individually when performance is considered, as each uses different molecular topology input files (tpr), and configuration files.

Individual dataset performance is mentioned in Figures 1(a), 1(b), and 1(c) respectively.

Figure 2 shows increase in the core count in the processor model increases the performance, based on the dataset used. In here, we observe that smaller (water 1536K and HecBioSim 1400K) has more advantage 5 to 6 percent performance gain in counterpart to the larger datasets (water 3072, HecBioSim 3M, and Ligno 3M).

Next, by comparing the relative numbers to the baseline processor Xeon 6330(28C) with Xeon 8380(40C), we found a 30 to 50 percent performance gain according to the datasets with increases in cores, from 28 to 40. A fraction of gain is by frequency of the processor model.

 

 Performance Analyses on Cascade Lake vs Ice Lake


Figure 3(a): Performance of GROMACS on Cascade Lake (Xeon 6252) vs Ice Lake (Xeon 6330)

Figure 3(b): Performance of GROMACS on Cascade Lake (Xeon 8280) vs Ice Lake (Xeon 8380)

We accounted for the fact that the memory is rightly fit according to the datasets. To begin, we compared each processor with previous generation processors. For performance benchmark comparisons, we selected Cascade Lake closest to their Ice Lake counterparts in terms of hardware features such as cache size, TDP values, and Processor Base/Turbo Frequency, and marked the maximum value attained for Ns/day by each of the datasets mentioned in Table 2.



Figure 3a shows Ice Lake 6330 is up to 50 to 75 percent faster than the 6252. The Xeon 6330 has 16 percent more cores and 9 percent faster memory bandwidth. Figure 3b shows that Ice Lake 8380 is up to 50-65 percent faster than the Xeon 8280 on single node tests, this is in line with the 42 percent more cores and 9 percent faster memory bandwidth.

This result is due to a higher processor speed, wherein more data can be accessed by each core. Also, datasets are more memory intensive and some percentage is added on due frequency improvement Overall, the Ice Lake processor results demonstrated a substantial performance improvement for GROMACS over Cascade Lake processors.

Performance Analysis on Multi-Node
Figure 4(a): Scalability of water 1536K with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/ Xeon 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled 
Figure 4(b): Scalability of water 3072K with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/INTEL 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled

Figure 4(c): Scalability of HecBioSim 1.4M with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/ Xeon 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled 

Figure 4(d): Scalability of HecBioSim 3M with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/ Xeon 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled 

Figure 4(e): Scalability of Lignocellulose 3M with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/INTEL 8380 ; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled 

For multi-node tests, the test bed was configured with an NVIDIA Mellanox HDR interconnect running at 200 Gbps and each server having the Ice Lake processor. We were able to achieve the expected linear performance scalability for GROMACS of up to eight nodes with hyper threading disabled and approximately 7.25X with hyper threading enabled for eight nodes, across the datasets. All cores in each server were used while running these benchmarks. The performance increases are close to linear across all the dataset types as the core count increases.

Conclusion

The Ice Lake processor-based Dell EMC Power Edge servers, with notable hardware feature upgrades over Cascade Lake, show up to 50 to 60 percent performance gain for all the datasets used for benchmarking GROMACS. Hyper threading should be disabled for the benchmarks addressed in this blog for getting better scalability above eight nodes. For small datasets mentioned in this blog benefits 5 to 6 percent in comparison to the larger ones with increase in the core count.

Watch our blog site for updates!

Read Full Blog
AI PowerEdge AMD

MD Simulation of GROMACS with AMD EPYC 7003 Series Processors on Dell EMC PowerEdge Servers

Savitha Pareek Joseph Stanfield Ashish Kumar Singh

Thu, 19 Aug 2021 20:06:53 -0000

|

Read Time: 0 minutes

AMD has recently announced and launched its third generation 7003 series EPYC processors family (code named Milan).  These processors build upon the proceeding generation 7002 series (Rome) processors and improve L3 cache architecture along with an increased memory bandwidth for workloads such as High Performance Computing (HPC).

The Dell EMC HPC and AI Innovation Lab has been evaluating these new processors with Dell EMC’s latest 15G PowerEdge servers and will report our initial findings for the molecular dynamics (MD) application GROMACs in this blog.

Given the enormous health impact of the ongoing COVID-19 pandemic, researchers and scientists are working closely with the HPC and AI Innovation Lab to obtain the appropriate computing resources to improve the performance of molecular dynamics simulations. Of these resources, GROMACS is an extensively used application for MD simulations. It has been evaluated with the standard datasets by combining the latest AMD EPYC Milan processor (based on Zen 3 cores) with Dell EMC PowerEdge servers to get most out of the MD simulations.  

In a previous blog, Molecular Dynamic Simulation with GROMACS on AMD EPYC- ROME, we published benchmark data for a GROMACS application study on a single node and multinode with AMD EPYC ROME based Dell EMC servers.

The results featured in this blog come from the test bed described in the following table. We performed a single-node and multi-node application study on Milan processors, using the latest AMD stack shown in Table 1, with GROMACS 2020.4 to understand the performance improvement over the older generation processor (Rome).

Table 1: Testbed hardware and software details

Server

Dell EMC PowerEdge 2-socket servers

(with AMD Milan processors)

Dell EMC PowerEdge 2-socket servers

(with AMD Rome processors)

Processor

Cores/socket

Frequency (Base-Boost )

Default TDP
 L3 cache

Processor bus speed

7763 (Milan) 

64

2.45 GHz – 3.5 GHz

280 W

256 MB

16 GT/s

7H12 (Rome)

64

2.6 GHz – 3.3 GHz

280 W

256 MB

16 GT/s

Processor

Cores/socket

Frequency

Default TDP
 L3 cache

Processor bus speed

7713 (Milan) 

64

2.0 GHz – 3.675 GHz

225 W

256 MB

16 GT/s

7702 (Rome) 

64

2.0 GHz – 3.35 GHz

200 W

256 MB

16 GT/s

Processor

Cores/socket

Frequency

Default TDP
 L3 cache

Processor bus speed

7543 (Milan) 

32

2.8 GHz – 3.7 GHz

225 W

256 MB

16 GT/s

7542 (Rome) 

32

2.9 GHz – 3.4 GHz

225 W

128 MB

16 GT/s

Operating system

Red Hat Enterprise Linux 8.3 (4.18.0-240.el8.x86_64)

Red Hat Enterprise Linux 7.8

Memory

DDR4 256 G (16 GB x 16) 3200 MT/s

BIOS/CPLD

2.0.2 / 1.1.12

 

Interconnect

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR 100

Table 2: Benchmark datasets used for GROMACS performance evaluation

Datasets


 Details

Water Molecule

1536 K and 3072 K  

HecBioSim

1400 K and 3000 K

Prace – Lignocellulose

3M 

The following information describes the performance evaluation for the processor stack listed in the Table 1.


Rome processors compared to Milan processors (GROMACS)

Figure 1: GROMACS performance comparison with AMD Rome processors

For performance benchmark comparisons, we selected Rome processors that are closest to their Milan counterparts in terms of hardware features such as cache size, TDP values, and Processor Base/Turbo Frequency, and marked the maximum value attained for Ns/day by each of the datasets mentioned in Table 2.

Figure 1 shows a 32C Milan processor has higher performance improvements (19 percent for water 1536, 21 percent for water 3072, and 10 to approximately 12 percent with HECBIO sim and lingo cellulose datasets) compared to a 32C Rome processor. This result is due to a higher processor speed and improved L3 cache, wherein more data can be accessed by each core. 

Next, with the higher end processor we see only 10 percent gain with respect to the water dataset, as they are more memory intensive. Some percentage is added on due to improvement of frequency for the remaining datasets. Overall, the Milan processor results demonstrated a substantial performance improvement for GROMACS over Rome processors.


Milan processors comparison (32C processors compared to 64C processors)

Figure 2: GROMACS performance with Milan processors

Figure 2 shows performance relative to the performance obtained on the 7543 processor. For instance, the performance of water 1536 is improved from the 32C processor to the 64 core (64C) processor from 41 percent (7713 processor) to 57 percent (7763 processor). The performance improvement is due to the increasing core counts and higher CPU core frequency performance improvement. We observed that GROMACS is frequency sensitive, but not to a great extent. Greater gains may be seen when running GROMACS across multiple ensembles runs or running dataset with higher number of atoms.

We recommend that you compare the price-to-performance ratio before choosing the processor based on the datasets with higher CPU core frequency, as the processors with a higher number of lower-frequency cores may provide better total performance.


Multi-node study with 7713 64C processors

Figure 3: Multi-node study with 7713 64c SKUs

For multi-node tests, the test bed was configured with an NVIDIA Mellanox HDR interconnect running at 200 Gbps and each server included an AMD EPYC 7713 processor. We achieved the expected linear performance scalability for GROMACS of up to four nodes and across each of the datasets. All cores in each server were used while running the benchmarks. The performance increases are close to linear across all the dataset types as core count increases.


Conclusion

For the various datasets we evaluated, GROMACS exhibited strong scaling and was compute intensive. We recommend a processor with high core count for smaller datasets (water 1536, hec 1400); larger datasets (water 3072, ligno,HEC 3000) would benefit from memory per core. Configuring the best BIOS options is important to get the best performance out of the system. 

For more information and updates, follow this blog site

Read Full Blog
PowerEdge WRF

WRF Performance with AMD EPYC 7003 Series processors On Dell EMC PowerEdge servers

Puneet Singh Joseph Stanfield Ashish Kumar Singh

Mon, 02 Aug 2021 21:17:28 -0000

|

Read Time: 0 minutes

The Weather Research and Forecasting (WRF) model is an open-source mesoscale weather prediction model that is predominantly used in a multi-compute node environment for atmospheric research and operational forecasts. This model performs well on the latest generation of the AMD EPYC 3rd Gen (7003 Series) processor family, code name Milan. In this blog, we highlight the performance improvement of WRF application on the AMD Milan processors based on Dell EMC PowerEdge servers.

This blog follows up our first blog in this series, where we introduced the AMD Milan processor architecture, key BIOS tuning options, and baseline microbenchmark performance. We analyzed the performance improvement of the latest AMD EPYC Milan (7003 Series) processor-based Dell EMC PowerEdge servers compared to the second-generation AMD EPYC Rome (7002 Series) processor-based Dell EMC PowerEdge servers. The testbed hardware and software details are outlined in the following table: 

Table 1: Testbed hardware and software details

Server

Dell EMC PowerEdge 2-socket servers

(with AMD Milan Processors)

Dell EMC PowerEdge 2-socket servers

(with AMD Rome Processors)

Processor model

Cores/socket

Frequency (Base-Boost)

TDP
 L3 cache

Processor bus speed

7763 

64

2.45 GHz – 3.5 GHz

280 W

256 MB

16 GT/s 

7713 

64

2.0 GHz – 3.7 GHz

225 W

256 MB

16 GT/s 

7543 

32

2.8 GHz – 3.7 GHz

225 W

256 MB

16 GT/s 

7662 

64c

2.0 GHz – 3.35 GHz

200 W

256 MB

16 GT/s

7542 

32

2.9 GHz – 3.4 GHz

225 W

128 MB

16 GT/s 

Operating system

Red Hat Enterprise Linux 8.3 (4.18.0-240.el8.x86_64)

Memory

DDR4 256G (16 GB x 16) 3200 MT/s

Interconnect

NVIDIA Mellanox HDR

BIOS/CPLD

2.2.5 / 1.1.12 (AMD 7763,AMD 7713,AMD 7543)

2.1.6 / 1.1.12 (AMD 7662)

2.1.5 / 0.10.3 (AMD 7542)

Applications

WRF v3.9.1.1, WRF v4.2.2 

Benchmark datasets

conus 2.5km,  new conus 2.5km, wrf_large 3km

 
 The following figure shows the domain for the tested datasets: 

     

Figure 1: Domain configuration for new conus 2.5 km, conus 2.5 km, and wrf_large datasets.

The following table provides a brief description of each dataset:

Table 2:  Configuration for new conus 2.5 km conus 2.5 km and wrf_large datasets

 

conus 2.5 km

new conus 2.5 km

wrf_large

Run hours

3

3

2

Resolution(m)

2500

2500

3000

Vertical layers

35

35

50

Grid points

1501 x 1201

1901 x 1301

1500 x 1500

interval_seconds

10800

10800

21600

The results were measured by averaging the WRF computation time of each timestep from the rsl.error.0000 output file.

 

Single node performance

The following figures show the application performance for the datasets mentioned in Table 2. In each figure, the numbers over the bars represent the relative change in the application performance compared to the application performance obtained on the AMD 7542 Rome processor model. 

 
 

Figure 2: Relative difference in the performance of WRF by processor and dataset type mentioned in Table 1

WRF was compiled with the "dm + sm" configuration and all the available cores were subscribed during WRF simulation runs. To optimize performance, we tried different MPI process count, OpenMP thread count combinations and tiling schemes (WRF_NUM_TILES) options. For single-node tests, two MPI processes per Core Complex Die (CCD) deliver the best results for conus 2.5 km and new conus 2.5 km datasets. We used eight processes per CCD for the wrf_large dataset.

Depending on the dataset, the AMD 7763 processor can deliver up to 14 percent better performance over the AMD 7543 processor. In the previous blog, we observed better performance improvements on the 32 core Milan processor model with memory bandwidth bound benchmarks like HPCG and STREAM. WRF is a memory bandwidth bound application and there is notable performance improvement in the 32-core processor model: the AMD 7543 delivers up to 26 percent better performance over AMD 7542 processor.

From the performance that is shown in Figure 2 and the average power usage data that is shown in figure 3, we noted that the AMD 7713 processor can deliver up to 58 percent better performance per watt than the AMD 7662 processor. 

Figure 3: Power used by platform and processor type: average idle server power usage was 305 W (7542), 338 W (7662), 305 W (7543), 258 W (7713), and 272 W (7763)


Multi-node scalability

To evaluate the scalability of WRF, we used eight nodes. Each node is equipped with an AMD 7713 processor and interconnected using the NVIDIA Mellanox HDR interconnect. The nodes used for benchmarking were connected to the same HDR switch. Table 1 provides details about the server and software that was used for the test. The text on top of the line represents the relative change in the application performance (on 2,4 and 8 nodes) with respect to the application performance obtained on the single node.


Figure 4: Multi-node performance of WRF on an AMD Milan 7713 processor for datasets listed in Table 1

The scalability numbers have been rounded off to a single digit. We observed good scalability with all the datasets listed in Table 1.


Conclusions and recommendations

WRF delivers better performance and performance per watt on AMD Milan processors. There is a significant performance improvement on the 32 core Milan processor model and the WRF simulations scale well with the datasets described in this blog. However, the scalability might vary depending on the dataset being used and the node count being tested. Ensure that you test the impact of the tile size, process, and threads per process before use. 

We will continue to post new blogs on this site as updates arise.


Read Full Blog
NVIDIA PowerEdge GPU NAMD

Nanoscale Molecular Dynamics (NAMD) Performance with Dell EMC PowerEdge R750xa & NVIDIA A series GPUs

Kihoon Yoon

Thu, 22 Jul 2021 09:03:25 -0000

|

Read Time: 0 minutes

Overview

Over the past decade, GPUs have become popular in scientific computing because of their great ability to exploit a high degree of parallelism. NVIDIA has optimized  life sciences applications to run on their general-purpose GPUs. Unfortunately, these GPUs can only be programmed with CUDA, OpenACC, or the OpenCL framework. Most of the life sciences community is not familiar with these frameworks so few biologists or bioinformaticians can make efficient use of GPU architectures. However, GPUs have been making inroads into the molecular dynamics simulation (MDS) field since MD was developed in the 1950s. MDS requires heavy computational work to simulate biomolecular structures or their interactions.

 

In this blog, the performance of one popular MDS application, NAMD, is presented with various NVIDIA A-series GPUs such as the A100, the A10, the A30 and the A40 . NAMD is a free and open-source parallel MD package designed for analyzing the physical movements of atoms and molecules.

 

Dell Technologies has released the new PowerEdge R750xa server, a GPU workload platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3rd Gen Intel Xeon Scalable Processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server. The test server configurations are summarized in Table 1, and the specifications of tested NVIDIA GPUs are listed in Table 2.

 

Table 1: Tested compute node configuration

Test Beds

Server

Dell EMC PowerEdge R750xa

Dell EMC PowerEdge R740

CPU

Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30 GHz

Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40 GHz

Intel(R) Xeon(R) Gold 6248 CPU @ 2.50 GHz

NVIDIA GPUs

4 x A100

4 x A10

4 x A30

2 x A40

RAM

DDR4 1024 GB (32 x 32 GB) 3200 MT/s

DDR4 384 GB (24 x 16 GB) 2933 MT/s

Operating system

RHEL 8.3 (4.18.0-240.el8.x86_64)

Filesystem network

Mellanox InfiniBand HDR100

Filesystem

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS system profile

Performance Optimized

Logical processor

Disabled

Virtualization technology

Disabled

Cuda/Toolkit

11.2 

OpenMPI

4.1.1

NAMD

NAMD_Git-2021-04-01_Source


Table 2: Specifications of tested NVIDIA GPUs

NVIDIA GPUs

 

A100

A10

A30

A40

FP64 (TFLOPS)

9.7

Unknown

5.2

Unknown

FP64 Tensor Core (TFLOPS)

19.5

Unknown

10.3

Unknown

FP32 (TFLOPS)

19.5

31.2

10.3

37.4

Tensor Float 32 (TFLOPS)

156 | 312*

62.5 | 125*

82 | 165 *

74.8 | 149.6*

BFLOAT16 Tensor Core (TFLOPS)

312 | 624*

125 | 250*

165 | 330*

149.7 | 299.4*

FP16 Tensor Core (TFLOPS)

312 | 624*

125 | 250*

165 | 330*

149.7 | 299.4*

INT8 Tensor Core (TOPS)

624 | 1248*

250 | 500*

330 | 661*

299.3 | 598.6*

INT4 Tensor Core (TOPS)

Unknown

500 | 1,000*

661 | 1321*

598.7 | 1,197.4*

GPU memory

40 GB HBM2

24 GB GDDR6

24 GB HBM2

48 GB GDDR6

GPU memory bandwidth

1,555 GB/s

600 GB/s

933 GB/s

696 GB/s

Max Thermal Design Power (TDP)

400W

150W

165W

300W

Multi-Instance GPU

Up to 7 MIGs @ 5 GB

Unknown

4 GPU instances @ 6 GB each

2 GPU instances @ 12 GB each

1 GPU instance @ 24 GB

Unknown

Form factor

PCIe

Single-slot, full-height, full-length (FHFL)

Dual-slot, full-height, full-length (FHFL)

4.4" (H) x 10.5" (L) dual slot

Interconnect

PCIe Gen4: 

64 GB/s

PCIe Gen4: 

64 GB/s

PCIe Gen4: 

64 GB/s

 

PCIE Gen4 x 16 31.5 GB/s (bidirectional)

* With sparsity


Performance Evaluation

NAMD

NAMD was compiled from source code (NAMD_Git-2021-04-01_Source) using GCC 11.1 and CUDA 11.2. We used a test data set, the 1.06 million-atom system of Satellite Tabacco Mosaic Virus (SMTV). 

 

Figure 1 shows the performance of four GPUs with the STMV dataset.  The figures represent the performance changes in nanoseconds per day (ns/day) with various numbers of cores used with one, two or four GPUs. The only valid comparison between the various GPUs is NVIDIA A100 and A10 since the test systems were configured identically. Although the performance of NAMD is affected by the CPU clock speed, the tested systems are not significantly different  from the CPU’s clock speed. The A10 is rated at three times the single precision FLOPS  of the A30, and  the A10 performs better than the A30 on the two GPU tests even with slightly slower CPUs. The A100 outperformed by roughly 25 percent and 16 percent  on single and two GPU tests when comparing the A10’s results, respectively.

 

The results from four GPU tests in Figure 1 show similar performance for the different GPUs. This agrees well with our previous test results that NAMD does not scale after two GPUs. We can rule out a potential argument that the data size might be too small since 3 million atom data, HECBioSim3000k-atom system, which is a pair of 1IVO and 1NQL hEGFR tetramers, shows similar or worse results (those  results are not shown here).

 

Figure 1: NAMD performance with  STMV, 1 million-atom system

As shown in Figure 1, when four GPUs were tested , all of the GPUs except the A40 reached ~9 ns/day simulations. And, in terms of maximum performance, the A10 performs the highest number of simulations, 9.121 ns/day. However, these numbers are not true reflections of the performance due to the scalability limitations. Although all four GPU test results are similar, the A100 has a better throughput than other GPUs for the two GPU test as shown in Figure 2. Also, it is worth noting that the A10 and the A40 are not suitable for general-purpose computing due to the lack of double-precision support. 

 

Figure 2 shows the performance comparisons among the different GPUs we tested in this study. Again, the A30 performed better than the A10 up to the 16 cores. It is difficult to determine  why the A30 doesn’t perform as well with a large number of active CPU cores(20 and more). 

 

Figure 2: STMV test results comparisons with two GPUs

Conclusion

The A100 shows a dominant performance and is the most capable card among the A-series GPUs. Although the A30 did not perform as well as the A10 in our test , it is another outstanding choice for versatile applications. 

 

The A10 performed well compared to the A30, and it is the successor of the T4, which was the most cost-effective solution for specific applications such as genomics data analysis.

Since it is not possible to obtain the accurate performance differences among A-series GPUs from this study, further investigation is necessary to achieve  a clear picture of these general purpose GPUs.

Read Full Blog
Tuxedo Pipeline PowerEdge C6520

Tuxedo Pipeline Performance on Dell EMC PowerEdge C6520

Kihoon Yoon

Tue, 22 Jun 2021 14:54:03 -0000

|

Read Time: 0 minutes

Overview

Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), insertion/deletion (indel), or chromosomal restructuring. Ultimately, all physiological and biochemical events depend on the final gene expression product, proteins. Although most mammals have an additional controlling layer before protein expression, knowing how many transcripts exist in a system helps to characterize the biochemical status of a cell. Ideally, technology  should enable us to quantify all proteins in a cell, which would advance the progress of Life Science significantly; however, we are far from achieving this.  

In this blog, we report the test results of one popular RNA-Seq data analysis pipeline known as the Tuxedo pipeline. The Tuxedo pipeline suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript and isoform detection, differential expression, visualizations, and quality control metrics. The tested workflow is a differentially expressed gene (DEG) analysis, and the detailed steps in the pipeline are shown in Figure 1. 

 

Figure 1: Updated Tuxedo Pipeline with Cuffquant Step

In this study, the performances of single nodes with 3rd Gen Intel® Xeon® Scalable Processors (Codename Ice Lake) and 2nd Generation Intel® Xeon® Scalable Processors (Codename Cascade Lake) on Dell EMC PowerEdge R6520 (liquid-cooled) servers and C6420 (air-cooled) servers were compared. The configurations of the test systems are summarized in Table 1.

Table 1: Tested compute node configuration

Dell EMC PowerEdge C6520 Liquid Cooled

CPU

Tested 3rd Gen Intel® Xeon® Scalable Processors:

2 x Intel® Xeon® Platinum 8358, 32 Cores, 2.60 GHz – 3.40 GHz Base-Boost, TDP 250W

2 x Intel® Xeon® Platinum 8352Y, 32 Cores, 2.20 GHz – 3.40 GHz Base-Boost, TDP 205W

 

Tested 2nd Generation Intel® Xeon® Scalable Processors:

2 x Intel® Xeon® Gold 6248, 20 Cores, 2.50 GHz – 3.90 GHz Base-Boost, TDP 150W on Dell EMC PowerEdge C6420 Air Cooled

RAM

DDR4 512 GB (16 x 32 GB) 3200 MT/s

Operating system

RHEL 8.3 (4.18.0-240.el8.x86_64)

Interconnect

Mellanox InfiniBand HDR100

File system

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS system profile

Performance Optimized

Logical processor

Disabled

Virtualization technology

Disabled

tophat

2.1.1

bowtie2

2.2.5

R

3.6

bioconductor-cummerbund

2.26.0

A performance study of the RNA-Seq pipeline is not trivial because the nature of the workflow requires input files that are non-identical but similar in size. Hence, 185 RNA-Seq paired-end read data are collected from a public data repository. All the read data files contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test are randomly selected from the pool of the 185 paired-end read files. Although these test data will not have any biological meaning, certainly these data with a high level of noise will put the tests in the worst-case scenario.

Performance Evaluation

Throughput Test – Single pipeline with more than two samples, biological and technical duplicates

Typical RNA-Seq studies consist of multiple samples, sometimes  hundreds of different samples, for example, normal versus disease or untreated versus treated samples. These samples tend to have a high level of noise  for biological reasons; hence, the analysis requires vigorous data preprocessing procedures. 

We tested various numbers of samples (all different RNA-Seq data selected from 185 paired-end reads data sets) to see how much data can be processed by a single node. Typically, when the number of samples increases, the runtime of the Tuxedo Pipeline increases as shown in Figure 2. Ice Lake CPUs show  improved overall runtime of 10% and more  compared to Cascade Lake 6248 CPUs.

Figure 2: Total runtime comparisons from various number of samples with a single compute node

Conclusion

Many additional tests are still required to obtain a better insight from Intel Ice Lake processors for the NGS data analysis area. Unfortunately, we could not push our tests over 8 samples due to the storage limitation. However, there seems to be plenty of room for a higher throughput processing of more than 8 samples together. 

Read Full Blog
NVIDIA PowerEdge

Molecular Dynamics Simulations with Dell EMC PowerEdge XE8545 Server and NVIDIA A100

Kihoon Yoon

Wed, 02 Jun 2021 19:37:48 -0000

|

Read Time: 0 minutes

Overview

Over the past decade, graphics processing units, or GPUs, have become popular in scientific computing because of their great ability to exploit a high degree of parallelism. NVIDIA has a handful of life sciences applications optimized and run on their general-purpose GPUs. Unfortunately, these GPUs can only be programmed with CUDA, OpenACC, and the OpenCL framework. Most members of the life sciences community are not familiar with these frameworks, and so few biologists or bioinformaticians can make efficient use of GPU architectures. However, GPUs have been making inroads into the molecular dynamics simulation (MDS) field since MD was developed in the 1950s. MDS requires heavy computational work to simulate biomolecular structures or their interactions.  

In this blog, we tested two MDS applications; NAMD, and LAMMPS using the Dell EMC PowerEdge XE8545 server with NVIDIA A100 GPUs. Since the XE8545 server does not support NVIDIA V100 GPU, we can roughly estimate the performance boost with the A100 from our previous tests.

These two applications are free and open-source parallel MD packages designed for analyzing the physical movements of atoms and molecules.

The test server configuration is summarized in the following table.

Table 1. Tested compute node configuration

Dell EMC PowerEdge XE8545

CPU

2x 7713 (Milan), 64 Cores, 2.0 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache

RAM

DDR4 1024 GB (32 x 32 GB) 3200 MT/s

Operating system

RHEL 8.3 (4.18.0-240.el8.x86_64)

Filesystem network

Mellanox InfiniBand HDR100

Filesystem

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS system profile

Performance Optimized

Logical processor

Disabled

Virtualization technology

Disabled

Accelerator

4 x A100-40 GB SXM4

Cuda/Toolkit

11.2

OpenMPI

4.1.1

NAMD

NAMD_Git-2021-04-01_Source

LAMMPS

Stable version (29 Oct 2020)

Performance Evaluation

NAMD

Nanoscale Molecular Dynamics (NAMD) is open-source software for molecular dynamics simulation written in a CHARMM parallel programming model and is designed for high-performance simulation of large biomolecular systems.

NAMD was built with the NAMD_Git-2021-04-01_Source source code on GCC 11.1 and CUDA 11.2. For our tests, we used two sets of data; 1.06 million-atoms of the Satellite Tobacco Mosaic Virus (STMV) system, and the HECBioSim3000k-atom system, which is a pair of 1IVO and 1NQL hEGFR tetramers.

Figure 1 shows the performance of 4x A100 GPUs with the STMV dataset. NAMD uses ++p options to specify the number of worker threads, and as recommended, is equal to the total number of cores minus the total number of GPUs. However, the number of total cores in the Milan Eypc 7003 family of processors, such as the Eypc 7713 that is used in the testing system, does not follow the generic recommendation. It seems to be around 79 to 90 cores. The optimal number of cores depends on the data size. Close to 9-nanosecond simulations (ns) per day performance is a significant performance gain from the NVIDIA V100 tests that we ran previously. It is difficult to say the performance gain is the sole contribution of the new A100 GPUs because the comparison of the 16 GB V100 on the Intel Skylake platform to the 40 GB A100 on the AMD Milan platform may not be valid.


Figure 1. Estimated simulation time per day with 4x NVIDIA A100 GPUs

The purpose of an additional test with 3 million atom protein tetramers is to confirm that the STMV test results are not artificial due to the relatively small icosahedron structure of SMTV, and the partial simulation of assembly and disassembly processes. Figure 2 shows the nanosecond simulations per day plot for 3000k-atom data. 2.1 ns/day seems to be close to the maximum performance with 64 cores.


Figure 2. Estimated simulation time per day with 4x NVIDIA A100 GPUs

LAMMPS

Large-scale Atomic/Molecular Massively Parallel Simulator, or LAMMPS, is a classical molecular dynamics code and has potentials for solid-state materials (metals and semiconductors), soft matter (biomolecules and polymers), and coarse-grained or mesoscopic systems. LAMMPS can model atoms, or can be used as a parallel particle simulator at the atomic, meso, or continuum scale. LAMMPS runs on single processors, or in parallel using message-passing techniques and spatial decomposition of the simulation domain. LAMMPS was built with GCC 11.1, OpenMPI 4.1.1, and CUDA 11.2 from the source. The 465k-atom system was selected from HECBioSim.

As shown in Figure 3, LAMMPS scales well over the number of A100s. With 4x A100 GPUs, a 8.4 ns/day simulation is achievable. 


Figure 3. Estimated simulation time per day with various number of BPUs

Conclusion

Although it is not possible to compare the performance of the A100 and the V100 from this study, the Milan CPUs and A100 show a strong synergy between more cores with better and faster GPUs. Running NAMD and LAMMPS on the XE8545 with the A100 can deliver a better performance than a system with the V100.

Read Full Blog
NVIDIA AI PowerEdge machine learning HPC GPU

New Frontiers—Dell EMC PowerEdge R750xa Server with NVIDIA A100 GPUs

Deepthi Cherlopalle Frank Han Savith Pareek

Tue, 01 Jun 2021 20:08:09 -0000

|

Read Time: 0 minutes

Dell Technologies has released the new PowerEdge R750xa server, a GPU workload-based platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3rd Gen Intel Xeon processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server.

Compared with the previous generation PowerEdge C4140 and PowerEdge R740 GPU platform options, the new PowerEdge R750xa server supports larger storage capacity, provides more flexible GPU offerings, and improves the thermal requirement

 

 

Figure 1 PowerEdge R750xa server

The NVIDIA A100 GPUs are built on the NVIDIA Ampere architecture to enable double precision workloads. This blog evaluates the new PowerEdge R750xa server and compares its performance with the previous generation PowerEdge C4140 server.

The following table shows the specifications for the NVIDIA GPU that is discussed in this blog and compares the performance improvement from the previous generation.

Table 1 NVIDIA GPU specifications


PCIe 

Improvement

GPU name 

A100 

V100 

 

GPU architecture 

Ampere 

Volta 

-

GPU memory 

40 GB 

32 GB 

60%

GPU memory bandwidth 

1555 GB/s 

900 GB/s 

73%

Peak FP64 

9.7 TFLOPS 

7 TFLOPS 

39%

Peak FP64 Tensor Core 

19.5 TFLOPS 

N/A 

-

Peak FP32 

19.5 TFLOPS

14 TFLOPS

39%

Peak FP32 Tensor Core 

156 TFLOPS

312 TFLOPS*

N/A

-

Peak Mixed Precision

FP16 ops/ FP32

Accumulate

312 TFLOPS

624 TFLOPS*

125 TFLOPS

5x

GPU base clock 

765 MHz 

1230 MHz 

-

Peak INT8

624 TOPS

1,248 TOPS*

N/A

-

GPU Boost clock 

1410 MHz 

1380 MHz 

2.1%

NVLink speed 

600 GB/s 

N/A 

-

Maximum power consumption 

250 W 

250 W 

No change

*with sparsity

Test bed and applications

This blog quantifies the performance improvement of the GPUs with the new PowerEdge GPU platform.

Using a single node PowerEdge R750xa server in the Dell HPC & AI Innovation Lab, we derived all results presented in this blog from this test bed. This section describes the test bed and the applications that were evaluated as part of the study. The following table provides test environment details:

Table 2 Server configuration

Component

Test Bed 1

Test Bed 2

Server

Dell PowerEdge R750xa

 

Dell PowerEdge C4140 configuration M

Processor

Intel Xeon 8380

Intel Xeon 6248

Memory

32 x 16 GB @ 3200MT/s

16 x 16 GB @ 2933MT/s

Operating system

Red Hat Enterprise Linux 8.3

Red Hat Enterprise Linux 8.3

GPU

4 x NVIDIA A100-PCIe-40 GB GPU

4 x NVIDIA V100-PCIe-32 GB GPU

The following table provides information about the applications and benchmarks used:

Table 3 Benchmark and application details

Application

Domain

Version 

Benchmark dataset

High-Performance Linpack

Floating point compute-intensive system benchmark

xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20

Problem size is more than 95% of GPU memory

HPCG

Sparse matrix calculations

xhpcg-3.1_cuda_11_ompi-3.1

512 * 512 * 288

 

GROMACS

Molecular dynamics application

2020

Ligno Cellulose

Water 1536

Water 3072

LAMMPS

Molecular dynamics application

29 October 2020 release

Lennard Jones

LAMMPS

Large-Scale Atomic/Molecular Massively Parallel simulator (LAMMPS) is distributed by Sandia National Labs and the US Department of Energy. LAMMPS is open-source code that has different accelerated models for performance on CPUs and GPUs. For our test, we compiled the binary using the KOKKOS package, which runs efficiently on GPUs.

Figure 2 LAMMPS Performance on PowerEdge R750xa and PowerEdge C4140 servers

With the newer generation GPUs, this application improves 2.4 times compared to single GPU performance. The overall performance from a single server improved twice with the PowerEdge R750xa server and NVIDIA A100 GPUs.

GROMACS

GROMACS is a free and open-source parallel molecular dynamics package designed for simulations of biochemical molecules such as proteins, lipids, and nucleic acids. It is used by a wide variety of researchers, particularly for biomolecular and chemistry simulations. GROMACS supports all the usual algorithms expected from modern molecular dynamics implementation. It is open-source software with the latest versions available under the GNU Lesser General Public License (LGPL).

Figure 3 GROMACS performance on PowerEdge C4140 and r750xa servers

With the newer generation GPUs, this application improved approximately 1.5 times across the dataset compared to single GPU performance. The overall performance from a single server improved 1.5 times with a PowerEdge R750xa server and NVIDIA A100 GPUs.

High-Performance Linpack

High-Performance Linpack (HPL) needs no introduction in the HPC arena. It is a widely used standard benchmark tests in the industry.  

 Figure 4 HPL Performance on the PowerEdge R750xa server with A100 GPU and PowerEdge C4140 server with V100 GPU

Figure 5 Power use of the HPL running on NVIDIA GPUs

From Figure 4 and Figure 5, the following results were observed: 

  • Performance—For GPU count, the NVIDIA A100 GPU demonstrates twice the performance of the NVIDIA V100 GPU. Higher memory size, double precision FLOPS, and a newer architecture contribute to the improvement for the NVIDIA A100 GPU.
  • ScalabilityThe PowerEdge R750xa server with four NVIDIA A100-PCIe-40 GB GPUs delivers 3.6 times higher HPL performance compared to one NVIDIA A100-PCIE-40 GB GPU. The NVIDIA A100 GPUs scale well inside the PowerEdge R750xa server for the HPL benchmark.  
  • Higher Rpeak—The HPL code on NVIDIA A100 GPUs uses the new double-precision Tensor cores. The theoretical peak for each GPU is 19.5 TFlops, as opposed to 9.7 TFlops. 
  • PowerFigure 5 shows power consumption of a complete HPL run with the PowerEdge R750xa server using four A100-PCIe GPUs. This result was measured with iDRAC commands, and the peak power consumption was observed as 2022 Watts. Based on our previous observations, we know that the PowerEdge C4140 server consumes approximately 1800 W of power.

HPCG

Figure 6 Scaling GPU performance data for HPCG Benchmark

As discussed in other blogs, high performance conjugate gradient (HPCG) is another standard benchmark to test data access patterns of sparse matrix calculations. From the graph, we see that the HPCG benchmark scales well with this benchmark resulting in 1.6 times performance improvement over the previous generation PowerEdge C4140 server with an NVIDIA V100 GPU.

The 72 percent improvement in memory bandwidth of the NVIDIA A100 GPU over the NVIDIA V100 GPU contributes to the performance improvement.

Conclusion

In this blog, we introduced the latest generation PowerEdge R750xa platform and discussed the performance improvement over the previous generation PowerEdge C4140 server. The PowerEdge R750xa server is a good option for customers looking for an Intel Xeon scalable CPU-based platform powered with NVIDIA GPUs.

With the newer generation PowerEdge R750xa server and NVIDIA A100 GPUs, the applications discussed in this blog show significant performance improvement.

Next steps

In future blogs, we plan to evaluate NVLINK bridge support, which is another important feature of the PowerEdge R750xa server and NVIDIA A100 GPUs.

Read Full Blog
PowerEdge Intel Xeon

Processing Six Human 50x WGS per day with 3rd Gen Intel Xeon Scalable Processors

Kihoon Yoon

Mon, 24 May 2021 22:07:44 -0000

|

Read Time: 0 minutes

Overview

Intel® Xeon® Scalable Processors have been proven for consistent and stable performance for many workload types. New 3rd Generation Intel® Xeon® Scalable Processors, also known by the code name of Ice Lake perform exceptionally well for a BWA-GATK pipeline. In this study, we tested two Ice Lake processors, 8352Y and 8358, and the test server configuration is also summarized in Table 1.

Table 1. Tested compute node configuration

Dell EMC PowerEdge C6520

CPU

Tested 3rd Gen Intel® Xeon® Scalable Processors:

2x Intel® Xeon® Platinum 8352Y Processor, 32 cores, 2.20 GHz – 3.40 GHz Base-Boost, TDP 205 W, 48 MB L3 Cache

2x Intel® Xeon® Platinum 8358 Processor, 32 cores, 2.60 GHz – 3.40 GHz Base-Boost, TDP 250 W, 48 MB L3 Cache

RAM

DDR4 512G (32 GB x 12) 3200 MT/s

Operating system

RHEL 8.3 (4.18.0-240.22.1)

Filesystem network

NVIDIA Mellanox InfiniBand HDR100

Filesystem

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS system profile

Performance Optimized

Logical processor

Disabled

Virtualization technology

Disabled

BWA

0.7.15-r1140

Sambamba

0.7.0

Samtools

1.6

GATK

3.60-g89b7209

The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x.

Performance evaluation

Single sample performance

Table 2 summarizes the overall runtimes and the comparisons between each step for our 9-step BWA-GATK pipeline with a single sample.

The mapping and sorting step is the only step that we can peak the true performance variations across different CPUs in Table 2. A rough estimation of overall performance improvements from 6248R (6248) to 8352Y and 8358 are 3.8 (9.0) % and 4.8 (10.0) %, respectively. The test batch for 6248R was Dell EMC PowerEdge R640 server with 394 GB RAM and local storage, and the configuration details for 6248 can be found from the embedded link. 

The mapping and sorting step shows a descent ~36 % runtime reduction due to the nature of the good scalability of BWA. The base recalibration step also takes advantage of a higher core count from Ice Lake CPUs.

Table 2. BWA-GATK performance comparisons between Ice Lake and Cascade Lake

Steps

8352Y 32c

2.2 GHz

8358 32c

2.6 GHz

6248R 24c

3.0 GHz

6248 20c

2.5 GHz

Mapping and sorting

3.23 (32)

3.23 (32)

 5.04 (24)

5.22 (20)

Mark duplicates

1.16 (13)

1.16 (13)

1.14 (13)

1.29 (13)

Generate realigning targets

0.47 (32)

0.46 (32)

0.16 (24)

0.42 (20)

Insertion and deletion realigning

8.16 (1)

7.97 (1)

7.20 (1)

7.87 (1)

Base recalibration

2.06 (32)

2.07 (32)

 2.41 (24)

2.30 (20)

Haplotypercaller

8.01 (16)

7.96 (16)

8.06 (16)

8.25 (16)

Genotype GVCFs

0.01 (32)

0.01 (32)

0.01 (24)

0.01 (20)

Variant recalibration

0.20 (1)

0.20 (1)

0.19 (1)

0.23 (1)

Apply variant recalibration

0.01 (1)

0.01 (1)

0.01 (1)

0.01 (1)

Total runtime (hours)

23.32

23.07

24.23

25.61

Note: The number of cores used for the test is parenthesized.

Multiple sample performances – throughput

A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment. 

The current pipeline invokes many pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processings. Typically, a process silently fails when there is not enough resource left to start an additional process.

As shown in Table 3 for the 8352Y test, the maximum number of samples that can be processed simultaneously is around 14 samples. Although a 14-sample test was not performed, 14 samples could likely be the maximum number of samples that can be processed together because the two pipelines were failed on the 16-sample test. In other words, ~ 6 genomes per day throughput is achievable with 8352Y. Also, 8358 shows 2 failed processes when 16 samples were processed simultaneously while the throughput reached ~7 genomes per day (Table 4).

Table 3. Throughput test for Intel® Xeon® Platinum 8352Y

Steps

Runtime with a various number of samples

Number of samples

1

2

4

8

12

16

Number of samples Failed 

0

0

0

0

0

2

Mapping and sorting

2.84

4.20

7.11

13.44

20.77

26.62

Mark duplicates

1.17

1.18

1.29

1.77

2.49

3.05

Generate realigning targets

0.46

0.51

0.52

0.77

1.09

1.25

Insertion and deletion realigning

7.94

8.04

8.02

8.00

8.26

8.11

Base recalibration

2.00

2.16

2.83

4.41

6.04

7.20

Haplotypercaller

8.00

7.93

9.10

9.24

9.31

9.26

Genotype GVCFs

0.02

0.02

0.03

0.02

0.03

0.04

Variant recalibration

0.17

0.20

0.21

0.20

0.19

0.23

Apply variant recalibration

0.01

0.02

0.02

0.02

0.02

0.03

Total runtime (hours)

22.60

24.26

29.12

37.89

48.20

55.78

Genomes per day

1.06

1.98

3.30

5.07

5.98

6.02

Table 4. Throughput test for Intel® Xeon® Platinum 8358

Steps

Runtime with a various number of samples

Number of samples

1

8

12

14

16

1

Number of samples Failed 

0

0

0

0

2

0

Mapping and sorting

2.67

11.79

18.26

22.84

24.34

2.67

Mark duplicates

1.16

1.51

2.18

2.59

2.65

1.16

Generate realigning targets

0.43

0.70

0.96

1.17

1.15

0.43

Insertion and deletion realigning

7.97

8.00

7.99

8.20

8.19

7.97

Base recalibration

1.94

4.05

5.65

6.47

6.56

1.94

Haplotypercaller

8.00

8.21

8.22

8.24

8.25

8.00

Genotype GVCFs

0.02

0.03

0.03

0.03

0.02

0.02

Variant recalibration

0.18

0.25

0.14

0.30

0.30

0.18

Apply variant recalibration

0.01

0.01

0.02

0.02

0.02

0.01

Total runtime (hours)

22.37

34.55

43.44

49.86

51.49

22.37

Genomes per day

1.07

5.56

6.63

6.74

6.53

1.07

Conclusion

The field of NGS data analysis has been moving fast in terms of data growth and data variations. The majority of the open-source applications in NGS data analysis are unable to take advantage of accelerator technology and do not scale well over the number of cores. It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision of making when to split data is pretty hard, it is tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.

Nonetheless, 3rd Generation Intel® Xeon® Scalable Processors, especially 8352Y, and 8358 are excellent choices for the highest variant calling analysis throughput and single sample analysis.

Read Full Blog
Intel PowerEdge HPC

Intel Ice Lake - BIOS Characterization for HPC

Puneet S Ashish KS Joseph Stanfield Tarun Singh Savitha Pareek

Mon, 24 May 2021 15:56:32 -0000

|

Read Time: 0 minutes

Intel recently announced the 3rd Generation Intel Xeon Scalable processors (code-named “Ice Lake”), which are based on a new 10 nm manufacturing process. This blog provides the new Ice Lake processor synthetic benchmark results and the recommended BIOS settings on Dell EMC PowerEdge servers.

Ice Lake processors offer a higher core count of up to 40 cores with a single Ice Lake 8380 processor. The Ice Lake processors have larger L3, L2, and L1 data cache than Intel’s second-generation Cascade Lake processors. These features are expected to improve performance of CPU-bound software applications. Table 1 shows the L1, L2, and L3 cache size on the 8380 processor model.

Ice Lake still supports the AVX 512 SIMD instructions, which allow for 32 DP FLOP/cycle. The upgraded Ultra Path Interconnect (UPI) Link speed of 11.2GT/s is expected to improve data movement between the sockets. In addition to core count and frequency, Ice Lake-based Dell EMC PowerEdge servers support DDR4 - 3200 MT/s DIMMS with eight memory channels per processor, which is expected to improve the performance of memory bandwidth-bound applications. Ice Lake processors now support DIMMs with 6 TB per socket.

Instructions such as Vector CLMUL, VPMADD52, Vector AES, and GFNI Extensions have been optimized to improve use of vector registers. The performance of software applications in the cryptography domain is also expected to benefit. The Ice Lake processor also includes improvements to Intel Speed Select Technology (Intel SST). With Intel SST, a few cores from the total available cores can be operated at a higher base frequency, turbo frequency, or power. This blog does not address this feature.

Table 1: hwloc-ls and numactl -H command output on an Intel 8380 processor model-based server with Round Robin core enumeration (MadtCoreEnumeration) and SubNumaCuster(Sub-NUMA Cluster) set to 2-Way

hwloc-ls

numactl -H

Machine (247GB total)

  Package L#0 + L3 L#0 (60MB)

    Group0 L#0

      NUMANode L#0 (P#0 61GB)

      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)

      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#4)

      L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#8)

      L2 L#3 (1280KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#12)

      L2 L#4 (1280KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#16)

      L2 L#5 (1280KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#20)

      L2 L#6 (1280KB) + L1d L#6 (48KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#24)

      L2 L#7 (1280KB) + L1d L#7 (48KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#28)

      L2 L#8 (1280KB) + L1d L#8 (48KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#32)

      L2 L#9 (1280KB) + L1d L#9 (48KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#36)

      L2 L#10 (1280KB) + L1d L#10 (48KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#40)

      L2 L#11 (1280KB) + L1d L#11 (48KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#44)

      L2 L#12 (1280KB) + L1d L#12 (48KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#48)

      L2 L#13 (1280KB) + L1d L#13 (48KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#52)

      L2 L#14 (1280KB) + L1d L#14 (48KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#56)

      L2 L#15 (1280KB) + L1d L#15 (48KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#60)

      L2 L#16 (1280KB) + L1d L#16 (48KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#64)

      L2 L#17 (1280KB) + L1d L#17 (48KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#68)

      L2 L#18 (1280KB) + L1d L#18 (48KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#72)

      L2 L#19 (1280KB) + L1d L#19 (48KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#76)

      HostBridge.

<snip>

.

.

 

 


BIOS options tested on Ice Lake processors

Table 2 provides the server details used for the performance tests. The following BIOS options were explored in the performance testing:

  • BIOS.ProcSettings.SubNumaCluster—Breaks up the LLC into disjoint clusters based on address range, with each cluster bound to a subset of the memory controllers in the system. It improves average latency to the LLC. Sub-NUMA Cluster (SNC) is disabled if NVDIMM-N is installed in the system.
  • BIOS.ProcSettings.DeadLineLlcAlloc—If enabled, fills in dead lines in LLC opportunistically.
  • BIOS.ProcSettings.LlcPrefetch—Enables and disables LLC Prefetch on all threads.
  • BIOS.ProcSettings.XptPrefetch—If enabled, enables the MS2IDI to take a read request that is being sent to the LLC and speculatively issue a copy of that read request to the memory controller.
  • BIOS.ProcSettings.UpiPrefetch—Starts the memory read early on the DDR bus. The UPI Rx path spawns a MemSpecRd to iMC directly.
  • BIOS.ProcSettings.DcuIpPrefetcher (Data Cache Unit IP Prefetcher)—Affects performance, depending on the application running on the server. This setting is recommended for High Performance Computing applications.
  • BIOS.ProcSettings.DcuStreamerPrefetcher (Data Cache Unit Streamer Prefetcher)—Affects performance, depending on the application running on the server. This setting is recommended for High Performance Computing applications.
  • BIOS.ProcSettings.ProcAdjCacheLine—When set to Enabled, optimizes the system for applications that require high utilization of sequential memory access. Disable this option for applications that require high utilization of random memory access.
  • BIOS.SysProfileSettings.SysProfile—Sets the System Profile to Performance Per Watt (DAPC), Performance Per Watt (OS), Performance, Workstation Performance, or Custom mode. When set to a mode other than Custom, the BIOS sets each option accordingly. When set to Custom, you can change setting of each option.
  • BIOS.ProcSettings.LogicalProc—Reports the logical processors. Each processor core supports up to two logical processors. When set to Enabled, the BIOS reports all logical processors. When set to Disabled, the BIOS only reports one logical processor per core. Generally, a higher processor count results in increased performance for most multithreaded workloads. The recommendation is to keep this option enabled. However, there are some floating point and scientific workloads, including HPC workloads, where disabling this feature might result in higher performance.

You can set the DeadLineLlcAlloc, LlcPrefetch, XptPrefetch, UpiPrefetch, DcuIpPrefetcher, DcuStreamerPrefetcher, ProcAdjCacheLine, and LogicalProc BIOS options to either Enabled or Disabled. You can set the SubNumaCluster to 2-Way and Disabled. The SysProfile setting can have five possible values: PerformanceOptimized, PerfPerWattOptimizedDapc, PerfPerWattOptimizedOs, PerfWorkStationOptimized and Custom.  

Table 2: Test bed hardware and software details 

Component

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6420 server

Dell EMC PowerEdge C6420 server

OPN

8380

6338

8280

6252

Cores/Socket

40

32

28

24

Frequency (Base-Boost)
 

2.30 – 3.40 GHz

2.0 – 3.20 GHz

2.70 – 4.0 GHz

2.10 – 3.70 GHz

TDP

270 W

205 W

205 W

150 W

L3Cache

60M

48M

38.5M

37.75M

Operating System

Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64

Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64

Red Hat Enterprise Linux 8.3

4.18.0-240.el8.x86_64

Red Hat Enterprise Linux 8.3

4.18.0-240.el8.x86_64

Memory

16 GB x 16 (2Rx8) 3200 MT/s

16 GB x 16 (2Rx8) 3200 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

BIOS/CPLD

1.1.2/1.0.1

Interconnect

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR100

NVIDIA Mellanox HDR100

Compiler

Intel parallel studio 2020 (update 4)

Benchmark software

  • HPL v 2.3 (parallel studio 2020 (update 4)
  • STREAM v5.10
  • HPCG v3.1 (parallel studio 2020 update 4)
  • OSU v 5.7
  • WRF v3.9.1.1 (conus 2.5 km dataset)

The system profile BIOS meta option helps to set a group of BIOS options (such as C1E, C States, and so on), each of which control performance and power management settings to a particular value. It is also possible to set these groups of BIOS options individually to a different value using the Custom system profile.

 Application performance results

Table 2 lists details about the software used for benchmarking the server. We used the precompiled HPL and HPCG binary files, which are part of Intel Parallel Studio 2020 update 4 software bundle, for our tests. We compiled the WRF application with AVX2 support. WRF and HPCG issue many nonfloating point packed micro-operations (approximately 73 percent to 90 percent of the total packed micro-operations). They are memory-bound (and DRAM-bandwidth bound) workloads. HPL issues packed double precision micro-operations and is a compute-bound workload.

After setting Sub-NUMA Cluster (BIOS.ProcSettings.SubNumaCluster) to 2-Way, Logical Processors (BIOS.ProcSettings.LogicalProc) to Disabled, and other settings (DeadLineLlcAlloc, LlcPrefetch, XptPrefetch, UpiPrefetch, DcuIpPrefetcher, DcuStreamerPrefetcher, ProcAdjCacheLine) to Enabled, we measured the impact of System Profile (BIOS.SysProfileSettings.SysProfile) BIOS parameters on application performance.

Figure 1 through Figure 4 show application performance. In each figure, the numbers over the bars represent the relative change in the application performance with respect to the application performance obtained on the Intel 6338 Ice Lake processor with the System Profile set to Performance Optimized (PO).

Note: In the figures, PO=PerformanceOptimized, PPWOD=PerfPerWattOptimizedDapc, PPWOO=PerfPerWattOptimizedOs, and PWSO=PerfWorkStationOptimized.

HPL Benchmark

 Figure 1: Relative difference in the performance of HPL by processor and Sysprofile setting

HPCG Benchmark

 Figure 2: Relative difference in the performance of HPCG by processor and Sysprofile setting

STREAM Benchmark

 Figure 3: Relative difference in the performance of STREAM by processor and Sysprofile setting

WRF Benchmark

Figure 4: Relative difference in the performance of WRF by processor and Sysprofile setting

 We obtained the performance for the applications in Figure 2 through Figure 4 by fully subscribing to all available cores. Depending on the processor model, we achieved 78 percent to 80 percent efficiency with HPL and STREAM benchmarks using the Performance Optimized profile.

Intel has extended the TDP of the Ice Lake processors with the top-end Intel 8380 processor at 270 W TDP. The following figure shows the power use on the systems with the applications listed in Table 2.


Note: In this figure, PO=PerformanceOptimized, PPWOD=PerfPerWattOptimizedDapc, PPWOO=PerfPerWattOptimizedOs and PWSO=PerfWorkStationOptimized

Figure 5: Power use by platform and processor type. Average Idle power usage on the PowerEdge C6520 server (Intel 6338 processor) with approximately 335 W and the PowerEdge R750 server (intel 8380 processor) with approximately 470 W using the Performance Optimized System Profile. 

When SNC is set to 2-Way, the system exposes four NUMA nodes. We tested the NUMA bandwidth, remote socket bandwidth, and local socket bandwidth using the STREAM TRIAD benchmark. In Figure 6, the CPU NUMA node is represented as c and the memory node is represented as m. As an example for NUMA bandwidth, the c0m0 (blue bars) test type represents the STREAM TRIAD test carried out between NUMA node 0 and memory node 0. Figure 6 shows the best bandwidth numbers obtained on varying the number of threads per test type.

Figure 6: Local and remote NUMA memory bandwidth.

Remote socket bandwidth numbers were measured between CPU node 0, 1 and memory node 2, 3. Local bandwidths were measured between CPU node 0, 1, and 0, 1. The following figure shows the performance numbers.

Figure 7: Local and remote processor bandwidth.

Impact of BIOS options on application performance

 We tested the impact of the DeadLineLlcAlloc, LlcPrefetch, XptPrefetch, UpiPrefetch, DcuIpPrefetcher, DcuStreamerPrefetcher and ProcAdjCacheLine with the Performance Optimized (PO) system profile. These BIOS options do not have significant impact on the performance of applications addressed in this blog, therefore we recommend that these options be set as Enabled.

Figure 8 and Figure 9 show the impact of the Sub-NUMA Cluster (SNC) BIOS option on the application performance. In each figure, the numbers over the bars represent the relative change in the application performance with respect to the application performance obtained on the Intel 6338 Ice Lake processor with SNC feature set to Disabled

Figure 8: HPL and HPCG performance variation by processor model with Sub-NUMA Cluster set to Disabled (SNC=D) and 2-Way (SNC=2W)

Figure 9: STREAM and WRF performance variation by processor model with Sub-NUMA Cluster set to Disabled (SNC=D) and 2-Way (SNC=2W)

The SubNumaCluster option can impact the applications that are Memory Bandwidth-bound (for example, STREAM, HPCG, and WRF). The SubNumaCluster option is recommended to be set to 2-Way as it can optimize the workloads addressed in this blog by a range of one percent to six percent, depending on the processor model and application.

InfiniBand bandwidth and message rate

The Ice Lake-based processors now support PCIe Gen 4, which allows the NVIDIA MELLANOX HDR adapter cards to be used with Dell EMC PowerEdge servers. Figure 10, Figure 11, and Figure 12 show the Message Rate, Unidirectional, and Bi-directional InfiniBand bandwidth test results of the OSU Benchmarks suite. The network adapter card was connected to the second socket (NUMA node 2), therefore, the local bandwidth tests were carried out with processes bound to NUMA node 2. The remote bandwidth tests were carried out with processes bound to NUMA node 0. In Figure 10 and Figure 11, the numbers in red over the orange bars represent the percentage difference between local and remote bandwidth performance numbers.

Figure 10: OSU Benchmark unidirectional bandwidth test on two servers with Intel 8380 processors and NVIDIA Mellanox HDR InfiniBand

 

Figure 11: OSU Benchmark bi-directional bandwidth test on two servers with Intel 8380 processors and NVIDIA Mellanox HDR InfiniBand

 

Figure 12: Interconnect bandwidth and message rate performance obtained between two servers having Intel 8380 processors with OSU Benchmark

On two nodes connected using the NVIDIA Mellanox ConnectX-6 HDR InfiniBand adapter cards, we achieved approximately 25 GB/s unidirectional bandwidth and a message rate of approximately 200 million messages/second—almost double the performance numbers obtained on the NVIDIA Mellanox HDR100 card.

Comparison with Cascade Lake processors

Based on the compute resources availability in our Dell EMC HPC & AI Innovation Lab, we selected the Cascade Lake processor-based servers and benchmarked them with software listed in Table 1. Figure 13 through Figure 16 show performance results from the Intel Ice Lake and Cascade Lake processors. The numbers over the bars represent the relative change in the application performance with respect to the application performance obtained on the Intel 6252 Cascade Lake processor.

Figure 13: HPL performance on processors listed in Table 2

 

Figure 14: HPCG performance on processors listed in Table 2

 

Figure 15: STREAM TRIAD test performance on Processors listed in Table 2 

 

 Figure 16: WRF performance on Processors listed in Table 2

  Ice Lake delivers approximately 38 percent better performance than Cascade Lake with HPL on the top-end processor model. The memory bandwidth-bound benchmarks such as STREAM and HPCG (see Figure 13 and Figure 14) delivered 42 percent to 43 percent performance improvement over the top-end Cascade Lake processors addressed in this blog.

The average real-time power usage of the Dell EMC PowerEdge platforms (listed in Table 1) was measured with the synthetic benchmarks listed in this blog. Figure 17 compares the power usage data from the Cascade Lake and Ice Lake platforms. The number over the bar represents the relative change of power with respect to the base (Intel 6252 processor in the idle state) power measured.

Figure 17: Average power usage during benchmark runs on Dell EMC PowerEdge servers (see details in Table 1)

Considering the data with the Performance Optimized profile with the respective power measurement, the applications (depending on the processor model) were unable to deliver better performance per watt on the Ice Lake platform when compared to the Cascade Lake platform.

Summary and future work

The Ice Lake processor-based Dell EMC Power Edge servers, with notable hardware feature upgrades over Cascade Lake, show up to 47 percent performance gain for all the HPC benchmarks addressed in this blog.  Hyper-threading should be Disabled for the benchmarks addressed in this blog; for other workloads the option should be tested and enabled as appropriate. Watch this space for subsequent blogs that describe application performance studies on our new Ice Lake processor-based cluster.


Read Full Blog
containers HPC Omnia

Containerized HPC Workloads Made Easy with Omnia and Singularity

John Lockman Luke Wilson PhD

Wed, 19 May 2021 18:46:17 -0000

|

Read Time: 0 minutes

Maximizing application performance and system utilization has always been important for HPC users.  The libraries, compilers, and applications found on these systems are the result of heroic efforts by HPC system administrators and teams of HPC specialists who fine tune, test, and maintain optimal builds of complex hierarchies of software for users. Fortunately for both researchers and administrators, some of that burden can be relieved with the use of containers, where software solutions can be built to run reliably when moved from one computing environment to another. This includes moving from one research lab to another, or from the developer’s laptop to a production lab, or even from an on-prem data center to the cloud. 

Singularity has provided HPC system administrators and users a way to take advantage of application containerization while running on batch-scheduled systems. Singularity is a container runtime that can build containers in its own native format, as well as execute any CRI-compatible container. By default, Singularity enforces security restrictions on containers by running in user space and can preserve user identification when run through batch schedulers, providing a simple method to deploy containerized workloads on multi-user HPC environments. 

Best practices for HPC systems deployment and use is the goal of Omnia and we recognize those practices vary in industry and research institutions. Omnia is developed with the entire community in mind and we aim to provide the tools that help them be productive. To this end, we recently included Singularity as an automatically installed package when deploying Slurm clusters with Omnia.

Building a Singularity-enabled cluster with Omnia

Installing a Slurm cluster with Omnia  and running a Singularity job is simple. We provide a repository of Ansible playbooks to configure a pile of metal or cloud resources into a ready-to-use Slurm cluster by applying the Slurm role in AWX or by applying the playbook on the command line.

ansible-playbook -i inventory omnia.yaml --skip-tags  kubernetes

Once the playbook has completed users are presented with a fully functional Slurm cluster with Singularity installed. We can run a simple “hello world” example, using containers directly from Singularity Hub. Here is an example Slurm submission script to run the “Hello World” example.

#!/bin/bash
#SBATCH -J singularity_test
#SBATCH -o singularity_test.out.%J
#SBATCH -e singularity_test.err.%J
#SBATCH -t 0-00:10
#SBATCH -N 1
# pull example Singularity container
singularity pull --name hello-world.sif shub://vsoch/hello-world
# execute Singularity container
singularity exec hello-world.sif cat /etc/os-release

Executing HPC applications without installing software

The “hello world” example is great but doesn’t demonstrate running real HPC codes, fortunately several hardware vendors have begun to publish containers for both HPC and AI workloads, such as Intel's oneContainer and Nvidia's NGC. Nvidia NGC is a catalog of GPU-accelerated software arranged in collections, containers, and Helm charts. This free to use repository has the latest builds of popular software used for deep learning and simulation with optimizations for Nvidia GPU systems. With Singularity we can take advantage of the NGC containers on our bare-metal Slurm cluster. Starting with the LAMMPS example on the NGC website we demonstrate how to run a standard Lennard-Jones 3D melt experiment, without having to compile all the libraries and executables. 

The input file for running this benchmark, in.lj.txt, can be downloaded from the Sandia National Laboratory site:

wget https://lammps.sandia.gov/inputs/in.lj.txt

Next make a local copy of the lammps container from NGC and name it lammps.sif

singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020

This example can be executed directly from the command line using srun. This example runs 8 tasks on 2 nodes with a total of 8 GPUs:

srun --mpi=pmi2 -N2 --ntasks=8 --ntasks-per-socket=2 singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd  lammps.sif lmp -k on g 8 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt

Alternatively, the following example Slurm submission script will permit batch execution with the same parameters as above, 8 tasks on 2 nodes with a total of 8 GPUs:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-socket=2
#SBATCH --time 00:10:00
set -e; set -o pipefail

# Build SIF, if it doesn't exist
if [[ ! -f lammps.sif ]]; then
    singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020
fi
readonly gpus_per_node=$(( SLURM_NTASKS / SLURM_JOB_NUM_NODES  ))
echo "Running Lennard Jones 8x4x8 example on ${SLURM_NTASKS} GPUS..."

srun --mpi=pmi2 \
singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd  lammps.sif lmp -k on g ${gpus_per_node} -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt

Containers provide a simple solution to the complex task of building optimized software to run anywhere. Researchers are no longer required to attempt building software themselves or wait for a release of software to be made available at the site they are running. Whether running on the workstation, laptop, on-prem HPC resource, or cloud environment they can be sure they are using the same optimized version for every run.

Omnia is an open source project that makes it easy to setup a Slurm or Kubernetes environment. When we combine the simplicity of Omnia for system deployment and Nvidia NGC containers for optimized software, both researchers and system administrators can concentrate on what matters most, getting results faster.

Learn more

Learn more about Singularity containers at https://sylabs.io/singularity/. Omnia is available for download at https://github.com/dellhpc/omnia



Read Full Blog
PowerEdge HPC AMD

Tuxedo Pipeline Performance on Dell EMC PowerEdge R6525

Kihoon Yoon

Tue, 27 Apr 2021 03:48:30 -0000

|

Read Time: 0 minutes

Overview

Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), insertion/deletion (indel), or chromosomal restructuring. Ultimately, all physiological and biochemical events depend on the final gene expression products, proteins. Although most mammals have an additional controlling layer before protein expression, knowing how many transcripts exist in a system helps to characterize the biochemical status of a cell. Ideally, technology would enable us to quantify all proteins in a cell, which would significantly advance the progress of Life Science. However, we are far from achieving this.  

This blog provides the test results of one popular RNA-Seq data analysis pipeline known as the Tuxedo pipeline (1). The Tuxedo pipeline suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript, and isoform detection, differential expression, visualizations, and quality control metrics. The tested workflow is a differentially expressed gene (DEG) analysis, and the detailed steps in the pipeline are shown in Figure 1.

Figure 1.  Updated tuxedo pipeline with cuffquant step

A single node study with AMD EPYC 7002 series (Rome) and AMD EPYC 7003 series (Milan) on Dell EMC PowerEdge R6525 server was done. The configurations of the test system are summarized in Table 1.

Table 1.  Tested compute node configuration

Dell EMC PowerEdge R6525
CPU

Tested AMD Milan:

2x 7763 (Milan), 64 Cores, 2.45 GHz – 3.5 GHz Base-Boost, TDP 280 W, 256 MB L3 Cache

2x 7713 (Milan), 64 Cores, 2.0 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache

7543 (Milan), 32 Cores, 2.8 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache

 

Tested AMD Rome:

7702 (Rome), 64 Cores, 2.0 GHz – 3.35 GHz Base-Boost, TDP 200 W, 256 MB L3 Cache
RAMDDR4 256 GB (16 Gb x 16) 3200 MT/s
Operating systemRHEL 8.3 (4.18.0-240.el8.x86_64)
InterconnectMellanox InfiniBand HDR100
FilesystemDell EMC Ready Solutions for HPC BeeGFS High Capacity Storage
BIOS system profilePerformance optimized
Logical processorDisabled
Virtualization technologyDisabled
tophat2.1.1
bowtie22.2.5
R3.6
bioconductor cummerbund2.26.0

A performance study of the RNA-Seq pipeline is not trivial because the nature of the workflow requires non-identical input files yet similar input files in size. Hence, 185 RNA-Seq paired-end read data are collected from a public data repository. All the read datafiles contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test are randomly selected from the pool of the 185 paired-end read files. Although these test data will not have any biological meaning, certainly these data with the high level of noise will put the tests in the worst-case scenario.

Performance Evaluation

Throughput test - Single pipeline with more than two samples, biological, and technical duplicates

Typical RNA-Seq studies consist of multiple samples, sometimes 100s of different samples, normal versus disease, or untreated versus treated samples. These samples tend to have a high level of noise due to biological reasons; hence, the analysis requires vigorous data preprocessing procedure.

A number of various samples were processed, with different RNA-Seq data selected from 185 paired-end reads dataset, to see how much data a single node can process. Typically, when the number of samples increases, the runtime of the Tuxedo pipeline increases. However, as shown in the figure below, the runtimes with two sample tests using 7713, are higher than the runtimes from four samples. The standard error from five repeated runs does not overlap with four and eight sample results. The interference of other users may cause this large variance. The current testing environment, especially a shared file system designed for large capacity, is not ideal for a Next Generation Sequencing (NGS) data analysis benchmark.

Figure 2.  Runtime comparisons among various AMD EPYC 7003 Series processors: Standard error is estimated from an estimated standard deviation based on a sample (STDDEV.S function in Excel)

The eight sample test results show that AMD Milan processors perform better than one of the Rome processors (7702) in a higher workload.

Conclusion

Many tests are still required to obtain a better insight from the AMD Milan processors for the NGS data analysis area. Unfortunately, the tests could not exceed eight samples due to storage limitations. However, there seems to be plenty of room for a higher throughput that processes more than eight samples together. AMD Milan 7763 performed 20% better than AMD Rome 7702. AMD Milan 7713 performed 18% better in eight sample tests for the Tuxedo pipeline as described in Figure 2.

Read Full Blog
NVIDIA PowerEdge

Accelerating HPC Workloads with NVIDIA A100 NVLink on Dell PowerEdge XE8545

Savitha Pareek Deepthi Cherlopalle Frank Han

Wed, 07 Apr 2021 18:02:56 -0000

|

Read Time: 0 minutes

NVIDIA A100 GPU

Three years after launching the Tesla V100 GPU, NVIDIA recently announced its latest  data center GPU A100, built on the Ampere architecture.  The A100 is available in two form factors, PCIe and SXM4, allowing GPU-to-GPU communication over PCIe or NVLink. The NVLink version is also known as the A100 SXM4 GPU and is available on the HGX A100 server board. 

As you’d expect, the Innovation Lab tested the performance of the A100 GPU in a new platform. The new PowerEdge XE8545 4U server from Dell Technologies supports these GPUs with the NVLink SXM4 form factor and dual-socket AMD 3rd generation EPYC CPUs (codename Milan). This platform supports PCIe Gen 4 speed, up to 10 local drives, and up to 16 DIMM slots running at 3200 MT/s. Milan CPUs are available with up to 64 physical cores per CPU. 

The PCIe version of the A100 can be housed in the PowerEdge R7525, which also supports AMD EPYC CPUs, up to 24 drives, and up to 16 DIMM slots running at 3200MT/s.  This blog compares the performance of the A100-PCIe system to the A100-SXM4 system. 

Figure 1: PowerEdge XE8545 Server

A previous blog discussed the performance of the NVIDIA A100-PCIe GPU compared to its predecessor NVIDIA Tesla V100-PCIe GPU in the PowerEdge R7525 platform. 

The following table shows the specifications of the NVIDIA A100 and V100 GPUs.

Table 1: NVIDIA A100 and V100 GPUs with PCIe and SXM4 form factors

Form factor

PCIe

 SXM (NVIDIA NVLink)

Type of NVIDIA

A100

V100

A100

V100

GPU architecture

Ampere

Volta

Ampere

Volta

GPU memory

40 GB

32 GB

40 GB

32 GB

GPU memory bandwidth

1555 GB/s

900 GB/s

1555 GB/s

900 GB/s

Peak FP64

9.7 TFLOPS

7 TFLOPS

9.7 TFLOPS

7.8 TFLOPS

Peak FP64 Tensor Core

19.5 TFLOPS

N/A

19.5 TFLOPS

N/A

GPU base clock

765 MHz

1230 MHz

1095 MHz

1290 MHz

GPU boost clock

1410 MHz

1380 MHz

1410 MHz

1530 MHz

NVLink speed

600 GB/s

N/A

600 GB/s

300 GB/s

Max power consumption

250 W

250 W

400 W

300 W

 From Table 1, we see that the A100 offers 42 percent improved memory bandwidth and 20 to 30 percent higher double precision FLOPS when compared to the Tesla V100 GPU. While the A100-PCIe GPU consumes the same amount of power as the V100-PCIe GPU, the NVLink version of the A100 GPU consumes 25 percent more power than the V100 GPU.  

How are the GPUs connected in the PowerEdge servers?

An understanding of the server architecture is helpful in determining the behavior of any application. The PowerEdge XE8545 server is an accelerator optimized server with four A100-SMX4 GPUs connected with third generation NVLink, as shown in the following figure.

Figure 2:  PowerEdge XE8545 CPU-GPU connectivity                     

In the A100 GPU, each NVLink lane supports a data rate of 50x 4 Gbit/s in each direction. The total number of NVLink lanes increases from six lanes in the V100 GPU to 12 lanes in the A100 GPU, now yielding 600 GB/s total.  Workloads that can take advantage of the higher GPU-to-GPU communication bandwidth can be benefit from the NVLink links in PowerEdge XE8545 Server. 

As shown in the following figure, the PowerEdge R7525 server can accommodate up to three PCIe-based GPUs; however the configuration chosen for this evaluation used two A100-PCIe GPUs. With this option, the GPU-to-GPU communication must flow through the AMD Infinity Fabric inter-CPU links.

Figure 3:  PowerEdge R7525 CPU-GPU connectivity

Testbed details

The following table shows the tested configuration details: 

Table 2:  Test bed configuration details

Server

PowerEdge XE8545

PowerEdge R7525

Processor

Dual AMD EPYC 7713, 64C, 2.8 GHz

Memory

512 GB

(16 x 32 GB @ 3200 MT/s)

1024 GB

(16 x 64 GB @ 3200 MT/s)

Height of system

4U

2U

GPUs

4 x NVIDIA A100 SXM4 40 GB

2 x NVIDIA A100 PCIe 40 GB

Operating system

Kernel

Red Hat Enterprise Linux release 8.3 (Ootpa)

4.18.0-240.el8.x86_64

BIOS settings

Sysprofile=PerfOptimized

LogicalProcessor=Disabled

NumaNodesPerSocket=4

CUDA Driver

CUDA Toolkit

450.51.05

11.1

GCC

9.2.0

MPI

OpenMPI - 4.0

The following table lists the version of HPC application that was used for the benchmark evaluation:

Table 3: HPC Applications used for the evaluation

Benchmark

Details

HPL

xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20

HPCG

xhpcg-3.1_cuda_11_ompi-3.1

GROMACS

v2021

NAMD

Git-2021-03-02_Source

LAMMPS

29Oct2020 release 

Benchmark evaluation

High Performance Linpack

High Performance Linpack (HPL) is a standard HPC system benchmark that is used to measure the computing power of a server or cluster. It is also used as a reference benchmark by the TOP500 org to rank supercomputers worldwide. HPL for GPU uses double precision floating point operations.  There are a few parameters that are significant for the HPL benchmark, as listed below:

  • is the problem size provided as input to the benchmark and determines the size of linear matrix that is solved by HPL. For a GPU system, the highest HPL performance is obtained when the problem size utilizes as much as possible of the GPU memory without exceeding it. For this study, we used HPL compiled with NVIDIA libraries as listed in Table 3.
  • NB is the block size which is used for data distribution. For this test configuration, we used an NB of 288.  
  • PxQ is the matrix size and is equal to the total number of GPUs in the system.  
  • Rpeak is the theoretical peak of the system. 
  • Rmax is the maximum measured performance achieved on the system. 

Figure 4: HPL Performance on the PowerEdge R7525 and XE8545 with NVIDIA A100-40 GB 

  

Figure 5: HPL Power Utilization on the PowerEdge XE8545 with four NVIDIA A100 GPUs and R7525 with two NVIDIA A100 GPUs

 From Figure 4 and Figure 5, we can make the following observations:

  • SXM4 vs PCIe: At 1-GPU, the NVIDIA A100-SXM4 GPU outperforms the A100-PCIe by 11 percent. The higher SMX4 GPU base clock frequency is the predominant factor contributing to the additional performance over the PCIe GPU.    
  • Scalability: The PowerEdge XE8545 server with four NVIDIA A100-SXM4-40GB GPUs delivers 3.5 times higher HPL performance compared to one NVIDIA A100-SXM4-40GB GPU. On the other hand, two A100-PCIe GPUs is 1.94 times faster than one on the R7525 platform. The A100 GPUs scale well on both platforms for HPL benchmark.  
  • Higher Rpeak:  HPL code on A100 GPUs use the new double-precision Tensor cores.  So, the theoretical peak for each card would be 19.5 TFlops, as opposed to 9.7 TFlops. 
  • Power: Figure 5 shows power consumption of a complete HPL run with PowerEdge XE8545 using 4 x A100-SXM4 GPUs and PowerEdge R7525 using 2 x A100-PCIe GPUs. This was measured with iDRAC commands, and the peak power consumption for XE8545 is 2877 Watts, while peak power consumption for R7525 is 1206 Watts.

High Performance Conjugate Gradient  

The TOP500 list has incorporated the High Performance Conjugate Gradient (HPCG) results as an alternate metric to assess system performance.

 

Figure 6: HPCG Performance on the PowerEdge R7525 and PowerEdge XE8545 Servers

Unlike HPL, HPCG performance depends heavily on the memory system and network performance when we go beyond one server. Because both the PCIe and SXM4 form factors of the A100 GPUs have the same memory bandwidth, there is no variation in the performance at a single node and HPCG performance scales well on both servers.

GROMACS

The following figure shows the performance results for GROMACS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. GROMACS 2021.0 was compiled with CUDA compilers and Open-MPI, as shown in Table 3.

Figure 7: GROMACS performance on the PowerEdge R7545 and PowerEdge XE8545 Server 

 The GROMACS build included thread MPI (built in with the GROMACS package). Performance results are presented using the ns/day metric. For each test, the performance was optimized by varying the number of MPI ranks and threads,  number of  PME ranks, and different nstlist values to obtain the best performance result.

With one GPU in test, the performance of the SMX4 XE8545 server is similar to the PCIe R7525. With two GPUs in test, the SMX4 XE8545 performance is up to 28 percent better than the PCIe R7525. As the performance was based on a comparative analysis between NVIDIA PCIe and SXM4 form factors along the server platforms, datasets like Water 1536 and Water 3072 demand more GPU-GPU communication, and SXM4 performs around 28 percent better. On the other hand, for datasets like LignoCellulose 3M, the two GPU R7525 achieves the same per-GPU performance as the XE8545, but with the lower 250 W GPU making it the more efficient solution.

 LAMMPS

The following figure shows the performance results for LAMMPS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. The code was compiled with the KOKKOS package to run efficiently on NVIDIA GPUs, and Lennard Jones is the dataset that was tested with Timesteps/s as the metric for comparison.

Figure 8: LAMMPS performance on the PowerEdge R7545 and PowerEdge XE8545 Servers

With one GPU in test, the performance of the SMX4 XE8545 server is 13 percent higher than the PCIe R7525, and with two GPUs in test, a 23 percent performance improvement was measured.  The PowerEdge XE8545 is at an advantage because the GPUs can communicate with each other over NVLink without the intervention of a CPU. The R7525 server with two GPUs is limited by the GPU-to-GPU communication pattern. Additionally, the other factor contributing for better performance is the higher clock rate of the SXM4 A100 GPU. 

Conclusion

In this blog, we discussed the performance of NVIDIA A100 GPUs on the PowerEdge R7525 Server and the PowerEdge XE8545 Server, which is the new addition from Dell Technologies. The A100 GPU has 42 percent more memory bandwidth and higher double precision FLOPs compared to its predecessor, the V100 series GPU. For workloads which demand more GPU-to-GPU communication, the PowerEdge XE8545 server is an ideal choice. For data centers where space and power are limited, the PowerEdge R7525 server may be the right fit. The overall performance of PowerEdge XE8545 Server with four A100-SXM4 GPUs is 1.5 to 2.3 times faster than the PowerEdge R7525 server with two A100-PCIe GPUs. 

In the future, we intend to evaluate the A100-80GB GPUs and NVIDIA A40 GPUs that will be available this year. We also plan to focus on a multi-node performance study with these GPUs.

Please contact your Dell sales specialist about the HPC and AI Innovation Lab if you would like to evaluate these GPU servers.  

Read Full Blog
HPC AMD

BWA-GATK Pipeline Performance on Dell EMC PowerEdge R6525 Server

Kihoon Yoon

Tue, 30 Mar 2021 18:34:08 -0000

|

Read Time: 0 minutes

Overview

We’ve been speculating that AMD Milan with Zen3 cores which allows more cores to share the same L3 cache could perform better for Next Generation Sequencing (NGS) applications. Comparing to the predecessor AMD EPYC Rome, the number of cores sharing the L3 cache is doubled-up from 4 to 8 for the 64 core processor model. In addition to that, the cache (both L1 and L2) is upgraded with new prefetchers, and memory bandwidth is improved.

Since Milan and Rome share the same SP3 socket, Dell EMC PowerEdge R6525 was selected for the case study and able to minimize system-to-system variations. The test server configuration is summarized in Table 1.

Table 1.  Tested compute node configuration

Dell EMC PowerEdge R6525

CPU

Tested AMD Milan:

2x 7763 (Milan), 64 Cores, 2.45GHz – 3.5GHz Base-Boost, TDP 280W, 256 MB L3 Cache

2x 7713 (Milan), 64 Cores, 2.0GHz – 3.7GHz Base-Boost, TDP 225W, 256 MB L3 Cache

7543 (Milan), 32 Cores, 2.8GHz – 3.7 GHz Base-Boost, TDP 225W, 256 MB L3 Cache

Tested AMD Rome:

7702 (Rome), 64 Cores, 2.0 GHz – 3.35 GHz Base-Boost, TDP 200W, 256 MB L3 Cache

RAM

DDR4 256G (16Gb x 16) 3200 MT/s

OS

RHEL 8.3 (4.18.0-240.el8.x86_64)

Filesystem Network

Mellanox InfiniBand HDR100

Filesystem

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS System Profile

Performance Optimized

Logical Processor

Disabled

Virtualization Technology

Disabled

BWA

0.7.15-r1140

Sambamba

0.7.0

Samtools

1.6

GATK

3.6-0-g89b7209

The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x. 

Performance Evaluation

Characterizing steps in BWA-GATK Pipeline

In a typical BWA-GATK pipeline, there are multiple steps, and each step consists of various applications which behave distinctively. As shown in Table 3, the applications in some steps do not support multi-threading operations. These steps are problematic since there are only a few ways to improve performance.

Table 2.  The steps in BWA-GATK pipeline and tools

Steps

Applications

Multi-threading support

Mapping & Sorting

BWA, samtools, sambamba

Yes

Mark Duplicates

Sambamba

Yes

Generate Realigning Targets

GATK RealignerTargetCreator

Yes

Insertion/Deletion Realigning

GATK IndelRealigner

No

Base Recalibration

GATK BaseRecalibrator

Yes

Haplotypercaller

GATK HaplotypeCaller

Yes

Genotype GVCFs

GATK GenotypeGVCFs

Yes

Variant Recalibration

GATK VariantRecalibrator

No

Apply Variant Recalibration

GATK ApplyRecalibration

No

Single thread applications, especially Variant Recalibration and Apply Variant Recalibratrion steps show no runtime variation due to the deterministic algorithm and the inputs for these steps are small. Hence, these two steps are not reported in Figure 1. The first step, Mapping & Sorting scales as the number of cores increases (Figure 1, (a)). Also, Genotype GVCFs is not included in Figure 1 although it supports multi-threading operation for a similar reason.

Burrows-Wheeler Aligner (BWA) is one of the most popular short sequence aligners for non-gapped aligning analysis. BWA scales well until 32 cores, and CPU usage drops down dramatically after 32 cores. The runtime improvement becomes marginal with higher core numbers greater than 32. Using more than 80 cores for this step is the wasting of resources.

Sambamba which is compatible with Picard is used for marking duplicate reads. The behavior of sambamba is plotted in Figure 1, (b). Due to the highly parallelized nature of design, the memory consumption increases to create more hash tables for additional threads. Amazingly, 50x human whole genome sequence (WGS) is not big enough to use more than 13 cores for the well-designed software.

After Mark Duplicates step, Genome Analysis Tool Kit, hence GATK, written in Java plays a critical role in performance measurement and creating answers. These steps do not scale at all as shown in Figure 1, (c) (d) (e) and (f). A better approach will be discussed in future work to handle the misbehavior in multi-core and multi-socket environments.

Figure 1.  Runtimes of 7702 with various number of cores for each step. Milan CPUs also show similar behaviors.

Single Sample Performance

Socket to Socket Comparison

This test is not a fair comparison since the majority of steps will not take advantage of using all the cores except 7543 with 32 cores. However, this comparison will help to decide which CPU could be best for the throughput test.

Table 3 summarizes the overall runtimes for BWA-GATK pipeline, and it is hard to say which one is better in terms of total runtimes. A lot more tests are required to differentiate the performance differences in GATK steps. Also, the results from 7502 and 7402 were from the previous tests with different environments.

The mapping & Sorting step is the only step that we can peak the true performance variations across different CPUs in Table 3. A rough estimation of performance improvement from 7702 to 7763 is 7% while the performance gain is 5% from 7702 to 7713.

Surprisingly, the Base Recalibration step showed similar results as the Mapping & Sorting step, which is 8% and 3% improvement.

Table 3.  BWA-GATK performance comparisons between Milan and Rome. The number of cores used for the test is parenthesized.

Steps

Runtime (hours)

AMD

7763

64c

2.45GHz

AMD

7713

64c

2.0GHz

AMD

7543

32c

2.8GHz

AMD

7702

64c

2.0GHz

AMD

7502

32c

2.5GHz

AMD

7402

24c

3.0GHz

Mapping & Sorting

2.44

(64)

2.49

(64)

3.69

(32)

2.63

(64)

4.68

(32)

5.73

(24)

Mark Duplicates

1.07

(13)

1.10

(13)

1.01

(13)

1.01

(13)

0.93

(13)

0.94

(13)

Generate Realigning Targets

0.55

(32)

0.56

(32)

0.50

(32)

0.58

(32)

0.45

(32)

0.44

(32)

Insertion/Deletion Realigning

8.73

(1)

9.13

(1)

7.73

(1)

8.78

(1)

8.30

(1)

8.21

(1)

Base Recalibration

2.27

(32)

2.38

(32)

2.17

(32)

2.46

(32)

2.52

(32)

2.67

(24)

Haplotypercaller

10.20

(16)

10.57

(16)

9.15

(16)

9.02

(16)

9.33

(16)

9.05

(16)

Genotype GVCFs

0.02

(32)

0.02

(32)

0.01

(32)

0.02

(32)

0.01

(32)

0.01

(24)

Variant Recalibration

0.31

(1)

0.20

(1)

0.17

(1)

0.12

(1)

0.21

(1)

0.13

(1)

Apply Variant Recalibration

0.01

(1)

0.01

(1)

0.01

(1)

0.01

(1)

0.01

(1)

0.01

(1)

Total Runtime (hours)

25.59

26.47

24.44

24.64

26.46

27.25

Multiple Sample Performance - Throughput

A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment.

The current pipeline invokes a large number of pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processing. Typically, a process silently fails when there is not enough resource left to start an additional process.

However, the failures experiencing during this study are quite different from the previous observations. 10 pipelines started in R6525 with 2x 7763 sustain only 6 pipelines on average with the 50x human WGS. Four pipelines are failed with the broken pipes error which suggests some sort of file operation. Current BeeGFS storage for the test is designed for high capacity, theoretical sequential write bandwidth of 25 GB/s. However, roughly 16 GB/s is achievable where there is not heavy usage loaded on this storage in a shared storage environment. This is not an ideal environment for any benchmark practice; however, the results here are quite helpful to see what the performance of these systems looks like in a real life.

As shown in Table 4, the maximum number of samples that can be processed at the same time is around 4 or 5, and the ~ 4.79 50x human whole genomes per day throughput is achievable with the current environment.

Table 4.  Throughput test for Milan 7763

Steps

Runtime (hours)

1 Sample

2 Samples

4 Samples

6 Samples

10 Samples

Number of Samples Failed 

0

0

0

1

4

Mapping & Sorting

2.44

2.91

4.33

5.86

8.33

Mark Duplicates

1.07

1.40

1.69

1.31

5.51

Generate Realigning Targets

0.55

0.88

1.77

0.50

2.07

Insertion/Deletion Realigning

8.73

8.97

8.92

8.92

9.70

Base Recalibration

2.27

2.50

2.79

3.26

3.67

Haplotypercaller

10.20

10.57

10.27

9.91

9.96

Genotype GVCFs

0.02

0.11

0.10

0.10

0.15

Variant Recalibration

0.31

0.25

0.20

0.21

0.36

Apply Variant Recalibration

0.01

0.02

0.01

0.01

0.03

Total Runtime (hours)

25.59

27.62

30.08

30.08

39.79

Genomes per day

0.94

1.74

4.79

3.99

3.62

Conclusion

The field of NGS data analysis has been moving fast in terms of data growth and data variations. However, the community has not been done much work adapting new technologies available such as accelerators. Instead of improving the quality of codes, the community is faced with analyzing the data without multi-thread processing since GATK version 4 and up does not support multi-threading anymore while the number of cores in a CPU increases fast.

It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision-making of when to split data is pretty hard, it is certainly tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.

Nonetheless, Milan 7763 or 7713 are an excellent candidate to cover both current multi-threading-based pipelines and future data-level parallelization driven pipelines with more available cores.

 

Read Full Blog
HPC AMD

AMD Milan - BIOS Characterization for HPC

Puneet Singh Savitha Pareek Tarun Singh Ashish KS

Thu, 18 Mar 2021 19:27:51 -0000

|

Read Time: 0 minutes

With the release of the AMD EPYC 7003 Series Processors (architecture codenamed "Milan"), Dell EMC PowerEdge servers have now been upgraded to support the new features. This blog outlines the Milan Processor architecture and the recommended BIOS settings to deliver optimal HPC Synthetic benchmark performance. Upcoming blogs will focus on the application performance and characterization of the software applications from various scientific domains such as Weather Science, Molecular Dynamics, and Computational Fluid Dynamics.

AMD Milan with Zen3 cores is the successor of AMD's high-performance second generation server microprocessor (architecture codenamed "Rome"). It supports up to 64 cores at 280w TDP and 8 DDR4 memory channels at speeds up to 3200MT/s.

Architecture 

As with AMD Rome, AMD Milan’s 64 core Processor model has 1 I/O die and 8 compute dies (also called CCD or Core Complex Die) – OPN 32 core models may have 4 or 8 compute dies. Milan Processors have upgrades to the Cache (including new prefetchers at both L1 and L2 caches) and Memory Bandwidth which is expected to improve performance of applications requiring higher memory bandwidth.

Unlike Naples and Rome, Milan's arrangement of its CCDs has changed. Each CCD now features up to 8 cores with a unified 32MB L3 cache which could reduce the cache access latency within compute chiplets. Milan Processors can expose each CCD as a NUMA node node by setting the “L3 cache as NUMA Domain” ( from the iDRAC GUI ) or BIOS.ProcSettings.CcxAsNumaDomain (using racadm CLI) option to “Enabled”. Therefore, Milan’s 64 core dual-socket Processors with 8 CCDs per Processor will expose 16 NUMA domains per system in this setting. Here is the logical representation of Core arrangement with NUMA Nodes per socket = 4 and CCD as NUMA = Disabled.

Figure1: Linear core enumeration on a dual-socket system, 64c per socket, NPS4 configuration on an 8 CCD Processor model

As with AMD Rome, AMD Milan Processors support the AVX256 instruction set allowing 16 DP FLOP/cycle.

BIOS Options Available on AMD Milan and Tuning

Processors from both Milan and Rome generations are socket compatible, so the BIOS Options are similar across these Processor generations. Server details are mentioned in Table 1 below.

Table 1: Testbed hardware and software details

Server

Dell EMC PowerEdge 2 socket servers

(with AMD Milan Processors)

Dell EMC PowerEdge 2 socket servers

(with AMD Rome Processors)

OPN

Cores/Socket

Frequency (Base-Boost)

TDP
 L3Cache

7763 (Milan) 

64

2.45GHz – 3.5GHz

280W

256 MB

7H12 (Rome) 

64

2.6GHz – 3.3 GHz

280W

256 MB

OPN

Cores/Socket

Frequency

TDP
 L3Cache

7713 (Milan)

64

2.0GHz – 3.7GHz

225W

256 MB

7702 (Rome)

64

2.0 GHz – 3.35 GHz

200W

256 MB

OPN

Cores/Socket

Frequency

TDP
L3Cache

7543 (Milan) 

32

2.8GHz – 3.7 GHz

225W

256 MB

7542 (Rome) 

32

2.9GHz – 3.4 GHz

225W

128 MB

Operating System

RHEL 8.3 (4.18.0-240.el8.x86_64)

RHEL 8.2 (4.18.0-193.el8.x86_64)

Memory

DDR4 256G (16Gb x 16) 3200 MT/s

BIOS / CPLD

2.0.3 / 1.1.12

1.1.7

Interconnect

Mellanox HDR 200 (4X HDR)

Mellanox HDR 100

 The following BIOS options were explored –

  • BIOS.SysProfileSettings.SysProfile:  This field sets the System Profile to Performance Per Watt (OS), Performance, or Custom mode. When set to a mode other than Custom, BIOS will set each option accordingly. When set to Custom, you can change setting of each option. Under Custom mode when C state is enabled, Monitor/Mwait should also be enabled.
  • BIOS.ProcSettings.L1StridePrefetcher: When set to Enabled, the Processor provides additional fetch to the data access for an individual instruction for performance tuning by controlling the L1 stride prefetcher setting.
  • BIOS.ProcSettings.L2StreamHwPrefetcher: When set to Enabled, the Processor provides advanced performance tuning by controlling the L2 stream HW prefetcher setting.
  • BIOS.ProcSettings.L2UpDownPrefetcher: When set to Enabled, the Processor uses memory access to determine whether to fetch  next or previous for all memory accesses for advanced performance tuning by controlling the L2 up/down prefetcher setting.
  • BIOS.ProcSettings.CcxAsNumaDomain: This field specifies that each CCD within the Processor will be declared as a NUMA Domain.
  • BIOS.MemSettings.MemoryInterleaving: When set to Auto, memory interleaving is supported if a symmetric memory configuration is installed. When set to Disabled, the system supports Non-Uniform Memory Access (NUMA) (asymmetric) memory configurations. Operating Systems that are NUMA-aware understand the distribution of memory in a particular system and can intelligently allocate memory in an optimal manner. Operating Systems that are not NUMA-aware could allocate memory to a Processor that is not local, resulting in a loss of performance. Die and Socket Interleaving should only be enabled for Operating Systems that are not NUMA-aware.

After setting System Profile (BIOS.SysProfileSettings.SysProfile) to PerformanceOptimized,  NUMA Nodes Per Socket (NPS) to 4, and Prefetchers (L1Region,L1Stream,L1Stride,L2Stream, L2UpDown) to “Enabled” we measured the impact of CcxAsNumaDomain and MemoryInterleaving BIOS parameters on application performance. We tested the performance of the applications listed in Table 1 with following settings.

Table 2: Combinations of CCX as NUMA domain and Memory Interleaving


CCX as NUMA Domain

Memory Interleaving

Setting01

Disabled

Disabled

Setting02

Disabled

Auto

Setting03

Enabled

Auto

Setting04

Enabled

Disabled

With Setting01 and Setting02 (CCX as NUMA Domain = Disabled), the system will expose 8 NUMA nodes. With Setting03 and Setting04, there will be 16 NUMA nodes on a dual socket server with 64 core based Milan Processors.

Table 3: hwloc-ls and numactl -H command output on 64c server with setting01/setting02 and (listed in Table 2)


Table 4: hwloc-ls and numactl -H command output on 128 core (2x 64c) server with setting03/setting04 and (listed in Table 2)

Application performance is shown in Figure 2, Figure 3 and Figure 4. In each Figure, the numbers on top of the bars represent the relative change in the application performance with respect to the application performance obtained on the 7543 Processor Model with setting04 (CCXasNUMADomain=Enabled and Memory Interleaving = Disabled - green bar).

Figure 2: Relative difference in the performance of HPL by processor and BIOS settings mentioned in Table 1 and Table 2. 


Figure 3: Relative difference in the performance of HPCG by processor and BIOS settings mentioned in Table 1 and Table 2. 


Figure 4: Relative difference in the performance of STREAM by processor and BIOS settings mentioned in Table 1 and Table 2. 

HPL delivers the best performance numbers on setting02 with 82-93% efficiency depending on Processor Model, whereas STREAM and HPCG deliver better performance with setting04.

STREAM TRIAD tests generate best performance numbers at ~378 GB/s memory bandwidth across all of the 64 and 32 core Processor Models mentioned in Table 1 with efficiency up to 90%.

In Figure 4, the STREAM TRIAD performance numbers were measured by undersubscribing the server by utilizing only 16 cores on the servers. The comparison of the performance numbers by utilizing all the available cores and 16 cores per system has been shown in Figure 5. The numbers on top of the orange bars shows the relative difference.

Figure 5: Relative difference in the memory bandwidth.

From Figure 5, we observed that by using 16 cores, the STREAM TRIAD test’s performance numbers were ~3-4% higher than the performance numbers measured by subscribing all available cores.

We carried out NUMA bandwidth tests using setting02 and setting04 mentioned in Table01.  With setting02, system exposes a total of 8 NUMA nodes while with setting04, system exposes a total of 16 NUMA nodes with 8 cores per NUMA node In Figure 6 and 7, NUMA node presented as “c” and memory node as “m”. As an example, c0m0 represents NUMA node 0 and memory node 0. The best bandwidth numbers obtained on varying the number of threads

Figure 6: Local and remote NUMA memory bandwidth with CCXasNUMADomain=Disabled

Figure 7: Local and remote NUMA memory bandwidth with CCXasNUMADomain=ENabled

We observed that the optimal intra socket local memory bandwidth numbers were obtained with 2 threads per NUMA node with setting2 on both 64 core and 32 core processor models. In Figure 6 with setting02 (Table 2) the intra socket local memory bandwidth, at 2 threads per NUMA node, can be up to 79% more than inter remote memory bandwidth. With setting02 (Figure 6) we get at least 96% higher intra socket local memory bandwidth per NUMA domain than setting04 (Figure 7).

Impact of new Prefetch options

Milan introduces two new prefetchers for L1 cache and one for L2 Cache with a total of five prefetcher options which can be configured using BIOS. We tested combinations listed in Table 5 by keeping L1 Stream and L2 Stream prefetcher as Enabled.

Table 5: Cache Prefetchers


L1StridePrefetcher

L1RegionPrefetcher

L2UpDownPrefetcher

setting01

Disabled

Enabled

Enabled

setting02

Enabled

Disabled

Enabled

setting03

Enabled

Enabled

Disabled

setting04

Disabled

Disabled

Disabled

We found that these new prefetchers do not have significant impact on the performance of the synthetic benchmarks covered in this blog.

InfiniBand bandwidth, message rate and scalability

For Multinode tests, the testbed was configured with Mellanox HDR interconnect running at 200 Gbps with each server having the AMD 7713 Processor Model and Preferred IO setting set to Enabled from BIOS.Along with the setting02 (Table 2) and Prefetchers (L1Region,L1Stream,L1Stride,L2Stream, L2UpDown) set to “Enabled” we were able to achieve the expected linear performance scalability for HPL and HPCG Benchmarks.

Figure 8: Multinode scalability of HPL and HPCG with setting02 (Table 2) with 7713 Processor model, HDR200 Infiniband


We tested the Message Rate, Unidirectional, and Bidirectional InfiniBand bandwidth using OSU Benchmarks and results are in Figure 9, Figure 10 and Figure 11. Except the Numa Nodes per socket setting, all other BIOS settings for these tests were same as mentioned above. The OSU Bidirectional bandwidth and OSU Unidirectional tests were carried out with Numa Nodes per socket set to 2 and the and Message rate test was carried out with Numa Nodes per socket set to 4. In Figure 9 and Figure10, the numbers on top of the orange bars represent the percentage difference between Local and Remote bandwidth performance numbers.

Figure 9: OSU bi-directional bandwidth test on AMD 7713, HDR 200 InfiniBand

Figure 10: OSU uni-directional bandwidth test on AMD 7713, HDR 200 Infiniband

For Local Latency and Bandwidth performance numbers, the MPI process was pinned to the NUMA node 1 (closest to the HCA). For Remote Latency and Bandwidth tests, processes were pinned to NUMA node 6.

Figure 11: OSU Message rate and bandwidth performance on 2 and 4 nodes of 7713 Processor model

On 2 nodes using HDR200, we are able to achieve ~24 GB/s unidirectional bandwidth and message rate of 192 Million messages/second – almost double the performance numbers obtained on HDR100.

Comparison with Rome SKUs

In order to draw out performance improvement comparisons, we have selected Rome SKUs closest to their Milan counterparts in terms of hardware features such as Cache Size, TDP values, and Processor Base/Turbo Frequency.

Figure 12: HPL performance comparison with Rome Processor Models

 

Figure 13: HPCG performance comparison with Rome Processor Models

 

Figure 14: STREAM performance comparison with Rome Processor Models

For HPL (Figure 12) we observed that, on higher end Processor Models, Milan delivers 10% better performance than Rome. As expected, on the Milan platform, memory bandwidth bound applications like STREAM and HPCG (Figure 13 and Figure 14) gain 6-16 % and 13-32% in the performance over Rome Processor Models covered in this blog.

Summary and Future Work  

Milan-based servers show expected performance upgrades, especially for the memory bandwidth bound synthetic HPC benchmarks covered in this blog. Configuring the BIOS options is important in order to get the best performance out of the system. The Hyper-Threading should be Disabled for general-purpose HPC systems, and benefits of this feature should be tested and enabled as appropriate for the synthetic benchmarks not covered in this blog.

Check back soon for subsequent blogs that describe application performance studies on our Milan Processor based cluster.

Read Full Blog
HPC AMD manufacturing

Siemens’ Simcenter STAR-CCM+ Performance with AMD EPYC 7003 Series Processors

Joshua Weage

Wed, 17 Mar 2021 15:15:19 -0000

|

Read Time: 0 minutes

Introduction

This blog discusses the performance of Siemens’ Simcenter STAR-CCM+ on the Dell EMC Ready Solutions for HPC Digital Manufacturing with AMD EPYC 7003 series processors. This Dell EMC Ready Solutions for HPC was designed and configured specifically for digital manufacturing workloads, where computer aided engineering (CAE) applications are critical for virtual product development. The Dell EMC Ready Solutions for HPC Digital Manufacturing uses a flexible building block approach to HPC system design, where individual building blocks can be combined to build HPC systems which are optimized for customer specific workloads and use cases.

The Dell EMC Ready Solutions for HPC Digital Manufacturing is one of many solutions in the Dell EMC HPC solution portfolio. Please visit www.dellemc.com/hpc for a comprehensive overview of the HPC solutions offered by Dell EMC.

Benchmark System Configuration

Performance benchmarking was performed using dual-socket Dell EMC PowerEdge servers with 7002 and 7003 series AMD EPYC processors. All servers were populated with two processors and one DIMM per channel memory configuration. The system configurations used for the performance benchmarking are shown in Table 1 and Table 2. The BIOS configuration used for the benchmarking systems is shown in Table 3.

Table 1.  7002 Series AMD EPYC System Configuration

Server

Dell EMC PowerEdge C6525

Processor

2x AMD EPYC 7532 32-core Processors

Memory

16x16GB 3200 MTps RDIMMs

BIOS Version

1.4.8

Operating System

Red Hat Enterprise Linux Server release 7.6

Kernel Version

3.10.0-957.27.2.el7.x86_64

 Table 2.  7003 Series AMD EPYC System Configuration

Server

Dell EMC PowerEdge R6525

Processors

2x AMD EPYC 7713 64-Core Processors

2x AMD EPYC 7543 32-Core Processors

Memory

16x16GB 3200 MTps RDIMMs

BIOS Version

2.0.1

Operating System

Red Hat Enterprise Linux Server release 8.3

Kernel Version

4.18.0-240.el8.x86_64

Table 3.  BIOS Configuration

System Profile

Performance Optimized

Logical Processor

Disabled

Virtualization Technology

Disabled

NUMA Nodes Per Socket

4

Software Versions

Application software versions are as described in Table 4.

Table 4.  Software Versions

Simcenter STAR-CCM+2020.3.1 mixed precision with Open MPI 4

Siemens’ Simcenter STAR-CCM+ Performance

Simcenter STAR-CCM+ is a multiphysics software application used to simulate a wide range of products and designs under a variety of conditions. The benchmarks reported here mainly use the computational fluid dynamics (CFD) and heat transfer features of STAR-CCM+. CFD applications typically scale well across multiple processor cores and servers, have modest memory capacity requirements, and typically perform minimal disk I/O while solving. However, some simulations may have greater I/O demands, such as transient analysis.

The benchmark cases from the standard STAR-CCM+ benchmark suite were evaluated on the systems. The benchmark results reported here are single-server performance results, with the benchmark run using all processor cores available in the server. STAR-CCM+ benchmark performance is measured using the Average Elapsed Time metric which is the average elapsed time per solver iteration. A smaller elapsed time represents better performance. Figure 1 shows the relative performance results for a selection of the STAR‑CCM+ benchmarks.

Figure 1.  Simcenter STAR-CCM+ Single Server Performance

 The results in Figure 1 are plotted relative to the performance of a single server configured with AMD EPYC 7532 processors. Larger values indicate better overall performance. These results show the performance improvement available with 7003 series AMD EPYC processors. The 32-core AMD EPYC 7543 processor provides good performance for these benchmarks. Per server, the 64-core AMD EPYC 7713 provides a significant performance advantage over the 32-core processors.

Conclusion

The results presented in this blog show that 7003 series AMD EPYC processors offer a significant performance improvement for Siemens’ Simcenter STAR-CCM+ relative to 7002 series AMD EPYC processors.


Read Full Blog
PowerEdge HPC GPU AMD

HPC Application Performance on Dell EMC PowerEdge R7525 Servers with the AMD MI100 Accelerator

Frank Han Dharmesh Patel

Tue, 15 Dec 2020 14:23:27 -0000

|

Read Time: 0 minutes


Overview

The Dell EMC PowerEdge R7525 server supports the AMD MI100 GPU Accelerator. The server is a two-socket, 2U rack-based server that is designed to run complex workloads using highly scalable memory, I/O capacity, and network options. The system is based on the 2nd Gen AMD EPYC processor (up to 64 cores), has up to 32 DIMMs, and has PCI Express (PCIe) 4.0-enabled expansion slots. The server supports SATA, SAS, and NVMe drives and up to three double-wide 300 W accelerators.

The following figure shows the front view of the server:
 

Figure 1.  Dell EMC PowerEdge R7525 server

The AMD Instinct™ MI100 accelerator is one of the world’s fastest HPC GPUs available in the market. It offers innovations to obtain higher performance for HPC applications with the following key technologies:

  • AMD Compute DNA (CDNA)—Architecture optimized for compute-oriented workloads
  • AMD ROCm—An Open Software Platform that includes GPU drivers, compilers, profilers, math and communication libraries, and system resource management tools 
  • Heterogeneous-Computing Interface for Portability (HIP)—An interface that enables developers to covert CUDA code to portable C++ so that the same source code can run on AMD GPUs

This blog focuses on the performance characteristics of a single PowerEdge R7525 server with AMD MI100-32G GPUs. We present results from the general matrix multiplication (GEMM) microbenchmarks, the LAMMPS benchmarks, and the NAMD benchmarks to showcase performance and scalability.

The following table provides the configuration details of the PowerEdge R7525 system under test (SUT): 

Table 1.  SUT hardware and software configurations

Component

Description

Processor

AMD EPYC 7502 32-core processor

Memory

512 GB (32 GB 3200 MT/s * 16)

Local disk

2 x 1.8 TB SSD (No RAID)

Operating system

Red Hat Enterprise Linux Server 8.2

GPU

3 x AMD MI100-PCIe-32G

Driver version

3204

ROCm version

3.9

Processor Settings > Logical Processors

Disabled

System profiles

Performance

NUMA node per socket

4

NAMD benchmark

Version:  NAMD 3.0 ALPHA 6

LAMMPS (KOKKOS) benchmark

Version:  LAMMPS patch_18Sep2020+AMD patches

The following table lists the AMD MI100 GPU specifications:

Table 2.  AMD MI100 PCIe GPU specification

Component


GPU architecture

MI100

Peak Engine Clock (MHz)

1502

Stream processors

7680

Peak FP64 (TFLOPS)

11.5

Peak FP64 Tensor DGEMM (TFLOPS)

11.5

Peak FP32 (TFLOPS)

23.1

Peak FP32 Tensor SGEMM (TFLOPS)

46.1

Memory size (GB)

32

Memory ECC support

Yes

TDP (Watt)

300

GEMM Microbenchmarks

The GEMM benchmark is a simple, multithreaded dense matrix-to-matrix multiplication benchmark that can be used to test the performance of GEMM on a single GPU. The rocblas-bench binary compiled from https://github.com/ROCmSoftwarePlatform/rocBLAS was used to collect DGEMM and SGEMM results. The results of these tests reflect the performance of an ideal application that only runs matrix multiplication in the form of the peak TFLOPS that the GPU can deliver. Although GEMM benchmark results might not represent real-world application performance, it is still a good benchmark to demonstrate the performance capability of different GPUs.

The following figure shows the observed numbers of DGEMM and SGEMM:

 

Figure 2.  DGEMM and SGEMM for both AMD MI100 peak and AMD-PCIe sustained

The results indicate:

  • In the DGEMM (double-precision GEMM) benchmarkthe theoretical peak performance of the AMD MI100 GPU is 11.5 TFLOPS and the measured sustained performance is 7.9 TFLOPS. As shown iTable 2, the standard double precision (FP64) theoretical peak and the FP64 tensor DGEMM peak performance are both at 11.5 TFLOPS. Because most real world HPC applications typically are not heavily implemented with DGEMM or other matrix operations, this high standard FP64 capabilitboosts the performance on other non-matrix double-precision math calculations.
  • For FP32 Tensor operations in the SGEMM (single-precision GEMM) benchmark, the theoretical peak performance of the AMD MI100 GPU is 46.1 TFLOPS, and the measured sustained performance is approximately 30 TFLOPS. 

The LAMMPS benchmark

The Large-Scale Atom/Molecular Massively Parallel Simulator (LAMMPS) runs threads in parallel using message-passing techniques. This benchmark measures the scalability and performance of large, parallel systems of multiple GPUs. 

The following figure shows the KOKKOS implementation of LAMMPS scaled relatively linearly as AMD MI100 GPUs were added across four datasets: EAM, LJ, Tersoff, and ReaxFF/C.

Figure 3.  LAMMPS benchmark showing scaling of multiple AMD MI100 GPUs

The NAMD Benchmark

Nanoscale Molecular Dynamics (NAMD) is a parallel molecular dynamics system designed for simulation of large biomolecular systems. The NAMD benchmark stresses the scaling and performance aspects of the server and GPU configuration.

The following figure plots the results of the NAMD microbenchmark: 

Figure 4.   NAMD benchmark performance

Aggregate data of multiple GPU cards is preferred because the Alpha builds of the NAMD 3.0 binary do not scale beyond a single accelerator. Three replica simulations were launched on the same server, one on each GPU, in parallel. NAMD was CPU-bound in previous versions. The new 3.0 version has reduced the CPU dependence. As a result, three-copy simulation produced linear scaling performing three times faster across all datasets.

As part of the optimization, the NAMD benchmark numbers in the following figure show the relative performance difference using different numbers of CPU cores for the STMV dataset:

Figure 5.  CPU core dependency on NAMD

The AMD MI100 GPU exhibited an optimum configuration of four CPU cores per GPU.

Conclusion

The AMD MI100 accelerator delivers industry-leading performance, and it is a well-positioned performance per dollar GPU for both FP32 and FP64 HPC parallel codes.

  • FP32 applications perform well using the AMD MI100 GPU based on the SGEMM, LAMMPS, and NAMD benchmarks performance by using tensor cores and native FP32 compute cores.
  • FP64 applications perform well using the AMD MI100 GPU by using native FP64 compute cores.

Next Steps

In the future, we plan to test other HPC and Deep Learning applications. We also plan to research using “Hipify” tools to port CUDA sources to HIP.

Read Full Blog
NVIDIA PowerEdge HPC GPU AMD

HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs

Savitha Pareek Ashish K Singh Frank Han

Tue, 24 Nov 2020 17:39:49 -0000

|

Read Time: 0 minutes

Overview

The Dell PowerEdge R7525 server powered with 2nd Gen AMD EPYC processors was released as part of the Dell server portfolio. It is a 2U form factor rack-mountable server that is designed for HPC workloads. Dell Technologies recently added support for NVIDIA A100 GPGPUs to the PowerEdge R7525 server, which supports up to three PCIe-based dual-width NVIDIA GPGPUs. This blog describes the single-node performance of selected HPC applications with both one- and two-NVIDIA A100 PCIe GPGPUs.

The NVIDIA Ampere A100 accelerator is one of the most advanced accelerators available in the market, supporting two form factors: 

  • PCIe version 
  • Mezzanine SXM4 version

 The PowerEdge R7525 server supports only the PCIe version of the NVIDIA A100 accelerator. 

The following table compares the NVIDIA A100 GPGPU with the NVIDIA V100S GPGPU: 


NVIDIA A100 GPGPU

NVIDIA V100S GPGPU

Form factor

SXM4

PCIe Gen4

SXM2

PCIe Gen3

GPU architecture

Ampere

Volta

Memory size

40 GB

40 GB

32 GB

32 GB

CUDA cores

6912

5120

Base clock

1095 MHz

765 MHz

1290 MHz

1245 MHz

Boost clock

1410 MHz

1530 MHz

1597 MHz

Memory clock

1215 MHz

877 MHz

1107 MHz

MIG support

Yes

No

Peak memory bandwidth

Up to 1555 GB/s

Up to 900 GB/s

Up to 1134 GB/s

Total board power

400 W

250 W

300 W

250 W

The NVIDIA A100 GPGPU brings innovations and features for HPC applications such as the following:

  • Multi-Instance GPU (MIG)—The NVIDIA A100 GPGPU can be converted into as many as seven GPU instances, which are fully isolated at the hardware level, each using their own high-bandwidth memory and cores. 
  • HBM2—The NVIDIA A100 GPGPU comes with 40 GB of high-bandwidth memory (HBM2) and delivers bandwidth up to 1555 GB/s. Memory bandwidth with the NVIDIA A100 GPGPU is 1.7 times higher than with the previous generation of GPUs. 

Server configuration

The following table shows the PowerEdge R7525 server configuration that we used for this blog:

Server

PowerEdge R7525

Processor

2nd Gen AMD EPYC 7502, 32C, 2.5Ghz

Memory

512 GB (16 x 32 GB @3200MT/s)

GPGPUs

Either of the following:

2 x NVIDIA A100 PCIe 40 GB

2 x NVIDIA V100S PCIe 32 GB

Logical processors

Disabled

Operating system

CentOS Linux release 8.1 (4.18.0-147.el8.x86_64)

CUDA

11.0 (Driver version - 450.51.05)

gcc

9.2.0

MPI

OpenMPI-3.0

HPL

hpl_cuda_11.0_ompi-4.0_ampere_volta_8-7-20

HPCG

xhpcg-3.1_cuda_11_ompi-3.1

GROMACS

v2020.4

Benchmark results

The following sections provide our benchmarks results with observations.

High-Performance Linpack benchmark

High Performance Linpack (HPL) is a standard HPC system benchmark. This benchmark measures the compute power of the entire cluster or server. For this study, we used HPL compiled with NVIDIA libraries.

The following figure shows the HPL performance comparison for the PowerEdge R7525 server  with either NVIDIA A100 or NVIDIA V100S GPGPUs:

Figure1: HPL performance on the PowerEdge R7525 server with the NVIDIA A100 GPGPU compared to the NVIDIA V100SGPGPU

The problem size (N) is larger for the NVIDIA A100 GPGPU due to the larger capacity of GPU memory. We adjusted the block size (NB) used with the:

  • NVIDIA A100 GPGPU to 288
  • NVIDIA V100S GPGPU to 384

The AMD EPYC processors provide options for multiple NUMA combinations.  We found that the best value of 4 NUMA per socket (NPS=4), with NUMA per socket 1 and 2 lower the performance by 10 percent and 5 percent respectively. In a single PowerEdge R7525 node, the NVIDIA A100 GPGPU delivers 12 TF per card using this configuration without an NVLINK bridge. The PowerEdge R7525 server with two NVIDIA A100 GPGPUs delivers 2.3 times higher HPL performance compared to the NVIDIA V100S GPGPU configuration. This performance improvement is credited to the new double-precision Tensor Cores that accelerate FP64 math.

The following figure shows power consumption of the server while running HPL on the NVIDIA A100 GPGPU in a time series. Power consumption was measured with an iDRAC. The server reached 1038 Watts at peak due to a higher GFLOPS number.

Figure2: Power consumption while running HPL

High Performance Conjugate Gradient benchmark

The High Performance Conjugate Gradient (HPCG)  benchmark is based on a conjugate gradient solver, in which the preconditioner is a three-level hierarchical multigrid method using the Gauss-Seidel method. 

As shown in the following figure, HPCG performs at a rate 70 percent higher with the NVIDIA A100 GPGPU due to higher memory bandwidth:  

Figure 3: HPCG performance comparison 

Due to different memory size, the problem size used to obtain the best performance on the NVIDIA A100 GPGPU was 512 x 512 x 288 and on the NVIDIA V100S GPGPU was 256 x 256 x 256. For this blog, we used NUMA per socket (NPS)=4 and we obtained results without an NVLINK bridge. These results show that applications such as HPCG, which fits into GPU memory, can take full advantage of GPU memory and benefit from the higher memory bandwidth of the NVIDIA A100 GPGPU.       

GROMACS

In addition to these two basic HPC benchmarks (HPL and HPCG), we also tested GROMACS, an HPC application. We compiled GROMACS 2020.4 with the CUDA compilers and OPENMPI, as shown in the following table: 

Figure4: GROMACS performance with NVIDIA GPGPUs on the PowerEdge R7525 server

 The GROMACS build included thread MPI (built in with the GROMACS package). All performance numbers were captured from the output “ns/day.” We evaluated multiple MPI ranks, separate PME ranks, and different nstlist values to achieve the best performance. In addition, we used settings with the best environment variables for GROMACS at runtime. Choosing the right combination of variables avoided expensive data transfer and led to significantly better performance for these datasets.

GROMACS performance was based on a comparative analysis between NVIDIA V100S and NVIDIA A100 GPGPUs. Excerpts from our single-node multi-GPU analysis for two datasets showed a performance improvement of approximately 30 percent with the NVIDIA A100 GPGPU. This result is due to improved memory bandwidth of the NVIDIA A100 GPGPU. (For information about how the GROMACS code design enables lower memory transfer overhead, see Developer Blog: Creating Faster Molecular Dynamics Simulations with GROMACS 2020.)

Conclusion

The Dell PowerEdge R7525 server equipped with NVIDIA A100 GPGPUs shows exceptional performance improvements over servers equipped with previous versions of NVIDIA GPGPUs for applications such as HPL, HPCG, and GROMACS. These performance improvements for memory-bound applications such as HPCG and GROMACS can take advantage of higher memory bandwidth available with NVIDIA A100 GPGPUs.

 

Read Full Blog
HPC

Ready Solution for HPC PixStor Storage Capacity Expansion, HDR100 Update

Mario Gallegos

Tue, 17 Nov 2020 20:30:50 -0000

|

Read Time: 0 minutes

Introduction

Today’s HPC environments have ever increasing demands for very high-speed storage that also frequently must provide high capacity and distributed access via several standard protocols such as NFS, and SMB. These high demand HPC requirements are typically covered by Parallel File Systems that provide concurrent access to a single file or set of files from multiple nodes, very efficiently and securely distributing data to multiple LUNs across several servers using the network technology with the highest speed available.

Solution Architecture

This blog is a technology update for the use of Infiniband HDR100 on the Dell EMC Ready Solution for HPC PixStor Storage, a Parallel File System (PFS) solution for HPC environments where PowerVault ME484 EBOD arrays are used to increase the capacity of the solution. Figure 1 presents the reference architecture depicting the capacity expansion SAS additions to the existing PowerVault ME4084 storage arrays, replacing Infiniband EDR components with HDR100: ConnectX-6 HCAs and QM8700 switches. The PixStor solution includes the widespread General Parallel File System also known as Spectrum Scale as the PFS component, in addition to many other ArcaStream software components including advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities and many others.

Figure 1 Reference Architecture

Solution Components

This solution was released with the latest Intel Xeon 2nd generation Scalable Xeon CPUs (Cascade Lake) and some of the servers will use the fastest RAM available (2933 MT/s). However, due to hardware availability during testing, the solution prototype used servers with Intel Xeon 1st generation Scalable Xeon CPUs (Skylake) and slower RAM to characterize the performance. Since the bottleneck of the solution is at the SAS controllers of the DellEMC PowerVault ME40x4 arrays, no significant performance disparity is expected once the Skylake CPUs and RAM are replaced with Cascade Lake CPUs and faster RAM. In addition, the solution was updated to the latest version of PixStor (5.1.3.1) that supports RHEL 7.7 and OFED 5.0-2.1.8.

Table 1 shows the list of main components for the solution where the first column has components used at release time and therefore available to customers, and the last column has the components actually used for characterizing the performance of the solution. The drives listed for data (12TB NLS) and metadata (960GB SSD), are the ones used for performance characterization, and faster drives can provide better Random IOPs and may improve create/removal metadata operations.

Finally, for completeness, the list of possible data HDDs and metadata SSDs was included, which is based on the drives supported as enumerated on the DellEMC PowerVault ME4 support matrix, available online.

Table 1 Components to be used at release time and those used in the test bed

Solution Component

Released

Test Bed

Internal Mgmt Connectivity

Dell Networking S3048-ON Gigabit Ethernet

Data Storage Subsystem

1x to 4x Dell EMC PowerVault ME4084
 1x to 4x Dell EMC PowerVault ME484 (One per ME4084)


80 – 12TB 3.5" NL SAS3 HDD drives
8 LUNs, linear 8+2 RAID 6, chunk size 512KiB.
Options: 900GB @15K, 1.2TB @10K, 1.8TB @10K,
2.4TB @10K, 4TB NLS, 8TB NLS, 12TB NLS, 16TB NLS.
4 - 960GB SAS3 SSDs for Metadata – 2x RAID 1 (or 4 - Global HDD
spares, if Optional High Demand Metadata Module is used)
       Options: 480GB, 960GB, 1.92TB and 3.84TB.

Optional High Demand Metadata Storage Subsystem

1x to 2x (max 4x) Dell EMC PowerVault ME4024
(for 4x ME4024, supported on Large config only)
Each ME4024: 12 LUNs, linear RAID 1.
24 – 960GB 2.5" SSD SAS3 drives (Options 480GB, 960GB, 1.92TB,
          3.84TB)

RAID Storage Controllers

Redundant 12 Gbps SAS

Capacity w/o Expansion

Raw: 4032 TB (3667 TiB or 3.58 PiB) with 12TB HDDs
 Formatted ~ 3072 GB (2794 TiB or 2.73 PiB)

Capacity w/Expansion

Raw: 8064 TB (7334 TiB or 7.16 PiB) with 12TB HDDs

Formatted ~ 6144 GB (5588 TiB or 5.46 PiB)

Processor

Gateway/Ngenea (R740)

2x Intel Xeon Gold 6230 2.1G, 20C/40T, 10.4GT/s, 27.5M Cache, Turbo, HT (125W) DDR4-2933

2x Intel Xeon Gold 6136 @
 3.0 GHz, 12 cores

High Demand Metadata (R740)

Storage Node (R740)

Management Node (R440)

2x Intel Xeon Gold 5220 2.2G, 18C/36T, 10.4GT/s, 24.75M Cache, Turbo, HT (125W) DDR4-2666

2x Intel Xeon Gold 5118 @ 2.30GHz, 12 cores

Memory

Gateway/Ngenea (R740)

12 x 16GiB 2933 MT/s RDIMMs
 (192 GiB)

24x 16GiB 2666 MT/s RDIMMs (384 GiB)

High Demand Metadata (R740)

Storage Node (R740)

Management Node (R440)

12 X 16GB RDIMMs, 2666 MT/s (192GiB)

12x 8GiB 2666 MT/s RDIMMs (96 GiB)

Operating System

CentOS 7.7

Kernel version

3.10.0-1062.12.1.el7.x86_64

PixStor Software

5.1.3.1

OFED Version

Mellanox OFED 5.0-2.1.8

High Performance Network Connectivity

Mellanox ConnectX-6 Dual-Port InfiniBand VPI HDR100/100 GbE, and 10 GbE

High Performance Switch

2x Mellanox QM8700 (HA – Redundant)

Local Disks (OS & Analysis/monitoring)

All servers except Management node

3x 480GB SSD SAS3 (RAID1 + HS) for OS

PERC H730P RAID controller
Management Node

3x 480GB SSD SAS3 (RAID1 + HS) for OS & Analysis/Monitoring

PERC H740P RAID controller

All servers except Management node

2x 300GB 15K SAS3 (RAID 1) for OS

PERC H330 RAID controller

Management Node

5x 300GB 15K SAS3 (RAID 5) for OS &
       Analysis/monitoring

PERC H740P RAID controller

Systems Management

iDRAC 9 Enterprise + DellEMC OpenManage

 

Performance Characterization

To characterize this Ready Solution, we used the hardware specified in the last column of Table 1, including the optional High Demand Metadata Module. In order to assess the solution performance, the following benchmarks were used:

  • IOzone N to N sequential
  • IOzone random
  • IOR N to 1 sequential
  • MDtest 

For all benchmarks listed above, the test bed had the clients as described in the Table 2 below. Since the number of compute nodes available for testing was only 16, when a higher number of threads was required, those threads were equally distributed on the compute nodes (i.e. 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads =16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes available. Since the benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), while avoiding excessive context switching and other related side effects that can affect performance results.

Table 2 Client test bed

Number of Client nodes

16

Client node

C6320

Processors per client node

2 x Intel(R) Xeon(R) Gold E5-2697v4 18 Cores @ 2.30GHz

Memory per client node

12 x 16GiB 2400 MT/s RDIMMs

High Performance Adapter

Mellanox ConnectX-4 InfiniBand VPI

Operating System

CentOS 7.6

OS Kernel

3.10.0-957.10.1

PixStor Software

5.1.3.1

OFED Version

Mellanox OFED 5.0-1.0.0

Sequential IOzone Performance N clients to N files

Sequential N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 512 threads on the capacity expanded solution (4x ME4084s + 4x ME484s); results from the EDR testing are contrasted with the HDR100 update. 

Caching effects were minimized by setting the file system tunable page pool to 8 GiB on the clients and 24 GiB on the servers and using files twice the total memory size of the clients or servers (whichever value is larger). It is important to note that for the file system, the page pool tunable sets the maximum amount of memory used by the file system for caching data, regardless of the amount of RAM installed and free. Also, important to note is that while in previous DellEMC HPC solutions the block size for large sequential transfers is 1 MiB, the file system was formatted with 8 MiB blocks and therefore that value is used on the benchmark for optimal performance. That may look too large and apparently waste too much space, but the file system uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided in 256 subblocks of 32 KiB each. 

The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and threadlist was the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

./iozone -i0 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist

./iozone -i1 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist

 

Figure 2  N to N Sequential Performance

From the results we can observe that performance rises fast with the number of clients used and then reaches a plateau that is fairly stable until the maximum number of threads that IOzone allows is reached, and therefore large file sequential performance is stable except for 512 concurrent threads (about 8% lower). The maximum read performance of 23.8 GB/s at 32 threads was still limited by the bandwidth of the two IB HDR100 links used on the storage nodes starting at 8 threads. Read performance at 4 threads is considerably lower and at high thread counts is a bit lower compared to EDR (less than 5%), but the results are reproduceable. Since the sequential N to 1 test using IOR uses the same data size and similar parameters but on a single file (adding locking overhead), the big drop in read performance at 4 threads (and  to a much smaller degree at high thread counts) may be due to IOzone using calls that are working less efficiently than IOR calls, but more work is needed to find the reason for the different behavior.

The highest write performance of 21 GB/s was achieved at 512 threads. It is important to remember that for PixStor file system, the preferred mode of operation is scatter, and the solution was formatted to use such mode. In this mode, blocks are allocated from the very beginning of operation in a pseudo-random fashion, spreading data across the whole surface of each HDD. While the obvious disadvantage is a smaller initial maximum performance, that performance is maintained fairly constant regardless of how much space is used on the file system. That in contrast to other parallel file systems that initially use the outer tracks that can hold more data (sectors) per disk revolution, and therefore have the highest possible performance the HDDs can provide, but as the system uses more space, inner tracks with less data per revolution are used, with the consequent reduction of performance.

Sequential IOR Performance N clients to 1 file

Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from a one thread up to 512 threads (since there were not enough cores for 1024 threads), and results are contrasted to the solution without the capacity expansion.

Caching effects were minimized by setting the file system page pool tunable to 8 GiB on the clients and 24 GiB on the servers and using a total data size bigger than twice the total memory size of clients or servers (whichever is larger). This benchmark tests used 8 MiB blocks for optimal performance. The previous performance test section has a more complete explanation for those matters. 

The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b 128G
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -r -s 1 -t 8m -b 128G 

 

Figure 3  N to 1 Sequential Performance

Performance rises fast with the number of clients used and then reaches a plateau that is fairly stable for reads and writes from 8 threads all the way to the maximum number of threads used on this test. The maximum read performance was 24.2 GB/s at 32 threads and the bottleneck was the InfiniBand HDR100 interface apparently at higher than line speed. Similarly, notice that the maximum write performance of 19.9 GB/s was reached at 16 threads. An important data point is at 4 threads, that even that uses the same data size and parameters as IOzone with the extra burden of locking, no performance drop is observed for writes as it was for IOzone.

Random small blocks IOzone Performance N clients to N files

Random N clients to N files performance was measured with IOzone 3.487. Tests executed varied from 16 threads up to 512 threads since there was not enough client-cores for 1024 threads. Lower thread counts were not tested at this time since they take a very large execute time and IOzone does not allow to get results until the test is completed in its entirety and the most important information tends to be the maximum IOPS that the solution can provide. Each thread was using a different file and the threads were assigned on a round robin fashion to the client nodes. This benchmark used 4 KiB blocks for emulating small block traffic.

Caching effects were minimized by setting the file system page pool tunable to 8GiB on the clients and 24 GiB on the servers and using a total data size bigger than twice the total page pool size of clients or servers (whichever is larger).  It is important to note that the page pool tunable sets the maximum amount of memory used by the file system for caching data, regardless the amount of RAM installed and free.

 

Figure 4  N to N Random Performance

From the results we can observe that write performance starts at a high value of 23.4K IOps and remains under 25K steadily up to 256 threads where it peaks at 27.4K IOps. Read performance on the other hand starts at 1.3K IOps and increases performance almost linearly with the number of threads used (keep in mind that number of threads is doubled for each data point) and reaches the maximum performance of 33.8K IOPS at 512 threads. Using more threads would require more than the 16 compute nodes or more cores per node to avoid loss of performance due to process context switching, data locality and similar effects. ME4 arrays require a higher IO pressure (queue or IO depth) to reach their maximum random IOPS showing in this test a lower apparent performance, where the arrays could in fact deliver more IOPS when using tests like FIO that can control the IO depth per process.

Metadata performance with MDtest using empty files

Metadata performance was measured with MDtest version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 512 threads. The benchmark was used for files only (no directory metadata), getting the number of creates, stats and removes that the solution can handle, and results were contrasted with previous EDR results.

To properly evaluate the solution in comparison to other DellEMC HPC storage solutions and the previous blog results, the optional High Demand Metadata Module was used, but with a single ME4024 array; but in fact, the large configuration tested in this work was designated to have two ME4024s.

This High Demand Metadata Module can support up to four ME4024 arrays, and it is suggested to increase the number of ME4024 arrays to 4, before adding another metadata module. Additional ME4024 arrays are expected to increase the Metadata performance with each additional array, except maybe for Stat operations, since the IOPS numbers are very high, at some point the CPUs will become a bottleneck and performance will not continue to increase linearly.

The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. Similar to the Random IO benchmark, the maximum number of threads was limited to 512, since there are not enough cores for 1024 threads and context switching would affect the results, reporting a number lower than the real performance of the solution.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F

Since performance results can be affected by the total number of IOPs, the number of files per directory and the number of threads, it was decided to fix the total number of files to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed as shown in Table 3.

Table 3 MDtest distribution of files on directories

Number of Threads

Number of directories per thread

Total number of files

1

2048

2,097,152

2

1024

2,097,152

4

512

2,097,152

8

256

2,097,152

16

128

2,097,152

32

64

2,097,152

64

32

2,097,152

128

16

2,097,152

256

8

2,097,152

512

4

2,097,152


 

Figure 5  Metadata Performance - Empty Files

First, note that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences several orders of magnitude; otherwise some of the operations would look like a flat line close to 0 on a normal graph. A log graph with base 2 could be more appropriate, since the number of threads are increased in powers of 2, but the graph would look very similar and people tend to handle and remember better numbers based on powers of 10.

The system gets very good results with Stat operations reaching their peak value at 256 threads with 6M op/s respectively. Removal operations attained the maximum of 189.7K op/s at 32 threads and Create operations achieving their peak at 512 threads with 266.8.1K op/s. Stat operation have more variability, but once they reach their peak value, performance does not drop below 3M op/s for Stats. Create and Removal are more stable once their reach a plateau and remain above 160K op/s for Removal and 128K op/s for Create.

Metadata performance with MDtest using 4 KiB files

This test is almost identical to the previous one, except that instead of empty files, small files of 4KiB were used.

The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F -w 4K -e 4K

 

Figure 6  Metadata Performance - Small files (4K)

The system gets very good results for Stat reaching a peak value at 512 threads with almost 4.9M op/s. Remove operations attained the maximum of 442.7K op/s at 128 threads and Create operations achieving their peak with 75K op/s at 512 threads and apparently not reaching a plateau yet. Stat and Removal operations have more variability, but once they reach their peak value, performance does not drop below 3.5M op/s for Stats and 315K op/s for Removal. Create and Read have less variability and keep increasing as the number of threads grows. 

Since these numbers are for a metadata module with a single ME4024, performance will increase for each additional ME4024 array, however we cannot simply assume a linear increase for each additional ME4024. Unless the whole file fits inside the inode for such file, data targets on the ME4084s will be used to store part of the 4K files, limiting the performance to some degree. Since the inode size is 4KiB and it still needs to store metadata, only files around 3 KiB will fit inside and any file bigger than that will use data targets.

Conclusions and Future Work

The solution has similar performance to that observed with the Infiniband EDR technology. An overview of the performance for HDR100 can be seen in Table 4; it is expected to be stable from an empty file system until is almost full because of the use of Scatter allocation across the whole surface area of ALL HDDs. Furthermore, the solution scales in capacity and performance linearly as more storage node modules are added, and a similar performance increase can be expected from the optional high demand metadata module. This solution provides HPC customers with a very reliable parallel file system used by many Top 500 HPC clusters. In addition, it provides exceptional search capabilities, advanced monitoring and management. With the addition of optional gateway nodes, it allows file sharing via ubiquitous standard protocols like NFS, SMB to as many clients as needed.  Finally, Ngenea nodes allow efficient access to other cost-effective storage tiers such as ECS, Isilon enterprise NAS and Cloud solutions using different protocols.

Table 4  Peak & Sustained Performance

 

Peak Performance

Sustained Performance

Write

Read

Write

Read

Large Sequential N clients to N files

21.0 GB/s

23.8 GB/s

20.5 GB/s

23.0 GB/s

Large Sequential N clients to single shared file

19.9 GB/s

24.2 GB/s

19.1 GB/s

23.4 GB/s

Random Small blocks N clients to N files

33.8KIOps

27.4KIOps

33.80KIOps

23.0KIOps

Metadata Create empty files

266.8K IOps

128K IOps

Metadata Stat empty files

6M IOps

3M IOps

Metadata Remove empty files

189.7K IOps

160K IOps

Metadata Create 4KiB files

75K IOps

75K IOps

Metadata Stat 4KiB files

4.9M IOps

3.5M IOps

Metadata Remove 4KiB files

442.7K IOps

315K IOps

 Performance for the gateway nodes was measured and will be reported in a new blog. Finally, high performance NVMe nodes are being tested and results will also be released in a different blog.

Read Full Blog
HPC

Ready Solutions for HPC BeeGFS High Performance Storage: HDR100 Refresh

Nirmala Sundararajan

Fri, 13 Nov 2020 21:31:14 -0000

|

Read Time: 0 minutes

Introduction

True to the tradition of keeping up with the technology trends, the Dell EMC Ready Solutions for BeeGFS High Performance Storage, that was originally released during November 2019, has now been refreshed with the latest software and hardware. The base architecture of the solution remains the same. The following table lists the differences between the initially released InfiniBand EDR based solution and the current InfiniBand HDR100 based solution in terms of the software and hardware used. 

Table 1.   Comparison of Hardware and Software of EDR and HDR based BeeGFS High Performance Solution 

Software 

Initial Release (Nov. 2019) 

Current Refresh (Nov. 2020)

Operating System

CentOS 7.6

CentOS 8.2.

Kernel version

3.10.0-957.27.2.el7.x86_64

4.18.0-193.14.2.el8_2.x86_64

BeeGFS File system version

7.1.3

7.2

Mellanox OFED version

4.5-1.0.1.0

5.0-2.1.8.0

Hardware 

Initial Release 

Current Refresh

NVMe Drives

Intel P4600 1.6 TB Mixed Use

Intel P4610 3.2 TB Mixed Use

InfiniBand Adapters

ConnectX-5 Single Port EDR

ConnectX-6 Single Port HDR100

InfiniBand Switch

SB7890 InfiniBand EDR 100 Gb/s Switch -1U (36x EDR 100 Gb/s ports)

QM8790 Quantum HDR Edge Switch – 1U (80x HDR100 100 Gb/s ports using splitter cables) 

 This blog presents the architecture, updated technical specifications and the performance characterization of the upgraded high-performance solution. It also includes a comparison of the performance with respect to the previous EDR based solution.

Solution Reference Architecture

The high-level architecture of the solution remains the same as the initial release. The hardware components of the solution consist of 1x PowerEdge R640 as the management server and 6x PowerEdge R740xd servers as metadata/storage servers to host the metadata and storage targets. Each PowerEdge R740xd server is equipped with 24x Intel P4610 3.2 TB Mixed Use Express Flash drives and 2x Mellanox ConnectX-6 HDR100 adapters. Figure 1 shows the reference architecture of the solution. 

 

 

Figure 1.   Dell EMC Ready Solutions for HPC BeeGFS Storage – Reference Architecture

There are two networks-the InfiniBand network, and the private Ethernet network. The management server is only connected via Ethernet to the metadata and storage servers. Each metadata and storage server has 2x links to the InfiniBand network and is connected to the private network via Ethernet. The clients have one InfiniBand link and are also connected to the private Ethernet network. For more details on the solution configuration please refer to the blog and whitepaper on BeeGFS High Performance Solution published at hpcatdell.com .

Hardware and Software Configuration

Table 2 and 3 describe the hardware specifications of management server and metadata/storage server respectively. Table 4 describes the versions of the software used for the solution.

Table 2.   PowerEdge R640 Configuration (Management Server)

Component

Description

Processor

2 x Intel Xeon Gold 5218 2.3GHz, 16 cores

Memory

12 x 8GB DDR4 2666MT/s DIMMs - 96GB

Local Disks

6 x 300GB 15K RPM SAS 2.5in HDDs

RAID Controller

PERC H740P Integrated RAID Controller

Out of Band Management

iDRAC9 Enterprise with Lifecycle Controller

Table 3.   PowerEdge R740xd Configuration (Metadata and Storage Servers)

Component

Description

Processor

2x Intel Xeon Platinum 8268 CPU @ 2.90GHz, 24 cores

Memory

12 x 32GB DDR4 2933MT/s DIMMs - 384GB

BOSS Card

2x 240GB M.2 SATA SSDs in RAID 1 for OS

Local Drives

24x Dell Express Flash NVMe P4610 3.2 TB 2.5” U.2

InfiniBand Adapter

2x Mellanox ConnectX-6 single port HDR100 Adapter

InfiniBand Adapter Firmware

20.26.4300

Out of Band Management

iDRAC9 Enterprise with Lifecycle Controller

 Table 4.   Software Configuration (Metadata and Storage Servers)

Component

Description

Operating System

CentOS Linux release 8.2.2004 (Core)

Kernel version

4 4.18.0-193.14.2.el8_2.

Mellanox OFED

5.0-2.1.8.0

NVMe SSDs

VDV1DP23

OpenMPI

4.0.3rc4

Intel Data Center Tool

v 3.0.26

BeeGFS 

7.2

Grafana

7.1.5-1

InfluxDB

1.8.2-1

IOzone

3.490

MDtest

3.3.0+dev

Performance Evaluation

The system performance was evaluated using the following benchmarks:

Performance tests were run on a testbed with clients as described in Table 5. For test cases where the number of IO threads were greater than the physical number of IO clients, threads were distributed equally across the clients (i.e., 32 threads = 2 threads per client…,1024 threads = 64 threads per node).  

Table 5.   Client Configuration

Component

Description

Server model

8x PowerEdge R840 

8x PowerEdge C6420

Processor

4x Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 24 cores (R840)

2x Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores (C6420) 

Memory

24 x 16GB DDR4 2933MT/s DIMMs - 384GB (R840)

12 x 16GB DDR4 2933MT/s DIMMs – 192 GB (C6420)

Operating System

4.18.0-193.el8.x86_64

Kernel version

Red Hat Enterprise Linux release 8.2 (Ootpa)

InfiniBand Adapter

1x ConnectX-6 single port HDR100 adapter

OFED version

5.0-2.1.8.0

Sequential Reads and Writes N-N

The IOzone benchmark was used in the sequential read and write mode to evaluate sequential reads and writes. These tests were conducted using multiple thread counts starting at 1 thread and up to 1024 threads. At each thread count, an equal number of files were generated since this test works on one file per thread or the N-N case. The round robin algorithm was used to choose targets for file creation in a deterministic fashion. 

For all the tests, aggregate file size was 8 TB and this was equally divided among the number of threads for any given test. The aggregate file size chosen was large enough to minimize the effects of caching from the servers as well as from BeeGFS clients.

IOzone was run in a combined mode of write then read (-i 0, -i 1) to allow it to coordinate the boundaries between the operations. For this test, we used a 1MiB record size for every run. The commands used for Sequential N-N tests are given below:

Sequential Writes and Reads: iozone -i 0 -i 1 -c -e -w -r 1m -I -s $Size -t $Thread -+n -+m /path/to/threadlist

OS caches were also dropped or cleaned on the client nodes between iterations as well as between write and read tests by running the command:

# sync && echo 3 > /proc/sys/vm/drop_caches

The default stripe count for BeeGFS is 4. However, the chunk size and the number of targets per file can be configured on a per-directory basis. For all these tests, BeeGFS stripe size was chosen to be 2MB and stripe count was chosen to be 3 since we have three targets per NUMA zone as shown below: 

# beegfs-ctl --getentryinfo --mount=/mnt/beegfs /mnt/beegfs/benchmark --verbose

Entry type: directory

EntryID: 0-5F6417B3-1

ParentID: root

Metadata node: storage1-numa0-2 [ID: 2]

Stripe pattern details:

+ Type: RAID0

+ Chunksize: 2M

+ Number of storage targets: desired: 3

+ Storage Pool: 1 (Default)

Inode hash path: 33/57/0-5F6417B3-1 

 
The testing methodology and the tuning parameters used were similar to those previously described in the EDR based solution. For additional details in this regard, please refer to the whitepaper on the BeeGFS High Performance Solution.

Note

The number of clients used for the performance characterization of the EDR based solution is 32 whereas the number of clients used for the performance characterization of the HDR100 based solution is 16 only. In the performance charts given below, this is indicated by including 16c which denotes 16clients and 32c which denotes 32 clients. The dotted lines show the performance of the EDR based solution and the solid lines shows the performance of the HDR100 based solution.

 

Figure 2.   Sequential IOzone 8 TB aggregate file size 


From Figure 2, we observe that the HDR100 peak read performance is ~131 GB/s and peak write is ~123 GB/s at 1024 threads. As per the technical specifications of the Intel P4610 3.2 TB NVMe SSDs, each drive can provide 3.2 GB/s peak read performance and 3.0 GB/s peak write performance, which allows a theoretical peak of 460.8 GB/s for reads and 432 GB/s for the solution. However, in this solution, the network is the limiting factor. In the setup, we have a total of 11 InfiniBand HDR100 links for the storage servers. Each link can provide a theoretical peak performance of 12.4 GB/s which allows an aggregate theoretical peak performance of 136.4 GB/s. The achieved peak read and write performance are 96% and 90% respectively of the theoretical peak performance.

We observe that the peak read performance for the HDR100 based solution is slightly lower than that observed with the EDR based solution. This can be attributed to the fact that the benchmark tests were carried out using 16 clients for the HDR100 based setup while the EDR based solution used 32 clients. The improved write performance with HDR100 is due to the fact that the P4600 NVMe SSD used in the EDR based solution could provide only 1.3 GB/s for sequential writes whereas the P4610 NVMe SSD provides 3.0 GB/s peak write performance.

We also observe that the read performance is lower than writes for thread counts from 16 to 128. This is because a PCIe read operation is a Non-Posted Operation, requiring both a request and a completion, whereas a PCIe write operation is a Posted Operation that consists of a request only. A PCIe write operation is a fire and forget operation, wherein once the Transaction Layer Packet is handed over to the Data Link Layer, the operation completes. 

Read throughput is typically lower than the write throughput because reads require two transactions instead of a single write for the same amount of data. The PCI Express uses a split transaction model for reads. The read transaction includes the following steps:

  • The requester sends a Memory Read Request (MRR).
  • The completer sends out the acknowledgement to MRR.
  • The completer returns a Completion with Data. 

 The read throughput depends on the delay between the time the read request is issued and the time the completer takes to return the data. However, when the application issues enough number of read requests to offset this delay, then throughput is maximized. A lower throughput is measured when the requester waits for completion before issuing subsequent requests. A higher throughput is registered when multiple requests are issued to amortize the delay after the first data returns. This explains why the read performance is less than that of the writes from 16 threads to 128 threads and then an increased throughput is observed for higher thread counts of 256, 512 and 1024.  

More details regarding the PCI Express Direct Memory Access  is available at   https://www.intel.com/content/www/us/en/programmable/documentation/nik1412547570040.html#nik1412547565760

Random Reads and Writes N-N

IOzone was used in the random mode to evaluate random IO performance. Tests were conducted with thread counts starting from 8 threads to up to 1024 threads. Direct IO option (-I) was used to run IOzone so that all operations bypass the buffer cache and go directly to the disk. BeeGFS stripe count of 3 and chunk size of 2MB was used. A 4KiB request size was used on IOzone and performance measured in I/O operations per second (IOPS). An aggregate file size of 8 TB was selected to minimize the effects of caching. The aggregate file size was equally divided among the number of threads within any given test. The OS caches were dropped between the runs on the BeeGFS servers as well as BeeGFS clients. 

The command used for random reads and writes is given below:

iozone -i 2 -w -c -O -I -r 4K -s $Size -t $Thread -+n -+m /path/to/threadlist

Figure 3.   N-N Random Performance

Figure 3 shows that the random writes peak at ~4.3 Million IOPS at 1024 threads and the random reads peak at ~4.2 Million IOPS at 1024 threads. Both the write and read performance show a higher performance when there are a greater number of IO requests. This is because NVMe standard supports up to 64K I/O queue and up to 64K commands per queue. This large pool of NVMe queues provide higher levels of I/O parallelism and hence we observe IOPS exceeding 3 Million. The following table provides a comparison of the random IO performance of the P4610 and P4600 NVMe SSDs to better understand the observed results.

Table 6.  Performance Specification of Intel NVMe SSDs

Product

P4610 3.2 TB NVMe SSD

P4600 1.6 TB NVMe SSD

Random Read (100% Span)

638000 IOPS

559550 IOPS

Random Write (100% Span)

222000 IOPS

176500 IOPS

 

Metadata Tests

The metadata performance was measured with MDtest and OpenMPI to run the benchmark over the 16 clients. The benchmark was used to measure file creates, stats, and removals performance of the solution. Since performance results can be affected by the total number of IOPs, the number of files per directory and

the number of threads, a consistent number of files across tests was chosen to allow comparison. The total number of files was chosen to be ~ 2M in powers of two (2^21 = 2097152). The number of files per

directory was fixed at 1024, and the number of directories varied as the number of threads changed. The test methodology, and directories created are similar to that described in the previous blog.

 The following command was used to execute the benchmark:

 mpirun -machinefile $hostlist --map-by node -np $threads ~/bin/mdtest -i 3 -b

$Directories -z 1 -L -I 1024 -y -u -t -F

 

Figure 4.   Metadata Performance – Empty Files

 From Figure 4, we observe that the create, removal and read performance are comparable to those received for the EDR based solution whereas the Stat performance is lower by ~100K IOPS. This may be because the HDR100 based solution uses only 16 clients for performance characterization whereas the EDR based solution used 32 clients. The file create operations reach their peak value at 512 threads at ~87K op/s. The removal and stat operations attained the maximum value at 32 threads with ~98K op/s, and 392 op/s respectively.

Conclusion

This blog presents the performance characteristics of the Dell EMC High Performance BeeGFS Storage Solution with the latest software and hardware. At the software level, high-performance solution has now been updated with

  • CentOS 8.2.2004 as the base OS
  • BeeGFS v7.2
  • Mellanox OFED version 5.0-2.1.8.0.

At the hardware level, the solution uses

  • ConnectX-6 Single Port HDR100 adapters
  • Intel P4610 3.2 TB Mixed use, NVMe drives and
  • Quantum switch QM8790 with 80x HDR100 100 Gb/s ports.

The performance analysis allows us to conclude that:

  • IOzone sequential read and write performance is similar to that of the EDR based solution because network is the bottleneck here.
  • The IOzone random read and write performance is greater than the previous EDR based solution by ~ 1M IOPS because of the use of P4610 NVMe drives which provide improved random write and read performance.
  • The file create and removal performance is similar to that of the EDR based solution.
  • The file stat performance registers a 19% drop because of the use of only 16 clients in the current solution as compared to the 32 clients used in the previous EDR based solution.

References

Dell EMC Ready Solutions for HPC BeeGFS Storage - Technical White Paper  

Features of Dell EMC Ready Solutions for HPC BeeGFS Storage

Scalability of Dell EMC Ready Solutions for HPC BeeGFS Storage  

Dell EMC Ready Solutions for HPC BeeGFS High Performance Storage

 

 

 

Read Full Blog