Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Dell.com Contact Us
US(English)

Blogs

Blogs about Dell Technologies solutions for high performance computing

blogs (31)

Financial Risk Assessment using Dell PowerEdge XE9680 Rack Servers with Eight NVIDIA H100 SXM5 80GB GPUs

HB Chen HB Chen

Wed, 06 Mar 2024 21:08:57 -0000

|

Read Time: 0 minutes

Equipped with Intel 4th Gen Intel® Xeon® Scalable processors and NVIDIA H100 SXM5 80 GB GPUs, the Dell PowerEdge XE9680 rack server delivers unparalleled performance acceleration, energy efficiency, and the quickest return on investment.

This server provides scalable and high-performance parallel GPU computing capabilities. It also increases application processing speed and optimizes energy efficiency for both compute and memory-intensive applications.  

The PowerEdge XE9680 also aims to support various application domains such as Generative AI, deep learning, machine learning, finance computing, and traditional high-performance computing workloads. On top of being a 6U form factor server, the XE9680 is Dell Technologies’ first 8x GPU platform, including the following advanced features and capabilities:

  • 2x 4th Intel® Xeon® Scalable Processors with up to 56 cores per processor
  • 32 DDR5 DIMM slots, supporting RDIMM 4 TB memory max, with speeds up to 4800 MT/s
  • 8x NVIDIA H100 80GB 700W SXM GPUs or 8x NVIDIA A100 80GB 500W SXM GPUs
  • Up to 10 x16 Gen5 (x16 PCIe) full-height, half-length
  • 8 x 2.5 in. NVMe/SAS/SATA SSD Slots (U.2) and NVMe BOSS-N1 RAID controller

STAC/A2 Benchmark Testing

Performance engineers at the Dell HPC and AI Innovation Lab recently performed tests on the PowerEdge XE9680 Server using the STAC-A2™ benchmark. This benchmark is the industry standard for testing technology stacks for compute and memory-intensive analytics in pricing and risk management. STAC-A2™ generates performance, throughput, hardware scalability, test quality, and energy efficiency reports of any technology stack capable of handling the workload. This includes Monte Carlo estimation of Heston-based Greeks for a path-dependent, multi-asset option with early exercise. The results of the performance testing were formally verified, validated, and approved by the STAC organization for publication.  

Performance results

In the STAC-A2 performance benchmark test, the PowerEdge XE9680 was equipped with eight of NVIDIA's latest H100 80GB SXM5 GPUs. Dell’s HPC and AI Innovation Lab compared this result to three other recent STAC-A 2 results using GPU acceleration technologies. The comparison summary is shown in the following charts and table. 

This figure shows a PowerEdge XE9680 rack server

Figure 1. PowerEdge XE9680 rack server 8x NVIDIA H100 GPUs 

In 2024, the PowerEdge XE9680 server claimed five new STAC-A2 pricing and risk management performance benchmark records, including:

  1. Processing speed throughput
  2. Baseline Greeks benchmarks
  3. The large Greeks benchmarks
  4. Workload processing capacity
  5. Energy efficiency

Processing speed throughput

This metric measures the ratio of options completed to elapsed time. Dell has set the record for the highest throughput in options per second.

This image shows a chart describing the processing speed of each server in this study.

Figure 2. Throughput of options processing speed comparison

Baseline Greeks benchmarks

This metric describes the number of seconds to compute all Greeks with five assets, 25 K paths, and 252 timesteps. Dell has set the record for the fastest warm restart run time in the Baseline Greeks benchmarks.

 This figure shows the baseline greeks benchmarks for warm run time. Dell has the fastest processing time.

Figure 3. Baseline Greeks benchmarks Warm Run Time comparison

The large Greeks benchmarks

The number of seconds to compute all Greeks with ten assets, 100 K paths, and 1260 timesteps. Dell set the record for both the fastest warm restart run time and cold restart run time.

This image shows the Large Greeks benchmarks for warm run time. Dell the fastest processing time.

Figure 4. Large Greeks benchmarks Warm run time comparison

This figure shows the large greeks benchmark cold run time. Dell has the fastest processing time.

Figure 5. Large Greeks benchmarks Cold run time comparison

Workload processing capacity

This benchmark describes the most correlated assets and Monte Carlo paths simulated in 10 minutes. Dell achieved the highest processing capacity record for assets and number of Monte Carlo paths. 

Table 1. Workload processing capacity comparison

This image is a table showing the rank, test SUT, number of assets, and number of monte carlo paths. Dell shows the highest processing capacity

Energy efficiency  

Energy efficiency measures the energy consumed. This measurement is the sum of one-second Watt readings over the HPORTFOLIO processing sequence and adjusted for shared HW/SW resources. Dell has set the record for the best energy efficiency in options per KHW.

This is an image of a chart comparing energy efficiency in options/kHw. Dell has the highest efficiency.

Figure 6. Energy efficiency comparison

Additional resources

  1. Dell Validated Design for Financial Risk Assessment
  2. STAC/A2 Finance Price and Risk Management performance benchmark
  3. Dell with NVIDIA H100 SXM5 GPUs under STAC-A2 (derivatives risk)
  4. STAC-A2 (derivatives risk) on NVIDIA H100 80GB in HPE ProLiant XL675d Gen10 Plus
  5. Intel GPUs under STAC-A2 (derivatives risk)
  6. Oracle Cloud Infrastructure with NVIDIA SXM4 A100 GPUs under STAC-A2 (derivatives risk) 
Read Full Blog

HPC Application Performance on Dell PowerEdge R6625 with AMD EPYC- GENOA

Savitha Pareek Veena K Miraj Naik Prasanthi Donthireddy Savitha Pareek Veena K Miraj Naik Prasanthi Donthireddy

Wed, 08 Nov 2023 21:09:35 -0000

|

Read Time: 0 minutes

The AMD EPYC 9354 Processor, when integrated into the Dell R6625 server, offers a formidable solution for high-performance computing (HPC) applications. Genoa, which is built on the Zen 4 architecture, delivers exceptional processing power and efficiency, making it a compelling choice for demanding HPC workloads. When paired with the PowerEdge R6625's robust infrastructure and scalability features, this CPU enhances server performance, enabling efficient and reliable execution of HPC applications. These features make it an ideal choice for HPC application studies and research.

At Dell Technologies, it’s our goal to help accelerate time to value for our customers. Dell wants to help customers leverage our benchmark performance and scaling studies to help plan out their environments. By utilizing our expertise, customers don’t have to spend time testing different combinations of CPU, memory and interconnect or choosing the CPU with the sweet spot for performance. This also saves time, as customers don’t have to spend time deciding which BIOS features to tweak for best performance and scaling. Dell wants to accelerate the set-up, deployment, and tuning of HPC clusters to enable customers to get the real value- running their applications and solving complex problems like manufacturing better products for their customers.

Testbed configuration

Benchmarking for high-performance computing applications was carried out using Dell PowerEdge 16G servers equipped with AMD EPYC 9354 32-Core Processor.

Table 1. Test bed system configuration used for this benchmark study

Platform

Dell PowerEdge R6625

Processor

AMD EPYC 9354 32-Core Processor

Cores/Socket

32 (64 total)

Base Frequency

3.25 GHz

Max Turbo Frequency

3.75 GHz

TDP

280 W

L3 Cache

256 MB

Memory

768 GB | DDR5 4800 MT/s  

Interconnect

NVIDIA Mellanox ConnectX-7 NDR 200

Operating System

RHEL 8.6

Linux Kernel 

4.18.0-372.32.1

BIOS

1.0.1

OFED 

5.6.2.0.9

System Profile

Performance Optimized

Compiler

AOCC 4.0.0

MPI

OpenMPI 4.1.4

Turbo Boost

ON

Interconnect

Mellanox NDR 200

Application

Vertical Domain

Benchmark Datasets

OpenFOAM

Manufacturing - Computational Fluid Dynamics (CFD)

Motorbike 50M 34M and 20M cell mesh 

Weather Research and Forecasting (WRF)

Weather and Environment

Conus 2.5KM 

 

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)

Molecular Dynamics

Rhodo, EAM, Stilliger Weber, tersoff, HECBIOSIM, and Airebo 

GROMACS

Life Sciences – Molecular Dynamics

HECBioSim Benchmarks – 3M Atoms , Water and Prace LignoCellulose 

CP2K

 Life Sciences 

 H2O-DFT-LS-NREP- 4,6 H2O-64-RI-MP 2 

Performance Scalability for HPC Application Domain

Vertical – Manufacturing | Application - OPENFOAM

OpenFOAM is an open-source computational fluid dynamics (CFD) software package renowned for its versatility in simulating fluid flows, turbulence, and heat transfer. It offers a robust framework for engineers and scientists to model complex fluid dynamics problems and conduct simulations with customizable features. In this study, worked on OpenFOAM version 9, which have been compiled with gcc 11.2.0 with OPENMPI 4.1.5. For successful compilation and optimization on the AMD EPYC processors, additional flags such as ' -O3 -znver4' have been added. 

The tutorial case under the simpleFoam solver category, motorBike, has been used to evaluate the performance of OpenFOAM package on AMD EPYC 9354 processors. Three different types of grids were generated such as 20M, 34M, and 50M cells using the blockMesh and snappyHexMesh utilities of OpenFOAM. Each run was conducted with full cores (64 cores per node) and from single node to sixteen nodes, The scalability tests were done for all the three sets of grids. The steady state simpleFoam solver execution time was noted down as performance numbers. The figure below shows the application performance for all the datasets.

 

Figure 1: The scaling performance of the OpenFOAM Motorbike dataset using the AMD EPYC Processor, with a focus on performance compared to a single node.

The results are non-dimensionalized with single node result. The scalability is depicted in Figure 1. The OpenFOAM application shows linear scaling from a single node to eight nodes on 9354 processors for higher dataset (50M). For other smaller datasets with 20M and 34M cells, the linear scaling was shown up to four nodes and slightly scaling reduced on eight nodes. For all the datasets (20M, 34M and 50M) on sixteen nodes the scalability was reduced. 

Achieving satisfactory results with smaller datasets can be accomplished using fewer processors and nodes, because smaller datasets do not require a higher number of processors. Nonetheless, augmenting the node count, and therefore, the processor count, in relation to the solver's computation time leads to increased interprocessor communication, subsequently extending the overall runtime. Consequently, higher node counts are more beneficial when handling larger datasets within OpenFOAM simulations. 

Vertical – Weather and Environment | Application - WRF

The Weather Research and Forecasting model (WRF) is at the forefront of meteorological and atmospheric research, with its latest version being a testament to advancements in high-performance computing (HPC). WRF enables scientists and meteorologists to simulate and forecast complex weather patterns with unparalleled precision. In this study, we have worked on WRF version 4.5, which have been compiled with AOCC 4.0.0 with OPENMPI 4.1.4. For successful compilation and optimization with the AMD EPYC compilers, additional flags such as ' -O3 -znver4' have been added. 

The dataset used in our study is CONUS v4.4. This means that the model's grid, parameters, and input data are set up to focus on weather conditions within the continental United States. This configuration is particularly useful for researchers, meteorologists, and government agencies who need high-resolution weather forecasts and simulations tailored to this specific geographic area. The configuration details, such as grid resolution, atmospheric physics schemes, and input data sources, can vary depending on the specific version of WRF and the goals of the modeling project. In this study, we have predominantly adhered to the default input configuration, making minimal alterations or adjustments to the source code or input file. Each run was conducted with full cores (64 cores per node) and from single node to sixteen nodes. The scalability tests were conducted and the performance metric in “sec” was noted.

Figure 2: The scaling performance of the WRF CONUS dataset using the AMD EPYC 9354 processor, with a focus on performance compared to a single node.

The AOCC compiled binaries of WRF show linear scaling from a single node to sixteen nodes on 9354 processors for the new CONUS v4.4. For the best performance with WRF, the impact of the tile size, process, and threads per process should be carefully considered. Given that the application is constrained by memory and DRAM bandwidth, we have opted for the latest DDR5 4800 MT/s DRAM for our test evaluations. It is also crucial to consider the BIOS settings, particularly SubNUMA configurations, as these settings can significantly influence the performance of memory-bound applications, potentially leading to improvements ranging from one to five percent. For more detailed BIOS tuning recommendations, please see our previous blog post on optimizing BIOS settings for optimal performance. 

Vertical – Molecular Dynamics | Application - LAMMPS

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a powerful tool for HPC. It is specifically designed to harness the immense computational capabilities of HPC clusters and supercomputers. LAMMPS allows researchers and scientists to conduct large-scale molecular dynamics simulations with remarkable efficiency and scalability. In this study, we have worked on LAMMPS, 15 June 2023 version, which have been compiled with AOCC 4.0.0 with OPENMPI 4.1.4. For successful compilation and optimization with the AMD EPYC compilers, additional flags such as ' -O3 -znver4' have been added.

We opted for the non-default package, which offers optimized atom pair styles. We have also tried running some benchmarks which are not supported with default package to check the performance and scaling. Our performance metric for this benchmark is nanoseconds per day, where higher nanoseconds per day is considered a better result .

There are two factors that were considered when compiling data for comparison, the number of nodes and the core count. Figure 3 shows results of performance improvement observed on processor 9354 with 64 cores.

Figure 3: The scaling performance of the LAMMPS datasets using the AMD EPYC 9354 processor, with a focus on performance compared to a single node.

Figure 3 shows the scaling of different LAMMPS datasets. We see a significant improvement in scaling as we increased the atom size and step size. We have tested two datasets EAM and Hecbiosim with more than 3 million atoms and observed a better scalability as compared to other datasets.

Vertical – Molecular Dynamics | Application - GROMACS

GROMACS, a high-performance molecular dynamics software, is a vital tool for HPC environments. Tailored for HPC clusters and supercomputers, GROMACS specializes in simulating the intricate movements and interactions of atoms and molecules. Researchers in diverse fields, including biochemistry and biophysics, rely on its efficiency and scalability to explore complex molecular processes. GROMACS is used for its ability to harness the immense computational power of HPC, allowing scientists to conduct intricate simulations that unveil critical insights into atomic-level behaviors, from biomolecules to chemical reactions and materials.  In this study, we have worked on GROMACS 2023.1 version, which have been compiled with AOCC 4.0.0 with OPENMPI 4.1.4. For successful compilation and optimization with the AMD EPYC compilers, additional flags such as ' -O3 -znver4' have been added.

We've curated a range of datasets for our benchmarking assessments. First, we included "water GMX_50 1536" and "water GMX_50 3072," which represent simulations involving water molecules. These simulations are pivotal for gaining insights into solvation, diffusion, and water's behaviour in diverse conditions. Next, we incorporated "HECBIOSIM 14K" and "HECBIOSIM 30K" datasets, which were specifically chosen for their ability to investigate intricate systems and larger biomolecular structures. Lastly, we included the "PRACE Lignocellulose" dataset, which aligns with our benchmarking objectives, particularly in the context of lignocellulose research. These datasets collectively offer a diverse array of scenarios for our benchmarking assessments. 

Our performance assessment was based on the measurement of nanoseconds per day (ns/day) for each dataset, providing valuable insights into the computational efficiency. Additionally, we paid careful attention to optimizing the mdrun tuning parameters (i.e, ntomp, dlb tunepme nsteps etc )in every test run to ensure accurate and reliable results. We examined the scalability by conducting tests spanning from a single node to a total of sixteen nodes.

Figure 4: The scaling performance of the GROMACS datasets using the AMD EPYC 9354 processor, with a focus on performance compared to a single node.

For ease of comparison across the various datasets, the relative performance has been included into a single graph. However, it is worth noting that each dataset behaves individually when performance is considered, as each uses different molecular topology input files (tpr), and configuration files.

We were able to achieve the expected performance scalability for GROMACS of up to eight nodes for larger datasets. All cores in each server were used while running these benchmarks. The performance increases are close to linear across all the dataset types, however there is drop in larger number of nodes due to the smaller dataset size and the simulation iterations.

Vertical – Molecular Dynamics | Application – CP2K

CP2K is a versatile computational software package that covers a wide range of quantum chemistry and solid-state physics simulations, including molecular dynamics. It's not strictly limited to molecular dynamics but is instead a comprehensive tool for various computational chemistry and materials science tasks. While CP2K is widely used for molecular dynamics simulations, it can also perform tasks like electronic structure calculations, ab initio molecular dynamics, hybrid quantum mechanics/molecular mechanics (QM/MM) simulations, and more. In this study, we have worked on CP2K 2023.1 version, which have been compiled with AOCC 4.0.0 with OPENMPI 4.1.4. For successful compilation and optimization with the AMD EPYC compilers, additional flags such as ' -O3 -znver4' have been added.

In our study focusing on high-performance computing (HPC), we utilized specific datasets optimized for computational efficiency. The first dataset, "H2O-DFT-LS-NREP-4,6," was configured for HPC simulations and calculations, emphasizing the modeling of water (H2O) using Density Functional Theory (DFT). The appended "NREP-4,6" parameter settings were fine-tuned to ensure efficient HPC performance. The second dataset, "H2O-64-RI-MP2," was exclusively crafted for HPC applications and revolved around the examination of a system comprising 64 water molecules (H2O). By employing the Resolution of Identity (RI) method in conjunction with the Møller–Plesset perturbation theory of second order (MP2), this dataset demonstrated the significant computational capabilities of HPC for conducting advanced electronic structure calculations within a high-molecule-count environment. We examined the scalability by conducting tests spanning from a single node to a total of sixteen nodes.

Figure 5: The scaling performance of the CP2K datasets using the AMD EPYC 9354 processor, with a focus on performance compared to a single node.

The datasets represent a single-point energy calculation employing linear-scaling Density Functional Theory (DFT). The system comprises 6144 atoms confined within a 39 Å^3 simulation box, which translates to a total of 2048 water molecules. To adjust the computational workload, you can modify the NREP parameter within the input file.  

Our benchmarking efforts encompass configurations involving up to 16 computational nodes .Optimal performance is achieved when using NREP4 and NREP6 in Hybrid mode, which combines MPI (Message Passing Interface) and OpenMP (Open Multi-Processing). This configuration exhibits the best scaling performance, particularly on 4 to 8 nodes. However, it's worth noting that scaling beyond 8 nodes does not exhibit a strictly linear performance improvement. Above figure 5 depict outcomes when using Pure MPI, utilizing 64 cores with a single thread per core.

Conclusion 

When considering CPUs with equivalent core counts, the earlier AMD EPYC processors can deliver performance levels like their Genoa counterparts. However, achieving this performance parity may require doubling the number of nodes. To further enhance performance using AMD EPYC processors, we suggest optimizing the BIOS settings as outlined in our previous blog post and specifically disabling Hyper-threading for the benchmarks discussed in this article. various workloads, we recommend conducting comprehensive testing and, if beneficial, enabling Hyper-threading. Additionally, for this performance study, we highly endorse the utilization of the Mellanox NDR 200 interconnect for optimal results.

Read Full Blog

16G PowerEdge Platform BIOS Characterization for HPC with Intel Sapphire Rapids

Savitha Pareek Miraj Naik Veena K Savitha Pareek Miraj Naik Veena K

Fri, 30 Jun 2023 13:44:52 -0000

|

Read Time: 0 minutes

Dell added over a dozen next-generation systems to the extensive portfolio of Dell PowerEdge 16G servers. These new systems are to accelerate performance and reliability for powerful computing across core data centers, large-scale public clouds, and edge locations.

The new PowerEdge servers feature rack, tower, and multi-node form factors, supporting the new 4th-gen Intel Xeon Scalable processors (formerly codenamed Sapphire Rapids). Sapphire Rapids still supports the AVX 512 SIMD instructions, which allow for 32 DP FLOP/cycle. The upgraded Ultra Path Interconnect (UPI) Link speed of 16 GT/s is expected to improve data movement between the sockets. In addition to core count and frequency, Sapphire Rapids-based Dell PowerEdge servers support DDR5 – 4800 MT/s RDIMMS with eight memory channels per processor, which is expected to improve the performance of memory bandwidth-bound applications. 

This blog provides synthetic benchmark results and recommended BIOS settings for the Sapphire Rapids-based Dell PowerEdge Server processors. This document contains guidelines that allow the customer to optimize their application for best energy efficiency and provides memory configuration and BIOS setting recommendations for the best out-of-the-box performance and scalability on the 4th Generation of Intel® Xeon® Scalable processor families. 

Test bed hardware and software details

Table 1 and Table 2 show the test bed hardware details and synthetic application details. There were 15 BIOS options explored through application performance testing. These options can be set and unset via the Remote Access Control Admin (RACADM) command in Linux or directly when the machines are in the BIOS mode.

Use the following command to set the “HPC Profile” to get the best synthetic benchmark results.

racadm set bios.sysprofilesettings.WorkloadProfile HpcProfile && sudo racadm jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW -e TIME_NA

Once the system is up, use the below command to verify if the setting is enabled.

racadm bios.sysprofilesettings.WorkloadProfile

It should show workload profile set as HPCProfile. Please note that any changes made in BIOS settings on top of the “HPCProfile” will set this parameter to “Not Configured”, while keeping the other settings of “HPCProfile” intact. 

Table 1. System details 

Component

Dell PowerEdge R660 server (Air cooled)

Dell PowerEdge R760 server (Air cooled)

Dell PowerEdge C-Series (C6620) server (Direct Liquid Cooled)

SKU

8452Y

6430

8480+

Cores/Socket

36

32

56

Base Frequency 

2

1.9

2

TDP

300

270

350

L3Cache

69.12 MB

61.44 MB

10.75 MB

Operating System

RHEL 8.6

RHEL 8.6

RHEL 8.6

Memory

1024 - 64 x 16

1024 - 64 x 16

512 -32 x 16

BIOS

1.0.1

1.0.1

1.0.1

CPLD

1.0.1

1.0.1

1.0.1

Interconnect

NDR 400

NDR 400

NDR 400

Compiler

OneAPI 2023

OneAPI 2023

OneAPI 2023

Table 2. Synthetic benchmark applications details

Application Name 

Version

High-Performance Linpack (HPL)

Pre-built binary MP_LINPACK INTEL - 2.3

STREAM

STREAM 5.0

High Performance Conjugate Gradient (HPCG)

Pre-built binary from INTEL oneAPI 2.3

Ohio State University (OSU)

OSU 7.0.1

In the present study, synthetic applications such as HPL, STREAM, and HPCG are done on a single node; since the OSU benchmark is a benchmark study on MPI operations, it requires a minimum of two nodes.

Synthetic application performance details

As shown in Table 2, four synthetic applications are tested on the test bed hardware (Table 1). They are HPL, STREAM, HPCG, and OSU. The details of performance of each application are given below:

High Performance Linpack (HPL)

HPL helps measure the floating-point computation efficiency of a system [1]. The details of the synthetic benchmarks can be found in the previous blog on Intel Ice Lake processors

Figure 1. Performance values of HPL application for different processor models

The N and NB sizes used for the HPL benchmark are 348484 and 384, respectively, for the Intel Sapphire Rapids 6430, 8452Y processors, and 246144 and 384, respectively, for the 8480 processor. The difference in N sizes is due to the difference in available memory. Systems with Intel 6430 and 8452Y processors are equipped with 1024 GB of memory; the 8480 processor system has 512 GB. The performance numbers are captured with different BIOS settings, as discussed above, and the delta difference between each result is within 1-2%. The results with the HPC workload BIOS profile are shown in Figure 1. the 8452Y processor performs 1.09 times better than the Intel Sapphire Rapids 6430 processor and the 8480 processor performs 1.65 times better. 

STREAM

The STREAM benchmark helps for measuring sustainable memory bandwidth of a processor. In general for STREAM benchmark, each array for STREAM must have at least four times the total size of all last-level caches utilized in the run or 1 million elements, whichever is larger. The STREAM array sizes used for the current study are 4×107 and 12×107 with full core utilization. The STREAM benchmark was also tested with 15 BIOS combinations, and the results depicted in Figure 2 are for the HPC workload profile bios test case. The STREAM TRIAD results are captured here in GB/sec. Results show improvement in performance compared to the Intel 3rd Generation Xeon Scalable processors, such as the 8380 and 6338. Also, if comparing 6430, 8452Y and 8480 processors, the STREAM results with 8452Y and 8480 Intel 4th Generation Xeon Scalable processors are, respectively, 1.12 and 1.24 times better than the Intel 6430 processor. 

Figure 2. Performance values of STREAM application for different processor models

HPCG

The HPCG benchmark aims to simulate the data access patterns of applications such as sparse matrix calculations, assessing the impact of memory subsystem and internal connectivity constraints on the computing performance of High-Performance Computers, or supercomputers. The different problem sizes used in the study are 192, 256, 176, 168, and so on. Additionally, in this benchmark study, the variation in performance within different BIOS options was within 1–2 percent. Figure 3 shows the HPCG performance results for Intel Sapphire Rapids processors 6430, 8452Y and 8480. In comparison with the Intel 6430 processor, the 8452Y shows 1.02 times and the 8480 shows 1.12 times better performance. 

Figure 3. Performance values of HPCG application for different processor models

OSU Micro Benchmarks

OSU Micro Benchmarks are used for measuring the performance of MPI implementations, so we used two nodes connected to NDR200. OSU benchmark determines uni-directional and bi-directional bandwidth and message rate and latency between the nodes. The OSU benchmark was run on all three Intel processors (6430, 8452Y, and 8480) with single core per node; however, we have shown one of the system/processors (Intel 8480 processor) results in the blog starting from Figures 4-7. 

 Figure 4. OSU Bi-Directional bandwidth chart for C6620_8480 intel processor

Figure 5. OSU Uni-Directional bandwidth chart for C6620_8480 intel processor

Figure 6. OSU Message bandwidth/Message rate chart for C6620_8480 intel processor

Figure 7. OSU Latency chart for C6620_8480 intel processor

All fifteen BIOS combinations were tested; the OSU benchmark also shows similar performance with a difference within a 1-2% delta.

Conclusion

The performance comparison between various Intel Sapphire Rapids processors (6430, 8452Y and 8480) is done with the help of synthetic benchmark applications such as HPL, STREAM, HPCG and OSU. Nearly 15 BIOS configurations are set on the system, and performance values with different benchmarks were captured to identify the best BIOS configuration to set. From the results, it was found that the difference in performance with any benchmarks for all the BIOS configurations applied is below 3 percent delta. 

Therefore, the HPC workload profile provides better benchmark results with all the Intel Sapphire Rapids processors. Among the three Intel processors compared, the 8480 had the highest application performance value, while the 8452Y is in second place. The maximum difference in performance between processors was found for the HPL benchmark, and it was the 8480 Intel Sapphire Rapids processor, which offers 1.65 times better results than the Intel 6430 processor.  

Watch out for future application benchmark results on this blog! Visit our page for previous blogs.

 

Read Full Blog

16G PowerEdge Platform BIOS Characterization for HPC with AMD Genoa 9354

Neeraj Kumar Neeraj Kumar

Fri, 30 Jun 2023 13:44:52 -0000

|

Read Time: 0 minutes

With the release of 4th Gen AMD EPYC 9004 CPUs (code-named “Genoa”), Dell PowerEdge servers have been refreshed to support these latest processors. In this blog, we will present the results of a study evaluating the performance of HPC synthetic benchmarks with AMD 9354 processors on Dell’s latest PowerEdge dual socket 1U R6625 server and dual socket 2U R7625 server. 

Architecture

AMD Genoa is based on the new Zen4 micro-architecture built with 5nm fabrication technology. Major changes from its predecessor AMD EPYC 7003 CPUs (code-named “Milan”) include the support for DDR5 memory at speeds up to 4800 MT/s and PCIe Gen5. It supports up to 96 cores per socket and the L2 cache per core is doubled. Zen4 adds support for the AVX-512 instruction set. The implementation in Zen4 executes AVX-512 instructions in two cycles. Also, improvements are made in instructions per cycle (IPC). 

Benchmark hardware and software configuration

Table 1. Test bed system configuration used for this benchmark study

Platform 

Dell PowerEdge R6625 /R7625

Processor 

AMD EPYC 9354 

Cores 

32 cores/socket 

Base Frequency 

3.25 GHz 

Turbo Clock 

Up to 3.8 GHz 

TDP 

280 W 

Configurable TDP 

240-300 W 

L1 Cache 

64K per core 

L2 Cache 

1MB per core 

L3 Cache 

256MB (shared) 

Memory 

32 GB x 24 DIMMs| 4800 MT/s  

Interconnect 

NVIDIA Mellanox NDR 400 

Operating System 

RHEL 8.6 

Linux Kernel 

 4.18.0-372.9.1

BIOS/CPLD 

1.1.3/1.1.3 

OFED 

 MLNX_OFED_LINUX-5.7-1.0.2.0

BIOS Workload Profile 

HPC Profile 

Compiler

AOCC 4.0.0 and AOCL 4.0

OpenMPI

4.1.5

Turbo Boost

ON

Recommended BIOS optimizations

We tested different combinations of BIOS options in this study to understand the potential performance improvements in synthetic benchmarks. We found that setting workload profile in BIOS as “HPCProfile” will give us the best performance on HPC synthetic benchmarks. 

This workload profile option can be found in System Profile Settings of BIOS. It is a collection of multiple BIOS options recommended for HPC workloads. This setting can be updated using the RACADM CLI tool. Use the following command to enable “HPCProfile” and reboot your system using racadm.

racadm set bios.sysprofilesettings.WorkloadProfile HpcProfile && sudo racadm jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW -e TIME_NA

Once the system is up, use the command below to verify that the setting is enabled.

racadm bios.sysprofilesettings.WorkloadProfile

It should show the workload profile as HPCProfile. Note that any changes made in BIOS settings on top of the “HPCProfile” will set this parameter to “Not Configured”, keeping the other settings of “HPCProfile” intact. 

We have studied the impact of different BIOS options on top of “HPCProfile”. All the performance numbers mentioned in this blog are with workload profile set to “HPCProfile”.  

Table 2. Synthetic benchmarks application details

We used prebuilt AMD Optimized binaries for HPL, Stream, and HPCG benchmarks, which are optimized for AMD’s Zen4 architecture. OSU was compiled using AOCC 4.0 compiler. Benchmark information and performance numbers are mentioned in the following section. 

Benchmark performance results

HPL: This benchmark solves random system of linear equations in double precision (64-bits) for distributed systems. It reports floating-point execution rate of the system. 

In the HPL benchmark test, we used 94 percent of available memory as the problem size where N=301440 and NB=384 was used. We achieved ~3.75 TFlops of performance across dual sockets with around 113 percent efficiency, compared to the base frequency on the AMD 9354 processor. We monitored the frequency throughout the benchmark run and observed that the processor was able to utilize its turbo frequencies constantly, which explains the efficiency being above 100 percent for this processor. The average power consumption during the benchmark run was ~830 watts when the system profile in BIOS was set to “HPCProfile” option. We obtained the best performance-per-watt results with this option. 

Figure 1. HPL performance with AMD Genoa 9354 processor on 16G PowerEdge R6625 and R7625 servers

STREAM: This synthetic benchmark is designed to measure sustainable memory bandwidth and a corresponding computation rate for four simple vector kernels: Copy, Scale, Add and Triad. 

In the STREAM TRIAD benchmark test, we were able to reach up to ~752 GB/s when utilizing all available cores of the dual socket server. To learn more about the STREAM performance numbers on AMD MILAN based servers, please refer to our previous blog here.

Figure 2. STREAM performance with AMD Genoa 9354 processor on 16G PowerEdge R6625 and R7625 servers

HPCG: This benchmark project is an effort to create a new metric for ranking HPC systems. It is an internally I/O bound benchmark, intended to complement LINPACK benchmarks.

In the HPCG benchmark, we used nx=ny=nz=192 local sub grid dimensions to tune the problem size as per our system memory. We were able to reach ~115 Gflops of performance with AMD optimized binary for HPCG. 

Figure 3. HPCG performance with AMD Genoa 9354 processor on 16G PowerEdge R6625 and R7625 servers

OSU Micro Benchmarks: These micro-benchmarks are widely used for measuring and evaluating the performance of MPI operations for point-to-point, multi-pair, and collective communications between the nodes

In the OSU benchmark, we used two nodes connected with NDR400.  We checked bidirectional bandwidth, unidirectional bandwidth, message rate, and latency between these two nodes. In a dual socket server, the socket connected to the network adapter card acts as local and the other acts as remote. We completed this test on both R6625 and R7625 servers for both remote and local latency and bandwidth. The results below are obtained from theR6625 server. All the results for OSU shown below are run using a single core per node.

The Delta label in secondary axis represents the percentage difference between local and remote latency and bandwidth.

Figure 4. OSU Latency with AMD Genoa 9354 processor on Dell PowerEdge R6625 server

We achieved ~48 GB/s unidirectional bandwidth and ~87 GB/s of bidirectional bandwidth.

Figure 5. OSU message rate with AMD Genoa 9354 processor on Dell PowerEdge R6625 server

Figure 6. OSU bi-directional bandwidth with AMD Genoa 9354 processor on Dell PowerEdge R6625 server

Figure 7. OSU uni-directional bandwidth with AMD Genoa 9354 processor on 16G Dell PowerEdge R6625 server

Conclusion and future work

We have seen a significant improvement in the performance of synthetic benchmarks using Genoa-based servers as compared to earlier Milan-based servers. Setting up the right BIOS parameters is important to achieve the best results on these servers. As part of our study, we tested different BIOS parameters, finding suggest that setting the workload profile to “HPCProfile” provides the best performance result.

For future work, we plan to study performance improvements on HPC applications from different domains using these latest AMD processors and Dell PowerEdge servers.

Check back soon for the next blog. 

Additional resources

Visit our website to read our previous blog on AMD Milan-based servers.

Read Full Blog
  • Intel
  • PowerEdge
  • VMware
  • vSphere
  • virtualization
  • HPC
  • PowerSwitch
  • performance metrics
  • bare metal
  • HPC on Demand
  • virtual HPC

Performance Evaluation of HPC Applications on a Dell PowerEdge R650-based VMware Virtualized Cluster

Veena K Neeraj Kumar Miraj Naik Rizwan Ali Veena K Neeraj Kumar Miraj Naik Rizwan Ali

Wed, 08 Feb 2023 14:45:39 -0000

|

Read Time: 0 minutes

Overview

High Performance Computing (HPC) solves complex computational problems by doing parallel computations on multiple computers and performing research activities through computer modeling and simulations. Traditionally, HPC is deployed on bare-metal hardware, but due to advancements in virtualization technologies, it is now possible to run HPC workloads in virtualized environments. Virtualization in HPC provides more flexibility, improves resource utilization, and enables support for multiple tenants on the same infrastructure. 

However, virtualization is an additional layer in the software stack and is often construed as impacting performance. This blog explains a performance study conducted by the Dell Technologies HPC and AI Innovation Lab in partnership with VMware. The study compares bare-metal and virtualized environments on multiple HPC workloads with Intel® Xeon® Scalable third-generation processor-based systems. Both the bare-metal and virtualized environments were deployed on the Dell HPC on Demand solution. 

    

Figure 1: Cluster Architecture

To evaluate the performance of HPC applications and workloads, we built a 32-node HPC cluster using Dell PowerEdge R650 as compute nodes. Dell Power Edge R650 is a 1U dual socket server with Intel® Xeon® Scalable third-generation processors. The cluster was configured to use both bare-metal and virtual compute nodes (running VMware vSphere 7). Both bare-metal and virtualized nodes were attached to the same head node.

Figure 1 shows a representative network topology of this cluster. The cluster was connected to two separate physical networks. The compute nodes were spread across two sets of racks, and the cluster consisted of the following two networks: 

  • HPC Network: Dell PowerSwitch Z9332 switch connecting NVIDIA® Connect®-X6 100 GbE adapters to provide a low latency high bandwidth 100 GbE RDMA-based HPC network for the MPI-based HPC workload
  • Services Network: A separate pair of Dell PowerSwitch S5248F-ON 25 GbE based top of rack (ToR) switches for hypervisor  

The Virtual Machine (VM) configuration details for optimal performance settings were captured in an earlier blog. In addition to the settings noted in the previous blog, some additional BIOS tuning options such as Snoop Hold Off, SubNumaCluster (SNC) and LLC Prefetch settings were also tested. Snoop Hold Off (set to 2 K cycles), and SNC, helped performance across most of the tested applications and microbenchmarks for both the bare-metal and virtual nodes. Enabling SNC in the server BIOS and not configuring SNC correctly in the VM might result in performance degradation.

 

Bare-metal and virtualized HPC system configuration

Table 1 shows the system environment details used for the study.

Table 1: System configuration details for the bare-metal and virtual clusters

Machine function

Component

Platform

PowerEdge R650 server

Processor

Two Intel® Xeon® third Generation 6348 (28 cores @ 2.6 GHz)

Number of cores

Bare-Metal: 56 cores 

Virtual: 52 vCPUs (four cores reserved for ESXi)

Memory

Sixteen 32 GB DDR4 DIMMS @3200 MT/s

Bare-Metal: All 512 GB used

Virtual: 440 GB reserved for the VM

 

HPC Network NIC

100 GbE NVIDIA Mellanox Connect-X6

Service Network NIC

10/25 GbE NVIDIA Mellanox Connect-X5

HPC Network Switch

Dell PowerSwitch Z9332 with OS 10.5.2.3 

Service Network Switch

Dell PowerSwitch S5248F-ON

Operating system

Rocky Linux release 8.5 (Green Obsidian)

Kernel

4.18.0-348.12.2.el8_5.x86_64

Software – MPI

IntelMPI 2021.5.0 

Software – Compilers

Intel OneAPI 2022.1.1

Software – OFED

OFED 5.4.3 (Mellanox FW 22.32.20.04)

BIOS version

1.5.5 (for both bare-metal and virtual nodes)

 

Application and benchmark details

The following chart outlines the set of HPC applications used for this study from different domains like Computational Fluid Dynamics (CFD), Weather, and Life Sciences. Different benchmark datasets were used for each of the applications as detailed in Table 2.

Table 2: Application and benchmark dataset details

Application

Vertical Domain

Benchmark Dataset

WRF (v3.9.1.1)

Weather and Environment

Conus 2.5KM, Maria 3KM

OpenFOAM (version 9)

Manufacturing - Computational Fluid Dynamics (CFD)

Motorbike 20M, 34M and 52M cells mesh

Gromacs (version 2022)

Life Sciences – Molecular Dynamics

HECBioSim Benchmarks – 3M Atoms 

Lignocellulose

BenchPEP

LAMMPS (4 May 2022)

Molecular Dynamics

EAM metallic Solid Benchmark (1M, 3M and 8M Atoms) HECBIOSIM – 3M Atoms

 

Performance results

All the application results shown here were run on both bare-metal and virtual environments using the same binary compiled with Intel Compiler and run with Intel MPI. Multiple runs were done to ensure consistency in the performance. Basic synthetic benchmarks like High Performance Linpack (HPL), Stream, and OSU MPI Benchmarks were run to ensure that the cluster was operating efficiently before running the HPC application benchmarks. For the study, all the benchmarks were run in a consistent, optimized, and stable environment across both the bare-metal and virtual compute nodes.

Intel® Xeon® Scalable third-generation processors (Ice Lake 6348) have 56 cores. Four cores were reserved for the virtualization hypervisor (ESXi) providing the remaining 52 cores to run benchmarks. All the results shown here consist of 56 core runs on bare-metal vs 52 core runs on virtual nodes.

To ensure better scaling and performance, multiple combinations of threads and MPI ranks were tried based on applications. The best results are used to show the relative speedup between both the bare-metal and virtual systems.

 

    Figure 2: Performance comparison between bare-metal and virtual nodes for WRF

 

  Figure 3: Performance comparison between bare-metal and virtual nodes for OpenFOAM

 

  Figure 4: Performance comparison between bare-metal and virtual nodes for GROMACS

 

  Figure 5: Performance comparison between bare-metal and virtual nodes for LAMMPS

The above results indicate that all the MPI applications running in a virtualized environment are close in performance to the bare-metal environment if proper tuning and optimizations are used. The performance delta, running from a single node up to 32 nodes, is within the 10% range for all the applications. This delta shows no major impact on scaling. 

Concurrency test

In a virtualized multitenant HPC environment, the expectation is for multiple tenants to be running multiple concurrent instances of the same or different applications. To simulate this configuration, a concurrency test was conducted by making multiple copies of the same workload and running them in parallel. This test checks whether any performance degradation appears in comparison with the baseline run result. To do some meaningful concurrency tests, we expanded the virtual cluster to 48 nodes by converting 16 nodes of bare-metal to virtual. For the concurrency tests, the baseline is made with an 8-node run while no other workload was running across the 48-node virtual cluster. After that, six copies of the same workload were allowed to run simultaneously across the virtual cluster. Then the results are compared and depicted for all the applications.

The concurrency was tested in two ways. In the first test, all eight nodes running a single copy were placed in the same rack. In the second test, the nodes running a single job were spread across two racks to see if any performance difference was observed due to additional communication over the network.

Figures 6 to 13 capture the results of the concurrency test. As seen from the results there was no degradation observed in the performance.

 Figure 6: Concurrency Test 1 for WRF 

  Figure 7: Concurrency Test 2 for WRF

   Figure 8: Concurrency Test 1 for Open FOAM

   Figure 9: Concurrency Test 2 for Open FOAM

 

    Figure 10: Concurrency Test 1 for GROMACS

   Figure 11: Concurrency Test 2 for GROMACS

 

   Figure 12: Concurrency Test 1 for LAMMPS 

  Figure 13: Concurrency Test 2 for LAMMPS

Another set of concurrency tests was conducted by running different applications (WRF, GROMACS, and Open FOAM) simultaneously in the virtual environment. In this test, two eight-node copies of each application run concurrently across the virtual cluster to determine if any performance variation occurs while running multiple parallel applications in the virtual nodes. There is no performance degradation observed in this scenario also, when compared to the individual application baseline run with no other workload running on the cluster.

 

 Figure 14: Concurrency test with multiple applications running in parallel

Intel Select Solution certification

In addition to the benchmark testing, this system has been certified as an Intel® Select Solution for Simulation and Modeling. Intel Select Solutions are workload-optimized configurations that Intel benchmark-tests and verifies for performance and reliability. These solutions can be deployed easily on premises and in the cloud, providing predictability and scalability.

All Intel Select Solutions are a tailored combination of Intel data center compute, memory, storage, and network technologies that deliver predictable, trusted, and compelling performance. Each solution offers assurance that the workload will work as expected, if not better. These solutions can save individual businesses from investing the resources that might otherwise be used to evaluate, select, and purchase the hardware components to gain that assurance themselves.

The Dell HPC On Demand solution is one of a select group of prevalidated, tested solutions that combine third-generation Intel® Xeon® Scalable processors and other Intel technologies into a proven architecture. These certified solutions can reduce the time and cost of building an HPC cluster, lowering hardware costs by taking advantage of a single system for both simulation and modeling workloads.

Conclusion

Running an HPC application necessitates careful consideration for achieving optimal performance. The main objective of the current study is to use appropriate tuning to bridge the performance gap between bare-metal and virtual systems. With the right settings on the tested HPC applications (see Overview), the performance difference between virtual and bare-metal nodes for the 32 node tests is less than 10%. It is therefore possible to successfully run different HPC workloads in a virtualized environment to leverage benefits of virtualization features. The concurrency testing helped to demonstrate that running multiple applications simultaneously in the virtual nodes does not degrade performance.

Resources

To learn more about our previous work on HPC virtualization on Cascade Lake, see the Performance study of a VMware vSphere 7 virtualized HPC cluster.

Acknowledgments

The authors thank Savitha Pareek from Dell Technologies, Yuankun Fu from VMware, Steven Pritchett, and Jonathan Sommers from R Systems for their contribution in the study.

Read Full Blog
  • PowerEdge
  • HPC
  • GPU
  • AMD

HPC Application Performance on Dell PowerEdge R7525 Servers with the AMD Instinct™ MI210 GPU

Savitha Pareek Frank Han Savitha Pareek Frank Han

Mon, 12 Sep 2022 12:11:52 -0000

|

Read Time: 0 minutes

PowerEdge support and performance

The PowerEdge R7525 server can support three AMD Instinct  MI210 GPUs; it is ideal for HPC Workloads. Furthermore, using the PowerEdge R7525 server to power AMD Instinct MI210 GPUs (built with the 2nd Gen AMD CDNA™ architecture) offers improvements on FP64 operations along with the robust capabilities of the AMD ROCm™ 5 open software ecosystem. Overall, the PowerEdge R7525 server with the AMD Instinct MI210 GPU delivers expectational double precision performance and leading total cost of ownership.

 Figure 1: Front view of the PowerEdge R7525 server

We performed and observed multiple benchmarks with AMD Instinct MI210 GPUs populated in a PowerEdge R7525 server. This blog shows the performance of LINPACK and the OpenMM customizable molecular simulation libraries with the AMD Instinct MI210 GPU and compares the performance characteristics to the previous generation AMD Instinct MI100 GPU.

The following table provides the configuration details of the PowerEdge R7525 system under test (SUT): 

Table 1.  SUT hardware and software configurations

Component

Description

Processor

AMD EPYC 7713 64-Core Processor

Memory

512 GB

Local disk

1.8T SSD

Operating system

Ubuntu 20.04.3 LTS

GPU

3xMI210/MI100

Driver version

5.13.20.22.10

ROCm version

ROCm-5.1.3

Processor Settings > Logical Processors

Disabled

System profiles

Performance

NUMA node per socket

4

HPL

rochpl_rocm-5.1-60_ubuntu-20.04

OpenMM

7.7.0_49

The following table contains the specifications of AMD Instinct MI210 and MI100 GPUs:

Table 2: AMD Instinct MI100 and MI210 PCIe GPU specifications

GPU architecture

AMD Instinct MI210

AMD Instinct MI100

Peak Engine Clock (MHz)

1700

1502

Stream processors

6656

7680

Peak FP64 (TFlops)

22.63

11.5

Peak FP64 Tensor DGEMM (TFlops)

45.25

11.5

Peak FP32 (TFlops)

22.63

23.1

Peak FP32 Tensor SGEMM (TFlops)

45.25

46.1

Memory size (GB)

64

32

Memory Type

HBM2e

HBM2

Peak Memory Bandwidth (GB/s)

1638

1228

Memory ECC support

Yes

Yes

TDP (Watt)

300

300

High-Performance LINPACK (HPL)

HPL measures the floating-point computing power of a system by solving a uniformly random system of linear equations in double precision (FP64) arithmetic, as shown in the following figure. The HPL binary used to collect results was compiled with ROCm 5.1.3.

Figure 2: LINPACK performance with AMD Instinct MI100 and MI210 GPUs

The following figure shows the power consumption during a single HPL run:

Figure 3: LINPACK power consumption with AMD Instinct MI100 and MI210 GPUs

We observed a significant improvement in the AMD Instinct MI210 HPL performance over the AMD Instinct MI100 GPU. The numbers on a single GPU test of MI210 are 18.2 TFLOPS which is approximately 2.7 times higher than MI100 number (6.75 TFLOPS). This improvement is due to the AMD CDNA2 architecture on the AMD Instinct MI210 GPU, which has been optimized for FP64 matrix and vector workloads. Also, the MI210 GPU has larger memory, so the problem size (N) used here is large in comparison to the AMD Instinct MI100 GPU.

As shown in Figure 2, the AMD Instinct MI210 has shown almost linear scalability in the HPL values on single node multi-GPU runs. The AMD Instinct MI210 GPU reports better scalability compared to its last generation AMD Instinct MI100 GPUs. Both GPUs have the same TDP, with the AMD Instinct MI210 GPU delivering three times better performance. The performance per watt value of a PowerEdge R7525 system is three times more. Figure 3 shows the power consumption characteristics in one HPL run cycle.  

OpenMM

OpenMM is a high-performance toolkit for molecular simulation. It can be used as a library or as an application. It includes extensive language bindings for Python, C, C++, and even Fortran. The code is open source and actively maintained on GitHub and licensed under MIT and LGPL.

Figure 4: OpenMM double-precision performance with AMD Instinct MI100 and MI210 GPUs

Figure 5: OpenMM single-precision performance with AMD Instinct MI100 and MI210 GPUs

Figure 6: OpenMM mixed-precision performance with AMD Instinct MI100 and MI210 GPUs

We tested OpenMM with seven datasets to validate double, single, and mixed precision. We observed exceptional double precision performance with OpenMM on the AMD Instinct MI210 GPU compared to the AMD Instinct MI100 GPU. This improvement is due to the AMD CDNA2 architecture on the AMD Instinct MI210 GPU, which has been optimized for FP64 matrix and vector workloads.

Conclusion

The AMD Instinct MI210 GPU shows an impressive performance improvement in FP64 workloads. These workloads benefit as AMD has doubled the width of their ALUs to a full 64-bits wide. This change allows the FP64 operations to now run at full speed in the new 2nd Gen AMD CDNA architecture. The applications and workloads that are designed to run on FP64 operations are expected to take full advantage of the hardware.

Read Full Blog
  • PowerEdge
  • Red Hat Enterprise Linux
  • NVMe
  • HPC
  • PowerVault
  • iDRAC9

Dell Validated Design for HPC pixstor Storage - PowerVault ME5 Update

J. Mario Gallegos J. Mario Gallegos

Thu, 06 Jul 2023 19:08:42 -0000

|

Read Time: 0 minutes

Introduction

Today’s HPC environments have increased demands for high-speed storage. Storage was becoming the bottleneck in many workloads due to higher core-count CPUs, larger and faster memory, a faster PCIe bus, and increasingly faster networks. Parallel File Systems (PFS) typically address these high-demand HPC requirements. PFS provides concurrent access to a single file or a set of files from multiple nodes, efficiently and securely distributing data to multiple LUNs across several storage servers. 

These file systems use spinning media to provide the highest capacity at the lowest cost. However, often the speed and latency of spinning media cannot keep up with the demands of many modern HPC workloads. The use of flash technology (that is, NVMe) in the form of burst buffers, faster tiers, or even fast scratch (local or distributed) can mitigate this issue. HPC pixstor Storage offers a cost-effective, high-capacity tier and NVMe nodes as the component to address high-bandwidth demands and for the optional High Demand Metadata module.

This blog is part of a series for PFS solutions for HPC environments, in particular for the flexible, scalable, efficient, and reliable HPC pixstor StorageIts focus is the upgrade to storage nodes using the new Dell PowerVault ME5084 arrays, which provide a significant boost in performance compared to the previous generation (ME4084 array).

Note: Because arcastream changed its branding to all lowercase characters, we have modified instances of “arcastream,” “pixstor,” and “ngenea” accordingly.

Architecture

The following figure shows the architecture for the new generation of the Dell Validated Design for HPC pixstor Storage. It uses Dell PowerEdge R650, R750, and R7525 servers and the new PowerVault ME5084 storage arrays, with the pixstor 6.0 software from our partner company arcastream. 

  

Figure 1 Reference Architecture

Optional PowerVault ME484 EBOD arrays can increase the capacity of the solution as SAS additions to the PowerVault ME5084 storage arrays. The pixstor software includes the widespread General Parallel File System (GPFS), also known as Spectrum Scale, as the PFS component that is considered software-defined storage due to its flexibility and scalability. In addition, the pixstor software includes many other arcastream software components such as advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities, and many others.

The main components of the pixstor solution are:

  • Management servers—PowerEdge R650 servers provide UI and CLI access for management and monitoring of the pixstor solution, as well as performing advanced search capabilities, compiling some metadata information in a database to speed up searches and prevent the search from loading metadata network shared disks (NSDs). 
  • Storage module—The storage module is the main building block for the pixstor storage solution. Each module includes:
    • One pair of storage servers
    • One, two, or four backend storage arrays (ME5084) with optional capacity expansions (ME484)
    • Network Shared Disks (NSDs) contained in the backend storage arrays
  • Storage server (SS)—The storage server is an essential storage module component. HA pairs of PowerEdge R750 servers (failover domains) connect to ME5084s arrays using SAS 12 Gbps cables to manage data NSDs and provide access to NSDs using redundant high-speed network interfaces. For the standard pixstor configuration, these servers have the dual role of being metadata servers and managing metadata NSDs (using SSDs that replace all spare HDDs). The following figure shows the allocation of adapters for the PowerEdge R750 server:

Figure 2  PowerEdge R750 storage nodes - Slot allocation

  • Backend Storage—Backend storage is part of the storage module that stores file system data, as shown in Figure 1. The solution uses high-density 5U PowerVault ME5084 disk arrays. The following figure shows the ME5084 array with its two SAS controllers. Two SAS ports from each controller (two from A0-A3 and two from B0-B3) are connected to different HBAs in slots 1, 2, 5 and 7 on each of the storage nodes (four SAS cables per server to each ME5084 array). The ME5084 array requires twice the number of cables previously used by ME4 arrays to match ME5 performance. The SAS connector I/O of each controller (next to the RJ45 management Ethernet port) is used to connect an I/O Module (IOM) in the ME484 expansion array using port I/O 0 (left blue SAS port of each IOM module) of the correspondent I/O Module (Controller A to I/O Module A, Controller B to I/O Module B). 

 

Figure 3  ME5084 array - Controllers and SAS ports

The following figure shows the back of the ME484 expansion array.

  • Capacity Expansion Storage—Optional PowerVault ME484 capacity expansions (shown in the following figure and inside the dotted orange square in Figure 1) are connected behind the ME5084 arrays using SAS 12 Gbps cables to expand the capacity of a storage module. For pixstor solutions, each ME5084 array is restricted to use only one ME484 expansion for performance and reliability (despite official ME5084 support for up to three ME484 expansions).

Figure 4 ME484 - I/O Module and SAS ports

  • Network Shared Disks (NSDs)—NSDs are backend block devices (that is, RAID 6 LUNs from ME5 arrays or replicated NVMe devices) that store data, metadata, or both. In the pixstor solution, file system data and metadata are stored in different NSDs. Data NSDs use spinning media (NLS SAS3 HDDs) or NVMe. Metadata NSDs use SSDs in the standard configuration or replicated NVMe devices for high metadata demands (metadata include directories, filenames, permissions, timestamps, and the location of data in other NSDs).
  • High Demand Metadata Server (HDMD)—The HDMD server is a component of the optional High Demand Metadata module (in the dotted yellow square in Figure 1). Pairs of PowerEdge R650 NVMe servers with up to 10 NVMe devices each in HA (failover domains) manage the metadata NSDs in replicated pairs and provide access using redundant high-speed network interfaces. Other supported servers (PowerEdge R750 and PowerEdge R7525 servers) can be used as NVMe nodes instead of the PowerEdge R650 server.
  • NVMe nodes—An NVMe node is the main component of the optional NVMe tier modules (in the dotted red squares in Figure 2). Pairs of the latest PowerEdge servers in HA (failover domains) provide a high-performance flash-based tier for the pixstor solution. The NVMe tier consists of three PowerEdge servers: PowerEdge R650 servers with 10 NVMe direct attached drives, PowerEdge R750 servers with 16 NVMe direct attached devices, or PowerEdge R7525 servers with 24 direct attached drives. To maintain homogeneous performance across the NVMe nodes and allow striping data across nodes in the tier, do not mix different server models in the same NVMe tier. However, multiple NVMe tiers each with different servers and accessed using different filesets is supported.

    The selected PowerEdge servers support NVMe PCIe4 and PCIe3 devices. However, mixing NVMe PCIe4 devices with lower performant PCIe3 devices is not recommended for the solution and it is not supported for the same NVMe tier. Additional pairs of NVMe nodes can scale out performance and capacity for this NVMe tier. Increased capacity is provided by selecting the appropriate capacity for the NVMe devices supported on the servers or adding more pairs of servers.

     An important difference from previous pixstor releases is that NVMesh is no longer a component of the solution. For HA purposes, an alternative based on GPFS replication of NSDs was implemented across each NVMe server HA pair, to functionally have mirrored NSDs.
  • Native client software—Native client software is installed on the clients to allow access to the file system. The file system must be mounted for access and appears as a single namespace.
  • Gateway nodes—The optional gateway nodes (in the dotted green square in Figure 1) are PowerEdge R750 servers (the same hardware as ngenea nodes but with different software) in a Samba the Clustered Trivial Data Base (CTDB) cluster providing NFS or SMB access to clients that do not have or cannot have the native client software installed.
  • ngenea nodes—The optional ngenea nodes (in the dotted green square in Figure 1) are PowerEdge R750 servers (the same hardware as the gateway nodes but using different software) that offer access to external storage systems (for example, object storage, cloud storage, tape libraries, and so on) allowing them to be used as another tier in the same single namespace using enterprise protocols, including cloud protocols.
  • Management switch—A PowerConnect N2248X-ON Gigabit Ethernet switch connects the different servers and storage arrays. It is used for administration of the solution interconnecting all the components.
  • High-speed network switch—Mellanox QM8700 switches provide high-speed access using InfiniBand (IB) HDR. For Ethernet solutions, the Mellanox SN3700 is used.

Solution components

This solution was released with the latest 3rd Generation Intel Xeon Scalable CPUs, also known as Ice Lake, and the fastest RAM available (3200 MT/s). The following table lists the main components for the solution. Some discrepancies were introduced between the wanted BOM and the actual test hardware because for the prerelease (production level) hardware for our project, only a few CPU models were made available, not including the planned life-cycle model. 

The At release column lists the components planned to be used at release and available to customers with the solution. The Test bed column lists the components actually used for characterizing the performance of the solution. The drives listed for data (12 TB NLS) were used for performance characterization, but all supported HDDs and SSDs in the PowerVault ME5 Support Matrix can be used for the solution. Because the ME5 controllers are no longer the first bottleneck of the backend storage, using drives with higher rated speed (10K, 15K, and SSDs) might provide some increase in sequential performance (a maximum of about 30 to 35 percent for throughput is expected), can provide better Random IOPS, and might improve create and remove metadata operations. For full high-speed network redundancy, two high-speed switches must be used (QM87000 for IB or SN3700 for GbE); each switch must have one CX6 adapter connected from each server.

The listed software components describe the versions used during the initial testing. However, these software versions might change over time in between official releases to include important fixes, support for new hardware components, or addition of important features.

The table lists possible data HDDs and SSDs, which are listed in the Dell PowerVault ME5 Support Matrix.

Table 1.              Components used at release time and in the test bed

Solution component

At release

Test bed

Internal management switch

Dell PowerSwitch N2248X-ON GbE

PowerSwitch S3048-ON

Data storage subsystem

1 x to 4 x PowerVault ME5084 arrays

2 x Dell PowerVault ME5084 arrays

Optional 4x PowerVault ME484 (one per ME5084 array)

80 – 12 TB 3.5" NL SAS3 HDD drives

Alternative options: 15K RPM: 900GB; 10K RPM: 1.2TB, 2.4 TB

SSD: 960GB, 1.92 TB, 3.84 TB; NLS: 4 TB, 8 TB, 12 TB, 16 TB, 20 TB

8 LUNs, linear 8+2 RAID 6, chunk size 512 KiB

4 - 1.92 TB (or 3.84 TB or 7.68 TB) SAS3 SSDs per ME5084 array for metadata – 2 x RAID 1 (or 4 - Global HDD spares, if optional HDMD is used)

Optional HDMD storage subsystem

One or more pairs of NVMe-tier servers

RAID storage controllers

Duplex 12 Gbps SAS

Capacity without expansion (with 12 TB HDDs)

Raw: 4032 TB (3667 TiB or 3.58 PiB)

Formatted: approximately 3072 GB (2794 TiB or 2.73 PiB)

Capacity with expansion (Large)

(12 TB HDDs)

Raw: 8064 TB (7334 TiB or 7.16 PiB)

Formatted: approximately 6144 GB (5588 TiB or 5.46 PiB)

Processor

Gateway/ngenea

2 x Intel Xeon Gold 6326 2.9 GHz, 16C/32T, 11.2GT/s, 24M Cache, Turbo, HT (185 W) DDR4-3200

2 x Intel Xeon Platinum 8352Y

2.2 GHz, 32C/64T, 11.2GT/s,

48M Cache, Turbo, HT (205 W)

DDR4-3200

Storage node

Management node

2x Intel Xeon Gold 6330 2 GHz, 28C/56T, 11.2GT/s, 42M Cache, Turbo, HT (185 W) DDR4-2933

R650 NVMe Nodes

2x Intel Xeon Gold 6354 3.00 GHz, 18C/36T, 11.2GT/s, 39M Cache, Turbo, HT (205 W) DDR4-3200

Optional High Demand Metadata

2x Intel Xeon Gold 6354 3.00 GHz, 18C/36T, 11.2GT/s, 39M Cache, Turbo, HT (205 W) DDR4-3200

R750 NVMe nodes

 

2x Intel Xeon Platinum 8352Y, 2.2 GHz, 32C/64T, 11.2GT/s, 48M Cache, Turbo, HT (205 W) DDR4-3200

R7525 NVMe nodes

2 x AMD EPYC 7302 3.0 GHz, 16C/32T, 128M L3 (155 W)

2 x AMD 7H12 2.6 GHz, 64C/64T 256M L3 (280 W)

Memory

 

Gateway/ngenea

16 x 16 GiB 3200 MT/s RDIMMs (256 GiB)

Storage node

Management node

Operating system

Red Hat Enterprise Linux 8.5

Kernel version

4.18.0-348.23.1.el8_5.x86_64

pixstor software

6.0.3.1-1

Spectrum Scale (GPFS)

Spectrum Scale (GPFS) 5.1.3-1

OFED version

Mellanox OFED 5.6-1.0.3.3

High-performance NIC

All: 2 x Dell OEM ConnectX-6 Single Port HDR VPI InfiniBand, Low Profile

Gateway and ngenea Nodes: 4x CX6 VPI adapters, 2x FS & 2x External

High-performance switch

All: 2 x Dell OEM ConnectX-6 Single Port HDR VPI InfiniBand, Low Profile

Gateway and ngenea Nodes: 4x CX6 VPI adapters, 2x FS & 2x External

Local Disks (operating system and analysis/monitoring)

NVMe servers: BOSS-S2 with 2x M.2 240 GB in RAID 1

Other servers: 3x 480 GB SSD SAS3 (RAID 1 + HS) for operating system with PERC H345 front RAID controller

Systems management

iDRAC9 Enterprise + Dell OpenManage 10.0.1-4561

Performance Characterization

To characterize the new solution component (ME5084 array), we used the following benchmarks:

  • IOzone N to N sequential
  • IOR N to 1 sequential
  • IOzone random
  • MDtest 

A delay in the delivery of the ME5084 arrays needed for the update of the solution imposed an unexpected limitation. Therefore, the number of ME5 prototypes available for the solution limited this work. Only two ME5084 arrays were used for the benchmark tests, which is the same as a Medium configuration. However, to compare results to the previous generation of the PowerVault array (ME4084), all IOzone and IOR results were extrapolated for a large configuration by multiplying the results by 2. When the additional ME5084 arrays are delivered, all benchmark tests will be repeated on the Large configuration, and then again using ME484 expansions.

For these benchmarks, the test bed included the clients in the following table: 

Table 2 Client test bed

Component

Description

Number of client nodes

16

Client node

C6420

Processors per client node

11 nodes with 2 x Intel Xeon Gold 6230 20 Cores @ 2.1 GHz

5 nodes with 2 x Intel Xeon Gold 6248 20 Cores @ 2.4 GHz

Memory per client node

6230 nodes with 12 x 16 GiB 2933 MT/s RDIMMs (192 GiB)

6248 nodes with 12 x 16 GiB 2666 MT/s RDIMMs (192 GiB)

BIOS

2.8.2

Operating system

CentOS 8.4.2105

Operating system kernel

4.18.0-305.12.1.el8_4.x86_64

pixstor software

6.0.3.1-1

Spectrum Scale (GPFS)

5.1.3-0

OFED version

MLNX_OFED_LINUX-5.4-1.0.3.0

CX6 FW

8 nodes with Mellanox CX6 single port: 20.32.1010

8 nodes with Dell OEM CX6 single port: 20.31.2006

Because there were only 16 compute nodes available for testing, when a higher number of threads was required, those threads were distributed equally on the compute nodes (that is, 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads = 16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes. Because the benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), avoiding excessive context switching and other related side effects.

Sequential IOzone Performance N clients to N files

Sequential N clients to N files performance was measured with IOzone version 3.492. The tests that we ran varied from a single thread up to 1024 threads in increments of powers of 2. 

We minimized caching effects by setting the GPFS page pool tunable to 16 GiB on clients and using files larger than twice the memory size of servers and clients (8 TiB). Note that GPFS sets the tunable maximum amount of memory used for caching data, regardless of the amount of RAM that is installed and is free. While in other Dell HPC solutions in which the block size for large sequential transfers is 1 MiB, GPFS was formatted with 8 MiB blocks; therefore that transfer size value is used on the benchmark for optimal performance. The block size on the file system might seem too large and waste too much space, but GPFS uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided into 512 subblocks of 16 KiB each. 

The following commands were used to run the benchmark for read and write operations, where the $Threads variable is the number of threads used (1 to 1024 incremented in powers of 2), and threadlist was the file that assign each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes.

To avoid any possible data caching effects from the clients, the total data size of the files was more than twice the total amount of RAM that the clients and servers have. That is, because each client has 128 GiB of RAM (total 2 TiB) and each server has 256 GiB (total 1 TiB), the total amount is 3 TiB, but 8 TiB of data were used. The 8 TiB were equally divided by the number of threads used.

./iozone -i0 -c -e -w -r 8M -s ${Size}G -t $Threads -+n -+m ./threadlist

./iozone -i1 -c -e -w -r 8M -s ${Size}G -t $Threads -+n -+m ./threadlist

Figure 5  N to N sequential performance

IMPORTANT: To allow comparison of ME5 array values to those previously obtained with ME4 arrays directly on the graph, the IOzone results of the Medium configuration (two ME5084 arrays) were multiplied by 2 to estimate the performance of a Large configuration (four ME5084 arrays). 

From the results, we see that write performance rises with the number of threads used and then reaches a plateau at eight threads for read and write operations (the values at four threads are slightly smaller). Read performance rises a little more and then decreases to a more stable value. Write performance seems to be more stable than read performance with a small variation around the sustained performance in the plateau. 

The maximum performance for read operations was 31.4 GB/s at 16 threads, about 34.5 percent below the specification of the ME5084 arrays (approximately 48 GB/s), and well below the performance of HDR links (4 x 25 GB/s or 100 GB/s). Even if only one HDR link per storage server was used (ceiling speed of 50 GB/s), the value is higher than the specification of the 4 x ME5084 arrays. Write performance peaks at 512 threads with 27.8 GB/s, but a similar value was observed at 32 threads. The maximum value was about 30.5 percent below the ME5 specifications (40 GB/s). Initial ME5 testing used raw devices with SSDs in RAID (on ME5024 arrays) and HDDs in (8+2) RAID 6 (on ME5084 arrays) and it was able to reach the specifications of the controllers. Therefore, the current assumption is that seek times introduced by GPFS scattered access (random placement of 8 GiB blocks across the surface of all drives) is limiting the performance. Adding ME484  expansions can help reach performance closer to the specifications because having twice the LUNs reduces the effect of the seek time across the file system. Our next whitepaper will include performance for ME484 expansions and benchmark tests will address this assumption.

Sequential IOR Performance N clients to 1 file

Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, with by OpenMPI v4.1.2A1 to run the benchmark over the 16 compute nodes. The tests that we ran varied from one thread up to 512 threads because there were not enough cores for 1024 or more threads. This benchmark used 8 MiB blocks for optimal performance. The previous section provides a more complete explanation about why that block size was selected.

We minimized data caching effects by setting the GPFS page pool tunable to 16 GiB and the total file size to 8192 GiB to ensure neither clients or servers could cache any data. An equal portion of that 8 TiB total was divided by the number of threads (the $Size variable in the following code manages that value).

The following commands were used to run the benchmark for write and read operations, where the $Threads variable is the number of threads used (1 to 512 incremented in powers of two) and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes:

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ucx --oversubscribe --prefix /usr/mpi/gcc/openmpi-4.1.2a1 --map-by node /mmfs1/bench/ior   -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b ${Size}G 

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ucx --oversubscribe --prefix /usr/mpi/gcc/openmpi-4.1.2a1 --map-by node /mmfs1/bench/ior   -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -r -s 1 -t 8m -b ${Size}G

Figure 6  N to 1 Sequential Performance

IMPORTANT: To allow comparison of ME5 array values to those previously obtained with ME4 arrays directly on the graph, the IOzone results of the Medium configuration (two ME5084 arrays) were multiplied by 2 to estimate the performance of a Large configuration (four ME5084 arrays).

From the results, we see that read and write performance are high regardless of the implicit need for locking mechanisms because all threads access the same file. Performance rises quickly with the number of threads used and then reaches a plateau at eight threads that is relatively stable up to the maximum number of threads used on this test. Notice that the maximum read performance was 30.9 GB/s at 16 threads, but similar to sequential N to N tests, performance decreases slightly until reaching a more stable value. The maximum write performance of 23 GB/s was achieved at 32 threads and remains stable across a higher number of threads. 

Random small blocks IOzone Performance N clients to N files

Random N clients to N files performance was measured with IOzone version 3.492. The tests that we ran varied from a single thread up to 1024 threads in increments of powers of 2.

The tests that we ran varied from a single thread up to 512 threads because there were not enough client cores for 1024 threads. Each thread used a different file and the threads were assigned using the round-robin method on the client nodes. This benchmark test used 4 KiB blocks to emulate small blocks traffic.

We minimized caching effects by setting the GPFS page pool tunable to 4 GiB and to avoid any possible data caching effects from the clients. The total data size of the files created was again 8,192 GiB divided by the number of threads (the $Size variable in the following code was used to manage that value). However, the actual random operations were limited to 128 GiB (4 GiB x 16 clients x 2) to save running time that can be extremely long due to low IOPS on NLS drives. 

. ./iozone -i0 -I -c -e -w -r 8M -s ${Size}G -t $Threads -+n -+m ./me5_threadlist        <= Create the files sequentially

./iozone -i2 -I -O -w -r 4k -s ${Size}G -t $Threads -+n -+m ./me5_threadlist   <= Perform the random reads and writes

Figure 7  N to N random performance

IMPORTANT: To allow comparison of ME5 array values to those previously obtained with ME4 arrays directly on the graph, the IOzone results of the Medium configuration (two ME5084 arrays) were multiplied by 2  to estimate the performance of a Large configuration (four ME5084 arrays).

From the results, we see that write performance starts at a high value of 15.2K IOPS and rises to the peak of 20.8K IOPS at four threads and then decreases until it reaches a plateau at 16 threads (15-17K IOPS). Read performance starts low at 1.5K IOPS at 16 threads and increases steadily with the number of threads used (the number of threads is doubled for each data point) until achieving a maximum performance of 31.8K IOPS at 512 threads without reaching a plateau. Using more threads requires more than 16 compute nodes to avoid resource starvation and excessive swapping that can lower apparent performance. Because NLS HDDs seek time limits maximum IOPS long before reaching the controller ME5 specification, using ME484 expansions can help to increase IOPS; and faster drives (10K, 15K, or SSDs) can help even more. However, the NVMe tier is better suited to handle extremely high IOPS requirements.

Metadata performance with MDtest

The optional HDMD used in this testbed was with a single pair of PowerEdge R650 servers with 10 PM1735 NVMe PCIe 4 devices on each server. Metadata performance was measured with MDtest version 3.3.0, with OpenMPI v4.1.2A1 to run the benchmark over the 16 compute nodes. The tests that we ran varied from a single thread up to 512 threads. The benchmark was used for files only (no directories metadata), getting the number of create, stat, read, and remove operations that the solution can handle.

Because the same High Demand Metadata NVMe module was used for previous benchmark tests of the pixstor storage solution, metadata results are similar to previous results (NVMe tier). Therefore, the study with empty and 3 KiB files were included for completeness, but results with 4 KiB files are more relevant for this blog. Since 4 KiB files cannot fit into an inode along with the metadata information, ME5 arrays are used to store data for each file. Therefore, MDtest can also provide an approximate estimate of small files performance for read operations and the rest of the metadata operations using ME5 arrays.

The following command was used to run the benchmark, where the $Threads variable is the number of threads used (1 to 512 incremented in powers of two) and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes. The file size for read  and create operations was stored in $FileSize. Like the Random IO benchmark, the maximum number of threads was limited to 512 because there are not enough cores on client nodes for 1024 threads. Context switching can affect the results, reporting a number lower than the real performance of the solution.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /usr/mpi/gcc/openmpi-4.1.2a1 --map-by node --mca btl_openib_allow_ib 1 /mmfs1/bench/mdtest -v -P -d /mmfs1/perftest -i 1 -b $Directories -z 1 -L -I 1024 -u -t -w $FileSize -e $FileSize

Because the total number of IOPS, the number of files per directory, and the number of threads can affect the performance results, we decided to keep the total number of files fixed to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed, as shown in the following table:

Table 3 MDtest distribution of files on directories

Number of threads

Number of directories per thread

Total number of files

1

2048

2,097,152

2

1024

2,097,152

4

512

2,097,152

8

256

2,097,152

16

128

2,097,152

32

64

2,097,152

64

32

2,097,152

128

16

2,097,152

256

8

2,097,152

512

4

2,097,152


Figure 8  Metadata Performance – empty Files

The scale chosen was logarithmic with base 10 to allow comparing operations that have differences of several orders of magnitude; otherwise, some of the operations would appear like a flat line close to 0 on a linear scale. A log graph with base 2 is more appropriate because the number of threads are increased by powers of 2. Such a graph would look similar, but people tend to perceive and remember numbers based on powers of 10 better.

Empty files do not involve all ME5 arrays and only represent the performance on the PowerEdge R650 servers with NVMe drives. The system provides good results with stat operations reaching the peak value at 256 threads with almost 8.6M op/s and then is reduced for 512 threads. Create operations reach the maximum of 239.6K op/s at 64 threads and then decrease slightly until reaching a plateau at 128 threads. Read operations attain a maximum of 3.7M op/s at 128 threads, then decrease slowly. Remove operations peak at 312.9K op/s at 64 threads, then decrease slightly and seem to reach a plateau.

Metadata Performance with 3 KiB files

Figure 9  Metadata Performance – 3 KiB Files

The scale chosen was logarithmic with base 10 to allow comparing operations that have differences of several orders of magnitude; otherwise, some of the operations would appear like a flat line close to 0 on a linear scale. A log graph with base 2 is more appropriate because the number of threads are increased by powers of 2. Such a graph would look similar, but people tend to perceive and remember numbers based on powers of 10 better. 

Note that 3 KiB files still fit completely on inodes and therefore do not involve ME5 arrays, but only represent the performance on the PowerEdge R650 servers with NVMe drives. The system provides good results with stat operations reaching the peak value at 512 threads with 9.9M op/s. Create operations reach the maximum of 192.2K op/s at 128 threads and seem to reach a plateau. Read operations attained a maximum of 3M op/s at 128 threads. Remove operations peaked at 298.7K op/s at 128 threads. 

Metadata Performance with 4 KiB files

Figure 10  Metadata Performance – 4 KiB Files

The scale chosen was logarithmic with base 10 to allow comparing operations that have differences of several orders of magnitude; otherwise, some of the operations would appear like a flat line close to 0 on a linear scale. A log graph with base 2 is more appropriate because the number of threads are increased by powers of 2. Such a graph would look similar, but people tend to perceive and remember numbers based on powers of 10 better.

The system provides good results with stat operations reaching the peak value at 256 threads with almost 9.8M op/s and then is reduced for 512 threads. Create operations reach the maximum of 115.6K op/s at 128 threads and then decrease slightly until reaching 512 threads where the value drops to less than 40 percent of the peak. Read operations attain a maximum of 4M IOPS at 256 threads, which seems too high for NLS drives (possibly implying the file system is caching all data needed for most data points), suddenly dropping also at 512 threads. More work is needed to understand the sudden drop for create operations and the high read performance. Finally, remove operations peak at 286.6K op/s at 128 threads and decrease at higher thread counts.

Conclusions and future work

The new ME5 arrays provide a significant increase in performance (71 percent for read operations and 82 percent for write operations from specifications). The new arrays directly increased the performance for the pixstor solution, but not to the level expected from the specification, as seen in Table 4. Because the pixstor solution uses scattered access by default, it is expected that ME484 expansions will help get closer to the limit of the ME5 controllers.

This solution provides HPC customers with a reliable parallel file system (Spectrum Scale – also known as GPFS) that is used by many Top500 HPC clusters. In addition, it provides exceptional search capabilities without degrading performance, and advanced monitoring and management. By using standard protocols like NFS, SMB, and others, optional gateways allow file sharing to as many clients as needed. Optional ngenea nodes allow tiering of other Dell storage such as Dell PowerScale, Dell ECS, other vendors, and even cloud storage.

Table 4   Peak and sustained performance with ME5084 arrays

Benchmark

Peak performance

Sustained performance

Write

Read

Write

Read

Large Sequential N clients to N files

31.4 GB/s

27.8 GB/s

28 GB/s

26 GB/s

Large Sequential N clients to single shared file

30.9 GB/s

27.8 GB/s

27.3 GB/s

27 GB/s

Random Small blocks N clients to N files

31.8K IOPS

20.8K IOPS

15.5K IOPS

27K IOPS

Metadata Create 4 KiB files

115.6K IOPS

50K IOPS

Metadata Stat 4 KiB files

9.8M IOPS

1.4M IOPS

Metadata Remove 4 KiB files

286.7K IOPS

195K IOPS

When two additional ME5084s are added to the pixstor solution, it will be fully benchmarked as a Large configuration (four ME5084 arrays). It will also be fully benchmarked  after adding expansion arrays (four ME484 arrays). Another document will be released with this and any additional information.


Read Full Blog
  • Intel
  • PowerEdge
  • HPC
  • GPU
  • AMD

HPC Application Performance on Dell PowerEdge R750xa Servers with the AMD Instinct TM MI210 Accelerator

Frank Han Neeraj Kumar Frank Han Neeraj Kumar

Fri, 12 Aug 2022 16:47:40 -0000

|

Read Time: 0 minutes

Overview

The Dell PowerEdge R750xa server, powered by 3rd Generation Intel Xeon Scalable processors, is a 2U rack server that supports dual CPUs, with up to 32 DDR4 DIMMs at 3200 MT/s in eight channels per CPU. The PowerEdge R750xa server is designed to support up to four PCI Gen 4 accelerator cards and up to eight SAS/SATA SSD or NVMe drives.

A picture containing electronics, projector, printer

Description automatically generated

Figure 1: Front view of the PowerEdge R750xa server

The AMD Instinct™ MI210 PCIe accelerator is the latest GPU from AMD that is designed for a broad set of HPC and AI applications. It provides the following key features and technologies:

  • Built with the 2nd Gen AMD CDNA architecture with new Matrix Cores delivering improvements on FP64 operations and enabling a broad range of mixed-precision capabilities
  • 64 GB high-speed HBM2e memory bandwidth supporting highly data-intensive workloads
  • 3rd Gen AMD Infinity Fabric™ technology bringing advanced platform connectivity and scalability enabling fully connected dual P2P GPU hives through AMD Infinity Fabric™ links
  • Combined with the AMD ROCm™ 5 open software platform allowing researchers to tap the  power of the AMD Instinct™ accelerator with optimized compilers, libraries, and runtime support

This blog provides the performance characteristics of a single PowerEdge R750xa server with the AMD Instinct MI210 accelerator. It compares the performance numbers of microbenchmarks (GEMM of FP64 and FP32 and bandwidth test), HPL, and LAMMPS for both the AMD Instinct MI210 accelerator and the previous generation AMD Instinct MI100 accelerator.   

The following table provides configuration details for the PowerEdge R750xa system under test (SUT):

Table 1: SUT hardware and software configurations

Component

Description

Processor

Dual Intel Xeon Gold 6338

Memory

512 GB - 16 x 32 GiB@3200 MHz

Local disk

3.84 TB SATA-6GB SSD

Operating system

Rocky Linux release 8.4 (Green Obsidian)

GPU model

4 x AMD MI210 (PCIe-64G) or 3 x AMD MI100 (PCIe-32G)

GPU driver version

5.13.20.5.1

ROCm version

5.1.3

Processor Settings > Logical Processors

Disabled

System profiles

Performance

rocm-blas-bench

5.1.3

TransferBench

5.1.3

HPL

Compiled with ROCm v5.1.3

LAMMPS (KOKKOS) 

Version: LAMMPS patch_4May2022

The following table provides the specifications of the AMD Instinct MI210 and MI100 GPUs:

Table 2: AMD Instinct MI100 and MI210 PCIe GPU specifications

GPU architecture

AMD Instinct MI210

AMD Instinct MI100

Peak Engine Clock (MHz)

1700

1502

Stream processors

6656

7680

Peak FP64 (TFlops)

22.63

11.5

Peak FP64 Tensor DGEMM (TFlops)

45.25

11.5

Peak FP32 (TFlops)

22.63

23.1

Peak FP32 Tensor SGEMM (TFlops)

45.25

46.1

Memory size (GB)

64

32

Memory Type

HBM2e

HBM2

Peak Memory Bandwidth (GB/s)

1638

1228

Memory ECC support

Yes

Yes

TDP (Watt)

300

300

GEMM microbenchmarks

Generic Matrix-Matrix Multiplication (GEMM) is a multithreaded dense matrix multiplication benchmark that is used to measure the performance of a single GPU. The unique O(n3) computational complexity compared to the O(n2) memory requirement of GEMM makes it an ideal benchmark to measure GPU acceleration with high efficiency because achieving high efficiency depends on minimizing the redundant memory access.

For this test, we complied the rocblas-bench binary from https://github.com/ROCmSoftwarePlatform/rocBLAS  to collect DGEMM (double-precision) and SGEMM (single-precision) performance numbers. 

These results only reflect the performance of matrix multiplication, and results are measured in the form of peak TFLOPS that the accelerator can deliver. These numbers can be used to compare the peak compute performance capabilities of different accelerators. However, they might not represent real-world application performance. 

Figure 2 presents the performance results measured for DGEMM and SGEMM on a single GPU:

Figure 2: DGEMM and SGEMM numbers obtained on AMD Instinct MI210 and MI100 GPUs with the PowerEdge R750xa server

From the results we observed:

  • The CDNA 2 architecture from AMD, which includes second-generation Matrix Cores and faster memory, provides significant improvement in the theoretical peak FP64 Tensor DGEMM value (45.3 TFLOPS). This result is 3.94 times better than the previous generation AMD Instinct MI100 GPU peak of 11.5 TFLOPS. The measured DGEMM value on the AMD Instinct MI250 GPU is 28.3 TFlops, which is 3.58 times better compared to the measured value of 7.9 TFlops on the AMD Instinct MI100 GPU.
  • For FP32 Tensor operations in the SGEMM single-precision GEMM benchmark, the theoretical peak performance of the AMD Instinct MI210 GPU is 45.23 TFLOPS, and the measured performance value is 32.2 TFLOPS. An improvement of approximately nine percent was observed in the measured value of SGEMM compared to the AMD Instinct MI100 GPU.

GPU-to-GPU bandwidth test 

This test captures the performance characteristics of buffer copying and kernel read/write operations. We collected results by using TransferBench, compiling the binary by following the procedure provided at https://github.com/ROCmSoftwarePlatform/rccl/tree/develop/tools/TransferBench. On the PowerEdge R750xa server, both the AMD Instinct MI100 and MI210 GPUs have the same GPU-to-GPU throughput, as shown in the following figure:

Figure 3: GPU-to-GPU bandwidth test with TransferBench on the PowerEdge R750xa server with AMD Instinct MI210 GPUs

High-Performance Linpack (HPL) Benchmark

HPL measures a system’s floating point computing power by solving a random system of linear equations in double precision (FP64) arithmetic. The peak FLOPS (Rpeak) is the highest number of floating-point operations that a computer can perform per second in theory. 

It  can be calculated using the following formula:

clock speed of the GPU × number of GPU cores × number of floating-point operations that the GPU performs per cycle 

Measured performance is referred to as Rmax. The ratio of Rmax to Rpeak demonstrates the HPL efficiency, which is how close the measured performance is to the theoretical peak. Several factors influence efficiency including GPU core clock speed boost and the efficiency of the software libraries.

The results shown in the following figure are the Rmax values, which are measured HPL numbers on AMD Instinct MI210 and AMD MI100 GPUs. The HPL binary used to collect the result was compiled with ROCm 5.1.3. 

Figure 4: HPL performance on AMD Instinct MI210 and MI100 GPUs powered with R750xa servers

The following figure shows the power consumption during a single HPL test :

   

Figure 5: System power use during one HPL test across four GPUs

Our observations include:  

  • We observed a significant improvement in the HPL performance with the AMD Instinct MI210 GPU over the AMD Instinct MI100 GPU. The performance on a single test of the AMD Instinct MI210 GPU is 18.2 TFLOPS, which is over 2.8 times higher than the AMD Instinct MI100 number of 6.4 TFLOPS. This improvement is a result of the AMD CDNA2 architecture on the AMD Instinct MI210 GPU, which has been optimized for FP64 matrix and vector workloads.
  • As shown in Figure 4, the AMD Instinct MI210 GPU provides almost linear scalability in the HPL values on single node multi-GPU runs. The AMD Instinct MI210 GPU shows better scalability compared to the previous generation AMD Instinct MI100 GPUs.
  • Both AMD Instinct MI100 and MI210 GPUs have the same TDP of 300 W, with the AMD Instinct MI210 GPU delivering a 3.6 times better performance. The performance per watt value from a PowerEdge R750xa server is 3.6 times more.

LAMMPS Benchmark

LAMMPS is a molecular dynamics simulation code that is a GPU bandwidth-bounded application. We used the KOKKOS acceleration library implementation of LAMMPS to measure the performance of AMD Instinct MI210 GPUs.

The following figure compares the LAMMPS performance of the AMD Instinct MI210 and MI100 GPU with four different datasets: 

Figure 6: LAMMPS performance numbers on AMD Instinct MI210 and MI100 GPUs on PowerEdge R750xa servers with different datasets

Our observations include:

  • We measure to an average 21 percent performance improvement on the AMD Instinct MI210 GPU compared to the AMD Instinct MI100 GPU with the PowerEdge R750xa server. Because MI100 and MI210 GPUs have different sizes of onboard GPU memory, the problem sizes of each LAMMPS dataset were adjusted to represent the best performance from each GPU. 
  • Datasets such as Tersoff, ReaxFF/C, and EAM on the AMD Instinct MI210 GPU show a 30 percent, 22 percent, and 18 percent improvement. This result is primarily because the AMD Instinct MI210 GPU comes with faster and larger memory HBM2e (64 GB) compared to the AMD Instinct MI100 GPU, which comes with HBM2 (32 GB) memory. For the LJ datasets, the improvement is less, but is still observed at 12 percent. This result is because single-precision calculations are used and the FP32 peak performance for the AMD Instinct MI210 and MI100 GPUs are at the same level.

Conclusion 

The AMD Instinct MI210 GPU shows impressive performance improvement in FP64 workloads. These workloads benefit as AMD has doubled the width of their ALUs to a full 64 bits wide allowing FP64 operations to now run at full speed in the new CDNA 2 architecture. Applications and workloads that can take advantage of FP64 operations are expected to make the most of the aspect of the AMD Instinct MI210 GPU. The faster bandwidth of the HBM2e memory of the AMD Instinct MI210 GPU provides advantages for GPU memory-bounded applications.

The PowerEdge R750xa server with AMD Instinct MI210 GPUs is a powerful compute engine, which is well suited for HPC users who need accelerated compute solutions.

Next steps 

In future work, we plan to describe benchmark results on additional HPC and deep learning applications, compare the  AMD Infinity FabricTM Link(xGMI) bridges, and show AMD Instinct MI210 performance numbers on other Dell PowerEdge servers, such as the PowerEdge R7525 server.

Read Full Blog
  • PowerEdge
  • vSphere
  • virtualization
  • HPC
  • RDMA
  • performance metrics

Performance study of a VMware vSphere 7 virtualized HPC cluster

Rizwan Ali Chris Gully Yuankun Fu Rizwan Ali Chris Gully Yuankun Fu

Mon, 28 Mar 2022 16:35:13 -0000

|

Read Time: 0 minutes

High Performance Computing (HPC) involves processing complex scientific and engineering problems at a high speed across a cluster of compute nodes. Performance is one of the most important features of HPC. While most HPC applications are run on bare metal servers, there has been a growing interest to run HPC applications in virtual environments. In addition to providing resiliency and redundancy for the virtual nodes, virtualization offers the flexibility to quickly instantiate a secure virtual HPC cluster. 

Most people tend to run their HPC workloads on dedicated hardware, which is often composed of server compute nodes that are interconnected by high-speed networks to maximize their performance. Alternatively, virtualization abstracts the underlying hardware and adds a software layer that emulates this hardware. With this in mind, the engineers at the Dell Technologies HPC & AI Innovation Lab and VMware conducted a performance study to compare the performance of running and scaling HPC workloads on dedicated bare metal nodes to a vSphere 7-based virtualized infrastructure. The team also tuned the physical and virtual infrastructure to achieve optimal virtual performance and share these findings and recommendations.

Performance test details

Our team evaluated tightly coupled HPC applications or message passing interface (MPI) based workloads and observed promising results. These applications consist of parallel processes (MPI ranks) that leverage multiple cores and are architected to scale computation to multiple compute servers (or VMs) to solve the complex mathematical model or scientific simulation in a timely manner. Examples of tightly coupled HPC workloads include computational fluid dynamics (CFD) used to model airflow in automotive and airplane designs, weather research and forecasting models for predicting the weather, and reservoir simulation code for oil discovery.

To evaluate the performance of these tightly coupled HPC applications, we built 16-node HPC cluster using Dell PowerEdge R640 vSAN Ready nodes. Dell Power Edge R640 is a 1U dual socket server with Intel® Xeon® Scalable processors. The same cluster was configured as both a bare metal HPC cluster and as a virtual cluster running VMware vSphere.

Figure 1 shows a representative network topology of this cluster. The cluster was connected to two separate physical networks. We used the following components for this cluster: 

  • Dell PowerSwitch Z9332 switch connecting NVIDIA® Connect®-X6 100 GbE adapters to provide a low latency high bandwidth 100 GbE RDMA-based HPC network for the MPI-based HPC workload
  • a separate pair of Dell PowerSwitch S5248 25 GbE based top of rack (ToR) switches for hypervisor management, VM access and VMware vSAN networks for the virtual cluster

The VM switches provided redundancy and were connected by a virtual link trucking interconnect (VLTi). A VMware vSAN cluster was created to host the VMDKs for the virtual machines. To maximize CPU utilization RDMA, we also leveraged support for vSAN. This provides direct memory access between the nodes participating in the vSAN cluster without involving the operating system or the CPU. RDMA offers low latency, high throughput, and high IOPs that are more difficult to achieve with traditional TCP-based networks. It also enables the HPC workloads to consume more CPU for their work without impacting the vSAN performance.

Figure 1: A 16-Node HPC cluster test bed

 

Figure 2: Physical adapter configuration for HPC network and service network

Table 1 describes the configuration details of the physical nodes and the network connectivity. For the virtual cluster, a single VM per node was provisioned for a total of 16 VMs or virtual compute nodes. Each VM was configured with 44 vCPU and 144 GB of memory, and the VM CPU and memory reservation were enabled and we set the VM latency sensitivity to high. Figure 1 also provides an example of how the hosts are cabled to each fabric. One port from each host is connected to the NVIDIA Mellanox ConnectX-6 adapter and to the Dell PowerSwitch Z9332 for the HPC network fabric. For the service network fabric, two ports are connected from the NVIDIA Mellanox ConnectX-4 adapter to the Dell PowerSwitch S5248 ToR switches.  

Table 1: Configuration details for the bare metal and virtual clusters

Environment

Bare Metal

Virtual

Server

PowerEdge R640 vSAN Ready Node

Processor

2 x Intel Xeon 2nd Generation 6240R

Cores

All 48 cores used

44 vCPU used

Memory

12 x 16GB @3200 MT/s
 All 192 GB used

144 GB reserved for the VM

Operating System

CentOS 8.3

Host OS: VMware vSphere 7.0u2
 Guest OS: CentOS 8.3

HPC Network NIC

100 GbE NVIDIA Mellanox Connect-X6 

Service Network NIC

10/25 GbE NVIDIA Mellanox Connect-X4

HPC Network Switch

Dell PowerSwitch Z9332F-ON

Service Network Switch

Dell PowerSwitch S5248F-ON

Table 2 shows a range of different HPC applications across multiple vertical domains along with the benchmark datasets that were used for the performance comparison. 

Table 2: Application and Benchmark Details

Application

Vertical Domain

Benchmark Dataset

OpenFOAM

Manufacturing - Computational Fluid Dynamics (CFD)

Motorbike 20M cell mesh

Weather Research and Forecasting (WRF)

Weather and Environment

Conus 2.5KM

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)

Molecular Dynamics

EAM Metallic Solid Benchmark

GROMACS

Life Sciences – Molecular Dynamics

HECBioSim Benchmarks – 3M Atoms

Nanoscale Molecular Dynamics (NAMD)

Life Sciences – Molecular Dynamics

STMV – 1.06M Atoms

Performance Results

Figures 3 through 7 show the performance, scalability, and difference in performance for five representative HPC applications in the CFD, weather, and science domains. Each of the applications was run to scale from 1 node through 16 nodes on a bare metal and a virtual cluster. All five applications demonstrate efficient speedup when computation is scaled out to multiple systems. The relative speedup for the application is plotted (the baseline is application performance on a bare metal single node).

The results indicate that MPI application performance running in a virtualized infrastructure (with proper tuning and following best practices for latency-sensitive applications in a virtual environment) is close to performance in a bare metal infrastructure. The single node performance delta ranges from no difference for WRF to a maximum of 8 percent difference observed with LAMMPS. Similarly, as the nodes are scaled, the performance observed on the virtual nodes is comparable to that on the bare-metal infrastructure with the largest delta being 10% when running LAMMPS on 16 nodes. 

Figure 3: OpenFOAM performance comparison between virtual and bare-metal systems

 Figure 4: WRF performance comparison between virtual and bare-metal systems 

Figure 5: LAMMPS performance comparison between virtual and bare-metal systems

Figure 6: GROMACS performance comparison between virtual and bare-metal systems

Figure 7: NAMD performance comparison between virtual and bare-metal systems

Tuning for Optimal Performance

One of the key elements of achieving a viable virtualized HPC solution is the tuning best practices that allow for optimal performance.  We found a significant improvement was achieved after some minor tweaks were made from the out-of-box configuration.  These improvements are a critical ingredient to ensuring customers can and will achieve results that enable not only the implementation of a virtual HPC environment, but also the adoption of a more cloud-ready eco-system that provides operational efficiencies and multi-workload support.

Table 3 outlines the parameters that we found to work best for MPI applications. Given the nature of MPI for parallel communication and its heavy reliance on a low-latency network, we suggest the implementation of the VM Latency Sensitivity setting available in vSphere 7.0.  This setting allows users to optimize the scheduling delay for latency sensitive applications by 1) giving exclusive access to physical resources to reduce resource contention, 2) by-passing virtualization layers that are not providing value for these workloads, and 3) tuning virtualization layers to reduce any unnecessary overhead.  We have also outlined the additional physical host and hypervisor tunings that complete these best practices below.

Table 3: Recommended performance turnings for tightly coupled HPC workloads

Settings

Value

Physical Server

 

BIOS Power Profile

Performance per watt (OS)

BIOS Hyper-threading

On

BIOS Node Interleaving

Off

BIOS SR-IOV

On

Hypervisor

 

ESXi Power Policy

High Performance

Virtual Machine

 

VM Latency Sensitivity

High

VM CPU Reservation

Enabled

VM Memory Reservation

Enabled

VM Sizing

Maximum VM size with CPU/memory reservation

 Graphical user interface, text, application

Description automatically generated

Figure 8: Virtual Machine Configuration with the recommended tuning settings 

Figure 8 shows a snapshot of the recommended tuning settings as applied to the virtual machine used as the virtual nodes on the HPC cluster. 

Conclusion

Achieving optimal performance is a key consideration for running an HPC application. While most HPC applications enjoy the performance benefits offered by a dedicated bare metal hardware, our results indicate that with appropriate tuning the performance gap between virtual and bare metal nodes has narrowed, making it feasible to run certain HPC applications in a virtualized environment. We also observed that these tested HPC applications demonstrate efficient speedups when computation is scaled out to multiple virtual nodes.  

Additional resources 

To learn more about our previous and ongoing work at the Dell Technologies HPC & AI Innovation Lab, see the High Performance Computing overview and the Dell Technologies Info Hub blog page for HPC solutions. 

Acknowledgements

The authors thank Martin Hilgeman from Dell Technologies, Ramesh Radhakrishnan and Michael Cui from VMware, and Martin Feyereisen for their contribution in the study.

Read Full Blog

Increased Automation, Scale, and Capability with Omnia 1.1

Luke Wilson PhD John Lockman Luke Wilson PhD John Lockman

Mon, 15 Nov 2021 23:08:56 -0000

|

Read Time: 0 minutes

The release of Omnia version 1.0 in March of 2020 was a huge milestone for the Omnia community. It was the culmination of nearly a year of planning, conversations with customers and community members, development, and testing. Omnia version 1.0 included:

  • bare-metal provisioning with Cobbler,
  • automated Slurm and Kubernetes cluster deployment, and
  • automated Kubeflow deployment.

The Omnia project was designed to rapidly add features and evolve, and we are proud to announce the first update to Omnia just 7 months later. While version 1.0 had a ton of great features for a first release, version 1.1  turned out to be even bigger!

The Omnia Project

Omnia is an open-source, community-driven framework for deploying high-performance computing (HPC) clusters for simulation & modeling, artificial intelligence, and data analytics. By automating the entire process, Omnia reduces deployment time for these complex systems from weeks to hours.

Omnia incubated at Dell Technologies in partnership with Intel. The project was initiated by two HPC & AI experts who needed to quickly setup proof-of-concept clusters in Dell’s HPC & AI Innovation Lab, and has since grown into a much larger effort to create production-grade clusters on demand and at scale. Today, Omnia has thirty collaborators from nearly a dozen organizations, including five official community member organizations. The code repo has been cloned over a thousand times and has over forty thousand views! The project is off to a great start with more new features releasing regularly!

Omnia 1.1

Omnia version 1.1 includes a multitude of new features and capabilities that expand datacenter automation beyond the compute server.  This latest release sets the groundwork for Omnia to handle future exascale supercomputer deployments while simultaneously growing the set of end-user features and platforms more rapidly.

New in Omnia 1.1

  • iDRAC-based provisioning
  • PowerVault provisioning/configuration (automatically turns a PV array into an NFS file share)
  • Parallel gang scheduling for Kubernetes (for MPI and Spark jobs)
  • User authentication/management using LDAP/Kerberos
  • Automatic firmware updating for PowerEdge servers with Intel® 2nd-generation Xeon®  Scalable processors when using iDRAC for provisioning
  • Automatic configuration of Dell PowerSwitch 100Gb Ethernet and Nvidia InfiniBand switches
  • Updated AWX GUI for deploying logical clusters
  • Additional MLOps platform options (Polyaxon, in addition to the existing KubeFlow)

A brand-new control plane designed for future growth

The new control plane (formerly called the Omnia appliance) is now a full Kubernetes-based deployment with a wealth of features. The new control plane includes Dell iDRAC integration for firmware updates and OS provisioning when iDRAC Enterprise or Datacenter licenses are detected, plus automatic fallback to Cobbler-based PXE provisioning when those licenses are not available. This allows cluster administrators using Dell servers to take full advantage of their iDRAC Enterprise or Datacenter licenses while continuing to offer a fully open-source and vendor-agnostic alternative. This new Kubernetes-based control plane is the first step in providing an expandable, multi-server control plane that could be used to manage the bare-metal provisioning and deployment of thousands of compute nodes for petascale and eventual exascale systems.

Automatically detect and deploy more than just servers

The development team has also extended Omnia’s automation capability beyond compute servers.  The control plane is now able to automatically detect and configure Dell EMC PowerSwitches, Nvidia/Mellanox InfiniBand switches, and Dell EMC PowerVault storage arrays. This allows users to now deploy complete HPC environments using Omnia’s one-touch philosophy, with compute, network, and storage pieces ready to go! Dell EMC PowerSwitches are automatically configured for both management and fabric deployments, with automatic configuration of RoCEv2 for supported 100Gbps Ethernet switches. Nvidia InfiniBand fabrics will automatically be deployed when an InfiniBand switch is detected, with the subnet manager running on the control plane. And when the control plane detects a Dell EMC PowerVault ME4 storage array, it will automatically configure the RAID, format the array, and setup an NFS service that can have shared access by the various logical clusters in the Omnia resource pool. In less than a day a loading dock full of servers, storage, and networking can be transformed into a functional Omnia resource pool, ready to be configured into logical Slurm and Kubernetes clusters on demand.

Automated deployment of LDAP services

Starting with version 1.1, Omnia also reduces the pain of user management. When logical Slurm clusters are created Omnia takes care of all the backend services needed for a fully functional, batch scheduled, simulation and modeling environment including Kerberos user authentication with FreeIPA. System administrators immediately have access to both a CLI and web-based interface for user management built upon well-known open-source components and standard protocols. Systems can also be configured to point to an existing LDAP service elsewhere in the data center. 

Preparing Kubernetes for HPC workloads

Interest in Kubernetes has been growing in the HPC community, especially for data science and data analytics workloads.  Interest in those use cases is precisely why Omnia included the ability to deploy Kubernetes from the start. However, default configurations of Kubernetes are missing some of the key components needed to make it useful for parallel and distributed data processing. Omnia version 1.0 included the mpi-operator from the Kubeflow project that provides custom resource descriptions (CRDs) for MPI job execution. Version 1.1 now includes the spark-operator to make executing Spark jobs simpler, as well.  Another feature of version 1.1 is the option to use gang scheduling for Kubernetes pods through the Volcano project. This gives Kubernetes the ability to understand that all the pods in an MPI job should be scheduled simultaneously, rather than deploying pods a few at a time when resources come available.

A new platform for neural network research

Artificial intelligence research has been central workload for Omnia. Being able to provide users easy-to-deploy MLOps platforms like Kubeflow is critical to enabling data scientists and AI researchers the flexibility to experiment with new neural network architectures. In addition to Kubeflow, Omnia now offers automated installation of the Polyaxon deep learning platform. Polyaxon gives neural network researchers and data science teams the ability to:

  • index and catalog experiments,
  • execute Distributed TensorFlow experiments, 
  • train MPI-enabled TensorFlow and PyTorch models, and 
  • tune/optimize models using parametric sweeps of hyperparameter values.

Even greater things are on the horizon!

Version 1.1 is a big release, but the Omnia community has even greater things planned. Soon we will be adding support for the entire line of Dell EMC PowerEdge servers with Intel® 3rd-generation Xeon® Scalable (code name “Ice Lake”) processors. Additionally, Omnia will soon be able to deploy logical clusters on top of servers provisioned with either Rocky Linux or CentOS, offering users a choice of base operating systems. Looking farther out, we are working with our customers, technology partners, and community members to bring support for creating BeeGFS filesystems on demand, deploying new user platforms like Open OnDemand, and providing better administrative interfaces for Kubernetes cluster administration through Lens. Anyone is free to look at what we’re working on (and suggest new things to try) by going to the Omnia GitHub. 

Learn More

Learn more about Omnia by visiting Omnia on GitHub.

Read the Dell Technologies solution overview on Omnia here.


Read Full Blog
  • WRF

WRF Performance with 3rd Generation Intel Xeon Scalable Processors On Dell EMC PowerEdge servers

Puneet Singh Ashish K Singh Puneet Singh Ashish K Singh

Tue, 21 Sep 2021 13:56:11 -0000

|

Read Time: 0 minutes

Many sectors like aviation, travel, tourism, energy, and transportation heavily rely on timely and accurate weather predictions provided by weather forecast centers. These operational forecast centers make use of numerical weather prediction (NWP) models to predict the weather based on current weather conditions. Weather research and forecasting (WRF) is one of the most widely used numerical weather prediction systems for weather forecasting. A suitable combination of robust computational resources, high-speed network and high throughput storage is required to achieve the maximum performance on high-performance computing (HPC) cluster for the WRF model to deliver timely forecasts.

In this blog, we highlight the performance improvement for WRF with Intel Ice Lake processors as compared with Intel Cascade Lake processors with Dell EMC PowerEdge servers. These tests were carried out on two socket Dell PowerEdge servers by setting the BIOS option to the HPC workload profile. The testbed hardware and software details are outlined in the following table:

Table 1: Testbed hardware and software details

Component

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge R650 server

Dell EMC PowerEdge C6420 server

Dell EMC PowerEdge C6420 server

SKU

8380

6338

8280

6252

Cores/Socket

40

32

28

24

Frequency (Base-Max Turbo)
  

2.30 – 3.40 GHz

2.0 – 3.20 GHz

2.70 – 4.0 GHz

2.10 – 3.70 GHz

TDP

270 W

205 W

205 W

150 W

L3Cache

60M

48M

38.5M

37.75M

Operating System

Red Hat Enterprise Linux 8.3   4.18.0-240.22.1.el8_3.x86_64

Red Hat Enterprise Linux 8.3   4.18.0-240.22.1.el8_3.x86_64

Red Hat Enterprise Linux 8.3

4.18.0-240.el8.x86_64

Red Hat Enterprise Linux 8.3

4.18.0-240.el8.x86_64

Memory

32 GB x 16 (2Rx8) 3200 MT/s

32 GB x 16 (2Rx8) 3200 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

BIOS/CPLD

1.2.4/1.0.5

2.11.2/1.1.0

Interconnect

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR100

NVIDIA Mellanox HDR100

Compiler

Intel parallel studio 2020 (update 4)

Datasets

conus 2.5km,  new conus 2.5km, wrf_large 3km

We benchmarked WRF-V3.9.1.1 with the conus 2.5km and new conus 2.5km datasets and WRF-V4.2.2 with new conus 2.5km and wrf_large 3km datasets. The following figure shows the simulation domain for the tested datasets:

Figure 1: Domain configuration for new conus 2.5 km, conus 2.5 km, and wrf_large datasets.

The following table provides a brief description of each dataset:

Table 2:   Configuration for new conus 2.5 km, conus 2.5 km and wrf_large datasets

 

conus 2.5 km

new conus 2.5 km

wrf_large

Run hours

3

3

2

Resolution(m)

2500

2500

3000

Vertical layers

35

35

50

Grid points

1501 x 1201

1901 x 1301

1500 x 1500

interval_seconds

10800

10800

21600

The results were measured by averaging the WRF computation time of each timestep from the rsl.error.0000 output file. The timesteps during the file read / write (of wrfout* / wrfinput* ) were not included in the average.

Single node performance

The following figures show the application performance for the datasets mentioned in Table 2. In each figure, the numbers over the bars represent the relative performance compared to the performance obtained with the Intel 6252 Cascade Lake processor model. The blue and green bars represent application performance obtained with Ice lake and Cascade Lake processors.

Figure 2: Relative performance of WRF by processor and dataset type mentioned in Table 1

WRF was compiled with the "dm + sm" configuration with avx2 instructions and serial netcdf support (io_form* set to 2). All the available cores were subscribed during WRF simulation runs. To optimize performance, we tested different MPI process counts, OpenMP thread count combinations, and tiling schemes (WRF_NUM_TILES).

Depending on the dataset, the 8380 processor model can deliver up to 19 percent better performance compared to the 6338 processor model. Relative to Cascade Lake, the Ice Lake architecture has more memory channels and offers higher aggregate memory bandwidth. WRF, which is typically memory bandwidth bound, can take advantage of the additional memory bandwidth (Table 3) provided by Ice Lake and the results demonstrate up to 65 percent performance improvement over the Cascade Lake counterparts. Comparison of Instructions Per Cycle (IPC) and DRAM Bandwidth Utilization collected using Intel OneAPI Vtune profiler on Intel Ice Lake and Cascade Lake processors is shown in Table 3.

Table 3: Metrics collected using Intel OneAPI vtune profiler


8380

8280


IPC

Bandwidth(GB/s)

IPC

Bandwidth (GB/s)

conus 2.5km (WRFV3)

0.99

257.32

0.86

128.30

new conus2.5km (WRFV3)

1.57

192.18

1.48

120.96

new conus 2.5km (WRFV4)

1.36

191.43

1.14

115.46

wrf_large (WRFV4)

1.09

64.80

0.90

62.55

Intel’s Ice Lake is expected to deliver around 20 percent better IPC than the Cascade Lake model (8380 vs 8280). With datasets covered in this blog, we found that Intel 8380 processor reports 6 to 19% better IPC than the Intel 8280 processor.

Figure 3 shows the power consumption using the box and whiskers plot when the system was being benchmarked with the four tests shown in Figure 2. Box indicates the spread of the central 50% of the power data, and the central line represents the median power value. The dots shows the outlier power values , most of which were recorded during the initialization and finalization phase of the tests.

Figure 3: Power used by platform and processor type

Average frequency usage for 8380, 6338, 8280, and 6252 processors were around 2.9, 2.5, 3.0, and 2.5 GHz respectively for all datasets.

Multi-node Scalability

We used eight nodes to evaluate the scalability of WRF. Each node is equipped with the Intel 8380 processor and interconnected using the NVIDIA Mellanox HDR interconnect. The nodes used for benchmarking were connected to the same HDR switch. Table 1 provides details about the server and software that was used for the test. The text on top of the bar in Figure 4 represents the relative performance (on two, four, and eight nodes) for the application as compared with the performance with a single node.

Figure 4: Multi-node performance of WRF on an Intel 8380 processor model for datasets listed in Table 1

The scalability numbers have been rounded off to a single digit. We observed good scalability with all the datasets listed in Table 1.

Conclusions and recommendations

For WRF, Intel Ice Lake demonstrates significant performance improvement as compared with Intel Cascade Lake processors. WRF simulations scale well with the datasets described in this blog. The scalability might vary depending on the dataset being used and the node count being tested. For the best performance with WRF, the impact of the tile size, process, and threads per process should be evaluated.

Read Full Blog
  • PowerEdge

LAMMPS — with Ice Lake on Dell EMC PowerEdge Servers

Savitha Pareek Joseph Stanfield Ashish K Singh Savitha Pareek Joseph Stanfield Ashish K Singh

Mon, 30 Aug 2021 21:09:22 -0000

|

Read Time: 0 minutes

3rd Generation Intel Xeon®  Scalable processors (architecture code named Ice Lake) is Intel’s successor to Cascade Lake. New features include up to 40 cores per processor, eight memory channels with support for 3200 MT/s memory speed and PCIe Gen4. The HPC and AI Innovation Lab at Dell EMC had access to a few systems and this blog presents the results of our initial benchmarking study.

LAMMPS Overview

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is an open source, well parallelized collection of packages for molecular dynamics (MD) research. LAMMPS has a nice collection of “atom styles”, force fields, and many contributed packages. LAMMPS can run on a single processor or on the largest parallel super-computers. It also has packages that provide force calculations accelerated on GPU’s. It can do simulations with billions of atoms!

LAMMPS can be run on a single processor or in parallel using some form of message passing, such as Message Passing Interface (MPI). The most current source code for LAMMPS is written in C++. For more information about LAMMPS, see the following link: https://www.lammps.org/.  

Objective

In this study we measure the performance of LAMMPS on different Ice Lake processor models as listed in Table 1 with a comparison to the previous generation Cascade Lake systems. Single node as well as the multi-node scalability tests were conducted.

Compilation Details

The LAMMPS version used for testing release was lammps-2July-2021, using the Intel 2020 update 5 compiler to take advantage of AVX2 and AVX512 optimizations, and the Intel MKL FFT library. We used the default INTEL package, which comes along the LAMMPS package providing some well optimized atom pair styles in LAMMPS for the vector instructions on Intel processors. The datasets used for our study are described in Table 2, along with detailed configuration of atom sizes and run steps. The unit of performance is timesteps per second, and higher is better.

Hardware and software configurations

Table 1: Hardware and Software test bed details 

 

 

Component

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6420 server

Dell EMC PowerEdge C6420 server

CPU model

Xeon 8380

Xeon 8358

Xeon 8352Y

Xeon 6330

Xeon 8280

Xeon 6248R

Cores/Socket

40

32

32

28

28

24

Base Frequency

2.30 GHz

2.60 GHz

2.20 GHz

2.00 GHz

2.70 GHz

3.00 GHz

TDP

270 W

250 W

205 W

205 W

205 W

205 W

Operating System

Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64

Memory

 

16 GB x 16 (2Rx8) 3200 MT/s

 

 

16 GB x 12 (2Rx8)

2933 MT/s

BIOS/CPLD

1.1.2/1.0.1

Interconnect

NVIDIA Mellanox HDR

 

NVIDIA Mellanox HDR100

Compiler

Intel parallel studio 2020 (update 4)

LAMMPS

2july2021

Datasets used for performance analysis

Table 2: Description of datasets used for performance analysis

Datasets

Description

Units

Atomic Style

Atom Size

Step Size

Lennard Jones

Atomic fluid (LJ Benchmark)

lj

atomic

512000

7900

Rhodo

Protein (Rhodopsin Benchmark)

real

full

512000

520

Liquid crystal

Liquid Crystal w/ Gay-Berne potential

lj

ellipsoid

524288

840

Eam

Copper benchmark with Embedded Atom Method

metal

atomic

512000

3100

Stilliger Weber

Silicon benchmark with Stillinger-Weber

metal

atomic

512000

6200

Tersoff

Silicon benchmark with Tersoff

metal

atomic

512000

2420

Water

Coarse-grain water benchmark using Stillinger-Weber

real

atomic

512000

2600

Polyethylene

Polyethylene benchmark with AIREBO

metal

atomic

522240

550




Figure 1:  Image view of datasets from OVITO (scientific data visualization and analysis software for molecular and other particle-based simulation model). Images are listed in order 1a-1h, subfigure 1a- 1h represents a small portion of simulation domain for Atomic Fluid (Lennard Jones), rhodo(protein), liquid crystal(lc), copper(eam), stilliger webner(sw), Terasoff, water, polyethylene datasets respectively.

Table 1 and Figure 1 shows the image view of datasets used for the single and multi-node analysis. For visualization of all datasets was done using OVITO, scientific data visualization and analysis software for molecular and other particle-based simulation model. For single node performance study, all the datasets shown in Table 2 were used, and for multi-node study Atomic fluid was considered for benchmarking.

Performance Analyses on Single Node  





 

 

Figure 2:  Single Node Performance of LAMMPS across the datasets with Intel Ice Lake processor model.  Each graph in Figure 2 is an individual subfigure, labeled (a-h) in the order shown. Each subfigure (2a- 2h) represents single node performance comparison across the Xeon processors with Xeon 6330 as baseline for Atomic Fluid (Lennard Jones), rhodo(protein), liquid crystal(lc), copper(eam), stilliger webner(sw), Terasoff, water, polyethylene datasets respectively.

Figure 2 shows the single node performance for the eight datasets (sub figure 2a-h) listed in Table 2 with the four Ice Lake processor model available for evaluation of LAMMPS.

For ease of comparison across the processor model, the relative performance of the datasets has been included into a different graph. However, it is worth noting that each dataset behaves individually when performance is considered, as each uses different molecular potentials and have different number of atoms. Figure 2 shows that increases in the core count in the processor model increases the performance, across the dataset used. Next, by comparing the relative numbers to the baseline processor Xeon 6330(28C) with Xeon 8380(40C), we measured a 30 to 45 percent performance gain with these datasets. A fraction of these boosts was due to frequency of the processor model.

Figure 3a: Performance of LAMMPS on Cascade Lake (Xeon 6248R) in comparison to Ice Lake (Xeon 6330)

Figure 3b: Performance of LAMMPS on Cascade Lake (Xeon 8280) in comparison to Ice Lake (Xeon 8380)

Figure 3 compares the performance of the mid-bin Cascade Lake 6248R (24core) to the Ice Lake 6330 (28 core), and the top end Cascade Lake 8280 (28 core) to the Ice Lake 8380 (40 core) From the figure 3a, Ice Lake 6330 is up to 30 percent faster than the 6248R. The Xeon 6330 has 16 percent more cores, and 9 percent faster memory bandwidth. Figure 3b shows Ice Lake 8380 is up to 75 percent faster than the Xeon 8280 on single node tests, this is in line with the 42 percent additional cores and 9 percent faster memory bandwidth. These results are due to a higher processor speed wherein more data can be accessed by each core.

Performance Analysis on Multi-Node

To analyze the scalability test with strong and weak scaling, we used the Atom Fluid (LJ) dataset from the Intel package. The job run time was 7900 steps with 512000 atoms in the simulation system. 


Figure 4a: Fixed size Atomic fluid (LJ) for different problem size (strong scaling) w/ Xeon 8380 

With strong scaling, we referred to the fixed problem size and increasing the parallel processes (Amdahl’s law). Whereas in weak scaling, we varied the atom size from 512000 atoms to 4096000 atoms in the simulation environment with an increase in the parallel processes (Gustafson-Barsis law). The test bed included DellEMC Poweredge R750 servers each with dual Ice Lake Xeon 8380 processors an NVIDIA Networking HDR interconnect running at 200 Gbps.

Figure 4a plots the fixed-size relative performance for four different problem sizes, viz, 512000,1024000,2048000, and 4096000 atoms, on different number of nodes.

The relative performance is normalized by single node performance. Hence, the single node performance for each curve is 1.00 (unity). Relative Performance for fixed size Atomic fluid was calculated by the following equation:

Relative Performance = loop time of ‘N’ node / loop time for single node

Loop time is the total wall-clock time for the simulation to run. It can be observed that relative performance increases with increase in problem size. This is because for smaller problems system spend more time in inter-nodal communications. Time spent in communication at 8 nodes is 61.91%, 59.74%,48.42%,45.04% for 512000,1024000,2048000,4096000 atom size respectively.

Figure 4b: Scaled size Atomic fluid (LJ) with 512000 atoms per node (weak scaling) w/ Xeon 8380

Figure 4b plots the scaled-size efficiency for runs with 512000 atoms per node. Thus, a scaled-size 2 node run is for 1024000 atoms; 8 node runs is for 4096000 atoms. Relative Performance for Scaled size Atomic fluid was calculated by the following equation:

Relative Performance= loop time for ‘n’ node/ (loop time for single node * number of nodes)

Weak scaling efficiency decreases with increase in no of nodes in the investigated range. This is due to the fact for larger number of nodes the time spend in MPI communication is larger. Time spent in communication with scaled size atom for 1N, 2N, 4N and 8 N are 27.17%, 32.42 %, 40.87%, 45.04 % respectively.

Figure 5: Multi-node efficiency for Atomic Fluid (LJ) w/ I 8380

Figure 5 plots the multi-node efficiency for Atomic Fluid with Xeon 8380. The relative performance is normalized by single node with 512000 atoms performance. Hence, the single node performance for 512000 atoms is 1.00 (unity). This point is taken as baseline for other comparison.

Performance Rating = (loop time * number of atoms)/ (loop time for 512000 atoms on 1 node * number of nodes * 512000)

We observed  that for smaller systems, such as those with fewer atoms, the efficiency of strong scaling decreases as the system spends more time in MPI communication; whereas in larger systems with many atoms, the efficiency of strong scaling increases as the time spent in pair-wise force calculation  becomes dominant. For weak scaling, as the no of nodes increases the efficiency of weak scaling decreases.

Conclusion

The Ice Lake processor-based Dell EMC Power Edge servers, with its hardware feature upgrades over Cascade Lake, demonstrate up to 50 to70 percent performance gain for all the datasets used for benchmarking LAMMPS. Watch our blog site for updates!

Read Full Blog

GROMACS — with Ice Lake on Dell EMC PowerEdge Servers

Savitha Pareek Joseph Stanfield Ashish K Singh Savitha Pareek Joseph Stanfield Ashish K Singh

Fri, 02 Dec 2022 05:33:27 -0000

|

Read Time: 0 minutes

3rd Generation Intel Xeon®  Scalable processors (architecture code named Ice Lake) is Intel’s successor to Cascade Lake. New features include up to 40 cores per processor, eight memory channels with support for 3200 MT/s memory speed and PCIe Gen4.

The HPC and AI Innovation Lab at Dell EMC had access to a few systems and this blog presents the results of our initial benchmarking study on a popular open-source molecular dynamics application – GROningen MAchine for Chemical Simulations (GROMACS).

Molecular dynamics (MD) simulations are a popular technique for studying the atomistic behavior of any molecular system. It performs the analysis of the trajectories of atoms and molecules where the dynamics of the system progresses over time. 

At HPC and AI Innovation Lab, we have conducted research on the SARS-COV-2 study where applications like GROMACS helped researchers identify molecules that bind to the spike protein of the virus and block it from infecting human cells. Other use cases of MD simulation in medicinal biology is iterative drug design through prediction of protein-ligand docking (in this case usually modelling a drug to target protein interaction).

Overview of GROMACS

GROMACS is a versatile package to perform MD simulations, such as simulate the Newtonian equations of motion for systems with hundreds to millions of particles. GROMACS can be run on CPUs and GPUs in single-node and multi-node (cluster) configurations. It is a free, open-source software released under the GNU General Public License (GPL). Check out this page for more details on GROMACS.

Hardware and software configurations

Table 1: Hardware and Software testbed details

 

 

Component

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6420 server

Dell EMC PowerEdge C6420 server

SKU

Xeon 8380

Xeon 8358

Xeon 8352Y

Xeon 6330

Xeon 8280

Xeon 6252

Cores/Socket

40

32

32

28

28

24

Base Frequency 

2.30 GHz

2.60 GHz

2.20 GHz

2.00 GHz

2.70 GHz

2.10 – GHz

TDP

270 W

250 W

205 W

205 W

205 W

150 W

L3Cache

60M

48M

48M

42M

38.5M

37.75M

Operating System

Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64

Memory

16 GB x 16 (2Rx8) 3200 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

BIOS/CPLD

1.1.2/1.0.1

Interconnect

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR100

Compiler

Intel parallel studio 2020 (update 4)

GROMACS

2021.1

Datasets used for performance analysis

Table 2: Description of datasets used for performance analysis

 

Datasets/Download Link

Description

Electrostatics

Atoms

System Size

Water 

Movement of Water

This example is to simulate- the motion process of many water molecules in each space and temperature.

 

Particle Mesh Ewald (PME)

 

1536K

small

3072K

Large

HecBioSim

This example is to simulate-

1.4M atom system - A Pair of hEGFR Dimers of 1IVO and 1IVO

3M atom system –

A Pair of hEGFR tetramers of 1IVO and 1IVO

 

 

Particle Mesh Ewald (PME)

 

1.5M

Small

3M

Large

Prace – Lignocellulose

This example is to simulate the lignocellulose – the tpr was obtained from PRACE website

 

Reaction Field (rf)

 

3M

Large

Compilation Details

We compiled GROMACS from source (version-2021.1) using the Intel 2020 Update 5 Compiler to take advantage of AVX2 and AVX512 optimizations, and the Intel MKL FFT library. The new version of GROMACS has a significant performance gain due to the improvements in its parallelization algorithms. The GROMACS build system and the gmx mdrun tool have built-in and configurable intelligence that detects your hardware and make effective use of it.

Objective of Benchmarking

Our objective is to quantify the performance of GROMACS using different test cases, like performance evaluation on different Ice Lake processors as listed in Table 1, then we compare the  2nd and 3rd Gen Xeon Scalable (Cascade Lake vs Ice Lake), and finally we compare multi-node scalability with hyper threading enabled and disabled.

To evaluate the datasets results with an appropriate metric, we added associated high-level compiler flags, electrostatic field load balancing (like PME, etc), tested with multiple ranks, separate PME ranks, varying different nstlist values, and created a paradigm for our application (GROMACS).

The typical time scales of the simulated system are in the order of micro-seconds (µs) or nanoseconds (ns). We measure the performance for the dataset’s simulation as nanoseconds per day (ns/day).

Performance Analyses on Single Node

Figure 1(a): Single node performance of Water 1536K and Water 3072K on Ice Lake processor model

Figure 1(b): Single node performance of Lignocellulose 3M on Ice Lake processor model

Figure 1(c): Single node performance of HecBioSim 1.4M and HecBioSim 3M on Ice Lake processor model

Figure 1 (a), (b) and (c) shows are the single node performance analyses for three datasets mentioned in Table 2 with the four processor models available for evaluation of GROMACS.

Figure 2:  Relative Performance of GROMACS across the datasets with Intel Ice Lake Processor Model

For ease of comparison across the various datasets, the relative performance of the processor model has been included into a single graph. However, it is worth noting that each dataset behaves individually when performance is considered, as each uses different molecular topology input files (tpr), and configuration files.

Individual dataset performance is mentioned in Figures 1(a), 1(b), and 1(c) respectively.

Figure 2 shows increase in the core count in the processor model increases the performance, based on the dataset used. In here, we observe that smaller (water 1536K and HecBioSim 1400K) has more advantage 5 to 6 percent performance gain in counterpart to the larger datasets (water 3072, HecBioSim 3M, and Ligno 3M).

Next, by comparing the relative numbers to the baseline processor Xeon 6330(28C) with Xeon 8380(40C), we found a 30 to 50 percent performance gain according to the datasets with increases in cores, from 28 to 40. A fraction of gain is by frequency of the processor model.

 

 Performance Analyses on Cascade Lake vs Ice Lake


Figure 3(a): Performance of GROMACS on Cascade Lake (Xeon 6252) vs Ice Lake (Xeon 6330)

Figure 3(b): Performance of GROMACS on Cascade Lake (Xeon 8280) vs Ice Lake (Xeon 8380)

We accounted for the fact that the memory is rightly fit according to the datasets. To begin, we compared each processor with previous generation processors. For performance benchmark comparisons, we selected Cascade Lake closest to their Ice Lake counterparts in terms of hardware features such as cache size, TDP values, and Processor Base/Turbo Frequency, and marked the maximum value attained for Ns/day by each of the datasets mentioned in Table 2.



Figure 3a shows Ice Lake 6330 is up to 50 to 75 percent faster than the 6252. The Xeon 6330 has 16 percent more cores and 9 percent faster memory bandwidth. Figure 3b shows that Ice Lake 8380 is up to 50-65 percent faster than the Xeon 8280 on single node tests, this is in line with the 42 percent more cores and 9 percent faster memory bandwidth.

This result is due to a higher processor speed, wherein more data can be accessed by each core. Also, datasets are more memory intensive and some percentage is added on due frequency improvement Overall, the Ice Lake processor results demonstrated a substantial performance improvement for GROMACS over Cascade Lake processors.

Performance Analysis on Multi-Node
Figure 4(a): Scalability of water 1536K with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/ Xeon 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled 
Figure 4(b): Scalability of water 3072K with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/INTEL 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled

Figure 4(c): Scalability of HecBioSim 1.4M with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/ Xeon 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled 

Figure 4(d): Scalability of HecBioSim 3M with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/ Xeon 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled 

Figure 4(e): Scalability of Lignocellulose 3M with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/INTEL 8380 ; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled 

For multi-node tests, the test bed was configured with an NVIDIA Mellanox HDR interconnect running at 200 Gbps and each server having the Ice Lake processor. We were able to achieve the expected linear performance scalability for GROMACS of up to eight nodes with hyper threading disabled and approximately 7.25X with hyper threading enabled for eight nodes, across the datasets. All cores in each server were used while running these benchmarks. The performance increases are close to linear across all the dataset types as the core count increases.

Conclusion

The Ice Lake processor-based Dell EMC Power Edge servers, with notable hardware feature upgrades over Cascade Lake, show up to 50 to 60 percent performance gain for all the datasets used for benchmarking GROMACS. Hyper threading should be disabled for the benchmarks addressed in this blog for getting better scalability above eight nodes. For small datasets mentioned in this blog benefits 5 to 6 percent in comparison to the larger ones with increase in the core count.

Watch our blog site for updates!

Read Full Blog
  • AI
  • PowerEdge
  • AMD

MD Simulation of GROMACS with AMD EPYC 7003 Series Processors on Dell EMC PowerEdge Servers

Savitha Pareek Joseph Stanfield Ashish K Singh Savitha Pareek Joseph Stanfield Ashish K Singh

Thu, 19 Aug 2021 20:06:53 -0000

|

Read Time: 0 minutes

AMD has recently announced and launched its third generation 7003 series EPYC processors family (code named Milan).  These processors build upon the proceeding generation 7002 series (Rome) processors and improve L3 cache architecture along with an increased memory bandwidth for workloads such as High Performance Computing (HPC).

The Dell EMC HPC and AI Innovation Lab has been evaluating these new processors with Dell EMC’s latest 15G PowerEdge servers and will report our initial findings for the molecular dynamics (MD) application GROMACs in this blog.

Given the enormous health impact of the ongoing COVID-19 pandemic, researchers and scientists are working closely with the HPC and AI Innovation Lab to obtain the appropriate computing resources to improve the performance of molecular dynamics simulations. Of these resources, GROMACS is an extensively used application for MD simulations. It has been evaluated with the standard datasets by combining the latest AMD EPYC Milan processor (based on Zen 3 cores) with Dell EMC PowerEdge servers to get most out of the MD simulations.  

In a previous blog, Molecular Dynamic Simulation with GROMACS on AMD EPYC- ROME, we published benchmark data for a GROMACS application study on a single node and multinode with AMD EPYC ROME based Dell EMC servers.

The results featured in this blog come from the test bed described in the following table. We performed a single-node and multi-node application study on Milan processors, using the latest AMD stack shown in Table 1, with GROMACS 2020.4 to understand the performance improvement over the older generation processor (Rome).

Table 1: Testbed hardware and software details

Server

Dell EMC PowerEdge 2-socket servers

(with AMD Milan processors)

Dell EMC PowerEdge 2-socket servers

(with AMD Rome processors)

Processor

Cores/socket

Frequency (Base-Boost )

Default TDP
 L3 cache

Processor bus speed

7763 (Milan) 

64

2.45 GHz – 3.5 GHz

280 W

256 MB

16 GT/s

7H12 (Rome)

64

2.6 GHz – 3.3 GHz

280 W

256 MB

16 GT/s

Processor

Cores/socket

Frequency

Default TDP
 L3 cache

Processor bus speed

7713 (Milan) 

64

2.0 GHz – 3.675 GHz

225 W

256 MB

16 GT/s

7702 (Rome) 

64

2.0 GHz – 3.35 GHz

200 W

256 MB

16 GT/s

Processor

Cores/socket

Frequency

Default TDP
 L3 cache

Processor bus speed

7543 (Milan) 

32

2.8 GHz – 3.7 GHz

225 W

256 MB

16 GT/s

7542 (Rome) 

32

2.9 GHz – 3.4 GHz

225 W

128 MB

16 GT/s

Operating system

Red Hat Enterprise Linux 8.3 (4.18.0-240.el8.x86_64)

Red Hat Enterprise Linux 7.8

Memory

DDR4 256 G (16 GB x 16) 3200 MT/s

BIOS/CPLD

2.0.2 / 1.1.12

 

Interconnect

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR 100

Table 2: Benchmark datasets used for GROMACS performance evaluation

Datasets


 Details

Water Molecule

1536 K and 3072 K  

HecBioSim

1400 K and 3000 K

Prace – Lignocellulose

3M 

The following information describes the performance evaluation for the processor stack listed in the Table 1.


Rome processors compared to Milan processors (GROMACS)

Figure 1: GROMACS performance comparison with AMD Rome processors

For performance benchmark comparisons, we selected Rome processors that are closest to their Milan counterparts in terms of hardware features such as cache size, TDP values, and Processor Base/Turbo Frequency, and marked the maximum value attained for Ns/day by each of the datasets mentioned in Table 2.

Figure 1 shows a 32C Milan processor has higher performance improvements (19 percent for water 1536, 21 percent for water 3072, and 10 to approximately 12 percent with HECBIO sim and lingo cellulose datasets) compared to a 32C Rome processor. This result is due to a higher processor speed and improved L3 cache, wherein more data can be accessed by each core. 

Next, with the higher end processor we see only 10 percent gain with respect to the water dataset, as they are more memory intensive. Some percentage is added on due to improvement of frequency for the remaining datasets. Overall, the Milan processor results demonstrated a substantial performance improvement for GROMACS over Rome processors.


Milan processors comparison (32C processors compared to 64C processors)

Figure 2: GROMACS performance with Milan processors

Figure 2 shows performance relative to the performance obtained on the 7543 processor. For instance, the performance of water 1536 is improved from the 32C processor to the 64 core (64C) processor from 41 percent (7713 processor) to 57 percent (7763 processor). The performance improvement is due to the increasing core counts and higher CPU core frequency performance improvement. We observed that GROMACS is frequency sensitive, but not to a great extent. Greater gains may be seen when running GROMACS across multiple ensembles runs or running dataset with higher number of atoms.

We recommend that you compare the price-to-performance ratio before choosing the processor based on the datasets with higher CPU core frequency, as the processors with a higher number of lower-frequency cores may provide better total performance.


Multi-node study with 7713 64C processors

Figure 3: Multi-node study with 7713 64c SKUs

For multi-node tests, the test bed was configured with an NVIDIA Mellanox HDR interconnect running at 200 Gbps and each server included an AMD EPYC 7713 processor. We achieved the expected linear performance scalability for GROMACS of up to four nodes and across each of the datasets. All cores in each server were used while running the benchmarks. The performance increases are close to linear across all the dataset types as core count increases.


Conclusion

For the various datasets we evaluated, GROMACS exhibited strong scaling and was compute intensive. We recommend a processor with high core count for smaller datasets (water 1536, hec 1400); larger datasets (water 3072, ligno,HEC 3000) would benefit from memory per core. Configuring the best BIOS options is important to get the best performance out of the system. 

For more information and updates, follow this blog site

Read Full Blog
  • PowerEdge
  • WRF

WRF Performance with AMD EPYC 7003 Series processors On Dell EMC PowerEdge servers

Puneet Singh Joseph Stanfield Ashish K Singh Puneet Singh Joseph Stanfield Ashish K Singh

Tue, 03 Aug 2021 14:49:25 -0000

|

Read Time: 0 minutes

The Weather Research and Forecasting (WRF) model is an open-source mesoscale weather prediction model that is predominantly used in a multi-compute node environment for atmospheric research and operational forecasts. This model performs well on the latest generation of the AMD EPYC 3rd Gen (7003 Series) processor family, code name Milan. In this blog, we highlight the performance improvement of WRF application on the AMD Milan processors based on Dell EMC PowerEdge servers.

This blog follows up our first blog in this series, where we introduced the AMD Milan processor architecture, key BIOS tuning options, and baseline microbenchmark performance. We analyzed the performance improvement of the latest AMD EPYC Milan (7003 Series) processor-based Dell EMC PowerEdge servers compared to the second-generation AMD EPYC Rome (7002 Series) processor-based Dell EMC PowerEdge servers. The testbed hardware and software details are outlined in the following table: 

Table 1: Testbed hardware and software details

Server

Dell EMC PowerEdge 2-socket servers

(with AMD Milan Processors)

Dell EMC PowerEdge 2-socket servers

(with AMD Rome Processors)

Processor model

Cores/socket

Frequency (Base-Boost)

TDP
 L3 cache

Processor bus speed

7763 

64

2.45 GHz – 3.5 GHz

280 W

256 MB

16 GT/s 

7713 

64

2.0 GHz – 3.7 GHz

225 W

256 MB

16 GT/s 

7543 

32

2.8 GHz – 3.7 GHz

225 W

256 MB

16 GT/s 

7662 

64c

2.0 GHz – 3.35 GHz

200 W

256 MB

16 GT/s

7542 

32

2.9 GHz – 3.4 GHz

225 W

128 MB

16 GT/s 

Operating system

Red Hat Enterprise Linux 8.3 (4.18.0-240.el8.x86_64)

Memory

DDR4 256G (16 GB x 16) 3200 MT/s

Interconnect

NVIDIA Mellanox HDR

BIOS/CPLD

2.2.5 / 1.1.12 (AMD 7763,AMD 7713,AMD 7543)

2.1.6 / 1.1.12 (AMD 7662)

2.1.5 / 0.10.3 (AMD 7542)

Applications

WRF v3.9.1.1, WRF v4.2.2 

Benchmark datasets

conus 2.5km,  new conus 2.5km, wrf_large 3km

 
 The following figure shows the domain for the tested datasets: 

     

Figure 1: Domain configuration for new conus 2.5 km, conus 2.5 km, and wrf_large datasets.

The following table provides a brief description of each dataset:

Table 2:  Configuration for new conus 2.5 km conus 2.5 km and wrf_large datasets

 

conus 2.5 km

new conus 2.5 km

wrf_large

Run hours

3

3

2

Resolution(m)

2500

2500

3000

Vertical layers

35

35

50

Grid points

1501 x 1201

1901 x 1301

1500 x 1500

interval_seconds

10800

10800

21600

The results were measured by averaging the WRF computation time of each timestep from the rsl.error.0000 output file.

 

Single node performance

The following figures show the application performance for the datasets mentioned in Table 2. In each figure, the numbers over the bars represent the relative change in the application performance compared to the application performance obtained on the AMD 7542 Rome processor model. 

 
 

Figure 2: Relative difference in the performance of WRF by processor and dataset type mentioned in Table 1

WRF was compiled with the "dm + sm" configuration and all the available cores were subscribed during WRF simulation runs. To optimize performance, we tried different MPI process count, OpenMP thread count combinations and tiling schemes (WRF_NUM_TILES) options. For single-node tests, two MPI processes per Core Complex Die (CCD) deliver the best results for conus 2.5 km and new conus 2.5 km datasets. We used eight processes per CCD for the wrf_large dataset.

Depending on the dataset, the AMD 7763 processor can deliver up to 14 percent better performance over the AMD 7543 processor. In the previous blog, we observed better performance improvements on the 32 core Milan processor model with memory bandwidth bound benchmarks like HPCG and STREAM. WRF is a memory bandwidth bound application and there is notable performance improvement in the 32-core processor model: the AMD 7543 delivers up to 26 percent better performance over AMD 7542 processor.

From the performance that is shown in Figure 2 and the average power usage data that is shown in figure 3, we noted that the AMD 7713 processor can deliver up to 58 percent better performance per watt than the AMD 7662 processor. 

Figure 3: Power used by platform and processor type: average idle server power usage was 305 W (7542), 338 W (7662), 305 W (7543), 258 W (7713), and 272 W (7763)


Multi-node scalability

To evaluate the scalability of WRF, we used eight nodes. Each node is equipped with an AMD 7713 processor and interconnected using the NVIDIA Mellanox HDR interconnect. The nodes used for benchmarking were connected to the same HDR switch. Table 1 provides details about the server and software that was used for the test. The text on top of the line represents the relative change in the application performance (on 2,4 and 8 nodes) with respect to the application performance obtained on the single node.


Figure 4: Multi-node performance of WRF on an AMD Milan 7713 processor for datasets listed in Table 1

The scalability numbers have been rounded off to a single digit. We observed good scalability with all the datasets listed in Table 1.


Conclusions and recommendations

WRF delivers better performance and performance per watt on AMD Milan processors. There is a significant performance improvement on the 32 core Milan processor model and the WRF simulations scale well with the datasets described in this blog. However, the scalability might vary depending on the dataset being used and the node count being tested. Ensure that you test the impact of the tile size, process, and threads per process before use. 

We will continue to post new blogs on this site as updates arise.


Read Full Blog
  • NVIDIA
  • PowerEdge
  • GPU
  • NAMD

Nanoscale Molecular Dynamics (NAMD) Performance with Dell EMC PowerEdge R750xa & NVIDIA A series GPUs

Kihoon Yoon Kihoon Yoon

Thu, 22 Jul 2021 09:03:25 -0000

|

Read Time: 0 minutes

Overview

Over the past decade, GPUs have become popular in scientific computing because of their great ability to exploit a high degree of parallelism. NVIDIA has optimized  life sciences applications to run on their general-purpose GPUs. Unfortunately, these GPUs can only be programmed with CUDA, OpenACC, or the OpenCL framework. Most of the life sciences community is not familiar with these frameworks so few biologists or bioinformaticians can make efficient use of GPU architectures. However, GPUs have been making inroads into the molecular dynamics simulation (MDS) field since MD was developed in the 1950s. MDS requires heavy computational work to simulate biomolecular structures or their interactions.

 

In this blog, the performance of one popular MDS application, NAMD, is presented with various NVIDIA A-series GPUs such as the A100, the A10, the A30 and the A40 . NAMD is a free and open-source parallel MD package designed for analyzing the physical movements of atoms and molecules.

 

Dell Technologies has released the new PowerEdge R750xa server, a GPU workload platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3rd Gen Intel Xeon Scalable Processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server. The test server configurations are summarized in Table 1, and the specifications of tested NVIDIA GPUs are listed in Table 2.

 

Table 1: Tested compute node configuration

Test Beds

Server

Dell EMC PowerEdge R750xa

Dell EMC PowerEdge R740

CPU

Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30 GHz

Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40 GHz

Intel(R) Xeon(R) Gold 6248 CPU @ 2.50 GHz

NVIDIA GPUs

4 x A100

4 x A10

4 x A30

2 x A40

RAM

DDR4 1024 GB (32 x 32 GB) 3200 MT/s

DDR4 384 GB (24 x 16 GB) 2933 MT/s

Operating system

RHEL 8.3 (4.18.0-240.el8.x86_64)

Filesystem network

Mellanox InfiniBand HDR100

Filesystem

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS system profile

Performance Optimized

Logical processor

Disabled

Virtualization technology

Disabled

Cuda/Toolkit

11.2 

OpenMPI

4.1.1

NAMD

NAMD_Git-2021-04-01_Source


Table 2: Specifications of tested NVIDIA GPUs

NVIDIA GPUs

 

A100

A10

A30

A40

FP64 (TFLOPS)

9.7

Unknown

5.2

Unknown

FP64 Tensor Core (TFLOPS)

19.5

Unknown

10.3

Unknown

FP32 (TFLOPS)

19.5

31.2

10.3

37.4

Tensor Float 32 (TFLOPS)

156 | 312*

62.5 | 125*

82 | 165 *

74.8 | 149.6*

BFLOAT16 Tensor Core (TFLOPS)

312 | 624*

125 | 250*

165 | 330*

149.7 | 299.4*

FP16 Tensor Core (TFLOPS)

312 | 624*

125 | 250*

165 | 330*

149.7 | 299.4*

INT8 Tensor Core (TOPS)

624 | 1248*

250 | 500*

330 | 661*

299.3 | 598.6*

INT4 Tensor Core (TOPS)

Unknown

500 | 1,000*

661 | 1321*

598.7 | 1,197.4*

GPU memory

40 GB HBM2

24 GB GDDR6

24 GB HBM2

48 GB GDDR6

GPU memory bandwidth

1,555 GB/s

600 GB/s

933 GB/s

696 GB/s

Max Thermal Design Power (TDP)

400W

150W

165W

300W

Multi-Instance GPU

Up to 7 MIGs @ 5 GB

Unknown

4 GPU instances @ 6 GB each

2 GPU instances @ 12 GB each

1 GPU instance @ 24 GB

Unknown

Form factor

PCIe

Single-slot, full-height, full-length (FHFL)

Dual-slot, full-height, full-length (FHFL)

4.4" (H) x 10.5" (L) dual slot

Interconnect

PCIe Gen4: 

64 GB/s

PCIe Gen4: 

64 GB/s

PCIe Gen4: 

64 GB/s

 

PCIE Gen4 x 16 31.5 GB/s (bidirectional)

* With sparsity


Performance Evaluation

NAMD

NAMD was compiled from source code (NAMD_Git-2021-04-01_Source) using GCC 11.1 and CUDA 11.2. We used a test data set, the 1.06 million-atom system of Satellite Tabacco Mosaic Virus (SMTV). 

 

Figure 1 shows the performance of four GPUs with the STMV dataset.  The figures represent the performance changes in nanoseconds per day (ns/day) with various numbers of cores used with one, two or four GPUs. The only valid comparison between the various GPUs is NVIDIA A100 and A10 since the test systems were configured identically. Although the performance of NAMD is affected by the CPU clock speed, the tested systems are not significantly different  from the CPU’s clock speed. The A10 is rated at three times the single precision FLOPS  of the A30, and  the A10 performs better than the A30 on the two GPU tests even with slightly slower CPUs. The A100 outperformed by roughly 25 percent and 16 percent  on single and two GPU tests when comparing the A10’s results, respectively.

 

The results from four GPU tests in Figure 1 show similar performance for the different GPUs. This agrees well with our previous test results that NAMD does not scale after two GPUs. We can rule out a potential argument that the data size might be too small since 3 million atom data, HECBioSim3000k-atom system, which is a pair of 1IVO and 1NQL hEGFR tetramers, shows similar or worse results (those  results are not shown here).

 

Figure 1: NAMD performance with  STMV, 1 million-atom system

As shown in Figure 1, when four GPUs were tested , all of the GPUs except the A40 reached ~9 ns/day simulations. And, in terms of maximum performance, the A10 performs the highest number of simulations, 9.121 ns/day. However, these numbers are not true reflections of the performance due to the scalability limitations. Although all four GPU test results are similar, the A100 has a better throughput than other GPUs for the two GPU test as shown in Figure 2. Also, it is worth noting that the A10 and the A40 are not suitable for general-purpose computing due to the lack of double-precision support. 

 

Figure 2 shows the performance comparisons among the different GPUs we tested in this study. Again, the A30 performed better than the A10 up to the 16 cores. It is difficult to determine  why the A30 doesn’t perform as well with a large number of active CPU cores(20 and more). 

 

Figure 2: STMV test results comparisons with two GPUs

Conclusion

The A100 shows a dominant performance and is the most capable card among the A-series GPUs. Although the A30 did not perform as well as the A10 in our test , it is another outstanding choice for versatile applications. 

 

The A10 performed well compared to the A30, and it is the successor of the T4, which was the most cost-effective solution for specific applications such as genomics data analysis.

Since it is not possible to obtain the accurate performance differences among A-series GPUs from this study, further investigation is necessary to achieve  a clear picture of these general purpose GPUs.

Read Full Blog
  • Tuxedo Pipeline
  • PowerEdge C6520

Tuxedo Pipeline Performance on Dell EMC PowerEdge C6520

Kihoon Yoon Kihoon Yoon

Tue, 22 Jun 2021 15:00:00 -0000

|

Read Time: 0 minutes

Overview

Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), insertion/deletion (indel), or chromosomal restructuring. Ultimately, all physiological and biochemical events depend on the final gene expression product, proteins. Although most mammals have an additional controlling layer before protein expression, knowing how many transcripts exist in a system helps to characterize the biochemical status of a cell. Ideally, technology  should enable us to quantify all proteins in a cell, which would advance the progress of Life Science significantly; however, we are far from achieving this.  

In this blog, we report the test results of one popular RNA-Seq data analysis pipeline known as the Tuxedo pipeline. The Tuxedo pipeline suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript and isoform detection, differential expression, visualizations, and quality control metrics. The tested workflow is a differentially expressed gene (DEG) analysis, and the detailed steps in the pipeline are shown in Figure 1. 

 

Figure 1: Updated Tuxedo Pipeline with Cuffquant Step

In this study, the performances of single nodes with 3rd Gen Intel® Xeon® Scalable Processors (Codename Ice Lake) and 2nd Generation Intel® Xeon® Scalable Processors (Codename Cascade Lake) on Dell EMC PowerEdge R6520 (liquid-cooled) servers and C6420 (air-cooled) servers were compared. The configurations of the test systems are summarized in Table 1.

Table 1: Tested compute node configuration

Dell EMC PowerEdge C6520 Liquid Cooled

CPU

Tested 3rd Gen Intel® Xeon® Scalable Processors:

2 x Intel® Xeon® Platinum 8358, 32 Cores, 2.60 GHz – 3.40 GHz Base-Boost, TDP 250W

2 x Intel® Xeon® Platinum 8352Y, 32 Cores, 2.20 GHz – 3.40 GHz Base-Boost, TDP 205W

 

Tested 2nd Generation Intel® Xeon® Scalable Processors:

2 x Intel® Xeon® Gold 6248, 20 Cores, 2.50 GHz – 3.90 GHz Base-Boost, TDP 150W on Dell EMC PowerEdge C6420 Air Cooled

RAM

DDR4 512 GB (16 x 32 GB) 3200 MT/s

Operating system

RHEL 8.3 (4.18.0-240.el8.x86_64)

Interconnect

Mellanox InfiniBand HDR100

File system

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS system profile

Performance Optimized

Logical processor

Disabled

Virtualization technology

Disabled

tophat

2.1.1

bowtie2

2.2.5

R

3.6

bioconductor-cummerbund

2.26.0

A performance study of the RNA-Seq pipeline is not trivial because the nature of the workflow requires input files that are non-identical but similar in size. Hence, 185 RNA-Seq paired-end read data are collected from a public data repository. All the read data files contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test are randomly selected from the pool of the 185 paired-end read files. Although these test data will not have any biological meaning, certainly these data with a high level of noise will put the tests in the worst-case scenario.

Performance Evaluation

Throughput Test – Single pipeline with more than two samples, biological and technical duplicates

Typical RNA-Seq studies consist of multiple samples, sometimes  hundreds of different samples, for example, normal versus disease or untreated versus treated samples. These samples tend to have a high level of noise  for biological reasons; hence, the analysis requires vigorous data preprocessing procedures. 

We tested various numbers of samples (all different RNA-Seq data selected from 185 paired-end reads data sets) to see how much data can be processed by a single node. Typically, when the number of samples increases, the runtime of the Tuxedo Pipeline increases as shown in Figure 2. Ice Lake CPUs show  improved overall runtime of 10% and more  compared to Cascade Lake 6248 CPUs.

Figure 2: Total runtime comparisons from various number of samples with a single compute node

Conclusion

Many additional tests are still required to obtain a better insight from Intel Ice Lake processors for the NGS data analysis area. Unfortunately, we could not push our tests over 8 samples due to the storage limitation. However, there seems to be plenty of room for a higher throughput processing of more than 8 samples together. 

Read Full Blog
  • NVIDIA
  • PowerEdge

Molecular Dynamics Simulations with Dell EMC PowerEdge XE8545 Server and NVIDIA A100

Kihoon Yoon Kihoon Yoon

Wed, 02 Jun 2021 19:37:48 -0000

|

Read Time: 0 minutes

Overview

Over the past decade, graphics processing units, or GPUs, have become popular in scientific computing because of their great ability to exploit a high degree of parallelism. NVIDIA has a handful of life sciences applications optimized and run on their general-purpose GPUs. Unfortunately, these GPUs can only be programmed with CUDA, OpenACC, and the OpenCL framework. Most members of the life sciences community are not familiar with these frameworks, and so few biologists or bioinformaticians can make efficient use of GPU architectures. However, GPUs have been making inroads into the molecular dynamics simulation (MDS) field since MD was developed in the 1950s. MDS requires heavy computational work to simulate biomolecular structures or their interactions.  

In this blog, we tested two MDS applications; NAMD, and LAMMPS using the Dell EMC PowerEdge XE8545 server with NVIDIA A100 GPUs. Since the XE8545 server does not support NVIDIA V100 GPU, we can roughly estimate the performance boost with the A100 from our previous tests.

These two applications are free and open-source parallel MD packages designed for analyzing the physical movements of atoms and molecules.

The test server configuration is summarized in the following table.

Table 1. Tested compute node configuration

Dell EMC PowerEdge XE8545

CPU

2x 7713 (Milan), 64 Cores, 2.0 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache

RAM

DDR4 1024 GB (32 x 32 GB) 3200 MT/s

Operating system

RHEL 8.3 (4.18.0-240.el8.x86_64)

Filesystem network

Mellanox InfiniBand HDR100

Filesystem

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS system profile

Performance Optimized

Logical processor

Disabled

Virtualization technology

Disabled

Accelerator

4 x A100-40 GB SXM4

Cuda/Toolkit

11.2

OpenMPI

4.1.1

NAMD

NAMD_Git-2021-04-01_Source

LAMMPS

Stable version (29 Oct 2020)

Performance Evaluation

NAMD

Nanoscale Molecular Dynamics (NAMD) is open-source software for molecular dynamics simulation written in a CHARMM parallel programming model and is designed for high-performance simulation of large biomolecular systems.

NAMD was built with the NAMD_Git-2021-04-01_Source source code on GCC 11.1 and CUDA 11.2. For our tests, we used two sets of data; 1.06 million-atoms of the Satellite Tobacco Mosaic Virus (STMV) system, and the HECBioSim3000k-atom system, which is a pair of 1IVO and 1NQL hEGFR tetramers.

Figure 1 shows the performance of 4x A100 GPUs with the STMV dataset. NAMD uses ++p options to specify the number of worker threads, and as recommended, is equal to the total number of cores minus the total number of GPUs. However, the number of total cores in the Milan Eypc 7003 family of processors, such as the Eypc 7713 that is used in the testing system, does not follow the generic recommendation. It seems to be around 79 to 90 cores. The optimal number of cores depends on the data size. Close to 9-nanosecond simulations (ns) per day performance is a significant performance gain from the NVIDIA V100 tests that we ran previously. It is difficult to say the performance gain is the sole contribution of the new A100 GPUs because the comparison of the 16 GB V100 on the Intel Skylake platform to the 40 GB A100 on the AMD Milan platform may not be valid.


Figure 1. Estimated simulation time per day with 4x NVIDIA A100 GPUs

The purpose of an additional test with 3 million atom protein tetramers is to confirm that the STMV test results are not artificial due to the relatively small icosahedron structure of SMTV, and the partial simulation of assembly and disassembly processes. Figure 2 shows the nanosecond simulations per day plot for 3000k-atom data. 2.1 ns/day seems to be close to the maximum performance with 64 cores.


Figure 2. Estimated simulation time per day with 4x NVIDIA A100 GPUs

LAMMPS

Large-scale Atomic/Molecular Massively Parallel Simulator, or LAMMPS, is a classical molecular dynamics code and has potentials for solid-state materials (metals and semiconductors), soft matter (biomolecules and polymers), and coarse-grained or mesoscopic systems. LAMMPS can model atoms, or can be used as a parallel particle simulator at the atomic, meso, or continuum scale. LAMMPS runs on single processors, or in parallel using message-passing techniques and spatial decomposition of the simulation domain. LAMMPS was built with GCC 11.1, OpenMPI 4.1.1, and CUDA 11.2 from the source. The 465k-atom system was selected from HECBioSim.

As shown in Figure 3, LAMMPS scales well over the number of A100s. With 4x A100 GPUs, a 8.4 ns/day simulation is achievable. 


Figure 3. Estimated simulation time per day with various number of BPUs

Conclusion

Although it is not possible to compare the performance of the A100 and the V100 from this study, the Milan CPUs and A100 show a strong synergy between more cores with better and faster GPUs. Running NAMD and LAMMPS on the XE8545 with the A100 can deliver a better performance than a system with the V100.

Read Full Blog
  • AI
  • NVIDIA
  • PowerEdge
  • machine learning
  • HPC
  • GPU

New Frontiers—Dell EMC PowerEdge R750xa Server with NVIDIA A100 GPUs

Deepthi Cherlopalle Frank Han Savitha Pareek Deepthi Cherlopalle Frank Han Savitha Pareek

Tue, 01 Jun 2021 20:18:04 -0000

|

Read Time: 0 minutes

Dell Technologies has released the new PowerEdge R750xa server, a GPU workload-based platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3rd Gen Intel Xeon processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server.

Compared with the previous generation PowerEdge C4140 and PowerEdge R740 GPU platform options, the new PowerEdge R750xa server supports larger storage capacity, provides more flexible GPU offerings, and improves the thermal requirement

 

 

Figure 1 PowerEdge R750xa server

The NVIDIA A100 GPUs are built on the NVIDIA Ampere architecture to enable double precision workloads. This blog evaluates the new PowerEdge R750xa server and compares its performance with the previous generation PowerEdge C4140 server.

The following table shows the specifications for the NVIDIA GPU that is discussed in this blog and compares the performance improvement from the previous generation.

Table 1 NVIDIA GPU specifications


PCIe 

Improvement

GPU name 

A100 

V100 

 

GPU architecture 

Ampere 

Volta 

-

GPU memory 

40 GB 

32 GB 

60%

GPU memory bandwidth 

1555 GB/s 

900 GB/s 

73%

Peak FP64 

9.7 TFLOPS 

7 TFLOPS 

39%

Peak FP64 Tensor Core 

19.5 TFLOPS 

N/A 

-

Peak FP32 

19.5 TFLOPS

14 TFLOPS

39%

Peak FP32 Tensor Core 

156 TFLOPS

312 TFLOPS*

N/A

-

Peak Mixed Precision

FP16 ops/ FP32

Accumulate

312 TFLOPS

624 TFLOPS*

125 TFLOPS

5x

GPU base clock 

765 MHz 

1230 MHz 

-

Peak INT8

624 TOPS

1,248 TOPS*

N/A

-

GPU Boost clock 

1410 MHz 

1380 MHz 

2.1%

NVLink speed 

600 GB/s 

N/A 

-

Maximum power consumption 

250 W 

250 W 

No change

*with sparsity

Test bed and applications

This blog quantifies the performance improvement of the GPUs with the new PowerEdge GPU platform.

Using a single node PowerEdge R750xa server in the Dell HPC & AI Innovation Lab, we derived all results presented in this blog from this test bed. This section describes the test bed and the applications that were evaluated as part of the study. The following table provides test environment details:

Table 2 Server configuration

Component

Test Bed 1

Test Bed 2

Server

Dell PowerEdge R750xa

 

Dell PowerEdge C4140 configuration M

Processor

Intel Xeon 8380

Intel Xeon 6248

Memory

32 x 16 GB @ 3200MT/s

16 x 16 GB @ 2933MT/s

Operating system

Red Hat Enterprise Linux 8.3

Red Hat Enterprise Linux 8.3

GPU

4 x NVIDIA A100-PCIe-40 GB GPU

4 x NVIDIA V100-PCIe-32 GB GPU

The following table provides information about the applications and benchmarks used:

Table 3 Benchmark and application details

Application

Domain

Version 

Benchmark dataset

High-Performance Linpack

Floating point compute-intensive system benchmark

xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20

Problem size is more than 95% of GPU memory

HPCG

Sparse matrix calculations

xhpcg-3.1_cuda_11_ompi-3.1

512 * 512 * 288

 

GROMACS

Molecular dynamics application

2020

Ligno Cellulose

Water 1536

Water 3072

LAMMPS

Molecular dynamics application

29 October 2020 release

Lennard Jones

LAMMPS

Large-Scale Atomic/Molecular Massively Parallel simulator (LAMMPS) is distributed by Sandia National Labs and the US Department of Energy. LAMMPS is open-source code that has different accelerated models for performance on CPUs and GPUs. For our test, we compiled the binary using the KOKKOS package, which runs efficiently on GPUs.

Figure 2 LAMMPS Performance on PowerEdge R750xa and PowerEdge C4140 servers

With the newer generation GPUs, this application improves 2.4 times compared to single GPU performance. The overall performance from a single server improved twice with the PowerEdge R750xa server and NVIDIA A100 GPUs.

GROMACS

GROMACS is a free and open-source parallel molecular dynamics package designed for simulations of biochemical molecules such as proteins, lipids, and nucleic acids. It is used by a wide variety of researchers, particularly for biomolecular and chemistry simulations. GROMACS supports all the usual algorithms expected from modern molecular dynamics implementation. It is open-source software with the latest versions available under the GNU Lesser General Public License (LGPL).

Figure 3 GROMACS performance on PowerEdge C4140 and r750xa servers

With the newer generation GPUs, this application improved approximately 1.5 times across the dataset compared to single GPU performance. The overall performance from a single server improved 1.5 times with a PowerEdge R750xa server and NVIDIA A100 GPUs.

High-Performance Linpack

High-Performance Linpack (HPL) needs no introduction in the HPC arena. It is a widely used standard benchmark tests in the industry.  

 Figure 4 HPL Performance on the PowerEdge R750xa server with A100 GPU and PowerEdge C4140 server with V100 GPU

Figure 5 Power use of the HPL running on NVIDIA GPUs

From Figure 4 and Figure 5, the following results were observed: 

  • Performance—For GPU count, the NVIDIA A100 GPU demonstrates twice the performance of the NVIDIA V100 GPU. Higher memory size, double precision FLOPS, and a newer architecture contribute to the improvement for the NVIDIA A100 GPU.
  • ScalabilityThe PowerEdge R750xa server with four NVIDIA A100-PCIe-40 GB GPUs delivers 3.6 times higher HPL performance compared to one NVIDIA A100-PCIE-40 GB GPU. The NVIDIA A100 GPUs scale well inside the PowerEdge R750xa server for the HPL benchmark.  
  • Higher Rpeak—The HPL code on NVIDIA A100 GPUs uses the new double-precision Tensor cores. The theoretical peak for each GPU is 19.5 TFlops, as opposed to 9.7 TFlops. 
  • PowerFigure 5 shows power consumption of a complete HPL run with the PowerEdge R750xa server using four A100-PCIe GPUs. This result was measured with iDRAC commands, and the peak power consumption was observed as 2022 Watts. Based on our previous observations, we know that the PowerEdge C4140 server consumes approximately 1800 W of power.

HPCG

Figure 6 Scaling GPU performance data for HPCG Benchmark

As discussed in other blogs, high performance conjugate gradient (HPCG) is another standard benchmark to test data access patterns of sparse matrix calculations. From the graph, we see that the HPCG benchmark scales well with this benchmark resulting in 1.6 times performance improvement over the previous generation PowerEdge C4140 server with an NVIDIA V100 GPU.

The 72 percent improvement in memory bandwidth of the NVIDIA A100 GPU over the NVIDIA V100 GPU contributes to the performance improvement.

Conclusion

In this blog, we introduced the latest generation PowerEdge R750xa platform and discussed the performance improvement over the previous generation PowerEdge C4140 server. The PowerEdge R750xa server is a good option for customers looking for an Intel Xeon scalable CPU-based platform powered with NVIDIA GPUs.

With the newer generation PowerEdge R750xa server and NVIDIA A100 GPUs, the applications discussed in this blog show significant performance improvement.

Next steps

In future blogs, we plan to evaluate NVLINK bridge support, which is another important feature of the PowerEdge R750xa server and NVIDIA A100 GPUs.

Read Full Blog
  • PowerEdge
  • Intel Xeon

Processing Six Human 50x WGS per day with 3rd Gen Intel Xeon Scalable Processors

Kihoon Yoon Kihoon Yoon

Mon, 24 May 2021 22:07:44 -0000

|

Read Time: 0 minutes

Overview

Intel® Xeon® Scalable Processors have been proven for consistent and stable performance for many workload types. New 3rd Generation Intel® Xeon® Scalable Processors, also known by the code name of Ice Lake perform exceptionally well for a BWA-GATK pipeline. In this study, we tested two Ice Lake processors, 8352Y and 8358, and the test server configuration is also summarized in Table 1.

Table 1. Tested compute node configuration

Dell EMC PowerEdge C6520

CPU

Tested 3rd Gen Intel® Xeon® Scalable Processors:

2x Intel® Xeon® Platinum 8352Y Processor, 32 cores, 2.20 GHz – 3.40 GHz Base-Boost, TDP 205 W, 48 MB L3 Cache

2x Intel® Xeon® Platinum 8358 Processor, 32 cores, 2.60 GHz – 3.40 GHz Base-Boost, TDP 250 W, 48 MB L3 Cache

RAM

DDR4 512G (32 GB x 12) 3200 MT/s

Operating system

RHEL 8.3 (4.18.0-240.22.1)

Filesystem network

NVIDIA Mellanox InfiniBand HDR100

Filesystem

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS system profile

Performance Optimized

Logical processor

Disabled

Virtualization technology

Disabled

BWA

0.7.15-r1140

Sambamba

0.7.0

Samtools

1.6

GATK

3.60-g89b7209

The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x.

Performance evaluation

Single sample performance

Table 2 summarizes the overall runtimes and the comparisons between each step for our 9-step BWA-GATK pipeline with a single sample.

The mapping and sorting step is the only step that we can peak the true performance variations across different CPUs in Table 2. A rough estimation of overall performance improvements from 6248R (6248) to 8352Y and 8358 are 3.8 (9.0) % and 4.8 (10.0) %, respectively. The test batch for 6248R was Dell EMC PowerEdge R640 server with 394 GB RAM and local storage, and the configuration details for 6248 can be found from the embedded link. 

The mapping and sorting step shows a descent ~36 % runtime reduction due to the nature of the good scalability of BWA. The base recalibration step also takes advantage of a higher core count from Ice Lake CPUs.

Table 2. BWA-GATK performance comparisons between Ice Lake and Cascade Lake

Steps

8352Y 32c

2.2 GHz

8358 32c

2.6 GHz

6248R 24c

3.0 GHz

6248 20c

2.5 GHz

Mapping and sorting

3.23 (32)

3.23 (32)

 5.04 (24)

5.22 (20)

Mark duplicates

1.16 (13)

1.16 (13)

1.14 (13)

1.29 (13)

Generate realigning targets

0.47 (32)

0.46 (32)

0.16 (24)

0.42 (20)

Insertion and deletion realigning

8.16 (1)

7.97 (1)

7.20 (1)

7.87 (1)

Base recalibration

2.06 (32)

2.07 (32)

 2.41 (24)

2.30 (20)

Haplotypercaller

8.01 (16)

7.96 (16)

8.06 (16)

8.25 (16)

Genotype GVCFs

0.01 (32)

0.01 (32)

0.01 (24)

0.01 (20)

Variant recalibration

0.20 (1)

0.20 (1)

0.19 (1)

0.23 (1)

Apply variant recalibration

0.01 (1)

0.01 (1)

0.01 (1)

0.01 (1)

Total runtime (hours)

23.32

23.07

24.23

25.61

Note: The number of cores used for the test is parenthesized.

Multiple sample performances – throughput

A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment. 

The current pipeline invokes many pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processings. Typically, a process silently fails when there is not enough resource left to start an additional process.

As shown in Table 3 for the 8352Y test, the maximum number of samples that can be processed simultaneously is around 14 samples. Although a 14-sample test was not performed, 14 samples could likely be the maximum number of samples that can be processed together because the two pipelines were failed on the 16-sample test. In other words, ~ 6 genomes per day throughput is achievable with 8352Y. Also, 8358 shows 2 failed processes when 16 samples were processed simultaneously while the throughput reached ~7 genomes per day (Table 4).

Table 3. Throughput test for Intel® Xeon® Platinum 8352Y

Steps

Runtime with a various number of samples

Number of samples

1

2

4

8

12

16

Number of samples Failed 

0

0

0

0

0

2

Mapping and sorting

2.84

4.20

7.11

13.44

20.77

26.62

Mark duplicates

1.17

1.18

1.29

1.77

2.49

3.05

Generate realigning targets

0.46

0.51

0.52

0.77

1.09

1.25

Insertion and deletion realigning

7.94

8.04

8.02

8.00

8.26

8.11

Base recalibration

2.00

2.16

2.83

4.41

6.04

7.20

Haplotypercaller

8.00

7.93

9.10

9.24

9.31

9.26

Genotype GVCFs

0.02

0.02

0.03

0.02

0.03

0.04

Variant recalibration

0.17

0.20

0.21

0.20

0.19

0.23

Apply variant recalibration

0.01

0.02

0.02

0.02

0.02

0.03

Total runtime (hours)

22.60

24.26

29.12

37.89

48.20

55.78

Genomes per day

1.06

1.98

3.30

5.07

5.98

6.02

Table 4. Throughput test for Intel® Xeon® Platinum 8358

Steps

Runtime with a various number of samples

Number of samples

1

8

12

14

16

1

Number of samples Failed 

0

0

0

0

2

0

Mapping and sorting

2.67

11.79

18.26

22.84

24.34

2.67

Mark duplicates

1.16

1.51

2.18

2.59

2.65

1.16

Generate realigning targets

0.43

0.70

0.96

1.17

1.15

0.43

Insertion and deletion realigning

7.97

8.00

7.99

8.20

8.19

7.97

Base recalibration

1.94

4.05

5.65

6.47

6.56

1.94

Haplotypercaller

8.00

8.21

8.22

8.24

8.25

8.00

Genotype GVCFs

0.02

0.03

0.03

0.03

0.02

0.02

Variant recalibration

0.18

0.25

0.14

0.30

0.30

0.18

Apply variant recalibration

0.01

0.01

0.02

0.02

0.02

0.01

Total runtime (hours)

22.37

34.55

43.44

49.86

51.49

22.37

Genomes per day

1.07

5.56

6.63

6.74

6.53

1.07

Conclusion

The field of NGS data analysis has been moving fast in terms of data growth and data variations. The majority of the open-source applications in NGS data analysis are unable to take advantage of accelerator technology and do not scale well over the number of cores. It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision of making when to split data is pretty hard, it is tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.

Nonetheless, 3rd Generation Intel® Xeon® Scalable Processors, especially 8352Y, and 8358 are excellent choices for the highest variant calling analysis throughput and single sample analysis.

Read Full Blog
  • Intel
  • PowerEdge
  • HPC

Intel Ice Lake - BIOS Characterization for HPC

Joseph Stanfield Tarun Singh Savitha Pareek Ashish K Singh Puneet Singh Joseph Stanfield Tarun Singh Savitha Pareek Ashish K Singh Puneet Singh

Tue, 25 May 2021 13:10:03 -0000

|

Read Time: 0 minutes

Intel recently announced the 3rd Generation Intel Xeon Scalable processors (code-named “Ice Lake”), which are based on a new 10 nm manufacturing process. This blog provides the new Ice Lake processor synthetic benchmark results and the recommended BIOS settings on Dell EMC PowerEdge servers.

Ice Lake processors offer a higher core count of up to 40 cores with a single Ice Lake 8380 processor. The Ice Lake processors have larger L3, L2, and L1 data cache than Intel’s second-generation Cascade Lake processors. These features are expected to improve performance of CPU-bound software applications. Table 1 shows the L1, L2, and L3 cache size on the 8380 processor model.

Ice Lake still supports the AVX 512 SIMD instructions, which allow for 32 DP FLOP/cycle. The upgraded Ultra Path Interconnect (UPI) Link speed of 11.2GT/s is expected to improve data movement between the sockets. In addition to core count and frequency, Ice Lake-based Dell EMC PowerEdge servers support DDR4 - 3200 MT/s DIMMS with eight memory channels per processor, which is expected to improve the performance of memory bandwidth-bound applications. Ice Lake processors now support DIMMs with 6 TB per socket.

Instructions such as Vector CLMUL, VPMADD52, Vector AES, and GFNI Extensions have been optimized to improve use of vector registers. The performance of software applications in the cryptography domain is also expected to benefit. The Ice Lake processor also includes improvements to Intel Speed Select Technology (Intel SST). With Intel SST, a few cores from the total available cores can be operated at a higher base frequency, turbo frequency, or power. This blog does not address this feature.

Table 1: hwloc-ls and numactl -H command output on an Intel 8380 processor model-based server with Round Robin core enumeration (MadtCoreEnumeration) and SubNumaCuster(Sub-NUMA Cluster) set to 2-Way

hwloc-ls

numactl -H

Machine (247GB total)

  Package L#0 + L3 L#0 (60MB)

    Group0 L#0

      NUMANode L#0 (P#0 61GB)

      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)

      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#4)

      L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#8)

      L2 L#3 (1280KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#12)

      L2 L#4 (1280KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#16)

      L2 L#5 (1280KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#20)

      L2 L#6 (1280KB) + L1d L#6 (48KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#24)

      L2 L#7 (1280KB) + L1d L#7 (48KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#28)

      L2 L#8 (1280KB) + L1d L#8 (48KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#32)

      L2 L#9 (1280KB) + L1d L#9 (48KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#36)

      L2 L#10 (1280KB) + L1d L#10 (48KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#40)

      L2 L#11 (1280KB) + L1d L#11 (48KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#44)

      L2 L#12 (1280KB) + L1d L#12 (48KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#48)

      L2 L#13 (1280KB) + L1d L#13 (48KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#52)

      L2 L#14 (1280KB) + L1d L#14 (48KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#56)

      L2 L#15 (1280KB) + L1d L#15 (48KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#60)

      L2 L#16 (1280KB) + L1d L#16 (48KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#64)

      L2 L#17 (1280KB) + L1d L#17 (48KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#68)

      L2 L#18 (1280KB) + L1d L#18 (48KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#72)

      L2 L#19 (1280KB) + L1d L#19 (48KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#76)

      HostBridge.

<snip>

.

.

 

 


BIOS options tested on Ice Lake processors

Table 2 provides the server details used for the performance tests. The following BIOS options were explored in the performance testing:

  • BIOS.ProcSettings.SubNumaCluster—Breaks up the LLC into disjoint clusters based on address range, with each cluster bound to a subset of the memory controllers in the system. It improves average latency to the LLC. Sub-NUMA Cluster (SNC) is disabled if NVDIMM-N is installed in the system.
  • BIOS.ProcSettings.DeadLineLlcAlloc—If enabled, fills in dead lines in LLC opportunistically.
  • BIOS.ProcSettings.LlcPrefetch—Enables and disables LLC Prefetch on all threads.
  • BIOS.ProcSettings.XptPrefetch—If enabled, enables the MS2IDI to take a read request that is being sent to the LLC and speculatively issue a copy of that read request to the memory controller.
  • BIOS.ProcSettings.UpiPrefetch—Starts the memory read early on the DDR bus. The UPI Rx path spawns a MemSpecRd to iMC directly.
  • BIOS.ProcSettings.DcuIpPrefetcher (Data Cache Unit IP Prefetcher)—Affects performance, depending on the application running on the server. This setting is recommended for High Performance Computing applications.
  • BIOS.ProcSettings.DcuStreamerPrefetcher (Data Cache Unit Streamer Prefetcher)—Affects performance, depending on the application running on the server. This setting is recommended for High Performance Computing applications.
  • BIOS.ProcSettings.ProcAdjCacheLine—When set to Enabled, optimizes the system for applications that require high utilization of sequential memory access. Disable this option for applications that require high utilization of random memory access.
  • BIOS.SysProfileSettings.SysProfile—Sets the System Profile to Performance Per Watt (DAPC), Performance Per Watt (OS), Performance, Workstation Performance, or Custom mode. When set to a mode other than Custom, the BIOS sets each option accordingly. When set to Custom, you can change setting of each option.
  • BIOS.ProcSettings.LogicalProc—Reports the logical processors. Each processor core supports up to two logical processors. When set to Enabled, the BIOS reports all logical processors. When set to Disabled, the BIOS only reports one logical processor per core. Generally, a higher processor count results in increased performance for most multithreaded workloads. The recommendation is to keep this option enabled. However, there are some floating point and scientific workloads, including HPC workloads, where disabling this feature might result in higher performance.

You can set the DeadLineLlcAlloc, LlcPrefetch, XptPrefetch, UpiPrefetch, DcuIpPrefetcher, DcuStreamerPrefetcher, ProcAdjCacheLine, and LogicalProc BIOS options to either Enabled or Disabled. You can set the SubNumaCluster to 2-Way and Disabled. The SysProfile setting can have five possible values: PerformanceOptimized, PerfPerWattOptimizedDapc, PerfPerWattOptimizedOs, PerfWorkStationOptimized and Custom.  

Table 2: Test bed hardware and software details 

Component

Dell EMC PowerEdge R750 server

Dell EMC PowerEdge C6520 server

Dell EMC PowerEdge C6420 server

Dell EMC PowerEdge C6420 server

OPN

8380

6338

8280

6252

Cores/Socket

40

32

28

24

Frequency (Base-Boost)
 

2.30 – 3.40 GHz

2.0 – 3.20 GHz

2.70 – 4.0 GHz

2.10 – 3.70 GHz

TDP

270 W

205 W

205 W

150 W

L3Cache

60M

48M

38.5M

37.75M

Operating System

Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64

Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64

Red Hat Enterprise Linux 8.3

4.18.0-240.el8.x86_64

Red Hat Enterprise Linux 8.3

4.18.0-240.el8.x86_64

Memory

16 GB x 16 (2Rx8) 3200 MT/s

16 GB x 16 (2Rx8) 3200 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

16 GB x 12 (2Rx8)

2933 MT/s

BIOS/CPLD

1.1.2/1.0.1

Interconnect

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR

NVIDIA Mellanox HDR100

NVIDIA Mellanox HDR100

Compiler

Intel parallel studio 2020 (update 4)

Benchmark software

  • HPL v 2.3 (parallel studio 2020 (update 4)
  • STREAM v5.10
  • HPCG v3.1 (parallel studio 2020 update 4)
  • OSU v 5.7
  • WRF v3.9.1.1 (conus 2.5 km dataset)

The system profile BIOS meta option helps to set a group of BIOS options (such as C1E, C States, and so on), each of which control performance and power management settings to a particular value. It is also possible to set these groups of BIOS options individually to a different value using the Custom system profile.

 Application performance results

Table 2 lists details about the software used for benchmarking the server. We used the precompiled HPL and HPCG binary files, which are part of Intel Parallel Studio 2020 update 4 software bundle, for our tests. We compiled the WRF application with AVX2 support. WRF and HPCG issue many nonfloating point packed micro-operations (approximately 73 percent to 90 percent of the total packed micro-operations). They are memory-bound (and DRAM-bandwidth bound) workloads. HPL issues packed double precision micro-operations and is a compute-bound workload.

After setting Sub-NUMA Cluster (BIOS.ProcSettings.SubNumaCluster) to 2-Way, Logical Processors (BIOS.ProcSettings.LogicalProc) to Disabled, and other settings (DeadLineLlcAlloc, LlcPrefetch, XptPrefetch, UpiPrefetch, DcuIpPrefetcher, DcuStreamerPrefetcher, ProcAdjCacheLine) to Enabled, we measured the impact of System Profile (BIOS.SysProfileSettings.SysProfile) BIOS parameters on application performance.

Figure 1 through Figure 4 show application performance. In each figure, the numbers over the bars represent the relative change in the application performance with respect to the application performance obtained on the Intel 6338 Ice Lake processor with the System Profile set to Performance Optimized (PO).

Note: In the figures, PO=PerformanceOptimized, PPWOD=PerfPerWattOptimizedDapc, PPWOO=PerfPerWattOptimizedOs, and PWSO=PerfWorkStationOptimized.

HPL Benchmark

 Figure 1: Relative difference in the performance of HPL by processor and Sysprofile setting

HPCG Benchmark

 Figure 2: Relative difference in the performance of HPCG by processor and Sysprofile setting

STREAM Benchmark

 Figure 3: Relative difference in the performance of STREAM by processor and Sysprofile setting

WRF Benchmark

Figure 4: Relative difference in the performance of WRF by processor and Sysprofile setting

 We obtained the performance for the applications in Figure 2 through Figure 4 by fully subscribing to all available cores. Depending on the processor model, we achieved 78 percent to 80 percent efficiency with HPL and STREAM benchmarks using the Performance Optimized profile.

Intel has extended the TDP of the Ice Lake processors with the top-end Intel 8380 processor at 270 W TDP. The following figure shows the power use on the systems with the applications listed in Table 2.


Note: In this figure, PO=PerformanceOptimized, PPWOD=PerfPerWattOptimizedDapc, PPWOO=PerfPerWattOptimizedOs and PWSO=PerfWorkStationOptimized

Figure 5: Power use by platform and processor type. Average Idle power usage on the PowerEdge C6520 server (Intel 6338 processor) with approximately 335 W and the PowerEdge R750 server (intel 8380 processor) with approximately 470 W using the Performance Optimized System Profile. 

When SNC is set to 2-Way, the system exposes four NUMA nodes. We tested the NUMA bandwidth, remote socket bandwidth, and local socket bandwidth using the STREAM TRIAD benchmark. In Figure 6, the CPU NUMA node is represented as c and the memory node is represented as m. As an example for NUMA bandwidth, the c0m0 (blue bars) test type represents the STREAM TRIAD test carried out between NUMA node 0 and memory node 0. Figure 6 shows the best bandwidth numbers obtained on varying the number of threads per test type.

Figure 6: Local and remote NUMA memory bandwidth.

Remote socket bandwidth numbers were measured between CPU node 0, 1 and memory node 2, 3. Local bandwidths were measured between CPU node 0, 1, and 0, 1. The following figure shows the performance numbers.

Figure 7: Local and remote processor bandwidth.

Impact of BIOS options on application performance

 We tested the impact of the DeadLineLlcAlloc, LlcPrefetch, XptPrefetch, UpiPrefetch, DcuIpPrefetcher, DcuStreamerPrefetcher and ProcAdjCacheLine with the Performance Optimized (PO) system profile. These BIOS options do not have significant impact on the performance of applications addressed in this blog, therefore we recommend that these options be set as Enabled.

Figure 8 and Figure 9 show the impact of the Sub-NUMA Cluster (SNC) BIOS option on the application performance. In each figure, the numbers over the bars represent the relative change in the application performance with respect to the application performance obtained on the Intel 6338 Ice Lake processor with SNC feature set to Disabled

Figure 8: HPL and HPCG performance variation by processor model with Sub-NUMA Cluster set to Disabled (SNC=D) and 2-Way (SNC=2W)

Figure 9: STREAM and WRF performance variation by processor model with Sub-NUMA Cluster set to Disabled (SNC=D) and 2-Way (SNC=2W)

The SubNumaCluster option can impact the applications that are Memory Bandwidth-bound (for example, STREAM, HPCG, and WRF). The SubNumaCluster option is recommended to be set to 2-Way as it can optimize the workloads addressed in this blog by a range of one percent to six percent, depending on the processor model and application.

InfiniBand bandwidth and message rate

The Ice Lake-based processors now support PCIe Gen 4, which allows the NVIDIA MELLANOX HDR adapter cards to be used with Dell EMC PowerEdge servers. Figure 10, Figure 11, and Figure 12 show the Message Rate, Unidirectional, and Bi-directional InfiniBand bandwidth test results of the OSU Benchmarks suite. The network adapter card was connected to the second socket (NUMA node 2), therefore, the local bandwidth tests were carried out with processes bound to NUMA node 2. The remote bandwidth tests were carried out with processes bound to NUMA node 0. In Figure 10 and Figure 11, the numbers in red over the orange bars represent the percentage difference between local and remote bandwidth performance numbers.

Figure 10: OSU Benchmark unidirectional bandwidth test on two servers with Intel 8380 processors and NVIDIA Mellanox HDR InfiniBand

 

Figure 11: OSU Benchmark bi-directional bandwidth test on two servers with Intel 8380 processors and NVIDIA Mellanox HDR InfiniBand

 

Figure 12: Interconnect bandwidth and message rate performance obtained between two servers having Intel 8380 processors with OSU Benchmark

On two nodes connected using the NVIDIA Mellanox ConnectX-6 HDR InfiniBand adapter cards, we achieved approximately 25 GB/s unidirectional bandwidth and a message rate of approximately 200 million messages/second—almost double the performance numbers obtained on the NVIDIA Mellanox HDR100 card.

Comparison with Cascade Lake processors

Based on the compute resources availability in our Dell EMC HPC & AI Innovation Lab, we selected the Cascade Lake processor-based servers and benchmarked them with software listed in Table 1. Figure 13 through Figure 16 show performance results from the Intel Ice Lake and Cascade Lake processors. The numbers over the bars represent the relative change in the application performance with respect to the application performance obtained on the Intel 6252 Cascade Lake processor.

Figure 13: HPL performance on processors listed in Table 2

 

Figure 14: HPCG performance on processors listed in Table 2

 

Figure 15: STREAM TRIAD test performance on Processors listed in Table 2 

 

 Figure 16: WRF performance on Processors listed in Table 2

  Ice Lake delivers approximately 38 percent better performance than Cascade Lake with HPL on the top-end processor model. The memory bandwidth-bound benchmarks such as STREAM and HPCG (see Figure 13 and Figure 14) delivered 42 percent to 43 percent performance improvement over the top-end Cascade Lake processors addressed in this blog.

The average real-time power usage of the Dell EMC PowerEdge platforms (listed in Table 1) was measured with the synthetic benchmarks listed in this blog. Figure 17 compares the power usage data from the Cascade Lake and Ice Lake platforms. The number over the bar represents the relative change of power with respect to the base (Intel 6252 processor in the idle state) power measured.

Figure 17: Average power usage during benchmark runs on Dell EMC PowerEdge servers (see details in Table 1)

Considering the data with the Performance Optimized profile with the respective power measurement, the applications (depending on the processor model) were unable to deliver better performance per watt on the Ice Lake platform when compared to the Cascade Lake platform.

Summary and future work

The Ice Lake processor-based Dell EMC Power Edge servers, with notable hardware feature upgrades over Cascade Lake, show up to 47 percent performance gain for all the HPC benchmarks addressed in this blog.  Hyper-threading should be Disabled for the benchmarks addressed in this blog; for other workloads the option should be tested and enabled as appropriate. Watch this space for subsequent blogs that describe application performance studies on our new Ice Lake processor-based cluster.


Read Full Blog
  • containers
  • HPC
  • Omnia

Containerized HPC Workloads Made Easy with Omnia and Singularity

John Lockman Luke Wilson PhD John Lockman Luke Wilson PhD

Mon, 28 Jun 2021 14:35:14 -0000

|

Read Time: 0 minutes

Maximizing application performance and system utilization has always been important for HPC users.  The libraries, compilers, and applications found on these systems are the result of heroic efforts by HPC system administrators and teams of HPC specialists who fine tune, test, and maintain optimal builds of complex hierarchies of software for users. Fortunately for both researchers and administrators, some of that burden can be relieved with the use of containers, where software solutions can be built to run reliably when moved from one computing environment to another. This includes moving from one research lab to another, or from the developer’s laptop to a production lab, or even from an on-prem data center to the cloud. 

Singularity has provided HPC system administrators and users a way to take advantage of application containerization while running on batch-scheduled systems. Singularity is a container runtime that can build containers in its own native format, as well as execute any CRI-compatible container. By default, Singularity enforces security restrictions on containers by running in user space and can preserve user identification when run through batch schedulers, providing a simple method to deploy containerized workloads on multi-user HPC environments. 

Best practices for HPC systems deployment and use is the goal of Omnia and we recognize those practices vary in industry and research institutions. Omnia is developed with the entire community in mind and we aim to provide the tools that help them be productive. To this end, we recently included Singularity as an automatically installed package when deploying Slurm clusters with Omnia.

Building a Singularity-enabled cluster with Omnia

Installing a Slurm cluster with Omnia  and running a Singularity job is simple. We provide a repository of Ansible playbooks to configure a pile of metal or cloud resources into a ready-to-use Slurm cluster by applying the Slurm role in AWX or by applying the playbook on the command line.

ansible-playbook -i inventory omnia.yaml --skip-tags  kubernetes

Once the playbook has completed users are presented with a fully functional Slurm cluster with Singularity installed. We can run a simple “hello world” example, using containers directly from Singularity Hub. Here is an example Slurm submission script to run the “Hello World” example.

#!/bin/bash
#SBATCH -J singularity_test
#SBATCH -o singularity_test.out.%J
#SBATCH -e singularity_test.err.%J
#SBATCH -t 0-00:10
#SBATCH -N 1
# pull example Singularity container
singularity pull --name hello-world.sif shub://vsoch/hello-world
# execute Singularity container
singularity exec hello-world.sif cat /etc/os-release

Executing HPC applications without installing software

The “hello world” example is great but doesn’t demonstrate running real HPC codes, fortunately several hardware vendors have begun to publish containers for both HPC and AI workloads, such as Intel's oneContainer and Nvidia's NGC. Nvidia NGC is a catalog of GPU-accelerated software arranged in collections, containers, and Helm charts. This free to use repository has the latest builds of popular software used for deep learning and simulation with optimizations for Nvidia GPU systems. With Singularity we can take advantage of the NGC containers on our bare-metal Slurm cluster. Starting with the LAMMPS example on the NGC website we demonstrate how to run a standard Lennard-Jones 3D melt experiment, without having to compile all the libraries and executables. 

The input file for running this benchmark, in.lj.txt, can be downloaded from the Sandia National Laboratory site:

wget https://lammps.sandia.gov/inputs/in.lj.txt

Next make a local copy of the lammps container from NGC and name it lammps.sif

singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020

This example can be executed directly from the command line using srun. This example runs 8 tasks on 2 nodes with a total of 8 GPUs:

srun --mpi=pmi2 -N2 --ntasks=8 --ntasks-per-socket=2 singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd  lammps.sif lmp -k on g 8 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt

Alternatively, the following example Slurm submission script will permit batch execution with the same parameters as above, 8 tasks on 2 nodes with a total of 8 GPUs:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-socket=2
#SBATCH --time 00:10:00
set -e; set -o pipefail

# Build SIF, if it doesn't exist
if [[ ! -f lammps.sif ]]; then
    singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020
fi
readonly gpus_per_node=$(( SLURM_NTASKS / SLURM_JOB_NUM_NODES  ))
echo "Running Lennard Jones 8x4x8 example on ${SLURM_NTASKS} GPUS..."

srun --mpi=pmi2 \
singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd  lammps.sif lmp -k on g ${gpus_per_node} -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt

Containers provide a simple solution to the complex task of building optimized software to run anywhere. Researchers are no longer required to attempt building software themselves or wait for a release of software to be made available at the site they are running. Whether running on the workstation, laptop, on-prem HPC resource, or cloud environment they can be sure they are using the same optimized version for every run.

Omnia is an open source project that makes it easy to setup a Slurm or Kubernetes environment. When we combine the simplicity of Omnia for system deployment and Nvidia NGC containers for optimized software, both researchers and system administrators can concentrate on what matters most, getting results faster.

Learn more

Learn more about Singularity containers at https://sylabs.io/singularity/. Omnia is available for download at https://github.com/dellhpc/omnia



Read Full Blog
  • PowerEdge
  • HPC
  • AMD

Tuxedo Pipeline Performance on Dell EMC PowerEdge R6525

Kihoon Yoon Kihoon Yoon

Tue, 27 Apr 2021 03:48:30 -0000

|

Read Time: 0 minutes

Overview

Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), insertion/deletion (indel), or chromosomal restructuring. Ultimately, all physiological and biochemical events depend on the final gene expression products, proteins. Although most mammals have an additional controlling layer before protein expression, knowing how many transcripts exist in a system helps to characterize the biochemical status of a cell. Ideally, technology would enable us to quantify all proteins in a cell, which would significantly advance the progress of Life Science. However, we are far from achieving this.  

This blog provides the test results of one popular RNA-Seq data analysis pipeline known as the Tuxedo pipeline (1). The Tuxedo pipeline suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript, and isoform detection, differential expression, visualizations, and quality control metrics. The tested workflow is a differentially expressed gene (DEG) analysis, and the detailed steps in the pipeline are shown in Figure 1.

Figure 1.  Updated tuxedo pipeline with cuffquant step

A single node study with AMD EPYC 7002 series (Rome) and AMD EPYC 7003 series (Milan) on Dell EMC PowerEdge R6525 server was done. The configurations of the test system are summarized in Table 1.

Table 1.  Tested compute node configuration

Dell EMC PowerEdge R6525
CPU

Tested AMD Milan:

2x 7763 (Milan), 64 Cores, 2.45 GHz – 3.5 GHz Base-Boost, TDP 280 W, 256 MB L3 Cache

2x 7713 (Milan), 64 Cores, 2.0 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache

7543 (Milan), 32 Cores, 2.8 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache

 

Tested AMD Rome:

7702 (Rome), 64 Cores, 2.0 GHz – 3.35 GHz Base-Boost, TDP 200 W, 256 MB L3 Cache
RAMDDR4 256 GB (16 Gb x 16) 3200 MT/s
Operating systemRHEL 8.3 (4.18.0-240.el8.x86_64)
InterconnectMellanox InfiniBand HDR100
FilesystemDell EMC Ready Solutions for HPC BeeGFS High Capacity Storage
BIOS system profilePerformance optimized
Logical processorDisabled
Virtualization technologyDisabled
tophat2.1.1
bowtie22.2.5
R3.6
bioconductor cummerbund2.26.0

A performance study of the RNA-Seq pipeline is not trivial because the nature of the workflow requires non-identical input files yet similar input files in size. Hence, 185 RNA-Seq paired-end read data are collected from a public data repository. All the read datafiles contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test are randomly selected from the pool of the 185 paired-end read files. Although these test data will not have any biological meaning, certainly these data with the high level of noise will put the tests in the worst-case scenario.

Performance Evaluation

Throughput test - Single pipeline with more than two samples, biological, and technical duplicates

Typical RNA-Seq studies consist of multiple samples, sometimes 100s of different samples, normal versus disease, or untreated versus treated samples. These samples tend to have a high level of noise due to biological reasons; hence, the analysis requires vigorous data preprocessing procedure.

A number of various samples were processed, with different RNA-Seq data selected from 185 paired-end reads dataset, to see how much data a single node can process. Typically, when the number of samples increases, the runtime of the Tuxedo pipeline increases. However, as shown in the figure below, the runtimes with two sample tests using 7713, are higher than the runtimes from four samples. The standard error from five repeated runs does not overlap with four and eight sample results. The interference of other users may cause this large variance. The current testing environment, especially a shared file system designed for large capacity, is not ideal for a Next Generation Sequencing (NGS) data analysis benchmark.

Figure 2.  Runtime comparisons among various AMD EPYC 7003 Series processors: Standard error is estimated from an estimated standard deviation based on a sample (STDDEV.S function in Excel)

The eight sample test results show that AMD Milan processors perform better than one of the Rome processors (7702) in a higher workload.

Conclusion

Many tests are still required to obtain a better insight from the AMD Milan processors for the NGS data analysis area. Unfortunately, the tests could not exceed eight samples due to storage limitations. However, there seems to be plenty of room for a higher throughput that processes more than eight samples together. AMD Milan 7763 performed 20% better than AMD Rome 7702. AMD Milan 7713 performed 18% better in eight sample tests for the Tuxedo pipeline as described in Figure 2.

Read Full Blog
  • NVIDIA
  • PowerEdge

Accelerating HPC Workloads with NVIDIA A100 NVLink on Dell PowerEdge XE8545

Savitha Pareek Deepthi Cherlopalle Frank Han Savitha Pareek Deepthi Cherlopalle Frank Han

Tue, 13 Apr 2021 14:25:31 -0000

|

Read Time: 0 minutes

NVIDIA A100 GPU

Three years after launching the Tesla V100 GPU, NVIDIA recently announced its latest  data center GPU A100, built on the Ampere architecture.  The A100 is available in two form factors, PCIe and SXM4, allowing GPU-to-GPU communication over PCIe or NVLink. The NVLink version is also known as the A100 SXM4 GPU and is available on the HGX A100 server board. 

As you’d expect, the Innovation Lab tested the performance of the A100 GPU in a new platform. The new PowerEdge XE8545 4U server from Dell Technologies supports these GPUs with the NVLink SXM4 form factor and dual-socket AMD 3rd generation EPYC CPUs (codename Milan). This platform supports PCIe Gen 4 speed, up to 10 local drives, and up to 16 DIMM slots running at 3200 MT/s. Milan CPUs are available with up to 64 physical cores per CPU. 

The PCIe version of the A100 can be housed in the PowerEdge R7525, which also supports AMD EPYC CPUs, up to 24 drives, and up to 16 DIMM slots running at 3200MT/s.  This blog compares the performance of the A100-PCIe system to the A100-SXM4 system. 

Figure 1: PowerEdge XE8545 Server

A previous blog discussed the performance of the NVIDIA A100-PCIe GPU compared to its predecessor NVIDIA Tesla V100-PCIe GPU in the PowerEdge R7525 platform. 

The following table shows the specifications of the NVIDIA A100 and V100 GPUs.

Table 1: NVIDIA A100 and V100 GPUs with PCIe and SXM4 form factors

Form factor

PCIe

 SXM (NVIDIA NVLink)

Type of NVIDIA

A100

V100

A100

V100

GPU architecture

Ampere

Volta

Ampere

Volta

GPU memory

40 GB

32 GB

40 GB

32 GB

GPU memory bandwidth

1555 GB/s

900 GB/s

1555 GB/s

900 GB/s

Peak FP64

9.7 TFLOPS

7 TFLOPS

9.7 TFLOPS

7.8 TFLOPS

Peak FP64 Tensor Core

19.5 TFLOPS

N/A

19.5 TFLOPS

N/A

GPU base clock

765 MHz

1230 MHz

1095 MHz

1290 MHz

GPU boost clock

1410 MHz

1380 MHz

1410 MHz

1530 MHz

NVLink speed

600 GB/s

N/A

600 GB/s

300 GB/s

Max power consumption

250 W

250 W

400 W

300 W

 From Table 1, we see that the A100 offers 42 percent improved memory bandwidth and 20 to 30 percent higher double precision FLOPS when compared to the Tesla V100 GPU. While the A100-PCIe GPU consumes the same amount of power as the V100-PCIe GPU, the NVLink version of the A100 GPU consumes 25 percent more power than the V100 GPU.  

How are the GPUs connected in the PowerEdge servers?

An understanding of the server architecture is helpful in determining the behavior of any application. The PowerEdge XE8545 server is an accelerator optimized server with four A100-SMX4 GPUs connected with third generation NVLink, as shown in the following figure.

Figure 2:  PowerEdge XE8545 CPU-GPU connectivity                     

In the A100 GPU, each NVLink lane supports a data rate of 50x 4 Gbit/s in each direction. The total number of NVLink lanes increases from six lanes in the V100 GPU to 12 lanes in the A100 GPU, now yielding 600 GB/s total.  Workloads that can take advantage of the higher GPU-to-GPU communication bandwidth can be benefit from the NVLink links in PowerEdge XE8545 Server. 

As shown in the following figure, the PowerEdge R7525 server can accommodate up to three PCIe-based GPUs; however the configuration chosen for this evaluation used two A100-PCIe GPUs. With this option, the GPU-to-GPU communication must flow through the AMD Infinity Fabric inter-CPU links.

Figure 3:  PowerEdge R7525 CPU-GPU connectivity

Testbed details

The following table shows the tested configuration details: 

Table 2:  Test bed configuration details

Server

PowerEdge XE8545

PowerEdge R7525

Processor

Dual AMD EPYC 7713, 64C, 2.8 GHz

Memory

512 GB

(16 x 32 GB @ 3200 MT/s)

1024 GB

(16 x 64 GB @ 3200 MT/s)

Height of system

4U

2U

GPUs

4 x NVIDIA A100 SXM4 40 GB

2 x NVIDIA A100 PCIe 40 GB

Operating system

Kernel

Red Hat Enterprise Linux release 8.3 (Ootpa)

4.18.0-240.el8.x86_64

BIOS settings

Sysprofile=PerfOptimized

LogicalProcessor=Disabled

NumaNodesPerSocket=4

CUDA Driver

CUDA Toolkit

450.51.05

11.1

GCC

9.2.0

MPI

OpenMPI - 4.0

The following table lists the version of HPC application that was used for the benchmark evaluation:

Table 3: HPC Applications used for the evaluation

Benchmark

Details

HPL

xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20

HPCG

xhpcg-3.1_cuda_11_ompi-3.1

GROMACS

v2021

NAMD

Git-2021-03-02_Source

LAMMPS

29Oct2020 release 

Benchmark evaluation

High Performance Linpack

High Performance Linpack (HPL) is a standard HPC system benchmark that is used to measure the computing power of a server or cluster. It is also used as a reference benchmark by the TOP500 org to rank supercomputers worldwide. HPL for GPU uses double precision floating point operations.  There are a few parameters that are significant for the HPL benchmark, as listed below:

  • is the problem size provided as input to the benchmark and determines the size of linear matrix that is solved by HPL. For a GPU system, the highest HPL performance is obtained when the problem size utilizes as much as possible of the GPU memory without exceeding it. For this study, we used HPL compiled with NVIDIA libraries as listed in Table 3.
  • NB is the block size which is used for data distribution. For this test configuration, we used an NB of 288.  
  • PxQ is the matrix size and is equal to the total number of GPUs in the system.  
  • Rpeak is the theoretical peak of the system. 
  • Rmax is the maximum measured performance achieved on the system. 

Figure 4: HPL Performance on the PowerEdge R7525 and XE8545 with NVIDIA A100-40 GB 

  

Figure 5: HPL Power Utilization on the PowerEdge XE8545 with four NVIDIA A100 GPUs and R7525 with two NVIDIA A100 GPUs

 From Figure 4 and Figure 5, we can make the following observations:

  • SXM4 vs PCIe: At 1-GPU, the NVIDIA A100-SXM4 GPU outperforms the A100-PCIe by 11 percent. The higher SMX4 GPU base clock frequency is the predominant factor contributing to the additional performance over the PCIe GPU.    
  • Scalability: The PowerEdge XE8545 server with four NVIDIA A100-SXM4-40GB GPUs delivers 3.5 times higher HPL performance compared to one NVIDIA A100-SXM4-40GB GPU. On the other hand, two A100-PCIe GPUs is 1.94 times faster than one on the R7525 platform. The A100 GPUs scale well on both platforms for HPL benchmark.  
  • Higher Rpeak:  HPL code on A100 GPUs use the new double-precision Tensor cores.  So, the theoretical peak for each card would be 19.5 TFlops, as opposed to 9.7 TFlops. 
  • Power: Figure 5 shows power consumption of a complete HPL run with PowerEdge XE8545 using 4 x A100-SXM4 GPUs and PowerEdge R7525 using 2 x A100-PCIe GPUs. This was measured with iDRAC commands, and the peak power consumption for XE8545 is 2877 Watts, while peak power consumption for R7525 is 1206 Watts.

High Performance Conjugate Gradient  

The TOP500 list has incorporated the High Performance Conjugate Gradient (HPCG) results as an alternate metric to assess system performance.

 

Figure 6: HPCG Performance on the PowerEdge R7525 and PowerEdge XE8545 Servers

Unlike HPL, HPCG performance depends heavily on the memory system and network performance when we go beyond one server. Because both the PCIe and SXM4 form factors of the A100 GPUs have the same memory bandwidth, there is no variation in the performance at a single node and HPCG performance scales well on both servers.

GROMACS

The following figure shows the performance results for GROMACS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. GROMACS 2021.0 was compiled with CUDA compilers and Open-MPI, as shown in Table 3.

Figure 7: GROMACS performance on the PowerEdge R7545 and PowerEdge XE8545 Server 

 The GROMACS build included thread MPI (built in with the GROMACS package). Performance results are presented using the ns/day metric. For each test, the performance was optimized by varying the number of MPI ranks and threads,  number of  PME ranks, and different nstlist values to obtain the best performance result.

With one GPU in test, the performance of the SMX4 XE8545 server is similar to the PCIe R7525. With two GPUs in test, the SMX4 XE8545 performance is up to 28 percent better than the PCIe R7525. As the performance was based on a comparative analysis between NVIDIA PCIe and SXM4 form factors along the server platforms, datasets like Water 1536 and Water 3072 demand more GPU-GPU communication, and SXM4 performs around 28 percent better. On the other hand, for datasets like LignoCellulose 3M, the two GPU R7525 achieves the same per-GPU performance as the XE8545, but with the lower 250 W GPU making it the more efficient solution.

 LAMMPS

The following figure shows the performance results for LAMMPS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. The code was compiled with the KOKKOS package to run efficiently on NVIDIA GPUs, and Lennard Jones is the dataset that was tested with Timesteps/s as the metric for comparison.

Figure 8: LAMMPS performance on the PowerEdge R7545 and PowerEdge XE8545 Servers

With one GPU in test, the performance of the SMX4 XE8545 server is 13 percent higher than the PCIe R7525, and with two GPUs in test, a 23 percent performance improvement was measured.  The PowerEdge XE8545 is at an advantage because the GPUs can communicate with each other over NVLink without the intervention of a CPU. The R7525 server with two GPUs is limited by the GPU-to-GPU communication pattern. Additionally, the other factor contributing for better performance is the higher clock rate of the SXM4 A100 GPU. 

Conclusion

In this blog, we discussed the performance of NVIDIA A100 GPUs on the PowerEdge R7525 Server and the PowerEdge XE8545 Server, which is the new addition from Dell Technologies. The A100 GPU has 42 percent more memory bandwidth and higher double precision FLOPs compared to its predecessor, the V100 series GPU. For workloads which demand more GPU-to-GPU communication, the PowerEdge XE8545 server is an ideal choice. For data centers where space and power are limited, the PowerEdge R7525 server may be the right fit. The overall performance of PowerEdge XE8545 Server with four A100-SXM4 GPUs is 1.5 to 2.3 times faster than the PowerEdge R7525 server with two A100-PCIe GPUs. 

In the future, we intend to evaluate the A100-80GB GPUs and NVIDIA A40 GPUs that will be available this year. We also plan to focus on a multi-node performance study with these GPUs.

Please contact your Dell sales specialist about the HPC and AI Innovation Lab if you would like to evaluate these GPU servers.  

Read Full Blog
  • HPC
  • AMD

BWA-GATK Pipeline Performance on Dell EMC PowerEdge R6525 Server

Kihoon Yoon Kihoon Yoon

Tue, 30 Mar 2021 18:34:08 -0000

|

Read Time: 0 minutes

Overview

We’ve been speculating that AMD Milan with Zen3 cores which allows more cores to share the same L3 cache could perform better for Next Generation Sequencing (NGS) applications. Comparing to the predecessor AMD EPYC Rome, the number of cores sharing the L3 cache is doubled-up from 4 to 8 for the 64 core processor model. In addition to that, the cache (both L1 and L2) is upgraded with new prefetchers, and memory bandwidth is improved.

Since Milan and Rome share the same SP3 socket, Dell EMC PowerEdge R6525 was selected for the case study and able to minimize system-to-system variations. The test server configuration is summarized in Table 1.

Table 1.  Tested compute node configuration

Dell EMC PowerEdge R6525

CPU

Tested AMD Milan:

2x 7763 (Milan), 64 Cores, 2.45GHz – 3.5GHz Base-Boost, TDP 280W, 256 MB L3 Cache

2x 7713 (Milan), 64 Cores, 2.0GHz – 3.7GHz Base-Boost, TDP 225W, 256 MB L3 Cache

7543 (Milan), 32 Cores, 2.8GHz – 3.7 GHz Base-Boost, TDP 225W, 256 MB L3 Cache

Tested AMD Rome:

7702 (Rome), 64 Cores, 2.0 GHz – 3.35 GHz Base-Boost, TDP 200W, 256 MB L3 Cache

RAM

DDR4 256G (16Gb x 16) 3200 MT/s

OS

RHEL 8.3 (4.18.0-240.el8.x86_64)

Filesystem Network

Mellanox InfiniBand HDR100

Filesystem

Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage

BIOS System Profile

Performance Optimized

Logical Processor

Disabled

Virtualization Technology

Disabled

BWA

0.7.15-r1140

Sambamba

0.7.0

Samtools

1.6

GATK

3.6-0-g89b7209

The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x. 

Performance Evaluation

Characterizing steps in BWA-GATK Pipeline

In a typical BWA-GATK pipeline, there are multiple steps, and each step consists of various applications which behave distinctively. As shown in Table 3, the applications in some steps do not support multi-threading operations. These steps are problematic since there are only a few ways to improve performance.

Table 2.  The steps in BWA-GATK pipeline and tools

Steps

Applications

Multi-threading support

Mapping & Sorting

BWA, samtools, sambamba

Yes

Mark Duplicates

Sambamba

Yes

Generate Realigning Targets

GATK RealignerTargetCreator

Yes

Insertion/Deletion Realigning

GATK IndelRealigner

No

Base Recalibration

GATK BaseRecalibrator

Yes

Haplotypercaller

GATK HaplotypeCaller

Yes

Genotype GVCFs

GATK GenotypeGVCFs

Yes

Variant Recalibration

GATK VariantRecalibrator

No

Apply Variant Recalibration

GATK ApplyRecalibration

No

Single thread applications, especially Variant Recalibration and Apply Variant Recalibratrion steps show no runtime variation due to the deterministic algorithm and the inputs for these steps are small. Hence, these two steps are not reported in Figure 1. The first step, Mapping & Sorting scales as the number of cores increases (Figure 1, (a)). Also, Genotype GVCFs is not included in Figure 1 although it supports multi-threading operation for a similar reason.

Burrows-Wheeler Aligner (BWA) is one of the most popular short sequence aligners for non-gapped aligning analysis. BWA scales well until 32 cores, and CPU usage drops down dramatically after 32 cores. The runtime improvement becomes marginal with higher core numbers greater than 32. Using more than 80 cores for this step is the wasting of resources.

Sambamba which is compatible with Picard is used for marking duplicate reads. The behavior of sambamba is plotted in Figure 1, (b). Due to the highly parallelized nature of design, the memory consumption increases to create more hash tables for additional threads. Amazingly, 50x human whole genome sequence (WGS) is not big enough to use more than 13 cores for the well-designed software.

After Mark Duplicates step, Genome Analysis Tool Kit, hence GATK, written in Java plays a critical role in performance measurement and creating answers. These steps do not scale at all as shown in Figure 1, (c) (d) (e) and (f). A better approach will be discussed in future work to handle the misbehavior in multi-core and multi-socket environments.

Figure 1.  Runtimes of 7702 with various number of cores for each step. Milan CPUs also show similar behaviors.

Single Sample Performance

Socket to Socket Comparison

This test is not a fair comparison since the majority of steps will not take advantage of using all the cores except 7543 with 32 cores. However, this comparison will help to decide which CPU could be best for the throughput test.

Table 3 summarizes the overall runtimes for BWA-GATK pipeline, and it is hard to say which one is better in terms of total runtimes. A lot more tests are required to differentiate the performance differences in GATK steps. Also, the results from 7502 and 7402 were from the previous tests with different environments.

The mapping & Sorting step is the only step that we can peak the true performance variations across different CPUs in Table 3. A rough estimation of performance improvement from 7702 to 7763 is 7% while the performance gain is 5% from 7702 to 7713.

Surprisingly, the Base Recalibration step showed similar results as the Mapping & Sorting step, which is 8% and 3% improvement.

Table 3.  BWA-GATK performance comparisons between Milan and Rome. The number of cores used for the test is parenthesized.

Steps

Runtime (hours)

AMD

7763

64c

2.45GHz

AMD

7713

64c

2.0GHz

AMD

7543

32c

2.8GHz

AMD

7702

64c

2.0GHz

AMD

7502

32c

2.5GHz

AMD

7402

24c

3.0GHz

Mapping & Sorting

2.44

(64)

2.49

(64)

3.69

(32)

2.63

(64)

4.68

(32)

5.73

(24)

Mark Duplicates

1.07

(13)

1.10

(13)

1.01

(13)

1.01

(13)

0.93

(13)

0.94

(13)

Generate Realigning Targets

0.55

(32)

0.56

(32)

0.50

(32)

0.58

(32)

0.45

(32)

0.44

(32)

Insertion/Deletion Realigning

8.73

(1)

9.13

(1)

7.73

(1)

8.78

(1)

8.30

(1)

8.21

(1)

Base Recalibration

2.27

(32)

2.38

(32)

2.17

(32)

2.46

(32)

2.52

(32)

2.67

(24)

Haplotypercaller

10.20

(16)

10.57

(16)

9.15

(16)

9.02

(16)

9.33

(16)

9.05

(16)

Genotype GVCFs

0.02

(32)

0.02

(32)

0.01

(32)

0.02

(32)

0.01

(32)

0.01

(24)

Variant Recalibration

0.31

(1)

0.20

(1)

0.17

(1)

0.12

(1)

0.21

(1)

0.13

(1)

Apply Variant Recalibration

0.01

(1)

0.01

(1)

0.01

(1)

0.01

(1)

0.01

(1)

0.01

(1)

Total Runtime (hours)

25.59

26.47

24.44

24.64

26.46

27.25

Multiple Sample Performance - Throughput

A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment.

The current pipeline invokes a large number of pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processing. Typically, a process silently fails when there is not enough resource left to start an additional process.

However, the failures experiencing during this study are quite different from the previous observations. 10 pipelines started in R6525 with 2x 7763 sustain only 6 pipelines on average with the 50x human WGS. Four pipelines are failed with the broken pipes error which suggests some sort of file operation. Current BeeGFS storage for the test is designed for high capacity, theoretical sequential write bandwidth of 25 GB/s. However, roughly 16 GB/s is achievable where there is not heavy usage loaded on this storage in a shared storage environment. This is not an ideal environment for any benchmark practice; however, the results here are quite helpful to see what the performance of these systems looks like in a real life.

As shown in Table 4, the maximum number of samples that can be processed at the same time is around 4 or 5, and the ~ 4.79 50x human whole genomes per day throughput is achievable with the current environment.

Table 4.  Throughput test for Milan 7763

Steps

Runtime (hours)

1 Sample

2 Samples

4 Samples

6 Samples

10 Samples

Number of Samples Failed 

0

0

0

1

4

Mapping & Sorting

2.44

2.91

4.33

5.86

8.33

Mark Duplicates

1.07

1.40

1.69

1.31

5.51

Generate Realigning Targets

0.55

0.88

1.77

0.50

2.07

Insertion/Deletion Realigning

8.73

8.97

8.92

8.92

9.70

Base Recalibration

2.27

2.50

2.79

3.26

3.67

Haplotypercaller

10.20

10.57

10.27

9.91

9.96

Genotype GVCFs

0.02

0.11

0.10

0.10

0.15

Variant Recalibration

0.31

0.25

0.20

0.21

0.36

Apply Variant Recalibration

0.01

0.02

0.01

0.01

0.03

Total Runtime (hours)

25.59

27.62

30.08

30.08

39.79

Genomes per day

0.94

1.74

4.79

3.99

3.62

Conclusion

The field of NGS data analysis has been moving fast in terms of data growth and data variations. However, the community has not been done much work adapting new technologies available such as accelerators. Instead of improving the quality of codes, the community is faced with analyzing the data without multi-thread processing since GATK version 4 and up does not support multi-threading anymore while the number of cores in a CPU increases fast.

It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision-making of when to split data is pretty hard, it is certainly tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.

Nonetheless, Milan 7763 or 7713 are an excellent candidate to cover both current multi-threading-based pipelines and future data-level parallelization driven pipelines with more available cores.

 

Read Full Blog
  • HPC
  • AMD

AMD Milan - BIOS Characterization for HPC

Puneet Singh Savitha Pareek Tarun Singh Ashish K Singh Puneet Singh Savitha Pareek Tarun Singh Ashish K Singh

Tue, 30 Mar 2021 18:23:11 -0000

|

Read Time: 0 minutes

With the release of the AMD EPYC 7003 Series Processors (architecture codenamed "Milan"), Dell EMC PowerEdge servers have now been upgraded to support the new features. This blog outlines the Milan Processor architecture and the recommended BIOS settings to deliver optimal HPC Synthetic benchmark performance. Upcoming blogs will focus on the application performance and characterization of the software applications from various scientific domains such as Weather Science, Molecular Dynamics, and Computational Fluid Dynamics.

AMD Milan with Zen3 cores is the successor of AMD's high-performance second generation server microprocessor (architecture codenamed "Rome"). It supports up to 64 cores at 280w TDP and 8 DDR4 memory channels at speeds up to 3200MT/s.

Architecture 

As with AMD Rome, AMD Milan’s 64 core Processor model has 1 I/O die and 8 compute dies (also called CCD or Core Complex Die) – OPN 32 core models may have 4 or 8 compute dies. Milan Processors have upgrades to the Cache (including new prefetchers at both L1 and L2 caches) and Memory Bandwidth which is expected to improve performance of applications requiring higher memory bandwidth.

Unlike Naples and Rome, Milan's arrangement of its CCDs has changed. Each CCD now features up to 8 cores with a unified 32MB L3 cache which could reduce the cache access latency within compute chiplets. Milan Processors can expose each CCD as a NUMA node node by setting the “L3 cache as NUMA Domain” ( from the iDRAC GUI ) or BIOS.ProcSettings.CcxAsNumaDomain (using racadm CLI) option to “Enabled”. Therefore, Milan’s 64 core dual-socket Processors with 8 CCDs per Processor will expose 16 NUMA domains per system in this setting. Here is the logical representation of Core arrangement with NUMA Nodes per socket = 4 and CCD as NUMA = Disabled.

Figure1: Linear core enumeration on a dual-socket system, 64c per socket, NPS4 configuration on an 8 CCD Processor model

As with AMD Rome, AMD Milan Processors support the AVX256 instruction set allowing 16 DP FLOP/cycle.

BIOS Options Available on AMD Milan and Tuning

Processors from both Milan and Rome generations are socket compatible, so the BIOS Options are similar across these Processor generations. Server details are mentioned in Table 1 below.

Table 1: Testbed hardware and software details

Server

Dell EMC PowerEdge 2 socket servers

(with AMD Milan Processors)

Dell EMC PowerEdge 2 socket servers

(with AMD Rome Processors)

OPN

Cores/Socket

Frequency (Base-Boost)

TDP
 L3Cache

7763 (Milan) 

64

2.45GHz – 3.5GHz

280W

256 MB

7H12 (Rome) 

64

2.6GHz – 3.3 GHz

280W

256 MB

OPN

Cores/Socket

Frequency

TDP
 L3Cache

7713 (Milan)

64

2.0GHz – 3.7GHz

225W

256 MB

7702 (Rome)

64

2.0 GHz – 3.35 GHz

200W

256 MB

OPN

Cores/Socket

Frequency

TDP
L3Cache

7543 (Milan) 

32

2.8GHz – 3.7 GHz

225W

256 MB

7542 (Rome) 

32

2.9GHz – 3.4 GHz

225W

128 MB

Operating System

RHEL 8.3 (4.18.0-240.el8.x86_64)

RHEL 8.2 (4.18.0-193.el8.x86_64)

Memory

DDR4 256G (16Gb x 16) 3200 MT/s

BIOS / CPLD

2.0.3 / 1.1.12

1.1.7

Interconnect

Mellanox HDR 200 (4X HDR)

Mellanox HDR 100

 The following BIOS options were explored –

  • BIOS.SysProfileSettings.SysProfile:  This field sets the System Profile to Performance Per Watt (OS), Performance, or Custom mode. When set to a mode other than Custom, BIOS will set each option accordingly. When set to Custom, you can change setting of each option. Under Custom mode when C state is enabled, Monitor/Mwait should also be enabled.
  • BIOS.ProcSettings.L1StridePrefetcher: When set to Enabled, the Processor provides additional fetch to the data access for an individual instruction for performance tuning by controlling the L1 stride prefetcher setting.
  • BIOS.ProcSettings.L2StreamHwPrefetcher: When set to Enabled, the Processor provides advanced performance tuning by controlling the L2 stream HW prefetcher setting.
  • BIOS.ProcSettings.L2UpDownPrefetcher: When set to Enabled, the Processor uses memory access to determine whether to fetch  next or previous for all memory accesses for advanced performance tuning by controlling the L2 up/down prefetcher setting.
  • BIOS.ProcSettings.CcxAsNumaDomain: This field specifies that each CCD within the Processor will be declared as a NUMA Domain.
  • BIOS.MemSettings.MemoryInterleaving: When set to Auto, memory interleaving is supported if a symmetric memory configuration is installed. When set to Disabled, the system supports Non-Uniform Memory Access (NUMA) (asymmetric) memory configurations. Operating Systems that are NUMA-aware understand the distribution of memory in a particular system and can intelligently allocate memory in an optimal manner. Operating Systems that are not NUMA-aware could allocate memory to a Processor that is not local, resulting in a loss of performance. Die and Socket Interleaving should only be enabled for Operating Systems that are not NUMA-aware.

After setting System Profile (BIOS.SysProfileSettings.SysProfile) to PerformanceOptimized,  NUMA Nodes Per Socket (NPS) to 4, and Prefetchers (L1Region,L1Stream,L1Stride,L2Stream, L2UpDown) to “Enabled” we measured the impact of CcxAsNumaDomain and MemoryInterleaving BIOS parameters on application performance. We tested the performance of the applications listed in Table 1 with following settings.

Table 2: Combinations of CCX as NUMA domain and Memory Interleaving


CCX as NUMA Domain

Memory Interleaving

Setting01

Disabled

Disabled

Setting02

Disabled

Auto

Setting03

Enabled

Auto

Setting04

Enabled

Disabled

With Setting01 and Setting02 (CCX as NUMA Domain = Disabled), the system will expose 8 NUMA nodes. With Setting03 and Setting04, there will be 16 NUMA nodes on a dual socket server with 64 core based Milan Processors.

Table 3: hwloc-ls and numactl -H command output on 64c server with setting01/setting02 and (listed in Table 2)


Table 4: hwloc-ls and numactl -H command output on 128 core (2x 64c) server with setting03/setting04 and (listed in Table 2)

Application performance is shown in Figure 2, Figure 3 and Figure 4. In each Figure, the numbers on top of the bars represent the relative change in the application performance with respect to the application performance obtained on the 7543 Processor Model with setting04 (CCXasNUMADomain=Enabled and Memory Interleaving = Disabled - green bar).

Figure 2: Relative difference in the performance of HPL by processor and BIOS settings mentioned in Table 1 and Table 2. 


Figure 3: Relative difference in the performance of HPCG by processor and BIOS settings mentioned in Table 1 and Table 2. 


Figure 4: Relative difference in the performance of STREAM by processor and BIOS settings mentioned in Table 1 and Table 2. 

HPL delivers the best performance numbers on setting02 with 82-93% efficiency depending on Processor Model, whereas STREAM and HPCG deliver better performance with setting04.

STREAM TRIAD tests generate best performance numbers at ~378 GB/s memory bandwidth across all of the 64 and 32 core Processor Models mentioned in Table 1 with efficiency up to 90%.

In Figure 4, the STREAM TRIAD performance numbers were measured by undersubscribing the server by utilizing only 16 cores on the servers. The comparison of the performance numbers by utilizing all the available cores and 16 cores per system has been shown in Figure 5. The numbers on top of the orange bars shows the relative difference.

Figure 5: Relative difference in the memory bandwidth.

From Figure 5, we observed that by using 16 cores, the STREAM TRIAD test’s performance numbers were ~3-4% higher than the performance numbers measured by subscribing all available cores.

We carried out NUMA bandwidth tests using setting02 and setting04 mentioned in Table01.  With setting02, system exposes a total of 8 NUMA nodes while with setting04, system exposes a total of 16 NUMA nodes with 8 cores per NUMA node In Figure 6 and 7, NUMA node presented as “c” and memory node as “m”. As an example, c0m0 represents NUMA node 0 and memory node 0. The best bandwidth numbers obtained on varying the number of threads

Figure 6: Local and remote NUMA memory bandwidth with CCXasNUMADomain=Disabled

Figure 7: Local and remote NUMA memory bandwidth with CCXasNUMADomain=ENabled

We observed that the optimal intra socket local memory bandwidth numbers were obtained with 2 threads per NUMA node with setting2 on both 64 core and 32 core processor models. In Figure 6 with setting02 (Table 2) the intra socket local memory bandwidth, at 2 threads per NUMA node, can be up to 79% more than inter remote memory bandwidth. With setting02 (Figure 6) we get at least 96% higher intra socket local memory bandwidth per NUMA domain than setting04 (Figure 7).

Impact of new Prefetch options

Milan introduces two new prefetchers for L1 cache and one for L2 Cache with a total of five prefetcher options which can be configured using BIOS. We tested combinations listed in Table 5 by keeping L1 Stream and L2 Stream prefetcher as Enabled.

Table 5: Cache Prefetchers


L1StridePrefetcher

L1RegionPrefetcher

L2UpDownPrefetcher

setting01

Disabled

Enabled

Enabled

setting02

Enabled

Disabled

Enabled

setting03

Enabled

Enabled

Disabled

setting04

Disabled

Disabled

Disabled

We found that these new prefetchers do not have significant impact on the performance of the synthetic benchmarks covered in this blog.

InfiniBand bandwidth, message rate and scalability

For Multinode tests, the testbed was configured with Mellanox HDR interconnect running at 200 Gbps with each server having the AMD 7713 Processor Model and Preferred IO setting set to Enabled from BIOS.Along with the setting02 (Table 2) and Prefetchers (L1Region,L1Stream,L1Stride,L2Stream, L2UpDown) set to “Enabled” we were able to achieve the expected linear performance scalability for HPL and HPCG Benchmarks.

Figure 8: Multinode scalability of HPL and HPCG with setting02 (Table 2) with 7713 Processor model, HDR200 Infiniband


We tested the Message Rate, Unidirectional, and Bidirectional InfiniBand bandwidth using OSU Benchmarks and results are in Figure 9, Figure 10 and Figure 11. Except the Numa Nodes per socket setting, all other BIOS settings for these tests were same as mentioned above. The OSU Bidirectional bandwidth and OSU Unidirectional tests were carried out with Numa Nodes per socket set to 2 and the and Message rate test was carried out with Numa Nodes per socket set to 4. In Figure 9 and Figure10, the numbers on top of the orange bars represent the percentage difference between Local and Remote bandwidth performance numbers.

Figure 9: OSU bi-directional bandwidth test on AMD 7713, HDR 200 InfiniBand

Figure 10: OSU uni-directional bandwidth test on AMD 7713, HDR 200 Infiniband

For Local Latency and Bandwidth performance numbers, the MPI process was pinned to the NUMA node 1 (closest to the HCA). For Remote Latency and Bandwidth tests, processes were pinned to NUMA node 6.

Figure 11: OSU Message rate and bandwidth performance on 2 and 4 nodes of 7713 Processor model

On 2 nodes using HDR200, we are able to achieve ~24 GB/s unidirectional bandwidth and message rate of 192 Million messages/second – almost double the performance numbers obtained on HDR100.

Comparison with Rome SKUs

In order to draw out performance improvement comparisons, we have selected Rome SKUs closest to their Milan counterparts in terms of hardware features such as Cache Size, TDP values, and Processor Base/Turbo Frequency.

Figure 12: HPL performance comparison with Rome Processor Models

 

Figure 13: HPCG performance comparison with Rome Processor Models

 

Figure 14: STREAM performance comparison with Rome Processor Models

For HPL (Figure 12) we observed that, on higher end Processor Models, Milan delivers 10% better performance than Rome. As expected, on the Milan platform, memory bandwidth bound applications like STREAM and HPCG (Figure 13 and Figure 14) gain 6-16 % and 13-32% in the performance over Rome Processor Models covered in this blog.

Summary and Future Work  

Milan-based servers show expected performance upgrades, especially for the memory bandwidth bound synthetic HPC benchmarks covered in this blog. Configuring the BIOS options is important in order to get the best performance out of the system. The Hyper-Threading should be Disabled for general-purpose HPC systems, and benefits of this feature should be tested and enabled as appropriate for the synthetic benchmarks not covered in this blog.

Check back soon for subsequent blogs that describe application performance studies on our Milan Processor based cluster.

Read Full Blog
  • HPC
  • AMD
  • manufacturing

Siemens’ Simcenter STAR-CCM+ Performance with AMD EPYC 7003 Series Processors

Joshua Weage Joshua Weage

Thu, 18 Mar 2021 16:39:54 -0000

|

Read Time: 0 minutes

Introduction

This blog discusses the performance of Siemens’ Simcenter STAR-CCM+ on the Dell EMC Ready Solutions for HPC Digital Manufacturing with AMD EPYC 7003 series processors. This Dell EMC Ready Solutions for HPC was designed and configured specifically for digital manufacturing workloads, where computer aided engineering (CAE) applications are critical for virtual product development. The Dell EMC Ready Solutions for HPC Digital Manufacturing uses a flexible building block approach to HPC system design, where individual building blocks can be combined to build HPC systems which are optimized for customer specific workloads and use cases.

The Dell EMC Ready Solutions for HPC Digital Manufacturing is one of many solutions in the Dell EMC HPC solution portfolio. Please visit www.dellemc.com/hpc for a comprehensive overview of the HPC solutions offered by Dell EMC.

Benchmark System Configuration

Performance benchmarking was performed using dual-socket Dell EMC PowerEdge servers with 7002 and 7003 series AMD EPYC processors. All servers were populated with two processors and one DIMM per channel memory configuration. The system configurations used for the performance benchmarking are shown in Table 1 and Table 2. The BIOS configuration used for the benchmarking systems is shown in Table 3.

Table 1.  7002 Series AMD EPYC System Configuration

Server

Dell EMC PowerEdge C6525

Processor

2x AMD EPYC 7532 32-core Processors

Memory

16x16GB 3200 MTps RDIMMs

BIOS Version

1.4.8

Operating System

Red Hat Enterprise Linux Server release 7.6

Kernel Version

3.10.0-957.27.2.el7.x86_64

 Table 2.  7003 Series AMD EPYC System Configuration

Server

Dell EMC PowerEdge R6525

Processors

2x AMD EPYC 7713 64-Core Processors

2x AMD EPYC 7543 32-Core Processors

Memory

16x16GB 3200 MTps RDIMMs

BIOS Version

2.0.1

Operating System

Red Hat Enterprise Linux Server release 8.3

Kernel Version

4.18.0-240.el8.x86_64

Table 3.  BIOS Configuration

System Profile

Performance Optimized

Logical Processor

Disabled

Virtualization Technology

Disabled

NUMA Nodes Per Socket

4

Software Versions

Application software versions are as described in Table 4.

Table 4.  Software Versions

Simcenter STAR-CCM+2020.3.1 mixed precision with Open MPI 4

Siemens’ Simcenter STAR-CCM+ Performance

Simcenter STAR-CCM+ is a multiphysics software application used to simulate a wide range of products and designs under a variety of conditions. The benchmarks reported here mainly use the computational fluid dynamics (CFD) and heat transfer features of STAR-CCM+. CFD applications typically scale well across multiple processor cores and servers, have modest memory capacity requirements, and typically perform minimal disk I/O while solving. However, some simulations may have greater I/O demands, such as transient analysis.

The benchmark cases from the standard STAR-CCM+ benchmark suite were evaluated on the systems. The benchmark results reported here are single-server performance results, with the benchmark run using all processor cores available in the server. STAR-CCM+ benchmark performance is measured using the Average Elapsed Time metric which is the average elapsed time per solver iteration. A smaller elapsed time represents better performance. Figure 1 shows the relative performance results for a selection of the STAR‑CCM+ benchmarks.

Figure 1.  Simcenter STAR-CCM+ Single Server Performance

 The results in Figure 1 are plotted relative to the performance of a single server configured with AMD EPYC 7532 processors. Larger values indicate better overall performance. These results show the performance improvement available with 7003 series AMD EPYC processors. The 32-core AMD EPYC 7543 processor provides good performance for these benchmarks. Per server, the 64-core AMD EPYC 7713 provides a significant performance advantage over the 32-core processors.

Conclusion

The results presented in this blog show that 7003 series AMD EPYC processors offer a significant performance improvement for Siemens’ Simcenter STAR-CCM+ relative to 7002 series AMD EPYC processors.


Read Full Blog
  • PowerEdge
  • HPC
  • GPU
  • AMD

HPC Application Performance on Dell EMC PowerEdge R7525 Servers with the AMD MI100 Accelerator

Frank Han Dharmesh Patel Frank Han Dharmesh Patel

Wed, 16 Dec 2020 17:34:42 -0000

|

Read Time: 0 minutes


Overview

The Dell EMC PowerEdge R7525 server supports the AMD MI100 GPU Accelerator. The server is a two-socket, 2U rack-based server that is designed to run complex workloads using highly scalable memory, I/O capacity, and network options. The system is based on the 2nd Gen AMD EPYC processor (up to 64 cores), has up to 32 DIMMs, and has PCI Express (PCIe) 4.0-enabled expansion slots. The server supports SATA, SAS, and NVMe drives and up to three double-wide 300 W accelerators.

The following figure shows the front view of the server:
 

Figure 1.  Dell EMC PowerEdge R7525 server

The AMD Instinct™ MI100 accelerator is one of the world’s fastest HPC GPUs available in the market. It offers innovations to obtain higher performance for HPC applications with the following key technologies:

  • AMD Compute DNA (CDNA)—Architecture optimized for compute-oriented workloads
  • AMD ROCm—An Open Software Platform that includes GPU drivers, compilers, profilers, math and communication libraries, and system resource management tools 
  • Heterogeneous-Computing Interface for Portability (HIP)—An interface that enables developers to covert CUDA code to portable C++ so that the same source code can run on AMD GPUs

This blog focuses on the performance characteristics of a single PowerEdge R7525 server with AMD MI100-32G GPUs. We present results from the general matrix multiplication (GEMM) microbenchmarks, the LAMMPS benchmarks, and the NAMD benchmarks to showcase performance and scalability.

The following table provides the configuration details of the PowerEdge R7525 system under test (SUT): 

Table 1.  SUT hardware and software configurations

Component

Description

Processor

AMD EPYC 7502 32-core processor

Memory

512 GB (32 GB 3200 MT/s * 16)

Local disk

2 x 1.8 TB SSD (No RAID)

Operating system

Red Hat Enterprise Linux Server 8.2

GPU

3 x AMD MI100-PCIe-32G

Driver version

3204

ROCm version

3.9

Processor Settings > Logical Processors

Disabled

System profiles

Performance

NUMA node per socket

4

NAMD benchmark

Version:  NAMD 3.0 ALPHA 6

LAMMPS (KOKKOS) benchmark

Version:  LAMMPS patch_18Sep2020+AMD patches

The following table lists the AMD MI100 GPU specifications:

Table 2.  AMD MI100 PCIe GPU specification

Component


GPU architecture

MI100

Peak Engine Clock (MHz)

1502

Stream processors

7680

Peak FP64 (TFLOPS)

11.5

Peak FP64 Tensor DGEMM (TFLOPS)

11.5

Peak FP32 (TFLOPS)

23.1

Peak FP32 Tensor SGEMM (TFLOPS)

46.1

Memory size (GB)

32

Memory ECC support

Yes

TDP (Watt)

300

GEMM Microbenchmarks

The GEMM benchmark is a simple, multithreaded dense matrix-to-matrix multiplication benchmark that can be used to test the performance of GEMM on a single GPU. The rocblas-bench binary compiled from https://github.com/ROCmSoftwarePlatform/rocBLAS was used to collect DGEMM and SGEMM results. The results of these tests reflect the performance of an ideal application that only runs matrix multiplication in the form of the peak TFLOPS that the GPU can deliver. Although GEMM benchmark results might not represent real-world application performance, it is still a good benchmark to demonstrate the performance capability of different GPUs.

The following figure shows the observed numbers of DGEMM and SGEMM:

 

Figure 2.  DGEMM and SGEMM for both AMD MI100 peak and AMD-PCIe sustained

The results indicate:

  • In the DGEMM (double-precision GEMM) benchmarkthe theoretical peak performance of the AMD MI100 GPU is 11.5 TFLOPS and the measured sustained performance is 7.9 TFLOPS. As shown iTable 2, the standard double precision (FP64) theoretical peak and the FP64 tensor DGEMM peak performance are both at 11.5 TFLOPS. Because most real world HPC applications typically are not heavily implemented with DGEMM or other matrix operations, this high standard FP64 capabilitboosts the performance on other non-matrix double-precision math calculations.
  • For FP32 Tensor operations in the SGEMM (single-precision GEMM) benchmark, the theoretical peak performance of the AMD MI100 GPU is 46.1 TFLOPS, and the measured sustained performance is approximately 30 TFLOPS. 

The LAMMPS benchmark

The Large-Scale Atom/Molecular Massively Parallel Simulator (LAMMPS) runs threads in parallel using message-passing techniques. This benchmark measures the scalability and performance of large, parallel systems of multiple GPUs. 

The following figure shows the KOKKOS implementation of LAMMPS scaled relatively linearly as AMD MI100 GPUs were added across four datasets: EAM, LJ, Tersoff, and ReaxFF/C.

Figure 3.  LAMMPS benchmark showing scaling of multiple AMD MI100 GPUs

The NAMD Benchmark

Nanoscale Molecular Dynamics (NAMD) is a parallel molecular dynamics system designed for simulation of large biomolecular systems. The NAMD benchmark stresses the scaling and performance aspects of the server and GPU configuration.

The following figure plots the results of the NAMD microbenchmark: 

Figure 4.   NAMD benchmark performance

Aggregate data of multiple GPU cards is preferred because the Alpha builds of the NAMD 3.0 binary do not scale beyond a single accelerator. Three replica simulations were launched on the same server, one on each GPU, in parallel. NAMD was CPU-bound in previous versions. The new 3.0 version has reduced the CPU dependence. As a result, three-copy simulation produced linear scaling performing three times faster across all datasets.

As part of the optimization, the NAMD benchmark numbers in the following figure show the relative performance difference using different numbers of CPU cores for the STMV dataset:

Figure 5.  CPU core dependency on NAMD

The AMD MI100 GPU exhibited an optimum configuration of four CPU cores per GPU.

Conclusion

The AMD MI100 accelerator delivers industry-leading performance, and it is a well-positioned performance per dollar GPU for both FP32 and FP64 HPC parallel codes.

  • FP32 applications perform well using the AMD MI100 GPU based on the SGEMM, LAMMPS, and NAMD benchmarks performance by using tensor cores and native FP32 compute cores.
  • FP64 applications perform well using the AMD MI100 GPU by using native FP64 compute cores.

Next Steps

In the future, we plan to test other HPC and Deep Learning applications. We also plan to research using “Hipify” tools to port CUDA sources to HIP.

Read Full Blog
  • NVIDIA
  • PowerEdge
  • HPC
  • GPU
  • AMD

HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs

Savitha Pareek Ashish K Singh Frank Han Savitha Pareek Ashish K Singh Frank Han

Tue, 24 Nov 2020 17:49:03 -0000

|

Read Time: 0 minutes

Overview

The Dell PowerEdge R7525 server powered with 2nd Gen AMD EPYC processors was released as part of the Dell server portfolio. It is a 2U form factor rack-mountable server that is designed for HPC workloads. Dell Technologies recently added support for NVIDIA A100 GPGPUs to the PowerEdge R7525 server, which supports up to three PCIe-based dual-width NVIDIA GPGPUs. This blog describes the single-node performance of selected HPC applications with both one- and two-NVIDIA A100 PCIe GPGPUs.

The NVIDIA Ampere A100 accelerator is one of the most advanced accelerators available in the market, supporting two form factors: 

  • PCIe version 
  • Mezzanine SXM4 version

 The PowerEdge R7525 server supports only the PCIe version of the NVIDIA A100 accelerator. 

The following table compares the NVIDIA A100 GPGPU with the NVIDIA V100S GPGPU: 


NVIDIA A100 GPGPU

NVIDIA V100S GPGPU

Form factor

SXM4

PCIe Gen4

SXM2

PCIe Gen3

GPU architecture

Ampere

Volta

Memory size

40 GB

40 GB

32 GB

32 GB

CUDA cores

6912

5120

Base clock

1095 MHz

765 MHz

1290 MHz

1245 MHz

Boost clock

1410 MHz

1530 MHz

1597 MHz

Memory clock

1215 MHz

877 MHz

1107 MHz

MIG support

Yes

No

Peak memory bandwidth

Up to 1555 GB/s

Up to 900 GB/s

Up to 1134 GB/s

Total board power

400 W

250 W

300 W

250 W

The NVIDIA A100 GPGPU brings innovations and features for HPC applications such as the following:

  • Multi-Instance GPU (MIG)—The NVIDIA A100 GPGPU can be converted into as many as seven GPU instances, which are fully isolated at the hardware level, each using their own high-bandwidth memory and cores. 
  • HBM2—The NVIDIA A100 GPGPU comes with 40 GB of high-bandwidth memory (HBM2) and delivers bandwidth up to 1555 GB/s. Memory bandwidth with the NVIDIA A100 GPGPU is 1.7 times higher than with the previous generation of GPUs. 

Server configuration

The following table shows the PowerEdge R7525 server configuration that we used for this blog:

Server

PowerEdge R7525

Processor

2nd Gen AMD EPYC 7502, 32C, 2.5Ghz

Memory

512 GB (16 x 32 GB @3200MT/s)

GPGPUs

Either of the following:

2 x NVIDIA A100 PCIe 40 GB

2 x NVIDIA V100S PCIe 32 GB

Logical processors

Disabled

Operating system

CentOS Linux release 8.1 (4.18.0-147.el8.x86_64)

CUDA

11.0 (Driver version - 450.51.05)

gcc

9.2.0

MPI

OpenMPI-3.0

HPL

hpl_cuda_11.0_ompi-4.0_ampere_volta_8-7-20

HPCG

xhpcg-3.1_cuda_11_ompi-3.1

GROMACS

v2020.4

Benchmark results

The following sections provide our benchmarks results with observations.

High-Performance Linpack benchmark

High Performance Linpack (HPL) is a standard HPC system benchmark. This benchmark measures the compute power of the entire cluster or server. For this study, we used HPL compiled with NVIDIA libraries.

The following figure shows the HPL performance comparison for the PowerEdge R7525 server  with either NVIDIA A100 or NVIDIA V100S GPGPUs:

Figure1: HPL performance on the PowerEdge R7525 server with the NVIDIA A100 GPGPU compared to the NVIDIA V100SGPGPU

The problem size (N) is larger for the NVIDIA A100 GPGPU due to the larger capacity of GPU memory. We adjusted the block size (NB) used with the:

  • NVIDIA A100 GPGPU to 288
  • NVIDIA V100S GPGPU to 384

The AMD EPYC processors provide options for multiple NUMA combinations.  We found that the best value of 4 NUMA per socket (NPS=4), with NUMA per socket 1 and 2 lower the performance by 10 percent and 5 percent respectively. In a single PowerEdge R7525 node, the NVIDIA A100 GPGPU delivers 12 TF per card using this configuration without an NVLINK bridge. The PowerEdge R7525 server with two NVIDIA A100 GPGPUs delivers 2.3 times higher HPL performance compared to the NVIDIA V100S GPGPU configuration. This performance improvement is credited to the new double-precision Tensor Cores that accelerate FP64 math.

The following figure shows power consumption of the server while running HPL on the NVIDIA A100 GPGPU in a time series. Power consumption was measured with an iDRAC. The server reached 1038 Watts at peak due to a higher GFLOPS number.

Figure2: Power consumption while running HPL

High Performance Conjugate Gradient benchmark

The High Performance Conjugate Gradient (HPCG)  benchmark is based on a conjugate gradient solver, in which the preconditioner is a three-level hierarchical multigrid method using the Gauss-Seidel method. 

As shown in the following figure, HPCG performs at a rate 70 percent higher with the NVIDIA A100 GPGPU due to higher memory bandwidth:  

Figure 3: HPCG performance comparison 

Due to different memory size, the problem size used to obtain the best performance on the NVIDIA A100 GPGPU was 512 x 512 x 288 and on the NVIDIA V100S GPGPU was 256 x 256 x 256. For this blog, we used NUMA per socket (NPS)=4 and we obtained results without an NVLINK bridge. These results show that applications such as HPCG, which fits into GPU memory, can take full advantage of GPU memory and benefit from the higher memory bandwidth of the NVIDIA A100 GPGPU.       

GROMACS

In addition to these two basic HPC benchmarks (HPL and HPCG), we also tested GROMACS, an HPC application. We compiled GROMACS 2020.4 with the CUDA compilers and OPENMPI, as shown in the following table: 

Figure4: GROMACS performance with NVIDIA GPGPUs on the PowerEdge R7525 server

 The GROMACS build included thread MPI (built in with the GROMACS package). All performance numbers were captured from the output “ns/day.” We evaluated multiple MPI ranks, separate PME ranks, and different nstlist values to achieve the best performance. In addition, we used settings with the best environment variables for GROMACS at runtime. Choosing the right combination of variables avoided expensive data transfer and led to significantly better performance for these datasets.

GROMACS performance was based on a comparative analysis between NVIDIA V100S and NVIDIA A100 GPGPUs. Excerpts from our single-node multi-GPU analysis for two datasets showed a performance improvement of approximately 30 percent with the NVIDIA A100 GPGPU. This result is due to improved memory bandwidth of the NVIDIA A100 GPGPU. (For information about how the GROMACS code design enables lower memory transfer overhead, see Developer Blog: Creating Faster Molecular Dynamics Simulations with GROMACS 2020.)

Conclusion

The Dell PowerEdge R7525 server equipped with NVIDIA A100 GPGPUs shows exceptional performance improvements over servers equipped with previous versions of NVIDIA GPGPUs for applications such as HPL, HPCG, and GROMACS. These performance improvements for memory-bound applications such as HPCG and GROMACS can take advantage of higher memory bandwidth available with NVIDIA A100 GPGPUs.

 

Read Full Blog
  • HPC

Ready Solution for HPC PixStor Storage Capacity Expansion, HDR100 Update

Mario Gallegos Mario Gallegos

Tue, 17 Nov 2020 21:43:49 -0000

|

Read Time: 0 minutes

Introduction

Today’s HPC environments have ever increasing demands for very high-speed storage that also frequently must provide high capacity and distributed access via several standard protocols such as NFS, and SMB. These high demand HPC requirements are typically covered by Parallel File Systems that provide concurrent access to a single file or set of files from multiple nodes, very efficiently and securely distributing data to multiple LUNs across several servers using the network technology with the highest speed available.

Solution Architecture

This blog is a technology update for the use of Infiniband HDR100 on the Dell EMC Ready Solution for HPC PixStor Storage, a Parallel File System (PFS) solution for HPC environments where PowerVault ME484 EBOD arrays are used to increase the capacity of the solution. Figure 1 presents the reference architecture depicting the capacity expansion SAS additions to the existing PowerVault ME4084 storage arrays, replacing Infiniband EDR components with HDR100: ConnectX-6 HCAs and QM8700 switches. The PixStor solution includes the widespread General Parallel File System also known as Spectrum Scale as the PFS component, in addition to many other ArcaStream software components including advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities and many others.

Figure 1 Reference Architecture

Solution Components

This solution was released with the latest Intel Xeon 2nd generation Scalable Xeon CPUs (Cascade Lake) and some of the servers will use the fastest RAM available (2933 MT/s). However, due to hardware availability during testing, the solution prototype used servers with Intel Xeon 1st generation Scalable Xeon CPUs (Skylake) and slower RAM to characterize the performance. Since the bottleneck of the solution is at the SAS controllers of the DellEMC PowerVault ME40x4 arrays, no significant performance disparity is expected once the Skylake CPUs and RAM are replaced with Cascade Lake CPUs and faster RAM. In addition, the solution was updated to the latest version of PixStor (5.1.3.1) that supports RHEL 7.7 and OFED 5.0-2.1.8.

Table 1 shows the list of main components for the solution where the first column has components used at release time and therefore available to customers, and the last column has the components actually used for characterizing the performance of the solution. The drives listed for data (12TB NLS) and metadata (960GB SSD), are the ones used for performance characterization, and faster drives can provide better Random IOPs and may improve create/removal metadata operations.

Finally, for completeness, the list of possible data HDDs and metadata SSDs was included, which is based on the drives supported as enumerated on the DellEMC PowerVault ME4 support matrix, available online.

Table 1 Components to be used at release time and those used in the test bed

Solution Component

Released

Test Bed

Internal Mgmt Connectivity

Dell Networking S3048-ON Gigabit Ethernet

Data Storage Subsystem

1x to 4x Dell EMC PowerVault ME4084
 1x to 4x Dell EMC PowerVault ME484 (One per ME4084)


80 – 12TB 3.5" NL SAS3 HDD drives
8 LUNs, linear 8+2 RAID 6, chunk size 512KiB.
Options: 900GB @15K, 1.2TB @10K, 1.8TB @10K,
2.4TB @10K, 4TB NLS, 8TB NLS, 12TB NLS, 16TB NLS.
4 - 960GB SAS3 SSDs for Metadata – 2x RAID 1 (or 4 - Global HDD
spares, if Optional High Demand Metadata Module is used)
       Options: 480GB, 960GB, 1.92TB and 3.84TB.

Optional High Demand Metadata Storage Subsystem

1x to 2x (max 4x) Dell EMC PowerVault ME4024
(for 4x ME4024, supported on Large config only)
Each ME4024: 12 LUNs, linear RAID 1.
24 – 960GB 2.5" SSD SAS3 drives (Options 480GB, 960GB, 1.92TB,
          3.84TB)

RAID Storage Controllers

Redundant 12 Gbps SAS

Capacity w/o Expansion

Raw: 4032 TB (3667 TiB or 3.58 PiB) with 12TB HDDs
 Formatted ~ 3072 GB (2794 TiB or 2.73 PiB)

Capacity w/Expansion

Raw: 8064 TB (7334 TiB or 7.16 PiB) with 12TB HDDs

Formatted ~ 6144 GB (5588 TiB or 5.46 PiB)

Processor

Gateway/Ngenea (R740)

2x Intel Xeon Gold 6230 2.1G, 20C/40T, 10.4GT/s, 27.5M Cache, Turbo, HT (125W) DDR4-2933

2x Intel Xeon Gold 6136 @
 3.0 GHz, 12 cores

High Demand Metadata (R740)

Storage Node (R740)

Management Node (R440)

2x Intel Xeon Gold 5220 2.2G, 18C/36T, 10.4GT/s, 24.75M Cache, Turbo, HT (125W) DDR4-2666

2x Intel Xeon Gold 5118 @ 2.30GHz, 12 cores

Memory

Gateway/Ngenea (R740)

12 x 16GiB 2933 MT/s RDIMMs
 (192 GiB)

24x 16GiB 2666 MT/s RDIMMs (384 GiB)

High Demand Metadata (R740)

Storage Node (R740)

Management Node (R440)

12 X 16GB RDIMMs, 2666 MT/s (192GiB)

12x 8GiB 2666 MT/s RDIMMs (96 GiB)

Operating System

CentOS 7.7

Kernel version

3.10.0-1062.12.1.el7.x86_64

PixStor Software

5.1.3.1

OFED Version

Mellanox OFED 5.0-2.1.8

High Performance Network Connectivity

Mellanox ConnectX-6 Dual-Port InfiniBand VPI HDR100/100 GbE, and 10 GbE

High Performance Switch

2x Mellanox QM8700 (HA – Redundant)

Local Disks (OS & Analysis/monitoring)

All servers except Management node

3x 480GB SSD SAS3 (RAID1 + HS) for OS

PERC H730P RAID controller
Management Node

3x 480GB SSD SAS3 (RAID1 + HS) for OS & Analysis/Monitoring

PERC H740P RAID controller

All servers except Management node

2x 300GB 15K SAS3 (RAID 1) for OS

PERC H330 RAID controller

Management Node

5x 300GB 15K SAS3 (RAID 5) for OS &
       Analysis/monitoring

PERC H740P RAID controller

Systems Management

iDRAC 9 Enterprise + DellEMC OpenManage

 

Performance Characterization

To characterize this Ready Solution, we used the hardware specified in the last column of Table 1, including the optional High Demand Metadata Module. In order to assess the solution performance, the following benchmarks were used:

  • IOzone N to N sequential
  • IOzone random
  • IOR N to 1 sequential
  • MDtest 

For all benchmarks listed above, the test bed had the clients as described in the Table 2 below. Since the number of compute nodes available for testing was only 16, when a higher number of threads was required, those threads were equally distributed on the compute nodes (i.e. 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads =16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes available. Since the benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), while avoiding excessive context switching and other related side effects that can affect performance results.

Table 2 Client test bed

Number of Client nodes

16

Client node

C6320

Processors per client node

2 x Intel(R) Xeon(R) Gold E5-2697v4 18 Cores @ 2.30GHz

Memory per client node

12 x 16GiB 2400 MT/s RDIMMs

High Performance Adapter

Mellanox ConnectX-4 InfiniBand VPI

Operating System

CentOS 7.6

OS Kernel

3.10.0-957.10.1

PixStor Software

5.1.3.1

OFED Version

Mellanox OFED 5.0-1.0.0

Sequential IOzone Performance N clients to N files

Sequential N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 512 threads on the capacity expanded solution (4x ME4084s + 4x ME484s); results from the EDR testing are contrasted with the HDR100 update. 

Caching effects were minimized by setting the file system tunable page pool to 8 GiB on the clients and 24 GiB on the servers and using files twice the total memory size of the clients or servers (whichever value is larger). It is important to note that for the file system, the page pool tunable sets the maximum amount of memory used by the file system for caching data, regardless of the amount of RAM installed and free. Also, important to note is that while in previous DellEMC HPC solutions the block size for large sequential transfers is 1 MiB, the file system was formatted with 8 MiB blocks and therefore that value is used on the benchmark for optimal performance. That may look too large and apparently waste too much space, but the file system uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided in 256 subblocks of 32 KiB each. 

The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and threadlist was the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

./iozone -i0 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist

./iozone -i1 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist

 

Figure 2  N to N Sequential Performance

From the results we can observe that performance rises fast with the number of clients used and then reaches a plateau that is fairly stable until the maximum number of threads that IOzone allows is reached, and therefore large file sequential performance is stable except for 512 concurrent threads (about 8% lower). The maximum read performance of 23.8 GB/s at 32 threads was still limited by the bandwidth of the two IB HDR100 links used on the storage nodes starting at 8 threads. Read performance at 4 threads is considerably lower and at high thread counts is a bit lower compared to EDR (less than 5%), but the results are reproduceable. Since the sequential N to 1 test using IOR uses the same data size and similar parameters but on a single file (adding locking overhead), the big drop in read performance at 4 threads (and  to a much smaller degree at high thread counts) may be due to IOzone using calls that are working less efficiently than IOR calls, but more work is needed to find the reason for the different behavior.

The highest write performance of 21 GB/s was achieved at 512 threads. It is important to remember that for PixStor file system, the preferred mode of operation is scatter, and the solution was formatted to use such mode. In this mode, blocks are allocated from the very beginning of operation in a pseudo-random fashion, spreading data across the whole surface of each HDD. While the obvious disadvantage is a smaller initial maximum performance, that performance is maintained fairly constant regardless of how much space is used on the file system. That in contrast to other parallel file systems that initially use the outer tracks that can hold more data (sectors) per disk revolution, and therefore have the highest possible performance the HDDs can provide, but as the system uses more space, inner tracks with less data per revolution are used, with the consequent reduction of performance.

Sequential IOR Performance N clients to 1 file

Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from a one thread up to 512 threads (since there were not enough cores for 1024 threads), and results are contrasted to the solution without the capacity expansion.

Caching effects were minimized by setting the file system page pool tunable to 8 GiB on the clients and 24 GiB on the servers and using a total data size bigger than twice the total memory size of clients or servers (whichever is larger). This benchmark tests used 8 MiB blocks for optimal performance. The previous performance test section has a more complete explanation for those matters. 

The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b 128G
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -r -s 1 -t 8m -b 128G 

 

Figure 3  N to 1 Sequential Performance

Performance rises fast with the number of clients used and then reaches a plateau that is fairly stable for reads and writes from 8 threads all the way to the maximum number of threads used on this test. The maximum read performance was 24.2 GB/s at 32 threads and the bottleneck was the InfiniBand HDR100 interface apparently at higher than line speed. Similarly, notice that the maximum write performance of 19.9 GB/s was reached at 16 threads. An important data point is at 4 threads, that even that uses the same data size and parameters as IOzone with the extra burden of locking, no performance drop is observed for writes as it was for IOzone.

Random small blocks IOzone Performance N clients to N files

Random N clients to N files performance was measured with IOzone 3.487. Tests executed varied from 16 threads up to 512 threads since there was not enough client-cores for 1024 threads. Lower thread counts were not tested at this time since they take a very large execute time and IOzone does not allow to get results until the test is completed in its entirety and the most important information tends to be the maximum IOPS that the solution can provide. Each thread was using a different file and the threads were assigned on a round robin fashion to the client nodes. This benchmark used 4 KiB blocks for emulating small block traffic.

Caching effects were minimized by setting the file system page pool tunable to 8GiB on the clients and 24 GiB on the servers and using a total data size bigger than twice the total page pool size of clients or servers (whichever is larger).  It is important to note that the page pool tunable sets the maximum amount of memory used by the file system for caching data, regardless the amount of RAM installed and free.

 

Figure 4  N to N Random Performance

From the results we can observe that write performance starts at a high value of 23.4K IOps and remains under 25K steadily up to 256 threads where it peaks at 27.4K IOps. Read performance on the other hand starts at 1.3K IOps and increases performance almost linearly with the number of threads used (keep in mind that number of threads is doubled for each data point) and reaches the maximum performance of 33.8K IOPS at 512 threads. Using more threads would require more than the 16 compute nodes or more cores per node to avoid loss of performance due to process context switching, data locality and similar effects. ME4 arrays require a higher IO pressure (queue or IO depth) to reach their maximum random IOPS showing in this test a lower apparent performance, where the arrays could in fact deliver more IOPS when using tests like FIO that can control the IO depth per process.

Metadata performance with MDtest using empty files

Metadata performance was measured with MDtest version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 512 threads. The benchmark was used for files only (no directory metadata), getting the number of creates, stats and removes that the solution can handle, and results were contrasted with previous EDR results.

To properly evaluate the solution in comparison to other DellEMC HPC storage solutions and the previous blog results, the optional High Demand Metadata Module was used, but with a single ME4024 array; but in fact, the large configuration tested in this work was designated to have two ME4024s.

This High Demand Metadata Module can support up to four ME4024 arrays, and it is suggested to increase the number of ME4024 arrays to 4, before adding another metadata module. Additional ME4024 arrays are expected to increase the Metadata performance with each additional array, except maybe for Stat operations, since the IOPS numbers are very high, at some point the CPUs will become a bottleneck and performance will not continue to increase linearly.

The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. Similar to the Random IO benchmark, the maximum number of threads was limited to 512, since there are not enough cores for 1024 threads and context switching would affect the results, reporting a number lower than the real performance of the solution.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F

Since performance results can be affected by the total number of IOPs, the number of files per directory and the number of threads, it was decided to fix the total number of files to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed as shown in Table 3.

Table 3 MDtest distribution of files on directories

Number of Threads

Number of directories per thread

Total number of files

1

2048

2,097,152

2

1024

2,097,152

4

512

2,097,152

8

256

2,097,152

16

128

2,097,152

32

64

2,097,152

64

32

2,097,152

128

16

2,097,152

256

8

2,097,152

512

4

2,097,152


 

Figure 5  Metadata Performance - Empty Files

First, note that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences several orders of magnitude; otherwise some of the operations would look like a flat line close to 0 on a normal graph. A log graph with base 2 could be more appropriate, since the number of threads are increased in powers of 2, but the graph would look very similar and people tend to handle and remember better numbers based on powers of 10.

The system gets very good results with Stat operations reaching their peak value at 256 threads with 6M op/s respectively. Removal operations attained the maximum of 189.7K op/s at 32 threads and Create operations achieving their peak at 512 threads with 266.8.1K op/s. Stat operation have more variability, but once they reach their peak value, performance does not drop below 3M op/s for Stats. Create and Removal are more stable once their reach a plateau and remain above 160K op/s for Removal and 128K op/s for Create.

Metadata performance with MDtest using 4 KiB files

This test is almost identical to the previous one, except that instead of empty files, small files of 4KiB were used.

The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F -w 4K -e 4K

 

Figure 6  Metadata Performance - Small files (4K)

The system gets very good results for Stat reaching a peak value at 512 threads with almost 4.9M op/s. Remove operations attained the maximum of 442.7K op/s at 128 threads and Create operations achieving their peak with 75K op/s at 512 threads and apparently not reaching a plateau yet. Stat and Removal operations have more variability, but once they reach their peak value, performance does not drop below 3.5M op/s for Stats and 315K op/s for Removal. Create and Read have less variability and keep increasing as the number of threads grows. 

Since these numbers are for a metadata module with a single ME4024, performance will increase for each additional ME4024 array, however we cannot simply assume a linear increase for each additional ME4024. Unless the whole file fits inside the inode for such file, data targets on the ME4084s will be used to store part of the 4K files, limiting the performance to some degree. Since the inode size is 4KiB and it still needs to store metadata, only files around 3 KiB will fit inside and any file bigger than that will use data targets.

Conclusions and Future Work

The solution has similar performance to that observed with the Infiniband EDR technology. An overview of the performance for HDR100 can be seen in Table 4; it is expected to be stable from an empty file system until is almost full because of the use of Scatter allocation across the whole surface area of ALL HDDs. Furthermore, the solution scales in capacity and performance linearly as more storage node modules are added, and a similar performance increase can be expected from the optional high demand metadata module. This solution provides HPC customers with a very reliable parallel file system used by many Top 500 HPC clusters. In addition, it provides exceptional search capabilities, advanced monitoring and management. With the addition of optional gateway nodes, it allows file sharing via ubiquitous standard protocols like NFS, SMB to as many clients as needed.  Finally, Ngenea nodes allow efficient access to other cost-effective storage tiers such as ECS, Isilon enterprise NAS and Cloud solutions using different protocols.

Table 4  Peak & Sustained Performance

 

Peak Performance

Sustained Performance

Write

Read

Write

Read

Large Sequential N clients to N files

21.0 GB/s

23.8 GB/s

20.5 GB/s

23.0 GB/s

Large Sequential N clients to single shared file

19.9 GB/s

24.2 GB/s

19.1 GB/s

23.4 GB/s

Random Small blocks N clients to N files

33.8KIOps

27.4KIOps

33.80KIOps

23.0KIOps

Metadata Create empty files

266.8K IOps

128K IOps

Metadata Stat empty files

6M IOps

3M IOps

Metadata Remove empty files

189.7K IOps

160K IOps

Metadata Create 4KiB files

75K IOps

75K IOps

Metadata Stat 4KiB files

4.9M IOps

3.5M IOps

Metadata Remove 4KiB files

442.7K IOps

315K IOps

 Performance for the gateway nodes was measured and will be reported in a new blog. Finally, high performance NVMe nodes are being tested and results will also be released in a different blog.

Read Full Blog
  • HPC

Ready Solutions for HPC BeeGFS High Performance Storage: HDR100 Refresh

Nirmala Sundararajan Nirmala Sundararajan

Wed, 19 Jul 2023 19:59:24 -0000

|

Read Time: 0 minutes

Introduction

True to the tradition of keeping up with the technology trends, the Dell EMC Ready Solutions for BeeGFS High Performance Storage, that was originally released during November 2019, has now been refreshed with the latest software and hardware. The base architecture of the solution remains the same. The following table lists the differences between the initially released InfiniBand EDR based solution and the current InfiniBand HDR100 based solution in terms of the software and hardware used. 

Table 1.   Comparison of Hardware and Software of EDR and HDR based BeeGFS High Performance Solution 

Software 

Initial Release (Nov. 2019) 

Current Refresh (Nov. 2020)

Operating System

CentOS 7.6

CentOS 8.2.

Kernel version

3.10.0-957.27.2.el7.x86_64

4.18.0-193.14.2.el8_2.x86_64

BeeGFS File system version

7.1.3

7.2

Mellanox OFED version

4.5-1.0.1.0

5.0-2.1.8.0

Hardware 

Initial Release 

Current Refresh

NVMe Drives

Intel P4600 1.6 TB Mixed Use

Intel P4610 3.2 TB Mixed Use

InfiniBand Adapters

ConnectX-5 Single Port EDR

ConnectX-6 Single Port HDR100

InfiniBand Switch

SB7890 InfiniBand EDR 100 Gb/s Switch -1U (36x EDR 100 Gb/s ports)

QM8790 Quantum HDR Edge Switch – 1U (80x HDR100 100 Gb/s ports using splitter cables) 

 This blog presents the architecture, updated technical specifications and the performance characterization of the upgraded high-performance solution. It also includes a comparison of the performance with respect to the previous EDR based solution.

Solution Reference Architecture

The high-level architecture of the solution remains the same as the initial release. The hardware components of the solution consist of 1x PowerEdge R640 as the management server and 6x PowerEdge R740xd servers as metadata/storage servers to host the metadata and storage targets. Each PowerEdge R740xd server is equipped with 24x Intel P4610 3.2 TB Mixed Use Express Flash drives and 2x Mellanox ConnectX-6 HDR100 adapters. Figure 1 shows the reference architecture of the solution. 

 

 

Figure 1.   Dell EMC Ready Solutions for HPC BeeGFS Storage – Reference Architecture

There are two networks-the InfiniBand network, and the private Ethernet network. The management server is only connected via Ethernet to the metadata and storage servers. Each metadata and storage server has 2x links to the InfiniBand network and is connected to the private network via Ethernet. The clients have one InfiniBand link and are also connected to the private Ethernet network. For more details on the solution configuration please refer to the blog and whitepaper on BeeGFS High Performance Solution published at hpcatdell.com .

Hardware and Software Configuration

Table 2 and 3 describe the hardware specifications of management server and metadata/storage server respectively. Table 4 describes the versions of the software used for the solution.

Table 2.   PowerEdge R640 Configuration (Management Server)

Component

Description

Processor

2 x Intel Xeon Gold 5218 2.3GHz, 16 cores

Memory

12 x 8GB DDR4 2666MT/s DIMMs - 96GB

Local Disks

6 x 300GB 15K RPM SAS 2.5in HDDs

RAID Controller

PERC H740P Integrated RAID Controller

Out of Band Management

iDRAC9 Enterprise with Lifecycle Controller

Table 3.   PowerEdge R740xd Configuration (Metadata and Storage Servers)

Component

Description

Processor

2x Intel Xeon Platinum 8268 CPU @ 2.90GHz, 24 cores

Memory

12 x 32GB DDR4 2933MT/s DIMMs - 384GB

BOSS Card

2x 240GB M.2 SATA SSDs in RAID 1 for OS

Local Drives

24x Dell Express Flash NVMe P4610 3.2 TB 2.5” U.2

InfiniBand Adapter

2x Mellanox ConnectX-6 single port HDR100 Adapter

InfiniBand Adapter Firmware

20.26.4300

Out of Band Management

iDRAC9 Enterprise with Lifecycle Controller

 Table 4.   Software Configuration (Metadata and Storage Servers)

Component

Description

Operating System

CentOS Linux release 8.2.2004 (Core)

Kernel version

4 4.18.0-193.14.2.el8_2.

Mellanox OFED

5.0-2.1.8.0

NVMe SSDs

VDV1DP23

OpenMPI

4.0.3rc4

Intel Data Center Tool

v 3.0.26

BeeGFS 

7.2

Grafana

7.1.5-1

InfluxDB

1.8.2-1

IOzone

3.490

MDtest

3.3.0+dev

Performance Evaluation

The system performance was evaluated using the following benchmarks:

Performance tests were run on a testbed with clients as described in Table 5. For test cases where the number of IO threads were greater than the physical number of IO clients, threads were distributed equally across the clients (i.e., 32 threads = 2 threads per client…,1024 threads = 64 threads per node).  

Table 5.   Client Configuration

Component

Description

Server model

8x PowerEdge R840 

8x PowerEdge C6420

Processor

4x Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 24 cores (R840)

2x Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores (C6420) 

Memory

24 x 16GB DDR4 2933MT/s DIMMs - 384GB (R840)

12 x 16GB DDR4 2933MT/s DIMMs – 192 GB (C6420)

Operating System

4.18.0-193.el8.x86_64

Kernel version

Red Hat Enterprise Linux release 8.2 (Ootpa)

InfiniBand Adapter

1x ConnectX-6 single port HDR100 adapter

OFED version

5.0-2.1.8.0

Sequential Reads and Writes N-N

The IOzone benchmark was used in the sequential read and write mode to evaluate sequential reads and writes. These tests were conducted using multiple thread counts starting at 1 thread and up to 1024 threads. At each thread count, an equal number of files were generated since this test works on one file per thread or the N-N case. The round robin algorithm was used to choose targets for file creation in a deterministic fashion. 

For all the tests, aggregate file size was 8 TB and this was equally divided among the number of threads for any given test. The aggregate file size chosen was large enough to minimize the effects of caching from the servers as well as from BeeGFS clients.

IOzone was run in a combined mode of write then read (-i 0, -i 1) to allow it to coordinate the boundaries between the operations. For this test, we used a 1MiB record size for every run. The commands used for Sequential N-N tests are given below:

Sequential Writes and Reads: iozone -i 0 -i 1 -c -e -w -r 1m -I -s $Size -t $Thread -+n -+m /path/to/threadlist

OS caches were also dropped or cleaned on the client nodes between iterations as well as between write and read tests by running the command:

# sync && echo 3 > /proc/sys/vm/drop_caches

The default stripe count for BeeGFS is 4. However, the chunk size and the number of targets per file can be configured on a per-directory basis. For all these tests, BeeGFS stripe size was chosen to be 2MB and stripe count was chosen to be 3 since we have three targets per NUMA zone as shown below: 

# beegfs-ctl --getentryinfo --mount=/mnt/beegfs /mnt/beegfs/benchmark --verbose

Entry type: directory

EntryID: 0-5F6417B3-1

ParentID: root

Metadata node: storage1-numa0-2 [ID: 2]

Stripe pattern details:

+ Type: RAID0

+ Chunksize: 2M

+ Number of storage targets: desired: 3

+ Storage Pool: 1 (Default)

Inode hash path: 33/57/0-5F6417B3-1 

 
The testing methodology and the tuning parameters used were similar to those previously described in the EDR based solution. For additional details in this regard, please refer to the whitepaper on the BeeGFS High Performance Solution.

Note

The number of clients used for the performance characterization of the EDR based solution is 32 whereas the number of clients used for the performance characterization of the HDR100 based solution is 16 only. In the performance charts given below, this is indicated by including 16c which denotes 16clients and 32c which denotes 32 clients. The dotted lines show the performance of the EDR based solution and the solid lines shows the performance of the HDR100 based solution.

 

Figure 2.   Sequential IOzone 8 TB aggregate file size 


From Figure 2, we observe that the HDR100 peak read performance is ~131 GB/s and peak write is ~123 GB/s at 1024 threads. Each drive can provide 3.2 GB/s peak read performance and 3.0 GB/s peak write performance, which allows a theoretical peak of 460.8 GB/s for reads and 432 GB/s for the solution. However, in this solution, the network is the limiting factor. In the setup, we have a total of 11 InfiniBand HDR100 links for the storage servers. Each link can provide a theoretical peak performance of 12.4 GB/s which allows an aggregate theoretical peak performance of 136.4 GB/s. The achieved peak read and write performance are 96% and 90% respectively of the theoretical peak performance.

We observe that the peak read performance for the HDR100 based solution is slightly lower than that observed with the EDR based solution. This can be attributed to the fact that the benchmark tests were carried out using 16 clients for the HDR100 based setup while the EDR based solution used 32 clients. The improved write performance with HDR100 is due to the fact that the P4600 NVMe SSD used in the EDR based solution could provide only 1.3 GB/s for sequential writes whereas the P4610 NVMe SSD provides 3.0 GB/s peak write performance.

We also observe that the read performance is lower than writes for thread counts from 16 to 128. This is because a PCIe read operation is a Non-Posted Operation, requiring both a request and a completion, whereas a PCIe write operation is a Posted Operation that consists of a request only. A PCIe write operation is a fire and forget operation, wherein once the Transaction Layer Packet is handed over to the Data Link Layer, the operation completes. 

Read throughput is typically lower than the write throughput because reads require two transactions instead of a single write for the same amount of data. The PCI Express uses a split transaction model for reads. The read transaction includes the following steps:

  • The requester sends a Memory Read Request (MRR).
  • The completer sends out the acknowledgement to MRR.
  • The completer returns a Completion with Data. 

 The read throughput depends on the delay between the time the read request is issued and the time the completer takes to return the data. However, when the application issues enough number of read requests to offset this delay, then throughput is maximized. A lower throughput is measured when the requester waits for completion before issuing subsequent requests. A higher throughput is registered when multiple requests are issued to amortize the delay after the first data returns. This explains why the read performance is less than that of the writes from 16 threads to 128 threads and then an increased throughput is observed for higher thread counts of 256, 512 and 1024.  

More details regarding the PCI Express Direct Memory Access  is available at   https://www.intel.com/content/www/us/en/programmable/documentation/nik1412547570040.html#nik1412547565760

Random Reads and Writes N-N

IOzone was used in the random mode to evaluate random IO performance. Tests were conducted with thread counts starting from 8 threads to up to 1024 threads. Direct IO option (-I) was used to run IOzone so that all operations bypass the buffer cache and go directly to the disk. BeeGFS stripe count of 3 and chunk size of 2MB was used. A 4KiB request size was used on IOzone and performance measured in I/O operations per second (IOPS). An aggregate file size of 8 TB was selected to minimize the effects of caching. The aggregate file size was equally divided among the number of threads within any given test. The OS caches were dropped between the runs on the BeeGFS servers as well as BeeGFS clients. 

The command used for random reads and writes is given below:

iozone -i 2 -w -c -O -I -r 4K -s $Size -t $Thread -+n -+m /path/to/threadlist

Figure 3.   N-N Random Performance

Figure 3 shows that the random writes peak at ~4.3 Million IOPS at 1024 threads and the random reads peak at ~4.2 Million IOPS at 1024 threads. Both the write and read performance show a higher performance when there are a greater number of IO requests. This is because NVMe standard supports up to 64K I/O queue and up to 64K commands per queue. This large pool of NVMe queues provide higher levels of I/O parallelism and hence we observe IOPS exceeding 3 Million. The following table provides a comparison of the random IO performance of the P4610 and P4600 NVMe SSDs to better understand the observed results.

Table 6.  Performance Specification of Intel NVMe SSDs

Product

P4610 3.2 TB NVMe SSD

P4600 1.6 TB NVMe SSD

Random Read (100% Span)

638000 IOPS

559550 IOPS

Random Write (100% Span)

222000 IOPS

176500 IOPS

 

Metadata Tests

The metadata performance was measured with MDtest and OpenMPI to run the benchmark over the 16 clients. The benchmark was used to measure file creates, stats, and removals performance of the solution. Since performance results can be affected by the total number of IOPs, the number of files per directory and

the number of threads, a consistent number of files across tests was chosen to allow comparison. The total number of files was chosen to be ~ 2M in powers of two (2^21 = 2097152). The number of files per

directory was fixed at 1024, and the number of directories varied as the number of threads changed. The test methodology, and directories created are similar to that described in the previous blog.

 The following command was used to execute the benchmark:

 mpirun -machinefile $hostlist --map-by node -np $threads ~/bin/mdtest -i 3 -b

$Directories -z 1 -L -I 1024 -y -u -t -F

 

Figure 4.   Metadata Performance – Empty Files

 From Figure 4, we observe that the create, removal and read performance are comparable to those received for the EDR based solution whereas the Stat performance is lower by ~100K IOPS. This may be because the HDR100 based solution uses only 16 clients for performance characterization whereas the EDR based solution used 32 clients. The file create operations reach their peak value at 512 threads at ~87K op/s. The removal and stat operations attained the maximum value at 32 threads with ~98K op/s, and 392 op/s respectively.

Conclusion

This blog presents the performance characteristics of the Dell EMC High Performance BeeGFS Storage Solution with the latest software and hardware. At the software level, high-performance solution has now been updated with

  • CentOS 8.2.2004 as the base OS
  • BeeGFS v7.2
  • Mellanox OFED version 5.0-2.1.8.0.

At the hardware level, the solution uses

  • ConnectX-6 Single Port HDR100 adapters
  • Intel P4610 3.2 TB Mixed use, NVMe drives and
  • Quantum switch QM8790 with 80x HDR100 100 Gb/s ports.

The performance analysis allows us to conclude that:

  • IOzone sequential read and write performance is similar to that of the EDR based solution because network is the bottleneck here.
  • The IOzone random read and write performance is greater than the previous EDR based solution by ~ 1M IOPS because of the use of P4610 NVMe drives which provide improved random write and read performance.
  • The file create and removal performance is similar to that of the EDR based solution.
  • The file stat performance registers a 19% drop because of the use of only 16 clients in the current solution as compared to the 32 clients used in the previous EDR based solution.

References

Dell EMC Ready Solutions for HPC BeeGFS Storage - Technical White Paper  

Features of Dell EMC Ready Solutions for HPC BeeGFS Storage

Scalability of Dell EMC Ready Solutions for HPC BeeGFS Storage  

Dell EMC Ready Solutions for HPC BeeGFS High Performance Storage

 

 

 

Read Full Blog