Blogs about Dell Technologies solutions for high performance computing
Wed, 08 Nov 2023 21:09:35 -0000
|Read Time: 0 minutes
The AMD EPYC 9354 Processor, when integrated into the Dell R6625 server, offers a formidable solution for high-performance computing (HPC) applications. Genoa, which is built on the Zen 4 architecture, delivers exceptional processing power and efficiency, making it a compelling choice for demanding HPC workloads. When paired with the PowerEdge R6625's robust infrastructure and scalability features, this CPU enhances server performance, enabling efficient and reliable execution of HPC applications. These features make it an ideal choice for HPC application studies and research.
At Dell Technologies, it’s our goal to help accelerate time to value for our customers. Dell wants to help customers leverage our benchmark performance and scaling studies to help plan out their environments. By utilizing our expertise, customers don’t have to spend time testing different combinations of CPU, memory and interconnect or choosing the CPU with the sweet spot for performance. This also saves time, as customers don’t have to spend time deciding which BIOS features to tweak for best performance and scaling. Dell wants to accelerate the set-up, deployment, and tuning of HPC clusters to enable customers to get the real value- running their applications and solving complex problems like manufacturing better products for their customers.
Benchmarking for high-performance computing applications was carried out using Dell PowerEdge 16G servers equipped with AMD EPYC 9354 32-Core Processor.
Table 1. Test bed system configuration used for this benchmark study
Platform | Dell PowerEdge R6625 |
Processor | AMD EPYC 9354 32-Core Processor |
Cores/Socket | 32 (64 total) |
Base Frequency | 3.25 GHz |
Max Turbo Frequency | 3.75 GHz |
TDP | 280 W |
L3 Cache | 256 MB |
Memory | 768 GB | DDR5 4800 MT/s |
Interconnect | NVIDIA Mellanox ConnectX-7 NDR 200 |
Operating System | RHEL 8.6 |
Linux Kernel | 4.18.0-372.32.1 |
BIOS | 1.0.1 |
OFED | 5.6.2.0.9 |
System Profile | Performance Optimized |
Compiler | AOCC 4.0.0 |
MPI | OpenMPI 4.1.4 |
Turbo Boost | ON |
Interconnect | Mellanox NDR 200 |
Application | Vertical Domain | Benchmark Datasets |
OpenFOAM | Manufacturing - Computational Fluid Dynamics (CFD) | Motorbike 50M 34M and 20M cell mesh |
Weather Research and Forecasting (WRF) | Weather and Environment |
|
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) | Molecular Dynamics | Rhodo, EAM, Stilliger Weber, tersoff, HECBIOSIM, and Airebo |
GROMACS | Life Sciences – Molecular Dynamics | HECBioSim Benchmarks – 3M Atoms , Water and Prace LignoCellulose |
CP2K | Life Sciences | H2O-DFT-LS-NREP- 4,6 H2O-64-RI-MP 2 |
OpenFOAM is an open-source computational fluid dynamics (CFD) software package renowned for its versatility in simulating fluid flows, turbulence, and heat transfer. It offers a robust framework for engineers and scientists to model complex fluid dynamics problems and conduct simulations with customizable features. In this study, worked on OpenFOAM version 9, which have been compiled with gcc 11.2.0 with OPENMPI 4.1.5. For successful compilation and optimization on the AMD EPYC processors, additional flags such as ' -O3 -znver4' have been added.
The tutorial case under the simpleFoam solver category, motorBike, has been used to evaluate the performance of OpenFOAM package on AMD EPYC 9354 processors. Three different types of grids were generated such as 20M, 34M, and 50M cells using the blockMesh and snappyHexMesh utilities of OpenFOAM. Each run was conducted with full cores (64 cores per node) and from single node to sixteen nodes, The scalability tests were done for all the three sets of grids. The steady state simpleFoam solver execution time was noted down as performance numbers. The figure below shows the application performance for all the datasets.
Figure 1: The scaling performance of the OpenFOAM Motorbike dataset using the AMD EPYC Processor, with a focus on performance compared to a single node.
The results are non-dimensionalized with single node result. The scalability is depicted in Figure 1. The OpenFOAM application shows linear scaling from a single node to eight nodes on 9354 processors for higher dataset (50M). For other smaller datasets with 20M and 34M cells, the linear scaling was shown up to four nodes and slightly scaling reduced on eight nodes. For all the datasets (20M, 34M and 50M) on sixteen nodes the scalability was reduced.
Achieving satisfactory results with smaller datasets can be accomplished using fewer processors and nodes, because smaller datasets do not require a higher number of processors. Nonetheless, augmenting the node count, and therefore, the processor count, in relation to the solver's computation time leads to increased interprocessor communication, subsequently extending the overall runtime. Consequently, higher node counts are more beneficial when handling larger datasets within OpenFOAM simulations.
The Weather Research and Forecasting model (WRF) is at the forefront of meteorological and atmospheric research, with its latest version being a testament to advancements in high-performance computing (HPC). WRF enables scientists and meteorologists to simulate and forecast complex weather patterns with unparalleled precision. In this study, we have worked on WRF version 4.5, which have been compiled with AOCC 4.0.0 with OPENMPI 4.1.4. For successful compilation and optimization with the AMD EPYC compilers, additional flags such as ' -O3 -znver4' have been added.
The dataset used in our study is CONUS v4.4. This means that the model's grid, parameters, and input data are set up to focus on weather conditions within the continental United States. This configuration is particularly useful for researchers, meteorologists, and government agencies who need high-resolution weather forecasts and simulations tailored to this specific geographic area. The configuration details, such as grid resolution, atmospheric physics schemes, and input data sources, can vary depending on the specific version of WRF and the goals of the modeling project. In this study, we have predominantly adhered to the default input configuration, making minimal alterations or adjustments to the source code or input file. Each run was conducted with full cores (64 cores per node) and from single node to sixteen nodes. The scalability tests were conducted and the performance metric in “sec” was noted.
Figure 2: The scaling performance of the WRF CONUS dataset using the AMD EPYC 9354 processor, with a focus on performance compared to a single node.
The AOCC compiled binaries of WRF show linear scaling from a single node to sixteen nodes on 9354 processors for the new CONUS v4.4. For the best performance with WRF, the impact of the tile size, process, and threads per process should be carefully considered. Given that the application is constrained by memory and DRAM bandwidth, we have opted for the latest DDR5 4800 MT/s DRAM for our test evaluations. It is also crucial to consider the BIOS settings, particularly SubNUMA configurations, as these settings can significantly influence the performance of memory-bound applications, potentially leading to improvements ranging from one to five percent. For more detailed BIOS tuning recommendations, please see our previous blog post on optimizing BIOS settings for optimal performance.
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a powerful tool for HPC. It is specifically designed to harness the immense computational capabilities of HPC clusters and supercomputers. LAMMPS allows researchers and scientists to conduct large-scale molecular dynamics simulations with remarkable efficiency and scalability. In this study, we have worked on LAMMPS, 15 June 2023 version, which have been compiled with AOCC 4.0.0 with OPENMPI 4.1.4. For successful compilation and optimization with the AMD EPYC compilers, additional flags such as ' -O3 -znver4' have been added.
We opted for the non-default package, which offers optimized atom pair styles. We have also tried running some benchmarks which are not supported with default package to check the performance and scaling. Our performance metric for this benchmark is nanoseconds per day, where higher nanoseconds per day is considered a better result .
There are two factors that were considered when compiling data for comparison, the number of nodes and the core count. Figure 3 shows results of performance improvement observed on processor 9354 with 64 cores.
Figure 3: The scaling performance of the LAMMPS datasets using the AMD EPYC 9354 processor, with a focus on performance compared to a single node.
Figure 3 shows the scaling of different LAMMPS datasets. We see a significant improvement in scaling as we increased the atom size and step size. We have tested two datasets EAM and Hecbiosim with more than 3 million atoms and observed a better scalability as compared to other datasets.
GROMACS, a high-performance molecular dynamics software, is a vital tool for HPC environments. Tailored for HPC clusters and supercomputers, GROMACS specializes in simulating the intricate movements and interactions of atoms and molecules. Researchers in diverse fields, including biochemistry and biophysics, rely on its efficiency and scalability to explore complex molecular processes. GROMACS is used for its ability to harness the immense computational power of HPC, allowing scientists to conduct intricate simulations that unveil critical insights into atomic-level behaviors, from biomolecules to chemical reactions and materials. In this study, we have worked on GROMACS 2023.1 version, which have been compiled with AOCC 4.0.0 with OPENMPI 4.1.4. For successful compilation and optimization with the AMD EPYC compilers, additional flags such as ' -O3 -znver4' have been added.
We've curated a range of datasets for our benchmarking assessments. First, we included "water GMX_50 1536" and "water GMX_50 3072," which represent simulations involving water molecules. These simulations are pivotal for gaining insights into solvation, diffusion, and water's behaviour in diverse conditions. Next, we incorporated "HECBIOSIM 14K" and "HECBIOSIM 30K" datasets, which were specifically chosen for their ability to investigate intricate systems and larger biomolecular structures. Lastly, we included the "PRACE Lignocellulose" dataset, which aligns with our benchmarking objectives, particularly in the context of lignocellulose research. These datasets collectively offer a diverse array of scenarios for our benchmarking assessments.
Our performance assessment was based on the measurement of nanoseconds per day (ns/day) for each dataset, providing valuable insights into the computational efficiency. Additionally, we paid careful attention to optimizing the mdrun tuning parameters (i.e, ntomp, dlb tunepme nsteps etc )in every test run to ensure accurate and reliable results. We examined the scalability by conducting tests spanning from a single node to a total of sixteen nodes.
Figure 4: The scaling performance of the GROMACS datasets using the AMD EPYC 9354 processor, with a focus on performance compared to a single node.
For ease of comparison across the various datasets, the relative performance has been included into a single graph. However, it is worth noting that each dataset behaves individually when performance is considered, as each uses different molecular topology input files (tpr), and configuration files.
We were able to achieve the expected performance scalability for GROMACS of up to eight nodes for larger datasets. All cores in each server were used while running these benchmarks. The performance increases are close to linear across all the dataset types, however there is drop in larger number of nodes due to the smaller dataset size and the simulation iterations.
CP2K is a versatile computational software package that covers a wide range of quantum chemistry and solid-state physics simulations, including molecular dynamics. It's not strictly limited to molecular dynamics but is instead a comprehensive tool for various computational chemistry and materials science tasks. While CP2K is widely used for molecular dynamics simulations, it can also perform tasks like electronic structure calculations, ab initio molecular dynamics, hybrid quantum mechanics/molecular mechanics (QM/MM) simulations, and more. In this study, we have worked on CP2K 2023.1 version, which have been compiled with AOCC 4.0.0 with OPENMPI 4.1.4. For successful compilation and optimization with the AMD EPYC compilers, additional flags such as ' -O3 -znver4' have been added.
In our study focusing on high-performance computing (HPC), we utilized specific datasets optimized for computational efficiency. The first dataset, "H2O-DFT-LS-NREP-4,6," was configured for HPC simulations and calculations, emphasizing the modeling of water (H2O) using Density Functional Theory (DFT). The appended "NREP-4,6" parameter settings were fine-tuned to ensure efficient HPC performance. The second dataset, "H2O-64-RI-MP2," was exclusively crafted for HPC applications and revolved around the examination of a system comprising 64 water molecules (H2O). By employing the Resolution of Identity (RI) method in conjunction with the Møller–Plesset perturbation theory of second order (MP2), this dataset demonstrated the significant computational capabilities of HPC for conducting advanced electronic structure calculations within a high-molecule-count environment. We examined the scalability by conducting tests spanning from a single node to a total of sixteen nodes.
Figure 5: The scaling performance of the CP2K datasets using the AMD EPYC 9354 processor, with a focus on performance compared to a single node.
The datasets represent a single-point energy calculation employing linear-scaling Density Functional Theory (DFT). The system comprises 6144 atoms confined within a 39 Å^3 simulation box, which translates to a total of 2048 water molecules. To adjust the computational workload, you can modify the NREP parameter within the input file.
Our benchmarking efforts encompass configurations involving up to 16 computational nodes .Optimal performance is achieved when using NREP4 and NREP6 in Hybrid mode, which combines MPI (Message Passing Interface) and OpenMP (Open Multi-Processing). This configuration exhibits the best scaling performance, particularly on 4 to 8 nodes. However, it's worth noting that scaling beyond 8 nodes does not exhibit a strictly linear performance improvement. Above figure 5 depict outcomes when using Pure MPI, utilizing 64 cores with a single thread per core.
When considering CPUs with equivalent core counts, the earlier AMD EPYC processors can deliver performance levels like their Genoa counterparts. However, achieving this performance parity may require doubling the number of nodes. To further enhance performance using AMD EPYC processors, we suggest optimizing the BIOS settings as outlined in our previous blog post and specifically disabling Hyper-threading for the benchmarks discussed in this article. various workloads, we recommend conducting comprehensive testing and, if beneficial, enabling Hyper-threading. Additionally, for this performance study, we highly endorse the utilization of the Mellanox NDR 200 interconnect for optimal results.
Fri, 30 Jun 2023 13:44:52 -0000
|Read Time: 0 minutes
Dell added over a dozen next-generation systems to the extensive portfolio of Dell PowerEdge 16G servers. These new systems are to accelerate performance and reliability for powerful computing across core data centers, large-scale public clouds, and edge locations.
The new PowerEdge servers feature rack, tower, and multi-node form factors, supporting the new 4th-gen Intel Xeon Scalable processors (formerly codenamed Sapphire Rapids). Sapphire Rapids still supports the AVX 512 SIMD instructions, which allow for 32 DP FLOP/cycle. The upgraded Ultra Path Interconnect (UPI) Link speed of 16 GT/s is expected to improve data movement between the sockets. In addition to core count and frequency, Sapphire Rapids-based Dell PowerEdge servers support DDR5 – 4800 MT/s RDIMMS with eight memory channels per processor, which is expected to improve the performance of memory bandwidth-bound applications.
This blog provides synthetic benchmark results and recommended BIOS settings for the Sapphire Rapids-based Dell PowerEdge Server processors. This document contains guidelines that allow the customer to optimize their application for best energy efficiency and provides memory configuration and BIOS setting recommendations for the best out-of-the-box performance and scalability on the 4th Generation of Intel® Xeon® Scalable processor families.
Table 1 and Table 2 show the test bed hardware details and synthetic application details. There were 15 BIOS options explored through application performance testing. These options can be set and unset via the Remote Access Control Admin (RACADM) command in Linux or directly when the machines are in the BIOS mode.
Use the following command to set the “HPC Profile” to get the best synthetic benchmark results.
racadm set bios.sysprofilesettings.WorkloadProfile HpcProfile && sudo racadm jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW -e TIME_NA
Once the system is up, use the below command to verify if the setting is enabled.
racadm bios.sysprofilesettings.WorkloadProfile
It should show workload profile set as HPCProfile. Please note that any changes made in BIOS settings on top of the “HPCProfile” will set this parameter to “Not Configured”, while keeping the other settings of “HPCProfile” intact.
Table 1. System details
Component | Dell PowerEdge R660 server (Air cooled) | Dell PowerEdge R760 server (Air cooled) | Dell PowerEdge C-Series (C6620) server (Direct Liquid Cooled) |
SKU | 8452Y | 6430 | 8480+ |
Cores/Socket | 36 | 32 | 56 |
Base Frequency | 2 | 1.9 | 2 |
TDP | 300 | 270 | 350 |
L3Cache | 69.12 MB | 61.44 MB | 10.75 MB |
Operating System | RHEL 8.6 | RHEL 8.6 | RHEL 8.6 |
Memory | 1024 - 64 x 16 | 1024 - 64 x 16 | 512 -32 x 16 |
BIOS | 1.0.1 | 1.0.1 | 1.0.1 |
CPLD | 1.0.1 | 1.0.1 | 1.0.1 |
Interconnect | NDR 400 | NDR 400 | NDR 400 |
Compiler | OneAPI 2023 | OneAPI 2023 | OneAPI 2023 |
Table 2. Synthetic benchmark applications details
Application Name | Version |
High-Performance Linpack (HPL) | Pre-built binary MP_LINPACK INTEL - 2.3 |
STREAM | |
High Performance Conjugate Gradient (HPCG) | Pre-built binary from INTEL oneAPI 2.3 |
Ohio State University (OSU) |
In the present study, synthetic applications such as HPL, STREAM, and HPCG are done on a single node; since the OSU benchmark is a benchmark study on MPI operations, it requires a minimum of two nodes.
As shown in Table 2, four synthetic applications are tested on the test bed hardware (Table 1). They are HPL, STREAM, HPCG, and OSU. The details of performance of each application are given below:
HPL helps measure the floating-point computation efficiency of a system [1]. The details of the synthetic benchmarks can be found in the previous blog on Intel Ice Lake processors.
Figure 1. Performance values of HPL application for different processor models
The N and NB sizes used for the HPL benchmark are 348484 and 384, respectively, for the Intel Sapphire Rapids 6430, 8452Y processors, and 246144 and 384, respectively, for the 8480 processor. The difference in N sizes is due to the difference in available memory. Systems with Intel 6430 and 8452Y processors are equipped with 1024 GB of memory; the 8480 processor system has 512 GB. The performance numbers are captured with different BIOS settings, as discussed above, and the delta difference between each result is within 1-2%. The results with the HPC workload BIOS profile are shown in Figure 1. the 8452Y processor performs 1.09 times better than the Intel Sapphire Rapids 6430 processor and the 8480 processor performs 1.65 times better.
The STREAM benchmark helps for measuring sustainable memory bandwidth of a processor. In general for STREAM benchmark, each array for STREAM must have at least four times the total size of all last-level caches utilized in the run or 1 million elements, whichever is larger. The STREAM array sizes used for the current study are 4×107 and 12×107 with full core utilization. The STREAM benchmark was also tested with 15 BIOS combinations, and the results depicted in Figure 2 are for the HPC workload profile bios test case. The STREAM TRIAD results are captured here in GB/sec. Results show improvement in performance compared to the Intel 3rd Generation Xeon Scalable processors, such as the 8380 and 6338. Also, if comparing 6430, 8452Y and 8480 processors, the STREAM results with 8452Y and 8480 Intel 4th Generation Xeon Scalable processors are, respectively, 1.12 and 1.24 times better than the Intel 6430 processor.
Figure 2. Performance values of STREAM application for different processor models
The HPCG benchmark aims to simulate the data access patterns of applications such as sparse matrix calculations, assessing the impact of memory subsystem and internal connectivity constraints on the computing performance of High-Performance Computers, or supercomputers. The different problem sizes used in the study are 192, 256, 176, 168, and so on. Additionally, in this benchmark study, the variation in performance within different BIOS options was within 1–2 percent. Figure 3 shows the HPCG performance results for Intel Sapphire Rapids processors 6430, 8452Y and 8480. In comparison with the Intel 6430 processor, the 8452Y shows 1.02 times and the 8480 shows 1.12 times better performance.
Figure 3. Performance values of HPCG application for different processor models
OSU Micro Benchmarks are used for measuring the performance of MPI implementations, so we used two nodes connected to NDR200. OSU benchmark determines uni-directional and bi-directional bandwidth and message rate and latency between the nodes. The OSU benchmark was run on all three Intel processors (6430, 8452Y, and 8480) with single core per node; however, we have shown one of the system/processors (Intel 8480 processor) results in the blog starting from Figures 4-7.
Figure 4. OSU Bi-Directional bandwidth chart for C6620_8480 intel processor
Figure 5. OSU Uni-Directional bandwidth chart for C6620_8480 intel processor
Figure 6. OSU Message bandwidth/Message rate chart for C6620_8480 intel processor
Figure 7. OSU Latency chart for C6620_8480 intel processor
All fifteen BIOS combinations were tested; the OSU benchmark also shows similar performance with a difference within a 1-2% delta.
The performance comparison between various Intel Sapphire Rapids processors (6430, 8452Y and 8480) is done with the help of synthetic benchmark applications such as HPL, STREAM, HPCG and OSU. Nearly 15 BIOS configurations are set on the system, and performance values with different benchmarks were captured to identify the best BIOS configuration to set. From the results, it was found that the difference in performance with any benchmarks for all the BIOS configurations applied is below 3 percent delta.
Therefore, the HPC workload profile provides better benchmark results with all the Intel Sapphire Rapids processors. Among the three Intel processors compared, the 8480 had the highest application performance value, while the 8452Y is in second place. The maximum difference in performance between processors was found for the HPL benchmark, and it was the 8480 Intel Sapphire Rapids processor, which offers 1.65 times better results than the Intel 6430 processor.
Watch out for future application benchmark results on this blog! Visit our page for previous blogs.
Fri, 30 Jun 2023 13:44:52 -0000
|Read Time: 0 minutes
With the release of 4th Gen AMD EPYC 9004 CPUs (code-named “Genoa”), Dell PowerEdge servers have been refreshed to support these latest processors. In this blog, we will present the results of a study evaluating the performance of HPC synthetic benchmarks with AMD 9354 processors on Dell’s latest PowerEdge dual socket 1U R6625 server and dual socket 2U R7625 server.
AMD Genoa is based on the new Zen4 micro-architecture built with 5nm fabrication technology. Major changes from its predecessor AMD EPYC 7003 CPUs (code-named “Milan”) include the support for DDR5 memory at speeds up to 4800 MT/s and PCIe Gen5. It supports up to 96 cores per socket and the L2 cache per core is doubled. Zen4 adds support for the AVX-512 instruction set. The implementation in Zen4 executes AVX-512 instructions in two cycles. Also, improvements are made in instructions per cycle (IPC).
Table 1. Test bed system configuration used for this benchmark study
Platform | Dell PowerEdge R6625 /R7625 |
Processor | AMD EPYC 9354 |
Cores | 32 cores/socket |
Base Frequency | 3.25 GHz |
Turbo Clock | Up to 3.8 GHz |
TDP | 280 W |
Configurable TDP | 240-300 W |
L1 Cache | 64K per core |
L2 Cache | 1MB per core |
L3 Cache | 256MB (shared) |
Memory | 32 GB x 24 DIMMs| 4800 MT/s |
Interconnect | NVIDIA Mellanox NDR 400 |
Operating System | RHEL 8.6 |
Linux Kernel | 4.18.0-372.9.1 |
BIOS/CPLD | 1.1.3/1.1.3 |
OFED | MLNX_OFED_LINUX-5.7-1.0.2.0 |
BIOS Workload Profile | HPC Profile |
Compiler | AOCC 4.0.0 and AOCL 4.0 |
OpenMPI | 4.1.5 |
Turbo Boost | ON |
We tested different combinations of BIOS options in this study to understand the potential performance improvements in synthetic benchmarks. We found that setting workload profile in BIOS as “HPCProfile” will give us the best performance on HPC synthetic benchmarks.
This workload profile option can be found in System Profile Settings of BIOS. It is a collection of multiple BIOS options recommended for HPC workloads. This setting can be updated using the RACADM CLI tool. Use the following command to enable “HPCProfile” and reboot your system using racadm.
racadm set bios.sysprofilesettings.WorkloadProfile HpcProfile && sudo racadm jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW -e TIME_NA
Once the system is up, use the command below to verify that the setting is enabled.
racadm bios.sysprofilesettings.WorkloadProfile
It should show the workload profile as HPCProfile. Note that any changes made in BIOS settings on top of the “HPCProfile” will set this parameter to “Not Configured”, keeping the other settings of “HPCProfile” intact.
We have studied the impact of different BIOS options on top of “HPCProfile”. All the performance numbers mentioned in this blog are with workload profile set to “HPCProfile”.
Table 2. Synthetic benchmarks application details
S.No. | Application Name | Version Used |
High-Performance Linpack (HPL) | ||
v7.1 |
We used prebuilt AMD Optimized binaries for HPL, Stream, and HPCG benchmarks, which are optimized for AMD’s Zen4 architecture. OSU was compiled using AOCC 4.0 compiler. Benchmark information and performance numbers are mentioned in the following section.
HPL: This benchmark solves random system of linear equations in double precision (64-bits) for distributed systems. It reports floating-point execution rate of the system.
In the HPL benchmark test, we used 94 percent of available memory as the problem size where N=301440 and NB=384 was used. We achieved ~3.75 TFlops of performance across dual sockets with around 113 percent efficiency, compared to the base frequency on the AMD 9354 processor. We monitored the frequency throughout the benchmark run and observed that the processor was able to utilize its turbo frequencies constantly, which explains the efficiency being above 100 percent for this processor. The average power consumption during the benchmark run was ~830 watts when the system profile in BIOS was set to “HPCProfile” option. We obtained the best performance-per-watt results with this option.
Figure 1. HPL performance with AMD Genoa 9354 processor on 16G PowerEdge R6625 and R7625 servers
STREAM: This synthetic benchmark is designed to measure sustainable memory bandwidth and a corresponding computation rate for four simple vector kernels: Copy, Scale, Add and Triad.
In the STREAM TRIAD benchmark test, we were able to reach up to ~752 GB/s when utilizing all available cores of the dual socket server. To learn more about the STREAM performance numbers on AMD MILAN based servers, please refer to our previous blog here.
Figure 2. STREAM performance with AMD Genoa 9354 processor on 16G PowerEdge R6625 and R7625 servers
HPCG: This benchmark project is an effort to create a new metric for ranking HPC systems. It is an internally I/O bound benchmark, intended to complement LINPACK benchmarks.
In the HPCG benchmark, we used nx=ny=nz=192 local sub grid dimensions to tune the problem size as per our system memory. We were able to reach ~115 Gflops of performance with AMD optimized binary for HPCG.
Figure 3. HPCG performance with AMD Genoa 9354 processor on 16G PowerEdge R6625 and R7625 servers
OSU Micro Benchmarks: These micro-benchmarks are widely used for measuring and evaluating the performance of MPI operations for point-to-point, multi-pair, and collective communications between the nodes.
In the OSU benchmark, we used two nodes connected with NDR400. We checked bidirectional bandwidth, unidirectional bandwidth, message rate, and latency between these two nodes. In a dual socket server, the socket connected to the network adapter card acts as local and the other acts as remote. We completed this test on both R6625 and R7625 servers for both remote and local latency and bandwidth. The results below are obtained from theR6625 server. All the results for OSU shown below are run using a single core per node.
The Delta label in secondary axis represents the percentage difference between local and remote latency and bandwidth.
Figure 4. OSU Latency with AMD Genoa 9354 processor on Dell PowerEdge R6625 server
We achieved ~48 GB/s unidirectional bandwidth and ~87 GB/s of bidirectional bandwidth.
Figure 5. OSU message rate with AMD Genoa 9354 processor on Dell PowerEdge R6625 server
Figure 6. OSU bi-directional bandwidth with AMD Genoa 9354 processor on Dell PowerEdge R6625 server
Figure 7. OSU uni-directional bandwidth with AMD Genoa 9354 processor on 16G Dell PowerEdge R6625 server
We have seen a significant improvement in the performance of synthetic benchmarks using Genoa-based servers as compared to earlier Milan-based servers. Setting up the right BIOS parameters is important to achieve the best results on these servers. As part of our study, we tested different BIOS parameters, finding suggest that setting the workload profile to “HPCProfile” provides the best performance result.
For future work, we plan to study performance improvements on HPC applications from different domains using these latest AMD processors and Dell PowerEdge servers.
Check back soon for the next blog.
Visit our website to read our previous blog on AMD Milan-based servers.
Wed, 08 Feb 2023 14:45:39 -0000
|Read Time: 0 minutes
High Performance Computing (HPC) solves complex computational problems by doing parallel computations on multiple computers and performing research activities through computer modeling and simulations. Traditionally, HPC is deployed on bare-metal hardware, but due to advancements in virtualization technologies, it is now possible to run HPC workloads in virtualized environments. Virtualization in HPC provides more flexibility, improves resource utilization, and enables support for multiple tenants on the same infrastructure.
However, virtualization is an additional layer in the software stack and is often construed as impacting performance. This blog explains a performance study conducted by the Dell Technologies HPC and AI Innovation Lab in partnership with VMware. The study compares bare-metal and virtualized environments on multiple HPC workloads with Intel® Xeon® Scalable third-generation processor-based systems. Both the bare-metal and virtualized environments were deployed on the Dell HPC on Demand solution.
Figure 1: Cluster Architecture
To evaluate the performance of HPC applications and workloads, we built a 32-node HPC cluster using Dell PowerEdge R650 as compute nodes. Dell Power Edge R650 is a 1U dual socket server with Intel® Xeon® Scalable third-generation processors. The cluster was configured to use both bare-metal and virtual compute nodes (running VMware vSphere 7). Both bare-metal and virtualized nodes were attached to the same head node.
Figure 1 shows a representative network topology of this cluster. The cluster was connected to two separate physical networks. The compute nodes were spread across two sets of racks, and the cluster consisted of the following two networks:
The Virtual Machine (VM) configuration details for optimal performance settings were captured in an earlier blog. In addition to the settings noted in the previous blog, some additional BIOS tuning options such as Snoop Hold Off, SubNumaCluster (SNC) and LLC Prefetch settings were also tested. Snoop Hold Off (set to 2 K cycles), and SNC, helped performance across most of the tested applications and microbenchmarks for both the bare-metal and virtual nodes. Enabling SNC in the server BIOS and not configuring SNC correctly in the VM might result in performance degradation.
Table 1 shows the system environment details used for the study.
Table 1: System configuration details for the bare-metal and virtual clusters
Machine function | Component |
Platform | PowerEdge R650 server |
Processor | Two Intel® Xeon® third Generation 6348 (28 cores @ 2.6 GHz) |
Number of cores | Bare-Metal: 56 cores Virtual: 52 vCPUs (four cores reserved for ESXi) |
Memory | Sixteen 32 GB DDR4 DIMMS @3200 MT/s Bare-Metal: All 512 GB used Virtual: 440 GB reserved for the VM
|
HPC Network NIC | 100 GbE NVIDIA Mellanox Connect-X6 |
Service Network NIC | 10/25 GbE NVIDIA Mellanox Connect-X5 |
HPC Network Switch | Dell PowerSwitch Z9332 with OS 10.5.2.3 |
Service Network Switch | Dell PowerSwitch S5248F-ON |
Operating system | Rocky Linux release 8.5 (Green Obsidian) |
Kernel | 4.18.0-348.12.2.el8_5.x86_64 |
Software – MPI | IntelMPI 2021.5.0 |
Software – Compilers | Intel OneAPI 2022.1.1 |
Software – OFED | OFED 5.4.3 (Mellanox FW 22.32.20.04) |
BIOS version | 1.5.5 (for both bare-metal and virtual nodes) |
The following chart outlines the set of HPC applications used for this study from different domains like Computational Fluid Dynamics (CFD), Weather, and Life Sciences. Different benchmark datasets were used for each of the applications as detailed in Table 2.
Table 2: Application and benchmark dataset details
Application | Vertical Domain | Benchmark Dataset |
Weather and Environment | Conus 2.5KM, Maria 3KM | |
Manufacturing - Computational Fluid Dynamics (CFD) | ||
Life Sciences – Molecular Dynamics | ||
Molecular Dynamics | EAM metallic Solid Benchmark (1M, 3M and 8M Atoms) HECBIOSIM – 3M Atoms |
All the application results shown here were run on both bare-metal and virtual environments using the same binary compiled with Intel Compiler and run with Intel MPI. Multiple runs were done to ensure consistency in the performance. Basic synthetic benchmarks like High Performance Linpack (HPL), Stream, and OSU MPI Benchmarks were run to ensure that the cluster was operating efficiently before running the HPC application benchmarks. For the study, all the benchmarks were run in a consistent, optimized, and stable environment across both the bare-metal and virtual compute nodes.
Intel® Xeon® Scalable third-generation processors (Ice Lake 6348) have 56 cores. Four cores were reserved for the virtualization hypervisor (ESXi) providing the remaining 52 cores to run benchmarks. All the results shown here consist of 56 core runs on bare-metal vs 52 core runs on virtual nodes.
To ensure better scaling and performance, multiple combinations of threads and MPI ranks were tried based on applications. The best results are used to show the relative speedup between both the bare-metal and virtual systems.
Figure 2: Performance comparison between bare-metal and virtual nodes for WRF
Figure 3: Performance comparison between bare-metal and virtual nodes for OpenFOAM
Figure 4: Performance comparison between bare-metal and virtual nodes for GROMACS
Figure 5: Performance comparison between bare-metal and virtual nodes for LAMMPS
The above results indicate that all the MPI applications running in a virtualized environment are close in performance to the bare-metal environment if proper tuning and optimizations are used. The performance delta, running from a single node up to 32 nodes, is within the 10% range for all the applications. This delta shows no major impact on scaling.
In a virtualized multitenant HPC environment, the expectation is for multiple tenants to be running multiple concurrent instances of the same or different applications. To simulate this configuration, a concurrency test was conducted by making multiple copies of the same workload and running them in parallel. This test checks whether any performance degradation appears in comparison with the baseline run result. To do some meaningful concurrency tests, we expanded the virtual cluster to 48 nodes by converting 16 nodes of bare-metal to virtual. For the concurrency tests, the baseline is made with an 8-node run while no other workload was running across the 48-node virtual cluster. After that, six copies of the same workload were allowed to run simultaneously across the virtual cluster. Then the results are compared and depicted for all the applications.
The concurrency was tested in two ways. In the first test, all eight nodes running a single copy were placed in the same rack. In the second test, the nodes running a single job were spread across two racks to see if any performance difference was observed due to additional communication over the network.
Figures 6 to 13 capture the results of the concurrency test. As seen from the results there was no degradation observed in the performance.
Figure 6: Concurrency Test 1 for WRF
Figure 7: Concurrency Test 2 for WRF
Figure 8: Concurrency Test 1 for Open FOAM
Figure 9: Concurrency Test 2 for Open FOAM
Figure 10: Concurrency Test 1 for GROMACS
Figure 11: Concurrency Test 2 for GROMACS
Figure 12: Concurrency Test 1 for LAMMPS
Figure 13: Concurrency Test 2 for LAMMPS
Another set of concurrency tests was conducted by running different applications (WRF, GROMACS, and Open FOAM) simultaneously in the virtual environment. In this test, two eight-node copies of each application run concurrently across the virtual cluster to determine if any performance variation occurs while running multiple parallel applications in the virtual nodes. There is no performance degradation observed in this scenario also, when compared to the individual application baseline run with no other workload running on the cluster.
Figure 14: Concurrency test with multiple applications running in parallel
In addition to the benchmark testing, this system has been certified as an Intel® Select Solution for Simulation and Modeling. Intel Select Solutions are workload-optimized configurations that Intel benchmark-tests and verifies for performance and reliability. These solutions can be deployed easily on premises and in the cloud, providing predictability and scalability.
All Intel Select Solutions are a tailored combination of Intel data center compute, memory, storage, and network technologies that deliver predictable, trusted, and compelling performance. Each solution offers assurance that the workload will work as expected, if not better. These solutions can save individual businesses from investing the resources that might otherwise be used to evaluate, select, and purchase the hardware components to gain that assurance themselves.
The Dell HPC On Demand solution is one of a select group of prevalidated, tested solutions that combine third-generation Intel® Xeon® Scalable processors and other Intel technologies into a proven architecture. These certified solutions can reduce the time and cost of building an HPC cluster, lowering hardware costs by taking advantage of a single system for both simulation and modeling workloads.
Running an HPC application necessitates careful consideration for achieving optimal performance. The main objective of the current study is to use appropriate tuning to bridge the performance gap between bare-metal and virtual systems. With the right settings on the tested HPC applications (see Overview), the performance difference between virtual and bare-metal nodes for the 32 node tests is less than 10%. It is therefore possible to successfully run different HPC workloads in a virtualized environment to leverage benefits of virtualization features. The concurrency testing helped to demonstrate that running multiple applications simultaneously in the virtual nodes does not degrade performance.
To learn more about our previous work on HPC virtualization on Cascade Lake, see the Performance study of a VMware vSphere 7 virtualized HPC cluster.
The authors thank Savitha Pareek from Dell Technologies, Yuankun Fu from VMware, Steven Pritchett, and Jonathan Sommers from R Systems for their contribution in the study.
Mon, 12 Sep 2022 12:11:52 -0000
|Read Time: 0 minutes
The PowerEdge R7525 server can support three AMD Instinct™ MI210 GPUs; it is ideal for HPC Workloads. Furthermore, using the PowerEdge R7525 server to power AMD Instinct MI210 GPUs (built with the 2nd Gen AMD CDNA™ architecture) offers improvements on FP64 operations along with the robust capabilities of the AMD ROCm™ 5 open software ecosystem. Overall, the PowerEdge R7525 server with the AMD Instinct MI210 GPU delivers expectational double precision performance and leading total cost of ownership.
Figure 1: Front view of the PowerEdge R7525 server
We performed and observed multiple benchmarks with AMD Instinct MI210 GPUs populated in a PowerEdge R7525 server. This blog shows the performance of LINPACK and the OpenMM customizable molecular simulation libraries with the AMD Instinct MI210 GPU and compares the performance characteristics to the previous generation AMD Instinct MI100 GPU.
The following table provides the configuration details of the PowerEdge R7525 system under test (SUT):
Table 1. SUT hardware and software configurations
Component | Description |
Processor | AMD EPYC 7713 64-Core Processor |
Memory | 512 GB |
Local disk | 1.8T SSD |
Operating system | Ubuntu 20.04.3 LTS |
GPU | 3xMI210/MI100 |
Driver version | 5.13.20.22.10 |
ROCm version | ROCm-5.1.3 |
Processor Settings > Logical Processors | Disabled |
System profiles | Performance |
NUMA node per socket | 4 |
HPL | rochpl_rocm-5.1-60_ubuntu-20.04 |
OpenMM | 7.7.0_49 |
The following table contains the specifications of AMD Instinct MI210 and MI100 GPUs:
Table 2: AMD Instinct MI100 and MI210 PCIe GPU specifications
GPU architecture | AMD Instinct MI210 | AMD Instinct MI100 |
Peak Engine Clock (MHz) | 1700 | 1502 |
Stream processors | 6656 | 7680 |
Peak FP64 (TFlops) | 22.63 | 11.5 |
Peak FP64 Tensor DGEMM (TFlops) | 45.25 | 11.5 |
Peak FP32 (TFlops) | 22.63 | 23.1 |
Peak FP32 Tensor SGEMM (TFlops) | 45.25 | 46.1 |
Memory size (GB) | 64 | 32 |
Memory Type | HBM2e | HBM2 |
Peak Memory Bandwidth (GB/s) | 1638 | 1228 |
Memory ECC support | Yes | Yes |
TDP (Watt) | 300 | 300 |
HPL measures the floating-point computing power of a system by solving a uniformly random system of linear equations in double precision (FP64) arithmetic, as shown in the following figure. The HPL binary used to collect results was compiled with ROCm 5.1.3.
Figure 2: LINPACK performance with AMD Instinct MI100 and MI210 GPUs
The following figure shows the power consumption during a single HPL run:
Figure 3: LINPACK power consumption with AMD Instinct MI100 and MI210 GPUs
We observed a significant improvement in the AMD Instinct MI210 HPL performance over the AMD Instinct MI100 GPU. The numbers on a single GPU test of MI210 are 18.2 TFLOPS which is approximately 2.7 times higher than MI100 number (6.75 TFLOPS). This improvement is due to the AMD CDNA2 architecture on the AMD Instinct MI210 GPU, which has been optimized for FP64 matrix and vector workloads. Also, the MI210 GPU has larger memory, so the problem size (N) used here is large in comparison to the AMD Instinct MI100 GPU.
As shown in Figure 2, the AMD Instinct MI210 has shown almost linear scalability in the HPL values on single node multi-GPU runs. The AMD Instinct MI210 GPU reports better scalability compared to its last generation AMD Instinct MI100 GPUs. Both GPUs have the same TDP, with the AMD Instinct MI210 GPU delivering three times better performance. The performance per watt value of a PowerEdge R7525 system is three times more. Figure 3 shows the power consumption characteristics in one HPL run cycle.
OpenMM is a high-performance toolkit for molecular simulation. It can be used as a library or as an application. It includes extensive language bindings for Python, C, C++, and even Fortran. The code is open source and actively maintained on GitHub and licensed under MIT and LGPL.
Figure 4: OpenMM double-precision performance with AMD Instinct MI100 and MI210 GPUs
Figure 5: OpenMM single-precision performance with AMD Instinct MI100 and MI210 GPUs
Figure 6: OpenMM mixed-precision performance with AMD Instinct MI100 and MI210 GPUs
We tested OpenMM with seven datasets to validate double, single, and mixed precision. We observed exceptional double precision performance with OpenMM on the AMD Instinct MI210 GPU compared to the AMD Instinct MI100 GPU. This improvement is due to the AMD CDNA2 architecture on the AMD Instinct MI210 GPU, which has been optimized for FP64 matrix and vector workloads.
The AMD Instinct MI210 GPU shows an impressive performance improvement in FP64 workloads. These workloads benefit as AMD has doubled the width of their ALUs to a full 64-bits wide. This change allows the FP64 operations to now run at full speed in the new 2nd Gen AMD CDNA architecture. The applications and workloads that are designed to run on FP64 operations are expected to take full advantage of the hardware.
Thu, 06 Jul 2023 19:08:42 -0000
|Read Time: 0 minutes
Today’s HPC environments have increased demands for high-speed storage. Storage was becoming the bottleneck in many workloads due to higher core-count CPUs, larger and faster memory, a faster PCIe bus, and increasingly faster networks. Parallel File Systems (PFS) typically address these high-demand HPC requirements. PFS provides concurrent access to a single file or a set of files from multiple nodes, efficiently and securely distributing data to multiple LUNs across several storage servers.
These file systems use spinning media to provide the highest capacity at the lowest cost. However, often the speed and latency of spinning media cannot keep up with the demands of many modern HPC workloads. The use of flash technology (that is, NVMe) in the form of burst buffers, faster tiers, or even fast scratch (local or distributed) can mitigate this issue. HPC pixstor Storage offers a cost-effective, high-capacity tier and NVMe nodes as the component to address high-bandwidth demands and for the optional High Demand Metadata module.
This blog is part of a series for PFS solutions for HPC environments, in particular for the flexible, scalable, efficient, and reliable HPC pixstor Storage. Its focus is the upgrade to storage nodes using the new Dell PowerVault ME5084 arrays, which provide a significant boost in performance compared to the previous generation (ME4084 array).
Note: Because arcastream changed its branding to all lowercase characters, we have modified instances of “arcastream,” “pixstor,” and “ngenea” accordingly.
The following figure shows the architecture for the new generation of the Dell Validated Design for HPC pixstor Storage. It uses Dell PowerEdge R650, R750, and R7525 servers and the new PowerVault ME5084 storage arrays, with the pixstor 6.0 software from our partner company arcastream.
Figure 1 Reference Architecture
Optional PowerVault ME484 EBOD arrays can increase the capacity of the solution as SAS additions to the PowerVault ME5084 storage arrays. The pixstor software includes the widespread General Parallel File System (GPFS), also known as Spectrum Scale, as the PFS component that is considered software-defined storage due to its flexibility and scalability. In addition, the pixstor software includes many other arcastream software components such as advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities, and many others.
The main components of the pixstor solution are:
Figure 2 PowerEdge R750 storage nodes - Slot allocation
Figure 3 ME5084 array - Controllers and SAS ports
The following figure shows the back of the ME484 expansion array.
Figure 4 ME484 - I/O Module and SAS ports
This solution was released with the latest 3rd Generation Intel Xeon Scalable CPUs, also known as Ice Lake, and the fastest RAM available (3200 MT/s). The following table lists the main components for the solution. Some discrepancies were introduced between the wanted BOM and the actual test hardware because for the prerelease (production level) hardware for our project, only a few CPU models were made available, not including the planned life-cycle model.
The At release column lists the components planned to be used at release and available to customers with the solution. The Test bed column lists the components actually used for characterizing the performance of the solution. The drives listed for data (12 TB NLS) were used for performance characterization, but all supported HDDs and SSDs in the PowerVault ME5 Support Matrix can be used for the solution. Because the ME5 controllers are no longer the first bottleneck of the backend storage, using drives with higher rated speed (10K, 15K, and SSDs) might provide some increase in sequential performance (a maximum of about 30 to 35 percent for throughput is expected), can provide better Random IOPS, and might improve create and remove metadata operations. For full high-speed network redundancy, two high-speed switches must be used (QM87000 for IB or SN3700 for GbE); each switch must have one CX6 adapter connected from each server.
The listed software components describe the versions used during the initial testing. However, these software versions might change over time in between official releases to include important fixes, support for new hardware components, or addition of important features.
The table lists possible data HDDs and SSDs, which are listed in the Dell PowerVault ME5 Support Matrix.
Table 1. Components used at release time and in the test bed
Solution component | At release | Test bed | |
Internal management switch | Dell PowerSwitch N2248X-ON GbE | PowerSwitch S3048-ON | |
Data storage subsystem | 1 x to 4 x PowerVault ME5084 arrays | 2 x Dell PowerVault ME5084 arrays | |
Optional 4x PowerVault ME484 (one per ME5084 array) 80 – 12 TB 3.5" NL SAS3 HDD drives Alternative options: 15K RPM: 900GB; 10K RPM: 1.2TB, 2.4 TB SSD: 960GB, 1.92 TB, 3.84 TB; NLS: 4 TB, 8 TB, 12 TB, 16 TB, 20 TB 8 LUNs, linear 8+2 RAID 6, chunk size 512 KiB 4 - 1.92 TB (or 3.84 TB or 7.68 TB) SAS3 SSDs per ME5084 array for metadata – 2 x RAID 1 (or 4 - Global HDD spares, if optional HDMD is used) | |||
Optional HDMD storage subsystem | One or more pairs of NVMe-tier servers | ||
RAID storage controllers | Duplex 12 Gbps SAS | ||
Capacity without expansion (with 12 TB HDDs) | Raw: 4032 TB (3667 TiB or 3.58 PiB) Formatted: approximately 3072 GB (2794 TiB or 2.73 PiB) | ||
Capacity with expansion (Large) (12 TB HDDs) | Raw: 8064 TB (7334 TiB or 7.16 PiB) Formatted: approximately 6144 GB (5588 TiB or 5.46 PiB) | ||
Processor | Gateway/ngenea | 2 x Intel Xeon Gold 6326 2.9 GHz, 16C/32T, 11.2GT/s, 24M Cache, Turbo, HT (185 W) DDR4-3200 | 2 x Intel Xeon Platinum 8352Y 2.2 GHz, 32C/64T, 11.2GT/s, 48M Cache, Turbo, HT (205 W) DDR4-3200 |
Storage node | |||
Management node | 2x Intel Xeon Gold 6330 2 GHz, 28C/56T, 11.2GT/s, 42M Cache, Turbo, HT (185 W) DDR4-2933 | ||
R650 NVMe Nodes | 2x Intel Xeon Gold 6354 3.00 GHz, 18C/36T, 11.2GT/s, 39M Cache, Turbo, HT (205 W) DDR4-3200 | ||
Optional High Demand Metadata | 2x Intel Xeon Gold 6354 3.00 GHz, 18C/36T, 11.2GT/s, 39M Cache, Turbo, HT (205 W) DDR4-3200 | ||
R750 NVMe nodes |
| 2x Intel Xeon Platinum 8352Y, 2.2 GHz, 32C/64T, 11.2GT/s, 48M Cache, Turbo, HT (205 W) DDR4-3200 | |
R7525 NVMe nodes | 2 x AMD EPYC 7302 3.0 GHz, 16C/32T, 128M L3 (155 W) | 2 x AMD 7H12 2.6 GHz, 64C/64T 256M L3 (280 W) | |
Memory
| Gateway/ngenea | 16 x 16 GiB 3200 MT/s RDIMMs (256 GiB) | |
Storage node | |||
Management node | |||
Operating system | Red Hat Enterprise Linux 8.5 | ||
Kernel version | 4.18.0-348.23.1.el8_5.x86_64 | ||
pixstor software | 6.0.3.1-1 | ||
Spectrum Scale (GPFS) | Spectrum Scale (GPFS) 5.1.3-1 | ||
OFED version | Mellanox OFED 5.6-1.0.3.3 | ||
High-performance NIC | All: 2 x Dell OEM ConnectX-6 Single Port HDR VPI InfiniBand, Low Profile Gateway and ngenea Nodes: 4x CX6 VPI adapters, 2x FS & 2x External | ||
High-performance switch | All: 2 x Dell OEM ConnectX-6 Single Port HDR VPI InfiniBand, Low Profile Gateway and ngenea Nodes: 4x CX6 VPI adapters, 2x FS & 2x External | ||
Local Disks (operating system and analysis/monitoring) | NVMe servers: BOSS-S2 with 2x M.2 240 GB in RAID 1 Other servers: 3x 480 GB SSD SAS3 (RAID 1 + HS) for operating system with PERC H345 front RAID controller | ||
Systems management | iDRAC9 Enterprise + Dell OpenManage 10.0.1-4561 |
To characterize the new solution component (ME5084 array), we used the following benchmarks:
A delay in the delivery of the ME5084 arrays needed for the update of the solution imposed an unexpected limitation. Therefore, the number of ME5 prototypes available for the solution limited this work. Only two ME5084 arrays were used for the benchmark tests, which is the same as a Medium configuration. However, to compare results to the previous generation of the PowerVault array (ME4084), all IOzone and IOR results were extrapolated for a large configuration by multiplying the results by 2. When the additional ME5084 arrays are delivered, all benchmark tests will be repeated on the Large configuration, and then again using ME484 expansions.
For these benchmarks, the test bed included the clients in the following table:
Table 2 Client test bed
Component | Description |
Number of client nodes | 16 |
Client node | C6420 |
Processors per client node | 11 nodes with 2 x Intel Xeon Gold 6230 20 Cores @ 2.1 GHz 5 nodes with 2 x Intel Xeon Gold 6248 20 Cores @ 2.4 GHz |
Memory per client node | 6230 nodes with 12 x 16 GiB 2933 MT/s RDIMMs (192 GiB) 6248 nodes with 12 x 16 GiB 2666 MT/s RDIMMs (192 GiB) |
BIOS | 2.8.2 |
Operating system | CentOS 8.4.2105 |
Operating system kernel | 4.18.0-305.12.1.el8_4.x86_64 |
pixstor software | 6.0.3.1-1 |
Spectrum Scale (GPFS) | 5.1.3-0 |
OFED version | MLNX_OFED_LINUX-5.4-1.0.3.0 |
CX6 FW | 8 nodes with Mellanox CX6 single port: 20.32.1010 8 nodes with Dell OEM CX6 single port: 20.31.2006 |
Because there were only 16 compute nodes available for testing, when a higher number of threads was required, those threads were distributed equally on the compute nodes (that is, 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads = 16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes. Because the benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), avoiding excessive context switching and other related side effects.
Sequential N clients to N files performance was measured with IOzone version 3.492. The tests that we ran varied from a single thread up to 1024 threads in increments of powers of 2.
We minimized caching effects by setting the GPFS page pool tunable to 16 GiB on clients and using files larger than twice the memory size of servers and clients (8 TiB). Note that GPFS sets the tunable maximum amount of memory used for caching data, regardless of the amount of RAM that is installed and is free. While in other Dell HPC solutions in which the block size for large sequential transfers is 1 MiB, GPFS was formatted with 8 MiB blocks; therefore that transfer size value is used on the benchmark for optimal performance. The block size on the file system might seem too large and waste too much space, but GPFS uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided into 512 subblocks of 16 KiB each.
The following commands were used to run the benchmark for read and write operations, where the $Threads variable is the number of threads used (1 to 1024 incremented in powers of 2), and threadlist was the file that assign each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes.
To avoid any possible data caching effects from the clients, the total data size of the files was more than twice the total amount of RAM that the clients and servers have. That is, because each client has 128 GiB of RAM (total 2 TiB) and each server has 256 GiB (total 1 TiB), the total amount is 3 TiB, but 8 TiB of data were used. The 8 TiB were equally divided by the number of threads used.
./iozone -i0 -c -e -w -r 8M -s ${Size}G -t $Threads -+n -+m ./threadlist
./iozone -i1 -c -e -w -r 8M -s ${Size}G -t $Threads -+n -+m ./threadlist
Figure 5 N to N sequential performance
IMPORTANT: To allow comparison of ME5 array values to those previously obtained with ME4 arrays directly on the graph, the IOzone results of the Medium configuration (two ME5084 arrays) were multiplied by 2 to estimate the performance of a Large configuration (four ME5084 arrays).
From the results, we see that write performance rises with the number of threads used and then reaches a plateau at eight threads for read and write operations (the values at four threads are slightly smaller). Read performance rises a little more and then decreases to a more stable value. Write performance seems to be more stable than read performance with a small variation around the sustained performance in the plateau.
The maximum performance for read operations was 31.4 GB/s at 16 threads, about 34.5 percent below the specification of the ME5084 arrays (approximately 48 GB/s), and well below the performance of HDR links (4 x 25 GB/s or 100 GB/s). Even if only one HDR link per storage server was used (ceiling speed of 50 GB/s), the value is higher than the specification of the 4 x ME5084 arrays. Write performance peaks at 512 threads with 27.8 GB/s, but a similar value was observed at 32 threads. The maximum value was about 30.5 percent below the ME5 specifications (40 GB/s). Initial ME5 testing used raw devices with SSDs in RAID (on ME5024 arrays) and HDDs in (8+2) RAID 6 (on ME5084 arrays) and it was able to reach the specifications of the controllers. Therefore, the current assumption is that seek times introduced by GPFS scattered access (random placement of 8 GiB blocks across the surface of all drives) is limiting the performance. Adding ME484 expansions can help reach performance closer to the specifications because having twice the LUNs reduces the effect of the seek time across the file system. Our next whitepaper will include performance for ME484 expansions and benchmark tests will address this assumption.
Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, with by OpenMPI v4.1.2A1 to run the benchmark over the 16 compute nodes. The tests that we ran varied from one thread up to 512 threads because there were not enough cores for 1024 or more threads. This benchmark used 8 MiB blocks for optimal performance. The previous section provides a more complete explanation about why that block size was selected.
We minimized data caching effects by setting the GPFS page pool tunable to 16 GiB and the total file size to 8192 GiB to ensure neither clients or servers could cache any data. An equal portion of that 8 TiB total was divided by the number of threads (the $Size variable in the following code manages that value).
The following commands were used to run the benchmark for write and read operations, where the $Threads variable is the number of threads used (1 to 512 incremented in powers of two) and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes:
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ucx --oversubscribe --prefix /usr/mpi/gcc/openmpi-4.1.2a1 --map-by node /mmfs1/bench/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b ${Size}G
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ucx --oversubscribe --prefix /usr/mpi/gcc/openmpi-4.1.2a1 --map-by node /mmfs1/bench/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -r -s 1 -t 8m -b ${Size}G
IMPORTANT: To allow comparison of ME5 array values to those previously obtained with ME4 arrays directly on the graph, the IOzone results of the Medium configuration (two ME5084 arrays) were multiplied by 2 to estimate the performance of a Large configuration (four ME5084 arrays).
From the results, we see that read and write performance are high regardless of the implicit need for locking mechanisms because all threads access the same file. Performance rises quickly with the number of threads used and then reaches a plateau at eight threads that is relatively stable up to the maximum number of threads used on this test. Notice that the maximum read performance was 30.9 GB/s at 16 threads, but similar to sequential N to N tests, performance decreases slightly until reaching a more stable value. The maximum write performance of 23 GB/s was achieved at 32 threads and remains stable across a higher number of threads.
Random N clients to N files performance was measured with IOzone version 3.492. The tests that we ran varied from a single thread up to 1024 threads in increments of powers of 2.
The tests that we ran varied from a single thread up to 512 threads because there were not enough client cores for 1024 threads. Each thread used a different file and the threads were assigned using the round-robin method on the client nodes. This benchmark test used 4 KiB blocks to emulate small blocks traffic.
We minimized caching effects by setting the GPFS page pool tunable to 4 GiB and to avoid any possible data caching effects from the clients. The total data size of the files created was again 8,192 GiB divided by the number of threads (the $Size variable in the following code was used to manage that value). However, the actual random operations were limited to 128 GiB (4 GiB x 16 clients x 2) to save running time that can be extremely long due to low IOPS on NLS drives.
. ./iozone -i0 -I -c -e -w -r 8M -s ${Size}G -t $Threads -+n -+m ./me5_threadlist <= Create the files sequentially
./iozone -i2 -I -O -w -r 4k -s ${Size}G -t $Threads -+n -+m ./me5_threadlist <= Perform the random reads and writes
Figure 7 N to N random performance
IMPORTANT: To allow comparison of ME5 array values to those previously obtained with ME4 arrays directly on the graph, the IOzone results of the Medium configuration (two ME5084 arrays) were multiplied by 2 to estimate the performance of a Large configuration (four ME5084 arrays).
From the results, we see that write performance starts at a high value of 15.2K IOPS and rises to the peak of 20.8K IOPS at four threads and then decreases until it reaches a plateau at 16 threads (15-17K IOPS). Read performance starts low at 1.5K IOPS at 16 threads and increases steadily with the number of threads used (the number of threads is doubled for each data point) until achieving a maximum performance of 31.8K IOPS at 512 threads without reaching a plateau. Using more threads requires more than 16 compute nodes to avoid resource starvation and excessive swapping that can lower apparent performance. Because NLS HDDs seek time limits maximum IOPS long before reaching the controller ME5 specification, using ME484 expansions can help to increase IOPS; and faster drives (10K, 15K, or SSDs) can help even more. However, the NVMe tier is better suited to handle extremely high IOPS requirements.
The optional HDMD used in this testbed was with a single pair of PowerEdge R650 servers with 10 PM1735 NVMe PCIe 4 devices on each server. Metadata performance was measured with MDtest version 3.3.0, with OpenMPI v4.1.2A1 to run the benchmark over the 16 compute nodes. The tests that we ran varied from a single thread up to 512 threads. The benchmark was used for files only (no directories metadata), getting the number of create, stat, read, and remove operations that the solution can handle.
Because the same High Demand Metadata NVMe module was used for previous benchmark tests of the pixstor storage solution, metadata results are similar to previous results (NVMe tier). Therefore, the study with empty and 3 KiB files were included for completeness, but results with 4 KiB files are more relevant for this blog. Since 4 KiB files cannot fit into an inode along with the metadata information, ME5 arrays are used to store data for each file. Therefore, MDtest can also provide an approximate estimate of small files performance for read operations and the rest of the metadata operations using ME5 arrays.
The following command was used to run the benchmark, where the $Threads variable is the number of threads used (1 to 512 incremented in powers of two) and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using the round-robin method to spread them homogeneously across the 16 compute nodes. The file size for read and create operations was stored in $FileSize. Like the Random IO benchmark, the maximum number of threads was limited to 512 because there are not enough cores on client nodes for 1024 threads. Context switching can affect the results, reporting a number lower than the real performance of the solution.
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /usr/mpi/gcc/openmpi-4.1.2a1 --map-by node --mca btl_openib_allow_ib 1 /mmfs1/bench/mdtest -v -P -d /mmfs1/perftest -i 1 -b $Directories -z 1 -L -I 1024 -u -t -w $FileSize -e $FileSize
Because the total number of IOPS, the number of files per directory, and the number of threads can affect the performance results, we decided to keep the total number of files fixed to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed, as shown in the following table:
Table 3 MDtest distribution of files on directories
Number of threads | Number of directories per thread | Total number of files |
1 | 2048 | 2,097,152 |
2 | 1024 | 2,097,152 |
4 | 512 | 2,097,152 |
8 | 256 | 2,097,152 |
16 | 128 | 2,097,152 |
32 | 64 | 2,097,152 |
64 | 32 | 2,097,152 |
128 | 16 | 2,097,152 |
256 | 8 | 2,097,152 |
512 | 4 | 2,097,152 |
Figure 8 Metadata Performance – empty Files
The scale chosen was logarithmic with base 10 to allow comparing operations that have differences of several orders of magnitude; otherwise, some of the operations would appear like a flat line close to 0 on a linear scale. A log graph with base 2 is more appropriate because the number of threads are increased by powers of 2. Such a graph would look similar, but people tend to perceive and remember numbers based on powers of 10 better.
Empty files do not involve all ME5 arrays and only represent the performance on the PowerEdge R650 servers with NVMe drives. The system provides good results with stat operations reaching the peak value at 256 threads with almost 8.6M op/s and then is reduced for 512 threads. Create operations reach the maximum of 239.6K op/s at 64 threads and then decrease slightly until reaching a plateau at 128 threads. Read operations attain a maximum of 3.7M op/s at 128 threads, then decrease slowly. Remove operations peak at 312.9K op/s at 64 threads, then decrease slightly and seem to reach a plateau.
Figure 9 Metadata Performance – 3 KiB Files
The scale chosen was logarithmic with base 10 to allow comparing operations that have differences of several orders of magnitude; otherwise, some of the operations would appear like a flat line close to 0 on a linear scale. A log graph with base 2 is more appropriate because the number of threads are increased by powers of 2. Such a graph would look similar, but people tend to perceive and remember numbers based on powers of 10 better.
Note that 3 KiB files still fit completely on inodes and therefore do not involve ME5 arrays, but only represent the performance on the PowerEdge R650 servers with NVMe drives. The system provides good results with stat operations reaching the peak value at 512 threads with 9.9M op/s. Create operations reach the maximum of 192.2K op/s at 128 threads and seem to reach a plateau. Read operations attained a maximum of 3M op/s at 128 threads. Remove operations peaked at 298.7K op/s at 128 threads.
The scale chosen was logarithmic with base 10 to allow comparing operations that have differences of several orders of magnitude; otherwise, some of the operations would appear like a flat line close to 0 on a linear scale. A log graph with base 2 is more appropriate because the number of threads are increased by powers of 2. Such a graph would look similar, but people tend to perceive and remember numbers based on powers of 10 better.
The system provides good results with stat operations reaching the peak value at 256 threads with almost 9.8M op/s and then is reduced for 512 threads. Create operations reach the maximum of 115.6K op/s at 128 threads and then decrease slightly until reaching 512 threads where the value drops to less than 40 percent of the peak. Read operations attain a maximum of 4M IOPS at 256 threads, which seems too high for NLS drives (possibly implying the file system is caching all data needed for most data points), suddenly dropping also at 512 threads. More work is needed to understand the sudden drop for create operations and the high read performance. Finally, remove operations peak at 286.6K op/s at 128 threads and decrease at higher thread counts.
The new ME5 arrays provide a significant increase in performance (71 percent for read operations and 82 percent for write operations from specifications). The new arrays directly increased the performance for the pixstor solution, but not to the level expected from the specification, as seen in Table 4. Because the pixstor solution uses scattered access by default, it is expected that ME484 expansions will help get closer to the limit of the ME5 controllers.
This solution provides HPC customers with a reliable parallel file system (Spectrum Scale – also known as GPFS) that is used by many Top500 HPC clusters. In addition, it provides exceptional search capabilities without degrading performance, and advanced monitoring and management. By using standard protocols like NFS, SMB, and others, optional gateways allow file sharing to as many clients as needed. Optional ngenea nodes allow tiering of other Dell storage such as Dell PowerScale, Dell ECS, other vendors, and even cloud storage.
Table 4 Peak and sustained performance with ME5084 arrays
Benchmark | Peak performance | Sustained performance | ||
Write | Read | Write | Read | |
Large Sequential N clients to N files | 31.4 GB/s | 27.8 GB/s | 28 GB/s | 26 GB/s |
Large Sequential N clients to single shared file | 30.9 GB/s | 27.8 GB/s | 27.3 GB/s | 27 GB/s |
Random Small blocks N clients to N files | 31.8K IOPS | 20.8K IOPS | 15.5K IOPS | 27K IOPS |
Metadata Create 4 KiB files | 115.6K IOPS | 50K IOPS | ||
Metadata Stat 4 KiB files | 9.8M IOPS | 1.4M IOPS | ||
Metadata Remove 4 KiB files | 286.7K IOPS | 195K IOPS |
When two additional ME5084s are added to the pixstor solution, it will be fully benchmarked as a Large configuration (four ME5084 arrays). It will also be fully benchmarked after adding expansion arrays (four ME484 arrays). Another document will be released with this and any additional information.
Fri, 12 Aug 2022 16:47:40 -0000
|Read Time: 0 minutes
Overview
The Dell PowerEdge R750xa server, powered by 3rd Generation Intel Xeon Scalable processors, is a 2U rack server that supports dual CPUs, with up to 32 DDR4 DIMMs at 3200 MT/s in eight channels per CPU. The PowerEdge R750xa server is designed to support up to four PCI Gen 4 accelerator cards and up to eight SAS/SATA SSD or NVMe drives.
Figure 1: Front view of the PowerEdge R750xa server
The AMD Instinct™ MI210 PCIe accelerator is the latest GPU from AMD that is designed for a broad set of HPC and AI applications. It provides the following key features and technologies:
This blog provides the performance characteristics of a single PowerEdge R750xa server with the AMD Instinct MI210 accelerator. It compares the performance numbers of microbenchmarks (GEMM of FP64 and FP32 and bandwidth test), HPL, and LAMMPS for both the AMD Instinct MI210 accelerator and the previous generation AMD Instinct MI100 accelerator.
The following table provides configuration details for the PowerEdge R750xa system under test (SUT):
Table 1: SUT hardware and software configurations
Component | Description |
Processor | Dual Intel Xeon Gold 6338 |
Memory | 512 GB - 16 x 32 GiB@3200 MHz |
Local disk | 3.84 TB SATA-6GB SSD |
Operating system | Rocky Linux release 8.4 (Green Obsidian) |
GPU model | 4 x AMD MI210 (PCIe-64G) or 3 x AMD MI100 (PCIe-32G) |
GPU driver version | 5.13.20.5.1 |
ROCm version | 5.1.3 |
Processor Settings > Logical Processors | Disabled |
System profiles | Performance |
5.1.3 | |
5.1.3 | |
HPL | Compiled with ROCm v5.1.3 |
LAMMPS (KOKKOS) | Version: LAMMPS patch_4May2022 |
The following table provides the specifications of the AMD Instinct MI210 and MI100 GPUs:
Table 2: AMD Instinct MI100 and MI210 PCIe GPU specifications
GPU architecture | AMD Instinct MI210 | AMD Instinct MI100 |
Peak Engine Clock (MHz) | 1700 | 1502 |
Stream processors | 6656 | 7680 |
Peak FP64 (TFlops) | 22.63 | 11.5 |
Peak FP64 Tensor DGEMM (TFlops) | 45.25 | 11.5 |
Peak FP32 (TFlops) | 22.63 | 23.1 |
Peak FP32 Tensor SGEMM (TFlops) | 45.25 | 46.1 |
Memory size (GB) | 64 | 32 |
Memory Type | HBM2e | HBM2 |
Peak Memory Bandwidth (GB/s) | 1638 | 1228 |
Memory ECC support | Yes | Yes |
TDP (Watt) | 300 | 300 |
GEMM microbenchmarks
Generic Matrix-Matrix Multiplication (GEMM) is a multithreaded dense matrix multiplication benchmark that is used to measure the performance of a single GPU. The unique O(n3) computational complexity compared to the O(n2) memory requirement of GEMM makes it an ideal benchmark to measure GPU acceleration with high efficiency because achieving high efficiency depends on minimizing the redundant memory access.
For this test, we complied the rocblas-bench binary from https://github.com/ROCmSoftwarePlatform/rocBLAS to collect DGEMM (double-precision) and SGEMM (single-precision) performance numbers.
These results only reflect the performance of matrix multiplication, and results are measured in the form of peak TFLOPS that the accelerator can deliver. These numbers can be used to compare the peak compute performance capabilities of different accelerators. However, they might not represent real-world application performance.
Figure 2 presents the performance results measured for DGEMM and SGEMM on a single GPU:
Figure 2: DGEMM and SGEMM numbers obtained on AMD Instinct MI210 and MI100 GPUs with the PowerEdge R750xa server
From the results we observed:
GPU-to-GPU bandwidth test
This test captures the performance characteristics of buffer copying and kernel read/write operations. We collected results by using TransferBench, compiling the binary by following the procedure provided at https://github.com/ROCmSoftwarePlatform/rccl/tree/develop/tools/TransferBench. On the PowerEdge R750xa server, both the AMD Instinct MI100 and MI210 GPUs have the same GPU-to-GPU throughput, as shown in the following figure:
Figure 3: GPU-to-GPU bandwidth test with TransferBench on the PowerEdge R750xa server with AMD Instinct MI210 GPUs
High-Performance Linpack (HPL) Benchmark
HPL measures a system’s floating point computing power by solving a random system of linear equations in double precision (FP64) arithmetic. The peak FLOPS (Rpeak) is the highest number of floating-point operations that a computer can perform per second in theory.
It can be calculated using the following formula:
clock speed of the GPU × number of GPU cores × number of floating-point operations that the GPU performs per cycle
Measured performance is referred to as Rmax. The ratio of Rmax to Rpeak demonstrates the HPL efficiency, which is how close the measured performance is to the theoretical peak. Several factors influence efficiency including GPU core clock speed boost and the efficiency of the software libraries.
The results shown in the following figure are the Rmax values, which are measured HPL numbers on AMD Instinct MI210 and AMD MI100 GPUs. The HPL binary used to collect the result was compiled with ROCm 5.1.3.
Figure 4: HPL performance on AMD Instinct MI210 and MI100 GPUs powered with R750xa servers
The following figure shows the power consumption during a single HPL test :
Figure 5: System power use during one HPL test across four GPUs
Our observations include:
LAMMPS Benchmark
LAMMPS is a molecular dynamics simulation code that is a GPU bandwidth-bounded application. We used the KOKKOS acceleration library implementation of LAMMPS to measure the performance of AMD Instinct MI210 GPUs.
The following figure compares the LAMMPS performance of the AMD Instinct MI210 and MI100 GPU with four different datasets:
Figure 6: LAMMPS performance numbers on AMD Instinct MI210 and MI100 GPUs on PowerEdge R750xa servers with different datasets
Our observations include:
Conclusion
The AMD Instinct MI210 GPU shows impressive performance improvement in FP64 workloads. These workloads benefit as AMD has doubled the width of their ALUs to a full 64 bits wide allowing FP64 operations to now run at full speed in the new CDNA 2 architecture. Applications and workloads that can take advantage of FP64 operations are expected to make the most of the aspect of the AMD Instinct MI210 GPU. The faster bandwidth of the HBM2e memory of the AMD Instinct MI210 GPU provides advantages for GPU memory-bounded applications.
The PowerEdge R750xa server with AMD Instinct MI210 GPUs is a powerful compute engine, which is well suited for HPC users who need accelerated compute solutions.
Next steps
In future work, we plan to describe benchmark results on additional HPC and deep learning applications, compare the AMD Infinity FabricTM Link(xGMI) bridges, and show AMD Instinct MI210 performance numbers on other Dell PowerEdge servers, such as the PowerEdge R7525 server.
Mon, 28 Mar 2022 16:35:13 -0000
|Read Time: 0 minutes
High Performance Computing (HPC) involves processing complex scientific and engineering problems at a high speed across a cluster of compute nodes. Performance is one of the most important features of HPC. While most HPC applications are run on bare metal servers, there has been a growing interest to run HPC applications in virtual environments. In addition to providing resiliency and redundancy for the virtual nodes, virtualization offers the flexibility to quickly instantiate a secure virtual HPC cluster.
Most people tend to run their HPC workloads on dedicated hardware, which is often composed of server compute nodes that are interconnected by high-speed networks to maximize their performance. Alternatively, virtualization abstracts the underlying hardware and adds a software layer that emulates this hardware. With this in mind, the engineers at the Dell Technologies HPC & AI Innovation Lab and VMware conducted a performance study to compare the performance of running and scaling HPC workloads on dedicated bare metal nodes to a vSphere 7-based virtualized infrastructure. The team also tuned the physical and virtual infrastructure to achieve optimal virtual performance and share these findings and recommendations.
Our team evaluated tightly coupled HPC applications or message passing interface (MPI) based workloads and observed promising results. These applications consist of parallel processes (MPI ranks) that leverage multiple cores and are architected to scale computation to multiple compute servers (or VMs) to solve the complex mathematical model or scientific simulation in a timely manner. Examples of tightly coupled HPC workloads include computational fluid dynamics (CFD) used to model airflow in automotive and airplane designs, weather research and forecasting models for predicting the weather, and reservoir simulation code for oil discovery.
To evaluate the performance of these tightly coupled HPC applications, we built 16-node HPC cluster using Dell PowerEdge R640 vSAN Ready nodes. Dell Power Edge R640 is a 1U dual socket server with Intel® Xeon® Scalable processors. The same cluster was configured as both a bare metal HPC cluster and as a virtual cluster running VMware vSphere.
Figure 1 shows a representative network topology of this cluster. The cluster was connected to two separate physical networks. We used the following components for this cluster:
The VM switches provided redundancy and were connected by a virtual link trucking interconnect (VLTi). A VMware vSAN cluster was created to host the VMDKs for the virtual machines. To maximize CPU utilization RDMA, we also leveraged support for vSAN. This provides direct memory access between the nodes participating in the vSAN cluster without involving the operating system or the CPU. RDMA offers low latency, high throughput, and high IOPs that are more difficult to achieve with traditional TCP-based networks. It also enables the HPC workloads to consume more CPU for their work without impacting the vSAN performance.
Figure 1: A 16-Node HPC cluster test bed
Figure 2: Physical adapter configuration for HPC network and service network
Table 1 describes the configuration details of the physical nodes and the network connectivity. For the virtual cluster, a single VM per node was provisioned for a total of 16 VMs or virtual compute nodes. Each VM was configured with 44 vCPU and 144 GB of memory, and the VM CPU and memory reservation were enabled and we set the VM latency sensitivity to high. Figure 1 also provides an example of how the hosts are cabled to each fabric. One port from each host is connected to the NVIDIA Mellanox ConnectX-6 adapter and to the Dell PowerSwitch Z9332 for the HPC network fabric. For the service network fabric, two ports are connected from the NVIDIA Mellanox ConnectX-4 adapter to the Dell PowerSwitch S5248 ToR switches.
Table 1: Configuration details for the bare metal and virtual clusters
Environment | Bare Metal | Virtual |
Server | PowerEdge R640 vSAN Ready Node | |
Processor | 2 x Intel Xeon 2nd Generation 6240R | |
Cores | All 48 cores used | 44 vCPU used |
Memory | 12 x 16GB @3200 MT/s | 144 GB reserved for the VM |
Operating System | CentOS 8.3 | Host OS: VMware vSphere 7.0u2 |
HPC Network NIC | 100 GbE NVIDIA Mellanox Connect-X6 | |
Service Network NIC | 10/25 GbE NVIDIA Mellanox Connect-X4 | |
HPC Network Switch | Dell PowerSwitch Z9332F-ON | |
Service Network Switch | Dell PowerSwitch S5248F-ON |
Table 2 shows a range of different HPC applications across multiple vertical domains along with the benchmark datasets that were used for the performance comparison.
Table 2: Application and Benchmark Details
Application | Vertical Domain | Benchmark Dataset |
OpenFOAM | Manufacturing - Computational Fluid Dynamics (CFD) | |
Weather Research and Forecasting (WRF) | Weather and Environment | |
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) | Molecular Dynamics | |
GROMACS | Life Sciences – Molecular Dynamics | HECBioSim Benchmarks – 3M Atoms |
Nanoscale Molecular Dynamics (NAMD) | Life Sciences – Molecular Dynamics | STMV – 1.06M Atoms |
Figures 3 through 7 show the performance, scalability, and difference in performance for five representative HPC applications in the CFD, weather, and science domains. Each of the applications was run to scale from 1 node through 16 nodes on a bare metal and a virtual cluster. All five applications demonstrate efficient speedup when computation is scaled out to multiple systems. The relative speedup for the application is plotted (the baseline is application performance on a bare metal single node).
The results indicate that MPI application performance running in a virtualized infrastructure (with proper tuning and following best practices for latency-sensitive applications in a virtual environment) is close to performance in a bare metal infrastructure. The single node performance delta ranges from no difference for WRF to a maximum of 8 percent difference observed with LAMMPS. Similarly, as the nodes are scaled, the performance observed on the virtual nodes is comparable to that on the bare-metal infrastructure with the largest delta being 10% when running LAMMPS on 16 nodes.
Figure 3: OpenFOAM performance comparison between virtual and bare-metal systems
Figure 4: WRF performance comparison between virtual and bare-metal systems
Figure 5: LAMMPS performance comparison between virtual and bare-metal systems
Figure 6: GROMACS performance comparison between virtual and bare-metal systems
Figure 7: NAMD performance comparison between virtual and bare-metal systems
One of the key elements of achieving a viable virtualized HPC solution is the tuning best practices that allow for optimal performance. We found a significant improvement was achieved after some minor tweaks were made from the out-of-box configuration. These improvements are a critical ingredient to ensuring customers can and will achieve results that enable not only the implementation of a virtual HPC environment, but also the adoption of a more cloud-ready eco-system that provides operational efficiencies and multi-workload support.
Table 3 outlines the parameters that we found to work best for MPI applications. Given the nature of MPI for parallel communication and its heavy reliance on a low-latency network, we suggest the implementation of the VM Latency Sensitivity setting available in vSphere 7.0. This setting allows users to optimize the scheduling delay for latency sensitive applications by 1) giving exclusive access to physical resources to reduce resource contention, 2) by-passing virtualization layers that are not providing value for these workloads, and 3) tuning virtualization layers to reduce any unnecessary overhead. We have also outlined the additional physical host and hypervisor tunings that complete these best practices below.
Table 3: Recommended performance turnings for tightly coupled HPC workloads
Settings | Value |
Physical Server |
|
BIOS Power Profile | Performance per watt (OS) |
BIOS Hyper-threading | On |
BIOS Node Interleaving | Off |
BIOS SR-IOV | On |
Hypervisor |
|
ESXi Power Policy | High Performance |
Virtual Machine |
|
VM Latency Sensitivity | High |
VM CPU Reservation | Enabled |
VM Memory Reservation | Enabled |
VM Sizing | Maximum VM size with CPU/memory reservation |
Figure 8: Virtual Machine Configuration with the recommended tuning settings
Figure 8 shows a snapshot of the recommended tuning settings as applied to the virtual machine used as the virtual nodes on the HPC cluster.
Achieving optimal performance is a key consideration for running an HPC application. While most HPC applications enjoy the performance benefits offered by a dedicated bare metal hardware, our results indicate that with appropriate tuning the performance gap between virtual and bare metal nodes has narrowed, making it feasible to run certain HPC applications in a virtualized environment. We also observed that these tested HPC applications demonstrate efficient speedups when computation is scaled out to multiple virtual nodes.
To learn more about our previous and ongoing work at the Dell Technologies HPC & AI Innovation Lab, see the High Performance Computing overview and the Dell Technologies Info Hub blog page for HPC solutions.
The authors thank Martin Hilgeman from Dell Technologies, Ramesh Radhakrishnan and Michael Cui from VMware, and Martin Feyereisen for their contribution in the study.
Mon, 15 Nov 2021 23:08:56 -0000
|Read Time: 0 minutes
The release of Omnia version 1.0 in March of 2020 was a huge milestone for the Omnia community. It was the culmination of nearly a year of planning, conversations with customers and community members, development, and testing. Omnia version 1.0 included:
automated Kubeflow deployment.
The Omnia project was designed to rapidly add features and evolve, and we are proud to announce the first update to Omnia just 7 months later. While version 1.0 had a ton of great features for a first release, version 1.1 turned out to be even bigger!
Omnia is an open-source, community-driven framework for deploying high-performance computing (HPC) clusters for simulation & modeling, artificial intelligence, and data analytics. By automating the entire process, Omnia reduces deployment time for these complex systems from weeks to hours.
Omnia incubated at Dell Technologies in partnership with Intel. The project was initiated by two HPC & AI experts who needed to quickly setup proof-of-concept clusters in Dell’s HPC & AI Innovation Lab, and has since grown into a much larger effort to create production-grade clusters on demand and at scale. Today, Omnia has thirty collaborators from nearly a dozen organizations, including five official community member organizations. The code repo has been cloned over a thousand times and has over forty thousand views! The project is off to a great start with more new features releasing regularly!
Omnia version 1.1 includes a multitude of new features and capabilities that expand datacenter automation beyond the compute server. This latest release sets the groundwork for Omnia to handle future exascale supercomputer deployments while simultaneously growing the set of end-user features and platforms more rapidly.
The new control plane (formerly called the Omnia appliance) is now a full Kubernetes-based deployment with a wealth of features. The new control plane includes Dell iDRAC integration for firmware updates and OS provisioning when iDRAC Enterprise or Datacenter licenses are detected, plus automatic fallback to Cobbler-based PXE provisioning when those licenses are not available. This allows cluster administrators using Dell servers to take full advantage of their iDRAC Enterprise or Datacenter licenses while continuing to offer a fully open-source and vendor-agnostic alternative. This new Kubernetes-based control plane is the first step in providing an expandable, multi-server control plane that could be used to manage the bare-metal provisioning and deployment of thousands of compute nodes for petascale and eventual exascale systems.
The development team has also extended Omnia’s automation capability beyond compute servers. The control plane is now able to automatically detect and configure Dell EMC PowerSwitches, Nvidia/Mellanox InfiniBand switches, and Dell EMC PowerVault storage arrays. This allows users to now deploy complete HPC environments using Omnia’s one-touch philosophy, with compute, network, and storage pieces ready to go! Dell EMC PowerSwitches are automatically configured for both management and fabric deployments, with automatic configuration of RoCEv2 for supported 100Gbps Ethernet switches. Nvidia InfiniBand fabrics will automatically be deployed when an InfiniBand switch is detected, with the subnet manager running on the control plane. And when the control plane detects a Dell EMC PowerVault ME4 storage array, it will automatically configure the RAID, format the array, and setup an NFS service that can have shared access by the various logical clusters in the Omnia resource pool. In less than a day a loading dock full of servers, storage, and networking can be transformed into a functional Omnia resource pool, ready to be configured into logical Slurm and Kubernetes clusters on demand.
Starting with version 1.1, Omnia also reduces the pain of user management. When logical Slurm clusters are created Omnia takes care of all the backend services needed for a fully functional, batch scheduled, simulation and modeling environment including Kerberos user authentication with FreeIPA. System administrators immediately have access to both a CLI and web-based interface for user management built upon well-known open-source components and standard protocols. Systems can also be configured to point to an existing LDAP service elsewhere in the data center.
Interest in Kubernetes has been growing in the HPC community, especially for data science and data analytics workloads. Interest in those use cases is precisely why Omnia included the ability to deploy Kubernetes from the start. However, default configurations of Kubernetes are missing some of the key components needed to make it useful for parallel and distributed data processing. Omnia version 1.0 included the mpi-operator from the Kubeflow project that provides custom resource descriptions (CRDs) for MPI job execution. Version 1.1 now includes the spark-operator to make executing Spark jobs simpler, as well. Another feature of version 1.1 is the option to use gang scheduling for Kubernetes pods through the Volcano project. This gives Kubernetes the ability to understand that all the pods in an MPI job should be scheduled simultaneously, rather than deploying pods a few at a time when resources come available.
Artificial intelligence research has been central workload for Omnia. Being able to provide users easy-to-deploy MLOps platforms like Kubeflow is critical to enabling data scientists and AI researchers the flexibility to experiment with new neural network architectures. In addition to Kubeflow, Omnia now offers automated installation of the Polyaxon deep learning platform. Polyaxon gives neural network researchers and data science teams the ability to:
Version 1.1 is a big release, but the Omnia community has even greater things planned. Soon we will be adding support for the entire line of Dell EMC PowerEdge servers with Intel® 3rd-generation Xeon® Scalable (code name “Ice Lake”) processors. Additionally, Omnia will soon be able to deploy logical clusters on top of servers provisioned with either Rocky Linux or CentOS, offering users a choice of base operating systems. Looking farther out, we are working with our customers, technology partners, and community members to bring support for creating BeeGFS filesystems on demand, deploying new user platforms like Open OnDemand, and providing better administrative interfaces for Kubernetes cluster administration through Lens. Anyone is free to look at what we’re working on (and suggest new things to try) by going to the Omnia GitHub.
Learn more about Omnia by visiting Omnia on GitHub.
Read the Dell Technologies solution overview on Omnia here.
Tue, 21 Sep 2021 13:56:11 -0000
|Read Time: 0 minutes
Many sectors like aviation, travel, tourism, energy, and transportation heavily rely on timely and accurate weather predictions provided by weather forecast centers. These operational forecast centers make use of numerical weather prediction (NWP) models to predict the weather based on current weather conditions. Weather research and forecasting (WRF) is one of the most widely used numerical weather prediction systems for weather forecasting. A suitable combination of robust computational resources, high-speed network and high throughput storage is required to achieve the maximum performance on high-performance computing (HPC) cluster for the WRF model to deliver timely forecasts.
In this blog, we highlight the performance improvement for WRF with Intel Ice Lake processors as compared with Intel Cascade Lake processors with Dell EMC PowerEdge servers. These tests were carried out on two socket Dell PowerEdge servers by setting the BIOS option to the HPC workload profile. The testbed hardware and software details are outlined in the following table:
Table 1: Testbed hardware and software details
Component | Dell EMC PowerEdge R750 server | Dell EMC PowerEdge R650 server | Dell EMC PowerEdge C6420 server | Dell EMC PowerEdge C6420 server |
SKU | 8380 | 6338 | 8280 | 6252 |
Cores/Socket | 40 | 32 | 28 | 24 |
Frequency (Base-Max Turbo) | 2.30 – 3.40 GHz | 2.0 – 3.20 GHz | 2.70 – 4.0 GHz | 2.10 – 3.70 GHz |
TDP | 270 W | 205 W | 205 W | 150 W |
L3Cache | 60M | 48M | 38.5M | 37.75M |
Operating System | Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64 | Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64 | Red Hat Enterprise Linux 8.3 4.18.0-240.el8.x86_64 | Red Hat Enterprise Linux 8.3 4.18.0-240.el8.x86_64 |
Memory | 32 GB x 16 (2Rx8) 3200 MT/s | 32 GB x 16 (2Rx8) 3200 MT/s | 16 GB x 12 (2Rx8) 2933 MT/s | 16 GB x 12 (2Rx8) 2933 MT/s |
BIOS/CPLD | 1.2.4/1.0.5 | 2.11.2/1.1.0 | ||
Interconnect | NVIDIA Mellanox HDR | NVIDIA Mellanox HDR | NVIDIA Mellanox HDR100 | NVIDIA Mellanox HDR100 |
Compiler | Intel parallel studio 2020 (update 4) | |||
Datasets |
We benchmarked WRF-V3.9.1.1 with the conus 2.5km and new conus 2.5km datasets and WRF-V4.2.2 with new conus 2.5km and wrf_large 3km datasets. The following figure shows the simulation domain for the tested datasets:
Figure 1: Domain configuration for new conus 2.5 km, conus 2.5 km, and wrf_large datasets.
The following table provides a brief description of each dataset:
Table 2: Configuration for new conus 2.5 km, conus 2.5 km and wrf_large datasets
| conus 2.5 km | new conus 2.5 km | wrf_large |
Run hours | 3 | 3 | 2 |
Resolution(m) | 2500 | 2500 | 3000 |
Vertical layers | 35 | 35 | 50 |
Grid points | 1501 x 1201 | 1901 x 1301 | 1500 x 1500 |
interval_seconds | 10800 | 10800 | 21600 |
The results were measured by averaging the WRF computation time of each timestep from the rsl.error.0000 output file. The timesteps during the file read / write (of wrfout* / wrfinput* ) were not included in the average.
The following figures show the application performance for the datasets mentioned in Table 2. In each figure, the numbers over the bars represent the relative performance compared to the performance obtained with the Intel 6252 Cascade Lake processor model. The blue and green bars represent application performance obtained with Ice lake and Cascade Lake processors.
Figure 2: Relative performance of WRF by processor and dataset type mentioned in Table 1
WRF was compiled with the "dm + sm" configuration with avx2 instructions and serial netcdf support (io_form* set to 2). All the available cores were subscribed during WRF simulation runs. To optimize performance, we tested different MPI process counts, OpenMP thread count combinations, and tiling schemes (WRF_NUM_TILES).
Depending on the dataset, the 8380 processor model can deliver up to 19 percent better performance compared to the 6338 processor model. Relative to Cascade Lake, the Ice Lake architecture has more memory channels and offers higher aggregate memory bandwidth. WRF, which is typically memory bandwidth bound, can take advantage of the additional memory bandwidth (Table 3) provided by Ice Lake and the results demonstrate up to 65 percent performance improvement over the Cascade Lake counterparts. Comparison of Instructions Per Cycle (IPC) and DRAM Bandwidth Utilization collected using Intel OneAPI Vtune profiler on Intel Ice Lake and Cascade Lake processors is shown in Table 3.
Table 3: Metrics collected using Intel OneAPI vtune profiler
8380 | 8280 | |||
IPC | Bandwidth(GB/s) | IPC | Bandwidth (GB/s) | |
conus 2.5km (WRFV3) | 0.99 | 257.32 | 0.86 | 128.30 |
new conus2.5km (WRFV3) | 1.57 | 192.18 | 1.48 | 120.96 |
new conus 2.5km (WRFV4) | 1.36 | 191.43 | 1.14 | 115.46 |
wrf_large (WRFV4) | 1.09 | 64.80 | 0.90 | 62.55 |
Intel’s Ice Lake is expected to deliver around 20 percent better IPC than the Cascade Lake model (8380 vs 8280). With datasets covered in this blog, we found that Intel 8380 processor reports 6 to 19% better IPC than the Intel 8280 processor.
Figure 3 shows the power consumption using the box and whiskers plot when the system was being benchmarked with the four tests shown in Figure 2. Box indicates the spread of the central 50% of the power data, and the central line represents the median power value. The dots shows the outlier power values , most of which were recorded during the initialization and finalization phase of the tests.
Figure 3: Power used by platform and processor type
Average frequency usage for 8380, 6338, 8280, and 6252 processors were around 2.9, 2.5, 3.0, and 2.5 GHz respectively for all datasets.
We used eight nodes to evaluate the scalability of WRF. Each node is equipped with the Intel 8380 processor and interconnected using the NVIDIA Mellanox HDR interconnect. The nodes used for benchmarking were connected to the same HDR switch. Table 1 provides details about the server and software that was used for the test. The text on top of the bar in Figure 4 represents the relative performance (on two, four, and eight nodes) for the application as compared with the performance with a single node.
Figure 4: Multi-node performance of WRF on an Intel 8380 processor model for datasets listed in Table 1
The scalability numbers have been rounded off to a single digit. We observed good scalability with all the datasets listed in Table 1.
For WRF, Intel Ice Lake demonstrates significant performance improvement as compared with Intel Cascade Lake processors. WRF simulations scale well with the datasets described in this blog. The scalability might vary depending on the dataset being used and the node count being tested. For the best performance with WRF, the impact of the tile size, process, and threads per process should be evaluated.
Mon, 30 Aug 2021 21:09:22 -0000
|Read Time: 0 minutes
3rd Generation Intel Xeon® Scalable processors (architecture code named Ice Lake) is Intel’s successor to Cascade Lake. New features include up to 40 cores per processor, eight memory channels with support for 3200 MT/s memory speed and PCIe Gen4. The HPC and AI Innovation Lab at Dell EMC had access to a few systems and this blog presents the results of our initial benchmarking study.
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is an open source, well parallelized collection of packages for molecular dynamics (MD) research. LAMMPS has a nice collection of “atom styles”, force fields, and many contributed packages. LAMMPS can run on a single processor or on the largest parallel super-computers. It also has packages that provide force calculations accelerated on GPU’s. It can do simulations with billions of atoms!
LAMMPS can be run on a single processor or in parallel using some form of message passing, such as Message Passing Interface (MPI). The most current source code for LAMMPS is written in C++. For more information about LAMMPS, see the following link: https://www.lammps.org/.
In this study we measure the performance of LAMMPS on different Ice Lake processor models as listed in Table 1 with a comparison to the previous generation Cascade Lake systems. Single node as well as the multi-node scalability tests were conducted.
The LAMMPS version used for testing release was lammps-2July-2021, using the Intel 2020 update 5 compiler to take advantage of AVX2 and AVX512 optimizations, and the Intel MKL FFT library. We used the default INTEL package, which comes along the LAMMPS package providing some well optimized atom pair styles in LAMMPS for the vector instructions on Intel processors. The datasets used for our study are described in Table 2, along with detailed configuration of atom sizes and run steps. The unit of performance is timesteps per second, and higher is better.
Table 1: Hardware and Software test bed details
Component | Dell EMC PowerEdge R750 server | Dell EMC PowerEdge R750 server | Dell EMC PowerEdge C6520 server | Dell EMC PowerEdge C6520 server | Dell EMC PowerEdge C6420 server | Dell EMC PowerEdge C6420 server |
CPU model | Xeon 8380 | Xeon 8358 | Xeon 8352Y | Xeon 6330 | Xeon 8280 | Xeon 6248R |
Cores/Socket | 40 | 32 | 32 | 28 | 28 | 24 |
Base Frequency | 2.30 GHz | 2.60 GHz | 2.20 GHz | 2.00 GHz | 2.70 GHz | 3.00 GHz |
TDP | 270 W | 250 W | 205 W | 205 W | 205 W | 205 W |
Operating System | Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64 | |||||
Memory |
16 GB x 16 (2Rx8) 3200 MT/s
| 16 GB x 12 (2Rx8) 2933 MT/s | ||||
BIOS/CPLD | 1.1.2/1.0.1 | |||||
Interconnect | NVIDIA Mellanox HDR
| NVIDIA Mellanox HDR100 | ||||
Compiler | Intel parallel studio 2020 (update 4) | |||||
LAMMPS | 2july2021 |
Table 2: Description of datasets used for performance analysis
Datasets | Description | Units | Atomic Style | Atom Size | Step Size |
Lennard Jones | Atomic fluid (LJ Benchmark) | lj | atomic | 512000 | 7900 |
Rhodo | Protein (Rhodopsin Benchmark) | real | full | 512000 | 520 |
Liquid crystal | Liquid Crystal w/ Gay-Berne potential | lj | ellipsoid | 524288 | 840 |
Eam | Copper benchmark with Embedded Atom Method | metal | atomic | 512000 | 3100 |
Stilliger Weber | Silicon benchmark with Stillinger-Weber | metal | atomic | 512000 | 6200 |
Tersoff | Silicon benchmark with Tersoff | metal | atomic | 512000 | 2420 |
Water | Coarse-grain water benchmark using Stillinger-Weber | real | atomic | 512000 | 2600 |
Polyethylene | Polyethylene benchmark with AIREBO | metal | atomic | 522240 | 550 |
Figure 1: Image view of datasets from OVITO (scientific data visualization and analysis software for molecular and other particle-based simulation model). Images are listed in order 1a-1h, subfigure 1a- 1h represents a small portion of simulation domain for Atomic Fluid (Lennard Jones), rhodo(protein), liquid crystal(lc), copper(eam), stilliger webner(sw), Terasoff, water, polyethylene datasets respectively.
Table 1 and Figure 1 shows the image view of datasets used for the single and multi-node analysis. For visualization of all datasets was done using OVITO, scientific data visualization and analysis software for molecular and other particle-based simulation model. For single node performance study, all the datasets shown in Table 2 were used, and for multi-node study Atomic fluid was considered for benchmarking.
Figure 2: Single Node Performance of LAMMPS across the datasets with Intel Ice Lake processor model. Each graph in Figure 2 is an individual subfigure, labeled (a-h) in the order shown. Each subfigure (2a- 2h) represents single node performance comparison across the Xeon processors with Xeon 6330 as baseline for Atomic Fluid (Lennard Jones), rhodo(protein), liquid crystal(lc), copper(eam), stilliger webner(sw), Terasoff, water, polyethylene datasets respectively.
Figure 2 shows the single node performance for the eight datasets (sub figure 2a-h) listed in Table 2 with the four Ice Lake processor model available for evaluation of LAMMPS.
For ease of comparison across the processor model, the relative performance of the datasets has been included into a different graph. However, it is worth noting that each dataset behaves individually when performance is considered, as each uses different molecular potentials and have different number of atoms. Figure 2 shows that increases in the core count in the processor model increases the performance, across the dataset used. Next, by comparing the relative numbers to the baseline processor Xeon 6330(28C) with Xeon 8380(40C), we measured a 30 to 45 percent performance gain with these datasets. A fraction of these boosts was due to frequency of the processor model.
Figure 3a: Performance of LAMMPS on Cascade Lake (Xeon 6248R) in comparison to Ice Lake (Xeon 6330)
Figure 3b: Performance of LAMMPS on Cascade Lake (Xeon 8280) in comparison to Ice Lake (Xeon 8380)
Figure 3 compares the performance of the mid-bin Cascade Lake 6248R (24core) to the Ice Lake 6330 (28 core), and the top end Cascade Lake 8280 (28 core) to the Ice Lake 8380 (40 core) From the figure 3a, Ice Lake 6330 is up to 30 percent faster than the 6248R. The Xeon 6330 has 16 percent more cores, and 9 percent faster memory bandwidth. Figure 3b shows Ice Lake 8380 is up to 75 percent faster than the Xeon 8280 on single node tests, this is in line with the 42 percent additional cores and 9 percent faster memory bandwidth. These results are due to a higher processor speed wherein more data can be accessed by each core.
To analyze the scalability test with strong and weak scaling, we used the Atom Fluid (LJ) dataset from the Intel package. The job run time was 7900 steps with 512000 atoms in the simulation system.
Figure 4a: Fixed size Atomic fluid (LJ) for different problem size (strong scaling) w/ Xeon 8380
With strong scaling, we referred to the fixed problem size and increasing the parallel processes (Amdahl’s law). Whereas in weak scaling, we varied the atom size from 512000 atoms to 4096000 atoms in the simulation environment with an increase in the parallel processes (Gustafson-Barsis law). The test bed included DellEMC Poweredge R750 servers each with dual Ice Lake Xeon 8380 processors an NVIDIA Networking HDR interconnect running at 200 Gbps.Figure 4a plots the fixed-size relative performance for four different problem sizes, viz, 512000,1024000,2048000, and 4096000 atoms, on different number of nodes.
The relative performance is normalized by single node performance. Hence, the single node performance for each curve is 1.00 (unity). Relative Performance for fixed size Atomic fluid was calculated by the following equation:
Relative Performance = loop time of ‘N’ node / loop time for single node
Loop time is the total wall-clock time for the simulation to run. It can be observed that relative performance increases with increase in problem size. This is because for smaller problems system spend more time in inter-nodal communications. Time spent in communication at 8 nodes is 61.91%, 59.74%,48.42%,45.04% for 512000,1024000,2048000,4096000 atom size respectively.
Figure 4b: Scaled size Atomic fluid (LJ) with 512000 atoms per node (weak scaling) w/ Xeon 8380
Figure 4b plots the scaled-size efficiency for runs with 512000 atoms per node. Thus, a scaled-size 2 node run is for 1024000 atoms; 8 node runs is for 4096000 atoms. Relative Performance for Scaled size Atomic fluid was calculated by the following equation:
Relative Performance= loop time for ‘n’ node/ (loop time for single node * number of nodes)
Weak scaling efficiency decreases with increase in no of nodes in the investigated range. This is due to the fact for larger number of nodes the time spend in MPI communication is larger. Time spent in communication with scaled size atom for 1N, 2N, 4N and 8 N are 27.17%, 32.42 %, 40.87%, 45.04 % respectively.
Figure 5: Multi-node efficiency for Atomic Fluid (LJ) w/ I 8380
Figure 5 plots the multi-node efficiency for Atomic Fluid with Xeon 8380. The relative performance is normalized by single node with 512000 atoms performance. Hence, the single node performance for 512000 atoms is 1.00 (unity). This point is taken as baseline for other comparison.
Performance Rating = (loop time * number of atoms)/ (loop time for 512000 atoms on 1 node * number of nodes * 512000)
We observed that for smaller systems, such as those with fewer atoms, the efficiency of strong scaling decreases as the system spends more time in MPI communication; whereas in larger systems with many atoms, the efficiency of strong scaling increases as the time spent in pair-wise force calculation becomes dominant. For weak scaling, as the no of nodes increases the efficiency of weak scaling decreases.
The Ice Lake processor-based Dell EMC Power Edge servers, with its hardware feature upgrades over Cascade Lake, demonstrate up to 50 to70 percent performance gain for all the datasets used for benchmarking LAMMPS. Watch our blog site for updates!
Fri, 02 Dec 2022 05:33:27 -0000
|Read Time: 0 minutes
3rd Generation Intel Xeon® Scalable processors (architecture code named Ice Lake) is Intel’s successor to Cascade Lake. New features include up to 40 cores per processor, eight memory channels with support for 3200 MT/s memory speed and PCIe Gen4.
The HPC and AI Innovation Lab at Dell EMC had access to a few systems and this blog presents the results of our initial benchmarking study on a popular open-source molecular dynamics application – GROningen MAchine for Chemical Simulations (GROMACS).
Molecular dynamics (MD) simulations are a popular technique for studying the atomistic behavior of any molecular system. It performs the analysis of the trajectories of atoms and molecules where the dynamics of the system progresses over time.
At HPC and AI Innovation Lab, we have conducted research on the SARS-COV-2 study where applications like GROMACS helped researchers identify molecules that bind to the spike protein of the virus and block it from infecting human cells. Other use cases of MD simulation in medicinal biology is iterative drug design through prediction of protein-ligand docking (in this case usually modelling a drug to target protein interaction).
GROMACS is a versatile package to perform MD simulations, such as simulate the Newtonian equations of motion for systems with hundreds to millions of particles. GROMACS can be run on CPUs and GPUs in single-node and multi-node (cluster) configurations. It is a free, open-source software released under the GNU General Public License (GPL). Check out this page for more details on GROMACS.
Table 1: Hardware and Software testbed details
Component | Dell EMC PowerEdge R750 server | Dell EMC PowerEdge R750 server | Dell EMC PowerEdge C6520 server | Dell EMC PowerEdge C6520 server | Dell EMC PowerEdge C6420 server | Dell EMC PowerEdge C6420 server |
SKU | Xeon 8380 | Xeon 8358 | Xeon 8352Y | Xeon 6330 | Xeon 8280 | Xeon 6252 |
Cores/Socket | 40 | 32 | 32 | 28 | 28 | 24 |
Base Frequency | 2.30 GHz | 2.60 GHz | 2.20 GHz | 2.00 GHz | 2.70 GHz | 2.10 – GHz |
TDP | 270 W | 250 W | 205 W | 205 W | 205 W | 150 W |
L3Cache | 60M | 48M | 48M | 42M | 38.5M | 37.75M |
Operating System | Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64 | |||||
Memory | 16 GB x 16 (2Rx8) 3200 MT/s | 16 GB x 12 (2Rx8) 2933 MT/s | ||||
BIOS/CPLD | 1.1.2/1.0.1 | |||||
Interconnect | NVIDIA Mellanox HDR | NVIDIA Mellanox HDR100 | ||||
Compiler | Intel parallel studio 2020 (update 4) | |||||
GROMACS | 2021.1 |
Table 2: Description of datasets used for performance analysis
Datasets/Download Link | Description | Electrostatics | Atoms | System Size |
Movement of Water This example is to simulate- the motion process of many water molecules in each space and temperature.
| Particle Mesh Ewald (PME)
| 1536K | small | |
3072K | Large | |||
This example is to simulate- 1.4M atom system - A Pair of hEGFR Dimers of 1IVO and 1IVO 3M atom system – A Pair of hEGFR tetramers of 1IVO and 1IVO
| Particle Mesh Ewald (PME)
| 1.5M | Small | |
3M | Large | |||
Prace – Lignocellulose | This example is to simulate the lignocellulose – the tpr was obtained from PRACE website
| Reaction Field (rf)
| 3M | Large |
We compiled GROMACS from source (version-2021.1) using the Intel 2020 Update 5 Compiler to take advantage of AVX2 and AVX512 optimizations, and the Intel MKL FFT library. The new version of GROMACS has a significant performance gain due to the improvements in its parallelization algorithms. The GROMACS build system and the gmx mdrun tool have built-in and configurable intelligence that detects your hardware and make effective use of it.
Our objective is to quantify the performance of GROMACS using different test cases, like performance evaluation on different Ice Lake processors as listed in Table 1, then we compare the 2nd and 3rd Gen Xeon Scalable (Cascade Lake vs Ice Lake), and finally we compare multi-node scalability with hyper threading enabled and disabled.
To evaluate the datasets results with an appropriate metric, we added associated high-level compiler flags, electrostatic field load balancing (like PME, etc), tested with multiple ranks, separate PME ranks, varying different nstlist values, and created a paradigm for our application (GROMACS).
The typical time scales of the simulated system are in the order of micro-seconds (µs) or nanoseconds (ns). We measure the performance for the dataset’s simulation as nanoseconds per day (ns/day).
Figure 1(a): Single node performance of Water 1536K and Water 3072K on Ice Lake processor model
Figure 1(b): Single node performance of Lignocellulose 3M on Ice Lake processor model
Figure 1(c): Single node performance of HecBioSim 1.4M and HecBioSim 3M on Ice Lake processor model
Figure 1 (a), (b) and (c) shows are the single node performance analyses for three datasets mentioned in Table 2 with the four processor models available for evaluation of GROMACS.
Figure 2: Relative Performance of GROMACS across the datasets with Intel Ice Lake Processor Model
For ease of comparison across the various datasets, the relative performance of the processor model has been included into a single graph. However, it is worth noting that each dataset behaves individually when performance is considered, as each uses different molecular topology input files (tpr), and configuration files.
Individual dataset performance is mentioned in Figures 1(a), 1(b), and 1(c) respectively.
Figure 2 shows increase in the core count in the processor model increases the performance, based on the dataset used. In here, we observe that smaller (water 1536K and HecBioSim 1400K) has more advantage 5 to 6 percent performance gain in counterpart to the larger datasets (water 3072, HecBioSim 3M, and Ligno 3M).
Next, by comparing the relative numbers to the baseline processor Xeon 6330(28C) with Xeon 8380(40C), we found a 30 to 50 percent performance gain according to the datasets with increases in cores, from 28 to 40. A fraction of gain is by frequency of the processor model.
Figure 3(a): Performance of GROMACS on Cascade Lake (Xeon 6252) vs Ice Lake (Xeon 6330)
Figure 3(b): Performance of GROMACS on Cascade Lake (Xeon 8280) vs Ice Lake (Xeon 8380)
We accounted for the fact that the memory is rightly fit according to the datasets. To begin, we compared each processor with previous generation processors. For performance benchmark comparisons, we selected Cascade Lake closest to their Ice Lake counterparts in terms of hardware features such as cache size, TDP values, and Processor Base/Turbo Frequency, and marked the maximum value attained for Ns/day by each of the datasets mentioned in Table 2.
Figure 3a shows Ice Lake 6330 is up to 50 to 75 percent faster than the 6252. The Xeon 6330 has 16 percent more cores and 9 percent faster memory bandwidth. Figure 3b shows that Ice Lake 8380 is up to 50-65 percent faster than the Xeon 8280 on single node tests, this is in line with the 42 percent more cores and 9 percent faster memory bandwidth.
This result is due to a higher processor speed, wherein more data can be accessed by each core. Also, datasets are more memory intensive and some percentage is added on due frequency improvement Overall, the Ice Lake processor results demonstrated a substantial performance improvement for GROMACS over Cascade Lake processors.
Figure 4(c): Scalability of HecBioSim 1.4M with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/ Xeon 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled
Figure 4(d): Scalability of HecBioSim 3M with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/ Xeon 8380; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled
Figure 4(e): Scalability of Lignocellulose 3M with hyper threading disabled(80C) vs hyperthreading enabled (160C) w/INTEL 8380 ; the dotted line represent the delta between hyperthreading enabled vs hyperthreading disabled
For multi-node tests, the test bed was configured with an NVIDIA Mellanox HDR interconnect running at 200 Gbps and each server having the Ice Lake processor. We were able to achieve the expected linear performance scalability for GROMACS of up to eight nodes with hyper threading disabled and approximately 7.25X with hyper threading enabled for eight nodes, across the datasets. All cores in each server were used while running these benchmarks. The performance increases are close to linear across all the dataset types as the core count increases.
The Ice Lake processor-based Dell EMC Power Edge servers, with notable hardware feature upgrades over Cascade Lake, show up to 50 to 60 percent performance gain for all the datasets used for benchmarking GROMACS. Hyper threading should be disabled for the benchmarks addressed in this blog for getting better scalability above eight nodes. For small datasets mentioned in this blog benefits 5 to 6 percent in comparison to the larger ones with increase in the core count.
Watch our blog site for updates!
Thu, 19 Aug 2021 20:06:53 -0000
|Read Time: 0 minutes
AMD has recently announced and launched its third generation 7003 series EPYC processors family (code named Milan). These processors build upon the proceeding generation 7002 series (Rome) processors and improve L3 cache architecture along with an increased memory bandwidth for workloads such as High Performance Computing (HPC).
The Dell EMC HPC and AI Innovation Lab has been evaluating these new processors with Dell EMC’s latest 15G PowerEdge servers and will report our initial findings for the molecular dynamics (MD) application GROMACs in this blog.
Given the enormous health impact of the ongoing COVID-19 pandemic, researchers and scientists are working closely with the HPC and AI Innovation Lab to obtain the appropriate computing resources to improve the performance of molecular dynamics simulations. Of these resources, GROMACS is an extensively used application for MD simulations. It has been evaluated with the standard datasets by combining the latest AMD EPYC Milan processor (based on Zen 3 cores) with Dell EMC PowerEdge servers to get most out of the MD simulations.
In a previous blog, Molecular Dynamic Simulation with GROMACS on AMD EPYC- ROME, we published benchmark data for a GROMACS application study on a single node and multinode with AMD EPYC ROME based Dell EMC servers.
The results featured in this blog come from the test bed described in the following table. We performed a single-node and multi-node application study on Milan processors, using the latest AMD stack shown in Table 1, with GROMACS 2020.4 to understand the performance improvement over the older generation processor (Rome).
Table 1: Testbed hardware and software details
Server | Dell EMC PowerEdge 2-socket servers (with AMD Milan processors) | Dell EMC PowerEdge 2-socket servers (with AMD Rome processors) |
Processor Cores/socket Frequency (Base-Boost ) Default TDP Processor bus speed | 7763 (Milan) 64 2.45 GHz – 3.5 GHz 280 W 256 MB 16 GT/s | 7H12 (Rome) 64 2.6 GHz – 3.3 GHz 280 W 256 MB 16 GT/s |
Processor Cores/socket Frequency Default TDP Processor bus speed | 7713 (Milan) 64 2.0 GHz – 3.675 GHz 225 W 256 MB 16 GT/s | 7702 (Rome) 64 2.0 GHz – 3.35 GHz 200 W 256 MB 16 GT/s |
Processor Cores/socket Frequency Default TDP Processor bus speed | 7543 (Milan) 32 2.8 GHz – 3.7 GHz 225 W 256 MB 16 GT/s | 7542 (Rome) 32 2.9 GHz – 3.4 GHz 225 W 128 MB 16 GT/s |
Operating system | Red Hat Enterprise Linux 8.3 (4.18.0-240.el8.x86_64) | Red Hat Enterprise Linux 7.8 |
Memory | DDR4 256 G (16 GB x 16) 3200 MT/s | |
BIOS/CPLD | 2.0.2 / 1.1.12 |
|
Interconnect | NVIDIA Mellanox HDR | NVIDIA Mellanox HDR 100 |
Table 2: Benchmark datasets used for GROMACS performance evaluation
Datasets | Details |
1536 K and 3072 K | |
1400 K and 3000 K | |
Prace – Lignocellulose | 3M |
The following information describes the performance evaluation for the processor stack listed in the Table 1.
Figure 1: GROMACS performance comparison with AMD Rome processors
For performance benchmark comparisons, we selected Rome processors that are closest to their Milan counterparts in terms of hardware features such as cache size, TDP values, and Processor Base/Turbo Frequency, and marked the maximum value attained for Ns/day by each of the datasets mentioned in Table 2.
Figure 1 shows a 32C Milan processor has higher performance improvements (19 percent for water 1536, 21 percent for water 3072, and 10 to approximately 12 percent with HECBIO sim and lingo cellulose datasets) compared to a 32C Rome processor. This result is due to a higher processor speed and improved L3 cache, wherein more data can be accessed by each core.
Next, with the higher end processor we see only 10 percent gain with respect to the water dataset, as they are more memory intensive. Some percentage is added on due to improvement of frequency for the remaining datasets. Overall, the Milan processor results demonstrated a substantial performance improvement for GROMACS over Rome processors.
Milan processors comparison (32C processors compared to 64C processors)
Figure 2: GROMACS performance with Milan processors
Figure 2 shows performance relative to the performance obtained on the 7543 processor. For instance, the performance of water 1536 is improved from the 32C processor to the 64 core (64C) processor from 41 percent (7713 processor) to 57 percent (7763 processor). The performance improvement is due to the increasing core counts and higher CPU core frequency performance improvement. We observed that GROMACS is frequency sensitive, but not to a great extent. Greater gains may be seen when running GROMACS across multiple ensembles runs or running dataset with higher number of atoms.
We recommend that you compare the price-to-performance ratio before choosing the processor based on the datasets with higher CPU core frequency, as the processors with a higher number of lower-frequency cores may provide better total performance.
Figure 3: Multi-node study with 7713 64c SKUs
For multi-node tests, the test bed was configured with an NVIDIA Mellanox HDR interconnect running at 200 Gbps and each server included an AMD EPYC 7713 processor. We achieved the expected linear performance scalability for GROMACS of up to four nodes and across each of the datasets. All cores in each server were used while running the benchmarks. The performance increases are close to linear across all the dataset types as core count increases.
For the various datasets we evaluated, GROMACS exhibited strong scaling and was compute intensive. We recommend a processor with high core count for smaller datasets (water 1536, hec 1400); larger datasets (water 3072, ligno,HEC 3000) would benefit from memory per core. Configuring the best BIOS options is important to get the best performance out of the system.
For more information and updates, follow this blog site.
Tue, 03 Aug 2021 14:49:25 -0000
|Read Time: 0 minutes
The Weather Research and Forecasting (WRF) model is an open-source mesoscale weather prediction model that is predominantly used in a multi-compute node environment for atmospheric research and operational forecasts. This model performs well on the latest generation of the AMD EPYC 3rd Gen (7003 Series) processor family, code name Milan. In this blog, we highlight the performance improvement of WRF application on the AMD Milan processors based on Dell EMC PowerEdge servers.
This blog follows up our first blog in this series, where we introduced the AMD Milan processor architecture, key BIOS tuning options, and baseline microbenchmark performance. We analyzed the performance improvement of the latest AMD EPYC Milan (7003 Series) processor-based Dell EMC PowerEdge servers compared to the second-generation AMD EPYC Rome (7002 Series) processor-based Dell EMC PowerEdge servers. The testbed hardware and software details are outlined in the following table:
Table 1: Testbed hardware and software details
Server | Dell EMC PowerEdge 2-socket servers (with AMD Milan Processors) | Dell EMC PowerEdge 2-socket servers (with AMD Rome Processors) | |||
Processor model Cores/socket Frequency (Base-Boost) TDP Processor bus speed | 7763 64 2.45 GHz – 3.5 GHz 280 W 256 MB 16 GT/s | 7713 64 2.0 GHz – 3.7 GHz 225 W 256 MB 16 GT/s | 7543 32 2.8 GHz – 3.7 GHz 225 W 256 MB 16 GT/s | 7662 64c 2.0 GHz – 3.35 GHz 200 W 256 MB 16 GT/s | 7542 32 2.9 GHz – 3.4 GHz 225 W 128 MB 16 GT/s |
Operating system | Red Hat Enterprise Linux 8.3 (4.18.0-240.el8.x86_64) | ||||
Memory | DDR4 256G (16 GB x 16) 3200 MT/s | ||||
Interconnect | NVIDIA Mellanox HDR | ||||
BIOS/CPLD | 2.2.5 / 1.1.12 (AMD 7763,AMD 7713,AMD 7543) 2.1.6 / 1.1.12 (AMD 7662) 2.1.5 / 0.10.3 (AMD 7542) | ||||
Applications | WRF v3.9.1.1, WRF v4.2.2 | ||||
Benchmark datasets |
Figure 1: Domain configuration for new conus 2.5 km, conus 2.5 km, and wrf_large datasets.
The following table provides a brief description of each dataset:
Table 2: Configuration for new conus 2.5 km conus 2.5 km and wrf_large datasets
| conus 2.5 km | new conus 2.5 km | wrf_large |
Run hours | 3 | 3 | 2 |
Resolution(m) | 2500 | 2500 | 3000 |
Vertical layers | 35 | 35 | 50 |
Grid points | 1501 x 1201 | 1901 x 1301 | 1500 x 1500 |
interval_seconds | 10800 | 10800 | 21600 |
The results were measured by averaging the WRF computation time of each timestep from the rsl.error.0000 output file.
Single node performance
The following figures show the application performance for the datasets mentioned in Table 2. In each figure, the numbers over the bars represent the relative change in the application performance compared to the application performance obtained on the AMD 7542 Rome processor model.
Figure 2: Relative difference in the performance of WRF by processor and dataset type mentioned in Table 1
WRF was compiled with the "dm + sm" configuration and all the available cores were subscribed during WRF simulation runs. To optimize performance, we tried different MPI process count, OpenMP thread count combinations and tiling schemes (WRF_NUM_TILES) options. For single-node tests, two MPI processes per Core Complex Die (CCD) deliver the best results for conus 2.5 km and new conus 2.5 km datasets. We used eight processes per CCD for the wrf_large dataset.
Depending on the dataset, the AMD 7763 processor can deliver up to 14 percent better performance over the AMD 7543 processor. In the previous blog, we observed better performance improvements on the 32 core Milan processor model with memory bandwidth bound benchmarks like HPCG and STREAM. WRF is a memory bandwidth bound application and there is notable performance improvement in the 32-core processor model: the AMD 7543 delivers up to 26 percent better performance over AMD 7542 processor.
From the performance that is shown in Figure 2 and the average power usage data that is shown in figure 3, we noted that the AMD 7713 processor can deliver up to 58 percent better performance per watt than the AMD 7662 processor.
Figure 3: Power used by platform and processor type: average idle server power usage was 305 W (7542), 338 W (7662), 305 W (7543), 258 W (7713), and 272 W (7763)
Multi-node scalability
To evaluate the scalability of WRF, we used eight nodes. Each node is equipped with an AMD 7713 processor and interconnected using the NVIDIA Mellanox HDR interconnect. The nodes used for benchmarking were connected to the same HDR switch. Table 1 provides details about the server and software that was used for the test. The text on top of the line represents the relative change in the application performance (on 2,4 and 8 nodes) with respect to the application performance obtained on the single node.
Figure 4: Multi-node performance of WRF on an AMD Milan 7713 processor for datasets listed in Table 1
The scalability numbers have been rounded off to a single digit. We observed good scalability with all the datasets listed in Table 1.
Conclusions and recommendations
WRF delivers better performance and performance per watt on AMD Milan processors. There is a significant performance improvement on the 32 core Milan processor model and the WRF simulations scale well with the datasets described in this blog. However, the scalability might vary depending on the dataset being used and the node count being tested. Ensure that you test the impact of the tile size, process, and threads per process before use.
We will continue to post new blogs on this site as updates arise.
Thu, 22 Jul 2021 09:03:25 -0000
|Read Time: 0 minutes
Overview
Over the past decade, GPUs have become popular in scientific computing because of their great ability to exploit a high degree of parallelism. NVIDIA has optimized life sciences applications to run on their general-purpose GPUs. Unfortunately, these GPUs can only be programmed with CUDA, OpenACC, or the OpenCL framework. Most of the life sciences community is not familiar with these frameworks so few biologists or bioinformaticians can make efficient use of GPU architectures. However, GPUs have been making inroads into the molecular dynamics simulation (MDS) field since MD was developed in the 1950s. MDS requires heavy computational work to simulate biomolecular structures or their interactions.
In this blog, the performance of one popular MDS application, NAMD, is presented with various NVIDIA A-series GPUs such as the A100, the A10, the A30 and the A40 . NAMD is a free and open-source parallel MD package designed for analyzing the physical movements of atoms and molecules.
Dell Technologies has released the new PowerEdge R750xa server, a GPU workload platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3rd Gen Intel Xeon Scalable Processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server. The test server configurations are summarized in Table 1, and the specifications of tested NVIDIA GPUs are listed in Table 2.
Table 1: Tested compute node configuration
Test Beds | ||||
Server | Dell EMC PowerEdge R750xa | Dell EMC PowerEdge R740 | ||
CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30 GHz | Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40 GHz | Intel(R) Xeon(R) Gold 6248 CPU @ 2.50 GHz | |
NVIDIA GPUs | 4 x A100 | 4 x A10 | 4 x A30 | 2 x A40 |
RAM | DDR4 1024 GB (32 x 32 GB) 3200 MT/s | DDR4 384 GB (24 x 16 GB) 2933 MT/s | ||
Operating system | RHEL 8.3 (4.18.0-240.el8.x86_64) | |||
Filesystem network | Mellanox InfiniBand HDR100 | |||
Filesystem | Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage | |||
BIOS system profile | Performance Optimized | |||
Logical processor | Disabled | |||
Virtualization technology | Disabled | |||
Cuda/Toolkit | 11.2 | |||
OpenMPI | 4.1.1 | |||
NAMD | NAMD_Git-2021-04-01_Source |
Table 2: Specifications of tested NVIDIA GPUs
NVIDIA GPUs | ||||
| ||||
FP64 (TFLOPS) | 9.7 | Unknown | 5.2 | Unknown |
FP64 Tensor Core (TFLOPS) | 19.5 | Unknown | 10.3 | Unknown |
FP32 (TFLOPS) | 19.5 | 31.2 | 10.3 | 37.4 |
Tensor Float 32 (TFLOPS) | 156 | 312* | 62.5 | 125* | 82 | 165 * | 74.8 | 149.6* |
BFLOAT16 Tensor Core (TFLOPS) | 312 | 624* | 125 | 250* | 165 | 330* | 149.7 | 299.4* |
FP16 Tensor Core (TFLOPS) | 312 | 624* | 125 | 250* | 165 | 330* | 149.7 | 299.4* |
INT8 Tensor Core (TOPS) | 624 | 1248* | 250 | 500* | 330 | 661* | 299.3 | 598.6* |
INT4 Tensor Core (TOPS) | Unknown | 500 | 1,000* | 661 | 1321* | 598.7 | 1,197.4* |
GPU memory | 40 GB HBM2 | 24 GB GDDR6 | 24 GB HBM2 | 48 GB GDDR6 |
GPU memory bandwidth | 1,555 GB/s | 600 GB/s | 933 GB/s | 696 GB/s |
Max Thermal Design Power (TDP) | 400W | 150W | 165W | 300W |
Multi-Instance GPU | Up to 7 MIGs @ 5 GB | Unknown | 4 GPU instances @ 6 GB each 2 GPU instances @ 12 GB each 1 GPU instance @ 24 GB | Unknown |
Form factor | PCIe | Single-slot, full-height, full-length (FHFL) | Dual-slot, full-height, full-length (FHFL) | 4.4" (H) x 10.5" (L) dual slot |
Interconnect | PCIe Gen4: 64 GB/s | PCIe Gen4: 64 GB/s | PCIe Gen4: 64 GB/s
| PCIE Gen4 x 16 31.5 GB/s (bidirectional) |
* With sparsity
Performance Evaluation
NAMD
NAMD was compiled from source code (NAMD_Git-2021-04-01_Source) using GCC 11.1 and CUDA 11.2. We used a test data set, the 1.06 million-atom system of Satellite Tabacco Mosaic Virus (SMTV).
Figure 1 shows the performance of four GPUs with the STMV dataset. The figures represent the performance changes in nanoseconds per day (ns/day) with various numbers of cores used with one, two or four GPUs. The only valid comparison between the various GPUs is NVIDIA A100 and A10 since the test systems were configured identically. Although the performance of NAMD is affected by the CPU clock speed, the tested systems are not significantly different from the CPU’s clock speed. The A10 is rated at three times the single precision FLOPS of the A30, and the A10 performs better than the A30 on the two GPU tests even with slightly slower CPUs. The A100 outperformed by roughly 25 percent and 16 percent on single and two GPU tests when comparing the A10’s results, respectively.
The results from four GPU tests in Figure 1 show similar performance for the different GPUs. This agrees well with our previous test results that NAMD does not scale after two GPUs. We can rule out a potential argument that the data size might be too small since 3 million atom data, HECBioSim3000k-atom system, which is a pair of 1IVO and 1NQL hEGFR tetramers, shows similar or worse results (those results are not shown here).
Figure 1: NAMD performance with STMV, 1 million-atom system |
As shown in Figure 1, when four GPUs were tested , all of the GPUs except the A40 reached ~9 ns/day simulations. And, in terms of maximum performance, the A10 performs the highest number of simulations, 9.121 ns/day. However, these numbers are not true reflections of the performance due to the scalability limitations. Although all four GPU test results are similar, the A100 has a better throughput than other GPUs for the two GPU test as shown in Figure 2. Also, it is worth noting that the A10 and the A40 are not suitable for general-purpose computing due to the lack of double-precision support.
Figure 2 shows the performance comparisons among the different GPUs we tested in this study. Again, the A30 performed better than the A10 up to the 16 cores. It is difficult to determine why the A30 doesn’t perform as well with a large number of active CPU cores(20 and more).
Figure 2: STMV test results comparisons with two GPUs |
Conclusion
The A100 shows a dominant performance and is the most capable card among the A-series GPUs. Although the A30 did not perform as well as the A10 in our test , it is another outstanding choice for versatile applications.
The A10 performed well compared to the A30, and it is the successor of the T4, which was the most cost-effective solution for specific applications such as genomics data analysis.
Since it is not possible to obtain the accurate performance differences among A-series GPUs from this study, further investigation is necessary to achieve a clear picture of these general purpose GPUs.
Tue, 22 Jun 2021 15:00:00 -0000
|Read Time: 0 minutes
Overview
Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), insertion/deletion (indel), or chromosomal restructuring. Ultimately, all physiological and biochemical events depend on the final gene expression product, proteins. Although most mammals have an additional controlling layer before protein expression, knowing how many transcripts exist in a system helps to characterize the biochemical status of a cell. Ideally, technology should enable us to quantify all proteins in a cell, which would advance the progress of Life Science significantly; however, we are far from achieving this.
In this blog, we report the test results of one popular RNA-Seq data analysis pipeline known as the Tuxedo pipeline. The Tuxedo pipeline suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript and isoform detection, differential expression, visualizations, and quality control metrics. The tested workflow is a differentially expressed gene (DEG) analysis, and the detailed steps in the pipeline are shown in Figure 1.
Figure 1: Updated Tuxedo Pipeline with Cuffquant Step
In this study, the performances of single nodes with 3rd Gen Intel® Xeon® Scalable Processors (Codename Ice Lake) and 2nd Generation Intel® Xeon® Scalable Processors (Codename Cascade Lake) on Dell EMC PowerEdge R6520 (liquid-cooled) servers and C6420 (air-cooled) servers were compared. The configurations of the test systems are summarized in Table 1.
Table 1: Tested compute node configuration
Dell EMC PowerEdge C6520 Liquid Cooled | |
CPU | Tested 3rd Gen Intel® Xeon® Scalable Processors: 2 x Intel® Xeon® Platinum 8358, 32 Cores, 2.60 GHz – 3.40 GHz Base-Boost, TDP 250W 2 x Intel® Xeon® Platinum 8352Y, 32 Cores, 2.20 GHz – 3.40 GHz Base-Boost, TDP 205W
Tested 2nd Generation Intel® Xeon® Scalable Processors: 2 x Intel® Xeon® Gold 6248, 20 Cores, 2.50 GHz – 3.90 GHz Base-Boost, TDP 150W on Dell EMC PowerEdge C6420 Air Cooled |
RAM | DDR4 512 GB (16 x 32 GB) 3200 MT/s |
Operating system | RHEL 8.3 (4.18.0-240.el8.x86_64) |
Interconnect | Mellanox InfiniBand HDR100 |
File system | Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage |
BIOS system profile | Performance Optimized |
Logical processor | Disabled |
Virtualization technology | Disabled |
tophat | 2.1.1 |
bowtie2 | 2.2.5 |
R | 3.6 |
bioconductor-cummerbund | 2.26.0 |
A performance study of the RNA-Seq pipeline is not trivial because the nature of the workflow requires input files that are non-identical but similar in size. Hence, 185 RNA-Seq paired-end read data are collected from a public data repository. All the read data files contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test are randomly selected from the pool of the 185 paired-end read files. Although these test data will not have any biological meaning, certainly these data with a high level of noise will put the tests in the worst-case scenario.
Performance Evaluation
Throughput Test – Single pipeline with more than two samples, biological and technical duplicates
Typical RNA-Seq studies consist of multiple samples, sometimes hundreds of different samples, for example, normal versus disease or untreated versus treated samples. These samples tend to have a high level of noise for biological reasons; hence, the analysis requires vigorous data preprocessing procedures.
We tested various numbers of samples (all different RNA-Seq data selected from 185 paired-end reads data sets) to see how much data can be processed by a single node. Typically, when the number of samples increases, the runtime of the Tuxedo Pipeline increases as shown in Figure 2. Ice Lake CPUs show improved overall runtime of 10% and more compared to Cascade Lake 6248 CPUs.
Figure 2: Total runtime comparisons from various number of samples with a single compute node
Conclusion
Many additional tests are still required to obtain a better insight from Intel Ice Lake processors for the NGS data analysis area. Unfortunately, we could not push our tests over 8 samples due to the storage limitation. However, there seems to be plenty of room for a higher throughput processing of more than 8 samples together.
Wed, 02 Jun 2021 19:37:48 -0000
|Read Time: 0 minutes
Over the past decade, graphics processing units, or GPUs, have become popular in scientific computing because of their great ability to exploit a high degree of parallelism. NVIDIA has a handful of life sciences applications optimized and run on their general-purpose GPUs. Unfortunately, these GPUs can only be programmed with CUDA, OpenACC, and the OpenCL framework. Most members of the life sciences community are not familiar with these frameworks, and so few biologists or bioinformaticians can make efficient use of GPU architectures. However, GPUs have been making inroads into the molecular dynamics simulation (MDS) field since MD was developed in the 1950s. MDS requires heavy computational work to simulate biomolecular structures or their interactions.
In this blog, we tested two MDS applications; NAMD, and LAMMPS using the Dell EMC PowerEdge XE8545 server with NVIDIA A100 GPUs. Since the XE8545 server does not support NVIDIA V100 GPU, we can roughly estimate the performance boost with the A100 from our previous tests.
These two applications are free and open-source parallel MD packages designed for analyzing the physical movements of atoms and molecules.
The test server configuration is summarized in the following table.
Dell EMC PowerEdge XE8545 | |
CPU | 2x 7713 (Milan), 64 Cores, 2.0 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache |
RAM | DDR4 1024 GB (32 x 32 GB) 3200 MT/s |
Operating system | RHEL 8.3 (4.18.0-240.el8.x86_64) |
Filesystem network | Mellanox InfiniBand HDR100 |
Filesystem | Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage |
BIOS system profile | Performance Optimized |
Logical processor | Disabled |
Virtualization technology | Disabled |
Accelerator | 4 x A100-40 GB SXM4 |
Cuda/Toolkit | 11.2 |
OpenMPI | 4.1.1 |
NAMD | NAMD_Git-2021-04-01_Source |
LAMMPS | Stable version (29 Oct 2020) |
Nanoscale Molecular Dynamics (NAMD) is open-source software for molecular dynamics simulation written in a CHARMM parallel programming model and is designed for high-performance simulation of large biomolecular systems.
NAMD was built with the NAMD_Git-2021-04-01_Source source code on GCC 11.1 and CUDA 11.2. For our tests, we used two sets of data; 1.06 million-atoms of the Satellite Tobacco Mosaic Virus (STMV) system, and the HECBioSim3000k-atom system, which is a pair of 1IVO and 1NQL hEGFR tetramers.
Figure 1 shows the performance of 4x A100 GPUs with the STMV dataset. NAMD uses ++p options to specify the number of worker threads, and as recommended, is equal to the total number of cores minus the total number of GPUs. However, the number of total cores in the Milan Eypc 7003 family of processors, such as the Eypc 7713 that is used in the testing system, does not follow the generic recommendation. It seems to be around 79 to 90 cores. The optimal number of cores depends on the data size. Close to 9-nanosecond simulations (ns) per day performance is a significant performance gain from the NVIDIA V100 tests that we ran previously. It is difficult to say the performance gain is the sole contribution of the new A100 GPUs because the comparison of the 16 GB V100 on the Intel Skylake platform to the 40 GB A100 on the AMD Milan platform may not be valid.
Figure 1. Estimated simulation time per day with 4x NVIDIA A100 GPUs
The purpose of an additional test with 3 million atom protein tetramers is to confirm that the STMV test results are not artificial due to the relatively small icosahedron structure of SMTV, and the partial simulation of assembly and disassembly processes. Figure 2 shows the nanosecond simulations per day plot for 3000k-atom data. 2.1 ns/day seems to be close to the maximum performance with 64 cores.
Figure 2. Estimated simulation time per day with 4x NVIDIA A100 GPUs
Large-scale Atomic/Molecular Massively Parallel Simulator, or LAMMPS, is a classical molecular dynamics code and has potentials for solid-state materials (metals and semiconductors), soft matter (biomolecules and polymers), and coarse-grained or mesoscopic systems. LAMMPS can model atoms, or can be used as a parallel particle simulator at the atomic, meso, or continuum scale. LAMMPS runs on single processors, or in parallel using message-passing techniques and spatial decomposition of the simulation domain. LAMMPS was built with GCC 11.1, OpenMPI 4.1.1, and CUDA 11.2 from the source. The 465k-atom system was selected from HECBioSim.
As shown in Figure 3, LAMMPS scales well over the number of A100s. With 4x A100 GPUs, a 8.4 ns/day simulation is achievable.
Figure 3. Estimated simulation time per day with various number of BPUs
Although it is not possible to compare the performance of the A100 and the V100 from this study, the Milan CPUs and A100 show a strong synergy between more cores with better and faster GPUs. Running NAMD and LAMMPS on the XE8545 with the A100 can deliver a better performance than a system with the V100.
Tue, 01 Jun 2021 20:18:04 -0000
|Read Time: 0 minutes
Dell Technologies has released the new PowerEdge R750xa server, a GPU workload-based platform that is designed to support artificial intelligence, machine learning, and high-performance computing solutions. The dual socket/2U platform supports 3rd Gen Intel Xeon processors (code named Ice Lake). It supports up to 40 cores per processor, has eight memory channels per CPU, and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed. This server can accommodate up to four double-width PCIe GPUs that are located in the front left and the front right of the server.
Compared with the previous generation PowerEdge C4140 and PowerEdge R740 GPU platform options, the new PowerEdge R750xa server supports larger storage capacity, provides more flexible GPU offerings, and improves the thermal requirement
Figure 1 PowerEdge R750xa server
The NVIDIA A100 GPUs are built on the NVIDIA Ampere architecture to enable double precision workloads. This blog evaluates the new PowerEdge R750xa server and compares its performance with the previous generation PowerEdge C4140 server.
The following table shows the specifications for the NVIDIA GPU that is discussed in this blog and compares the performance improvement from the previous generation.
Table 1 NVIDIA GPU specifications
PCIe | Improvement | ||
GPU name | A100 | V100 |
|
GPU architecture | Ampere | Volta | - |
GPU memory | 40 GB | 32 GB | 60% |
GPU memory bandwidth | 1555 GB/s | 900 GB/s | 73% |
Peak FP64 | 9.7 TFLOPS | 7 TFLOPS | 39% |
Peak FP64 Tensor Core | 19.5 TFLOPS | N/A | - |
Peak FP32 | 19.5 TFLOPS | 14 TFLOPS | 39% |
Peak FP32 Tensor Core | 156 TFLOPS 312 TFLOPS* | N/A | - |
Peak Mixed Precision FP16 ops/ FP32 Accumulate | 312 TFLOPS 624 TFLOPS* | 125 TFLOPS | 5x |
GPU base clock | 765 MHz | 1230 MHz | - |
Peak INT8 | 624 TOPS 1,248 TOPS* | N/A | - |
GPU Boost clock | 1410 MHz | 1380 MHz | 2.1% |
NVLink speed | 600 GB/s | N/A | - |
Maximum power consumption | 250 W | 250 W | No change |
This blog quantifies the performance improvement of the GPUs with the new PowerEdge GPU platform.
Using a single node PowerEdge R750xa server in the Dell HPC & AI Innovation Lab, we derived all results presented in this blog from this test bed. This section describes the test bed and the applications that were evaluated as part of the study. The following table provides test environment details:
Table 2 Server configuration
Component | Test Bed 1 | Test Bed 2 |
Server | Dell PowerEdge R750xa
| Dell PowerEdge C4140 configuration M |
Processor | Intel Xeon 8380 | Intel Xeon 6248 |
Memory | 32 x 16 GB @ 3200MT/s | 16 x 16 GB @ 2933MT/s |
Operating system | Red Hat Enterprise Linux 8.3 | Red Hat Enterprise Linux 8.3 |
GPU | 4 x NVIDIA A100-PCIe-40 GB GPU | 4 x NVIDIA V100-PCIe-32 GB GPU |
The following table provides information about the applications and benchmarks used:
Table 3 Benchmark and application details
Application | Domain | Version | Benchmark dataset |
High-Performance Linpack | Floating point compute-intensive system benchmark | xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20 | Problem size is more than 95% of GPU memory |
HPCG | Sparse matrix calculations | xhpcg-3.1_cuda_11_ompi-3.1 | 512 * 512 * 288
|
GROMACS | Molecular dynamics application | 2020 | Ligno Cellulose Water 1536 Water 3072 |
LAMMPS | Molecular dynamics application | 29 October 2020 release | Lennard Jones |
Large-Scale Atomic/Molecular Massively Parallel simulator (LAMMPS) is distributed by Sandia National Labs and the US Department of Energy. LAMMPS is open-source code that has different accelerated models for performance on CPUs and GPUs. For our test, we compiled the binary using the KOKKOS package, which runs efficiently on GPUs.
Figure 2 LAMMPS Performance on PowerEdge R750xa and PowerEdge C4140 servers
With the newer generation GPUs, this application improves 2.4 times compared to single GPU performance. The overall performance from a single server improved twice with the PowerEdge R750xa server and NVIDIA A100 GPUs.
GROMACS is a free and open-source parallel molecular dynamics package designed for simulations of biochemical molecules such as proteins, lipids, and nucleic acids. It is used by a wide variety of researchers, particularly for biomolecular and chemistry simulations. GROMACS supports all the usual algorithms expected from modern molecular dynamics implementation. It is open-source software with the latest versions available under the GNU Lesser General Public License (LGPL).
Figure 3 GROMACS performance on PowerEdge C4140 and r750xa servers
With the newer generation GPUs, this application improved approximately 1.5 times across the dataset compared to single GPU performance. The overall performance from a single server improved 1.5 times with a PowerEdge R750xa server and NVIDIA A100 GPUs.
High-Performance Linpack (HPL) needs no introduction in the HPC arena. It is a widely used standard benchmark tests in the industry.
Figure 4 HPL Performance on the PowerEdge R750xa server with A100 GPU and PowerEdge C4140 server with V100 GPU
Figure 5 Power use of the HPL running on NVIDIA GPUs
From Figure 4 and Figure 5, the following results were observed:
Figure 6 Scaling GPU performance data for HPCG Benchmark
As discussed in other blogs, high performance conjugate gradient (HPCG) is another standard benchmark to test data access patterns of sparse matrix calculations. From the graph, we see that the HPCG benchmark scales well with this benchmark resulting in 1.6 times performance improvement over the previous generation PowerEdge C4140 server with an NVIDIA V100 GPU.
The 72 percent improvement in memory bandwidth of the NVIDIA A100 GPU over the NVIDIA V100 GPU contributes to the performance improvement.
In this blog, we introduced the latest generation PowerEdge R750xa platform and discussed the performance improvement over the previous generation PowerEdge C4140 server. The PowerEdge R750xa server is a good option for customers looking for an Intel Xeon scalable CPU-based platform powered with NVIDIA GPUs.
With the newer generation PowerEdge R750xa server and NVIDIA A100 GPUs, the applications discussed in this blog show significant performance improvement.
In future blogs, we plan to evaluate NVLINK bridge support, which is another important feature of the PowerEdge R750xa server and NVIDIA A100 GPUs.
Mon, 24 May 2021 22:07:44 -0000
|Read Time: 0 minutes
Intel® Xeon® Scalable Processors have been proven for consistent and stable performance for many workload types. New 3rd Generation Intel® Xeon® Scalable Processors, also known by the code name of Ice Lake perform exceptionally well for a BWA-GATK pipeline. In this study, we tested two Ice Lake processors, 8352Y and 8358, and the test server configuration is also summarized in Table 1.
Dell EMC PowerEdge C6520 | |
CPU | Tested 3rd Gen Intel® Xeon® Scalable Processors: 2x Intel® Xeon® Platinum 8352Y Processor, 32 cores, 2.20 GHz – 3.40 GHz Base-Boost, TDP 205 W, 48 MB L3 Cache 2x Intel® Xeon® Platinum 8358 Processor, 32 cores, 2.60 GHz – 3.40 GHz Base-Boost, TDP 250 W, 48 MB L3 Cache |
RAM | DDR4 512G (32 GB x 12) 3200 MT/s |
Operating system | RHEL 8.3 (4.18.0-240.22.1) |
Filesystem network | NVIDIA Mellanox InfiniBand HDR100 |
Filesystem | Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage |
BIOS system profile | Performance Optimized |
Logical processor | Disabled |
Virtualization technology | Disabled |
BWA | |
Sambamba | |
Samtools | |
GATK |
The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x.
Table 2 summarizes the overall runtimes and the comparisons between each step for our 9-step BWA-GATK pipeline with a single sample.
The mapping and sorting step is the only step that we can peak the true performance variations across different CPUs in Table 2. A rough estimation of overall performance improvements from 6248R (6248) to 8352Y and 8358 are 3.8 (9.0) % and 4.8 (10.0) %, respectively. The test batch for 6248R was Dell EMC PowerEdge R640 server with 394 GB RAM and local storage, and the configuration details for 6248 can be found from the embedded link.
The mapping and sorting step shows a descent ~36 % runtime reduction due to the nature of the good scalability of BWA. The base recalibration step also takes advantage of a higher core count from Ice Lake CPUs.
Steps | 8352Y 32c 2.2 GHz | 8358 32c 2.6 GHz | 6248R 24c 3.0 GHz | 6248 20c 2.5 GHz |
Mapping and sorting | 3.23 (32) | 3.23 (32) | 5.04 (24) | 5.22 (20) |
Mark duplicates | 1.16 (13) | 1.16 (13) | 1.14 (13) | 1.29 (13) |
Generate realigning targets | 0.47 (32) | 0.46 (32) | 0.16 (24) | 0.42 (20) |
Insertion and deletion realigning | 8.16 (1) | 7.97 (1) | 7.20 (1) | 7.87 (1) |
Base recalibration | 2.06 (32) | 2.07 (32) | 2.41 (24) | 2.30 (20) |
Haplotypercaller | 8.01 (16) | 7.96 (16) | 8.06 (16) | 8.25 (16) |
Genotype GVCFs | 0.01 (32) | 0.01 (32) | 0.01 (24) | 0.01 (20) |
Variant recalibration | 0.20 (1) | 0.20 (1) | 0.19 (1) | 0.23 (1) |
Apply variant recalibration | 0.01 (1) | 0.01 (1) | 0.01 (1) | 0.01 (1) |
Total runtime (hours) | 23.32 | 23.07 | 24.23 | 25.61 |
Note: The number of cores used for the test is parenthesized.
A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment.
The current pipeline invokes many pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processings. Typically, a process silently fails when there is not enough resource left to start an additional process.
As shown in Table 3 for the 8352Y test, the maximum number of samples that can be processed simultaneously is around 14 samples. Although a 14-sample test was not performed, 14 samples could likely be the maximum number of samples that can be processed together because the two pipelines were failed on the 16-sample test. In other words, ~ 6 genomes per day throughput is achievable with 8352Y. Also, 8358 shows 2 failed processes when 16 samples were processed simultaneously while the throughput reached ~7 genomes per day (Table 4).
Steps | Runtime with a various number of samples | |||||
Number of samples | 1 | 2 | 4 | 8 | 12 | 16 |
Number of samples Failed | 0 | 0 | 0 | 0 | 0 | 2 |
Mapping and sorting | 2.84 | 4.20 | 7.11 | 13.44 | 20.77 | 26.62 |
Mark duplicates | 1.17 | 1.18 | 1.29 | 1.77 | 2.49 | 3.05 |
Generate realigning targets | 0.46 | 0.51 | 0.52 | 0.77 | 1.09 | 1.25 |
Insertion and deletion realigning | 7.94 | 8.04 | 8.02 | 8.00 | 8.26 | 8.11 |
Base recalibration | 2.00 | 2.16 | 2.83 | 4.41 | 6.04 | 7.20 |
Haplotypercaller | 8.00 | 7.93 | 9.10 | 9.24 | 9.31 | 9.26 |
Genotype GVCFs | 0.02 | 0.02 | 0.03 | 0.02 | 0.03 | 0.04 |
Variant recalibration | 0.17 | 0.20 | 0.21 | 0.20 | 0.19 | 0.23 |
Apply variant recalibration | 0.01 | 0.02 | 0.02 | 0.02 | 0.02 | 0.03 |
Total runtime (hours) | 22.60 | 24.26 | 29.12 | 37.89 | 48.20 | 55.78 |
Genomes per day | 1.06 | 1.98 | 3.30 | 5.07 | 5.98 | 6.02 |
Steps | Runtime with a various number of samples | |||||
Number of samples | 1 | 8 | 12 | 14 | 16 | 1 |
Number of samples Failed | 0 | 0 | 0 | 0 | 2 | 0 |
Mapping and sorting | 2.67 | 11.79 | 18.26 | 22.84 | 24.34 | 2.67 |
Mark duplicates | 1.16 | 1.51 | 2.18 | 2.59 | 2.65 | 1.16 |
Generate realigning targets | 0.43 | 0.70 | 0.96 | 1.17 | 1.15 | 0.43 |
Insertion and deletion realigning | 7.97 | 8.00 | 7.99 | 8.20 | 8.19 | 7.97 |
Base recalibration | 1.94 | 4.05 | 5.65 | 6.47 | 6.56 | 1.94 |
Haplotypercaller | 8.00 | 8.21 | 8.22 | 8.24 | 8.25 | 8.00 |
Genotype GVCFs | 0.02 | 0.03 | 0.03 | 0.03 | 0.02 | 0.02 |
Variant recalibration | 0.18 | 0.25 | 0.14 | 0.30 | 0.30 | 0.18 |
Apply variant recalibration | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 |
Total runtime (hours) | 22.37 | 34.55 | 43.44 | 49.86 | 51.49 | 22.37 |
Genomes per day | 1.07 | 5.56 | 6.63 | 6.74 | 6.53 | 1.07 |
The field of NGS data analysis has been moving fast in terms of data growth and data variations. The majority of the open-source applications in NGS data analysis are unable to take advantage of accelerator technology and do not scale well over the number of cores. It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision of making when to split data is pretty hard, it is tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.
Nonetheless, 3rd Generation Intel® Xeon® Scalable Processors, especially 8352Y, and 8358 are excellent choices for the highest variant calling analysis throughput and single sample analysis.
Tue, 25 May 2021 13:10:03 -0000
|Read Time: 0 minutes
Intel recently announced the 3rd Generation Intel Xeon Scalable processors (code-named “Ice Lake”), which are based on a new 10 nm manufacturing process. This blog provides the new Ice Lake processor synthetic benchmark results and the recommended BIOS settings on Dell EMC PowerEdge servers.
Ice Lake processors offer a higher core count of up to 40 cores with a single Ice Lake 8380 processor. The Ice Lake processors have larger L3, L2, and L1 data cache than Intel’s second-generation Cascade Lake processors. These features are expected to improve performance of CPU-bound software applications. Table 1 shows the L1, L2, and L3 cache size on the 8380 processor model.
Ice Lake still supports the AVX 512 SIMD instructions, which allow for 32 DP FLOP/cycle. The upgraded Ultra Path Interconnect (UPI) Link speed of 11.2GT/s is expected to improve data movement between the sockets. In addition to core count and frequency, Ice Lake-based Dell EMC PowerEdge servers support DDR4 - 3200 MT/s DIMMS with eight memory channels per processor, which is expected to improve the performance of memory bandwidth-bound applications. Ice Lake processors now support DIMMs with 6 TB per socket.
Instructions such as Vector CLMUL, VPMADD52, Vector AES, and GFNI Extensions have been optimized to improve use of vector registers. The performance of software applications in the cryptography domain is also expected to benefit. The Ice Lake processor also includes improvements to Intel Speed Select Technology (Intel SST). With Intel SST, a few cores from the total available cores can be operated at a higher base frequency, turbo frequency, or power. This blog does not address this feature.
Table 1: hwloc-ls and numactl -H command output on an Intel 8380 processor model-based server with Round Robin core enumeration (MadtCoreEnumeration) and SubNumaCuster(Sub-NUMA Cluster) set to 2-Way
hwloc-ls | numactl -H |
Machine (247GB total) Package L#0 + L3 L#0 (60MB) Group0 L#0 NUMANode L#0 (P#0 61GB) L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#4) L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#8) L2 L#3 (1280KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#12) L2 L#4 (1280KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#16) L2 L#5 (1280KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#20) L2 L#6 (1280KB) + L1d L#6 (48KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#24) L2 L#7 (1280KB) + L1d L#7 (48KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#28) L2 L#8 (1280KB) + L1d L#8 (48KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#32) L2 L#9 (1280KB) + L1d L#9 (48KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#36) L2 L#10 (1280KB) + L1d L#10 (48KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#40) L2 L#11 (1280KB) + L1d L#11 (48KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#44) L2 L#12 (1280KB) + L1d L#12 (48KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#48) L2 L#13 (1280KB) + L1d L#13 (48KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#52) L2 L#14 (1280KB) + L1d L#14 (48KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#56) L2 L#15 (1280KB) + L1d L#15 (48KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#60) L2 L#16 (1280KB) + L1d L#16 (48KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#64) L2 L#17 (1280KB) + L1d L#17 (48KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#68) L2 L#18 (1280KB) + L1d L#18 (48KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#72) L2 L#19 (1280KB) + L1d L#19 (48KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#76) HostBridge. <snip> . .
| |
Table 2 provides the server details used for the performance tests. The following BIOS options were explored in the performance testing:
You can set the DeadLineLlcAlloc, LlcPrefetch, XptPrefetch, UpiPrefetch, DcuIpPrefetcher, DcuStreamerPrefetcher, ProcAdjCacheLine, and LogicalProc BIOS options to either Enabled or Disabled. You can set the SubNumaCluster to 2-Way and Disabled. The SysProfile setting can have five possible values: PerformanceOptimized, PerfPerWattOptimizedDapc, PerfPerWattOptimizedOs, PerfWorkStationOptimized and Custom.
Table 2: Test bed hardware and software details
Component | Dell EMC PowerEdge R750 server | Dell EMC PowerEdge C6520 server | Dell EMC PowerEdge C6420 server | Dell EMC PowerEdge C6420 server |
OPN | 8380 | 6338 | 8280 | 6252 |
Cores/Socket | 40 | 32 | 28 | 24 |
Frequency (Base-Boost) | 2.30 – 3.40 GHz | 2.0 – 3.20 GHz | 2.70 – 4.0 GHz | 2.10 – 3.70 GHz |
TDP | 270 W | 205 W | 205 W | 150 W |
L3Cache | 60M | 48M | 38.5M | 37.75M |
Operating System | Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64 | Red Hat Enterprise Linux 8.3 4.18.0-240.22.1.el8_3.x86_64 | Red Hat Enterprise Linux 8.3 4.18.0-240.el8.x86_64 | Red Hat Enterprise Linux 8.3 4.18.0-240.el8.x86_64 |
Memory | 16 GB x 16 (2Rx8) 3200 MT/s | 16 GB x 16 (2Rx8) 3200 MT/s | 16 GB x 12 (2Rx8) 2933 MT/s | 16 GB x 12 (2Rx8) 2933 MT/s |
BIOS/CPLD | 1.1.2/1.0.1 | |||
Interconnect | NVIDIA Mellanox HDR | NVIDIA Mellanox HDR | NVIDIA Mellanox HDR100 | NVIDIA Mellanox HDR100 |
Compiler | Intel parallel studio 2020 (update 4) | |||
Benchmark software |
|
The system profile BIOS meta option helps to set a group of BIOS options (such as C1E, C States, and so on), each of which control performance and power management settings to a particular value. It is also possible to set these groups of BIOS options individually to a different value using the Custom system profile.
Table 2 lists details about the software used for benchmarking the server. We used the precompiled HPL and HPCG binary files, which are part of Intel Parallel Studio 2020 update 4 software bundle, for our tests. We compiled the WRF application with AVX2 support. WRF and HPCG issue many nonfloating point packed micro-operations (approximately 73 percent to 90 percent of the total packed micro-operations). They are memory-bound (and DRAM-bandwidth bound) workloads. HPL issues packed double precision micro-operations and is a compute-bound workload.
After setting Sub-NUMA Cluster (BIOS.ProcSettings.SubNumaCluster) to 2-Way, Logical Processors (BIOS.ProcSettings.LogicalProc) to Disabled, and other settings (DeadLineLlcAlloc, LlcPrefetch, XptPrefetch, UpiPrefetch, DcuIpPrefetcher, DcuStreamerPrefetcher, ProcAdjCacheLine) to Enabled, we measured the impact of System Profile (BIOS.SysProfileSettings.SysProfile) BIOS parameters on application performance.
Figure 1 through Figure 4 show application performance. In each figure, the numbers over the bars represent the relative change in the application performance with respect to the application performance obtained on the Intel 6338 Ice Lake processor with the System Profile set to Performance Optimized (PO).
Note: In the figures, PO=PerformanceOptimized, PPWOD=PerfPerWattOptimizedDapc, PPWOO=PerfPerWattOptimizedOs, and PWSO=PerfWorkStationOptimized.
Figure 1: Relative difference in the performance of HPL by processor and Sysprofile setting
Figure 2: Relative difference in the performance of HPCG by processor and Sysprofile setting
Figure 3: Relative difference in the performance of STREAM by processor and Sysprofile setting
Figure 4: Relative difference in the performance of WRF by processor and Sysprofile setting
We obtained the performance for the applications in Figure 2 through Figure 4 by fully subscribing to all available cores. Depending on the processor model, we achieved 78 percent to 80 percent efficiency with HPL and STREAM benchmarks using the Performance Optimized profile.
Intel has extended the TDP of the Ice Lake processors with the top-end Intel 8380 processor at 270 W TDP. The following figure shows the power use on the systems with the applications listed in Table 2.Note: In this figure, PO=PerformanceOptimized, PPWOD=PerfPerWattOptimizedDapc, PPWOO=PerfPerWattOptimizedOs and PWSO=PerfWorkStationOptimized
Figure 5: Power use by platform and processor type. Average Idle power usage on the PowerEdge C6520 server (Intel 6338 processor) with approximately 335 W and the PowerEdge R750 server (intel 8380 processor) with approximately 470 W using the Performance Optimized System Profile.
When SNC is set to 2-Way, the system exposes four NUMA nodes. We tested the NUMA bandwidth, remote socket bandwidth, and local socket bandwidth using the STREAM TRIAD benchmark. In Figure 6, the CPU NUMA node is represented as c and the memory node is represented as m. As an example for NUMA bandwidth, the c0m0 (blue bars) test type represents the STREAM TRIAD test carried out between NUMA node 0 and memory node 0. Figure 6 shows the best bandwidth numbers obtained on varying the number of threads per test type.
Figure 6: Local and remote NUMA memory bandwidth.
Remote socket bandwidth numbers were measured between CPU node 0, 1 and memory node 2, 3. Local bandwidths were measured between CPU node 0, 1, and 0, 1. The following figure shows the performance numbers.
Figure 7: Local and remote processor bandwidth.
We tested the impact of the DeadLineLlcAlloc, LlcPrefetch, XptPrefetch, UpiPrefetch, DcuIpPrefetcher, DcuStreamerPrefetcher and ProcAdjCacheLine with the Performance Optimized (PO) system profile. These BIOS options do not have significant impact on the performance of applications addressed in this blog, therefore we recommend that these options be set as Enabled.
Figure 8 and Figure 9 show the impact of the Sub-NUMA Cluster (SNC) BIOS option on the application performance. In each figure, the numbers over the bars represent the relative change in the application performance with respect to the application performance obtained on the Intel 6338 Ice Lake processor with SNC feature set to Disabled.
Figure 8: HPL and HPCG performance variation by processor model with Sub-NUMA Cluster set to Disabled (SNC=D) and 2-Way (SNC=2W)
Figure 9: STREAM and WRF performance variation by processor model with Sub-NUMA Cluster set to Disabled (SNC=D) and 2-Way (SNC=2W)
The SubNumaCluster option can impact the applications that are Memory Bandwidth-bound (for example, STREAM, HPCG, and WRF). The SubNumaCluster option is recommended to be set to 2-Way as it can optimize the workloads addressed in this blog by a range of one percent to six percent, depending on the processor model and application.
The Ice Lake-based processors now support PCIe Gen 4, which allows the NVIDIA MELLANOX HDR adapter cards to be used with Dell EMC PowerEdge servers. Figure 10, Figure 11, and Figure 12 show the Message Rate, Unidirectional, and Bi-directional InfiniBand bandwidth test results of the OSU Benchmarks suite. The network adapter card was connected to the second socket (NUMA node 2), therefore, the local bandwidth tests were carried out with processes bound to NUMA node 2. The remote bandwidth tests were carried out with processes bound to NUMA node 0. In Figure 10 and Figure 11, the numbers in red over the orange bars represent the percentage difference between local and remote bandwidth performance numbers.
Figure 12: Interconnect bandwidth and message rate performance obtained between two servers having Intel 8380 processors with OSU Benchmark
On two nodes connected using the NVIDIA Mellanox ConnectX-6 HDR InfiniBand adapter cards, we achieved approximately 25 GB/s unidirectional bandwidth and a message rate of approximately 200 million messages/second—almost double the performance numbers obtained on the NVIDIA Mellanox HDR100 card.
Based on the compute resources availability in our Dell EMC HPC & AI Innovation Lab, we selected the Cascade Lake processor-based servers and benchmarked them with software listed in Table 1. Figure 13 through Figure 16 show performance results from the Intel Ice Lake and Cascade Lake processors. The numbers over the bars represent the relative change in the application performance with respect to the application performance obtained on the Intel 6252 Cascade Lake processor.
Figure 15: STREAM TRIAD test performance on Processors listed in Table 2
Figure 16: WRF performance on Processors listed in Table 2
Ice Lake delivers approximately 38 percent better performance than Cascade Lake with HPL on the top-end processor model. The memory bandwidth-bound benchmarks such as STREAM and HPCG (see Figure 13 and Figure 14) delivered 42 percent to 43 percent performance improvement over the top-end Cascade Lake processors addressed in this blog.
The average real-time power usage of the Dell EMC PowerEdge platforms (listed in Table 1) was measured with the synthetic benchmarks listed in this blog. Figure 17 compares the power usage data from the Cascade Lake and Ice Lake platforms. The number over the bar represents the relative change of power with respect to the base (Intel 6252 processor in the idle state) power measured.
Figure 17: Average power usage during benchmark runs on Dell EMC PowerEdge servers (see details in Table 1)
Considering the data with the Performance Optimized profile with the respective power measurement, the applications (depending on the processor model) were unable to deliver better performance per watt on the Ice Lake platform when compared to the Cascade Lake platform.
The Ice Lake processor-based Dell EMC Power Edge servers, with notable hardware feature upgrades over Cascade Lake, show up to 47 percent performance gain for all the HPC benchmarks addressed in this blog. Hyper-threading should be Disabled for the benchmarks addressed in this blog; for other workloads the option should be tested and enabled as appropriate. Watch this space for subsequent blogs that describe application performance studies on our new Ice Lake processor-based cluster.
Mon, 28 Jun 2021 14:35:14 -0000
|Read Time: 0 minutes
Maximizing application performance and system utilization has always been important for HPC users. The libraries, compilers, and applications found on these systems are the result of heroic efforts by HPC system administrators and teams of HPC specialists who fine tune, test, and maintain optimal builds of complex hierarchies of software for users. Fortunately for both researchers and administrators, some of that burden can be relieved with the use of containers, where software solutions can be built to run reliably when moved from one computing environment to another. This includes moving from one research lab to another, or from the developer’s laptop to a production lab, or even from an on-prem data center to the cloud.
Singularity has provided HPC system administrators and users a way to take advantage of application containerization while running on batch-scheduled systems. Singularity is a container runtime that can build containers in its own native format, as well as execute any CRI-compatible container. By default, Singularity enforces security restrictions on containers by running in user space and can preserve user identification when run through batch schedulers, providing a simple method to deploy containerized workloads on multi-user HPC environments.
Best practices for HPC systems deployment and use is the goal of Omnia and we recognize those practices vary in industry and research institutions. Omnia is developed with the entire community in mind and we aim to provide the tools that help them be productive. To this end, we recently included Singularity as an automatically installed package when deploying Slurm clusters with Omnia.
Installing a Slurm cluster with Omnia and running a Singularity job is simple. We provide a repository of Ansible playbooks to configure a pile of metal or cloud resources into a ready-to-use Slurm cluster by applying the Slurm role in AWX or by applying the playbook on the command line.
ansible-playbook -i inventory omnia.yaml --skip-tags kubernetes
Once the playbook has completed users are presented with a fully functional Slurm cluster with Singularity installed. We can run a simple “hello world” example, using containers directly from Singularity Hub. Here is an example Slurm submission script to run the “Hello World” example.
#!/bin/bash #SBATCH -J singularity_test #SBATCH -o singularity_test.out.%J #SBATCH -e singularity_test.err.%J #SBATCH -t 0-00:10 #SBATCH -N 1 # pull example Singularity container singularity pull --name hello-world.sif shub://vsoch/hello-world # execute Singularity container singularity exec hello-world.sif cat /etc/os-release
The “hello world” example is great but doesn’t demonstrate running real HPC codes, fortunately several hardware vendors have begun to publish containers for both HPC and AI workloads, such as Intel's oneContainer and Nvidia's NGC. Nvidia NGC is a catalog of GPU-accelerated software arranged in collections, containers, and Helm charts. This free to use repository has the latest builds of popular software used for deep learning and simulation with optimizations for Nvidia GPU systems. With Singularity we can take advantage of the NGC containers on our bare-metal Slurm cluster. Starting with the LAMMPS example on the NGC website we demonstrate how to run a standard Lennard-Jones 3D melt experiment, without having to compile all the libraries and executables.
The input file for running this benchmark, in.lj.txt, can be downloaded from the Sandia National Laboratory site:
wget https://lammps.sandia.gov/inputs/in.lj.txt
Next make a local copy of the lammps container from NGC and name it lammps.sif
singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020
This example can be executed directly from the command line using srun. This example runs 8 tasks on 2 nodes with a total of 8 GPUs:
srun --mpi=pmi2 -N2 --ntasks=8 --ntasks-per-socket=2 singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd lammps.sif lmp -k on g 8 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt
Alternatively, the following example Slurm submission script will permit batch execution with the same parameters as above, 8 tasks on 2 nodes with a total of 8 GPUs:
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --ntasks-per-socket=2 #SBATCH --time 00:10:00 set -e; set -o pipefail # Build SIF, if it doesn't exist if [[ ! -f lammps.sif ]]; then singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020 fi readonly gpus_per_node=$(( SLURM_NTASKS / SLURM_JOB_NUM_NODES )) echo "Running Lennard Jones 8x4x8 example on ${SLURM_NTASKS} GPUS..." srun --mpi=pmi2 \ singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd lammps.sif lmp -k on g ${gpus_per_node} -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt
Containers provide a simple solution to the complex task of building optimized software to run anywhere. Researchers are no longer required to attempt building software themselves or wait for a release of software to be made available at the site they are running. Whether running on the workstation, laptop, on-prem HPC resource, or cloud environment they can be sure they are using the same optimized version for every run.
Omnia is an open source project that makes it easy to setup a Slurm or Kubernetes environment. When we combine the simplicity of Omnia for system deployment and Nvidia NGC containers for optimized software, both researchers and system administrators can concentrate on what matters most, getting results faster.
Learn more about Singularity containers at https://sylabs.io/singularity/. Omnia is available for download at https://github.com/dellhpc/omnia.
Tue, 27 Apr 2021 03:48:30 -0000
|Read Time: 0 minutes
Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), insertion/deletion (indel), or chromosomal restructuring. Ultimately, all physiological and biochemical events depend on the final gene expression products, proteins. Although most mammals have an additional controlling layer before protein expression, knowing how many transcripts exist in a system helps to characterize the biochemical status of a cell. Ideally, technology would enable us to quantify all proteins in a cell, which would significantly advance the progress of Life Science. However, we are far from achieving this.
This blog provides the test results of one popular RNA-Seq data analysis pipeline known as the Tuxedo pipeline (1). The Tuxedo pipeline suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript, and isoform detection, differential expression, visualizations, and quality control metrics. The tested workflow is a differentially expressed gene (DEG) analysis, and the detailed steps in the pipeline are shown in Figure 1.
Figure 1. Updated tuxedo pipeline with cuffquant step
A single node study with AMD EPYC 7002 series (Rome) and AMD EPYC 7003 series (Milan) on Dell EMC PowerEdge R6525 server was done. The configurations of the test system are summarized in Table 1.
Table 1. Tested compute node configuration
Dell EMC PowerEdge R6525 | |
---|---|
CPU | Tested AMD Milan: 2x 7763 (Milan), 64 Cores, 2.45 GHz – 3.5 GHz Base-Boost, TDP 280 W, 256 MB L3 Cache 2x 7713 (Milan), 64 Cores, 2.0 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache 7543 (Milan), 32 Cores, 2.8 GHz – 3.7 GHz Base-Boost, TDP 225 W, 256 MB L3 Cache
Tested AMD Rome: 7702 (Rome), 64 Cores, 2.0 GHz – 3.35 GHz Base-Boost, TDP 200 W, 256 MB L3 Cache |
RAM | DDR4 256 GB (16 Gb x 16) 3200 MT/s |
Operating system | RHEL 8.3 (4.18.0-240.el8.x86_64) |
Interconnect | Mellanox InfiniBand HDR100 |
Filesystem | Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage |
BIOS system profile | Performance optimized |
Logical processor | Disabled |
Virtualization technology | Disabled |
tophat | 2.1.1 |
bowtie2 | 2.2.5 |
R | 3.6 |
bioconductor cummerbund | 2.26.0 |
A performance study of the RNA-Seq pipeline is not trivial because the nature of the workflow requires non-identical input files yet similar input files in size. Hence, 185 RNA-Seq paired-end read data are collected from a public data repository. All the read datafiles contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test are randomly selected from the pool of the 185 paired-end read files. Although these test data will not have any biological meaning, certainly these data with the high level of noise will put the tests in the worst-case scenario.
Typical RNA-Seq studies consist of multiple samples, sometimes 100s of different samples, normal versus disease, or untreated versus treated samples. These samples tend to have a high level of noise due to biological reasons; hence, the analysis requires vigorous data preprocessing procedure.
A number of various samples were processed, with different RNA-Seq data selected from 185 paired-end reads dataset, to see how much data a single node can process. Typically, when the number of samples increases, the runtime of the Tuxedo pipeline increases. However, as shown in the figure below, the runtimes with two sample tests using 7713, are higher than the runtimes from four samples. The standard error from five repeated runs does not overlap with four and eight sample results. The interference of other users may cause this large variance. The current testing environment, especially a shared file system designed for large capacity, is not ideal for a Next Generation Sequencing (NGS) data analysis benchmark.
Figure 2. Runtime comparisons among various AMD EPYC 7003 Series processors: Standard error is estimated from an estimated standard deviation based on a sample (STDDEV.S function in Excel)
The eight sample test results show that AMD Milan processors perform better than one of the Rome processors (7702) in a higher workload.
Many tests are still required to obtain a better insight from the AMD Milan processors for the NGS data analysis area. Unfortunately, the tests could not exceed eight samples due to storage limitations. However, there seems to be plenty of room for a higher throughput that processes more than eight samples together. AMD Milan 7763 performed 20% better than AMD Rome 7702. AMD Milan 7713 performed 18% better in eight sample tests for the Tuxedo pipeline as described in Figure 2.
Tue, 13 Apr 2021 14:25:31 -0000
|Read Time: 0 minutes
Three years after launching the Tesla V100 GPU, NVIDIA recently announced its latest data center GPU A100, built on the Ampere architecture. The A100 is available in two form factors, PCIe and SXM4, allowing GPU-to-GPU communication over PCIe or NVLink. The NVLink version is also known as the A100 SXM4 GPU and is available on the HGX A100 server board.
As you’d expect, the Innovation Lab tested the performance of the A100 GPU in a new platform. The new PowerEdge XE8545 4U server from Dell Technologies supports these GPUs with the NVLink SXM4 form factor and dual-socket AMD 3rd generation EPYC CPUs (codename Milan). This platform supports PCIe Gen 4 speed, up to 10 local drives, and up to 16 DIMM slots running at 3200 MT/s. Milan CPUs are available with up to 64 physical cores per CPU.
The PCIe version of the A100 can be housed in the PowerEdge R7525, which also supports AMD EPYC CPUs, up to 24 drives, and up to 16 DIMM slots running at 3200MT/s. This blog compares the performance of the A100-PCIe system to the A100-SXM4 system.
Figure 1: PowerEdge XE8545 Server
A previous blog discussed the performance of the NVIDIA A100-PCIe GPU compared to its predecessor NVIDIA Tesla V100-PCIe GPU in the PowerEdge R7525 platform.
The following table shows the specifications of the NVIDIA A100 and V100 GPUs.
Table 1: NVIDIA A100 and V100 GPUs with PCIe and SXM4 form factors
Form factor | PCIe | SXM (NVIDIA NVLink) | ||
Type of NVIDIA | A100 | V100 | A100 | V100 |
GPU architecture | Ampere | Volta | Ampere | Volta |
GPU memory | 40 GB | 32 GB | 40 GB | 32 GB |
GPU memory bandwidth | 1555 GB/s | 900 GB/s | 1555 GB/s | 900 GB/s |
Peak FP64 | 9.7 TFLOPS | 7 TFLOPS | 9.7 TFLOPS | 7.8 TFLOPS |
Peak FP64 Tensor Core | 19.5 TFLOPS | N/A | 19.5 TFLOPS | N/A |
GPU base clock | 765 MHz | 1230 MHz | 1095 MHz | 1290 MHz |
GPU boost clock | 1410 MHz | 1380 MHz | 1410 MHz | 1530 MHz |
NVLink speed | 600 GB/s | N/A | 600 GB/s | 300 GB/s |
Max power consumption | 250 W | 250 W | 400 W | 300 W |
From Table 1, we see that the A100 offers 42 percent improved memory bandwidth and 20 to 30 percent higher double precision FLOPS when compared to the Tesla V100 GPU. While the A100-PCIe GPU consumes the same amount of power as the V100-PCIe GPU, the NVLink version of the A100 GPU consumes 25 percent more power than the V100 GPU.
An understanding of the server architecture is helpful in determining the behavior of any application. The PowerEdge XE8545 server is an accelerator optimized server with four A100-SMX4 GPUs connected with third generation NVLink, as shown in the following figure.
Figure 2: PowerEdge XE8545 CPU-GPU connectivity
In the A100 GPU, each NVLink lane supports a data rate of 50x 4 Gbit/s in each direction. The total number of NVLink lanes increases from six lanes in the V100 GPU to 12 lanes in the A100 GPU, now yielding 600 GB/s total. Workloads that can take advantage of the higher GPU-to-GPU communication bandwidth can be benefit from the NVLink links in PowerEdge XE8545 Server.
As shown in the following figure, the PowerEdge R7525 server can accommodate up to three PCIe-based GPUs; however the configuration chosen for this evaluation used two A100-PCIe GPUs. With this option, the GPU-to-GPU communication must flow through the AMD Infinity Fabric inter-CPU links.
Figure 3: PowerEdge R7525 CPU-GPU connectivity
The following table shows the tested configuration details:
Table 2: Test bed configuration details
Server | PowerEdge XE8545 | PowerEdge R7525 |
Processor | Dual AMD EPYC 7713, 64C, 2.8 GHz | |
Memory | 512 GB (16 x 32 GB @ 3200 MT/s) | 1024 GB (16 x 64 GB @ 3200 MT/s) |
Height of system | 4U | 2U |
GPUs | 4 x NVIDIA A100 SXM4 40 GB | 2 x NVIDIA A100 PCIe 40 GB |
Operating system Kernel | Red Hat Enterprise Linux release 8.3 (Ootpa) 4.18.0-240.el8.x86_64 | |
BIOS settings | Sysprofile=PerfOptimized LogicalProcessor=Disabled NumaNodesPerSocket=4 | |
CUDA Driver CUDA Toolkit | 450.51.05 11.1 | |
GCC | 9.2.0 | |
MPI | OpenMPI - 4.0 |
The following table lists the version of HPC application that was used for the benchmark evaluation:
Table 3: HPC Applications used for the evaluation
Benchmark | Details |
HPL | xhpl_cuda-11.0-dyn_mkl-static_ompi-4.0.4_gcc4.8.5_7-23-20 |
HPCG | xhpcg-3.1_cuda_11_ompi-3.1 |
GROMACS | v2021 |
NAMD | Git-2021-03-02_Source |
LAMMPS | 29Oct2020 release |
High Performance Linpack (HPL) is a standard HPC system benchmark that is used to measure the computing power of a server or cluster. It is also used as a reference benchmark by the TOP500 org to rank supercomputers worldwide. HPL for GPU uses double precision floating point operations. There are a few parameters that are significant for the HPL benchmark, as listed below:
Figure 4: HPL Performance on the PowerEdge R7525 and XE8545 with NVIDIA A100-40 GB
Figure 5: HPL Power Utilization on the PowerEdge XE8545 with four NVIDIA A100 GPUs and R7525 with two NVIDIA A100 GPUs
From Figure 4 and Figure 5, we can make the following observations:
The TOP500 list has incorporated the High Performance Conjugate Gradient (HPCG) results as an alternate metric to assess system performance.
Figure 6: HPCG Performance on the PowerEdge R7525 and PowerEdge XE8545 Servers
Unlike HPL, HPCG performance depends heavily on the memory system and network performance when we go beyond one server. Because both the PCIe and SXM4 form factors of the A100 GPUs have the same memory bandwidth, there is no variation in the performance at a single node and HPCG performance scales well on both servers.
The following figure shows the performance results for GROMACS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. GROMACS 2021.0 was compiled with CUDA compilers and Open-MPI, as shown in Table 3.
Figure 7: GROMACS performance on the PowerEdge R7545 and PowerEdge XE8545 Server
The GROMACS build included thread MPI (built in with the GROMACS package). Performance results are presented using the ns/day metric. For each test, the performance was optimized by varying the number of MPI ranks and threads, number of PME ranks, and different nstlist values to obtain the best performance result.
With one GPU in test, the performance of the SMX4 XE8545 server is similar to the PCIe R7525. With two GPUs in test, the SMX4 XE8545 performance is up to 28 percent better than the PCIe R7525. As the performance was based on a comparative analysis between NVIDIA PCIe and SXM4 form factors along the server platforms, datasets like Water 1536 and Water 3072 demand more GPU-GPU communication, and SXM4 performs around 28 percent better. On the other hand, for datasets like LignoCellulose 3M, the two GPU R7525 achieves the same per-GPU performance as the XE8545, but with the lower 250 W GPU making it the more efficient solution.
The following figure shows the performance results for LAMMPS, a molecular dynamics application, on the PowerEdge R7525 and XE8545 servers. The code was compiled with the KOKKOS package to run efficiently on NVIDIA GPUs, and Lennard Jones is the dataset that was tested with Timesteps/s as the metric for comparison.
Figure 8: LAMMPS performance on the PowerEdge R7545 and PowerEdge XE8545 Servers
With one GPU in test, the performance of the SMX4 XE8545 server is 13 percent higher than the PCIe R7525, and with two GPUs in test, a 23 percent performance improvement was measured. The PowerEdge XE8545 is at an advantage because the GPUs can communicate with each other over NVLink without the intervention of a CPU. The R7525 server with two GPUs is limited by the GPU-to-GPU communication pattern. Additionally, the other factor contributing for better performance is the higher clock rate of the SXM4 A100 GPU.
In this blog, we discussed the performance of NVIDIA A100 GPUs on the PowerEdge R7525 Server and the PowerEdge XE8545 Server, which is the new addition from Dell Technologies. The A100 GPU has 42 percent more memory bandwidth and higher double precision FLOPs compared to its predecessor, the V100 series GPU. For workloads which demand more GPU-to-GPU communication, the PowerEdge XE8545 server is an ideal choice. For data centers where space and power are limited, the PowerEdge R7525 server may be the right fit. The overall performance of PowerEdge XE8545 Server with four A100-SXM4 GPUs is 1.5 to 2.3 times faster than the PowerEdge R7525 server with two A100-PCIe GPUs.
In the future, we intend to evaluate the A100-80GB GPUs and NVIDIA A40 GPUs that will be available this year. We also plan to focus on a multi-node performance study with these GPUs.
Please contact your Dell sales specialist about the HPC and AI Innovation Lab if you would like to evaluate these GPU servers.
Tue, 30 Mar 2021 18:34:08 -0000
|Read Time: 0 minutes
We’ve been speculating that AMD Milan with Zen3 cores which allows more cores to share the same L3 cache could perform better for Next Generation Sequencing (NGS) applications. Comparing to the predecessor AMD EPYC Rome, the number of cores sharing the L3 cache is doubled-up from 4 to 8 for the 64 core processor model. In addition to that, the cache (both L1 and L2) is upgraded with new prefetchers, and memory bandwidth is improved.
Since Milan and Rome share the same SP3 socket, Dell EMC PowerEdge R6525 was selected for the case study and able to minimize system-to-system variations. The test server configuration is summarized in Table 1.
Table 1. Tested compute node configuration
Dell EMC PowerEdge R6525 | |
CPU | Tested AMD Milan: 2x 7763 (Milan), 64 Cores, 2.45GHz – 3.5GHz Base-Boost, TDP 280W, 256 MB L3 Cache 2x 7713 (Milan), 64 Cores, 2.0GHz – 3.7GHz Base-Boost, TDP 225W, 256 MB L3 Cache 7543 (Milan), 32 Cores, 2.8GHz – 3.7 GHz Base-Boost, TDP 225W, 256 MB L3 Cache Tested AMD Rome: 7702 (Rome), 64 Cores, 2.0 GHz – 3.35 GHz Base-Boost, TDP 200W, 256 MB L3 Cache |
RAM | DDR4 256G (16Gb x 16) 3200 MT/s |
OS | RHEL 8.3 (4.18.0-240.el8.x86_64) |
Filesystem Network | Mellanox InfiniBand HDR100 |
Filesystem | Dell EMC Ready Solutions for HPC BeeGFS High Capacity Storage |
BIOS System Profile | Performance Optimized |
Logical Processor | Disabled |
Virtualization Technology | Disabled |
BWA | 0.7.15-r1140 |
Sambamba | 0.7.0 |
Samtools | 1.6 |
GATK | 3.6-0-g89b7209 |
The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it reaches ~53x.
In a typical BWA-GATK pipeline, there are multiple steps, and each step consists of various applications which behave distinctively. As shown in Table 3, the applications in some steps do not support multi-threading operations. These steps are problematic since there are only a few ways to improve performance.
Table 2. The steps in BWA-GATK pipeline and tools
Steps | Applications | Multi-threading support |
Mapping & Sorting | BWA, samtools, sambamba | Yes |
Mark Duplicates | Sambamba | Yes |
Generate Realigning Targets | GATK RealignerTargetCreator | Yes |
Insertion/Deletion Realigning | GATK IndelRealigner | No |
Base Recalibration | GATK BaseRecalibrator | Yes |
Haplotypercaller | GATK HaplotypeCaller | Yes |
Genotype GVCFs | GATK GenotypeGVCFs | Yes |
Variant Recalibration | GATK VariantRecalibrator | No |
Apply Variant Recalibration | GATK ApplyRecalibration | No |
Single thread applications, especially Variant Recalibration and Apply Variant Recalibratrion steps show no runtime variation due to the deterministic algorithm and the inputs for these steps are small. Hence, these two steps are not reported in Figure 1. The first step, Mapping & Sorting scales as the number of cores increases (Figure 1, (a)). Also, Genotype GVCFs is not included in Figure 1 although it supports multi-threading operation for a similar reason.
Burrows-Wheeler Aligner (BWA) is one of the most popular short sequence aligners for non-gapped aligning analysis. BWA scales well until 32 cores, and CPU usage drops down dramatically after 32 cores. The runtime improvement becomes marginal with higher core numbers greater than 32. Using more than 80 cores for this step is the wasting of resources.
Sambamba which is compatible with Picard is used for marking duplicate reads. The behavior of sambamba is plotted in Figure 1, (b). Due to the highly parallelized nature of design, the memory consumption increases to create more hash tables for additional threads. Amazingly, 50x human whole genome sequence (WGS) is not big enough to use more than 13 cores for the well-designed software.
After Mark Duplicates step, Genome Analysis Tool Kit, hence GATK, written in Java plays a critical role in performance measurement and creating answers. These steps do not scale at all as shown in Figure 1, (c) (d) (e) and (f). A better approach will be discussed in future work to handle the misbehavior in multi-core and multi-socket environments.
This test is not a fair comparison since the majority of steps will not take advantage of using all the cores except 7543 with 32 cores. However, this comparison will help to decide which CPU could be best for the throughput test.
Table 3 summarizes the overall runtimes for BWA-GATK pipeline, and it is hard to say which one is better in terms of total runtimes. A lot more tests are required to differentiate the performance differences in GATK steps. Also, the results from 7502 and 7402 were from the previous tests with different environments.
The mapping & Sorting step is the only step that we can peak the true performance variations across different CPUs in Table 3. A rough estimation of performance improvement from 7702 to 7763 is 7% while the performance gain is 5% from 7702 to 7713.
Surprisingly, the Base Recalibration step showed similar results as the Mapping & Sorting step, which is 8% and 3% improvement.
Table 3. BWA-GATK performance comparisons between Milan and Rome. The number of cores used for the test is parenthesized.
Steps | Runtime (hours) | |||||
AMD 7763 64c 2.45GHz | AMD 7713 64c 2.0GHz | AMD 7543 32c 2.8GHz | AMD 7702 64c 2.0GHz | AMD 7502 32c 2.5GHz | AMD 7402 24c 3.0GHz | |
Mapping & Sorting | 2.44 (64) | 2.49 (64) | 3.69 (32) | 2.63 (64) | 4.68 (32) | 5.73 (24) |
Mark Duplicates | 1.07 (13) | 1.10 (13) | 1.01 (13) | 1.01 (13) | 0.93 (13) | 0.94 (13) |
Generate Realigning Targets | 0.55 (32) | 0.56 (32) | 0.50 (32) | 0.58 (32) | 0.45 (32) | 0.44 (32) |
Insertion/Deletion Realigning | 8.73 (1) | 9.13 (1) | 7.73 (1) | 8.78 (1) | 8.30 (1) | 8.21 (1) |
Base Recalibration | 2.27 (32) | 2.38 (32) | 2.17 (32) | 2.46 (32) | 2.52 (32) | 2.67 (24) |
Haplotypercaller | 10.20 (16) | 10.57 (16) | 9.15 (16) | 9.02 (16) | 9.33 (16) | 9.05 (16) |
Genotype GVCFs | 0.02 (32) | 0.02 (32) | 0.01 (32) | 0.02 (32) | 0.01 (32) | 0.01 (24) |
Variant Recalibration | 0.31 (1) | 0.20 (1) | 0.17 (1) | 0.12 (1) | 0.21 (1) | 0.13 (1) |
Apply Variant Recalibration | 0.01 (1) | 0.01 (1) | 0.01 (1) | 0.01 (1) | 0.01 (1) | 0.01 (1) |
Total Runtime (hours) | 25.59 | 26.47 | 24.44 | 24.64 | 26.46 | 27.25 |
A typical way of running an NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. However, this time the tests were performed on a single compute node due to the limited number of servers available at this moment.
The current pipeline invokes a large number of pipe operations in the first step to minimize the amount of writing intermediate files. Although this saves a day of runtime and lowers the storage usage significantly, the cost of invoking pipes is quite heavy. Hence, this limits the number of concurrent sample processing. Typically, a process silently fails when there is not enough resource left to start an additional process.
However, the failures experiencing during this study are quite different from the previous observations. 10 pipelines started in R6525 with 2x 7763 sustain only 6 pipelines on average with the 50x human WGS. Four pipelines are failed with the broken pipes error which suggests some sort of file operation. Current BeeGFS storage for the test is designed for high capacity, theoretical sequential write bandwidth of 25 GB/s. However, roughly 16 GB/s is achievable where there is not heavy usage loaded on this storage in a shared storage environment. This is not an ideal environment for any benchmark practice; however, the results here are quite helpful to see what the performance of these systems looks like in a real life.
As shown in Table 4, the maximum number of samples that can be processed at the same time is around 4 or 5, and the ~ 4.79 50x human whole genomes per day throughput is achievable with the current environment.
Table 4. Throughput test for Milan 7763
Steps | Runtime (hours) | ||||
1 Sample | 2 Samples | 4 Samples | 6 Samples | 10 Samples | |
Number of Samples Failed | 0 | 0 | 0 | 1 | 4 |
Mapping & Sorting | 2.44 | 2.91 | 4.33 | 5.86 | 8.33 |
Mark Duplicates | 1.07 | 1.40 | 1.69 | 1.31 | 5.51 |
Generate Realigning Targets | 0.55 | 0.88 | 1.77 | 0.50 | 2.07 |
Insertion/Deletion Realigning | 8.73 | 8.97 | 8.92 | 8.92 | 9.70 |
Base Recalibration | 2.27 | 2.50 | 2.79 | 3.26 | 3.67 |
Haplotypercaller | 10.20 | 10.57 | 10.27 | 9.91 | 9.96 |
Genotype GVCFs | 0.02 | 0.11 | 0.10 | 0.10 | 0.15 |
Variant Recalibration | 0.31 | 0.25 | 0.20 | 0.21 | 0.36 |
Apply Variant Recalibration | 0.01 | 0.02 | 0.01 | 0.01 | 0.03 |
Total Runtime (hours) | 25.59 | 27.62 | 30.08 | 30.08 | 39.79 |
Genomes per day | 0.94 | 1.74 | 4.79 | 3.99 | 3.62 |
The field of NGS data analysis has been moving fast in terms of data growth and data variations. However, the community has not been done much work adapting new technologies available such as accelerators. Instead of improving the quality of codes, the community is faced with analyzing the data without multi-thread processing since GATK version 4 and up does not support multi-threading anymore while the number of cores in a CPU increases fast.
It is time that users need to think about how this problem can be tackled. One simple way to avoid this problem is to perform data-level parallelization. Although the decision-making of when to split data is pretty hard, it is certainly tractable with careful interventions in an existing BWA-GATK pipeline without diluting statistical power with a sheer number of data. If each smaller data chunk goes through an individual pipeline on each core and is merged at the end, it could be possible to achieve better performance on a single sample. This performance gain could lead to higher throughput if the overall runtime reduces significantly.
Nonetheless, Milan 7763 or 7713 are an excellent candidate to cover both current multi-threading-based pipelines and future data-level parallelization driven pipelines with more available cores.
Tue, 30 Mar 2021 18:23:11 -0000
|Read Time: 0 minutes
With the release of the AMD EPYC 7003 Series Processors (architecture codenamed "Milan"), Dell EMC PowerEdge servers have now been upgraded to support the new features. This blog outlines the Milan Processor architecture and the recommended BIOS settings to deliver optimal HPC Synthetic benchmark performance. Upcoming blogs will focus on the application performance and characterization of the software applications from various scientific domains such as Weather Science, Molecular Dynamics, and Computational Fluid Dynamics.
AMD Milan with Zen3 cores is the successor of AMD's high-performance second generation server microprocessor (architecture codenamed "Rome"). It supports up to 64 cores at 280w TDP and 8 DDR4 memory channels at speeds up to 3200MT/s.
As with AMD Rome, AMD Milan’s 64 core Processor model has 1 I/O die and 8 compute dies (also called CCD or Core Complex Die) – OPN 32 core models may have 4 or 8 compute dies. Milan Processors have upgrades to the Cache (including new prefetchers at both L1 and L2 caches) and Memory Bandwidth which is expected to improve performance of applications requiring higher memory bandwidth.
Unlike Naples and Rome, Milan's arrangement of its CCDs has changed. Each CCD now features up to 8 cores with a unified 32MB L3 cache which could reduce the cache access latency within compute chiplets. Milan Processors can expose each CCD as a NUMA node node by setting the “L3 cache as NUMA Domain” ( from the iDRAC GUI ) or BIOS.ProcSettings.CcxAsNumaDomain (using racadm CLI) option to “Enabled”. Therefore, Milan’s 64 core dual-socket Processors with 8 CCDs per Processor will expose 16 NUMA domains per system in this setting. Here is the logical representation of Core arrangement with NUMA Nodes per socket = 4 and CCD as NUMA = Disabled.
Figure1: Linear core enumeration on a dual-socket system, 64c per socket, NPS4 configuration on an 8 CCD Processor model
As with AMD Rome, AMD Milan Processors support the AVX256 instruction set allowing 16 DP FLOP/cycle.
Processors from both Milan and Rome generations are socket compatible, so the BIOS Options are similar across these Processor generations. Server details are mentioned in Table 1 below.
Table 1: Testbed hardware and software details
Server | Dell EMC PowerEdge 2 socket servers (with AMD Milan Processors) | Dell EMC PowerEdge 2 socket servers (with AMD Rome Processors) |
OPN Cores/Socket Frequency (Base-Boost) TDP | 7763 (Milan) 64 2.45GHz – 3.5GHz 280W 256 MB | 7H12 (Rome) 64 2.6GHz – 3.3 GHz 280W 256 MB |
OPN Cores/Socket Frequency TDP | 7713 (Milan) 64 2.0GHz – 3.7GHz 225W 256 MB | 7702 (Rome) 64 2.0 GHz – 3.35 GHz 200W 256 MB |
OPN Cores/Socket Frequency TDP | 7543 (Milan) 32 2.8GHz – 3.7 GHz 225W 256 MB | 7542 (Rome) 32 2.9GHz – 3.4 GHz 225W 128 MB |
Operating System | RHEL 8.3 (4.18.0-240.el8.x86_64) | RHEL 8.2 (4.18.0-193.el8.x86_64) |
Memory | DDR4 256G (16Gb x 16) 3200 MT/s | |
BIOS / CPLD | 2.0.3 / 1.1.12 | 1.1.7 |
Interconnect | Mellanox HDR 200 (4X HDR) | Mellanox HDR 100 |
The following BIOS options were explored –
After setting System Profile (BIOS.SysProfileSettings.SysProfile) to PerformanceOptimized, NUMA Nodes Per Socket (NPS) to 4, and Prefetchers (L1Region,L1Stream,L1Stride,L2Stream, L2UpDown) to “Enabled” we measured the impact of CcxAsNumaDomain and MemoryInterleaving BIOS parameters on application performance. We tested the performance of the applications listed in Table 1 with following settings.
Table 2: Combinations of CCX as NUMA domain and Memory Interleaving
CCX as NUMA Domain | Memory Interleaving | |
Setting01 | Disabled | Disabled |
Setting02 | Disabled | Auto |
Setting03 | Enabled | Auto |
Setting04 | Enabled | Disabled |
With Setting01 and Setting02 (CCX as NUMA Domain = Disabled), the system will expose 8 NUMA nodes. With Setting03 and Setting04, there will be 16 NUMA nodes on a dual socket server with 64 core based Milan Processors.
Table 3: hwloc-ls and numactl -H command output on 64c server with setting01/setting02 and (listed in Table 2)
Figure 2: Relative difference in the performance of HPL by processor and BIOS settings mentioned in Table 1 and Table 2.
Figure 3: Relative difference in the performance of HPCG by processor and BIOS settings mentioned in Table 1 and Table 2.
Figure 4: Relative difference in the performance of STREAM by processor and BIOS settings mentioned in Table 1 and Table 2.
HPL delivers the best performance numbers on setting02 with 82-93% efficiency depending on Processor Model, whereas STREAM and HPCG deliver better performance with setting04.
STREAM TRIAD tests generate best performance numbers at ~378 GB/s memory bandwidth across all of the 64 and 32 core Processor Models mentioned in Table 1 with efficiency up to 90%.
In Figure 4, the STREAM TRIAD performance numbers were measured by undersubscribing the server by utilizing only 16 cores on the servers. The comparison of the performance numbers by utilizing all the available cores and 16 cores per system has been shown in Figure 5. The numbers on top of the orange bars shows the relative difference.
Figure 5: Relative difference in the memory bandwidth.
From Figure 5, we observed that by using 16 cores, the STREAM TRIAD test’s performance numbers were ~3-4% higher than the performance numbers measured by subscribing all available cores.
We carried out NUMA bandwidth tests using setting02 and setting04 mentioned in Table01. With setting02, system exposes a total of 8 NUMA nodes while with setting04, system exposes a total of 16 NUMA nodes with 8 cores per NUMA node In Figure 6 and 7, NUMA node presented as “c” and memory node as “m”. As an example, c0m0 represents NUMA node 0 and memory node 0. The best bandwidth numbers obtained on varying the number of threads
Figure 6: Local and remote NUMA memory bandwidth with CCXasNUMADomain=Disabled
Figure 7: Local and remote NUMA memory bandwidth with CCXasNUMADomain=ENabled
We observed that the optimal intra socket local memory bandwidth numbers were obtained with 2 threads per NUMA node with setting2 on both 64 core and 32 core processor models. In Figure 6 with setting02 (Table 2) the intra socket local memory bandwidth, at 2 threads per NUMA node, can be up to 79% more than inter remote memory bandwidth. With setting02 (Figure 6) we get at least 96% higher intra socket local memory bandwidth per NUMA domain than setting04 (Figure 7).
Milan introduces two new prefetchers for L1 cache and one for L2 Cache with a total of five prefetcher options which can be configured using BIOS. We tested combinations listed in Table 5 by keeping L1 Stream and L2 Stream prefetcher as Enabled.
Table 5: Cache Prefetchers
L1StridePrefetcher | L1RegionPrefetcher | L2UpDownPrefetcher | |
setting01 | Disabled | Enabled | Enabled |
setting02 | Enabled | Disabled | Enabled |
setting03 | Enabled | Enabled | Disabled |
setting04 | Disabled | Disabled | Disabled |
We found that these new prefetchers do not have significant impact on the performance of the synthetic benchmarks covered in this blog.
For Multinode tests, the testbed was configured with Mellanox HDR interconnect running at 200 Gbps with each server having the AMD 7713 Processor Model and Preferred IO setting set to Enabled from BIOS.Along with the setting02 (Table 2) and Prefetchers (L1Region,L1Stream,L1Stride,L2Stream, L2UpDown) set to “Enabled” we were able to achieve the expected linear performance scalability for HPL and HPCG Benchmarks.
Figure 8: Multinode scalability of HPL and HPCG with setting02 (Table 2) with 7713 Processor model, HDR200 Infiniband
We tested the Message Rate, Unidirectional, and Bidirectional InfiniBand bandwidth using OSU Benchmarks and results are in Figure 9, Figure 10 and Figure 11. Except the Numa Nodes per socket setting, all other BIOS settings for these tests were same as mentioned above. The OSU Bidirectional bandwidth and OSU Unidirectional tests were carried out with Numa Nodes per socket set to 2 and the and Message rate test was carried out with Numa Nodes per socket set to 4. In Figure 9 and Figure10, the numbers on top of the orange bars represent the percentage difference between Local and Remote bandwidth performance numbers.
Figure 9: OSU bi-directional bandwidth test on AMD 7713, HDR 200 InfiniBand
Figure 10: OSU uni-directional bandwidth test on AMD 7713, HDR 200 Infiniband
For Local Latency and Bandwidth performance numbers, the MPI process was pinned to the NUMA node 1 (closest to the HCA). For Remote Latency and Bandwidth tests, processes were pinned to NUMA node 6.
Figure 11: OSU Message rate and bandwidth performance on 2 and 4 nodes of 7713 Processor model
On 2 nodes using HDR200, we are able to achieve ~24 GB/s unidirectional bandwidth and message rate of 192 Million messages/second – almost double the performance numbers obtained on HDR100.
In order to draw out performance improvement comparisons, we have selected Rome SKUs closest to their Milan counterparts in terms of hardware features such as Cache Size, TDP values, and Processor Base/Turbo Frequency.
Figure 12: HPL performance comparison with Rome Processor Models
Figure 13: HPCG performance comparison with Rome Processor Models
Figure 14: STREAM performance comparison with Rome Processor Models
For HPL (Figure 12) we observed that, on higher end Processor Models, Milan delivers 10% better performance than Rome. As expected, on the Milan platform, memory bandwidth bound applications like STREAM and HPCG (Figure 13 and Figure 14) gain 6-16 % and 13-32% in the performance over Rome Processor Models covered in this blog.
Milan-based servers show expected performance upgrades, especially for the memory bandwidth bound synthetic HPC benchmarks covered in this blog. Configuring the BIOS options is important in order to get the best performance out of the system. The Hyper-Threading should be Disabled for general-purpose HPC systems, and benefits of this feature should be tested and enabled as appropriate for the synthetic benchmarks not covered in this blog.
Check back soon for subsequent blogs that describe application performance studies on our Milan Processor based cluster.
Thu, 18 Mar 2021 16:39:54 -0000
|Read Time: 0 minutes
This blog discusses the performance of Siemens’ Simcenter STAR-CCM+ on the Dell EMC Ready Solutions for HPC Digital Manufacturing with AMD EPYC 7003 series processors. This Dell EMC Ready Solutions for HPC was designed and configured specifically for digital manufacturing workloads, where computer aided engineering (CAE) applications are critical for virtual product development. The Dell EMC Ready Solutions for HPC Digital Manufacturing uses a flexible building block approach to HPC system design, where individual building blocks can be combined to build HPC systems which are optimized for customer specific workloads and use cases.
The Dell EMC Ready Solutions for HPC Digital Manufacturing is one of many solutions in the Dell EMC HPC solution portfolio. Please visit www.dellemc.com/hpc for a comprehensive overview of the HPC solutions offered by Dell EMC.
Performance benchmarking was performed using dual-socket Dell EMC PowerEdge servers with 7002 and 7003 series AMD EPYC processors. All servers were populated with two processors and one DIMM per channel memory configuration. The system configurations used for the performance benchmarking are shown in Table 1 and Table 2. The BIOS configuration used for the benchmarking systems is shown in Table 3.
Table 1. 7002 Series AMD EPYC System Configuration
Server | Dell EMC PowerEdge C6525 |
Processor | 2x AMD EPYC 7532 32-core Processors |
Memory | 16x16GB 3200 MTps RDIMMs |
BIOS Version | 1.4.8 |
Operating System | Red Hat Enterprise Linux Server release 7.6 |
Kernel Version | 3.10.0-957.27.2.el7.x86_64 |
Table 2. 7003 Series AMD EPYC System Configuration
Server | Dell EMC PowerEdge R6525 |
Processors | 2x AMD EPYC 7713 64-Core Processors 2x AMD EPYC 7543 32-Core Processors |
Memory | 16x16GB 3200 MTps RDIMMs |
BIOS Version | 2.0.1 |
Operating System | Red Hat Enterprise Linux Server release 8.3 |
Kernel Version | 4.18.0-240.el8.x86_64 |
Table 3. BIOS Configuration
System Profile | Performance Optimized |
Logical Processor | Disabled |
Virtualization Technology | Disabled |
NUMA Nodes Per Socket | 4 |
Application software versions are as described in Table 4.
Table 4. Software Versions
Simcenter STAR-CCM+ | 2020.3.1 mixed precision with Open MPI 4 |
Simcenter STAR-CCM+ is a multiphysics software application used to simulate a wide range of products and designs under a variety of conditions. The benchmarks reported here mainly use the computational fluid dynamics (CFD) and heat transfer features of STAR-CCM+. CFD applications typically scale well across multiple processor cores and servers, have modest memory capacity requirements, and typically perform minimal disk I/O while solving. However, some simulations may have greater I/O demands, such as transient analysis.
The benchmark cases from the standard STAR-CCM+ benchmark suite were evaluated on the systems. The benchmark results reported here are single-server performance results, with the benchmark run using all processor cores available in the server. STAR-CCM+ benchmark performance is measured using the Average Elapsed Time metric which is the average elapsed time per solver iteration. A smaller elapsed time represents better performance. Figure 1 shows the relative performance results for a selection of the STAR‑CCM+ benchmarks.
Figure 1. Simcenter STAR-CCM+ Single Server Performance
The results in Figure 1 are plotted relative to the performance of a single server configured with AMD EPYC 7532 processors. Larger values indicate better overall performance. These results show the performance improvement available with 7003 series AMD EPYC processors. The 32-core AMD EPYC 7543 processor provides good performance for these benchmarks. Per server, the 64-core AMD EPYC 7713 provides a significant performance advantage over the 32-core processors.
The results presented in this blog show that 7003 series AMD EPYC processors offer a significant performance improvement for Siemens’ Simcenter STAR-CCM+ relative to 7002 series AMD EPYC processors.
Wed, 16 Dec 2020 17:34:42 -0000
|Read Time: 0 minutes
The Dell EMC PowerEdge R7525 server supports the AMD MI100 GPU Accelerator. The server is a two-socket, 2U rack-based server that is designed to run complex workloads using highly scalable memory, I/O capacity, and network options. The system is based on the 2nd Gen AMD EPYC processor (up to 64 cores), has up to 32 DIMMs, and has PCI Express (PCIe) 4.0-enabled expansion slots. The server supports SATA, SAS, and NVMe drives and up to three double-wide 300 W accelerators.
The following figure shows the front view of the server:
Figure 1. Dell EMC PowerEdge R7525 server
The AMD Instinct™ MI100 accelerator is one of the world’s fastest HPC GPUs available in the market. It offers innovations to obtain higher performance for HPC applications with the following key technologies:
This blog focuses on the performance characteristics of a single PowerEdge R7525 server with AMD MI100-32G GPUs. We present results from the general matrix multiplication (GEMM) microbenchmarks, the LAMMPS benchmarks, and the NAMD benchmarks to showcase performance and scalability.
The following table provides the configuration details of the PowerEdge R7525 system under test (SUT):
Table 1. SUT hardware and software configurations
Component | Description |
Processor | AMD EPYC 7502 32-core processor |
Memory | 512 GB (32 GB 3200 MT/s * 16) |
Local disk | 2 x 1.8 TB SSD (No RAID) |
Operating system | Red Hat Enterprise Linux Server 8.2 |
GPU | 3 x AMD MI100-PCIe-32G |
Driver version | 3204 |
ROCm version | 3.9 |
Processor Settings > Logical Processors | Disabled |
System profiles | Performance |
NUMA node per socket | 4 |
NAMD benchmark | Version: NAMD 3.0 ALPHA 6 |
LAMMPS (KOKKOS) benchmark | Version: LAMMPS patch_18Sep2020+AMD patches |
The following table lists the AMD MI100 GPU specifications:
Table 2. AMD MI100 PCIe GPU specification
Component | |
GPU architecture | MI100 |
Peak Engine Clock (MHz) | 1502 |
Stream processors | 7680 |
Peak FP64 (TFLOPS) | 11.5 |
Peak FP64 Tensor DGEMM (TFLOPS) | 11.5 |
Peak FP32 (TFLOPS) | 23.1 |
Peak FP32 Tensor SGEMM (TFLOPS) | 46.1 |
Memory size (GB) | 32 |
Memory ECC support | Yes |
TDP (Watt) | 300 |
The GEMM benchmark is a simple, multithreaded dense matrix-to-matrix multiplication benchmark that can be used to test the performance of GEMM on a single GPU. The rocblas-bench binary compiled from https://github.com/ROCmSoftwarePlatform/rocBLAS was used to collect DGEMM and SGEMM results. The results of these tests reflect the performance of an ideal application that only runs matrix multiplication in the form of the peak TFLOPS that the GPU can deliver. Although GEMM benchmark results might not represent real-world application performance, it is still a good benchmark to demonstrate the performance capability of different GPUs.
The following figure shows the observed numbers of DGEMM and SGEMM:
Figure 2. DGEMM and SGEMM for both AMD MI100 peak and AMD-PCIe sustained
The results indicate:
The Large-Scale Atom/Molecular Massively Parallel Simulator (LAMMPS) runs threads in parallel using message-passing techniques. This benchmark measures the scalability and performance of large, parallel systems of multiple GPUs.
The following figure shows the KOKKOS implementation of LAMMPS scaled relatively linearly as AMD MI100 GPUs were added across four datasets: EAM, LJ, Tersoff, and ReaxFF/C.
Figure 3. LAMMPS benchmark showing scaling of multiple AMD MI100 GPUs
Nanoscale Molecular Dynamics (NAMD) is a parallel molecular dynamics system designed for simulation of large biomolecular systems. The NAMD benchmark stresses the scaling and performance aspects of the server and GPU configuration.
The following figure plots the results of the NAMD microbenchmark:
Figure 4. NAMD benchmark performance
Aggregate data of multiple GPU cards is preferred because the Alpha builds of the NAMD 3.0 binary do not scale beyond a single accelerator. Three replica simulations were launched on the same server, one on each GPU, in parallel. NAMD was CPU-bound in previous versions. The new 3.0 version has reduced the CPU dependence. As a result, three-copy simulation produced linear scaling performing three times faster across all datasets.
As part of the optimization, the NAMD benchmark numbers in the following figure show the relative performance difference using different numbers of CPU cores for the STMV dataset:
Figure 5. CPU core dependency on NAMD
The AMD MI100 GPU exhibited an optimum configuration of four CPU cores per GPU.
The AMD MI100 accelerator delivers industry-leading performance, and it is a well-positioned performance per dollar GPU for both FP32 and FP64 HPC parallel codes.
In the future, we plan to test other HPC and Deep Learning applications. We also plan to research using “Hipify” tools to port CUDA sources to HIP.
Tue, 24 Nov 2020 17:49:03 -0000
|Read Time: 0 minutes
The Dell PowerEdge R7525 server powered with 2nd Gen AMD EPYC processors was released as part of the Dell server portfolio. It is a 2U form factor rack-mountable server that is designed for HPC workloads. Dell Technologies recently added support for NVIDIA A100 GPGPUs to the PowerEdge R7525 server, which supports up to three PCIe-based dual-width NVIDIA GPGPUs. This blog describes the single-node performance of selected HPC applications with both one- and two-NVIDIA A100 PCIe GPGPUs.
The NVIDIA Ampere A100 accelerator is one of the most advanced accelerators available in the market, supporting two form factors:
The PowerEdge R7525 server supports only the PCIe version of the NVIDIA A100 accelerator.
The following table compares the NVIDIA A100 GPGPU with the NVIDIA V100S GPGPU:
NVIDIA A100 GPGPU | NVIDIA V100S GPGPU | |||
Form factor | SXM4 | PCIe Gen4 | SXM2 | PCIe Gen3 |
GPU architecture | Ampere | Volta | ||
Memory size | 40 GB | 40 GB | 32 GB | 32 GB |
CUDA cores | 6912 | 5120 | ||
Base clock | 1095 MHz | 765 MHz | 1290 MHz | 1245 MHz |
Boost clock | 1410 MHz | 1530 MHz | 1597 MHz | |
Memory clock | 1215 MHz | 877 MHz | 1107 MHz | |
MIG support | Yes | No | ||
Peak memory bandwidth | Up to 1555 GB/s | Up to 900 GB/s | Up to 1134 GB/s | |
Total board power | 400 W | 250 W | 300 W | 250 W |
The NVIDIA A100 GPGPU brings innovations and features for HPC applications such as the following:
The following table shows the PowerEdge R7525 server configuration that we used for this blog:
Server | PowerEdge R7525 |
Processor | 2nd Gen AMD EPYC 7502, 32C, 2.5Ghz |
Memory | 512 GB (16 x 32 GB @3200MT/s) |
GPGPUs | Either of the following: 2 x NVIDIA A100 PCIe 40 GB 2 x NVIDIA V100S PCIe 32 GB |
Logical processors | Disabled |
Operating system | CentOS Linux release 8.1 (4.18.0-147.el8.x86_64) |
CUDA | 11.0 (Driver version - 450.51.05) |
gcc | 9.2.0 |
MPI | OpenMPI-3.0 |
HPL | hpl_cuda_11.0_ompi-4.0_ampere_volta_8-7-20 |
HPCG | xhpcg-3.1_cuda_11_ompi-3.1 |
GROMACS | v2020.4 |
The following sections provide our benchmarks results with observations.
High Performance Linpack (HPL) is a standard HPC system benchmark. This benchmark measures the compute power of the entire cluster or server. For this study, we used HPL compiled with NVIDIA libraries.
The following figure shows the HPL performance comparison for the PowerEdge R7525 server with either NVIDIA A100 or NVIDIA V100S GPGPUs:
Figure1: HPL performance on the PowerEdge R7525 server with the NVIDIA A100 GPGPU compared to the NVIDIA V100SGPGPU
The problem size (N) is larger for the NVIDIA A100 GPGPU due to the larger capacity of GPU memory. We adjusted the block size (NB) used with the:
The AMD EPYC processors provide options for multiple NUMA combinations. We found that the best value of 4 NUMA per socket (NPS=4), with NUMA per socket 1 and 2 lower the performance by 10 percent and 5 percent respectively. In a single PowerEdge R7525 node, the NVIDIA A100 GPGPU delivers 12 TF per card using this configuration without an NVLINK bridge. The PowerEdge R7525 server with two NVIDIA A100 GPGPUs delivers 2.3 times higher HPL performance compared to the NVIDIA V100S GPGPU configuration. This performance improvement is credited to the new double-precision Tensor Cores that accelerate FP64 math.
The following figure shows power consumption of the server while running HPL on the NVIDIA A100 GPGPU in a time series. Power consumption was measured with an iDRAC. The server reached 1038 Watts at peak due to a higher GFLOPS number.
Figure2: Power consumption while running HPL
The High Performance Conjugate Gradient (HPCG) benchmark is based on a conjugate gradient solver, in which the preconditioner is a three-level hierarchical multigrid method using the Gauss-Seidel method.
As shown in the following figure, HPCG performs at a rate 70 percent higher with the NVIDIA A100 GPGPU due to higher memory bandwidth:
Figure 3: HPCG performance comparison
Due to different memory size, the problem size used to obtain the best performance on the NVIDIA A100 GPGPU was 512 x 512 x 288 and on the NVIDIA V100S GPGPU was 256 x 256 x 256. For this blog, we used NUMA per socket (NPS)=4 and we obtained results without an NVLINK bridge. These results show that applications such as HPCG, which fits into GPU memory, can take full advantage of GPU memory and benefit from the higher memory bandwidth of the NVIDIA A100 GPGPU.
In addition to these two basic HPC benchmarks (HPL and HPCG), we also tested GROMACS, an HPC application. We compiled GROMACS 2020.4 with the CUDA compilers and OPENMPI, as shown in the following table:
Figure4: GROMACS performance with NVIDIA GPGPUs on the PowerEdge R7525 server
The GROMACS build included thread MPI (built in with the GROMACS package). All performance numbers were captured from the output “ns/day.” We evaluated multiple MPI ranks, separate PME ranks, and different nstlist values to achieve the best performance. In addition, we used settings with the best environment variables for GROMACS at runtime. Choosing the right combination of variables avoided expensive data transfer and led to significantly better performance for these datasets.
GROMACS performance was based on a comparative analysis between NVIDIA V100S and NVIDIA A100 GPGPUs. Excerpts from our single-node multi-GPU analysis for two datasets showed a performance improvement of approximately 30 percent with the NVIDIA A100 GPGPU. This result is due to improved memory bandwidth of the NVIDIA A100 GPGPU. (For information about how the GROMACS code design enables lower memory transfer overhead, see Developer Blog: Creating Faster Molecular Dynamics Simulations with GROMACS 2020.)
The Dell PowerEdge R7525 server equipped with NVIDIA A100 GPGPUs shows exceptional performance improvements over servers equipped with previous versions of NVIDIA GPGPUs for applications such as HPL, HPCG, and GROMACS. These performance improvements for memory-bound applications such as HPCG and GROMACS can take advantage of higher memory bandwidth available with NVIDIA A100 GPGPUs.
Tue, 17 Nov 2020 21:43:49 -0000
|Read Time: 0 minutes
Today’s HPC environments have ever increasing demands for very high-speed storage that also frequently must provide high capacity and distributed access via several standard protocols such as NFS, and SMB. These high demand HPC requirements are typically covered by Parallel File Systems that provide concurrent access to a single file or set of files from multiple nodes, very efficiently and securely distributing data to multiple LUNs across several servers using the network technology with the highest speed available.
This blog is a technology update for the use of Infiniband HDR100 on the Dell EMC Ready Solution for HPC PixStor Storage, a Parallel File System (PFS) solution for HPC environments where PowerVault ME484 EBOD arrays are used to increase the capacity of the solution. Figure 1 presents the reference architecture depicting the capacity expansion SAS additions to the existing PowerVault ME4084 storage arrays, replacing Infiniband EDR components with HDR100: ConnectX-6 HCAs and QM8700 switches. The PixStor solution includes the widespread General Parallel File System also known as Spectrum Scale as the PFS component, in addition to many other ArcaStream software components including advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities and many others.
Figure 1 Reference Architecture
This solution was released with the latest Intel Xeon 2nd generation Scalable Xeon CPUs (Cascade Lake) and some of the servers will use the fastest RAM available (2933 MT/s). However, due to hardware availability during testing, the solution prototype used servers with Intel Xeon 1st generation Scalable Xeon CPUs (Skylake) and slower RAM to characterize the performance. Since the bottleneck of the solution is at the SAS controllers of the DellEMC PowerVault ME40x4 arrays, no significant performance disparity is expected once the Skylake CPUs and RAM are replaced with Cascade Lake CPUs and faster RAM. In addition, the solution was updated to the latest version of PixStor (5.1.3.1) that supports RHEL 7.7 and OFED 5.0-2.1.8.
Table 1 shows the list of main components for the solution where the first column has components used at release time and therefore available to customers, and the last column has the components actually used for characterizing the performance of the solution. The drives listed for data (12TB NLS) and metadata (960GB SSD), are the ones used for performance characterization, and faster drives can provide better Random IOPs and may improve create/removal metadata operations.
Finally, for completeness, the list of possible data HDDs and metadata SSDs was included, which is based on the drives supported as enumerated on the DellEMC PowerVault ME4 support matrix, available online.
Table 1 Components to be used at release time and those used in the test bed
Solution Component | Released | Test Bed | |
Internal Mgmt Connectivity | Dell Networking S3048-ON Gigabit Ethernet | ||
Data Storage Subsystem | 1x to 4x Dell EMC PowerVault ME4084
| ||
Optional High Demand Metadata Storage Subsystem | 1x to 2x (max 4x) Dell EMC PowerVault ME4024 | ||
RAID Storage Controllers | Redundant 12 Gbps SAS | ||
Capacity w/o Expansion | Raw: 4032 TB (3667 TiB or 3.58 PiB) with 12TB HDDs | ||
Capacity w/Expansion | Raw: 8064 TB (7334 TiB or 7.16 PiB) with 12TB HDDs Formatted ~ 6144 GB (5588 TiB or 5.46 PiB) | ||
Processor | Gateway/Ngenea (R740) | 2x Intel Xeon Gold 6230 2.1G, 20C/40T, 10.4GT/s, 27.5M Cache, Turbo, HT (125W) DDR4-2933 | 2x Intel Xeon Gold 6136 @ |
High Demand Metadata (R740) | |||
Storage Node (R740) | |||
Management Node (R440) | 2x Intel Xeon Gold 5220 2.2G, 18C/36T, 10.4GT/s, 24.75M Cache, Turbo, HT (125W) DDR4-2666 | 2x Intel Xeon Gold 5118 @ 2.30GHz, 12 cores | |
Memory | Gateway/Ngenea (R740) | 12 x 16GiB 2933 MT/s RDIMMs | 24x 16GiB 2666 MT/s RDIMMs (384 GiB) |
High Demand Metadata (R740) | |||
Storage Node (R740) | |||
Management Node (R440) | 12 X 16GB RDIMMs, 2666 MT/s (192GiB) | 12x 8GiB 2666 MT/s RDIMMs (96 GiB) | |
Operating System | CentOS 7.7 | ||
Kernel version | 3.10.0-1062.12.1.el7.x86_64 | ||
PixStor Software | 5.1.3.1 | ||
OFED Version | Mellanox OFED 5.0-2.1.8 | ||
High Performance Network Connectivity | Mellanox ConnectX-6 Dual-Port InfiniBand VPI HDR100/100 GbE, and 10 GbE | ||
High Performance Switch | 2x Mellanox QM8700 (HA – Redundant) | ||
Local Disks (OS & Analysis/monitoring) | All servers except Management node 3x 480GB SSD SAS3 (RAID1 + HS) for OS PERC H730P RAID controller 3x 480GB SSD SAS3 (RAID1 + HS) for OS & Analysis/Monitoring PERC H740P RAID controller | All servers except Management node 2x 300GB 15K SAS3 (RAID 1) for OS PERC H330 RAID controller Management Node 5x 300GB 15K SAS3 (RAID 5) for OS & PERC H740P RAID controller | |
Systems Management | iDRAC 9 Enterprise + DellEMC OpenManage |
To characterize this Ready Solution, we used the hardware specified in the last column of Table 1, including the optional High Demand Metadata Module. In order to assess the solution performance, the following benchmarks were used:
For all benchmarks listed above, the test bed had the clients as described in the Table 2 below. Since the number of compute nodes available for testing was only 16, when a higher number of threads was required, those threads were equally distributed on the compute nodes (i.e. 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads =16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes available. Since the benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), while avoiding excessive context switching and other related side effects that can affect performance results.
Table 2 Client test bed
Number of Client nodes | 16 |
Client node | C6320 |
Processors per client node | 2 x Intel(R) Xeon(R) Gold E5-2697v4 18 Cores @ 2.30GHz |
Memory per client node | 12 x 16GiB 2400 MT/s RDIMMs |
High Performance Adapter | Mellanox ConnectX-4 InfiniBand VPI |
Operating System | CentOS 7.6 |
OS Kernel | 3.10.0-957.10.1 |
PixStor Software | 5.1.3.1 |
OFED Version | Mellanox OFED 5.0-1.0.0 |
Sequential N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 512 threads on the capacity expanded solution (4x ME4084s + 4x ME484s); results from the EDR testing are contrasted with the HDR100 update.
Caching effects were minimized by setting the file system tunable page pool to 8 GiB on the clients and 24 GiB on the servers and using files twice the total memory size of the clients or servers (whichever value is larger). It is important to note that for the file system, the page pool tunable sets the maximum amount of memory used by the file system for caching data, regardless of the amount of RAM installed and free. Also, important to note is that while in previous DellEMC HPC solutions the block size for large sequential transfers is 1 MiB, the file system was formatted with 8 MiB blocks and therefore that value is used on the benchmark for optimal performance. That may look too large and apparently waste too much space, but the file system uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided in 256 subblocks of 32 KiB each.
The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and threadlist was the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.
./iozone -i0 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist
./iozone -i1 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist
Figure 2 N to N Sequential Performance
From the results we can observe that performance rises fast with the number of clients used and then reaches a plateau that is fairly stable until the maximum number of threads that IOzone allows is reached, and therefore large file sequential performance is stable except for 512 concurrent threads (about 8% lower). The maximum read performance of 23.8 GB/s at 32 threads was still limited by the bandwidth of the two IB HDR100 links used on the storage nodes starting at 8 threads. Read performance at 4 threads is considerably lower and at high thread counts is a bit lower compared to EDR (less than 5%), but the results are reproduceable. Since the sequential N to 1 test using IOR uses the same data size and similar parameters but on a single file (adding locking overhead), the big drop in read performance at 4 threads (and to a much smaller degree at high thread counts) may be due to IOzone using calls that are working less efficiently than IOR calls, but more work is needed to find the reason for the different behavior.
The highest write performance of 21 GB/s was achieved at 512 threads. It is important to remember that for PixStor file system, the preferred mode of operation is scatter, and the solution was formatted to use such mode. In this mode, blocks are allocated from the very beginning of operation in a pseudo-random fashion, spreading data across the whole surface of each HDD. While the obvious disadvantage is a smaller initial maximum performance, that performance is maintained fairly constant regardless of how much space is used on the file system. That in contrast to other parallel file systems that initially use the outer tracks that can hold more data (sectors) per disk revolution, and therefore have the highest possible performance the HDDs can provide, but as the system uses more space, inner tracks with less data per revolution are used, with the consequent reduction of performance.
Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from a one thread up to 512 threads (since there were not enough cores for 1024 threads), and results are contrasted to the solution without the capacity expansion.
Caching effects were minimized by setting the file system page pool tunable to 8 GiB on the clients and 24 GiB on the servers and using a total data size bigger than twice the total memory size of clients or servers (whichever is larger). This benchmark tests used 8 MiB blocks for optimal performance. The previous performance test section has a more complete explanation for those matters.
The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b 128G
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -r -s 1 -t 8m -b 128G
Figure 3 N to 1 Sequential Performance
Performance rises fast with the number of clients used and then reaches a plateau that is fairly stable for reads and writes from 8 threads all the way to the maximum number of threads used on this test. The maximum read performance was 24.2 GB/s at 32 threads and the bottleneck was the InfiniBand HDR100 interface apparently at higher than line speed. Similarly, notice that the maximum write performance of 19.9 GB/s was reached at 16 threads. An important data point is at 4 threads, that even that uses the same data size and parameters as IOzone with the extra burden of locking, no performance drop is observed for writes as it was for IOzone.
Random N clients to N files performance was measured with IOzone 3.487. Tests executed varied from 16 threads up to 512 threads since there was not enough client-cores for 1024 threads. Lower thread counts were not tested at this time since they take a very large execute time and IOzone does not allow to get results until the test is completed in its entirety and the most important information tends to be the maximum IOPS that the solution can provide. Each thread was using a different file and the threads were assigned on a round robin fashion to the client nodes. This benchmark used 4 KiB blocks for emulating small block traffic.
Caching effects were minimized by setting the file system page pool tunable to 8GiB on the clients and 24 GiB on the servers and using a total data size bigger than twice the total page pool size of clients or servers (whichever is larger). It is important to note that the page pool tunable sets the maximum amount of memory used by the file system for caching data, regardless the amount of RAM installed and free.
Figure 4 N to N Random Performance
From the results we can observe that write performance starts at a high value of 23.4K IOps and remains under 25K steadily up to 256 threads where it peaks at 27.4K IOps. Read performance on the other hand starts at 1.3K IOps and increases performance almost linearly with the number of threads used (keep in mind that number of threads is doubled for each data point) and reaches the maximum performance of 33.8K IOPS at 512 threads. Using more threads would require more than the 16 compute nodes or more cores per node to avoid loss of performance due to process context switching, data locality and similar effects. ME4 arrays require a higher IO pressure (queue or IO depth) to reach their maximum random IOPS showing in this test a lower apparent performance, where the arrays could in fact deliver more IOPS when using tests like FIO that can control the IO depth per process.
Metadata performance was measured with MDtest version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 512 threads. The benchmark was used for files only (no directory metadata), getting the number of creates, stats and removes that the solution can handle, and results were contrasted with previous EDR results.
To properly evaluate the solution in comparison to other DellEMC HPC storage solutions and the previous blog results, the optional High Demand Metadata Module was used, but with a single ME4024 array; but in fact, the large configuration tested in this work was designated to have two ME4024s.
This High Demand Metadata Module can support up to four ME4024 arrays, and it is suggested to increase the number of ME4024 arrays to 4, before adding another metadata module. Additional ME4024 arrays are expected to increase the Metadata performance with each additional array, except maybe for Stat operations, since the IOPS numbers are very high, at some point the CPUs will become a bottleneck and performance will not continue to increase linearly.
The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. Similar to the Random IO benchmark, the maximum number of threads was limited to 512, since there are not enough cores for 1024 threads and context switching would affect the results, reporting a number lower than the real performance of the solution.
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F
Since performance results can be affected by the total number of IOPs, the number of files per directory and the number of threads, it was decided to fix the total number of files to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed as shown in Table 3.
Table 3 MDtest distribution of files on directories
Number of Threads | Number of directories per thread | Total number of files |
1 | 2048 | 2,097,152 |
2 | 1024 | 2,097,152 |
4 | 512 | 2,097,152 |
8 | 256 | 2,097,152 |
16 | 128 | 2,097,152 |
32 | 64 | 2,097,152 |
64 | 32 | 2,097,152 |
128 | 16 | 2,097,152 |
256 | 8 | 2,097,152 |
512 | 4 | 2,097,152 |
Figure 5 Metadata Performance - Empty Files
First, note that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences several orders of magnitude; otherwise some of the operations would look like a flat line close to 0 on a normal graph. A log graph with base 2 could be more appropriate, since the number of threads are increased in powers of 2, but the graph would look very similar and people tend to handle and remember better numbers based on powers of 10.
The system gets very good results with Stat operations reaching their peak value at 256 threads with 6M op/s respectively. Removal operations attained the maximum of 189.7K op/s at 32 threads and Create operations achieving their peak at 512 threads with 266.8.1K op/s. Stat operation have more variability, but once they reach their peak value, performance does not drop below 3M op/s for Stats. Create and Removal are more stable once their reach a plateau and remain above 160K op/s for Removal and 128K op/s for Create.
This test is almost identical to the previous one, except that instead of empty files, small files of 4KiB were used.
The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.
mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F -w 4K -e 4K
Figure 6 Metadata Performance - Small files (4K)
The system gets very good results for Stat reaching a peak value at 512 threads with almost 4.9M op/s. Remove operations attained the maximum of 442.7K op/s at 128 threads and Create operations achieving their peak with 75K op/s at 512 threads and apparently not reaching a plateau yet. Stat and Removal operations have more variability, but once they reach their peak value, performance does not drop below 3.5M op/s for Stats and 315K op/s for Removal. Create and Read have less variability and keep increasing as the number of threads grows.
Since these numbers are for a metadata module with a single ME4024, performance will increase for each additional ME4024 array, however we cannot simply assume a linear increase for each additional ME4024. Unless the whole file fits inside the inode for such file, data targets on the ME4084s will be used to store part of the 4K files, limiting the performance to some degree. Since the inode size is 4KiB and it still needs to store metadata, only files around 3 KiB will fit inside and any file bigger than that will use data targets.
The solution has similar performance to that observed with the Infiniband EDR technology. An overview of the performance for HDR100 can be seen in Table 4; it is expected to be stable from an empty file system until is almost full because of the use of Scatter allocation across the whole surface area of ALL HDDs. Furthermore, the solution scales in capacity and performance linearly as more storage node modules are added, and a similar performance increase can be expected from the optional high demand metadata module. This solution provides HPC customers with a very reliable parallel file system used by many Top 500 HPC clusters. In addition, it provides exceptional search capabilities, advanced monitoring and management. With the addition of optional gateway nodes, it allows file sharing via ubiquitous standard protocols like NFS, SMB to as many clients as needed. Finally, Ngenea nodes allow efficient access to other cost-effective storage tiers such as ECS, Isilon enterprise NAS and Cloud solutions using different protocols.
Table 4 Peak & Sustained Performance
| Peak Performance | Sustained Performance | ||
Write | Read | Write | Read | |
Large Sequential N clients to N files | 21.0 GB/s | 23.8 GB/s | 20.5 GB/s | 23.0 GB/s |
Large Sequential N clients to single shared file | 19.9 GB/s | 24.2 GB/s | 19.1 GB/s | 23.4 GB/s |
Random Small blocks N clients to N files | 33.8KIOps | 27.4KIOps | 33.80KIOps | 23.0KIOps |
Metadata Create empty files | 266.8K IOps | 128K IOps | ||
Metadata Stat empty files | 6M IOps | 3M IOps | ||
Metadata Remove empty files | 189.7K IOps | 160K IOps | ||
Metadata Create 4KiB files | 75K IOps | 75K IOps | ||
Metadata Stat 4KiB files | 4.9M IOps | 3.5M IOps | ||
Metadata Remove 4KiB files | 442.7K IOps | 315K IOps |
Performance for the gateway nodes was measured and will be reported in a new blog. Finally, high performance NVMe nodes are being tested and results will also be released in a different blog.
Wed, 19 Jul 2023 19:59:24 -0000
|Read Time: 0 minutes
True to the tradition of keeping up with the technology trends, the Dell EMC Ready Solutions for BeeGFS High Performance Storage, that was originally released during November 2019, has now been refreshed with the latest software and hardware. The base architecture of the solution remains the same. The following table lists the differences between the initially released InfiniBand EDR based solution and the current InfiniBand HDR100 based solution in terms of the software and hardware used.
Table 1. Comparison of Hardware and Software of EDR and HDR based BeeGFS High Performance Solution
Software | Initial Release (Nov. 2019) | Current Refresh (Nov. 2020) |
Operating System | CentOS 7.6 | CentOS 8.2. |
Kernel version | 3.10.0-957.27.2.el7.x86_64 | 4.18.0-193.14.2.el8_2.x86_64 |
BeeGFS File system version | 7.1.3 | 7.2 |
Mellanox OFED version | 4.5-1.0.1.0 | 5.0-2.1.8.0 |
Hardware | Initial Release | Current Refresh |
NVMe Drives | Intel P4600 1.6 TB Mixed Use | Intel P4610 3.2 TB Mixed Use |
InfiniBand Adapters | ConnectX-5 Single Port EDR | ConnectX-6 Single Port HDR100 |
InfiniBand Switch | SB7890 InfiniBand EDR 100 Gb/s Switch -1U (36x EDR 100 Gb/s ports) | QM8790 Quantum HDR Edge Switch – 1U (80x HDR100 100 Gb/s ports using splitter cables) |
This blog presents the architecture, updated technical specifications and the performance characterization of the upgraded high-performance solution. It also includes a comparison of the performance with respect to the previous EDR based solution.
The high-level architecture of the solution remains the same as the initial release. The hardware components of the solution consist of 1x PowerEdge R640 as the management server and 6x PowerEdge R740xd servers as metadata/storage servers to host the metadata and storage targets. Each PowerEdge R740xd server is equipped with 24x Intel P4610 3.2 TB Mixed Use Express Flash drives and 2x Mellanox ConnectX-6 HDR100 adapters. Figure 1 shows the reference architecture of the solution.
Figure 1. Dell EMC Ready Solutions for HPC BeeGFS Storage – Reference Architecture
There are two networks-the InfiniBand network, and the private Ethernet network. The management server is only connected via Ethernet to the metadata and storage servers. Each metadata and storage server has 2x links to the InfiniBand network and is connected to the private network via Ethernet. The clients have one InfiniBand link and are also connected to the private Ethernet network. For more details on the solution configuration please refer to the blog and whitepaper on BeeGFS High Performance Solution published at hpcatdell.com .
Table 2 and 3 describe the hardware specifications of management server and metadata/storage server respectively. Table 4 describes the versions of the software used for the solution.
Table 2. PowerEdge R640 Configuration (Management Server)
Component | Description |
Processor | 2 x Intel Xeon Gold 5218 2.3GHz, 16 cores |
Memory | 12 x 8GB DDR4 2666MT/s DIMMs - 96GB |
Local Disks | 6 x 300GB 15K RPM SAS 2.5in HDDs |
RAID Controller | PERC H740P Integrated RAID Controller |
Out of Band Management | iDRAC9 Enterprise with Lifecycle Controller |
Table 3. PowerEdge R740xd Configuration (Metadata and Storage Servers)
Component | Description |
Processor | 2x Intel Xeon Platinum 8268 CPU @ 2.90GHz, 24 cores |
Memory | 12 x 32GB DDR4 2933MT/s DIMMs - 384GB |
BOSS Card | 2x 240GB M.2 SATA SSDs in RAID 1 for OS |
Local Drives | 24x Dell Express Flash NVMe P4610 3.2 TB 2.5” U.2 |
InfiniBand Adapter | 2x Mellanox ConnectX-6 single port HDR100 Adapter |
InfiniBand Adapter Firmware | 20.26.4300 |
Out of Band Management | iDRAC9 Enterprise with Lifecycle Controller |
Table 4. Software Configuration (Metadata and Storage Servers)
Component | Description |
Operating System | CentOS Linux release 8.2.2004 (Core) |
Kernel version | 4 4.18.0-193.14.2.el8_2. |
Mellanox OFED | 5.0-2.1.8.0 |
NVMe SSDs | VDV1DP23 |
OpenMPI | 4.0.3rc4 |
Intel Data Center Tool | v 3.0.26 |
BeeGFS | 7.2 |
Grafana | 7.1.5-1 |
InfluxDB | 1.8.2-1 |
IOzone | 3.490 |
MDtest | 3.3.0+dev |
The system performance was evaluated using the following benchmarks:
Performance tests were run on a testbed with clients as described in Table 5. For test cases where the number of IO threads were greater than the physical number of IO clients, threads were distributed equally across the clients (i.e., 32 threads = 2 threads per client…,1024 threads = 64 threads per node).
Table 5. Client Configuration
Component | Description |
Server model | 8x PowerEdge R840 8x PowerEdge C6420 |
Processor | 4x Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, 24 cores (R840) 2x Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 20 cores (C6420) |
Memory | 24 x 16GB DDR4 2933MT/s DIMMs - 384GB (R840) 12 x 16GB DDR4 2933MT/s DIMMs – 192 GB (C6420) |
Operating System | 4.18.0-193.el8.x86_64 |
Kernel version | Red Hat Enterprise Linux release 8.2 (Ootpa) |
InfiniBand Adapter | 1x ConnectX-6 single port HDR100 adapter |
OFED version | 5.0-2.1.8.0 |
The IOzone benchmark was used in the sequential read and write mode to evaluate sequential reads and writes. These tests were conducted using multiple thread counts starting at 1 thread and up to 1024 threads. At each thread count, an equal number of files were generated since this test works on one file per thread or the N-N case. The round robin algorithm was used to choose targets for file creation in a deterministic fashion.
For all the tests, aggregate file size was 8 TB and this was equally divided among the number of threads for any given test. The aggregate file size chosen was large enough to minimize the effects of caching from the servers as well as from BeeGFS clients.
IOzone was run in a combined mode of write then read (-i 0, -i 1) to allow it to coordinate the boundaries between the operations. For this test, we used a 1MiB record size for every run. The commands used for Sequential N-N tests are given below:
Sequential Writes and Reads: iozone -i 0 -i 1 -c -e -w -r 1m -I -s $Size -t $Thread -+n -+m /path/to/threadlist
OS caches were also dropped or cleaned on the client nodes between iterations as well as between write and read tests by running the command:
# sync && echo 3 > /proc/sys/vm/drop_caches
The default stripe count for BeeGFS is 4. However, the chunk size and the number of targets per file can be configured on a per-directory basis. For all these tests, BeeGFS stripe size was chosen to be 2MB and stripe count was chosen to be 3 since we have three targets per NUMA zone as shown below:
# beegfs-ctl --getentryinfo --mount=/mnt/beegfs /mnt/beegfs/benchmark --verbose
Entry type: directory
EntryID: 0-5F6417B3-1
ParentID: root
Metadata node: storage1-numa0-2 [ID: 2]
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 2M
+ Number of storage targets: desired: 3
+ Storage Pool: 1 (Default)
Inode hash path: 33/57/0-5F6417B3-1
The testing methodology and the tuning parameters used were similar to those previously described in the EDR based solution. For additional details in this regard, please refer to the whitepaper on the BeeGFS High Performance Solution.
Note:
The number of clients used for the performance characterization of the EDR based solution is 32 whereas the number of clients used for the performance characterization of the HDR100 based solution is 16 only. In the performance charts given below, this is indicated by including 16c which denotes 16clients and 32c which denotes 32 clients. The dotted lines show the performance of the EDR based solution and the solid lines shows the performance of the HDR100 based solution.
Figure 2. Sequential IOzone 8 TB aggregate file size
From Figure 2, we observe that the HDR100 peak read performance is ~131 GB/s and peak write is ~123 GB/s at 1024 threads. Each drive can provide 3.2 GB/s peak read performance and 3.0 GB/s peak write performance, which allows a theoretical peak of 460.8 GB/s for reads and 432 GB/s for the solution. However, in this solution, the network is the limiting factor. In the setup, we have a total of 11 InfiniBand HDR100 links for the storage servers. Each link can provide a theoretical peak performance of 12.4 GB/s which allows an aggregate theoretical peak performance of 136.4 GB/s. The achieved peak read and write performance are 96% and 90% respectively of the theoretical peak performance.
We observe that the peak read performance for the HDR100 based solution is slightly lower than that observed with the EDR based solution. This can be attributed to the fact that the benchmark tests were carried out using 16 clients for the HDR100 based setup while the EDR based solution used 32 clients. The improved write performance with HDR100 is due to the fact that the P4600 NVMe SSD used in the EDR based solution could provide only 1.3 GB/s for sequential writes whereas the P4610 NVMe SSD provides 3.0 GB/s peak write performance.
We also observe that the read performance is lower than writes for thread counts from 16 to 128. This is because a PCIe read operation is a Non-Posted Operation, requiring both a request and a completion, whereas a PCIe write operation is a Posted Operation that consists of a request only. A PCIe write operation is a fire and forget operation, wherein once the Transaction Layer Packet is handed over to the Data Link Layer, the operation completes.
Read throughput is typically lower than the write throughput because reads require two transactions instead of a single write for the same amount of data. The PCI Express uses a split transaction model for reads. The read transaction includes the following steps:
The read throughput depends on the delay between the time the read request is issued and the time the completer takes to return the data. However, when the application issues enough number of read requests to offset this delay, then throughput is maximized. A lower throughput is measured when the requester waits for completion before issuing subsequent requests. A higher throughput is registered when multiple requests are issued to amortize the delay after the first data returns. This explains why the read performance is less than that of the writes from 16 threads to 128 threads and then an increased throughput is observed for higher thread counts of 256, 512 and 1024.
More details regarding the PCI Express Direct Memory Access is available at https://www.intel.com/content/www/us/en/programmable/documentation/nik1412547570040.html#nik1412547565760
IOzone was used in the random mode to evaluate random IO performance. Tests were conducted with thread counts starting from 8 threads to up to 1024 threads. Direct IO option (-I) was used to run IOzone so that all operations bypass the buffer cache and go directly to the disk. BeeGFS stripe count of 3 and chunk size of 2MB was used. A 4KiB request size was used on IOzone and performance measured in I/O operations per second (IOPS). An aggregate file size of 8 TB was selected to minimize the effects of caching. The aggregate file size was equally divided among the number of threads within any given test. The OS caches were dropped between the runs on the BeeGFS servers as well as BeeGFS clients.
The command used for random reads and writes is given below:
iozone -i 2 -w -c -O -I -r 4K -s $Size -t $Thread -+n -+m /path/to/threadlist
Figure 3. N-N Random Performance
Figure 3 shows that the random writes peak at ~4.3 Million IOPS at 1024 threads and the random reads peak at ~4.2 Million IOPS at 1024 threads. Both the write and read performance show a higher performance when there are a greater number of IO requests. This is because NVMe standard supports up to 64K I/O queue and up to 64K commands per queue. This large pool of NVMe queues provide higher levels of I/O parallelism and hence we observe IOPS exceeding 3 Million. The following table provides a comparison of the random IO performance of the P4610 and P4600 NVMe SSDs to better understand the observed results.
Table 6. Performance Specification of Intel NVMe SSDs
Product | P4610 3.2 TB NVMe SSD | P4600 1.6 TB NVMe SSD |
Random Read (100% Span) | 638000 IOPS | 559550 IOPS |
Random Write (100% Span) | 222000 IOPS | 176500 IOPS |
The metadata performance was measured with MDtest and OpenMPI to run the benchmark over the 16 clients. The benchmark was used to measure file creates, stats, and removals performance of the solution. Since performance results can be affected by the total number of IOPs, the number of files per directory and
the number of threads, a consistent number of files across tests was chosen to allow comparison. The total number of files was chosen to be ~ 2M in powers of two (2^21 = 2097152). The number of files per
directory was fixed at 1024, and the number of directories varied as the number of threads changed. The test methodology, and directories created are similar to that described in the previous blog.
The following command was used to execute the benchmark:
mpirun -machinefile $hostlist --map-by node -np $threads ~/bin/mdtest -i 3 -b
$Directories -z 1 -L -I 1024 -y -u -t -F
Figure 4. Metadata Performance – Empty Files
From Figure 4, we observe that the create, removal and read performance are comparable to those received for the EDR based solution whereas the Stat performance is lower by ~100K IOPS. This may be because the HDR100 based solution uses only 16 clients for performance characterization whereas the EDR based solution used 32 clients. The file create operations reach their peak value at 512 threads at ~87K op/s. The removal and stat operations attained the maximum value at 32 threads with ~98K op/s, and 392 op/s respectively.
This blog presents the performance characteristics of the Dell EMC High Performance BeeGFS Storage Solution with the latest software and hardware. At the software level, high-performance solution has now been updated with
At the hardware level, the solution uses
The performance analysis allows us to conclude that:
Dell EMC Ready Solutions for HPC BeeGFS Storage - Technical White Paper
Features of Dell EMC Ready Solutions for HPC BeeGFS Storage
Scalability of Dell EMC Ready Solutions for HPC BeeGFS Storage
Dell EMC Ready Solutions for HPC BeeGFS High Performance Storage