Home Workload Solutions High Performance Computing Blogs

Performance Evaluation of HPC Applications on a Dell PowerEdge R650-based VMware Virtualized Cluster

Wed, 08 Feb 2023 14:45:39 -0000

Read Time: 0 minutes

Veena K

Neeraj Kumar

Miraj Naik

Rizwan Ali

Overview

High Performance Computing (HPC) solves complex computational problems by doing parallel computations on multiple computers and performing research activities through computer modeling and simulations. Traditionally, HPC is deployed on bare-metal hardware, but due to advancements in virtualization technologies, it is now possible to run HPC workloads in virtualized environments. Virtualization in HPC provides more flexibility, improves resource utilization, and enables support for multiple tenants on the same infrastructure.

However, virtualization is an additional layer in the software stack and is often construed as impacting performance. This blog explains a performance study conducted by the Dell Technologies HPC and AI Innovation Lab in partnership with VMware. The study compares bare-metal and virtualized environments on multiple HPC workloads with Intel® Xeon® Scalable third-generation processor-based systems. Both the bare-metal and virtualized environments were deployed on the Dell HPC on Demand solution.

Figure 1: Cluster Architecture

To evaluate the performance of HPC applications and workloads, we built a 32-node HPC cluster using Dell PowerEdge R650 as compute nodes. Dell Power Edge R650 is a 1U dual socket server with Intel® Xeon® Scalable third-generation processors. The cluster was configured to use both bare-metal and virtual compute nodes (running VMware vSphere 7). Both bare-metal and virtualized nodes were attached to the same head node.

Figure 1 shows a representative network topology of this cluster. The cluster was connected to two separate physical networks. The compute nodes were spread across two sets of racks, and the cluster consisted of the following two networks:

HPC Network: Dell PowerSwitch Z9332 switch connecting NVIDIA® Connect®-X6 100 GbE adapters to provide a low latency high bandwidth 100 GbE RDMA-based HPC network for the MPI-based HPC workload
Services Network: A separate pair of Dell PowerSwitch S5248F-ON 25 GbE based top of rack (ToR) switches for hypervisor

The Virtual Machine (VM) configuration details for optimal performance settings were captured in an earlier blog. In addition to the settings noted in the previous blog, some additional BIOS tuning options such as Snoop Hold Off, SubNumaCluster (SNC) and LLC Prefetch settings were also tested. Snoop Hold Off (set to 2 K cycles), and SNC, helped performance across most of the tested applications and microbenchmarks for both the bare-metal and virtual nodes. Enabling SNC in the server BIOS and not configuring SNC correctly in the VM might result in performance degradation.

Bare-metal and virtualized HPC system configuration

Table 1 shows the system environment details used for the study.

Table 1: System configuration details for the bare-metal and virtual clusters

Machine function	Component
Platform	PowerEdge R650 server
Processor	Two Intel® Xeon® third Generation 6348 (28 cores @ 2.6 GHz)
Number of cores	Bare-Metal: 56 cores Virtual: 52 vCPUs (four cores reserved for ESXi)
Memory	Sixteen 32 GB DDR4 DIMMS @3200 MT/s Bare-Metal: All 512 GB used Virtual: 440 GB reserved for the VM
HPC Network NIC	100 GbE NVIDIA Mellanox Connect-X6
Service Network NIC	10/25 GbE NVIDIA Mellanox Connect-X5
HPC Network Switch	Dell PowerSwitch Z9332 with OS 10.5.2.3
Service Network Switch	Dell PowerSwitch S5248F-ON
Operating system	Rocky Linux release 8.5 (Green Obsidian)
Kernel	4.18.0-348.12.2.el8_5.x86_64
Software – MPI	IntelMPI 2021.5.0
Software – Compilers	Intel OneAPI 2022.1.1
Software – OFED	OFED 5.4.3 (Mellanox FW 22.32.20.04)
BIOS version	1.5.5 (for both bare-metal and virtual nodes)

Application and benchmark details

The following chart outlines the set of HPC applications used for this study from different domains like Computational Fluid Dynamics (CFD), Weather, and Life Sciences. Different benchmark datasets were used for each of the applications as detailed in Table 2.

Table 2: Application and benchmark dataset details

Application	Vertical Domain	Benchmark Dataset
WRF (v3.9.1.1)	Weather and Environment	Conus 2.5KM, Maria 3KM
OpenFOAM (version 9)	Manufacturing - Computational Fluid Dynamics (CFD)	Motorbike 20M, 34M and 52M cells mesh
Gromacs (version 2022)	Life Sciences – Molecular Dynamics	HECBioSim Benchmarks – 3M Atoms Lignocellulose BenchPEP
LAMMPS (4 May 2022)	Molecular Dynamics	EAM metallic Solid Benchmark (1M, 3M and 8M Atoms) HECBIOSIM – 3M Atoms

Performance results

All the application results shown here were run on both bare-metal and virtual environments using the same binary compiled with Intel Compiler and run with Intel MPI. Multiple runs were done to ensure consistency in the performance. Basic synthetic benchmarks like High Performance Linpack (HPL), Stream, and OSU MPI Benchmarks were run to ensure that the cluster was operating efficiently before running the HPC application benchmarks. For the study, all the benchmarks were run in a consistent, optimized, and stable environment across both the bare-metal and virtual compute nodes.

Intel® Xeon® Scalable third-generation processors (Ice Lake 6348) have 56 cores. Four cores were reserved for the virtualization hypervisor (ESXi) providing the remaining 52 cores to run benchmarks. All the results shown here consist of 56 core runs on bare-metal vs 52 core runs on virtual nodes.

To ensure better scaling and performance, multiple combinations of threads and MPI ranks were tried based on applications. The best results are used to show the relative speedup between both the bare-metal and virtual systems.

Figure 2: Performance comparison between bare-metal and virtual nodes for WRF

Figure 3: Performance comparison between bare-metal and virtual nodes for OpenFOAM

Figure 4: Performance comparison between bare-metal and virtual nodes for GROMACS

Figure 5: Performance comparison between bare-metal and virtual nodes for LAMMPS

The above results indicate that all the MPI applications running in a virtualized environment are close in performance to the bare-metal environment if proper tuning and optimizations are used. The performance delta, running from a single node up to 32 nodes, is within the 10% range for all the applications. This delta shows no major impact on scaling.

Concurrency test

In a virtualized multitenant HPC environment, the expectation is for multiple tenants to be running multiple concurrent instances of the same or different applications. To simulate this configuration, a concurrency test was conducted by making multiple copies of the same workload and running them in parallel. This test checks whether any performance degradation appears in comparison with the baseline run result. To do some meaningful concurrency tests, we expanded the virtual cluster to 48 nodes by converting 16 nodes of bare-metal to virtual. For the concurrency tests, the baseline is made with an 8-node run while no other workload was running across the 48-node virtual cluster. After that, six copies of the same workload were allowed to run simultaneously across the virtual cluster. Then the results are compared and depicted for all the applications.

The concurrency was tested in two ways. In the first test, all eight nodes running a single copy were placed in the same rack. In the second test, the nodes running a single job were spread across two racks to see if any performance difference was observed due to additional communication over the network.

Figures 6 to 13 capture the results of the concurrency test. As seen from the results there was no degradation observed in the performance.

Figure 6: Concurrency Test 1 for WRF

Figure 7: Concurrency Test 2 for WRF

Figure 8: Concurrency Test 1 for Open FOAM

Figure 9: Concurrency Test 2 for Open FOAM

Figure 10: Concurrency Test 1 for GROMACS

Figure 11: Concurrency Test 2 for GROMACS

Figure 12: Concurrency Test 1 for LAMMPS

Figure 13: Concurrency Test 2 for LAMMPS

Another set of concurrency tests was conducted by running different applications (WRF, GROMACS, and Open FOAM) simultaneously in the virtual environment. In this test, two eight-node copies of each application run concurrently across the virtual cluster to determine if any performance variation occurs while running multiple parallel applications in the virtual nodes. There is no performance degradation observed in this scenario also, when compared to the individual application baseline run with no other workload running on the cluster.

Figure 14: Concurrency test with multiple applications running in parallel

Intel Select Solution certification

In addition to the benchmark testing, this system has been certified as an Intel® Select Solution for Simulation and Modeling. Intel Select Solutions are workload-optimized configurations that Intel benchmark-tests and verifies for performance and reliability. These solutions can be deployed easily on premises and in the cloud, providing predictability and scalability.

All Intel Select Solutions are a tailored combination of Intel data center compute, memory, storage, and network technologies that deliver predictable, trusted, and compelling performance. Each solution offers assurance that the workload will work as expected, if not better. These solutions can save individual businesses from investing the resources that might otherwise be used to evaluate, select, and purchase the hardware components to gain that assurance themselves.

The Dell HPC On Demand solution is one of a select group of prevalidated, tested solutions that combine third-generation Intel® Xeon® Scalable processors and other Intel technologies into a proven architecture. These certified solutions can reduce the time and cost of building an HPC cluster, lowering hardware costs by taking advantage of a single system for both simulation and modeling workloads.

Conclusion

Running an HPC application necessitates careful consideration for achieving optimal performance. The main objective of the current study is to use appropriate tuning to bridge the performance gap between bare-metal and virtual systems. With the right settings on the tested HPC applications (see Overview), the performance difference between virtual and bare-metal nodes for the 32 node tests is less than 10%. It is therefore possible to successfully run different HPC workloads in a virtualized environment to leverage benefits of virtualization features. The concurrency testing helped to demonstrate that running multiple applications simultaneously in the virtual nodes does not degrade performance.

Resources

To learn more about our previous work on HPC virtualization on Cascade Lake, see the Performance study of a VMware vSphere 7 virtualized HPC cluster.

Acknowledgments

The authors thank Savitha Pareek from Dell Technologies, Yuankun Fu from VMware, Steven Pritchett, and Jonathan Sommers from R Systems for their contribution in the study.

Tags:

Environment	Bare Metal	Virtual
Server	PowerEdge R640 vSAN Ready Node
Processor	2 x Intel Xeon 2^nd Generation 6240R
Cores	All 48 cores used	44 vCPU used
Memory	12 x 16GB @3200 MT/s All 192 GB used	144 GB reserved for the VM
Operating System	CentOS 8.3	Host OS: VMware vSphere 7.0u2 Guest OS: CentOS 8.3
HPC Network NIC	100 GbE NVIDIA Mellanox Connect-X6
Service Network NIC	10/25 GbE NVIDIA Mellanox Connect-X4
HPC Network Switch	Dell PowerSwitch Z9332F-ON
Service Network Switch	Dell PowerSwitch S5248F-ON

Settings	Value
Physical Server
BIOS Power Profile	Performance per watt (OS)
BIOS Hyper-threading	On
BIOS Node Interleaving	Off
BIOS SR-IOV	On
Hypervisor
ESXi Power Policy	High Performance
Virtual Machine
VM Latency Sensitivity	High
VM CPU Reservation	Enabled
VM Memory Reservation	Enabled
VM Sizing	Maximum VM size with CPU/memory reservation

Component	Description
Processor	Dual Intel Xeon Gold 6338
Memory	512 GB - 16 x 32 GiB@3200 MHz
Local disk	3.84 TB SATA-6GB SSD
Operating system	Rocky Linux release 8.4 (Green Obsidian)
GPU model	4 x AMD MI210 (PCIe-64G) or 3 x AMD MI100 (PCIe-32G)
GPU driver version	5.13.20.5.1
ROCm version	5.1.3
Processor Settings > Logical Processors	Disabled
System profiles	Performance
rocm-blas-bench	5.1.3
TransferBench	5.1.3
HPL	Compiled with ROCm v5.1.3
LAMMPS (KOKKOS)	Version: LAMMPS patch_4May2022

GPU architecture	AMD Instinct MI210	AMD Instinct MI100
Peak Engine Clock (MHz)	1700	1502
Stream processors	6656	7680
Peak FP64 (TFlops)	22.63	11.5
Peak FP64 Tensor DGEMM (TFlops)	45.25	11.5
Peak FP32 (TFlops)	22.63	23.1
Peak FP32 Tensor SGEMM (TFlops)	45.25	46.1
Memory size (GB)	64	32
Memory Type	HBM2e	HBM2
Peak Memory Bandwidth (GB/s)	1638	1228
Memory ECC support	Yes	Yes
TDP (Watt)	300	300

Your Browser is Out of Date

Performance Evaluation of HPC Applications on a Dell PowerEdge R650-based VMware Virtualized Cluster

Overview

Bare-metal and virtualized HPC system configuration

Application and benchmark details

Performance results

Concurrency test

Intel Select Solution certification

Conclusion

Resources

Acknowledgments

Related Blog Posts

Performance study of a VMware vSphere 7 virtualized HPC cluster

Performance test details

Performance Results

Tuning for Optimal Performance

Conclusion

Additional resources

Acknowledgements

HPC Application Performance on Dell PowerEdge R750xa Servers with the AMD Instinct TM MI210 Accelerator