Porting the CUDA p2pbandwidthLatencyTest to the HIP environment on Dell PowerEdge Servers with the AMD GPU
Wed, 13 Jul 2022 14:59:25 -0000|
Read Time: 0 minutes
When writing code in CUDA, it is natural to ask if that code can be extended to other GPUs. This extension can allow the “write once, run anywhere” programming paradigm to materialize. While this programming paradigm is a lofty goal, we are in a position to achieve the benefits of porting code from CUDA (for NVIDIA GPUs) to HIP (for AMD GPUs) with little effort. This interoperability provides added value because developers do not have to rewrite code starting at the beginning. It not only saves time, but also saves system administrator efforts to run workloads on a data center depending on hardware resource availability.
This blog provides a brief overview of the AMD ROCm™ platform. It describes a use case that ports the peer-to-peer GPU bandwidth latency test (p2pbandwidthlatencytest) from CUDA to Heterogeneous-Computing Interface for Portability (HIP) to run on an AMD GPU.
Introduction to ROCm and HIP
ROCm is an open-source software platform for GPU-accelerated computing from AMD. It supports running of HPC and AI workloads across different vendors. The following figures show the core ROCm components and capabilities:
Figure 1: The ROCm libraries stack
Figure 2: The ROCm stack
ROCm is a full package of all that is needed to run different HPC and AI workloads. It includes a collection of drivers, APIs, and other GPU tools that support AMD Instinct™ GPUs as well as other accelerators. To meet the objective of running workloads on other accelerators, HIP was introduced.
HIP is AMD’s GPU programming paradigm for designing kernels on GPU hardware. It is a C++ runtime API and a programming language that serves applications on different platforms.
One of the key features of HIP is the ability to convert CUDA code to HIP, which allows running CUDA applications on AMD GPUs. When the code is ported to HIP, it is possible to run HIP code on NVIDIA GPUs by using the CUDA platform-supported compilers (HIP is C++ code and it provides headers that support translation between HIP runtime APIs to CUDA runtime APIs). HIPify refers to the tools that translate CUDA source code into HIP C++.
Introduction to the CUDA p2pbandwidthLatencyTest
The p2pbwLatencyTest determines the data transfer speed between GPUs by computing latency and bandwidth. This test is useful to quantify the communication speed between GPUs and to ensure that these GPUs can communicate.
For example, during training of large-scale data and model parallel deep learning models, it is imperative to ensure that GPUs can communicate after a deadlock or other issues while building and debugging a model. There are other use cases for this test such as BIOS configuration performance improvements, driver update performance implications, and so on.
Porting the p2pbandwidthLatencyTest
The following steps port the p2pbandwidthLatencyTest from CUDA to HIP:
- Ensure that ROCm and HIP are installed in your machine. Follow the installation instructions in the ROCm Installation Guide at:
Note: The latest version of ROCm is v5.2.0. This blog describes a scenario running with ROCm v4.5. You can run ROCm v5.x, however, it is recommended that you see the ROCm Installation Guide v5.1.3 at:
- Verify your installation by running the commands described in:
- Optionally, ensure that HIP is installed as described at:
We recommend this step to ensure that the expected outputs are produced.
- Install CUDA on your local machine to be able to convert CUDA source code to HIP.
To align version dependencies that need CUDA and LLVM +CLANG, see:
- Verify that your installation is successful by testing a sample source conversion and compilation. See the instructions at:
Clone this repo to perform the validation test. If you can run the following square.cpp program, the installation is successful:
Congratulations! You can now run the conversion process for the p2pbwLatencyTest.
- If you use the Bright Cluster Manager, load the CUDA module as follows:
module load cuda11.1/toolkit/11.1.0
Converting the p2pbwLatencyTest from CUDA to HIP
After you download the p2pbandwidthLatencyTest, convert the test from CUDA to HIP.
There are two approaches to convert CUDA to HIP:
- hipify-perl—A Perl script that uses regular expressions to convert CUDA to HIP replacements. It is useful when direct replacements can solve the porting problem. It is a naïve converter that does not check for valid CUDA code. A disadvantage of the script is that it cannot transform some constructs. For more information, see https://github.com/ROCm-Developer-Tools/HIPIFY#-hipify-perl.
- hipify-clang—A tool that translates CUDA source code into an abstract syntax tree, which is traversed by transformation matchers. After performing all the transformations, HIP output is produced. For more information, see https://github.com/ROCm-Developer-Tools/HIPIFY#-hipify-clang.
For more information about HIPify, see the HIPify Reference Guide at https://docs.amd.com/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html.
To convert the p2pbwLatencyTest from CUDA to HIP:
- Clone the CUDA sample repository and run the conversion:
git clone https://github.com/NVIDIA/cuda-samples.git cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest hipify-perl p2pBandwidthLatencyTest.cu > hip_converted.cpp hipcc hip_converted.cpp -o p2pamd.ouThe following example shows the program output:
Figure 3: Output of the CUDAP2PBandWidthLatency test run on AMD GPUs
The output must include all the GPUs. In this use case, there are three GPUs: 0, 1, 2.
- Use the rocminfo command to identify GPUs in the server and then you can use the rocm-smi command to identify the three GPUs in the server, as shown in the following figure:
Figure 4: Output of the rocm-smi command showing all three GPUs in the server
HIPify is a time-saving tool for converting CUDA code to run on AMD Instinct accelerators. Because there are consistent improvements from the AMD software team, there are regular releases in the software stack . The HIPify path is an automated way to support conversion from CUDA to a generalized framework. After your code is ported to HIP, this conversion allows for running code on different accelerators from different vendors. This feature helps to enable further developments from a common platform.
This blog showed how to convert a sample use case from CUDA to HIP using the hipify-perl tool.
Run system information
Table 1: System details
CentOS Linux 8 (Core)
Dell PowerEdge R7525
2 x AMD EPYC 7543 32-Core Processor
AMD Instinct MI210
Related Blog Posts
HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs
Tue, 24 Nov 2020 17:49:03 -0000|
Read Time: 0 minutes
The Dell PowerEdge R7525 server powered with 2nd Gen AMD EPYC processors was released as part of the Dell server portfolio. It is a 2U form factor rack-mountable server that is designed for HPC workloads. Dell Technologies recently added support for NVIDIA A100 GPGPUs to the PowerEdge R7525 server, which supports up to three PCIe-based dual-width NVIDIA GPGPUs. This blog describes the single-node performance of selected HPC applications with both one- and two-NVIDIA A100 PCIe GPGPUs.
The NVIDIA Ampere A100 accelerator is one of the most advanced accelerators available in the market, supporting two form factors:
- PCIe version
- Mezzanine SXM4 version
The PowerEdge R7525 server supports only the PCIe version of the NVIDIA A100 accelerator.
The following table compares the NVIDIA A100 GPGPU with the NVIDIA V100S GPGPU:
NVIDIA A100 GPGPU
NVIDIA V100S GPGPU
Peak memory bandwidth
Up to 1555 GB/s
Up to 900 GB/s
Up to 1134 GB/s
Total board power
The NVIDIA A100 GPGPU brings innovations and features for HPC applications such as the following:
- Multi-Instance GPU (MIG)—The NVIDIA A100 GPGPU can be converted into as many as seven GPU instances, which are fully isolated at the hardware level, each using their own high-bandwidth memory and cores.
- HBM2—The NVIDIA A100 GPGPU comes with 40 GB of high-bandwidth memory (HBM2) and delivers bandwidth up to 1555 GB/s. Memory bandwidth with the NVIDIA A100 GPGPU is 1.7 times higher than with the previous generation of GPUs.
The following table shows the PowerEdge R7525 server configuration that we used for this blog:
2nd Gen AMD EPYC 7502, 32C, 2.5Ghz
512 GB (16 x 32 GB @3200MT/s)
Either of the following:
2 x NVIDIA A100 PCIe 40 GB
2 x NVIDIA V100S PCIe 32 GB
CentOS Linux release 8.1 (4.18.0-147.el8.x86_64)
11.0 (Driver version - 450.51.05)
The following sections provide our benchmarks results with observations.
High-Performance Linpack benchmark
High Performance Linpack (HPL) is a standard HPC system benchmark. This benchmark measures the compute power of the entire cluster or server. For this study, we used HPL compiled with NVIDIA libraries.
The following figure shows the HPL performance comparison for the PowerEdge R7525 server with either NVIDIA A100 or NVIDIA V100S GPGPUs:
Figure1: HPL performance on the PowerEdge R7525 server with the NVIDIA A100 GPGPU compared to the NVIDIA V100SGPGPU
The problem size (N) is larger for the NVIDIA A100 GPGPU due to the larger capacity of GPU memory. We adjusted the block size (NB) used with the:
- NVIDIA A100 GPGPU to 288
- NVIDIA V100S GPGPU to 384
The AMD EPYC processors provide options for multiple NUMA combinations. We found that the best value of 4 NUMA per socket (NPS=4), with NUMA per socket 1 and 2 lower the performance by 10 percent and 5 percent respectively. In a single PowerEdge R7525 node, the NVIDIA A100 GPGPU delivers 12 TF per card using this configuration without an NVLINK bridge. The PowerEdge R7525 server with two NVIDIA A100 GPGPUs delivers 2.3 times higher HPL performance compared to the NVIDIA V100S GPGPU configuration. This performance improvement is credited to the new double-precision Tensor Cores that accelerate FP64 math.
The following figure shows power consumption of the server while running HPL on the NVIDIA A100 GPGPU in a time series. Power consumption was measured with an iDRAC. The server reached 1038 Watts at peak due to a higher GFLOPS number.
Figure2: Power consumption while running HPL
High Performance Conjugate Gradient benchmark
The High Performance Conjugate Gradient (HPCG) benchmark is based on a conjugate gradient solver, in which the preconditioner is a three-level hierarchical multigrid method using the Gauss-Seidel method.
As shown in the following figure, HPCG performs at a rate 70 percent higher with the NVIDIA A100 GPGPU due to higher memory bandwidth:
Figure 3: HPCG performance comparison
Due to different memory size, the problem size used to obtain the best performance on the NVIDIA A100 GPGPU was 512 x 512 x 288 and on the NVIDIA V100S GPGPU was 256 x 256 x 256. For this blog, we used NUMA per socket (NPS)=4 and we obtained results without an NVLINK bridge. These results show that applications such as HPCG, which fits into GPU memory, can take full advantage of GPU memory and benefit from the higher memory bandwidth of the NVIDIA A100 GPGPU.
In addition to these two basic HPC benchmarks (HPL and HPCG), we also tested GROMACS, an HPC application. We compiled GROMACS 2020.4 with the CUDA compilers and OPENMPI, as shown in the following table:
Figure4: GROMACS performance with NVIDIA GPGPUs on the PowerEdge R7525 server
The GROMACS build included thread MPI (built in with the GROMACS package). All performance numbers were captured from the output “ns/day.” We evaluated multiple MPI ranks, separate PME ranks, and different nstlist values to achieve the best performance. In addition, we used settings with the best environment variables for GROMACS at runtime. Choosing the right combination of variables avoided expensive data transfer and led to significantly better performance for these datasets.
GROMACS performance was based on a comparative analysis between NVIDIA V100S and NVIDIA A100 GPGPUs. Excerpts from our single-node multi-GPU analysis for two datasets showed a performance improvement of approximately 30 percent with the NVIDIA A100 GPGPU. This result is due to improved memory bandwidth of the NVIDIA A100 GPGPU. (For information about how the GROMACS code design enables lower memory transfer overhead, see Developer Blog: Creating Faster Molecular Dynamics Simulations with GROMACS 2020.)
The Dell PowerEdge R7525 server equipped with NVIDIA A100 GPGPUs shows exceptional performance improvements over servers equipped with previous versions of NVIDIA GPGPUs for applications such as HPL, HPCG, and GROMACS. These performance improvements for memory-bound applications such as HPCG and GROMACS can take advantage of higher memory bandwidth available with NVIDIA A100 GPGPUs.
Deep Learning Training Performance on Dell EMC PowerEdge R7525 Servers with NVIDIA A100 GPUs
Mon, 21 Jun 2021 20:03:09 -0000|
Read Time: 0 minutes
The Dell EMC PowerEdge R7525 server, which was recently released, supports NVIDIA A100 Tensor Core GPUs. It is a two-socket, 2U rack-based server that is designed to run complex workloads using highly scalable memory, I/O capacity, and network options. The system is based on the 2nd Gen AMD EPYC processor (up to 64 cores), has up to 32 DIMMs, and has PCI Express (PCIe) 4.0-enabled expansion slots. The server supports SATA, SAS, and NVMe drives and up to three double-wide 300 W or six single-wide 75 W accelerators.
The following figure shows the front view of the server:
Figure 1: Dell EMC PowerEdge R7525 server
This blog focuses on the deep learning training performance of a single PowerEdge R7525 server with two NVIDIA A100-PCIe GPUs. The results of using two NVIDIA V100S GPUs in the same PowerEdge R7525 system are presented as reference data. We also present results from the cuBLAS GEMM test and the ResNet-50 model form the MLPerf Training v0.7 benchmark.
The following table provides the configuration details of the PowerEdge R7525 system under test:
AMD EPYC 7502 32-core processor
512 GB (32 GB 3200 MT/s * 16)
2 x 1.8 TB SSD (No RAID)
RedHat Enterprise Linux Server 8.2
|Either of the following:|
Processor Settings > Logical Processors
CUDA Basic Linear Algebra
The CUDA Basic Linear Algebra (cuBLAS) library is the CUDA version of standard basic linear algebra subroutines, part of CUDA-X. NVIDIA provides the cublasMatmulBench binary, which can be used to test the performance of general matrix multiplication (GEMM) on a single GPU. The results of this test reflect the performance of an ideal application that only runs matrix multiplication in the form of the peak TFLOPS that the GPU can deliver. Although GEMM benchmark results might not represent real-world application performance, it is still a good benchmark to demonstrate the performance capability of different GPUs.
Precision formats such as FP64 and FP32 are important to HPC workloads; precision formats such as INT8 and FP16 are important for deep learning inference. We plan to discuss these observed performances in our upcoming HPC and inference blogs.
Because FP16, FP32, and TF32 precision formats are imperative to deep learning training performance, the blog focuses on these formats.
The following figure shows the results that we observed:
Figure 2: cuBLAS GEMM performance on the PowerEdge R7525 server with NVIDIA V100S-PCIe-32G and NVIDIA A100-PCIe-40G GPUs
The results include:
- For FP16, the HGEMM TFLOPs of the NVIDIA A100 GPU is 2.27 times faster than the NVIDIA V100S GPU.
- For FP32, the SGEMM TFLOPs of the NVIDIA A100 GPU is 1.3 times faster than the NVIDIA V100S GPU.
- For TF32, performance improvement is expected without code changes for deep learning applications on the new NVIDIA A100 GPUs. This expectation is because math operations are run on NVIDIA A100 Tensor Cores GPUs with the new TF32 precision format. Although TF32 reduces the precision by a small margin, it preserves the range of FP32 and strikes an excellent balance between speed and accuracy. Matrix multiplication gained a sizable boost from 13.4 TFLOPS (FP32 on the NVIDIA V100S GPU) to 86.5 TFLOPS (TF32 on the NVIDIA A100 GPU).
MLPerf Training v0.7 ResNet-50
MLPerf is a benchmarking suite that measures the performance of machine learning (ML) workloads. The MLPerf Training benchmark suite measures how fast a system can train ML models.
The following figure shows the performance results of the ResNet-50 under the MLPerf Training v0.7 benchmark:
Figure 3: MLPerf Training v0.7 ResNet-50 performance on the PowerEdge R7525 server with NVIDIA V100S-PCIe-32G and NVIDIA A100-PCIe-40G GPUs
The metric for the ResNet-50 training is the minutes that the system under test spends to train the dataset to achieve 75.9 percent accuracy. Both runs using two NVIDIA A100 GPUs and two NVIDIA V100S GPUs converged at the 40th epoch. The NVIDIA A100 run took 166 minutes to converge, which is 1.8 times faster than the NVIDIA V100S run. Regarding throughput, two NVIDIA A100 GPUs can process 5240 images per second, which is also 1.8 times faster than the two NVIDIA V100S GPUs.
The Dell EMC PowerEdge R7525 server with two NVIDIA A100-PCIe GPUs demonstrates optimal performance for deep learning training workloads. The NVIDIA A100 GPU shows a greater performance improvement over the NVIDIA V100S GPU.
To evaluate deep learning and HPC workload and application performance with the PowerEdge R7525 server powered by NVIDIA GPUs, contact the HPC & AI Innovation Lab.
We plan to provide performance studies on:
- Three NVIDIA A100 GPUs in a PowerEdge R7525 server
- Results of other deep learning models in the MLPerf Training v0.7 benchmark
- Training scalability results on multiple PowerEdge R7525 servers