
Running the MLPerf™ Inference v1.1 Benchmark on Dell EMC Systems
Fri, 24 Sep 2021 16:51:50 -0000
|Read Time: 0 minutes
This blog is a guide for running the MLPerf inference v1.1 benchmark. Information about how to run the MLPerf inference v1.1 benchmark is available online at different locations. This blog provides all the steps in one place.
MLPerf is a benchmarking suite that measures the performance of Machine Learning (ML) workloads. It focuses on the most important aspects of the ML life cycle: training and inference. For more information, see Introduction to MLPerf™ Inference v1.1 Performance with Dell EMC Servers.
This blog focuses on inference setup and describes the steps to run closed data center MLPerf inference v1.1 tests on Dell Technologies servers with NVIDIA GPUs. It enables you to run the tests and reproduce the results that we observed in our HPC & AI Innovation Lab. For details about the hardware and the software stack for different systems in the benchmark, see this list of systems.
The MLPerf inference v1.1 suite contains the following benchmarks:
- Resnet50
- SSD-Resnet34
- BERT
- DLRM
- RNN-T
- 3D U-Net
Note: The BERT, DLRM, and 3D U-Net models have 99 percent (default accuracy) and 99.9 percent (high accuracy) targets.
This blog describes the steps to run all these benchmarks.
1 Getting started
A system under test consists of a defined set of hardware and software resources that will be measured for performance. The hardware resources may include processors, accelerators, memories, disks, and interconnect. The software resources may include an operating system, compilers, libraries, and drivers that significantly influence the running time of a benchmark. In this case, the system on which you clone the MLPerf repository and run the benchmark is known as the system under test (SUT).
For storage, SSD RAID or local NVMe drives are acceptable for running all the subtests without any penalty. Inference does not have strict requirements for fast-parallel storage. However, the BeeGFS or Lustre file system, the PixStor storage solution, and so on help make multiple copies of large datasets.
2 Prerequisites
Prerequisites for running the MLPerf inference v1.1 tests include:
- An x86_64 Dell EMC systems
- Docker installed with the NVIDIA runtime hook
- Ampere-based NVIDIA GPUs (Turing GPUs have legacy support but are no longer maintained for optimizations.)
- NVIDIA driver version 470.xx or later
- As of inference v1.0, ECC turned ON
Set ECC to on,run the following command:sudo nvidia-smi --ecc-config=1
Preparing to run the MLPerf inference v1.1
Before you can run the MLPerf inference v1.1 tests, perform the following tasks to prepare your environment.
3.1 Clone the MLPerf repository
- Clone the repository to your home directory or another acceptable path:
cd - git clone https://github.com/mlperf/inference_results_v1.1
- Go to the closed/DellEMC directory:
cd inference_results_v1.1/closed/DellEMC
- Create a “scratch” directory with at least 3 TB of space in which to store the models, datasets, preprocessed data, and so on:
mkdir scratch
- Export the absolute path for $MLPERF_SCRATCH_PATH with the scratch directory:
export MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.1/closed/DellEMC/scratch
3.2 Set up the configuration file
The closed/DellEMC/configs directory includes an __init__.py file that lists configurations for different Dell EMC servers that were systems in the MLPerf Inference v1.1 benchmark. If necessary, modify the configs/<benchmark>/<Scenario>/__init__.py file to include the system that will run the benchmark.
Note: If your system is already present in the configuration file, there is no need to add another configuration.
In the configs/<benchmark>/<Scenario>/__init__.py file, select a similar configuration and modify it based on the current system, matching the number and type of GPUs in your system.
For this blog, we used a Dell EMC PowerEdge R7525 server with a one-A100 GPU as the example. We chose R7525_A100_PCIE_40GBx1 as the name for this new system. Because the R7525_A100_PCIE_40GBx1 system is not already in the list of systems, we added the R7525_A100-PCIe-40GBx1 configuration.
Because the R7525_A100_PCIE_40GBx3 reference system is the most similar, we modified that configuration and picked Resnet50 Server as the example benchmark.
The following example shows the reference configuration for three GPUs for the Resnet50 Server benchmark in the configs/resnet50/Server/__init__.py file:
@ConfigRegistry.register(HarnessType.LWIS, AccuracyTarget.k_99, PowerSetting.MaxP) class R7525_A100_PCIE_40GBx3(BenchmarkConfiguration): system = System("R7525_A100-PCIE-40GB", Architecture.Ampere, 3) active_sms = 100 input_dtype = "int8" input_format = "linear" map_path = "data_maps/<dataset_name>/val_map.txt" precision = "int8" tensor_path = "${PREPROCESSED_DATA_DIR}/<dataset_name>/ResNet50/int8_linear" use_deque_limit = True deque_timeout_usec = 5742 gpu_batch_size = 205 gpu_copy_streams = 11 gpu_inference_streams = 9 server_target_qps = 91250 use_cuda_thread_per_device = True use_graphs = True scenario = Scenario.Server benchmark = Benchmark.ResNet50 start_from_device=True
This example shows the modified configuration for one GPU:
@ConfigRegistry.register(HarnessType.LWIS, AccuracyTarget.k_99, PowerSetting.MaxP) class R7525_A100_PCIE_40GBx1(BenchmarkConfiguration): system = System("R7525_A100-PCIE-40GB", Architecture.Ampere, 1) active_sms = 100 input_dtype = "int8" input_format = "linear" map_path = "data_maps/<dataset_name>/val_map.txt" precision = "int8" tensor_path = "${PREPROCESSED_DATA_DIR}/<dataset_name>/ResNet50/int8_linear" use_deque_limit = True deque_timeout_usec = 5742 gpu_batch_size = 205 gpu_copy_streams = 11 gpu_inference_streams = 9 server_target_qps = 30400 use_cuda_thread_per_device = True use_graphs = True scenario = Scenario.Server benchmark = Benchmark.ResNet50 start_from_device=True
We modified the queries per second (QPS) parameter (server_target_qps) to match the number of GPUs. The server_target_qps parameter is linearly scalable, therefore the QPS = number of GPUs x QPS per GPU.
The modified parameter is server_target_qps set to 30400 in accordance with one GPU performance expectation.
3.3 Add the new system
After you add the new system to the __init__.py file as shown in the preceding example, add the new system to the list of available systems. The list of available systems is in the code/common/system_list.py file. This entry informs the benchmark that a new system exists and ensures that the benchmark selects the correct configuration.
Note: If your system is already added, there is no need to add it to the code/common/system_list.py file.
Add the new system to the list of available systems in the code/common/system_list.py file.
At the end of the file, there is a class called KnownSystems. This class defines a list of SystemClass objects that describe supported systems as shown in the following example:
SystemClass(<system ID>, [<list of names reported by nvidia-smi>], [<known PCI IDs of this system>], <architecture>, [list of known supported gpu counts>])
Where:
- For <system ID>, enter the system ID with which you want to identify this system.
- For <list of names reported by nvidia-smi>, run the nvidia-smi -L command and use the name that is returned.
- For <known PCI IDs of this system>, run the following command:
$ CUDA_VISIBLE_ORDER=PCI_BUS_ID nvidia-smi --query-gpu=gpu_name,pci.device_id --format=csv
name, pci.device_id
A100-PCIE-40GB, 0x20F110DE
---
This pci.device_id field is in the 0x<PCI ID>10DE format, where 10DE is the NVIDIA PCI vendor ID. Use the four hexadecimal digits between 0x and 10DE as your PCI ID for the system. In this case, it is 20F1.
- For <architecture>, use the architecture Enum, which is at the top of the file. In this case, A100 is Ampere architecture.
- For the <list of known GPU counts>, enter the number of GPUs of the systems you want to support (that is, [1,2,4] if you want to support 1x, 2x, and 4x GPU variants of this system.). Because we already have a 3x variant in the system_list.py file, we simply need to include the number 1 as an additional entry.
Note: Because a configuration is already present for the PowerEdge R7525 server, we added the number 1 for our configuration, as shown in the following example. If your system does not exist in the system_list.py file, add the entire configuration and not just the number.
class KnownSystems: """ Global List of supported systems """ # before the addition of 1 - this config only supports R7525_A100-PCIE-40GB x3 # R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [3]) # after the addition – this config now supports R7525_A100-PCIE-40GB x1 and R7525_A100-PCIE-40GB x3 versions. R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIE-40GB", ["A100-PCIE-40GB ["20F1"], Architecture.Ampere, [1, 3]) DSS8440_A100_PCIE_80GB = SystemClass("DSS8440_A100-PCIE-80GB", ["A100-PCIE-80GB"], ["20B5"], Architecture.Ampere, [10]) DSS8440_A30 = SystemClass("DSS8440_A30", ["A30"], ["20B7"], Architecture.Ampere, [8], valid_mig_slices=[MIGSlice(1, 6), MIGSlice(2, 12), MIGSlice(4, 24)]) R750xa_A100_PCIE_40GB = SystemClass("R750xa_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [4]) R750xa_A100_PCIE_80GB = SystemClass("R750xa_A100-PCIE-80GB", ["A100-PCIE-80GB"], ["20B5"], Architecture.Ampere, [4],valid_mig_slices=[MIGSlice(1, 10), MIGSlice(2, 20), MIGSlice(3, 40)]) ----
Note: You must provide different configurations in the configs/resnet50/Server/__init__.py file for the x1 variant and x3 variant. In the preceding example, the R7525_A100-PCIE-40GBx3 configuration is different from the R7525_A100-PCIE-40GBx1 configuration.
3.4 Build the Docker image and required libraries
Build the Docker image and then launch an interactive container. Then, in the interactive container, build the required libraries for inferencing.
- To build the Docker image, run the make prebuild command inside the closed/DellEMC folder:
Command:
make prebuildThe following example shows sample output:
Launching Docker session nvidia-docker run --rm -it -w /work \ -v /home/user/article_inference_v1.1/closed/DellEMC:/work -v /home/user:/mnt//home/user \ --cap-add SYS_ADMIN \ -e NVIDIA_VISIBLE_DEVICES=0 \ --shm-size=32gb \ -v /etc/timezone:/etc/timezone:ro -v /etc/localtime:/etc/localtime:ro \ --security-opt apparmor=unconfined --security-opt seccomp=unconfined \ --name mlperf-inference-user -h mlperf-inference-user --add-host mlperf-inference-user:127.0.0.1 \ --user 1002:1002 --net host --device /dev/fuse \ -v =/home/user/inference_results_v1.0/closed/DellEMC/scratch:/home/user/inference_results_v1.1/closed/DellEMC/scratch \ -e MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.0/closed/DellEMC/scratch \ -e HOST_HOSTNAME=node009 \ mlperf-inference:user
The Docker container is launched with all the necessary packages installed.
- Access the interactive terminal on the container.
- To build the required libraries for inferencing, run the make build command inside the interactive container:
Command
make build
The following example shows sample output:
(mlperf) user@mlperf-inference-user:/work$ make build ……. [ 26%] Linking CXX executable /work/build/bin/harness_default make[4]: Leaving directory '/work/build/harness' make[4]: Leaving directory '/work/build/harness' make[4]: Leaving directory '/work/build/harness' [ 36%] Built target harness_bert [ 50%] Built target harness_default [ 55%] Built target harness_dlrm make[4]: Leaving directory '/work/build/harness' [ 63%] Built target harness_rnnt make[4]: Leaving directory '/work/build/harness' [ 81%] Built target harness_triton make[4]: Leaving directory '/work/build/harness' [100%] Built target harness_triton_mig make[3]: Leaving directory '/work/build/harness' make[2]: Leaving directory '/work/build/harness' Finished building harness. make[1]: Leaving directory '/work' (mlperf) user@mlperf-inference-user:/work
The container in which you can run the benchmarks is built.
3.5 Download and preprocess validation data and models
To run the MLPerf inference v1.1, download datasets and models, and then preprocess them. MLPerf provides scripts that download the trained models. The scripts also download the dataset for benchmarks other than Resnet50, DLRM, and 3D U-Net.
For Resnet50, DLRM, and 3D U-Net, register for an account and then download the datasets manually:
- Resnet50—Download the most common image classification 2012 Validation set and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/<dataset_name>/
(https://github.com/mlcommons/training/tree/master/image_classification). - DLRM—Download the Criteo Terabyte dataset and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/criteo/
- 3D U-Net—Download the BraTS challenge data and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training
Except for the Resnet50, DLRM, and 3D U-Net datasets, run the following commands to download all the models, datasets, and then preprocess them:
$ make download_model # Downloads models and saves to $MLPERF_SCRATCH_PATH/models $ make download_data # Downloads datasets and saves to $MLPERF_SCRATCH_PATH/data $ make preprocess_data # Preprocess data and saves to $MLPERF_SCRATCH_PATH/preprocessed_data
Note: These commands download all the datasets, which might not be required if the objective is to run one specific benchmark. To run a specific benchmark rather than all the benchmarks, see the following sections for information about the specific benchmark.
(mlperf) user@mlperf-inference-user:/work$ tree -d -L 1 . ├── build ├── code ├── compliance ├── configs ├── data_maps ├── docker ├── measurements ├── power ├── results ├── scripts └── systems # different folders are as follows ├── build—Logs, preprocessed data, engines, models, plugins, and so on ├── code—Source code for all the benchmarks ├── compliance—Passed compliance checks ├── configs—Configurations that run different benchmarks for different system setups ├── data_maps—Data maps for different benchmarks ├── docker—Docker files to support building the container ├── measurements—Measurement values for different benchmarks ├── power—files specific to power submission (it’s only needed for power submissions) ├── results—Final result logs ├── scratch—Storage for models, preprocessed data, and the dataset that is symlinked to the preceding build directory ├── scripts—Support scripts └── systems—Hardware and software details of systems in the benchmark
4 Running the benchmarks
After you have performed the preceding tasks to prepare your environment, run any of the benchmarks that are required for your tests.
The Resnet50, SSD-Resnet34, and RNN-T benchmarks have 99 percent (default accuracy) targets.
The BERT, DLRM, and 3D U-Net benchmarks have 99 percent (default accuracy) and 99.9 percent (high accuracy) targets. For information about running these benchmarks, see the Running high accuracy target benchmarks section below.
If you downloaded and preprocessed all the datasets (as shown in the previous section), there is no need to do so again. Skip the download and preprocessing steps in the procedures for the following benchmarks.
NVIDIA TensorRT is the inference engine for the backend. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning applications.
4.1 Run the Resnet50 benchmark
To set up the Resnet50 dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go step 5.
- Download the required validation dataset (https://github.com/mlcommons/training/tree/master/image_classification).
- Extract the images to $MLPERF_SCRATCH_PATH/data/<dataset_name>/
- Run the following commands:
make download_model BENCHMARKS=resnet50 make preprocess_data BENCHMARKS=resnet50
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
The following example shows the output for a PerformanceOnly mode and displays a “VALID” result:
======================= Perf harness results: ======================= R7525_A100-PCIe-40GBx1_TRT-default-Server: resnet50: Scheduled samples per second : 30400.32 and Result is : VALID ======================= Accuracy results: ======================= R7525_A100-PCIe-40GBx1_TRT-default-Server: resnet50: No accuracy results in PerformanceOnly mode.
4.2 Run the SSD-Resnet34 benchmark
To set up the SSD-Resnet34 dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=ssd-resnet34 make download_data BENCHMARKS=ssd-resnet34 make preprocess_data BENCHMARKS=ssd-resnet34
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
4.3 Run the RNN-T benchmark
To set up the RNN-T dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=rnnt make download_data BENCHMARKS=rnnt make preprocess_data BENCHMARKS=rnnt
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=rnnt --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
5 Running high accuracy target benchmarks
The BERT, DLRM, and 3D U-Net benchmarks have high accuracy targets.
5.1 Run the BERT benchmark
To set up the BERT dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=bert make download_data BENCHMARKS=bert make preprocess_data BENCHMARKS=bert
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=Offline,Server --config_ver=default,high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"
5.2 Run the DLRM benchmark
To set up the DLRM dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go to step 5.
- Download the Criteo Terabyte dataset.
- Extract the images to $MLPERF_SCRATCH_PATH/data/criteo/ directory.
- Run the following commands:
make download_model BENCHMARKS=dlrm make preprocess_data BENCHMARKS=dlrm
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=dlrm --scenarios=Offline,Server --config_ver=default, high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"
5.3 Run the 3D U-Net benchmark
Note: This benchmark only has the Offline scenario.
To set up the 3D U-Net dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go to step 5
- Download the BraTS challenge data.
- Extract the images to the $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training directory.
- Run the following commands:
make download_model BENCHMARKS=3d-unet make preprocess_data BENCHMARKS=3d-unet
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default,high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"
6 Limitations and Best Practices for Running MLPerf
Note the following limitations and best practices:
- To build the engine and run the benchmark by using a single command, use the make run RUN_ARGS… shortcut. The shortcut is valid alternative to the make generate_engines … && make run_harness.. commands.
- Include the --fast flag in the RUN_ARGS command to test runs quickly by setting the run time to one minute. For example
make run_harness RUN_ARGS="–-fast --benchmarks=<bmname> --scenarios=<scenario> --config_ver=<cver> --test_mode=PerformanceOnly"
The benchmark runs for one minute instead of the default 10 minutes.
- If the server results are “INVALID”, reduce the server_target_qps for a Server scenario run. If the latency constraints are not met during the run, “INVALID” results are expected.
- If the results are “INVALID” for an Offline scenario run, then increase the offline_expected_qps. “INVALID” runs for the Offline scenario occurs when the system can deliver a significantly higher QPS than what is provided through the offline_expected_qps configuration.
- If the batch size changes, rebuild the engine.
- Only the BERT, DLRM, 3D-U-Net benchmarks support high accuracy targets.
- 3D-U-Net only has the Offline scenario.
- Triton Inference Server runs by passing triton and high_accuracy_triton for default and high_accuracy targets respectively inside the config_ver argument.
- When running a command with RUN_ARGS, be aware of the quotation marks. Errors can occur if you omit the quotation marks.
Conclusion
This blog provided the step-by-step procedures to run and reproduce closed data center MLPerf inference v1.1 results on Dell EMC servers with NVIDIA GPUs.
Related Blog Posts

Porting the CUDA p2pbandwidthLatencyTest to the HIP environment on Dell PowerEdge Servers with the AMD GPU
Wed, 13 Jul 2022 14:59:25 -0000
|Read Time: 0 minutes
Introduction
When writing code in CUDA, it is natural to ask if that code can be extended to other GPUs. This extension can allow the “write once, run anywhere” programming paradigm to materialize. While this programming paradigm is a lofty goal, we are in a position to achieve the benefits of porting code from CUDA (for NVIDIA GPUs) to HIP (for AMD GPUs) with little effort. This interoperability provides added value because developers do not have to rewrite code starting at the beginning. It not only saves time, but also saves system administrator efforts to run workloads on a data center depending on hardware resource availability.
This blog provides a brief overview of the AMD ROCm™ platform. It describes a use case that ports the peer-to-peer GPU bandwidth latency test (p2pbandwidthlatencytest) from CUDA to Heterogeneous-Computing Interface for Portability (HIP) to run on an AMD GPU.
Introduction to ROCm and HIP
ROCm is an open-source software platform for GPU-accelerated computing from AMD. It supports running of HPC and AI workloads across different vendors. The following figures show the core ROCm components and capabilities:
Figure 1: The ROCm libraries stack
Figure 2: The ROCm stack
ROCm is a full package of all that is needed to run different HPC and AI workloads. It includes a collection of drivers, APIs, and other GPU tools that support AMD Instinct™ GPUs as well as other accelerators. To meet the objective of running workloads on other accelerators, HIP was introduced.
HIP is AMD’s GPU programming paradigm for designing kernels on GPU hardware. It is a C++ runtime API and a programming language that serves applications on different platforms.
One of the key features of HIP is the ability to convert CUDA code to HIP, which allows running CUDA applications on AMD GPUs. When the code is ported to HIP, it is possible to run HIP code on NVIDIA GPUs by using the CUDA platform-supported compilers (HIP is C++ code and it provides headers that support translation between HIP runtime APIs to CUDA runtime APIs). HIPify refers to the tools that translate CUDA source code into HIP C++.
Introduction to the CUDA p2pbandwidthLatencyTest
The p2pbwLatencyTest determines the data transfer speed between GPUs by computing latency and bandwidth. This test is useful to quantify the communication speed between GPUs and to ensure that these GPUs can communicate.
For example, during training of large-scale data and model parallel deep learning models, it is imperative to ensure that GPUs can communicate after a deadlock or other issues while building and debugging a model. There are other use cases for this test such as BIOS configuration performance improvements, driver update performance implications, and so on.
Porting the p2pbandwidthLatencyTest
The following steps port the p2pbandwidthLatencyTest from CUDA to HIP:
- Ensure that ROCm and HIP are installed in your machine. Follow the installation instructions in the ROCm Installation Guide at:
https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation_new.html#rocm-installation-guide-v4-5
Note: The latest version of ROCm is v5.2.0. This blog describes a scenario running with ROCm v4.5. You can run ROCm v5.x, however, it is recommended that you see the ROCm Installation Guide v5.1.3 at:
https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.1.3/page/Overview_of_ROCm_Installation_Methods.html. - Verify your installation by running the commands described in:
https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation_new.html#verifying-rocm-installation - Optionally, ensure that HIP is installed as described at:
https://github.com/ROCm-Developer-Tools/HIP/blob/master/INSTALL.md#verify-your-installation
We recommend this step to ensure that the expected outputs are produced. - Install CUDA on your local machine to be able to convert CUDA source code to HIP.
To align version dependencies that need CUDA and LLVM +CLANG, see:
https://github.com/ROCm-Developer-Tools/HIPIFY#dependencies - Verify that your installation is successful by testing a sample source conversion and compilation. See the instructions at:
https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/0_Intro/square#squaremd
Clone this repo to perform the validation test. If you can run the following square.cpp program, the installation is successful:
Congratulations! You can now run the conversion process for the p2pbwLatencyTest. - If you use the Bright Cluster Manager, load the CUDA module as follows:
module load cuda11.1/toolkit/11.1.0
Converting the p2pbwLatencyTest from CUDA to HIP
After you download the p2pbandwidthLatencyTest, convert the test from CUDA to HIP.
There are two approaches to convert CUDA to HIP:
- hipify-perl—A Perl script that uses regular expressions to convert CUDA to HIP replacements. It is useful when direct replacements can solve the porting problem. It is a naïve converter that does not check for valid CUDA code. A disadvantage of the script is that it cannot transform some constructs. For more information, see https://github.com/ROCm-Developer-Tools/HIPIFY#-hipify-perl.
- hipify-clang—A tool that translates CUDA source code into an abstract syntax tree, which is traversed by transformation matchers. After performing all the transformations, HIP output is produced. For more information, see https://github.com/ROCm-Developer-Tools/HIPIFY#-hipify-clang.
For more information about HIPify, see the HIPify Reference Guide at https://docs.amd.com/bundle/HIPify-Reference-Guide-v5.1/page/HIPify.html.
To convert the p2pbwLatencyTest from CUDA to HIP:
- Clone the CUDA sample repository and run the conversion:
git clone https://github.com/NVIDIA/cuda-samples.git cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest hipify-perl p2pBandwidthLatencyTest.cu > hip_converted.cpp hipcc hip_converted.cpp -o p2pamd.ou
The following example shows the program output:
Figure 3: Output of the CUDAP2PBandWidthLatency test run on AMD GPUs
The output must include all the GPUs. In this use case, there are three GPUs: 0, 1, 2. - Use the rocminfo command to identify GPUs in the server and then you can use the rocm-smi command to identify the three GPUs in the server, as shown in the following figure:
Figure 4: Output of the rocm-smi command showing all three GPUs in the server
Conclusion
HIPify is a time-saving tool for converting CUDA code to run on AMD Instinct accelerators. Because there are consistent improvements from the AMD software team, there are regular releases in the software stack . The HIPify path is an automated way to support conversion from CUDA to a generalized framework. After your code is ported to HIP, this conversion allows for running code on different accelerators from different vendors. This feature helps to enable further developments from a common platform.
This blog showed how to convert a sample use case from CUDA to HIP using the hipify-perl tool.
Run system information
Table 1: System details
Component | Description |
Operating system | CentOS Linux 8 (Core) |
ROCm version | 4.5 |
CUDA version | 11.1 |
Server | Dell PowerEdge R7525 |
CPU | 2 x AMD EPYC 7543 32-Core Processor |
Accelerator | AMD Instinct MI210 |
References
- https://github.com/ROCm-Developer-Tools
- https://github.com/ROCm-Developer-Tools/HIPIFY
- https://github.com/NVIDIA/cuda-samples
- https://github.com/ROCm-Developer-Tools/HIP/
- https://docs.amd.com/
- https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-porting-guide.html
- https://docs.amd.com/bundle/HIP_API_Guide/page/modules.html
- https://www.amd.com/en/graphics/servers-solutions-rocm
- https://www.amd.com/system/files/documents/the-amd-rocm-5-open-platform-for-hpc-and-ml-workloads.pdf

MLPerf™ Inference v2.0 Edge Workloads Powered by Dell PowerEdge Servers
Fri, 06 May 2022 19:54:11 -0000
|Read Time: 0 minutes
Abstract
Dell Technologies recently submitted results to the MLPerf Inference v2.0 benchmark suite. This blog examines the results of two specialty edge servers: the Dell PowerEdge XE2420 server with the NVIDIA T4 Tensor Core GPU and the Dell PowerEdge XR12 server with the NVIDIA A2 Tensor Core GPU.
Introduction
It is 6:00 am on a Saturday morning. You drag yourself out of bed, splash water on your face, brush your hair, and head to your dimly lit kitchen for a bite to eat before your morning run. Today, you have decided to explore a new part of the neighborhood because your dog’s nose needs new bushes to sniff. As you wait for your bagel to toast, you ask your voice assistant “what’s the weather like?” Within a couple of seconds, you know that you need to grab an extra layer because there is a slight chance of rain. Edge computing has saved your morning run.
Although this use case is covered in the MLPerf Mobile benchmarks, the data discussed in this blog is from the MLPerf Inference benchmark that has been run on Dell servers.
Edge computing is computing that takes place at the “edge of networks.” Edge of networks refers to where devices such as phones, tablets, laptops, smart speakers, and even industrial robots can access the rest of the network. In this case, smart speakers can perform speech-to-text recognition to offload processing that ordinarily must be accomplished in the cloud. This offloading not only improves response time but also decreases the amount of sensitive data that is sent and stored in the cloud. The scope for edge computing expands far beyond voice assistants with use cases including autonomous vehicles, 5G mobile computing, smart cities, security, and more.
The Dell PowerEdge XE2420 and PowerEdge XR 12 servers are designed for edge computing workloads. The design criteria is based on real life scenarios such as extreme heat, dust, and vibration from factory floors, for example. However, despite these servers not being physically located in a data center, server reliability and performance are not compromised.
PowerEdge XE2420 server
The PowerEdge XE2420 server is a specialty edge server that delivers high performance in harsh environments. This server is designed for demanding edge applications such as streaming analytics, manufacturing logistics, 5G cell processing, and other AI applications. It is a short-depth, dense, dual-socket, 2U server that can handle great environmental stress on its electrical and physical components. Also, this server is ideal for low-latency and large-storage edge applications because it supports 16x DDR4 RDIMM/LR-DIMM (12 DIMMs are balanced) up to 2993 MT/s. Importantly, this server can support the following GPU/Flash PCI card configurations:
- Up to 2 x PCIe x16, up to 300 W passive FHFL cards (for example, NVIDIA V100/s or NVIDIA RTX6000)
- Up to 4 x PCIe x8; 75 W passive (for example, NVIDIA T4 GPU)
- Up to 2 x FE1 storage expansion cards (up to 20 x M.2 drives on each)
The following figures show the PowerEdge XE2420 server (source):
Figure 1: Front view of the PowerEdge XE2420 server
Figure 2: Rear view of the PowerEdge XE2420 server
PowerEdge XR12 server
The PowerEdge XR12 server is part of a line of rugged servers that deliver high performance and reliability in extreme conditions. This server is a marine-compliant, single-socket 2U server that offers boosted services for the edge. It includes one CPU that has up to 36 x86 cores, support for accelerators, DDR4, PCIe 4.0, persistent memory and up to six drives. Also, the PowerEdge XR12 server offers 3rd Generation Intel Xeon Scalable Processors.
The following figures show the PowerEdge XR12 server (source):
Figure 3: Front view of the PowerEdge XR12 server
Figure 4: Rear view of the PowerEdge XR12 server
Performance discussion
The following figure shows the comparison of the ResNet 50 Offline performance of various server and GPU configurations, including:
- PowerEdge XE8545 server with the 80 GB A100 Multi-Instance GPU (MIG) with seven instances of the one compute instance of the 10gb memory profile
- PowerEdge XR12 server with the A2 GPU
- PowerEdge XE2420 server with the T4 and A30 GPU
Figure 5: MLPerf Inference ResNet 50 Offline performance
ResNet 50 falls under the computer vision category of applications because it includes image classification, object detection, and object classification detection workloads.
The MIG numbers are per card and have been divided by 28 because of the four physical GPU cards in the systems multiplied by second instances of the MIG profile. The non-MIG numbers are also per card.
For the ResNet 50 benchmark, the PowerEdge XE2420 server with the T4 GPU showed more than double the performance of the PowerEdge XR12 server with the A2 GPU. The PowerEdge XE8545 server with the A100 MIG showed competitive performance when compared to the PowerEdge XE2420 server with the T4 GPU. The performance delta of 12.8 percent favors the PowerEdge XE2420 system. However, the PowerEdge XE2420 server with A30 GPU card takes the top spot in this comparison as it shows almost triple the performance over the PowerEdge XE2420 server with the T4 GPU.
The following figure shows a comparison of the SSD-ResNet 34 Offline performance of the PowerEdge XE8545 server with the A100 MIG and the PowerEdge XE2420 server with the A30 GPU.
Figure 6: MLPerf Inference SSD-ResNet 34 Offline performance
The SSD-ResNet 34 model falls under the computer vision category because it performs object detection. The PowerEdge XE2420 server with the A30 GPU card performed more than three times better than the PowerEdge XE8545 server with the A100 MIG.
The following figure shows a comparison of the Recurrent Neural Network Transducers (RNNT) Offline performance of the PowerEdge XR12 server with the A2 GPU and the PowerEdge XE2420 server with the T4 GPU:
Figure 7: MLPerf Inference RNNT Offline performance
The RNNT model falls under the speech recognition category, which can be used for applications such as automatic closed captioning in YouTube videos and voice commands on smartphones. However, for speech recognition workloads, the PowerEdge XE2420 server with the T4 GPU and the PowerEdge XR12 server with the A2 GPU are closer in terms of performance. There is only a 32 percent performance delta.
The following figure shows a comparison of the BERT Offline performance of default and high accuracy runs of the PowerEdge XR12 server with the A2 GPU and the PowerEdge XE2420 server with the A30 GPU:
Figure 8: MLPerf Inference BERT Offline performance
BERT is a state-of-the-art, language-representational model for Natural Language Processing applications such as sentiment analysis. Although the PowerEdge XE2420 server with the A30 GPU shows significant performance gains, the PowerEdge XR12 server with the A2 GPU exceeds when considering achieved performance based on the money spent.
The following figure shows a comparison of the Deep Learning Recommendation Model (DLRM) Offline performance for the PowerEdge XE2420 server with the T4 GPU and the PowerEdge XR12 server with the A2 GPU:
Figure 9: MLPerf Inference DLRM Offline performance
DLRM uses collaborative filtering and predicative analysis-based approaches to make recommendations, based on the dataset provided. Recommender systems are extremely important in search, online shopping, and online social networks. The performance of the PowerEdge XE2420 T4 in the offline mode was 40 percent better than the PowerEdge XR12 server with the A2 GPU.
Despite the higher performance from the PowerEdge XE2420 server with the T4 GPU, the PowerEdge XR12 server with the A2 GPU is an excellent option for edge-related workloads. The A2 GPU is designed for high performance at the edge and consumes less power than the T4 GPU for similar workloads. Also, the A2 GPU is the more cost-effective option.
Power Discussion
It is important to budget power consumption for the critical load in a data center. The critical load includes components such as servers, routers, storage devices, and security devices. For the MLPerf Inference v2.0 submission, Dell Technologies submitted power numbers for the PowerEdge XR12 server with the A2 GPU. Figures 8 through 11 showcase the performance and power results achieved on the PowerEdge XR12 system. The blue bars are the performance results, and the green bars are the system power results. For all power submissions with the A2 GPU, Dell Technologies took the Number One claim for performance per watt for the ResNet 50, RNNT, BERT, and DLRM benchmarks.
Figure 10: MLPerf Inference v2.0 ResNet 50 power results on the Dell PowerEdge XR12 server
Figure 11: MLPerf Inference v2.0 RNNT power results on the Dell PowerEdge XR12 server
Figure 12: MLPerf Inference v2.0 BERT power results on the Dell PowerEdge XR12 server
Figure 13: MLPerf Inference v2.0 DLRM power results on the Dell PowerEdge XR12 server
Note: During our submission to MLPerf Inference v2.0 including power numbers, the PowerEdge XR12 server was not tuned for optimal performance per watt score. These results reflect the performance-optimized power consumption numbers of the server.
Conclusion
This blog takes a closer look at Dell Technologies’ MLPerf Inference v2.0 edge-related submissions. Readers can compare performance results between the Dell PowerEdge XE2420 server with the T4 GPU and the Dell PowerEdge XR12 server with the A2 GPU with other systems with different accelerators. This comparison helps readers make informed decisions about ML workloads on the edge. Performance, power consumption, and cost are the important factors to consider when planning any ML workload. Both the PowerEdge XR12 and XE2420 servers are excellent choices for Deep Learning workloads on the edge.
Appendix
SUT configuration
The following table describes the System Under Test (SUT) configurations from MLPerf Inference v2.0 submissions:
Table 1: MLPerf Inference v2.0 system configuration of the PowerEdge XE2420 and XR12 servers
Platform | PowerEdge XE2420 1x T4, TensorRT | PowerEdge XR12 1x A2, TensorRT | PowerEdge XR12 1x A2, MaxQ, TensorRT | PowerEdge XE2420 2x A30, TensorRT |
MLPerf system ID | XE2420_T4x1_edge_TRT | XR12_edge_A2x1_TRT | XR12_A2x1_TRT_MaxQ | XE2420_A30x2_TRT |
Operating system | CentOS 8.2.2004 | Ubuntu 20.04.4 | ||
CPU | Intel Xeon Gold 6238 CPU @ 2.10 GHz | Intel Xeon Gold 6312U CPU @ 2.40 GHz | Intel Xeon Gold 6252N CPU @ 2.30 GHz | |
Memory | 256 GB | 1 TB | ||
GPU | NVIDIA T4 | NVIDIA A2 | NVIDIA A30 | |
GPU form factor | PCIe | |||
GPU count | 1 | 2 | ||
Software stack | TensorRT 8.4.0 CUDA 11.6 cuDNN 8.3.2 Driver 510.47.03 DALI 0.31.0 |
Table 2: MLPerf Inference v1.1 system configuration of the PowerEdge XE8545 server
Platform | PowerEdge XE8545 4x A100-SXM-80GB-7x1g.10gb, TensorRT, Triton |
MLPerf system ID | XE8545_A100-SXM-80GB-MIG_28x1g.10gb_TRT_Triton |
Operating system | Ubuntu 20.04.2 |
CPU | AMD EPYC 7763 |
Memory | 1 TB |
GPU | NVIDIA A100-SXM-80GB (7x1g.10gb MIG) |
GPU form factor | SXM |
GPU count | 4 |
Software stack | TensorRT 8.0.2 CUDA 11.3 cuDNN 8.2.1 Driver 470.57.02 DALI 0.31.0 |