
Running the MLPerf™ Inference v1.1 Benchmark on Dell EMC Systems
Fri, 24 Sep 2021 16:51:50 -0000
|Read Time: 0 minutes
This blog is a guide for running the MLPerf inference v1.1 benchmark. Information about how to run the MLPerf inference v1.1 benchmark is available online at different locations. This blog provides all the steps in one place.
MLPerf is a benchmarking suite that measures the performance of Machine Learning (ML) workloads. It focuses on the most important aspects of the ML life cycle: training and inference. For more information, see Introduction to MLPerf™ Inference v1.1 Performance with Dell EMC Servers.
This blog focuses on inference setup and describes the steps to run closed data center MLPerf inference v1.1 tests on Dell Technologies servers with NVIDIA GPUs. It enables you to run the tests and reproduce the results that we observed in our HPC & AI Innovation Lab. For details about the hardware and the software stack for different systems in the benchmark, see this list of systems.
The MLPerf inference v1.1 suite contains the following benchmarks:
- Resnet50
- SSD-Resnet34
- BERT
- DLRM
- RNN-T
- 3D U-Net
Note: The BERT, DLRM, and 3D U-Net models have 99 percent (default accuracy) and 99.9 percent (high accuracy) targets.
This blog describes the steps to run all these benchmarks.
1 Getting started
A system under test consists of a defined set of hardware and software resources that will be measured for performance. The hardware resources may include processors, accelerators, memories, disks, and interconnect. The software resources may include an operating system, compilers, libraries, and drivers that significantly influence the running time of a benchmark. In this case, the system on which you clone the MLPerf repository and run the benchmark is known as the system under test (SUT).
For storage, SSD RAID or local NVMe drives are acceptable for running all the subtests without any penalty. Inference does not have strict requirements for fast-parallel storage. However, the BeeGFS or Lustre file system, the PixStor storage solution, and so on help make multiple copies of large datasets.
2 Prerequisites
Prerequisites for running the MLPerf inference v1.1 tests include:
- An x86_64 Dell EMC systems
- Docker installed with the NVIDIA runtime hook
- Ampere-based NVIDIA GPUs (Turing GPUs have legacy support but are no longer maintained for optimizations.)
- NVIDIA driver version 470.xx or later
- As of inference v1.0, ECC turned ON
Set ECC to on,run the following command:sudo nvidia-smi --ecc-config=1
Preparing to run the MLPerf inference v1.1
Before you can run the MLPerf inference v1.1 tests, perform the following tasks to prepare your environment.
3.1 Clone the MLPerf repository
- Clone the repository to your home directory or another acceptable path:
cd - git clone https://github.com/mlperf/inference_results_v1.1
- Go to the closed/DellEMC directory:
cd inference_results_v1.1/closed/DellEMC
- Create a “scratch” directory with at least 3 TB of space in which to store the models, datasets, preprocessed data, and so on:
mkdir scratch
- Export the absolute path for $MLPERF_SCRATCH_PATH with the scratch directory:
export MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.1/closed/DellEMC/scratch
3.2 Set up the configuration file
The closed/DellEMC/configs directory includes an __init__.py file that lists configurations for different Dell EMC servers that were systems in the MLPerf Inference v1.1 benchmark. If necessary, modify the configs/<benchmark>/<Scenario>/__init__.py file to include the system that will run the benchmark.
Note: If your system is already present in the configuration file, there is no need to add another configuration.
In the configs/<benchmark>/<Scenario>/__init__.py file, select a similar configuration and modify it based on the current system, matching the number and type of GPUs in your system.
For this blog, we used a Dell EMC PowerEdge R7525 server with a one-A100 GPU as the example. We chose R7525_A100_PCIE_40GBx1 as the name for this new system. Because the R7525_A100_PCIE_40GBx1 system is not already in the list of systems, we added the R7525_A100-PCIe-40GBx1 configuration.
Because the R7525_A100_PCIE_40GBx3 reference system is the most similar, we modified that configuration and picked Resnet50 Server as the example benchmark.
The following example shows the reference configuration for three GPUs for the Resnet50 Server benchmark in the configs/resnet50/Server/__init__.py file:
@ConfigRegistry.register(HarnessType.LWIS, AccuracyTarget.k_99, PowerSetting.MaxP) class R7525_A100_PCIE_40GBx3(BenchmarkConfiguration): system = System("R7525_A100-PCIE-40GB", Architecture.Ampere, 3) active_sms = 100 input_dtype = "int8" input_format = "linear" map_path = "data_maps/<dataset_name>/val_map.txt" precision = "int8" tensor_path = "${PREPROCESSED_DATA_DIR}/<dataset_name>/ResNet50/int8_linear" use_deque_limit = True deque_timeout_usec = 5742 gpu_batch_size = 205 gpu_copy_streams = 11 gpu_inference_streams = 9 server_target_qps = 91250 use_cuda_thread_per_device = True use_graphs = True scenario = Scenario.Server benchmark = Benchmark.ResNet50 start_from_device=True
This example shows the modified configuration for one GPU:
@ConfigRegistry.register(HarnessType.LWIS, AccuracyTarget.k_99, PowerSetting.MaxP) class R7525_A100_PCIE_40GBx1(BenchmarkConfiguration): system = System("R7525_A100-PCIE-40GB", Architecture.Ampere, 1) active_sms = 100 input_dtype = "int8" input_format = "linear" map_path = "data_maps/<dataset_name>/val_map.txt" precision = "int8" tensor_path = "${PREPROCESSED_DATA_DIR}/<dataset_name>/ResNet50/int8_linear" use_deque_limit = True deque_timeout_usec = 5742 gpu_batch_size = 205 gpu_copy_streams = 11 gpu_inference_streams = 9 server_target_qps = 30400 use_cuda_thread_per_device = True use_graphs = True scenario = Scenario.Server benchmark = Benchmark.ResNet50 start_from_device=True
We modified the queries per second (QPS) parameter (server_target_qps) to match the number of GPUs. The server_target_qps parameter is linearly scalable, therefore the QPS = number of GPUs x QPS per GPU.
The modified parameter is server_target_qps set to 30400 in accordance with one GPU performance expectation.
3.3 Add the new system
After you add the new system to the __init__.py file as shown in the preceding example, add the new system to the list of available systems. The list of available systems is in the code/common/system_list.py file. This entry informs the benchmark that a new system exists and ensures that the benchmark selects the correct configuration.
Note: If your system is already added, there is no need to add it to the code/common/system_list.py file.
Add the new system to the list of available systems in the code/common/system_list.py file.
At the end of the file, there is a class called KnownSystems. This class defines a list of SystemClass objects that describe supported systems as shown in the following example:
SystemClass(<system ID>, [<list of names reported by nvidia-smi>], [<known PCI IDs of this system>], <architecture>, [list of known supported gpu counts>])
Where:
- For <system ID>, enter the system ID with which you want to identify this system.
- For <list of names reported by nvidia-smi>, run the nvidia-smi -L command and use the name that is returned.
- For <known PCI IDs of this system>, run the following command:
$ CUDA_VISIBLE_ORDER=PCI_BUS_ID nvidia-smi --query-gpu=gpu_name,pci.device_id --format=csv
name, pci.device_id
A100-PCIE-40GB, 0x20F110DE
---
This pci.device_id field is in the 0x<PCI ID>10DE format, where 10DE is the NVIDIA PCI vendor ID. Use the four hexadecimal digits between 0x and 10DE as your PCI ID for the system. In this case, it is 20F1.
- For <architecture>, use the architecture Enum, which is at the top of the file. In this case, A100 is Ampere architecture.
- For the <list of known GPU counts>, enter the number of GPUs of the systems you want to support (that is, [1,2,4] if you want to support 1x, 2x, and 4x GPU variants of this system.). Because we already have a 3x variant in the system_list.py file, we simply need to include the number 1 as an additional entry.
Note: Because a configuration is already present for the PowerEdge R7525 server, we added the number 1 for our configuration, as shown in the following example. If your system does not exist in the system_list.py file, add the entire configuration and not just the number.
class KnownSystems: """ Global List of supported systems """ # before the addition of 1 - this config only supports R7525_A100-PCIE-40GB x3 # R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [3]) # after the addition – this config now supports R7525_A100-PCIE-40GB x1 and R7525_A100-PCIE-40GB x3 versions. R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIE-40GB", ["A100-PCIE-40GB ["20F1"], Architecture.Ampere, [1, 3]) DSS8440_A100_PCIE_80GB = SystemClass("DSS8440_A100-PCIE-80GB", ["A100-PCIE-80GB"], ["20B5"], Architecture.Ampere, [10]) DSS8440_A30 = SystemClass("DSS8440_A30", ["A30"], ["20B7"], Architecture.Ampere, [8], valid_mig_slices=[MIGSlice(1, 6), MIGSlice(2, 12), MIGSlice(4, 24)]) R750xa_A100_PCIE_40GB = SystemClass("R750xa_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [4]) R750xa_A100_PCIE_80GB = SystemClass("R750xa_A100-PCIE-80GB", ["A100-PCIE-80GB"], ["20B5"], Architecture.Ampere, [4],valid_mig_slices=[MIGSlice(1, 10), MIGSlice(2, 20), MIGSlice(3, 40)]) ----
Note: You must provide different configurations in the configs/resnet50/Server/__init__.py file for the x1 variant and x3 variant. In the preceding example, the R7525_A100-PCIE-40GBx3 configuration is different from the R7525_A100-PCIE-40GBx1 configuration.
3.4 Build the Docker image and required libraries
Build the Docker image and then launch an interactive container. Then, in the interactive container, build the required libraries for inferencing.
- To build the Docker image, run the make prebuild command inside the closed/DellEMC folder:
Command:
make prebuildThe following example shows sample output:
Launching Docker session nvidia-docker run --rm -it -w /work \ -v /home/user/article_inference_v1.1/closed/DellEMC:/work -v /home/user:/mnt//home/user \ --cap-add SYS_ADMIN \ -e NVIDIA_VISIBLE_DEVICES=0 \ --shm-size=32gb \ -v /etc/timezone:/etc/timezone:ro -v /etc/localtime:/etc/localtime:ro \ --security-opt apparmor=unconfined --security-opt seccomp=unconfined \ --name mlperf-inference-user -h mlperf-inference-user --add-host mlperf-inference-user:127.0.0.1 \ --user 1002:1002 --net host --device /dev/fuse \ -v =/home/user/inference_results_v1.0/closed/DellEMC/scratch:/home/user/inference_results_v1.1/closed/DellEMC/scratch \ -e MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.0/closed/DellEMC/scratch \ -e HOST_HOSTNAME=node009 \ mlperf-inference:user
The Docker container is launched with all the necessary packages installed.
- Access the interactive terminal on the container.
- To build the required libraries for inferencing, run the make build command inside the interactive container:
Command
make build
The following example shows sample output:
(mlperf) user@mlperf-inference-user:/work$ make build ……. [ 26%] Linking CXX executable /work/build/bin/harness_default make[4]: Leaving directory '/work/build/harness' make[4]: Leaving directory '/work/build/harness' make[4]: Leaving directory '/work/build/harness' [ 36%] Built target harness_bert [ 50%] Built target harness_default [ 55%] Built target harness_dlrm make[4]: Leaving directory '/work/build/harness' [ 63%] Built target harness_rnnt make[4]: Leaving directory '/work/build/harness' [ 81%] Built target harness_triton make[4]: Leaving directory '/work/build/harness' [100%] Built target harness_triton_mig make[3]: Leaving directory '/work/build/harness' make[2]: Leaving directory '/work/build/harness' Finished building harness. make[1]: Leaving directory '/work' (mlperf) user@mlperf-inference-user:/work
The container in which you can run the benchmarks is built.
3.5 Download and preprocess validation data and models
To run the MLPerf inference v1.1, download datasets and models, and then preprocess them. MLPerf provides scripts that download the trained models. The scripts also download the dataset for benchmarks other than Resnet50, DLRM, and 3D U-Net.
For Resnet50, DLRM, and 3D U-Net, register for an account and then download the datasets manually:
- Resnet50—Download the most common image classification 2012 Validation set and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/<dataset_name>/
(https://github.com/mlcommons/training/tree/master/image_classification). - DLRM—Download the Criteo Terabyte dataset and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/criteo/
- 3D U-Net—Download the BraTS challenge data and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training
Except for the Resnet50, DLRM, and 3D U-Net datasets, run the following commands to download all the models, datasets, and then preprocess them:
$ make download_model # Downloads models and saves to $MLPERF_SCRATCH_PATH/models $ make download_data # Downloads datasets and saves to $MLPERF_SCRATCH_PATH/data $ make preprocess_data # Preprocess data and saves to $MLPERF_SCRATCH_PATH/preprocessed_data
Note: These commands download all the datasets, which might not be required if the objective is to run one specific benchmark. To run a specific benchmark rather than all the benchmarks, see the following sections for information about the specific benchmark.
(mlperf) user@mlperf-inference-user:/work$ tree -d -L 1 . ├── build ├── code ├── compliance ├── configs ├── data_maps ├── docker ├── measurements ├── power ├── results ├── scripts └── systems # different folders are as follows ├── build—Logs, preprocessed data, engines, models, plugins, and so on ├── code—Source code for all the benchmarks ├── compliance—Passed compliance checks ├── configs—Configurations that run different benchmarks for different system setups ├── data_maps—Data maps for different benchmarks ├── docker—Docker files to support building the container ├── measurements—Measurement values for different benchmarks ├── power—files specific to power submission (it’s only needed for power submissions) ├── results—Final result logs ├── scratch—Storage for models, preprocessed data, and the dataset that is symlinked to the preceding build directory ├── scripts—Support scripts └── systems—Hardware and software details of systems in the benchmark
4 Running the benchmarks
After you have performed the preceding tasks to prepare your environment, run any of the benchmarks that are required for your tests.
The Resnet50, SSD-Resnet34, and RNN-T benchmarks have 99 percent (default accuracy) targets.
The BERT, DLRM, and 3D U-Net benchmarks have 99 percent (default accuracy) and 99.9 percent (high accuracy) targets. For information about running these benchmarks, see the Running high accuracy target benchmarks section below.
If you downloaded and preprocessed all the datasets (as shown in the previous section), there is no need to do so again. Skip the download and preprocessing steps in the procedures for the following benchmarks.
NVIDIA TensorRT is the inference engine for the backend. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning applications.
4.1 Run the Resnet50 benchmark
To set up the Resnet50 dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go step 5.
- Download the required validation dataset (https://github.com/mlcommons/training/tree/master/image_classification).
- Extract the images to $MLPERF_SCRATCH_PATH/data/<dataset_name>/
- Run the following commands:
make download_model BENCHMARKS=resnet50 make preprocess_data BENCHMARKS=resnet50
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
The following example shows the output for a PerformanceOnly mode and displays a “VALID” result:
======================= Perf harness results: ======================= R7525_A100-PCIe-40GBx1_TRT-default-Server: resnet50: Scheduled samples per second : 30400.32 and Result is : VALID ======================= Accuracy results: ======================= R7525_A100-PCIe-40GBx1_TRT-default-Server: resnet50: No accuracy results in PerformanceOnly mode.
4.2 Run the SSD-Resnet34 benchmark
To set up the SSD-Resnet34 dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=ssd-resnet34 make download_data BENCHMARKS=ssd-resnet34 make preprocess_data BENCHMARKS=ssd-resnet34
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
4.3 Run the RNN-T benchmark
To set up the RNN-T dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=rnnt make download_data BENCHMARKS=rnnt make preprocess_data BENCHMARKS=rnnt
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=rnnt --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
5 Running high accuracy target benchmarks
The BERT, DLRM, and 3D U-Net benchmarks have high accuracy targets.
5.1 Run the BERT benchmark
To set up the BERT dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=bert make download_data BENCHMARKS=bert make preprocess_data BENCHMARKS=bert
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=Offline,Server --config_ver=default,high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"
5.2 Run the DLRM benchmark
To set up the DLRM dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go to step 5.
- Download the Criteo Terabyte dataset.
- Extract the images to $MLPERF_SCRATCH_PATH/data/criteo/ directory.
- Run the following commands:
make download_model BENCHMARKS=dlrm make preprocess_data BENCHMARKS=dlrm
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=dlrm --scenarios=Offline,Server --config_ver=default, high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"
5.3 Run the 3D U-Net benchmark
Note: This benchmark only has the Offline scenario.
To set up the 3D U-Net dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go to step 5
- Download the BraTS challenge data.
- Extract the images to the $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training directory.
- Run the following commands:
make download_model BENCHMARKS=3d-unet make preprocess_data BENCHMARKS=3d-unet
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default,high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"
6 Limitations and Best Practices for Running MLPerf
Note the following limitations and best practices:
- To build the engine and run the benchmark by using a single command, use the make run RUN_ARGS… shortcut. The shortcut is valid alternative to the make generate_engines … && make run_harness.. commands.
- Include the --fast flag in the RUN_ARGS command to test runs quickly by setting the run time to one minute. For example
make run_harness RUN_ARGS="–-fast --benchmarks=<bmname> --scenarios=<scenario> --config_ver=<cver> --test_mode=PerformanceOnly"
The benchmark runs for one minute instead of the default 10 minutes.
- If the server results are “INVALID”, reduce the server_target_qps for a Server scenario run. If the latency constraints are not met during the run, “INVALID” results are expected.
- If the results are “INVALID” for an Offline scenario run, then increase the offline_expected_qps. “INVALID” runs for the Offline scenario occurs when the system can deliver a significantly higher QPS than what is provided through the offline_expected_qps configuration.
- If the batch size changes, rebuild the engine.
- Only the BERT, DLRM, 3D-U-Net benchmarks support high accuracy targets.
- 3D-U-Net only has the Offline scenario.
- Triton Inference Server runs by passing triton and high_accuracy_triton for default and high_accuracy targets respectively inside the config_ver argument.
- When running a command with RUN_ARGS, be aware of the quotation marks. Errors can occur if you omit the quotation marks.
Conclusion
This blog provided the step-by-step procedures to run and reproduce closed data center MLPerf inference v1.1 results on Dell EMC servers with NVIDIA GPUs.
Related Blog Posts

Choosing a PowerEdge Server and NVIDIA GPUs for AI Inference at the Edge
Fri, 05 May 2023 16:38:19 -0000
|Read Time: 0 minutes
Dell Technologies submitted several benchmark results for the latest MLCommonsTM Inference v3.0 benchmark suite. An objective was to provide information to help customers choose a favorable server and GPU combination for their workload. This blog reviews the Edge benchmark results and provides information about how to determine the best server and GPU configuration for different types of ML applications.
Results overview
For computer vision workloads, which are widely used in security systems, industrial applications, and even in self-driven cars, ResNet and RetinaNet results were submitted. ResNet is an image classification task and RetinaNet is an object detection task. The following figures show that for intensive processing, the NVIDIA A30 GPU, which is a double-wide card, provides the best performance with almost two times more images per second than the NVIDIA L4 GPU. However, the NVIDIA L4 GPU is a single-wide card that requires only 43 percent of the energy consumption of the NVIDIA A30 GPU, considering nominal Thermal Design Power (TDP) of each GPU. This low-energy consumption provides a great advantage for applications that need lower power consumption or in environments that are more challenging to cool. The NVIDIA L4 GPU is the replacement for the best-selling NVIDIA T4 GPU, and provides twice the performance with the same form factor. Therefore, we see that this card is the best option for most Edge AI workloads.
Conversely, the NVIDIA A2 GPU exhibits the most economical price (compared to the NVIDIA A30 GPU's price), power consumption (TDP), and performance levels among all available options in the market. Therefore, if the application is compatible with this GPU, it has the potential to deliver the lowest total cost of ownership (TCO).
Figure 1: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the ResNet Offline benchmark
Figure 2: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the RetinaNet Offline benchmark
The 3D-UNet benchmark is the other computer vision image-related benchmark. It uses medical images for volumetric segmentation. We saw the same results for default accuracy and high accuracy. Moreover, the NVIDIA A30 GPU delivered significantly better performance over the NVIDIA L4 GPU. However, the same comparison between energy consumption, space, and cooling capacity discussed previously applies when considering which GPU to use for each application and use case.
Figure 3: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the 3D-UNet Offline benchmark
Another important benchmark is for BERT, which is a Natural Language Processing model that performs tasks such as question answering and text summarization. We observed similar performance differences between the NVIDIA A30, L4, T4, and A2 GPUs. The higher the value, the better.
Figure 4: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the BERT Offline benchmark
MLPerf benchmarks also include latency results, which are the time that systems take to process requests. For some use cases, this processing time can be more critical than the number of requests that can be processed per second. For example, if it takes several seconds to respond to a conversational algorithm or an object detection query that needs a real-time response, this time can be particularly impactful on the experience of the user or application.
As shown in the following figures, the NVIDIA A30 and NVIDIA L4 GPUs have similar latency results. Depending on the workload, the results can vary due to which GPU provides the lowest latency. For customers planning to replace the NVIDIA T4 GPU or seeking a better response time for their applications, the NVIDIA L4 GPU is an excellent option. The NVIDIA A2 GPU can also be used for applications that require low latency because it performed better than the NVIDIA T4 GPU in single stream workloads. The lower the value, the better.
Figure 4: Latency comparison of NVIDIA A30, L4, T4, and A2 GPUs for the ResNet single-stream and multistream benchmark
Figure 5: Latency comparison of NVIDIA A30, L4, T4, and A2 GPUs for the RetinaNet single-stream and multistream benchmark and the BERT single-stream benchmark
Dell Technologies submitted to various benchmarks to help understand which configuration is the most environmentally friendly as the data center’s carbon footprint is a concern today. This concern is relevant because some edge locations have power and cooling limitations. Therefore, it is important to understand performance compared to power consumption.
The following figure affirms that the NVIDIA L4 GPU has equal or better performance per watt compared to the NVIDIA A2 GPU, even with higher power consumption. For Throughput and Perf/watt values, higher is better; for Power(watt) values, lower is better.
Figure 6: NVIDIA L4 and A2 GPU power consumption comparison
Conclusion
With measured workload benchmarks on MLPerf Inference 3.0, we can conclude that all NVIDIA GPUs tested for Edge workloads have characteristics that address several use cases. Customers must evaluate size, performance, latency, power consumption, and price. When choosing which GPU to use and depending on the requirements of the application, one of the evaluated GPUs will provide a better result for the final use case.
Another important conclusion is that the NVIDIA L4 GPU can be considered as an exceptional upgrade for customers and applications running on NVIDIA T4 GPUs. The migration to this new GPU can help consolidate the amount of equipment, reduce the power consumption, and reduce the TCO; one NVIDIA L4 GPU can provide twice the performance of the NVIDIA T4 GPU for some workloads.
Dell Technologies demonstrates on this benchmark the broad Dell portfolio that provides the infrastructure for any type of customer requirement.
The following blogs provide analyses of other MLPerfTM benchmark results:
- Dell Servers Excel in MLPerf™ Inference 3.0 Performance
- Dell Technologies’ NVIDIA H100 SXM GPU submission to MLPerf™ Inference 3.0
- Empowering Enterprises with Generative AI: How Does MLPerf™ Help Support
- Comparison of Top Accelerators from Dell Technologies’ MLPerf™
References
For more information about Dell Power Edge servers, go to the following links:
- Dell’s PowerEdge XR7620 for Telecom/Edge Compute
- Dell’s PowerEdge XR5610 for Telecom/Edge Compute
- PowerEdge XR4520c Compute Sled specification sheet
- PowerEdge XE2420 Spec Sheet
For more information about NVIDIA GPUs, go to the following links:
MLCommonsTM Inference v3.0 results presented in this document are based on following system IDs:
ID | Submitter | Availability | System |
---|---|---|---|
2.1-0005 | Dell Technologies | Available | Dell PowerEdge XE2420 (1x T4, TensorRT) |
2.1-0017 | Dell Technologies | Available | Dell PowerEdge XR4520c (1x A2, TensorRT) |
2.1-0018 | Dell Technologies | Available | Dell PowerEdge XR4520c (1x A30, TensorRT) |
2.1-0019 | Dell Technologies | Available | Dell PowerEdge XR4520c (1x A2, MaxQ, TensorRT) |
2.1-0125 | Dell Technologies | Preview | Dell PowerEdge XR5610 (1x L4, TensorRT, MaxQ) |
2.1-0126 | Dell Technologies | Preview | Dell PowerEdge XR7620 (1x L4, TensorRT) |
Table 1: MLPerfTM system IDs

Comparison of Top Accelerators from Dell Technologies’ MLPerf™ Inference v3.0 Submission
Fri, 21 Apr 2023 21:43:39 -0000
|Read Time: 0 minutes
Abstract
Dell Technologies recently submitted results to MLPerfTM Inference v3.0 in the closed division. This blog highlights the NVIDIA H100 PCIe GPU and compares the results to the NVIDIA A100 PCIe GPU with the PCIe form factor held constant.
Introduction
MLPerf Inference v3.0 submission falls under the benchmarking pillar of the MLCommonsTM consortium with the objective to make fair comparisons across server configurations. Submissions that are made to the closed division warrant an equitable comparison of the systems.
This blog highlights the closed division submissions Dell Technologies made with the NVIDIA A100 GPU using the PCIe (peripheral component interconnect express) form factor. The PCIe form factor is an interfacing standard for connecting various high-speed components in hardware such as a computer or a server. Servers include a certain number of PCIe slots in which to insert GPUs or other additional cards. Note that there are different physical configurations for the slots to indicate the number of lanes for data to travel to and from the PCIe card. The NVIDIA H100 GPU is truly the latest and greatest GPU with NVIDIA AI Enterprise included; it is a dual-slot air cooled PCIe generation 5.0 GPU. This GPU runs at a memory bandwidth speed of over 2,000 megabits per second and up to seven Multi-Instance GPUs at 10 gigabytes each. The NVIDIA A100 80 GB GPU is a dual-slot PCIe generation 4.0 GPU that runs at a memory bandwidth speed of over 2,000 megabits per second.
NVIDIA H100 PCIe GPU and NVIDIA A100 PCIe GPU comparison
In addition to making a submission with the NVIDIA A100 GPU, Dell Technologies made a submission with the NVIDIA H100 GPU. To make a fair comparison, the systems were identical and the PCIe form factor was held constant.
Platform | Dell PowerEdge R750xa (4x A100-PCIe-80GB, TensorRT) | Dell PowerEdge R750xa (4x H100-PCIe-80GB, TensorRT) |
Round | V3.0 | |
MLPerf System ID | R750xa_A100_PCIe_80GBx4_TRT | R750xa_H100_PCIe_80GBx4_TRT |
Operating system | CentOS 8.2 | |
CPU | Intel Xeon Gold 6338 CPU @ 2.00 GHz | |
Memory | 1 TB | 1 TB |
GPU | NVIDIA A100-PCIe-80GB | NVIDIA H100-PCIe-80GB |
GPU form factor | PCIe | |
GPU memory configuration | HBM2e | |
GPU count | 4 | |
Software stack | TensorRT 8.6 CUDA 12.0 cuDNN 8.8.0 Driver 525.85.12 DALI 1.17.0 | TensorRT 8.6 CUDA 12.0 cuDNN 8.8.0 Driver 525.60.13 DALI 1.17.0 |
Table 1: Software stack of submissions made on NVIDIA A100 PCIe and NVIDIA H100 PCIe GPUs for MLPerf Inference v3.0 on the Dell PowerEdge R750xa server
In the following figure, the per card numbers are normalized over the NVIDIA A100 GPU results to show a readable comparison of the GPUs on the same system. Across object detection, medical image segmentation, and speech to text and natural language processing, the latest NVIDIA H100 GPU outperforms its predecessor in all categories. Note the outstanding performance of the Dell PowerEdge R750xa server with NVIDIA H100 GPUs with the BERT benchmark in the high accuracy mode. With the advancements in generative artificial intelligence, the Dell PowerEdge R750xa server is a versatile, reliable, and high performing platform.
Figure 1: Normalized per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs on the Dell PowerEdge R750xa server
The following figures show absolute numbers for a comparison of the NVIDIA H100 and NVIDIA A100 GPUs.
Figure 2: Per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs for RetinaNet on the PowerEdge R750xa server
Figure 3: Per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs for 3D-Unet on the PowerEdge R750xa server
Figure 4: Per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs for RNNT on the PowerEdge R750xa server
Figure 5: Per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs for BERT on the PowerEdge R750xa server
These results can be found on the MLCommons website.
Submissions made with the NVIDIA A100 PCIe GPU
In this round of submissions, Dell Technologies submitted results on the PowerEdge R750xa server packaged with four NVIDIA A100 80 GB PCIe GPUs. In previous rounds, the PowerEdge R750xa server showed outstanding performance across all the benchmarks. For a deeper dive of a previous round's submission, check out our blog from MLPerf Inference v2.0. From the previous round of MLPerf Inference v2.1 submissions, Dell Technologies submitted results on an identical system. However, across the two rounds of submissions, the main difference is the upgrades in the software stack, as described in the following table:
Platform | Dell PowerEdge R750xa (4x A100-PCIe-80GB, TensorRT) | Dell PowerEdge R750xa (4x A100-PCIe-80GB, TensorRT) |
Round | V3.0 | V2.1 |
MLPerf System ID | R750xa_A100_PCIe_80GBx4_TRT | |
Operating system | CentOS 8.2 | |
CPU | Intel Xeon Gold 6338 CPU @ 2.00 GHz | |
Memory | 512 GB | |
GPU | NVIDIA A100-PCIe-80GB | |
GPU form factor | PCIe | |
GPU memory configuration | HBM2e | |
GPU count | 4 | |
Software stack | TensorRT 8.6 CUDA 12.0 cuDNN 8.8.0 Driver 525.85.12 DALI 1.17.0 | TensorRT 8.4.2 CUDA 11.6 cuDNN 8.4.1 Driver 510.39.01 DALI 0.31.0 |
Table 2: Software stack for submissions made on the NVIDIA A100 PCIe GPU in MLPerf Inference v3.0 and v2.1
Comparison of PowerEdge R750xa NVIDIA A100 results from Inference v3.0 and v2.1
Object detection
The RetinaNet benchmark falls under the object detection category and uses the OpenImages dataset. The results from Inference v3.0 show a less than 0.05 percent difference in the Server scenario and a 21.53 percent difference in the Offline scenario. A potential reason for this result might be NVIDIA’s optimizations, as outlined in their technical blog.
Figure 6: RetinaNet Server and Offline results on the PowerEdge R750xa server from Inference v3.0 and Inference v2.1
Medical image segmentation
The 3D-Unet benchmark performs the KiTS 2019 kidney tumor segmentation task. Across the two rounds of submission, the PowerEdge R750xa server performed consistently well with a 0.3 percent difference in both the default and high accuracy modes.
Figure 7: 3D-UNet Offline results on the PowerEdge R750xa server from Inference v3.0 and v2.1
Speech to text
The Recurrent Neural Network Transducers (RNNT) model falls under the speech recognition category. This benchmark accepts raw audio samples and produces the corresponding character transcription. In the Server scenario, the results are within a 2.25 percent difference and 0.41 percent difference in the Offline scenario.
Figure 8: RNNT Server and Offline results on the Dell PowerEdge R750xa server from Inference v3.0 and v2.1
Natural language processing
Bidirectional Encoder Representation from Transformers (BERT) is a state-of-the-art language representational model for Natural Language Processing applications. This benchmark performs the SQuAD question answering task. The BERT benchmark consists of default and high accuracy modes for the Offline and Server scenarios. For the Server scenarios, the default mode results are within a 1.69 percent range and 3.12 percent range for the high accuracy mode. For the Offline scenarios, a similar behavior is noticeable in which the default mode results are within a 0.86 percent range and 3.65 percent range in the high accuracy mode.
Figure 9: BERT Server and Offline results on the PowerEdge R750xa server from Inference v3.0 and v2.1
Conclusion
Across the various rounds of submissions to the MLPerf Inference benchmark suite, the PowerEdge R750xa server has been a consistent top performer for any machine learning tasks ranging from object detection, medical image segmentation, speech to text and natural language processing. The PowerEdge R750xa server continues to be an excellent server choice for machine learning inference workloads. Customers can take advantage of the diverse results submitted on the Dell PowerEdge R750xa server with the NVIDIA H100 GPU to make an informed decision for their specific solution needs.