
Running the MLPerf™ Inference v1.0 Benchmark on Dell EMC Systems
Fri, 24 Sep 2021 15:23:27 -0000
|Read Time: 0 minutes
This blog is a guide for running the MLPerf inference v1.0 benchmark. Information about how to run the MLPerf inference v1.0 benchmark is available online at different locations. This blog provides all the steps in one place.
MLPerf is a benchmarking suite that measures the performance of Machine Learning (ML) workloads. It focuses on the most important aspects of the ML life cycle: training and inference. For more information, see Introduction to MLPerf™ Inference v1.0 Performance with Dell EMC Servers.
This blog focuses on inference setup and describes the steps to run MLPerf inference v1.0 tests on Dell Technologies servers with NVIDIA GPUs. It enables you to run the tests and reproduce the results that we observed in our HPC and AI Innovation Lab. For details about the hardware and the software stack for different systems in the benchmark, see this list of systems.
The MLPerf inference v1.0 suite contains the following models for benchmark:
- Resnet50
- SSD-Resnet34
- BERT
- DLRM
- RNN-T
- 3D U-Net
Note: The BERT, DLRM, and 3D U-Net models have 99% (default accuracy) and 99.9% (high accuracy) targets.
This blog describes steps to run all these benchmarks.
1 Getting started
A system under test consists of a defined set of hardware and software resources that will be measured for performance. The hardware resources may include processors, accelerators, memories, disks, and interconnect. The software resources may include an operating system, compilers, libraries, and drivers that significantly influence the running time of a benchmark. In this case, the system on which you clone the MLPerf repository and run the benchmark is known as the system under test (SUT).
For storage, SSD RAID or local NVMe drives are acceptable for running all the subtests without any penalty. Inference does not have strict requirements for fast-parallel storage. However, the BeeGFS or Lustre file system, the PixStor storage solution, and so on help make multiple copies of large datasets.
2 Prerequisites
Prerequisites for running the MLPerf inference v1.0 tests include:
- An x86_64 system
- Docker installed with the NVIDIA runtime hook
- Ampere-based NVIDIA GPUs (Turing GPUs include legacy support, but are no longer maintained for optimizations)
- NVIDIA Driver Version 455.xx or later
- ECC set to ON
To set ECC to ON, run the following command:sudo nvidia-smi --ecc-config=1
3 Preparing to run the MLPerf inference v1.0
Before you can run the MLPerf inference v1.0 tests, perform the following tasks to prepare your environment.
3.1 Clone the MLPerf repository
- Clone the repository to your home directory or to another acceptable path:
cd - git clone https://github.com/mlcommons/inference_results_v1.0
- Go to the closed/DellEMC directory:
cd inference_results_v1.0/closed/DellEMC
- Create a “scratch” directory with a least 3 TB of space in which to store the models, datasets, preprocessed data, and so on:
mkdir scratch
- Export the absolute path for $MLPERF_SCRATCH_PATHwith the scratch directory:
export MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.0/closed/DellEMC/scratch
3.2 Set up the configuration file
The closed/DellEMC/configs directory includes a config.json file that lists configurations for different Dell servers that were systems in the MLPerf Inference v1.0 benchmark. If necessary, modify the configs/<benchmark>/<Scenario>/config.json file to include the system that will run the benchmark.
Note: If your system is already present in the configuration file, there is no need to add another configuration.
In the configs/<benchmark>/<Scenario>/config.json file, select a similar configuration and modify it based on the current system, matching the number and type of GPUs in your system.
For this blog, we used a Dell EMC PowerEdge R7525 server with a one-A100 GPU as the example. We chose R7525_A100-PCIe-40GBx1 as the name for this new system. Because the R7525_A100-PCIe-40GBx1 system is not already in the list of systems, we added the R7525_A100-PCIe-40GBx1 configuration.
Because the R7525_A100-PCIe-40GBx2 reference system is the most similar, we modified that configuration and picked Resnet50 Server as the example benchmark.
The following example shows the reference configuration for two GPUs for the Resnet50 Server benchmark in the configs/resnet50/Server/config.json file:
"R7525_A100-PCIe-40GBx2": { "config_ver": { }, "deque_timeout_us": 2000, "gpu_batch_size": 64, "gpu_copy_streams": 4, "gpu_inference_streams": 3, "server_target_qps": 52000, "use_cuda_thread_per_device": true, "use_graphs": true },
This example shows the modified configuration for one GPU:
"R7525_A100-PCIe-40GBx1": { "config_ver": { }, "deque_timeout_us": 2000, "gpu_batch_size": 64, "gpu_copy_streams": 4, "gpu_inference_streams": 3, "server_target_qps": 26000, "use_cuda_thread_per_device": true, "use_graphs": true },
We modified the QPS parameter (server_target_qps) to match the number of GPUs. The server_target_qps parameter is linearly scalable, therefore the QPS = number of GPUs x QPS per GPU.
The modified parameter is server_target_qps set to 26000 in accordance with one GPU performance expectation.
3.3 Add the new system to the list of available systems
After you add the new system to the config.json file as shown in the preceding section, add the new system to the list of available systems. The list of available systems is in the code/common/system_list.py file. This entry indicates to the benchmark that a new system exists and ensures that the benchmark selects the correct configuration.
Note: If your system is already added, there is no need to add it to the code/common/system_list.py file.
Add the new system to the list of available systems in the code/common/system_list.py file.
At the end of the file, there is a class called KnownSystems. This class defines a list of SystemClass objects that describe supported systems as shown in the following example:
SystemClass(<system ID>, [<list of names reported by nvidia-smi>], [<known PCI IDs of this system>], <architecture>, [list of known supported gpu counts>])
Where:
- For <system ID>, enter the system ID with which you want to identify this system.
- For <list of names reported by nvidia-smi>, run the nvidia-smi -L command and use the name that is returned.
- For <known PCI IDs of this system>, run the following command:
$ CUDA_VISIBLE_ORDER=PCI_BUS_ID nvidia-smi --query-gpu=gpu_name,pci.device_id --format=csv
name, pci.device_id A100-PCIE-40GB, 0x20F110DE
---
The pci.device_id field is in the 0x<PCI ID>10DE format, where 10DE is the NVIDIA PCI vendor ID. Use the four hexadecimal digits between 0x and 10DE as your PCI ID for the system. In this case, it is 20F1.
- For <architecture>, use the architecture Enum, which is at the top of the file. In this case A100 is Ampere architecture.
- For <list of known GPU counts>, enter the number of GPUs of the systems you want to support (that is, [1,2,4] if you want to support 1x, 2x, and 4x GPU variants of this system). Because we already have a 2x variant in the system_list.py file, we simply need to include the number 1 as an additional entry to support our system.
Note: Because a configuration is already present for the PowerEdge R7525 server, we added the number 1 for our configuration, as shown in the following example. If your system does not exist in the system_list.py file, the configuration, add the entire configuration and not just the number.
class KnownSystems: """ Global List of supported systems """ # before the addition of 1 - this config only supports R7525_A100-PCIe-40GB x2 # R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIe-40GB", ["A100-PCIe-40GB"], ["20F1"], Architecture.Ampere, [2]) # after the addition – this config now supports R7525_A100-PCIe-40GB x1 and R7525_A100-PCIe-40GB x2 versions. R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIe-40GB", ["A100-PCIe-40GB"], ["20F1"], Architecture.Ampere, [1, 2]) DSS8440_A100_PCIE_40GB = SystemClass("DSS8440_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [10]) DSS8440_A40 = SystemClass("DSS8440_A40", ["A40"], ["2235"], Architecture.Ampere, [10]) R740_A100_PCIe_40GB = SystemClass("R740_A100-PCIe-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [3]) R750xa_A100_PCIE_40GB = SystemClass("R750xa_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [4]) ----
Note: You must provide different configurations in the configs/resnet50/Server/config.json for the x1 variant and x2 variant. In the preceding example, the R7525_A100-PCIe-40GBx2 configuration is different from the R7525_A100-PCIe-40GBx1 configuration.
3.4 Build the Docker image and required libraries
Build the Docker image and then launch an interactive container. Then, in the interactive container, build the required libraries for inferencing.
- To build the Docker image, run the make prebuild command inside the closed/DellEMC folder
Command:
make prebuildThe following example shows sample output:
Launching Docker session nvidia-docker run --rm -it -w /work \ -v /home/user/article_inference_v1.0/closed/DellEMC:/work -v /home/user:/mnt//home/user \ --cap-add SYS_ADMIN \ -e NVIDIA_VISIBLE_DEVICES=0 \ --shm-size=32gb \ -v /etc/timezone:/etc/timezone:ro -v /etc/localtime:/etc/localtime:ro \ --security-opt apparmor=unconfined --security-opt seccomp=unconfined \ --name mlperf-inference-user -h mlperf-inference-user --add-host mlperf-inference-user:127.0.0.1 \ --user 1002:1002 --net host --device /dev/fuse \ -v =/home/user/inference_results_v1.0/closed/DellEMC/scratch:/home/user/inference_results_v1.0/closed/DellEMC/scratch \ -e MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.0/closed/DellEMC/scratch \ -e HOST_HOSTNAME=node009 \ mlperf-inference:user
The Docker container is launched with all the necessary packages installed.
- Access the interactive terminal in the container.
- To build the required libraries for inferencing, run the make build command inside the interactive container.
Command:
make buildThe following example shows sample output:
(mlperf) user@mlperf-inference-user:/work$ make build
The container is built, in which you can run the benchmarks.
…….
[ 26%] Linking CXX executable /work/build/bin/harness_default
make[4]: Leaving directory '/work/build/harness'
make[4]: Leaving directory '/work/build/harness'
make[4]: Leaving directory '/work/build/harness'
[ 36%] Built target harness_bert
[ 50%] Built target harness_default
[ 55%] Built target harness_dlrm
make[4]: Leaving directory '/work/build/harness'
[ 63%] Built target harness_rnnt
make[4]: Leaving directory '/work/build/harness'
[ 81%] Built target harness_triton
make[4]: Leaving directory '/work/build/harness'
[100%] Built target harness_triton_mig
make[3]: Leaving directory '/work/build/harness'
make[2]: Leaving directory '/work/build/harness'
Finished building harness.
make[1]: Leaving directory '/work'
(mlperf) user@mlperf-inference-user:/work
3.5 Download and preprocess validation data and models
To run the MLPerf inference v1.0, download datasets and models, and then preprocess them. MLPerf provides scripts that download the trained models. The scripts also download the dataset for benchmarks other than Resnet50, DLRM, and 3D U-Net.
For Resnet50, DLRM, and 3D U-Net, register for an account and then download the datasets manually:
- DLRM—Download the Criteo Terabyte dataset and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/criteo/
- 3D U-Net—Download the BraTS challenge data and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training
Except for the Resnet50, DLRM, and 3D U-Net datasets, run the following commands to download all the models, datasets, and then preprocess them:
$ make download_model # Downloads models and saves to $MLPERF_SCRATCH_PATH/models $ make download_data # Downloads datasets and saves to $MLPERF_SCRATCH_PATH/data $ make preprocess_data # Preprocess data and saves to $MLPERF_SCRATCH_PATH/preprocessed_data
Note: These commands download all the datasets, which might not be required if the objective is to run one specific benchmark. To run a specific benchmark rather than all the benchmarks, see the following sections for information about the specific benchmark.
(mlperf) user@mlperf-inference-user:/work$ tree -d -L 1 . ├── build ├── code ├── compliance ├── configs ├── data_maps ├── docker ├── measurements ├── power ├── results ├── scripts └── systems # different folders are as follows ├── build—Logs, preprocessed data, engines, models, plugins, and so on ├── code—Source code for all the benchmarks ├── compliance—Passed compliance checks ├── configs—Configurations that run different benchmarks for different system setups ├── data_maps—Data maps for different benchmarks ├── docker—Docker files to support building the container ├── measurements—Measurement values for different benchmarks ├── power—files specific to power submission (it’s only needed for power submissions) ├── results—Final result logs ├── scratch—Storage for models, preprocessed data, and the dataset that is symlinked to the preceding build directory ├── scripts—Support scripts └── systems—Hardware and software details of systems in the benchmark
4.0 Running the benchmarks
After you have performed the preceding tasks to prepare your environment, run any of the benchmarks that are required for your tests.
The Resnet50, SSD-Resnet34, and RNN-T benchmarks have 99% (default accuracy) targets.
The BERT, DLRM, and 3D U-Net benchmarks have 99% (default accuracy) and 99.9% (high accuracy) targets. For information about running these benchmarks, see the Running high accuracy target benchmarks section below.
If you downloaded and preprocessed all the datasets (as shown in the previous section), there is no need to do so again. Skip the download and preprocessing steps in the procedures for the following benchmarks.
NVIDIA TensorRT is the inference engine for the backend. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning applications.
4.1 Run the Resnet50 benchmark
To set up the Resnet50 dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go step 5.
- Download the required validation dataset (https://github.com/mlcommons/training/tree/master/image_classification).
- Extract the images to $MLPERF_SCRATCH_PATH/data/dataset/
- Run the following commands:
make download_model BENCHMARKS=resnet50 make preprocess_data BENCHMARKS=resnet50
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
The following example shows the output for PerformanceOnly mode and displays a “VALID“ result:
======================= Perf harness results: ======================= R7525_A100-PCIe-40GBx1_TRT-default-Server: resnet50: Scheduled samples per second : 25992.97 and Result is : VALID ======================= Accuracy results: ======================= R7525_A100-PCIe-40GBx1_TRT-default-Server: resnet50: No accuracy results in PerformanceOnly mode.
4.2 Run the SSD-Resnet34 benchmark
To set up the SSD-Resnet34 dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=ssd-resnet34 make download_data BENCHMARKS=ssd-resnet34 make preprocess_data BENCHMARKS=ssd-resnet34
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
4.3 Run the RNN-T benchmark
To set up the RNN-T dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=rnnt make download_data BENCHMARKS=rnnt make preprocess_data BENCHMARKS=rnnt
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=rnnt --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
5 Running high accuracy target benchmarks
The BERT, DLRM, and 3D U-Net benchmarks have high accuracy targets.
5.1 Run the BERT benchmark
To set up the BERT dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=bert make download_data BENCHMARKS=bert make preprocess_data BENCHMARKS=bert
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=Offline,Server --config_ver=default,high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"
5.2 Run the DLRM benchmark
To set up the DLRM dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go to step 5.
- Download the Criteo Terabyte dataset.
- Extract the images to $MLPERF_SCRATCH_PATH/data/criteo/ directory.
- Run the following commands:
make download_model BENCHMARKS=dlrm make preprocess_data BENCHMARKS=dlrm
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=dlrm --scenarios=Offline,Server --config_ver=default, high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"
5.3 Run the 3D U-Net benchmark
Note: This benchmark only has the Offline scenario.
To set up the 3D U-Net dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go to step 5.
- Download the BraTS challenge data.
- Extract the images to the $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training directory.
- Run the following commands:
make download_model BENCHMARKS=3d-unet make preprocess_data BENCHMARKS=3d-unet
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default,high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"
6 Limitations and Best Practices for Running MLPerf
Note the following limitations and best practices:
- To build the engine and run the benchmark by using a single command, use the make run RUN_ARGS… shortcut. The shortcut is a valid alternative to the make generate_engines … && make run_harness.. commands.
- Include the --fast flag with the RUN_ARGS command to test runs quickly by setting the run time to one minute. For example:
make run_harness RUN_ARGS="–-fast --benchmarks=<bmname> --scenarios=<scenario> --config_ver=<cver> --test_mode=PerformanceOnly"
The benchmark runs for one minute instead of the default 10 minutes.
- If the server results are “INVALID”, reduce the server_target_qps for a Server scenario run. If the latency constraints are not met during the run, “INVALID” results are expected.
- If the results are “INVALID” for an Offline scenario run, then increase the gpu_offline_expected_qps. “INVALID” runs for Offline scenario occur when the system can deliver a significantly higher QPS than what is provided through the gpu_offline_expected_qps configuration.
- If the batch size changes, rebuild the engine.
- Only the BERT, DLRM, 3D-Unet benchmarks support high accuracy targets.
- 3D-UNet only has Offline scenario.
7 Conclusion
This blog provides step-by-step procedures to run and reproduce MLPerf inference v1.0 results on Dell Technologies servers with NVIDIA GPUs..
Related Blog Posts

MLPerf™ Inference v2.0 Edge Workloads Powered by Dell PowerEdge Servers
Fri, 06 May 2022 19:54:11 -0000
|Read Time: 0 minutes
Abstract
Dell Technologies recently submitted results to the MLPerf Inference v2.0 benchmark suite. This blog examines the results of two specialty edge servers: the Dell PowerEdge XE2420 server with the NVIDIA T4 Tensor Core GPU and the Dell PowerEdge XR12 server with the NVIDIA A2 Tensor Core GPU.
Introduction
It is 6:00 am on a Saturday morning. You drag yourself out of bed, splash water on your face, brush your hair, and head to your dimly lit kitchen for a bite to eat before your morning run. Today, you have decided to explore a new part of the neighborhood because your dog’s nose needs new bushes to sniff. As you wait for your bagel to toast, you ask your voice assistant “what’s the weather like?” Within a couple of seconds, you know that you need to grab an extra layer because there is a slight chance of rain. Edge computing has saved your morning run.
Although this use case is covered in the MLPerf Mobile benchmarks, the data discussed in this blog is from the MLPerf Inference benchmark that has been run on Dell servers.
Edge computing is computing that takes place at the “edge of networks.” Edge of networks refers to where devices such as phones, tablets, laptops, smart speakers, and even industrial robots can access the rest of the network. In this case, smart speakers can perform speech-to-text recognition to offload processing that ordinarily must be accomplished in the cloud. This offloading not only improves response time but also decreases the amount of sensitive data that is sent and stored in the cloud. The scope for edge computing expands far beyond voice assistants with use cases including autonomous vehicles, 5G mobile computing, smart cities, security, and more.
The Dell PowerEdge XE2420 and PowerEdge XR 12 servers are designed for edge computing workloads. The design criteria is based on real life scenarios such as extreme heat, dust, and vibration from factory floors, for example. However, despite these servers not being physically located in a data center, server reliability and performance are not compromised.
PowerEdge XE2420 server
The PowerEdge XE2420 server is a specialty edge server that delivers high performance in harsh environments. This server is designed for demanding edge applications such as streaming analytics, manufacturing logistics, 5G cell processing, and other AI applications. It is a short-depth, dense, dual-socket, 2U server that can handle great environmental stress on its electrical and physical components. Also, this server is ideal for low-latency and large-storage edge applications because it supports 16x DDR4 RDIMM/LR-DIMM (12 DIMMs are balanced) up to 2993 MT/s. Importantly, this server can support the following GPU/Flash PCI card configurations:
- Up to 2 x PCIe x16, up to 300 W passive FHFL cards (for example, NVIDIA V100/s or NVIDIA RTX6000)
- Up to 4 x PCIe x8; 75 W passive (for example, NVIDIA T4 GPU)
- Up to 2 x FE1 storage expansion cards (up to 20 x M.2 drives on each)
The following figures show the PowerEdge XE2420 server (source):
Figure 1: Front view of the PowerEdge XE2420 server
Figure 2: Rear view of the PowerEdge XE2420 server
PowerEdge XR12 server
The PowerEdge XR12 server is part of a line of rugged servers that deliver high performance and reliability in extreme conditions. This server is a marine-compliant, single-socket 2U server that offers boosted services for the edge. It includes one CPU that has up to 36 x86 cores, support for accelerators, DDR4, PCIe 4.0, persistent memory and up to six drives. Also, the PowerEdge XR12 server offers 3rd Generation Intel Xeon Scalable Processors.
The following figures show the PowerEdge XR12 server (source):
Figure 3: Front view of the PowerEdge XR12 server
Figure 4: Rear view of the PowerEdge XR12 server
Performance discussion
The following figure shows the comparison of the ResNet 50 Offline performance of various server and GPU configurations, including:
- PowerEdge XE8545 server with the 80 GB A100 Multi-Instance GPU (MIG) with seven instances of the one compute instance of the 10gb memory profile
- PowerEdge XR12 server with the A2 GPU
- PowerEdge XE2420 server with the T4 and A30 GPU
Figure 5: MLPerf Inference ResNet 50 Offline performance
ResNet 50 falls under the computer vision category of applications because it includes image classification, object detection, and object classification detection workloads.
The MIG numbers are per card and have been divided by 28 because of the four physical GPU cards in the systems multiplied by second instances of the MIG profile. The non-MIG numbers are also per card.
For the ResNet 50 benchmark, the PowerEdge XE2420 server with the T4 GPU showed more than double the performance of the PowerEdge XR12 server with the A2 GPU. The PowerEdge XE8545 server with the A100 MIG showed competitive performance when compared to the PowerEdge XE2420 server with the T4 GPU. The performance delta of 12.8 percent favors the PowerEdge XE2420 system. However, the PowerEdge XE2420 server with A30 GPU card takes the top spot in this comparison as it shows almost triple the performance over the PowerEdge XE2420 server with the T4 GPU.
The following figure shows a comparison of the SSD-ResNet 34 Offline performance of the PowerEdge XE8545 server with the A100 MIG and the PowerEdge XE2420 server with the A30 GPU.
Figure 6: MLPerf Inference SSD-ResNet 34 Offline performance
The SSD-ResNet 34 model falls under the computer vision category because it performs object detection. The PowerEdge XE2420 server with the A30 GPU card performed more than three times better than the PowerEdge XE8545 server with the A100 MIG.
The following figure shows a comparison of the Recurrent Neural Network Transducers (RNNT) Offline performance of the PowerEdge XR12 server with the A2 GPU and the PowerEdge XE2420 server with the T4 GPU:
Figure 7: MLPerf Inference RNNT Offline performance
The RNNT model falls under the speech recognition category, which can be used for applications such as automatic closed captioning in YouTube videos and voice commands on smartphones. However, for speech recognition workloads, the PowerEdge XE2420 server with the T4 GPU and the PowerEdge XR12 server with the A2 GPU are closer in terms of performance. There is only a 32 percent performance delta.
The following figure shows a comparison of the BERT Offline performance of default and high accuracy runs of the PowerEdge XR12 server with the A2 GPU and the PowerEdge XE2420 server with the A30 GPU:
Figure 8: MLPerf Inference BERT Offline performance
BERT is a state-of-the-art, language-representational model for Natural Language Processing applications such as sentiment analysis. Although the PowerEdge XE2420 server with the A30 GPU shows significant performance gains, the PowerEdge XR12 server with the A2 GPU exceeds when considering achieved performance based on the money spent.
The following figure shows a comparison of the Deep Learning Recommendation Model (DLRM) Offline performance for the PowerEdge XE2420 server with the T4 GPU and the PowerEdge XR12 server with the A2 GPU:
Figure 9: MLPerf Inference DLRM Offline performance
DLRM uses collaborative filtering and predicative analysis-based approaches to make recommendations, based on the dataset provided. Recommender systems are extremely important in search, online shopping, and online social networks. The performance of the PowerEdge XE2420 T4 in the offline mode was 40 percent better than the PowerEdge XR12 server with the A2 GPU.
Despite the higher performance from the PowerEdge XE2420 server with the T4 GPU, the PowerEdge XR12 server with the A2 GPU is an excellent option for edge-related workloads. The A2 GPU is designed for high performance at the edge and consumes less power than the T4 GPU for similar workloads. Also, the A2 GPU is the more cost-effective option.
Power Discussion
It is important to budget power consumption for the critical load in a data center. The critical load includes components such as servers, routers, storage devices, and security devices. For the MLPerf Inference v2.0 submission, Dell Technologies submitted power numbers for the PowerEdge XR12 server with the A2 GPU. Figures 8 through 11 showcase the performance and power results achieved on the PowerEdge XR12 system. The blue bars are the performance results, and the green bars are the system power results. For all power submissions with the A2 GPU, Dell Technologies took the Number One claim for performance per watt for the ResNet 50, RNNT, BERT, and DLRM benchmarks.
Figure 10: MLPerf Inference v2.0 ResNet 50 power results on the Dell PowerEdge XR12 server
Figure 11: MLPerf Inference v2.0 RNNT power results on the Dell PowerEdge XR12 server
Figure 12: MLPerf Inference v2.0 BERT power results on the Dell PowerEdge XR12 server
Figure 13: MLPerf Inference v2.0 DLRM power results on the Dell PowerEdge XR12 server
Note: During our submission to MLPerf Inference v2.0 including power numbers, the PowerEdge XR12 server was not tuned for optimal performance per watt score. These results reflect the performance-optimized power consumption numbers of the server.
Conclusion
This blog takes a closer look at Dell Technologies’ MLPerf Inference v2.0 edge-related submissions. Readers can compare performance results between the Dell PowerEdge XE2420 server with the T4 GPU and the Dell PowerEdge XR12 server with the A2 GPU with other systems with different accelerators. This comparison helps readers make informed decisions about ML workloads on the edge. Performance, power consumption, and cost are the important factors to consider when planning any ML workload. Both the PowerEdge XR12 and XE2420 servers are excellent choices for Deep Learning workloads on the edge.
Appendix
SUT configuration
The following table describes the System Under Test (SUT) configurations from MLPerf Inference v2.0 submissions:
Table 1: MLPerf Inference v2.0 system configuration of the PowerEdge XE2420 and XR12 servers
Platform | PowerEdge XE2420 1x T4, TensorRT | PowerEdge XR12 1x A2, TensorRT | PowerEdge XR12 1x A2, MaxQ, TensorRT | PowerEdge XE2420 2x A30, TensorRT |
MLPerf system ID | XE2420_T4x1_edge_TRT | XR12_edge_A2x1_TRT | XR12_A2x1_TRT_MaxQ | XE2420_A30x2_TRT |
Operating system | CentOS 8.2.2004 | Ubuntu 20.04.4 | ||
CPU | Intel Xeon Gold 6238 CPU @ 2.10 GHz | Intel Xeon Gold 6312U CPU @ 2.40 GHz | Intel Xeon Gold 6252N CPU @ 2.30 GHz | |
Memory | 256 GB | 1 TB | ||
GPU | NVIDIA T4 | NVIDIA A2 | NVIDIA A30 | |
GPU form factor | PCIe | |||
GPU count | 1 | 2 | ||
Software stack | TensorRT 8.4.0 CUDA 11.6 cuDNN 8.3.2 Driver 510.47.03 DALI 0.31.0 |
Table 2: MLPerf Inference v1.1 system configuration of the PowerEdge XE8545 server
Platform | PowerEdge XE8545 4x A100-SXM-80GB-7x1g.10gb, TensorRT, Triton |
MLPerf system ID | XE8545_A100-SXM-80GB-MIG_28x1g.10gb_TRT_Triton |
Operating system | Ubuntu 20.04.2 |
CPU | AMD EPYC 7763 |
Memory | 1 TB |
GPU | NVIDIA A100-SXM-80GB (7x1g.10gb MIG) |
GPU form factor | SXM |
GPU count | 4 |
Software stack | TensorRT 8.0.2 CUDA 11.3 cuDNN 8.2.1 Driver 470.57.02 DALI 0.31.0 |

Overview of MLPerf™ Inference v2.0 Results on Dell Servers
Fri, 09 Sep 2022 15:15:36 -0000
|Read Time: 0 minutes
Dell Technologies has been an active participant in the MLCommons™ Inference benchmark submission since day one. We have completed five rounds of inference submission.
This blog provides an overview of the latest results of MLPerf Inference v2.0 closed data center, closed data center power, closed edge, and closed edge power categories on Dell servers from our HPC & AI Innovation Lab. It shows optimal inference and power (performance per watt) performance for Dell GPU-based servers (DSS 8440, PowerEdge R750xa, PowerEdge XE2420, PowerEdge XE8545, and PowerEdge XR12). The previous blog about MLPerf Inference v1.1 performance results can be found here.
What is new?
- There were 3,800 performance results for this round compared to 1,800 performance results for v1.1. Additionally, 885 systems in v2.0 compared to 424 systems in v1.1 shows that there were more than twice the systems submitted for this round.
- For the 3D U-Net benchmark, the dataset now used is the KiTs 2019 Kidney Tumor Segmentation set.
- Early stopping was introduced in this round to replace a deterministic minimum query count with a function that dynamically determines when further runs are not required to identify additional performance gain.
Results at a glance
Dell Technologies submitted 167 results to the various categories. The Dell team made 86 submissions to the closed data center category, 28 submissions to the closed data center power category, and 53 submissions to the closed edge category. For the closed data center category, the Dell team submitted the second most results. In fact, Dell Technologies submitted results from 17 different system configurations with the NVIDIA TensorRT and NVIDIA Triton inference engines. Among these 17 configurations, the PowerEdge XE2420 server with T4 and A30 GPUs and the PowerEdge XR12 server with the A2 GPU were two new systems that have not been submitted before. Additionally, Dell Technologies submitted to the reintroduced Multiterm scenario. Only Dell Technologies submitted results for different host operating systems.
Noteworthy results
Noteworthy results include:
- The PowerEdge XE8545 and R750xa servers yield Number One results for performance per accelerator with NVIDIA A100 GPUs. The use cases for this top classification include Image Classification, Object Detection, Speech-to-text, Medical Imaging, Natural Language Processing, and Recommendation.
- The DSS 8440 server yields Number Two results for system performance for multiple benchmarks including Speech-to-text, Object Detection, Natural Language Processing, and Medical Image Segmentati on uses cases among all submissions.
- The PowerEdge R750xa server yields Number One results for the highest system performance for multiple benchmarks including Image Classification, Object Detection, Speech-to-text, Natural Language Processing, and Recommendation use cases among all the PCIe-based GPU servers.
- The PowerEdge XE8545 server yields Number One results for the lowest multistrand latency with NVIDIA Multi-Instance GPU (MIG) in the edge category for the Image Classification and Object Detection use cases.
- The PowerEdge XE2420 server yields Number One results for the highest T4 GPU inference results for the Image Classification, Speech-to-text, and Recommendation use cases.
- The PowerEdge XR12 server yields Number One results for the highest performance per watt with NVIDIA A2 GPU results in power for the Image Classification, Object Detection, Speech-to-text, Natural Language Processing, and Recommendation use cases.
MLPerf Inference v2.0 benchmark results
The following graphs highlight the performance metrics for the Server and Offline scenarios across the various benchmarks from MLCommons. Dell Technologies presents results as an method for our customers to identify options to suit their deep learning application demands. Additionally, this performance data serves as a reference point to enable sizing of deep learning clusters. Dell Technologies strives to submit as many results as possible to offer answers to ensure that customer questions are resolved.
For the Server scenario, the performance metric is queries per second (QPS). For the Offline scenario, the performance metric is Offline samples per second. In general, the metrics represent throughput, and a higher throughput indicates a better result. In the following graphs, the Y axis is an exponentially scaled axis representing throughput and the X axis represents the systems under test (SUTs) and their corresponding models. The SUTs are described in the appendix.
Figure 1 through Figure 6 show the per card performance of the various SUTs on the ResNet 50, BERT, SSD, 3dUnet, RNNT, and DLRM modes respectively in the Server and Offline scenarios:
Figure 1: MLPerf Inference v2.0 ResNet 50 per card results
Figure 2: MLPerf Inference v2.0 BERT default and high accuracy per card results
Figure 3: MLPerf Inference v2.0 SSD-ResNet 34 per card results
Figure 4: MLPerf Inference v2.0 3D U-Net per card results
Figure 5: MLPerf Inference v2.0 RNNT per card results
Figure 6: MLPerf Inference v2.0 DLRM default and high accuracy per card results
Observations
The results in this blog have been officially submitted to and accepted by the MLCommons organization. These results have passed compliance tests, been peer reviewed, and adhered to the constraints enforced by MLCommons. Customers and partners can reproduce our results by following steps to run MLPerf Inference v2.0 in its GitHub repository.
Submissions from Dell Technologies included approximately 140 performance results and 28 performance and power results. Across the various workload tasks including Image Classification, Object Detection, Medical Image Segmentation, Speech-to-text, Language Processing, and Recommendation, server performance from Dell Technologies was promising.
Dell servers performed with optimal performance and power results. They were configured with different GPUs such as:
- NVIDIA A30 Tensor Core GPU
- NVIDIA A100 (PCIe and SXM)
- NVIDIA T4 Tensor Core GPU
- NVIDIA A2 Tensor Core GPU, which is newly released
More information about performance for specific configurations that are not discussed in this blog can be found in the v1.1 or v1.0 results.
The submission included results from different inference backends such as NVIDIA TensorRT and NVIDIA Triton. The appendix provides a summary of the full hardware and software stacks.
Conclusion
This blog quantifies the performance of Dell servers in the MLPerf Inference v2.0 round of submission. Readers can use these results to make informed planning and purchasing decisions for their AI workload needs.
Appendix
Software stack
The NVIDIA Triton Inference Server is an open-source inferencing software tool that aids in the deployment and execution of AI models at scale in production. Triton not only works with all major frameworks but also with customizable backends, further enabling developers to focus on their AI development. It is a versatile tool because it supports any inference type and can be deployed on any platform including CPU, GPU, data center, cloud, or edge. Additionally, Triton supports the rapid and reliable deployment of AI models at scale by integrating well with Kubernetes, Kubeflow, Prometheus, and Grafana. Triton supports the HTTP/REST and GRPC protocols that allow remote clients to request inferencing for any model that the server manages.
The NVIDIA TensorRT SDK delivers high-performance deep learning inference that includes an inference optimizer and runtime. It is powered by CUDA and offers a unified solution to deploy on various platforms including edge or data center. TensorRT supports the major frameworks including PyTorch, TensorFlow, ONNX, and MATLAB. It can import models trained in these frameworks by using integrated parsers. For inference, TensorRT performs orders of magnitude faster than its CPU-only counterparts.
NVIDIA MIG can partition GPUs into several instances that extend compute resources among users. MIG enables predictable performance and maximum GPU use by running jobs simultaneously on the different instances with dedicated resources for compute, memory, and memory bandwidth.
SUT configuration
The following table describes the SUT from this round of data center inference submission:
Table 1: MLPerf Inference v2.0 system configurations for DSS 8440 and PowerEdge R750xa servers
Platform | DSS 8440 10xA100 TensorRT | R750xa 4xA100 TensorRT |
MLPerf system ID | DSS8440_A100_PCIE_80GBx10_TRT | R750xa_A100_PCIE_80GBx4_TRT |
Operating system | CentOS 8.2 | |
CPU | Intel Xeon Gold 6248R CPU @ 3.00 GHz | Intel Xeon Gold 6338 CPU @ 2.00 GHz |
Memory | 768 GB | 1 TB |
GPU | NVIDIA A100 | |
GPU form factor | PCIe | |
GPU count | 10 | 4 |
Software stack | TensorRT 8.4.0 CUDA 11.6 cuDNN 8.3.2 Driver 510.39.01 DALI 0.31.0 |
Table 2: MLPerf Inference v2.0 system configurations for PowerEdge XE2420 servers
Platform | PowerEdge XE2420 1xA30 TensorRT | PowerEdge XE2420 2xA30 TensorRT | PowerEdge XE2420 1xA30 TensorRT MaxQ | PowerEdge XE2420 1xAT4 TensorRT |
MLPerf system ID | XE2420_A30x1_TRT | XE2420_A30x2_TRT | XE2420_A30x1_TRT_MaxQ | XE2420_T4x1_TRT |
Operating system | Ubuntu 20.04.4 | CentOS 8.2.2004 | ||
CPU | Intel Xeon Gold 6252 CPU @ 2.10 GHz | Intel Xeon Gold 6252N CPU @ 2.30 GHz | Intel Xeon Silver 4216 CPU @ 2.10 GHz | Intel Xeon Gold 6238 CPU @ 2.10 GHz |
Memory | 1 TB | 64 GB | 256 GB | |
GPU | NVIDIA A30 | NVIDIA T4 | ||
GPU form factor | PCIe | |||
GPU count | 1 | 2 | 1 | 1 |
Software stack | TensorRT 8.4.0 CUDA 11.6 cuDNN 8.3.2 Driver 510.39.01 DALI 0.31.0 |
Table 3: MLPerf Inference v2.0 system configurations for PowerEdge XE8545 servers
Platform | PowerEdge XE8545 4xA100 TensorRT | PowerEdge XE8545 4xA100 TensorRT, Triton | PowerEdge XE8545 1xA100 MIG 1x1g.10g TensorRT
|
MLPerf system ID | XE8545_A100_SXM_80GBx4_TRT | XE8545_A100_SXM_80GBx4_TRT_Triton | XE8545_A100_SXM_80GB_1xMIG_TRT |
Operating system | Ubuntu 20.04.3 | ||
CPU | AMD EPYC 7763 | ||
Memory | 1 TB | ||
GPU | NVIDIA A100-SXM-80 GB | NVIDIA A100-SXM-80 GB (1x1g.10gb MIG) | |
GPU form factor | SXM | ||
GPU count | 4 | 1 | |
Software stack | TensorRT 8.4.0 CUDA 11.6 CuDNN 8.3.2 Driver 510.47.03 DALI 0.31.0 | ||
| Triton 22.01 |
|
Table 4: MLPerf Inference v2.0 system configurations for PowerEdge XR12 servers
Platform | PowerEdge XR12 1xA2 TensorRT | PowerEdge XR12 1xA2 TensorRT MaxQ |
MLPerf system ID | XR12_A2x1_TRT | XR12_A2x1_TRT_MaxQ |
Operating system | CentOS 8.2 | |
CPU | Intel Xeon Gold 6312U CPU @ 2.40 GHz | |
Memory | 256 GB | |
GPU | NVIDIA A2 | |
GPU form factor | PCIe | |
GPU count | 1 | |
Software stack | TensorRT 8.4.0 CUDA 11.6 cuDNN 8.3.2 Driver 510.39.01 DALI 0.31.0 |