Running the MLPerf Inference v0.7 Benchmark on Dell EMC Systems
Mon, 21 Jun 2021 18:22:09 -0000
|Read Time: 0 minutes
MLPerf is a benchmarking suite that measures the performance of Machine Learning (ML) workloads. It focuses on the most important aspects of the ML life cycle:
- Training—The MLPerf training benchmark suite measures how fast a system can train ML models.
- Inference—The MLPerf inference benchmark measures how fast a system can perform ML inference by using a trained model in various deployment scenarios.
The MLPerf inference v0.7 suite contains the following models for benchmark:
- Resnet50
- SSD-Resnet34
- BERT
- DLRM
- RNN-T
- 3D U-Net
Note: The BERT, DLRM, and 3D U-Net models have 99% (default accuracy) and 99.9% (high accuracy) targets.
This blog describes the steps to run MLPerf inference v0.7 tests on Dell Technologies servers with NVIDIA GPUs. It helps you run and reproduce the results that we observed in our HPC and AI Innovation Lab. For more details about the hardware and the software stack for different systems in the benchmark, see this GitHub repository.
Getting started
A system under test consists of a defined set of hardware and software resources that will be measured for performance. The hardware resources may include processors, accelerators, memories, disks, and interconnect. The software resources may include an operating system, compilers, libraries, and drivers that significantly influences the running time of a benchmark. In this case, the system on which you clone the MLPerf repository and run the benchmark is known as the system under test (SUT).
For storage, SSD RAID or local NVMe drives are acceptable for running all the subtests without any penalty. Inference does not have strict requirements for fast-parallel storage. However, the BeeGFS or Lustre file system, the PixStor storage solution, and so on help make multiple copies of large datasets.
Clone the MLPerf repository
Follow these steps:
- Clone the repository to your home directory or another acceptable path:
cd - git clone https://github.com/mlperf/inference_results_v0.7.git
- Go to the closed/DellEMC directory:
cd inference_results_v0.7/closed/DellEMC
- Create a “scratch” directory to store the models, datasets, preprocessed data, and so on:
mkdir scratch
This scratch directory requires at least 3 TB of space. - Export the absolute path for MLPERF_SCRATCH_PATH with the scratch directory:
export MLPERF_SCRATCH_PATH=/home/user/inference_results_v0.7/closed/DellEMC/scratch
Set up the configuration file
The closed/DellEMC/configs directory includes a config.json file that lists configurations for different Dell servers that were systems in the MLPerf Inference v0.7 benchmark. If necessary, modify the configs/<benchmark>/<Scenario>/config.json file to include the system that will run the benchmark.
Note: If your system is already present in the configuration file, there is no need to add another configuration.
In the configs/<benchmark>/<Scenario>/config.json file, select a similar configuration and modify it based on the current system, matching the number and type of GPUs in your system.
For this blog, we considered our R7525 server with a one-A100 GPU. We chose R7525_A100x1 as the name for this new system. Because the R7525_A100x1 system is not already in the list of systems, we added the R7525_A100x1 configuration.
Because the R7525_A100x2 reference system is the most similar, we modified that configuration and picked Resnet50 Server as the example benchmark.
The following example shows the reference configuration for two GPUs for the Resnet50 Server benchmark in the configs/<benchmark>/<Scenario>/config.json file:
"R7525_A100x2": { "active_sms": 100, "config_ver": { }, "deque_timeout_us": 2000, "gpu_batch_size": 64, "gpu_copy_streams": 4, "gpu_inference_streams": 3, "input_dtype": "int8", "input_format": "linear", "map_path": "data_maps/dataset/val_map.txt", "precision": "int8", "server_target_qps": 52400, "tensor_path": "${PREPROCESSED_DATA_DIR}/dataset/ResNet50/int8_linear", "use_cuda_thread_per_device": true, "use_deque_limit": true, "use_graphs": true },
This example shows the modified configuration for one GPU:
"R7525_A100x1": { "active_sms": 100, "config_ver": { }, "deque_timeout_us": 2000, "gpu_batch_size": 64, "gpu_copy_streams": 4, "gpu_inference_streams": 3, "input_dtype": "int8", "input_format": "linear", "map_path": "data_maps/dataset/val_map.txt", "precision": "int8", "server_target_qps": 26200, "tensor_path": "${PREPROCESSED_DATA_DIR}/datset/ResNet50/int8_linear", "use_cuda_thread_per_device": true, "use_deque_limit": true, "use_graphs": true },
We modified the QPS parameter (server_target_qps) to match the number of GPUs. The server_target_qps parameter is linearly scalable, therefore the QPS = number of GPUs x QPS per GPU.
We only modified the server_target_qps parameter to get a baseline run first. You can also modify other parameters such as gpu_batch_size, gpu_copy_streams, and so on. We will discuss these other parameters in a future blog that describes performance tuning.
Finally, we added the modified configuration for the new R7525_A100x1 system to the configuration file at configs/resnet50/Server/config.json.
Register the new system
After you add the new system to the config.json file, register the new system in the list of available systems. The list of available systems is in the code/common/system_list.py file.
Note: If your system is already registered, there is no need to add it to the code/common/system_list.py file.
To register the system, add the new system to the list of available systems in the code/common/system_list.py file, as shown in the following:
# (system_id, gpu_name_from_driver, gpu_count) system_list = ([ ("R740_vT4x4", "GRID T4-16Q", 4), ("XE2420_T4x4", "Tesla T4", 4), ("DSS8440_T4x12", "Tesla T4", 12), ("R740_T4x4", "Tesla T4", 4), ("R7515_T4x4", "Tesla T4", 4), ("DSS8440_T4x16", "Tesla T4", 16), ("DSS8440_QuadroRTX8000x8", "Quadro RTX 8000", 8), ("DSS8440_QuadroRTX6000x10", "Quadro RTX 6000", 10), ("DSS8440_QuadroRTX8000x10", "Quadro RTX 8000", 10), ("R7525_A100x2", "A100-PCIE-40GB", 2), ("R7525_A100x3", "A100-PCIE-40GB", 3), ("R7525_QuadroRTX8000x3", "Quadro RTX 8000", 3), ("R7525_A100x1", "A100-PCIE-40GB", 1), ])
In the preceding example, the last line under system_list is the newly added R7525_A100x1 system. It is a tuple of the form (<system name>, <GPU name>, <GPU count>). To find the GPU name from the driver, run the nvidia-smi -L command.
Note: Ensure that you add the system configuration for all the benchmarks that you intend to run and add the system to the system_list.py file. Otherwise, the results might be suboptimal. The benchmark might choose the wrong system configuration or not run at all because it could not find appropriate configuration.
Build the Docker image and required libraries
Build the Docker image and then launch an interactive container. Then, in the interactive container, build the required libraries for inferencing.
- To build the Docker image, run the following command:
make prebuild………
Launching Docker interactive session docker run --gpus=all --rm -ti -w /work -v /home/user/inference_results_v0.7/closed/DellEMC:/work -v /home/user:/mnt// user \ --cap-add SYS_ADMIN -e NVIDIA_MIG_CONFIG_DEVICES="all" \ -v /etc/timezone:/etc/timezone:ro -v /etc/localtime:/etc/localtime:ro \ --security-opt apparmor=unconfined --security-opt seccomp=unconfined \ --name mlperf-inference-user -h mlperf-inference-userv0.7 --add-host mlperf-inference-user:127.0.0.1 \ --user 1004:1004 --net host --device /dev/fuse --cap-add SYS_ADMIN \ -e MLPERF_SCRATCH_PATH=”/home/user/ inference_results_v0.7 /closed/DellEMC/scratch” mlperf-inference:user (mlperf) user@mlperf-inference-user:/work$
The Docker container is launched with all the necessary packages installed. - Access the interactive terminal on the container.
- To build the required libraries for inferencing, run the following command inside the interactive container:
make build(mlperf) user@mlperf-inference-user:/work$ make build ……. [ 92%] Built target harness_triton [ 96%] Linking CXX executable /work/build/bin/harness_default [100%] Linking CXX executable /work/build/bin/harness_rnnt make[4]: Leaving directory '/work/build/harness' [100%] Built target harness_default make[4]: Leaving directory '/work/build/harness' [100%] Built target harness_rnnt make[3]: Leaving directory '/work/build/harness' make[2]: Leaving directory '/work/build/harness' Finished building harness. make[1]: Leaving directory '/work' (mlperf) user@mlperf-inference-user:/work
Download and preprocess validation data and models
To run MLPerf inference v0.7, download datasets and models, and then preprocess them. MLPerf provides scripts that download the trained models. The scripts also download the dataset for benchmarks other than Resnet50, DLRM, and 3D U-Net.
For Resnet50, DLRM, and 3D U-Net, register for an account and then download the datasets manually:
- DLRM—Download the Criteo Terabyte dataset and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/criteo/
- 3D U-Net—Download the BraTS challenge data and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training
Except for the Resnet50, DLRM, and 3D U-Net datasets, run the following commands to download all the models, datasets, and then preprocess them:
$ make download_model # Downloads models and saves to $MLPERF_SCRATCH_PATH/models $ make download_data # Downloads datasets and saves to $MLPERF_SCRATCH_PATH/data $ make preprocess_data # Preprocess data and saves to $MLPERF_SCRATCH_PATH/preprocessed_data
Note: These commands download all the datasets, which might not be required if the objective is to run one specific benchmark. To run a specific benchmark rather than all the benchmarks, see the following sections for information about the specific benchmark.
After building the libraries and preprocessing the data, the folders containing the following are displayed:
(mlperf) user@mlperf-inference-user:/work$ tree -d -L 1
.
├── build—Logs, preprocessed data, engines, models, plugins, and so on
├── code—Source code for all the benchmarks
├── compliance—Passed compliance checks
├── configs—Configurations that run different benchmarks for different system setups
├── data_maps—Data maps for different benchmarks
├── docker—Docker files to support building the container
├── measurements—Measurement values for different benchmarks
├── results—Final result logs
├── scratch—Storage for models, preprocessed data, and the dataset that is symlinked to the preceding build directory
├── scripts—Support scripts
└── systems—Hardware and software details of systems in the benchmark
Running the benchmarks
Run any of the benchmarks that are required for your tests.
The Resnet50, SSD-Resnet34, and RNN-T benchmarks have 99% (default accuracy) targets.
The BERT, DLRM, and 3D U-Net benchmarks have 99% (default accuracy) and 99.9% (high accuracy) targets. For information about running these benchmarks, see the Running high accuracy target benchmarks section below.
If you downloaded and preprocessed all the datasets (as shown in the previous section), there is no need to do so again. Skip the download and preprocessing steps in the procedures for the following benchmarks.
NVIDIA TensorRT is the inference engine for the backend. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning applications.
Run the Resnet50 benchmark
To set up the Resnet50 dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go step 5.
- Download the required validation dataset (https://github.com/mlcommons/training/tree/master/image_classification).
- Extract the images to $MLPERF_SCRATCH_PATH/data/dataset/.
- Run the following commands:
make download_model BENCHMARKS=resnet50 make preprocess_data BENCHMARKS=resnet50
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
The following output is displayed for a “PerformanceOnly” mode:
The following is a “VALID“ result: ======================= Perf harness results: ======================= R7525_A100x1_TRT-default-Server: resnet50: Scheduled samples per second : 26212.91 and Result is : VALID ======================= Accuracy results: ======================= R7525_A100x1_TRT-default-Server: resnet50: No accuracy results in PerformanceOnly mode.
Run the SSD-Resnet34 benchmark
To set up the SSD-Resnet34 dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=ssd-resnet34 make download_data BENCHMARKS=ssd-resnet34 make preprocess_data BENCHMARKS=ssd-resnet34
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
Run the RNN-T benchmark
To set up the RNN-T dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=rnnt make download_data BENCHMARKS=rnnt make preprocess_data BENCHMARKS=rnnt
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario make generate_engines RUN_ARGS="--benchmarks=rnnt --scenarios=Offline,Server --config_ver=default"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
Running high accuracy target benchmarks
The BERT, DLRM, and 3D U-Net benchmarks have high accuracy targets.
Run the BERT benchmark
To set up the BERT dataset and model to run the inference:
- If necessary, download and preprocess the dataset:
make download_model BENCHMARKS=bert make download_data BENCHMARKS=bert make preprocess_data BENCHMARKS=bert
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=Offline,Server --config_ver=default,high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"
Run the DLRM benchmark
To set up the DLRM dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go to step 5.
- Download the Criteo Terabyte dataset.
- Extract the images to $MLPERF_SCRATCH_PATH/data/criteo/ directory.
- Run the following commands:
make download_model BENCHMARKS=dlrm make preprocess_data BENCHMARKS=dlrm
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=dlrm --scenarios=Offline,Server --config_ver=default, high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"
Run the 3D U-Net benchmark
Note: This benchmark only has the Offline scenario.
To set up the 3D U-Net dataset and model to run the inference:
- If you already downloaded and preprocessed the datasets, go to step 5
- Download the BraTS challenge data.
- Extract the images to the $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training directory.
- Run the following commands:
make download_model BENCHMARKS=3d-unet make preprocess_data BENCHMARKS=3d-unet
- Generate the TensorRT engines:
# generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and for default and high accuracy targets. make generate_engines RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default,high_accuracy"
- Run the benchmark:
# run the performance benchmark make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly" # run the accuracy benchmark make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"
Limitations and Best Practices
Note the following limitations and best practices:
- To build the engine and run the benchmark by using a single command, use the make run RUN_ARGS… shortcut. The shortcut is valid alternative to the make generate_engines … && make run_harness.. commands.
- If the server results are “INVALID”, reduce the QPS. If the latency constraints are not met during the run, “INVALID” results are expected.
- If you change the batch size, rebuild the engine.
- Only the BERT, DLRM, 3D-Unet benchmarks support high accuracy targets.
- 3D-UNet only has Offline scenario.
Conclusion
This blog provided the step-by-step procedures to run and reproduce MLPerf inference v0.7 results on Dell Technologies servers with NVIDIA GPUs.
Next steps
In future blogs, we will discuss techniques to improvise performance.