Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Dell.com Contact Us
United States/English
Leela Uppuluri
Leela Uppuluri

I work as a Senior Systems Development Engineer involved in designing, developing, implementing, validating, and delivering Dell-validated designs for Artificial Intelligence, which will focus on deploying Machine Learning and Deep Learning workloads on Dell PowerEdge Servers, Switches, and Storage. My hobbies are traveling places, exploring foods, playing board games, and techy geek. 


LinkedIn: https://www.linkedin.com/in/leela-uppuluri-06a19147

Assets

Home > AI Solutions > Artificial Intelligence > Blogs

NVIDIA PowerEdge MLPerf

Running the MLPerf™ Inference v1.0 Benchmark on Dell EMC Systems

Rakshith Vasudev Frank Han Leela Uppuluri Rakshith Vasudev Frank Han Leela Uppuluri

Fri, 24 Sep 2021 15:23:27 -0000

|

Read Time: 0 minutes

This blog is a guide for running the MLPerf inference v1.0 benchmark. Information about how to run the MLPerf inference v1.0 benchmark is available online at different locations. This blog provides all the steps in one place.   

MLPerf is a benchmarking suite that measures the performance of Machine Learning (ML) workloads. It focuses on the most important aspects of the ML life cycle: training and inference. For more information, see Introduction to MLPerf™ Inference v1.0 Performance with Dell EMC Servers.

This blog focuses on inference setup and describes the steps to run MLPerf inference v1.0 tests on Dell Technologies servers with NVIDIA GPUs. It enables you to run the tests and reproduce the results that we observed in our HPC and AI Innovation Lab. For details about the hardware and the software stack for different systems in the benchmark, see this list of systems.

The MLPerf inference v1.0 suite contains the following models for benchmark:

  • Resnet50 
  • SSD-Resnet34 
  • BERT 
  • DLRM 
  • RNN-T 
  • 3D U-Net

Note: The BERT, DLRM, and 3D U-Net models have 99% (default accuracy) and 99.9% (high accuracy) targets.

This blog describes steps to run all these benchmarks.

1 Getting started

A system under test consists of a defined set of hardware and software resources that will be measured for performance. The hardware resources may include processors, accelerators, memories, disks, and interconnect. The software resources may include an operating system, compilers, libraries, and drivers that significantly influence the running time of a benchmark. In this case, the system on which you clone the MLPerf repository and run the benchmark is known as the system under test (SUT).

For storage, SSD RAID or local NVMe drives are acceptable for running all the subtests without any penalty. Inference does not have strict requirements for fast-parallel storage. However, the BeeGFS or Lustre file system, the PixStor storage solution, and so on help make multiple copies of large datasets.

2 Prerequisites

Prerequisites for running the MLPerf inference v1.0 tests include:

  • An x86_64 system
  • Docker installed with the NVIDIA runtime hook 
  • Ampere-based NVIDIA GPUs (Turing GPUs include legacy support, but are no longer maintained for optimizations)
  • NVIDIA Driver Version 455.xx or later
  • ECC set to ON
    To set ECC to ON, run the following command:
    sudo nvidia-smi --ecc-config=1

3 Preparing to run the MLPerf inference v1.0

Before you can run the MLPerf inference v1.0 tests, perform the following tasks to prepare your environment.

3.1 Clone the MLPerf repository 

  1. Clone the repository to your home directory or to another acceptable path:
     cd -
     git clone https://github.com/mlcommons/inference_results_v1.0
  2. Go to the closed/DellEMC directory:
    cd inference_results_v1.0/closed/DellEMC
  3. Create a “scratch” directory with a least 3 TB of space in which to store the models, datasets, preprocessed data, and so on:
    mkdir scratch
  4. Export the absolute path for $MLPERF_SCRATCH_PATHwith the scratch directory:
    export MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.0/closed/DellEMC/scratch

3.2 Set up the configuration file

The closed/DellEMC/configs directory includes a config.json file that lists configurations for different Dell servers that were systems in the MLPerf Inference v1.0 benchmark. If necessary, modify the configs/<benchmark>/<Scenario>/config.json file to include the system that will run the benchmark.

Note: If your system is already present in the configuration file, there is no need to add another configuration. 

In the configs/<benchmark>/<Scenario>/config.json file, select a similar configuration and modify it based on the current system, matching the number and type of GPUs in your system.

For this blog, we used a Dell EMC PowerEdge R7525 server with a one-A100 GPU as the example. We chose R7525_A100-PCIe-40GBx1 as the name for this new system. Because the R7525_A100-PCIe-40GBx1  system is not already in the list of systems, we added the R7525_A100-PCIe-40GBx1 configuration.

Because the R7525_A100-PCIe-40GBx2 reference system is the most similar, we modified that configuration and picked Resnet50 Server as the example benchmark.

The following example shows the reference configuration for two GPUs for the Resnet50 Server benchmark in the configs/resnet50/Server/config.json file:

"R7525_A100-PCIe-40GBx2": {
         "config_ver": {
         },
         "deque_timeout_us": 2000,
         "gpu_batch_size": 64,
         "gpu_copy_streams": 4,
         "gpu_inference_streams": 3,
         "server_target_qps": 52000,
         "use_cuda_thread_per_device": true,
         "use_graphs": true
     }, 

This example shows the modified configuration for one GPU:

"R7525_A100-PCIe-40GBx1": {
         "config_ver": {
         },
         "deque_timeout_us": 2000,
         "gpu_batch_size": 64,
         "gpu_copy_streams": 4,
         "gpu_inference_streams": 3,
         "server_target_qps": 26000,
         "use_cuda_thread_per_device": true,
         "use_graphs": true
     },

We modified the QPS parameter (server_target_qps) to match the number of GPUs. The server_target_qps parameter is linearly scalable, therefore the QPS = number of GPUs x QPS per GPU.

The modified parameter is server_target_qps set to 26000 in accordance with one GPU performance expectation.

3.3 Add the new system to the list of available systems

After you add the new system to the config.json file as shown in the preceding section, add the new system to the list of available systems. The list of available systems is in the code/common/system_list.py file. This entry indicates to the benchmark that a new system exists  and ensures that the benchmark selects the correct configuration.

Note: If your system is already added, there is no need to add it to the code/common/system_list.py file. 

Add the new system to the list of available systems in the code/common/system_list.py file.

At the end of the file, there is a class called KnownSystems.  This class defines a list of SystemClass objects that describe supported systems as shown in the following example:

SystemClass(<system ID>, [<list of names reported by nvidia-smi>], [<known PCI IDs of this system>], <architecture>, [list of known supported gpu counts>])

Where:

  • For <system ID>, enter the system ID with which you want to identify this system.
  • For <list of names reported by nvidia-smi>, run the nvidia-smi -L command and use the name that is returned.
  • For <known PCI IDs of this system>, run the following command:
$ CUDA_VISIBLE_ORDER=PCI_BUS_ID nvidia-smi --query-gpu=gpu_name,pci.device_id --format=csv
name, pci.device_id A100-PCIE-40GB, 0x20F110DE
---

The pci.device_id field is in the 0x<PCI ID>10DE format, where 10DE is the NVIDIA PCI vendor ID. Use the four hexadecimal digits between 0x and 10DE as your PCI ID for the system. In this case, it is 20F1.

  • For <architecture>, use the architecture Enum, which is at the top of the file. In this case A100 is Ampere architecture.
  • For <list of known GPU counts>, enter the number of GPUs of the systems you want to support (that is, [1,2,4] if you want to support 1x, 2x, and 4x GPU variants of this system). Because we already have a 2x variant in the system_list.py file, we simply need to include the number 1 as an additional entry to support our system.

Note: Because a configuration is already present for the PowerEdge R7525 server, we added the number 1 for our configuration, as shown in the following example. If your system does not exist in the system_list.py file, the configuration, add the entire configuration and not just the number.

class KnownSystems:
     """
     Global List of supported systems
     """
     # before the addition of 1 - this config only supports R7525_A100-PCIe-40GB x2  
     # R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIe-40GB", ["A100-PCIe-40GB"], ["20F1"], Architecture.Ampere, [2])
     # after the addition – this config now supports R7525_A100-PCIe-40GB x1 and R7525_A100-PCIe-40GB x2 versions.
     R7525_A100_PCIE_40GB= SystemClass("R7525_A100-PCIe-40GB", ["A100-PCIe-40GB"], ["20F1"], Architecture.Ampere, [1, 2])
     DSS8440_A100_PCIE_40GB = SystemClass("DSS8440_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [10])
     DSS8440_A40 = SystemClass("DSS8440_A40", ["A40"], ["2235"], Architecture.Ampere, [10])
     R740_A100_PCIe_40GB = SystemClass("R740_A100-PCIe-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [3])
     R750xa_A100_PCIE_40GB = SystemClass("R750xa_A100-PCIE-40GB", ["A100-PCIE-40GB"], ["20F1"], Architecture.Ampere, [4])
     ----

Note: You must provide different configurations in the configs/resnet50/Server/config.json for the x1 variant and x2 variant. In the preceding example, the R7525_A100-PCIe-40GBx2 configuration is different from the R7525_A100-PCIe-40GBx1 configuration.

3.4 Build the Docker image and required libraries

Build the Docker image and then launch an interactive container. Then, in the interactive container, build the required libraries for inferencing.

  1. To build the Docker image, run the make prebuild command inside the closed/DellEMC folder

    Command:
    make prebuild 

    The following example shows sample output:

    Launching Docker session
    nvidia-docker run --rm -it -w /work \
    -v /home/user/article_inference_v1.0/closed/DellEMC:/work -v     /home/user:/mnt//home/user \
    --cap-add SYS_ADMIN \
       -e NVIDIA_VISIBLE_DEVICES=0 \
       --shm-size=32gb \
       -v /etc/timezone:/etc/timezone:ro -v /etc/localtime:/etc/localtime:ro \
       --security-opt apparmor=unconfined --security-opt seccomp=unconfined \
       --name mlperf-inference-user -h mlperf-inference-user --add-host mlperf-inference-user:127.0.0.1 \
       --user 1002:1002 --net host --device /dev/fuse \
       -v =/home/user/inference_results_v1.0/closed/DellEMC/scratch:/home/user/inference_results_v1.0/closed/DellEMC/scratch  \
       -e MLPERF_SCRATCH_PATH=/home/user/inference_results_v1.0/closed/DellEMC/scratch \
       -e HOST_HOSTNAME=node009 
    \
    mlperf-inference:user        

    The Docker container is launched with all the necessary packages installed.

  2. Access the interactive terminal in the container.
  3. To build the required libraries for inferencing, run the make build command inside the interactive container.

    Command:
    make build

    The following example shows sample output:

    (mlperf) user@mlperf-inference-user:/work$ make build
    …….
    [ 26%] Linking CXX executable /work/build/bin/harness_default
    make[4]: Leaving directory '/work/build/harness'
    make[4]: Leaving directory '/work/build/harness'
    make[4]: Leaving directory '/work/build/harness'
    [ 36%] Built target harness_bert
    [ 50%] Built target harness_default
    [ 55%] Built target harness_dlrm
    make[4]: Leaving directory '/work/build/harness'
    [ 63%] Built target harness_rnnt
    make[4]: Leaving directory '/work/build/harness'
    [ 81%] Built target harness_triton
    make[4]: Leaving directory '/work/build/harness'
    [100%] Built target harness_triton_mig
    make[3]: Leaving directory '/work/build/harness'
    make[2]: Leaving directory '/work/build/harness'
    Finished building harness.
    make[1]: Leaving directory '/work'
    (mlperf) user@mlperf-inference-user:/work
    The container is built, in which you can run the benchmarks.

 3.5 Download and preprocess validation data and models

To run the MLPerf inference v1.0, download datasets and models, and then preprocess them. MLPerf provides scripts that download the trained models. The scripts also download the dataset for benchmarks other than Resnet50, DLRM, and 3D U-Net. 

For Resnet50, DLRM, and 3D U-Net, register for an account and then download the datasets manually:

  • DLRM—Download the Criteo Terabyte dataset and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/criteo/
  • 3D U-Net—Download the BraTS challenge data and extract the downloaded file to $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training

Except for the Resnet50, DLRM, and 3D U-Net datasets, run the following commands to download all the models, datasets, and then preprocess them:

$ make download_model # Downloads models and saves to $MLPERF_SCRATCH_PATH/models
$ make download_data # Downloads datasets and saves to $MLPERF_SCRATCH_PATH/data
$ make preprocess_data # Preprocess data and saves to $MLPERF_SCRATCH_PATH/preprocessed_data

Note: These commands download all the datasets, which might not be required if the objective is to run one specific benchmark. To run a specific benchmark rather than all the benchmarks, see the following sections for information about the specific benchmark.

(mlperf) user@mlperf-inference-user:/work$ tree -d -L 1
.
├── build
├── code
├── compliance
├── configs
├── data_maps
├── docker
├── measurements
├── power
├── results
├── scripts
└── systems
 
 
 
# different folders are as follows
 
├── build—Logs, preprocessed data, engines, models, plugins, and so on 
 
├── code—Source code for all the benchmarks
 
├── compliance—Passed compliance checks 
 
├── configs—Configurations that run different benchmarks for different system setups
 
├── data_maps—Data maps for different benchmarks
 
├── docker—Docker files to support building the container
 
├── measurements—Measurement values for different benchmarks
 
├── power—files specific to power submission (it’s only needed for power submissions)
 
├── results—Final result logs 
 
├── scratch—Storage for models, preprocessed data, and the dataset that is symlinked to the preceding build directory
 
├── scripts—Support scripts 
 
└── systems—Hardware and software details of systems in the benchmark

4.0 Running the benchmarks

After you have performed the preceding tasks to prepare your environment, run any of the benchmarks that are required for your tests.

The Resnet50, SSD-Resnet34, and RNN-T benchmarks have 99% (default accuracy) targets. 

The BERT, DLRM, and 3D U-Net benchmarks have 99% (default accuracy) and 99.9% (high accuracy) targets. For information about running these benchmarks, see the Running high accuracy target benchmarks section below.   

If you downloaded and preprocessed all the datasets (as shown in the previous section), there is no need to do so again. Skip the download and preprocessing steps in the procedures for the following benchmarks. 

NVIDIA TensorRT is the inference engine for the backend. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning applications.

4.1 Run the Resnet50 benchmark

To set up the Resnet50 dataset and model to run the inference:

  1. If you already downloaded and preprocessed the datasets, go step 5.
  2. Download the required validation dataset (https://github.com/mlcommons/training/tree/master/image_classification).
  3. Extract the images to $MLPERF_SCRATCH_PATH/data/dataset/ 
  4. Run the following commands:
    make download_model BENCHMARKS=resnet50
    make preprocess_data BENCHMARKS=resnet50
  5. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario 
    
    make generate_engines RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline,Server --config_ver=default"
  6. Run the benchmark:
    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly" 
    make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
    
    # run the accuracy benchmark 
    
    make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=resnet50 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"

    The following example shows the output for  PerformanceOnly mode and displays a “VALID“ result:

    ======================= Perf harness results: =======================
    R7525_A100-PCIe-40GBx1_TRT-default-Server:
          resnet50: Scheduled samples per second : 25992.97 and Result is : VALID
    ======================= Accuracy results: =======================
    R7525_A100-PCIe-40GBx1_TRT-default-Server:
         resnet50: No accuracy results in PerformanceOnly mode.

4.2 Run the SSD-Resnet34 benchmark

To set up the SSD-Resnet34 dataset and model to run the inference:

  1. If necessary, download and preprocess the dataset:
    make download_model BENCHMARKS=ssd-resnet34
    make download_data BENCHMARKS=ssd-resnet34 
    make preprocess_data BENCHMARKS=ssd-resnet34
  2. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario 
    
    make generate_engines RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline,Server --config_ver=default"
  3. Run the benchmark:
    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
     
    # run the accuracy benchmark
    
    make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=ssd-resnet34 --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"

4.3 Run the RNN-T benchmark

To set up the RNN-T dataset and model to run the inference:

  1. If necessary, download and preprocess the dataset:
    make download_model BENCHMARKS=rnnt
    make download_data BENCHMARKS=rnnt 
    make preprocess_data BENCHMARKS=rnnt
  2. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario
    
    make generate_engines RUN_ARGS="--benchmarks=rnnt --scenarios=Offline,Server --config_ver=default" 
  3. Run the benchmark:
    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=PerformanceOnly" 
     
    # run the accuracy benchmark 
    
    make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"         

5 Running high accuracy target benchmarks

The BERT, DLRM, and 3D U-Net benchmarks have high accuracy targets.

5.1 Run the BERT benchmark

To set up the BERT dataset and model to run the inference:

  1. If necessary, download and preprocess the dataset:
    make download_model BENCHMARKS=bert
    make download_data BENCHMARKS=bert 
    make preprocess_data BENCHMARKS=bert
  2. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets.
     
    make generate_engines RUN_ARGS="--benchmarks=bert --scenarios=Offline,Server --config_ver=default,high_accuracy"
  3. Run the benchmark:
    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly" 
     
    # run the accuracy benchmark
      
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=bert --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"

5.2 Run the DLRM benchmark

To set up the DLRM dataset and model to run the inference:

  1. If you already downloaded and preprocessed the datasets, go to step 5.
  2. Download the Criteo Terabyte dataset.
  3. Extract the images to $MLPERF_SCRATCH_PATH/data/criteo/ directory.
  4. Run the following commands:
    make download_model BENCHMARKS=dlrm
    make preprocess_data BENCHMARKS=dlrm
  5. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and also for default and high accuracy targets.
    
    make generate_engines RUN_ARGS="--benchmarks=dlrm --scenarios=Offline,Server --config_ver=default, high_accuracy"
  6. Run the benchmark:
    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=PerformanceOnly"
     
    # run the accuracy benchmark
    
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=default --test_mode=AccuracyOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"
    make run_harness RUN_ARGS="--benchmarks=dlrm --scenarios=Server --config_ver=high_accuracy --test_mode=AccuracyOnly"

5.3 Run the 3D U-Net benchmark

Note: This benchmark only has the Offline scenario.

To set up the 3D U-Net dataset and model to run the inference:

  1. If you already downloaded and preprocessed the datasets, go to step 5.
  2. Download the BraTS challenge data.
  3. Extract the images to the $MLPERF_SCRATCH_PATH/data/BraTS/MICCAI_BraTS_2019_Data_Training directory.
  4. Run the following commands:
    make download_model BENCHMARKS=3d-unet
    make preprocess_data BENCHMARKS=3d-unet
  5. Generate the TensorRT engines:
    # generates the TRT engines with the specified config. In this case it generates engine for both Offline and Server scenario and for default and high accuracy targets.
    
    make generate_engines RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default,high_accuracy"
  6. Run the benchmark:
    # run the performance benchmark
    
    make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=PerformanceOnly"
    make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=PerformanceOnly"
     
    # run the accuracy benchmark 
    
    make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=default --test_mode=AccuracyOnly" 
    make run_harness RUN_ARGS="--benchmarks=3d-unet --scenarios=Offline --config_ver=high_accuracy --test_mode=AccuracyOnly"

6 Limitations and Best Practices for Running MLPerf

Note the following limitations and best practices:

  • To build the engine and run the benchmark by using a single command, use the make run RUN_ARGS… shortcut. The shortcut is a valid alternative to the make generate_engines … && make run_harness.. commands.
  • Include the --fast flag with the RUN_ARGS command to test runs quickly by setting the run time to one minute. For example:
   make run_harness RUN_ARGS="–-fast --benchmarks=<bmname> --scenarios=<scenario> --config_ver=<cver> --test_mode=PerformanceOnly"

      The benchmark runs for one minute instead of the default 10 minutes. 

  • If the server results are “INVALID”, reduce the server_target_qps for a Server scenario run. If the latency constraints are not met during the run, “INVALID” results are expected.
  • If the results are “INVALID” for an Offline scenario run, then increase the gpu_offline_expected_qps. “INVALID” runs for Offline scenario occur when the system can deliver a significantly higher QPS than what is provided through the gpu_offline_expected_qps configuration.
  • If the batch size changes, rebuild the engine.
  • Only the BERT, DLRM, 3D-Unet benchmarks support high accuracy targets.
  • 3D-UNet only has Offline scenario.

7 Conclusion

This blog provides step-by-step procedures to run and reproduce MLPerf inference v1.0 results on Dell Technologies servers with NVIDIA GPUs.





.