
Dell EMC Servers Shine in MLPerf Inference v0.7 Benchmark
Mon, 02 Nov 2020 15:16:55 -0000
|Read Time: 0 minutes
As software applications and systems using Artificial Intelligence (AI) gain mainstream adoption across all industries, inference workloads for ongoing operations are becoming a larger resource consumer in the datacenter. MLPerf is a benchmark suite that is used to evaluate the performance profiles of systems for both training and inference AI tasks. In this blog we take a closer look at the recent results submitted by Dell EMC and how our various servers performed in the datacenter category.
The reason we do this type of work is to help customers understand which server platform makes the most sense for their use case. Dell Technologies wants to make the choice easier and reduce work for our customers, so they don’t waste their precious resources. We want customers to use their time focusing on the use case helping accelerate time to value for the business.
Dell Technologies has a total of 210 submissions for MLPerf Inference v0.7 in the Datacenter category using various server platforms and accelerators. Why so many? It is because many customers have never run AI in their environment, the use cases are endless across industries and expertise limited. Customers have told us they need help identifying the correct server platform based on their workloads.
We’re proud of what we’ve done, but it’s still all about helping customers adopt AI. By sharing our expertise and providing guidance on infrastructure for AI, we help customers become successful and get their use case into production.
MLPerf Benchmarks
MLPerf was founded in 2018 with a goal of accelerating improvements in ML system performance. Formed as a collaboration of companies and researchers from leading educational institutions, MLPerf leverages open source code, public state-of-the-art Machine Learning (ML) models and publicly available datasets contributed to the ML community. The MLPerf suites include MLPerf Training and MLPerf Inference.
MLPerf Training measures how fast a system can train machine learning models. Training benchmarks have been defined for image classification, lightweight and heavy-weight object detection, language translation, natural language processing, recommendation and reinforcement learning. Each benchmark includes specifications for input datasets, quality targets and reference implementation models. The first round of training submissions was published on the MLPerf website in December 2018 with results submitted by Google, Intel and NVIDIA.
The MLPerf Inference suite measures how quickly a trained neural network can evaluate new data and perform forecasting or classification for a wide range of applications. MLPerf Inference includes image classification, object detection and machine translation with specific models, datasets, quality, server latency and multi-stream latency constraints. MLPerf validated and published results for MLPerf Inference v0.7 on October 21, 2020. In this blog we take a closer look at the for MLPerf Inference v0.7 results submitted by Dell EMC and how the servers performed in the datacenter category.
A summary of the key highlights of the Dell EMC results are shown in Table 1. These are derived from the submitted results in MLPerf datacenter closed category. Ranking and claims are based on Dell analysis of published MLPerf data. Per accelerator is calculated by dividing the primary metric of total performance by the number of accelerators reported.
Rank | Category | Specifics | Use Cases |
#1 | Performance per Accelerator for NVIDIA A100-PCIe | PowerEdge R7525 | Medical Imaging, Image Classification |
#1 | Performance per Accelerator with NVIDIA T4 GPUs | PowerEdge XE2420, PowerEdge R7525, DSS8440 | Medical Imaging, NLP, Image Classification, Speech Recognition, Object Detection, Recommendation |
#1 | Highest inference results with Quadro RTX6000 and RTX8000 | PowerEdge R7525, DSS 8440 | Medical Imaging, NLP, Image Classification, Speech Recognition, Object Detection, Recommendation |
Dell EMC had a total of 210 submissions for MLPerf Inference v0.7 in the Datacenter category using various Dell EMC platforms and accelerators from leading vendors. We achieved impressive results when compared to other submissions in the same class of platforms.
MLPerf Inference Categories and Dell EMC Achievements
A benchmark suite is made up of tasks or models from vision, speech, language and commerce use cases. MLPerf Inference measures how fast a system can perform ML inference by using a load generator against the System Under Test (SUT) where the trained model is deployed.
There are three types of benchmark tests defined in MLPerf inference v0.7, one for datacenter systems, one for edge systems and one for mobile systems. MLPerf then has four different scenarios to enable representative testing of a wide variety of inference platforms and use cases:
- Single stream
- Multiple stream
- Server
- Offline
The single stream and multiple stream scenarios are only used for edge and mobile inference benchmarks. The data center benchmark type targets systems designed for data center deployments and requires evaluation of both the server and offline scenarios. The metrics used in the Datacenter category are inference operations/second. In the server scenario, the MLPerf load generator sends new queries to the SUT according to a Poisson distribution. This is representative of on-line AI applications such as translation, image tagging which have variable arrival patterns based on end-user traffic. Offline represents AI tasks done thru batch processing such as photo categorization where all the data is readily available ahead of time.
Dell EMC published multiple results in the datacenter systems category. Details on the models, dataset and the scenarios submitted for the different datacenter benchmark are shown in Table 2
Area | Task | Model | Dataset | Required Scenarios |
Vision | Image classification | ResNet50-v1.5 | Imagenet (224x224) | Server, Offline |
Object detection (large) | SSD-ResNet34 | COCO (1200x1200) | Server, Offline | |
Medical image segmentation | 3d Unet | BraTS 2019 (224x224x160) | Offline | |
Speech | Speech-to-text | RNNT | Librispeech dev-clean (samples < 15 seconds) | Server, Offline |
Language | Language processing | BERT | SQuAD v1.1 (max_seq_len=384) | Server, Offline |
Commerce | Recommendation | DLRM | 1TB Click Logs | Server, Offline |
Next we highlight some of the key performance achievements for the broad range of solutions available in the Dell EMC portfolio for inference use cases and deployments.
1. Dell EMC is #1 in total number of datacenter submissions in the closed division including bare metal submissions using different GPUs, Xeon CPUs, Xilinx FPGA and virtualized submission on VMware vSphere
The closed division enables head to head comparisons and consists of server platforms used from the Edge to private or public clouds. The Dell Technologies engineering team submitted 210 out of the total 509 results.
We remain committed to helping customers deploy inference workloads as efficiently as possible, meeting their unique requirements of power, density, budget and performance. The wide range of servers submitted by Dell Technologies’ is a testament to this commitment -
- The only vendor with submissions for a variety of inference solutions – leveraging GPU, FPGA and CPUs for the datacenter/private cloud and Edge
- Unique in the industry by submitting results across a multitude of servers that range from mainstream servers (R740/R7525) to dense GPU-optimized servers supporting up to 16 NVIDIA GPUs (DSS8440).
- Demonstrated that customers that demand real-time inferencing at the telco or retail Edge can deploy up to 4 GPUs in a short depth NEBS-compliant PowerEdge XE2420 server.
- Demonstrated efficient Inference performance using the 2nd Gen Intel Xeon Scalable platform on the PowerEdge R640 and PowerEdge R740 platforms for customers wanting to run inference on Intel CPUs.
- Dell submissions using Xilinx U280 in PowerEdge R740 demonstrated that customers wanting low latency inference can leverage FPGA solutions.
2. Dell EMC is #1 in performance “per Accelerator” with PowerEdge R7525 and A100-PCIe for multiple benchmarks
The Dell EMC PowerEdge R7525 was purpose-built for superior accelerated performance. The MLPerf results validated leading performance across many scenarios including:
Performance Rank “Per Accelerator” | Inference Throughput | Dell EMC System |
#1 ResNet50 (server) | 30,005 | PowerEdge R7525 (3x NVIDIA A100-PCIE) |
#1 3D-Unet-99 (offline) | 39 | PowerEdge R7525 (3x NVIDIA A100-PCIE) |
#1 3D-Unet-99 (offline) | 39 | PowerEdge R7525 (3x NVIDIA A100-PCIE) |
#2 DLRM-99 (server) | 192,543 | PowerEdge R7525 (2x NVIDIA A100-PCIE) |
#2 DLRM-99 (server) | 192,543 | PowerEdge R7525 (2x NVIDIA A100-PCIE) |
3. Dell achieved the highest inference scores with NVIDIA Quadro RTX GPUs using the DSS 8440 and R7525
Dell Technologies engineering understands that since training isn’t the only AI workload, using the right technology for each job is far more cost effective. Dell is the only vendor to submit results using NVIDIA RTX6000 and RTX8000 GPUs that provide up to 48GB HBM memory for large inference models. The DSS 8440 with 10 Quadro RTX achieved
- #2 and #3 highest system performance on RNN-T for Offline scenario.
The #1 ranking was delivered using 8x NVIDIA A100 SXM4 that was introduced in May 2020 and is a powerful system for customer to train state of the art deep learning models. Dell Technologies took the #2 and #3 spots with the DSS8440 server equipped with 10x NVIDIA RTX8000 and DSS8440 with 10x NVIDIA RTX6000 providing a better power and cost efficiency for inference workloads compared to other submissions.
4. Dell EMC claims #1 spots for NVIDIA T4 platforms with DSS 8440, XE2420 and PowerEdge R7525
Dell Technologies provides system options for customers to deploy inference workloads that match their unique requirements. Today’s accelerators vary significantly in price, performance and power consumption. For example, the NVIDIA T4 is a low profile, lower power GPU option that is widely deployed for inference due to its superior power efficiency and economic value for that use case.
The MLPerf results corroborate the exemplary inference performance of NVIDIA T4 on Dell EMC Servers. The T4 leads for performance per GPU among the 20 servers used to submit scores using NVIDIA T4 GPUs
- #1 in performance per GPU on 3d-unet-99 and 3d-unet-99.9 Offline scenario
- #1 in performance per GPU on Bert-99 Server and Bert-99.9 Offline scenario
- #1, #2 and #3 in performance with T4 on DLRM-99 & DLRM-99.9 Server scenario
- #1 in performance per GPU on ResNet50 Offline scenario
- #1 in performance per GPU on RNN-T Server and Offline scenario
- #1 in performance per GPU on SSD-large Offline scenario
The best scores achieved for the NVIDIA T4 “Per GPU” rankings above and respective platforms are shown in the table:
Benchmark | Offline Scenario | Server Scenario | ||||
Rank | Throughput | Server | Rank | Throughput | Server | |
3d-unet-99 | #1 | 7.6 | XE2420 | n/a | ||
3d-unet-99.9 | #1 | 7.6 | XE2420 | n/a |
|
|
bert-99 | #3 | 449 | XE2420 | #1 | 402 | XE2420 |
bert-99.9 | #1 | 213 | DSS 8440 | #2 | 190 | XE2420 |
dlrm-99 | #2 | 35,054 | XE2420 | #1 | 32,507 | R7525 |
dlrm-99.9 | #2 | 35,054 | XE2420 | #1 | 32,507 | R7525 |
resnet | #1 | 6,285 | XE2420 | #4 | 5,663 | DSS 8440 |
rnnt | #1 | 1,560 | XE2420 | #1 | 1,146 | XE2420 |
ssd-large | #1 | 142 | XE2420 | #2 | 131 | DSS 8440 |
5. Dell is the only vendor to submit results on virtualized infrastructure with vCPUs and NVIDIA virtual GPUs (vGPU) on VMware vSphere
Customers interested in deploying inference workloads for AI on virtualized infrastructure can leverage Dell servers with VMware software to reap the benefits of virtualization.
To demonstrate efficient virtualized performance on Intel 2nd Generation Intel Xeon Scalable processors, Dell EMC and VMware submitted results using vSphere and OpenVino on the PowerEdge R640.
- Virtualization overhead for a single VM was observed to be minimal and testing showed that using multiple VMs could be deployed on a single server to achieve ~26% better throughput compared to a bare metal environment.
Dell EMC has published guidance on virtualizing GPUs using DirectPath I/O, NVIDIA Virtual Compute Server (vCS) and more. Dell EMC and VMware used NVIDIA vCS virtualization software in vSphere for MLPerf Inference benchmarks on virtualized NVIDIA T4 GPUs
- VMware vSphere using NVIDIA vCS delivers near bare metal performance for MLPerf Inference v0.7 benchmarks. The inference throughput (queries processed per second) increases linearly as the number of vGPUs attached to the VM increases.
Blogs covering these virtualized tests in greater detail are published at VMware’s performance Blog site.
This finishes our coverage of the top 5 highlights out of the 200+ submissions done by Dell EMC in the datacenter division. Next we discuss other aspects of the GPU optimized portfolio that are important for customers – quality and support.
Dell has the highest number of NVIDIA GPU submissions using NVIDIA NGC Ready systems
Dell GPU enabled platforms are part of NVIDIA NGC-Ready and NGC-Ready for Edge validation programs. At Dell, we understand that performance is critical, but customers are not willing to compromise quality and reliability to achieve maximum performance. Customers can confidently deploy inference and other software applications from the NVIDIA NGC catalog knowing that the Dell systems meet all the requirements set by NVIDIA to deploy customer workloads on-premises or at the Edge.
NVIDIA NGC validated configs that were used for this round of MLPerf submissions are:
- Dell EMC PowerEdge XE2420 (4x T4)
- Dell EMC DSS 8440 (10x Quadro RTX 8000)
- Dell EMC DSS 8440 (12x T4)
- Dell EMC DSS 8440 (16x T4)
- Dell EMC DSS 8440 (8x Quadro RTX 8000)
- Dell EMC PowerEdge R740 (4x T4)
- Dell EMC PowerEdge R7515 (4x T4)
- Dell EMC PowerEdge R7525 (2x A100-PCIE)
- Dell EMC PowerEdge R7525 (3x Quadro RTX 8000)
Dell EMC portfolio can address customers inference needs from on-premises to the edge
In this blog, we highlighted the results submitted by Dell EMC to demonstrate how our various servers performed in the datacenter category. The Dell EMC server portfolio provides many options for customer wanting to deploy AI inference in their datacenters or on the edge. We also offer a wide range of accelerator options including both multiple GPU and FPGA models for running inference either on bare metal or virtualized infrastructure that can meet specific application and deployment requirements.
Finally, we list the performance for a subset of the server platforms that we see mostly commonly used by customers today for running inference workloads. These rankings highlight that the platform can support a wide range of inference use cases that are showcased in the MLPerf suite.
1. The Dell EMC PowerEdge XE2420 with 4x NVIDIA T4 GPUs: Ranked between #1 and #3 in 14 out of 16 benchmark categories when compared with other T4 Servers
Dell EMC PowerEdge XE2420 (4x T4) Per Accelerator Ranking* | |||
| Offline | Server |
|
3d-unet-99 | #1 | n/a
| |
3d-unet-99.9 | #1 | ||
bert-99 | #3 | #1 | |
bert-99.9 | #2 | #2 | |
dlrm-99 | #1 | #3 | |
dlrm-99.9 | #1 | #3 | |
resnet | #1 |
| |
rnnt | #1 | #1 | |
ssd-large | #1 |
|
2. Dell EMC PowerEdge R7525 with 8x T4 GPUs: Ranked between #1 and #5 in 11 out of 16 benchmark categories in T4 server submission
Dell EMC PowerEdge R7525 (8x T4) Per Accelerator Ranking* | |||
| Offline | Server |
|
3d-unet-99 | #4 | n/a
| |
3d-unet-99.9 | #4 | ||
bert-99 | #4 |
| |
dlrm-99 | #2 | #1 | |
dlrm-99.9 | #2 | #1 | |
rnnt | #2 | #5 | |
ssd-large | #5 |
|
3. The Dell EMC PowerEdge R7525 with up to 3xA100-PCIe: ranked between #3 and #10 in 15 out of 16 benchmark categories across all datacenter submissions
Dell EMC PowerEdge R7525 (2|3x A100-PCIe) Per Accelerator | |||
| Offline | Server |
|
3d-unet-99 | #4 | n/a
| |
3d-unet-99.9 | #4 | ||
bert-99 | #8 | #9 | |
bert-99.9 | #7 | #8 | |
dlrm-99 | #6 | #4 | |
dlrm-99.9 | #6 | #4 | |
resnet | #10 | #3 | |
rnnt | #6 | #7 | |
ssd-large | #10 |
|
4. The Dell EMC DSS 8440 with 16x T4 ranked between #3 and #7 when compared against all submissions using T4
Dell EMC DSS 8440 (16x T4) | |||
| Offline | Server |
|
3d-unet-99 | #4 | n/a
| |
3d-unet-99.9 | #4 | ||
bert-99 | #6 | #4 | |
bert-99.9 | #7 | #5 | |
dlrm-99 | #3 | #3 | |
dlrm-99.9 | #3 | #3 | |
resnet | #6 | #4 | |
rnnt | #5 | #5 | |
ssd-large | #7 | #5 |
5. The Dell EMC DSS 8440 with 10x RTX6000 ranked between #2 and #6 in 14 out of 16 benchmarks when compared against all submissions
Dell EMC DSS 8440 (10x Quadro RTX6000) | |||
| Offline | Server |
|
3d-unet-99 | #4 | n/a
| |
3d-unet-99.9 | #4 | ||
bert-99 | #4 | #5 | |
bert-99.9 | #4 | #5 | |
dlrm-99 |
|
| |
dlrm-99.9 |
|
| |
resnet | #5 | #6 | |
rnnt | #2 | #5 | |
ssd-large | #5 | #6 |
6. Dell EMC DSS 8440 with 10x RTX8000 ranked between #2 and #6 when compared against all submissions
Dell EMC DSS 8440 (10x Quadro RTX8000) | |||
| Offline | Server |
|
3d-unet-99 | #5 | n/a
| |
3d-unet-99.9 | #5 | ||
bert-99 | #5 | #4 | |
bert-99.9 | #5 | #4 | |
dlrm-99 | #3 | #3 | |
dlrm-99.9 | #3 | #3 | |
resnet | #6 | #5 | |
rnnt | #3 | #6 | |
ssd-large | #6 | #5 |
Get more information on MLPerf results at www.mlperf.org and earn more about PowerEdge servers that are optimized for AI / ML / DL at www.DellTechnologies.com/Servers
Acknowledgements: These impressive results were made possible by the work of the following Dell EMC and partner team members - Shubham Billus, Trevor Cockrell, Bagus Hanindhito (Univ. of Texas, Austin), Uday Kurkure (VMWare), Guy Laporte, Anton Lokhmotov (Dividiti), Bhavesh Patel, Vilmara Sanchez, Rakshith Vasudev, Lan Vu (VMware) and Nicholas Wakou. We would also like to thank our partners – NVIDIA, Intel and Xilinx for their help and support in MLPerf v0.7 Inference submissions.
Related Blog Posts

Quantifying Performance of Dell EMC PowerEdge R7525 Servers with NVIDIA A100 GPUs for Deep Learning Inference
Tue, 17 Nov 2020 18:30:15 -0000
|Read Time: 0 minutes
The Dell EMC PowerEdge R7525 server provides exceptional MLPerf Inference v0.7 Results, which indicate that:
- Dell Technologies holds the #1 spot in performance per GPU with the NVIDIA A100-PCIe GPU on the DLRM-99 Server scenario
- Dell Technologies holds the #1 spot in performance per GPU with the NVIDIA A100-PCIe on the DLRM-99.9 Server scenario
- Dell Technologies holds the #1 spot in performance per GPU with the NVIDIA A100-PCIe on the ResNet-50 Server scenario
Summary
In this blog, we provide the performance numbers of our recently released Dell EMC PowerEdge R7525 server with two NVIDIA A100 GPUs on all the results of the MLPerf Inference v0.7 benchmark. Our results indicate that the PowerEdge R7525 server is an excellent choice for inference workloads. It delivers optimal performance for different tasks that are in the MLPerf Inference v0.7 benchmark. These tasks include image classification, object detection, medical image segmentation, speech to text, language processing, and recommendation.
The PowerEdge R7525 server is a two-socket, 2U rack server that is designed to run workloads using flexible I/O and network configurations. The PowerEdge R7525 server features the 2nd Gen AMD EPYC processor, supports up to 32 DIMMs, has PCI Express (PCIe) Gen 4.0-enabled expansion slots, and provides a choice of network interface technologies to cover networking options.
The following figure shows the front view of the PowerEdge R7525 server:
Figure 1. Dell EMC PowerEdge R7525 server
The PowerEdge R7525 server is designed to handle demanding workloads and for AI applications such as AI training for different kinds of models and inference for different deployment scenarios. The PowerEdge R7525 server supports various accelerators such as NVIDIA T4, NVIDIA V100S, NVIDIA RTX, and NVIDIA A100 GPU s. The following sections compare the performance of NVIDIA A100 GPUs with NVIDIA T4 and NVIDIA RTX GPUs using MLPerf Inference v0.7 as a benchmark.
The following table provides details of the PowerEdge R7525 server configuration and software environment for MLPerf Inference v0.7:
Component | Description |
Processor | AMD EPYC 7502 32-Core Processor |
Memory | 512 GB (32 GB 3200 MT/s * 16) |
Local disk | 2x 1.8 TB SSD (No RAID) |
Operating system | CentOS Linux release 8.1 |
GPU | NVIDIA A100-PCIe-40G, T4-16G, and RTX8000 |
CUDA Driver | 450.51.05 |
CUDA Toolkit | 11.0 |
Other CUDA-related libraries | TensorRT 7.2, CUDA 11.0, cuDNN 8.0.2, cuBLAS 11.2.0, libjemalloc2, cub 1.8.0, tensorrt-laboratory mlperf branch |
Other software stack | Docker 19.03.12, Python 3.6.8, GCC 5.5.0, ONNX 1.3.0, TensorFlow 1.13.1, PyTorch 1.1.0, torchvision 0.3.0, PyCUDA 2019.1, SacreBLEU 1.3.3, simplejson, OpenCV 4.1.1 |
System profiles | Performance |
For more information about how to run the benchmark, see Running the MLPerf Inference v0.7 Benchmark on Dell EMC Systems.
MLPerf Inference v0.7 performance results
The MLPerf inference benchmark measures how fast a system can perform machine learning (ML) inference using a trained model in various deployment scenarios. The following results represent the Offline and Server scenarios of the MLPerf Inference benchmark. For more information about different scenarios, models, datasets, accuracy targets, and latency constraints in MLPerf Inference v0.7, see Deep Learning Performance with MLPerf Inference v0.7 Benchmark.
In the MLPerf inference evaluation framework, the LoadGen load generator sends inference queries to the system under test, in our case, the PowerEdge R7525 server with various GPU configurations. The system under test uses a backend (for example, TensorRT, TensorFlow, or PyTorch) to perform inferencing and sends the results back to LoadGen.
MLPerf has identified four different scenarios that enable representative testing of a wide variety of inference platforms and use cases. In this blog, we discuss the Offline and Server scenario performance. The main differences between these scenarios are based on how the queries are sent and received:
- Offline—One query with all samples is sent to the system under test. The system under test can send the results back once or multiple times in any order. The performance metric is samples per second.
- Server—Queries are sent to the system under test following a Poisson distribution (to model real-world random events). One query has one sample. The performance metric is queries per second (QPS) within latency bound.
Note: Both the performance metrics for Offline and Server scenario represent the throughput of the system.
In all the benchmarks, two NVIDIA A100 GPUs outperform eight NVIDIA T4 GPUs and three NVIDIA RTX800 GPUs for the following models:
- ResNet-50 image classification model
- SSD-ResNet34 object detection model
- RNN-T speech recognition model
- BERT language processing model
- DLRM recommender model
- 3D U-Net medical image segmentation model
The following graphs show PowerEdge R7525 server performance with two NVIDIA A100 GPUs, eight NVIDIA T4 GPUs, and three NVIDIA RTX8000 GPUs with 99% accuracy target benchmarks and 99.9% accuracy targets for applicable benchmarks:
- 99% accuracy (default accuracy) target benchmarks: ResNet-50, SSD-Resnet34, and RNN-T
- 99% and 99.9% accuracy (high accuracy) target benchmarks: DLRM, BERT, and 3D-Unet
99% accuracy target benchmarks
ResNet-50
The following figure shows results for the ResNet-50 model:
Figure 2. ResNet-50 Offline and Server inference performance
From the graph, we can derive the per GPU values. We divide the system throughput (containing all the GPUs) by the number of GPUs to get the Per GPU results as they are linearly scaled.
SSD-Resnet34
The following figure shows the results for the SSD-Resnet34 model:
Figure 3. SSD-Resnet34 Offline and Server inference performance
RNN-T
The following figure shows the results for the RNN-T model:
Figure 4. RNN-T Offline and Server inference performance
99.9% accuracy target benchmarks
DLRM
The following figures show the results for the DLRM model with 99% and 99.9% accuracy:
Figure 5. DLRM Offline and Server Scenario inference performance – 99% and 99.9% accuracy
For the DLRM recommender and 3D U-Net medical image segmentation (see Figure 7) models, both 99% and 99.9% accuracy have the same throughput. The 99.9% accuracy benchmark also satisfies the required accuracy constraints with the same throughput as that of 99%.
BERT
The following figures show the results for the BERT model with 99% and 99.9% accuracy:
Figure 6. BERT Offline and Server inference performance – 99% and 99.9% accuracy
For the BERT language processing model, two NVIDIA A100 GPUs outperform eight NVIDIA T4 GPUs and three NVIDIA RTX8000 GPUs. However, the performance of three NVIDIA RTX8000 GPUs is a little better than that of eight NVIDIA T4 GPUs.
3D U-Net
For the 3D-Unet medical image segmentation model, only the Offline scenario benchmark is available.
The following figure shows the results for the 3D U-Net model Offline scenario:
Figure 7. 3D U-Net Offline inference performance
For the 3D-Unet medical image segmentation model, since there is only offline scenario benchmark for 3D-Unet the above graph represents only Offline scenario.
The following table compares the throughput between two NVIDIA A100 GPUs, eight NVIDIA T4 GPUs, and three NVIDIA RTX8000 GPUs with 99% accuracy target benchmarks and 99.9% accuracy targets:
Model | Scenario | Accuracy | 2 x A100 GPUs vs 8 x T4 GPUs | 2 x A100 GPUs vs 3 x RTX8000 GPUs |
ResNet-50 | Offline | 99% | 5.21x | 2.10x |
Server | 4.68x | 1.89x | ||
SSD-Resnet34 | Offline | 6.00x | 2.35x | |
Server | 5.99x | 2.21x | ||
RNN-T | Offline | 5.55x | 2.14x | |
Server | 6.71x | 2.43x | ||
DLRM | Offline | 6.55x | 2.52x | |
Server | 5.92x | 2.47x | ||
Offline | 99.9% | 6.55x | 2.52x | |
Server | 5.92x | 2.47x | ||
BERT | Offline | 99% | 6.26x | 2.31x |
Server | 6.80x | 2.72x | ||
Offline | 99.9% | 7.04x | 2.22x | |
Server | 6.84x | 2.20x | ||
3D U-Net | Offline | 99% | 5.05x | 2.06x |
Server | 99.9% | 5.05x | 2.06x |
Conclusion
With support of NVIDIA A100, NVIDIA T4, or NVIDIA RTX8000 GPUs, Dell EMC PowerEdge R7525 server is an exceptional choice for various workloads that involve deep learning inference. However, the higher throughput that we observed with NVIDIA A100 GPUs translates to performance gains and faster business value for inference applications.
Dell EMC PowerEdge R7525 server with two NVIDIA A100 GPUs delivers optimal performance for various inference workloads, whether it is in a batch inference setting such as Offline scenario or Online inference setting such as Server scenario.
Next steps
In future blogs, we will discuss sizing the system (server and GPU configurations) correctly based on the type of workload (area and task).

MLPerf Inference v0.7 Benchmarks on Power Edge R7515 Servers
Tue, 08 Dec 2020 00:14:16 -0000
|Read Time: 0 minutes
Introduction
MLPerf (https://mlperf.org) Inference is a benchmark suite for measuring how fast Machine Learning (ML) and Deep Learning (DL) systems can process input inference data and produce results using a trained model. The benchmarks belong to a diversified set of ML use cases that are popular in the industry and provide a standard for hardware platforms to perform ML-specific tasks. Hence, good performance under these benchmarks signifies a hardware setup that is well optimized for real world ML inferencing use cases.
System under Test (SUT)
- Server – Dell EMC PowerEdge R7515
- GPU – NVIDIA Tesla T4
- Framework – TensorRT™ 7.2.0.14
Dell EMC PowerEdge R7515
Table 1 Dell EMC PowerEdge R7515 technical specifications
Component | Description |
---|---|
System name | PowerEdge R7515 |
Status | Commercially available |
System type | Data center |
Number of nodes | 1 |
Host processor model lane | AMD® EPYC® 7702P |
Host processors per node | 1 |
Host processor core count | 64 |
Host processor frequency | 2.00 GHz |
Host memory capacity | 256 GB DDR4, 2933 MHz |
Host storage | 3.2 TB SSD |
Host accelerator | NVIDIA Tesla T4 |
Accelerators per node | 4 |
NVIDIA Tesla T4
The NVIDIA Tesla T4, based on NVIDIA’s Turing architecture is one of the most widely used AI inference accelerators. The Tesla T4 features NVIDIA Turing Tensor cores which enables it to accelerate all types of neural networks for images, speech, translation, and recommender systems, to name a few. Tesla T4 supports a wide variety of precisions and accelerates all major DL & ML frameworks, including TensorFlow, PyTorch, MXNet, Chainer, and Caffe2.
Table 2 NVIDIA Tesla T4 technical specifications
Component | Description |
---|---|
GPU architecture | NVIDIA Turing |
NVIDIA Turing Tensor cores | 320 |
NVIDIA CUDA® cores | 2,560 |
Single-precision | 8.1 TFLOPS |
Mixed-precision (FP16/FP32) | 65 TFLOPS |
INT8 | 130 TOPS |
INT4 | 260 TOPS |
GPU memory | 16 GB GDDR6, 320+ GB/s |
ECC | Yes |
Interconnect bandwidth | 32 GB/s |
System interface | X16 PCIe Gen3 |
Form factor | Low profile PCIe |
Thermal solution | Passive |
Compute APIs | CUDA, NVIDIA TensorRT™, ONNX |
Power | 70W |
MLPerf Inference v0.7
The MLPerf inference benchmark measures how fast a system can perform ML inference using a trained model with new data that is provided in various deployment scenarios. Table 3 shows seven mature models that are in the official v0.7 release.
Table 3 MLPerf Inference Suite v0.7
Model | Reference application | Dataset |
---|---|---|
resnet50-v1.5 | vision/classification and detection | ImageNet (224 x 224) |
ssd-mobilenet 300 x 300 | vision/classification and detection | COCO (300 x 300) |
ssd-resnet34 1200 x 1200 | vision/classification and detection | COCO (1200 x 1200) |
bert | language | squad-1.1 |
dlrm | recommendation | Criteo Terabyte |
3d-unet | vision/medical imaging | BraTS 2019 |
rnnt | speech recognition | OpenSLR LibriSpeech Corpus |
The above models serve in various critical inference applications or use cases that are known as “scenarios.” Each scenario requires different metrics and demonstrates performance in a production environment. MLPerf Inference consists of four evaluation scenarios that are shown in Table 4:
- Single-stream
- Multi-stream
- Server
- Offline
Table 4 Deployment scenarios
Scenario | Sample use case | Metrics |
---|---|---|
Single-stream | Cell phone augmented reality | Latency in ms |
Multi-stream | Multiple camera driving assistance | Number of streams |
Server | Translation sites | QPS |
Offline | Photo sorting | Inputs/s |
Results
The units on which Inference is measured are based on samples and queries. A sample is a unit on which inference is run, such as an image or sentence. A query is a set of samples that are issued to an inference system together. For detailed explanation of definitions, rules and constraints of MLPerf Inference see: https://github.com/mlperf/inference_policies/blob/master/inference_rules.adoc#constraints-for-the-closed-division
Default Accuracy refers to a configuration where the model infers samples with at least 99% accuracy. High Accuracy refers to a configuration where the model infers samples with 99.9% accuracy.
For MLPerf Inference v0.7 result submissions, Dell EMC used Offline and Server scenarios as they are more representative of datacenter systems. Offline scenario represents use cases where inference is done as a batch job (for instance using AI for photo sorting), while server scenario represents an interactive inference operation (translation app).
MLPerf Inference results on the PowerEdge R7515
Table 5 PowerEdge R7515 inference results
3D-UNET | BERT | DLRM | ResNet50 | RNNT | SSD-ResNet34 | |||||
---|---|---|---|---|---|---|---|---|---|---|
System | Scenario | Default accuracy | High accuracy | Default accuracy | High accuracy | Default accuracy | High accuracy | Default accuracy | Default accuracy | Default accuracy |
Dell EMC R7515 (4 x T4) | Offline (samples/s) | 28 | 28 | 1,708 | 715 | 126,287 | 126,287 | 23,290 | 5,712 | 535 |
Server (queries/s) |
|
| 1,249 | 629 | 126,514 | 126,514 | 21,506 | 4,096 | 450 |
Table 5 above shows the raw performance of the R740_T4x4 SUT in samples/s for the offline scenario and queries for the server scenario. Detailed results for this and other configurations can be found at https://mlperf.org/inference-results-0-7/
Figures 1 to 4 below show the inference capabilities of two Dell PowerEdge servers; R7515 and PowerEdge R7525. They are both 2U and are powered by AMD processors. The R7515 is single socket, and the R7525 is dual socket. The R7515 used 4xNVIDIA Tesla T4 GPUs while the R7525 used four different configurations of three NVIDIA GPU accelerators; Tesla T4, Quadro RTX8000, and A100. Each bar graph indicates the relative performance of inference operations that are completed in a set amount of time while bounded by latency constraints. The higher the bar graph, the higher the inference capability of the platform.
Figure 1 Offline scenario relative performance with default accuracy for six different benchmarks and five different configurations using R7515_T4x4 as a baseline
Figure 2 Offline scenario relative performance with high accuracy for six different benchmarks and five different configurations using R7515_T4x4 as a baseline
Figure 3 Server scenario relative performance with default accuracy for five different benchmarks and five different configurations using R7515T4x4 as a baseline
Figure 4 Server scenario relative performance with high accuracy for two different benchmarks and five different configurations using R7515_T4x4 as a baseline
Figure 5 shows the relative price of each GPU configuration using the cost of Tesla T4 as the baseline and the corresponding price performance. The price/performance shown is an estimate to illustrate the “bang “for the money that is spent on the GPU configurations and should not be taken as the price/performance of the entire SUT. In this case, the shorter the bar the better.
Key Takeaways from the results
- Performance is almost linearly proportional to the number of GPU cards. Checkout figures 1 to 4 and compare the performance of the R7515_T4x4 and R7525_T4x8 or R7525_A100x2 and R7525_A100x3.
- Performance significantly tracks the number of GPU cards. The Relative performance of the R7525_T4x8 is about 2.0 for most benchmarks. It has twice the number of GPUs than the reference system. The number of GPUs have a significant impact on performance.
- The more expensive GPUs provide better price/performance. From figure 5, the cost of the R7525_A100x3 configuration is 3x the cost of the reference configuration R7515_T4x4 but its relative price/performance is 0.61.
- The price of the RTX8000 is 2.22x of the price of the Tesla T4 as searched from the Dell website. The RTX8000 can be used with fewer GPU cards, 3 compared to 8xT4, at a lower cost. From Figure 5, the R7525_RTX8000x3 is 0.8333 x the cost of the R7525_T4x8, and it posts better price/performance and performance.
- Generally, Dell Technologies provides server configurations with the flexibility to deploy customer inference workloads on systems that match their requirements:
- The NVIDIA T4 is a low profile, lower power GPU option that is widely deployed for inference due to its superior power efficiency and economic value.
- With 48 GB of GDDR6 memory, the NVIDIA Quadro RTX 8000 is designed to work with memory intensive workloads like creating the most complex models, building massive architectural datasets and visualizing immense data science workloads. Dell is the only vendor that submitted results using NVIDIA Quadro RTX GPUs.
- NVIDIA A100-PCIe-40G is a powerful platform that is popularly used for training state-of-the-art Deep Learning models. For customers who are not on a budget and have heavy Inference computational requirements, its initial high cost is more than offset by the better price/performance.
Conclusion
As shown in the charts above, Dell EMC PowerEdge R7515 performed well in a wide range of benchmark scenarios. The benchmarks that are discussed in this paper included diverse use cases. For instance, image dataset inferencing (Object Detection using SSD-Resnet34 model on COCO dataset), language processing (BERT model used on SQUAD v1.1 for machine comprehension of texts), and recommendation engine (DLRM model with Criteo 1 TB clicks dataset).