
Dell EMC Servers Shine in MLPerf Inference v0.7 Benchmark
Tue, 03 Nov 2020 12:46:25 -0000
|Read Time: 0 minutes
As software applications and systems using Artificial Intelligence (AI) gain mainstream adoption across all industries, inference workloads for ongoing operations are becoming a larger resource consumer in the datacenter. MLPerf is a benchmark suite that is used to evaluate the performance profiles of systems for both training and inference AI tasks. In this blog we take a closer look at the recent results submitted by Dell EMC and how our various servers performed in the datacenter category.
The reason we do this type of work is to help customers understand which server platform makes the most sense for their use case. Dell Technologies wants to make the choice easier and reduce work for our customers, so they don’t waste their precious resources. We want customers to use their time focusing on the use case helping accelerate time to value for the business.
Dell Technologies has a total of 210 submissions for MLPerf Inference v0.7 in the Datacenter category using various server platforms and accelerators. Why so many? It is because many customers have never run AI in their environment, the use cases are endless across industries and expertise limited. Customers have told us they need help identifying the correct server platform based on their workloads.
We’re proud of what we’ve done, but it’s still all about helping customers adopt AI. By sharing our expertise and providing guidance on infrastructure for AI, we help customers become successful and get their use case into production.
MLPerf Benchmarks
MLPerf was founded in 2018 with a goal of accelerating improvements in ML system performance. Formed as a collaboration of companies and researchers from leading educational institutions, MLPerf leverages open source code, public state-of-the-art Machine Learning (ML) models and publicly available datasets contributed to the ML community. The MLPerf suites include MLPerf Training and MLPerf Inference.
MLPerf Training measures how fast a system can train machine learning models. Training benchmarks have been defined for image classification, lightweight and heavy-weight object detection, language translation, natural language processing, recommendation and reinforcement learning. Each benchmark includes specifications for input datasets, quality targets and reference implementation models. The first round of training submissions was published on the MLPerf website in December 2018 with results submitted by Google, Intel and NVIDIA.
The MLPerf Inference suite measures how quickly a trained neural network can evaluate new data and perform forecasting or classification for a wide range of applications. MLPerf Inference includes image classification, object detection and machine translation with specific models, datasets, quality, server latency and multi-stream latency constraints. MLPerf validated and published results for MLPerf Inference v0.7 on October 21, 2020. In this blog we take a closer look at the for MLPerf Inference v0.7 results submitted by Dell EMC and how the servers performed in the datacenter category.
A summary of the key highlights of the Dell EMC results are shown in Table 1. These are derived from the submitted results in MLPerf datacenter closed category. Ranking and claims are based on Dell analysis of published MLPerf data. Per accelerator is calculated by dividing the primary metric of total performance by the number of accelerators reported.
Rank | Category | Specifics | Use Cases |
#1 | Performance per Accelerator for NVIDIA A100-PCIe | PowerEdge R7525 | Medical Imaging, Image Classification |
#1 | Performance per Accelerator with NVIDIA T4 GPUs | PowerEdge XE2420, PowerEdge R7525, DSS8440 | Medical Imaging, NLP, Image Classification, Speech Recognition, Object Detection, Recommendation |
#1 | Highest inference results with Quadro RTX6000 and RTX8000 | PowerEdge R7525, DSS 8440 | Medical Imaging, NLP, Image Classification, Speech Recognition, Object Detection, Recommendation |
Dell EMC had a total of 210 submissions for MLPerf Inference v0.7 in the Datacenter category using various Dell EMC platforms and accelerators from leading vendors. We achieved impressive results when compared to other submissions in the same class of platforms.
MLPerf Inference Categories and Dell EMC Achievements
A benchmark suite is made up of tasks or models from vision, speech, language and commerce use cases. MLPerf Inference measures how fast a system can perform ML inference by using a load generator against the System Under Test (SUT) where the trained model is deployed.
There are three types of benchmark tests defined in MLPerf inference v0.7, one for datacenter systems, one for edge systems and one for mobile systems. MLPerf then has four different scenarios to enable representative testing of a wide variety of inference platforms and use cases:
- Single stream
- Multiple stream
- Server
- Offline
The single stream and multiple stream scenarios are only used for edge and mobile inference benchmarks. The data center benchmark type targets systems designed for data center deployments and requires evaluation of both the server and offline scenarios. The metrics used in the Datacenter category are inference operations/second. In the server scenario, the MLPerf load generator sends new queries to the SUT according to a Poisson distribution. This is representative of on-line AI applications such as translation, image tagging which have variable arrival patterns based on end-user traffic. Offline represents AI tasks done thru batch processing such as photo categorization where all the data is readily available ahead of time.
Dell EMC published multiple results in the datacenter systems category. Details on the models, dataset and the scenarios submitted for the different datacenter benchmark are shown in Table 2
Area | Task | Model | Dataset | Required Scenarios |
Vision | Image classification | ResNet50-v1.5 | Imagenet (224x224) | Server, Offline |
Object detection (large) | SSD-ResNet34 | COCO (1200x1200) | Server, Offline | |
Medical image segmentation | 3d Unet | BraTS 2019 (224x224x160) | Offline | |
Speech | Speech-to-text | RNNT | Librispeech dev-clean (samples < 15 seconds) | Server, Offline |
Language | Language processing | BERT | SQuAD v1.1 (max_seq_len=384) | Server, Offline |
Commerce | Recommendation | DLRM | 1TB Click Logs | Server, Offline |
Next we highlight some of the key performance achievements for the broad range of solutions available in the Dell EMC portfolio for inference use cases and deployments.
1. Dell EMC is #1 in total number of datacenter submissions in the closed division including bare metal submissions using different GPUs, Xeon CPUs, Xilinx FPGA and virtualized submission on VMware vSphere
The closed division enables head to head comparisons and consists of server platforms used from the Edge to private or public clouds. The Dell Technologies engineering team submitted 210 out of the total 509 results.
We remain committed to helping customers deploy inference workloads as efficiently as possible, meeting their unique requirements of power, density, budget and performance. The wide range of servers submitted by Dell Technologies’ is a testament to this commitment -
- The only vendor with submissions for a variety of inference solutions – leveraging GPU, FPGA and CPUs for the datacenter/private cloud and Edge
- Unique in the industry by submitting results across a multitude of servers that range from mainstream servers (R740/R7525) to dense GPU-optimized servers supporting up to 16 NVIDIA GPUs (DSS8440).
- Demonstrated that customers that demand real-time inferencing at the telco or retail Edge can deploy up to 4 GPUs in a short depth NEBS-compliant PowerEdge XE2420 server.
- Demonstrated efficient Inference performance using the 2nd Gen Intel Xeon Scalable platform on the PowerEdge R640 and PowerEdge R740 platforms for customers wanting to run inference on Intel CPUs.
- Dell submissions using Xilinx U280 in PowerEdge R740 demonstrated that customers wanting low latency inference can leverage FPGA solutions.
2. Dell EMC is #1 in performance “per Accelerator” with PowerEdge R7525 and A100-PCIe for multiple benchmarks
The Dell EMC PowerEdge R7525 was purpose-built for superior accelerated performance. The MLPerf results validated leading performance across many scenarios including:
Performance Rank “Per Accelerator” | Inference Throughput | Dell EMC System |
#1 ResNet50 (server) | 30,005 | PowerEdge R7525 (3x NVIDIA A100-PCIE) |
#1 3D-Unet-99 (offline) | 39 | PowerEdge R7525 (3x NVIDIA A100-PCIE) |
#1 3D-Unet-99 (offline) | 39 | PowerEdge R7525 (3x NVIDIA A100-PCIE) |
#2 DLRM-99 (server) | 192,543 | PowerEdge R7525 (2x NVIDIA A100-PCIE) |
#2 DLRM-99 (server) | 192,543 | PowerEdge R7525 (2x NVIDIA A100-PCIE) |
3. Dell achieved the highest inference scores with NVIDIA Quadro RTX GPUs using the DSS 8440 and R7525
Dell Technologies engineering understands that since training isn’t the only AI workload, using the right technology for each job is far more cost effective. Dell is the only vendor to submit results using NVIDIA RTX6000 and RTX8000 GPUs that provide up to 48GB HBM memory for large inference models. The DSS 8440 with 10 Quadro RTX achieved
- #2 and #3 highest system performance on RNN-T for Offline scenario.
The #1 ranking was delivered using 8x NVIDIA A100 SXM4 that was introduced in May 2020 and is a powerful system for customer to train state of the art deep learning models. Dell Technologies took the #2 and #3 spots with the DSS8440 server equipped with 10x NVIDIA RTX8000 and DSS8440 with 10x NVIDIA RTX6000 providing a better power and cost efficiency for inference workloads compared to other submissions.
4. Dell EMC claims #1 spots for NVIDIA T4 platforms with DSS 8440, XE2420 and PowerEdge R7525
Dell Technologies provides system options for customers to deploy inference workloads that match their unique requirements. Today’s accelerators vary significantly in price, performance and power consumption. For example, the NVIDIA T4 is a low profile, lower power GPU option that is widely deployed for inference due to its superior power efficiency and economic value for that use case.
The MLPerf results corroborate the exemplary inference performance of NVIDIA T4 on Dell EMC Servers. The T4 leads for performance per GPU among the 20 servers used to submit scores using NVIDIA T4 GPUs
- #1 in performance per GPU on 3d-unet-99 and 3d-unet-99.9 Offline scenario
- #1 in performance per GPU on Bert-99 Server and Bert-99.9 Offline scenario
- #1, #2 and #3 in performance with T4 on DLRM-99 & DLRM-99.9 Server scenario
- #1 in performance per GPU on ResNet50 Offline scenario
- #1 in performance per GPU on RNN-T Server and Offline scenario
- #1 in performance per GPU on SSD-large Offline scenario
The best scores achieved for the NVIDIA T4 “Per GPU” rankings above and respective platforms are shown in the table:
Benchmark | Offline Scenario | Server Scenario | ||||
Rank | Throughput | Server | Rank | Throughput | Server | |
3d-unet-99 | #1 | 7.6 | XE2420 | n/a | ||
3d-unet-99.9 | #1 | 7.6 | XE2420 | n/a |
|
|
bert-99 | #3 | 449 | XE2420 | #1 | 402 | XE2420 |
bert-99.9 | #1 | 213 | DSS 8440 | #2 | 190 | XE2420 |
dlrm-99 | #2 | 35,054 | XE2420 | #1 | 32,507 | R7525 |
dlrm-99.9 | #2 | 35,054 | XE2420 | #1 | 32,507 | R7525 |
resnet | #1 | 6,285 | XE2420 | #4 | 5,663 | DSS 8440 |
rnnt | #1 | 1,560 | XE2420 | #1 | 1,146 | XE2420 |
ssd-large | #1 | 142 | XE2420 | #2 | 131 | DSS 8440 |
5. Dell is the only vendor to submit results on virtualized infrastructure with vCPUs and NVIDIA virtual GPUs (vGPU) on VMware vSphere
Customers interested in deploying inference workloads for AI on virtualized infrastructure can leverage Dell servers with VMware software to reap the benefits of virtualization.
To demonstrate efficient virtualized performance on Intel 2nd Generation Intel Xeon Scalable processors, Dell EMC and VMware submitted results using vSphere and OpenVino on the PowerEdge R640.
- Virtualization overhead for a single VM was observed to be minimal and testing showed that using multiple VMs could be deployed on a single server to achieve ~26% better throughput compared to a bare metal environment.
Dell EMC has published guidance on virtualizing GPUs using DirectPath I/O, NVIDIA Virtual Compute Server (vCS) and more. Dell EMC and VMware used NVIDIA vCS virtualization software in vSphere for MLPerf Inference benchmarks on virtualized NVIDIA T4 GPUs
- VMware vSphere using NVIDIA vCS delivers near bare metal performance for MLPerf Inference v0.7 benchmarks. The inference throughput (queries processed per second) increases linearly as the number of vGPUs attached to the VM increases.
Blogs covering these virtualized tests in greater detail are published at VMware’s performance Blog site.
This finishes our coverage of the top 5 highlights out of the 200+ submissions done by Dell EMC in the datacenter division. Next we discuss other aspects of the GPU optimized portfolio that are important for customers – quality and support.
Dell has the highest number of NVIDIA GPU submissions using NVIDIA NGC Ready systems
Dell GPU enabled platforms are part of NVIDIA NGC-Ready and NGC-Ready for Edge validation programs. At Dell, we understand that performance is critical, but customers are not willing to compromise quality and reliability to achieve maximum performance. Customers can confidently deploy inference and other software applications from the NVIDIA NGC catalog knowing that the Dell systems meet all the requirements set by NVIDIA to deploy customer workloads on-premises or at the Edge.
NVIDIA NGC validated configs that were used for this round of MLPerf submissions are:
- Dell EMC PowerEdge XE2420 (4x T4)
- Dell EMC DSS 8440 (10x Quadro RTX 8000)
- Dell EMC DSS 8440 (12x T4)
- Dell EMC DSS 8440 (16x T4)
- Dell EMC DSS 8440 (8x Quadro RTX 8000)
- Dell EMC PowerEdge R740 (4x T4)
- Dell EMC PowerEdge R7515 (4x T4)
- Dell EMC PowerEdge R7525 (2x A100-PCIE)
- Dell EMC PowerEdge R7525 (3x Quadro RTX 8000)
Dell EMC portfolio can address customers inference needs from on-premises to the edge
In this blog, we highlighted the results submitted by Dell EMC to demonstrate how our various servers performed in the datacenter category. The Dell EMC server portfolio provides many options for customer wanting to deploy AI inference in their datacenters or on the edge. We also offer a wide range of accelerator options including both multiple GPU and FPGA models for running inference either on bare metal or virtualized infrastructure that can meet specific application and deployment requirements.
Finally, we list the performance for a subset of the server platforms that we see mostly commonly used by customers today for running inference workloads. These rankings highlight that the platform can support a wide range of inference use cases that are showcased in the MLPerf suite.
1. The Dell EMC PowerEdge XE2420 with 4x NVIDIA T4 GPUs: Ranked between #1 and #3 in 14 out of 16 benchmark categories when compared with other T4 Servers
Dell EMC PowerEdge XE2420 (4x T4) Per Accelerator Ranking* | |||
| Offline | Server |
|
3d-unet-99 | #1 | n/a
| |
3d-unet-99.9 | #1 | ||
bert-99 | #3 | #1 | |
bert-99.9 | #2 | #2 | |
dlrm-99 | #1 | #3 | |
dlrm-99.9 | #1 | #3 | |
resnet | #1 |
| |
rnnt | #1 | #1 | |
ssd-large | #1 |
|
2. Dell EMC PowerEdge R7525 with 8x T4 GPUs: Ranked between #1 and #5 in 11 out of 16 benchmark categories in T4 server submission
Dell EMC PowerEdge R7525 (8x T4) Per Accelerator Ranking* | |||
| Offline | Server |
|
3d-unet-99 | #4 | n/a
| |
3d-unet-99.9 | #4 | ||
bert-99 | #4 |
| |
dlrm-99 | #2 | #1 | |
dlrm-99.9 | #2 | #1 | |
rnnt | #2 | #5 | |
ssd-large | #5 |
|
3. The Dell EMC PowerEdge R7525 with up to 3xA100-PCIe: ranked between #3 and #10 in 15 out of 16 benchmark categories across all datacenter submissions
Dell EMC PowerEdge R7525 (2|3x A100-PCIe) Per Accelerator | |||
| Offline | Server |
|
3d-unet-99 | #4 | n/a
| |
3d-unet-99.9 | #4 | ||
bert-99 | #8 | #9 | |
bert-99.9 | #7 | #8 | |
dlrm-99 | #6 | #4 | |
dlrm-99.9 | #6 | #4 | |
resnet | #10 | #3 | |
rnnt | #6 | #7 | |
ssd-large | #10 |
|
4. The Dell EMC DSS 8440 with 16x T4 ranked between #3 and #7 when compared against all submissions using T4
Dell EMC DSS 8440 (16x T4) | |||
| Offline | Server |
|
3d-unet-99 | #4 | n/a
| |
3d-unet-99.9 | #4 | ||
bert-99 | #6 | #4 | |
bert-99.9 | #7 | #5 | |
dlrm-99 | #3 | #3 | |
dlrm-99.9 | #3 | #3 | |
resnet | #6 | #4 | |
rnnt | #5 | #5 | |
ssd-large | #7 | #5 |
5. The Dell EMC DSS 8440 with 10x RTX6000 ranked between #2 and #6 in 14 out of 16 benchmarks when compared against all submissions
Dell EMC DSS 8440 (10x Quadro RTX6000) | |||
| Offline | Server |
|
3d-unet-99 | #4 | n/a
| |
3d-unet-99.9 | #4 | ||
bert-99 | #4 | #5 | |
bert-99.9 | #4 | #5 | |
dlrm-99 |
|
| |
dlrm-99.9 |
|
| |
resnet | #5 | #6 | |
rnnt | #2 | #5 | |
ssd-large | #5 | #6 |
6. Dell EMC DSS 8440 with 10x RTX8000 ranked between #2 and #6 when compared against all submissions
Dell EMC DSS 8440 (10x Quadro RTX8000) | |||
| Offline | Server |
|
3d-unet-99 | #5 | n/a
| |
3d-unet-99.9 | #5 | ||
bert-99 | #5 | #4 | |
bert-99.9 | #5 | #4 | |
dlrm-99 | #3 | #3 | |
dlrm-99.9 | #3 | #3 | |
resnet | #6 | #5 | |
rnnt | #3 | #6 | |
ssd-large | #6 | #5 |
Get more information on MLPerf results at www.mlperf.org and earn more about PowerEdge servers that are optimized for AI / ML / DL at www.DellTechnologies.com/Servers
Acknowledgements: These impressive results were made possible by the work of the following Dell EMC and partner team members - Shubham Billus, Trevor Cockrell, Bagus Hanindhito (Univ. of Texas, Austin), Uday Kurkure (VMWare), Guy Laporte, Anton Lokhmotov (Dividiti), Bhavesh Patel, Vilmara Sanchez, Rakshith Vasudev, Lan Vu (VMware) and Nicholas Wakou. We would also like to thank our partners – NVIDIA, Intel and Xilinx for their help and support in MLPerf v0.7 Inference submissions.
Related Blog Posts

Multinode Performance of Dell PowerEdge Servers with MLPerfTM Training v1.1
Mon, 07 Mar 2022 19:51:12 -0000
|Read Time: 0 minutes
The Dell MLPerf v1.1 submission included multinode results. This blog showcases performance across multiple nodes on Dell PowerEdge R750xa and XE8545 servers and demonstrates that the multinode scaling performance was excellent.
The compute requirement for deep learning training is growing at a rapid pace. It is imperative to train models across multiple nodes to attain a faster time-to-solution. Therefore, it is critical to showcase the scaling performance across multiple nodes. To demonstrate to customers the performance that they can expect across multiple nodes, our v1.1 submission includes multinode results. The following figures show multinode results for PowerEdge R750xa and XE8545 systems.
Figure 1: One-, two-, four-, and eight-node results with PowerEdge R750xa Resnet50 MLPerf v1.1 scaling performance
Figure 1 shows the performance of the PowerEdge R750xa server with Resnet50 training. These numbers scale from one node to eight nodes, from four NVIDIA A100-PCIE-80GB GPUs to 32 NVIDIA A100-PCIE-80GB GPUs. We can see that the scaling is almost linear across nodes. MLPerf training requires passing Reference Convergence Points (RCP) for compliance. These RCPs were inhibitors to show linear scaling for the 8x scaling case. The near linear scaling makes a PowerEdge R750xa node an excellent choice for multinode training setup.
The workload was distributed by using singularity on PowerEdge R750xa servers. Singularity is a secure containerization solution that is primarily used in traditional HPC GPU clusters. Our submission includes setup scripts with singularity that help traditional HPC customers run workloads without the need to fully restructure their existing cluster setup. The submission also includes Slurm Docker-based scripts.
Figure 2: Multinode submission results for PowerEdge XE8545 server with BERT, MaskRCNN, Resnet50, SSD, and RNNT
Figure 2 shows the submitted performance of the PowerEdge XE8545 server with BERT, MaskRCNN, Resnet50, SSD, and RNNT training. These numbers scale from one node to two nodes, from four NVIDIA A100-SXM-80GB GPUs to eight NVIDIA A100-SXM-80GB GPUs. All GPUs operate at 500W TDP for maximum performance. They were distributed using Slurm and Docker on PowerEdge XE8545 servers. The performance is nearly linear.
Note: The RNN-T single node results submitted for the PowerEdge XE8545x4A100-SXM-80GB system used a different set of hyperparameters than for two nodes. After the submission, we ran the RNN-T benchmark again on the PowerEdge XE8545x4A100-SXM-80GB system with the same hyperparameters and found that the new time to converge is approximately 77.37 minutes. Because we only had the resources to update the results for the 2xXE8545x4A100-SXM-80GB system before the submission deadline, the MLCommons results show 105.6 minutes for a single node XE8545x4100-SXM-80GB system.
The following figure shows the adjusted representation of performance for the PowerEdge XE8545x4A100-SXM-80GB system. RNN-T provides an unverified score of 77.31 minutes[1]:
Figure 3: Revised multinode results with PowerEdge XE8545 BERT, MaskRCNN, Resnet50, SSD, and RNNT
Figure 3 shows the linear scaling abilities of the PowerEdge XE8545 server across different workloads such as BERT, MaskRCNN, ResNet, SSD, and RNNT. This linear scaling ability makes the PowerEdge XE8545 server an excellent choice to run large-scale multinode workloads.
Note: This rnnt.zip file includes log files for 10 runs that show that the averaged performance is 77.31 minutes.
Conclusion
- It is critical to measure deep learning performance across multiple nodes to assess the scalability component of training as deep learning workloads are growing rapidly.
- Our MLPerf training v1.1 submission includes multinode results that are linear and perform extremely well.
- Scaling numbers for the PowerEdge XE8545 and PowerEdge R750xa server make them excellent platform choices for enabling large scale deep learning training workloads across different areas and tasks.
[1] MLPerf v1.1 Training RNN-T; Result not verified by the MLCommonsTM Association. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See http://www.mlcommons.org for more information.

Dell Servers Excel in MLPerf™ Training v2.1
Wed, 16 Nov 2022 10:07:33 -0000
|Read Time: 0 minutes
Dell Technologies has completed the successful submission of MLPerf Training, which marks the seventh round of submission to MLCommons™. This blog provides an overview and highlights the performance of the Dell PowerEdge R750xa, XE8545, and DSS8440 servers that were used for the submission.
What’s new in MLPerf Training v2.1?
This round of submission does not include new benchmarks or changes in the existing benchmarks. A change is introduced in the submission compliance checker.
This round adds one-sided normalization to the checker to reduce variance in the number of steps to converge. This change means that if a result converges faster than the RCP mean within a certain range, the checker normalizes the results to the RCP mean. This normalization was not available in earlier rounds of submission.
What’s new in MLPerf Training v2.1 with Dell submissions?
For Dell submission for MLPerf Training v2.1, we included:
- Improved performance with BERT and Mask R-CNN models
- Minigo submission results on Dell PowerEdge R750xa server with A100 PCIe GPUs
Overall Dell Submissions
Figure 1. Overall submissions for all Dell PowerEdge servers in MLPerf Training v2.1
Figure 1 shows our submission in which the workloads span across image classification, lightweight and heavy object detection, speech recognition, natural language processing, recommender systems, medical image segmentation, and reinforcement learning. There were different NVIDIA GPUs including the A100, with PCIe and SXM4 form factors having 40 GB and 80 GB VRAM and A30.
The Minigo on the PowerEdge R750xa server is a first-time submission, and it takes around 516 minutes to run to target quality. That submission has 4x A100 PCIe 80 GB GPUs.
Our results have increased in count from 41 to 45. This increased number of submissions helps customers see the performance of the systems using different PowerEdge servers, GPUs, and CPUs. With more results, customers can expect to see the influence of using different hardware settings that can play a vital role in time to convergence.
We have several procured winning titles that demonstrate the higher performance of our systems in relation to other submitters, starting with the highest number of results across all the submitters. Some other titles include the top position in the time to converge for BERT, ResNet, and Mask R-CNN with our PowerEdge XE8545 server powered by NVIDIA A100-40GB GPUs.
Improvement in Performance for BERT and Mask R-CNN
Figure 2. Performance gains from MLPerf v2.0 to MLPerf v2.1 running BERT
Figure 2 shows the improvements seen with the PowerEdge R750xa and PowerEdge XE8545 servers with A100 GPUs from MLPerf training v2.0 to MLPerf training v2.1 running BERT language model workload. The PowerEdge XE8545 server with A100-80GB has the fastest time to convergence and the highest improvement at 13.1 percent, whereas the PowerEdge XE8545 server with A100-40GB has 7.74 percent followed by the PowerEdge R750xa server with A100-PCIe at 5.35 percent.
Figure 3. Performance gains from MLPerf v2.0 to MLPerf v2.1 running Mask R-CNN
Figure 3 shows the improvements seen with the PowerEdge XE8545 server with A100 GPUs. There is a 3.31 percent improvement in time to convergence with MLPerf v2.1.
For both BERT and Mask R-CNN, the improvements are software-based. These results show that software-only improvements can reduce convergence time. Customers can benefit from similar improvements without any changes in their hardware environment.
The following sections compare the performance differences between SXM and PCIe form factor GPUs.
Performance Difference Between PCIe and SXM4 Form Factor with A100 GPUs
Figure 4. SXM4 form factor compared to PCIe for the BERT
Figure 5. SXM4 form factor compared to PCIe for Resnet50 v1.5
Figure 6. SXM4 form factor compared to PCIe for the RNN-T
Table 1:
System | BERT | Resnet50 | RNN-T |
R750xax4A100-PCIe-80GB | 48.95 | 61.27 | 66.19 |
XE8545x4A100-SXM-80GB | 32.79 | 54.23 | 55.08 |
Percentage difference | 39.54% | 12.19% | 18.32% |
Figures 4, 5, and 6 and Table 1 show that SXM form factor is faster than the PCIe form factor for BERT, Resnet50 v1.5, and RNN-T workloads.
The SXM form factor typically consumes more power and is faster than PCIe. For the above workloads, the minimum percentage improvement in convergence that customers can expect is in double digits, ranging from approximately 12 percent to 40 percent, depending on the workload.
Multinode Results Comparison
Multinode performance assessment is more important than ever. With the advent of large models and different parallelism techniques, customers have an ever-increasing need to find results faster. Therefore, we have submitted several multinode results to assess scaling performance.
Figure 7. BERT multinode results with PowerEdge R750xa and XE8545 servers
Figure 7 indicates multinode results from three different systems with the following configurations:
- R750xa with 4 A100-PCIe-80GB GPUs
- XE8545 with 4 A100-SXM-40GB GPUs
- XE8545 with 4 A100-SXM-80GB GPUs
Every node of the above system has four GPUs each. When the graph shows eight GPUs, it means that the performance results are derived from two nodes. Similarly, for 16 GPUs the results are derived from four nodes, and so on.
Figure 8. Resnet50 multinode results with R750xa and XE8545 servers
Figure 9. Mask R-CNN multinode results with R750xa and XE8545 servers
As shown in Figures 7, 8, and 9, the multinode scaling results of the BERT, Resnet50, and Mask R-CNN are linear or nearly linear scaled. This shows that Dell servers offer outstanding performance with single-node and multinode scaling.
Conclusion
The findings described in this blog show that:
- Dell servers can run all types of workloads in the MLPerf Training submission.
- Software-only enhancements reduce time to solution for our customers, as shown in our MLPerf Training v2.1 submission, and customers can expect to see improvements in their environments.
- Dell PowerEdge XE8545 and PowerEdge R750xa servers with NVIDIA A100 with PCIe and SXM4 form factors are both great selections for all deep learning models.
- PCIe-based PowerEdge R750xa servers can deliver reinforcement learning workloads in addition to other classes of workloads, such as image classification, lightweight and heavy object detection, speech recognition, natural language processing, and medical image segmentation.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode run scales well and helps to reduce time to solution across a distinct set of workload types, making Dell servers apt for single-node and multinode deep learning training workloads.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode results show a well-scaled performance that helps to reduce time to solution across a distinct set of workload types. This makes Dell servers apt for both small training workloads on single nodes and large deep learning training workloads on multinodes.
Appendix
System Under Test
MLPerf system configurations for PowerEdge XE8545 systems
Operating system | CPU | Memory | GPU | GPU form factor | GPU count | Networking | Software stack |
XE8545x4A100-SXM-40GB 2xXE8545x4A100-SXM-40GB 4xXE8545x4A100-SXM-40GB 8xXE8545x4A100-SXM-40GB 16xXE8545x4A100-SXM-40GB 32xXE8545x4A100-SXM-40GB | 2x ConnectX-6 IB HDR 200Gb/Sec
|
| |||||
Red Hat Enterprise Linux | AMD EPYC 7713 | 1 TB | NVIDIA A100-SXM-40GB | SXM4 | 4, 8, 16, 32, 64, 128 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
XE8545x4A100-SXM-80GB |
|
| |||||
Ubuntu 20.04.4 | AMD EPYC 7763 | 1 TB | NVIDIA A100-SXM-80GB | SXM4 | 4 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
2xXE8545x4A100-SXM-80GB 4xXE8545x4A100-SXM-80GB |
|
| |||||
Red Hat Enterprise Linux | AMD EPYC 7713 | 1 TB | NVIDIA A100-SXM-80GB | SXM4 | 4, 8 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
MLPerf system configurations for Dell PowerEdge R750xa servers
| 2xR750xa_A100 | 8xR750xa_A100 |
MLPerf System ID | 2xR750xax4A100-PCIE-80GB | 8xR750xax4A100-PCIE-80GB |
Operating system | CentOS 8.2.2004 | |
CPU | Intel Xeon Gold 6338 | |
Memory | 512 GB | |
GPU | NVIDIA A100-PCIE-80GB | |
GPU form factor | PCIe | |
GPU count | 4,32 | |
Networking | 1x ConnectX-5 IB EDR 100Gb/Sec | |
Software stack | CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 | CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
MLPerf system configurations Dell DSS 8440 servers
| DSS 8440 |
MLPerf System ID | DSS8440x8A30-NVBRIDGE |
Operating system | CentOS 8.2.2004 |
CPU | Intel Xeon Gold 6248R |
Memory | 768 GB |
GPU | NVIDIA A30 |
GPU form factor | PCIe |
GPU count | 8 |
Networking | 1x ConnectX-5 IB EDR 100Gb/Sec |
Software stack | CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |