
Dell EMC Servers Excel in MLPerf™ Training v1.0 Benchmarks
Thu, 08 Jul 2021 15:28:25 -0000
|Read Time: 0 minutes
Dell Technologies has submitted MLPerf training v1.0 results. This blog provides an explanation of what is new with MLPerf training v1.0 and a high-level overview of our submissions. Results indicate that Dell EMC DSS8440 and PowerEdge XE8545 servers offer promising performance for Deep Learning training workloads across different areas.
MLCommons™ is a community that contains a consortium of experts in the Machine Learning/Deep Learning industry from different fields within AI technology. It consists of experts from industry, academia, startups, and individual researchers. MLPerf™ Training is the community-led test suite focusing on deep learning training. This test suite aims to measure how fast a system can train deep learning models across eight different problem types:
- Image classification
- Medical image segmentation
- Light-weight object detection
- Heavy-weight object detection
- Speech recognition
- Natural language processing
- Recommendation
- Reinforcement learning
These benchmarks provide a consistent and reproducible way to measure accuracy and convergence on individual accelerators, systems, and cloud setups. As of June 2021, MLPerf™ Training released the latest v1.0 results in the fourth round of submissions of MLPerf Training. The following changes are new with v1.0:
- Addition of two benchmarks:
- RNN-T—RNN-T is a speech recognition model. Speech recognition accepts raw audio samples and produces a corresponding text transcription. It uses the Libri-speech dataset, which is derived from audiobooks. An example of the use of speech recognition is Google Voice Search.
- 3D-UNet—3D-Unet is a model for 3D medical image segmentation. It accepts 3D images that contain tumors; the model divides (or segments) the tumor from the other parts in the image. It uses the KiTs19 dataset. An example of the use of 3D medical image segmentation is for the identification of kidney tumors.
- Introduction of a uniform and more mature process for evaluation and submission:
- Reference Convergence Points (RCP) checker to ensure hyperparameters are assessed consistently and uniformly across different submissions.
- Other checkers such as compliance checker, system desc checker, and package checker to check the accuracy of the submission.
- Result summarizer to provide a submission summary.
- Retirement of two language translation benchmarks from v0.7:
- GNMT
- Transformer
BERT serves as a replacement for language model tasks.
The following figure demonstrates the numbers from the Deep Learning v1.0 benchmarks submitted by Dell Technologies:
Figure 1: MLPerf v1.0 results from Dell Technologies
Contributions from Dell Technologies
Our submissions focused on Dell EMC DSS 8440 and Dell EMC PowerEdge XE8545 servers. The DSS 8440 server is an Intel-based, PCIe Gen3 4U server that supports up to 10 double-wide PCIe GPUs, focused on Machine Learning/Deep Learning applications such as training. The 4U PowerEdge XE8545 server supports the latest 3rd Gen AMD EPYC processors, PCIe Gen4, and the latest NVIDIA A100 Tensor Core GPUs for cutting edge machine learning workloads. Both of these system configurations are NVIDIA-Certified, which means they have been validated for best performance and optimal scalability. The submission from Dell Technologies also included multinode training entries to showcase scale-out performance.
Multinode training is important. Training is compute intensive, therefore, more compute nodes are used while training models. Because extra compute nodes help to reduce the turnaround time, it is critical to showcase multiple nodes’ performance. Dell Technologies and NVIDIA are the only submitters that submitted multiple nodes on GPUs. The submissions from NVIDIA run on Docker with a customized Slurm environment to optimize performance; we submitted multinode submissions with Singularity on our DSS 8440 servers as well as Docker and Slurm submissions on PowerEdge XE8545 servers. Singularity is a secure containerization solution primarily used in traditional HPC GPU clusters. Setup scripts with singularity help traditional HPC customers run MLPerf™ Training on their cluster without the need to fully restructure their existing cluster setup.
The PowerEdge XE8545 server provides the best performing submission with an air-cooled solution for NVIDIA A100-SXM-80GB 500W GPUs. Typically, 500W GPUs of most vendors' systems are cooled with liquid, due to the challenges presented by the high TDP. However, Dell Technologies invested engineering and design time to solve the thermal challenge and allows customers to avoid the need for costly changes to a standard data center setup.
The DSS 8440 server submissions to MLPerf™ Training v1.0 using the latest generation NVIDIA A100 40 GB-PCIe GPUs show a 2.1 to 2.4 times increase from equivalent MLPerf™ Training v0.7 submissions using NVIDIA V100S PCIe GPUs. Dell Technologies is committed to bringing the latest performance advancements to customers as quickly as possible.
Out of 12 different organizations, Dell Technologies and NVIDIA are the only two organizations that submitted results for all eight models in the MLPerf™ training v1.0 benchmarking suite.
Next steps
As a next step, we will publish more technical blogs to provide deep dives into DSS 8440 server and PowerEdge XE8545 server results.
Related Blog Posts

Multinode Performance of Dell PowerEdge Servers with MLPerfTM Training v1.1
Mon, 07 Mar 2022 19:51:12 -0000
|Read Time: 0 minutes
The Dell MLPerf v1.1 submission included multinode results. This blog showcases performance across multiple nodes on Dell PowerEdge R750xa and XE8545 servers and demonstrates that the multinode scaling performance was excellent.
The compute requirement for deep learning training is growing at a rapid pace. It is imperative to train models across multiple nodes to attain a faster time-to-solution. Therefore, it is critical to showcase the scaling performance across multiple nodes. To demonstrate to customers the performance that they can expect across multiple nodes, our v1.1 submission includes multinode results. The following figures show multinode results for PowerEdge R750xa and XE8545 systems.
Figure 1: One-, two-, four-, and eight-node results with PowerEdge R750xa Resnet50 MLPerf v1.1 scaling performance
Figure 1 shows the performance of the PowerEdge R750xa server with Resnet50 training. These numbers scale from one node to eight nodes, from four NVIDIA A100-PCIE-80GB GPUs to 32 NVIDIA A100-PCIE-80GB GPUs. We can see that the scaling is almost linear across nodes. MLPerf training requires passing Reference Convergence Points (RCP) for compliance. These RCPs were inhibitors to show linear scaling for the 8x scaling case. The near linear scaling makes a PowerEdge R750xa node an excellent choice for multinode training setup.
The workload was distributed by using singularity on PowerEdge R750xa servers. Singularity is a secure containerization solution that is primarily used in traditional HPC GPU clusters. Our submission includes setup scripts with singularity that help traditional HPC customers run workloads without the need to fully restructure their existing cluster setup. The submission also includes Slurm Docker-based scripts.
Figure 2: Multinode submission results for PowerEdge XE8545 server with BERT, MaskRCNN, Resnet50, SSD, and RNNT
Figure 2 shows the submitted performance of the PowerEdge XE8545 server with BERT, MaskRCNN, Resnet50, SSD, and RNNT training. These numbers scale from one node to two nodes, from four NVIDIA A100-SXM-80GB GPUs to eight NVIDIA A100-SXM-80GB GPUs. All GPUs operate at 500W TDP for maximum performance. They were distributed using Slurm and Docker on PowerEdge XE8545 servers. The performance is nearly linear.
Note: The RNN-T single node results submitted for the PowerEdge XE8545x4A100-SXM-80GB system used a different set of hyperparameters than for two nodes. After the submission, we ran the RNN-T benchmark again on the PowerEdge XE8545x4A100-SXM-80GB system with the same hyperparameters and found that the new time to converge is approximately 77.37 minutes. Because we only had the resources to update the results for the 2xXE8545x4A100-SXM-80GB system before the submission deadline, the MLCommons results show 105.6 minutes for a single node XE8545x4100-SXM-80GB system.
The following figure shows the adjusted representation of performance for the PowerEdge XE8545x4A100-SXM-80GB system. RNN-T provides an unverified score of 77.31 minutes[1]:
Figure 3: Revised multinode results with PowerEdge XE8545 BERT, MaskRCNN, Resnet50, SSD, and RNNT
Figure 3 shows the linear scaling abilities of the PowerEdge XE8545 server across different workloads such as BERT, MaskRCNN, ResNet, SSD, and RNNT. This linear scaling ability makes the PowerEdge XE8545 server an excellent choice to run large-scale multinode workloads.
Note: This rnnt.zip file includes log files for 10 runs that show that the averaged performance is 77.31 minutes.
Conclusion
- It is critical to measure deep learning performance across multiple nodes to assess the scalability component of training as deep learning workloads are growing rapidly.
- Our MLPerf training v1.1 submission includes multinode results that are linear and perform extremely well.
- Scaling numbers for the PowerEdge XE8545 and PowerEdge R750xa server make them excellent platform choices for enabling large scale deep learning training workloads across different areas and tasks.
[1] MLPerf v1.1 Training RNN-T; Result not verified by the MLCommonsTM Association. The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See http://www.mlcommons.org for more information.

Dell Servers Excel in MLPerf™ Training v2.1
Wed, 16 Nov 2022 10:07:33 -0000
|Read Time: 0 minutes
Dell Technologies has completed the successful submission of MLPerf Training, which marks the seventh round of submission to MLCommons™. This blog provides an overview and highlights the performance of the Dell PowerEdge R750xa, XE8545, and DSS8440 servers that were used for the submission.
What’s new in MLPerf Training v2.1?
This round of submission does not include new benchmarks or changes in the existing benchmarks. A change is introduced in the submission compliance checker.
This round adds one-sided normalization to the checker to reduce variance in the number of steps to converge. This change means that if a result converges faster than the RCP mean within a certain range, the checker normalizes the results to the RCP mean. This normalization was not available in earlier rounds of submission.
What’s new in MLPerf Training v2.1 with Dell submissions?
For Dell submission for MLPerf Training v2.1, we included:
- Improved performance with BERT and Mask R-CNN models
- Minigo submission results on Dell PowerEdge R750xa server with A100 PCIe GPUs
Overall Dell Submissions
Figure 1. Overall submissions for all Dell PowerEdge servers in MLPerf Training v2.1
Figure 1 shows our submission in which the workloads span across image classification, lightweight and heavy object detection, speech recognition, natural language processing, recommender systems, medical image segmentation, and reinforcement learning. There were different NVIDIA GPUs including the A100, with PCIe and SXM4 form factors having 40 GB and 80 GB VRAM and A30.
The Minigo on the PowerEdge R750xa server is a first-time submission, and it takes around 516 minutes to run to target quality. That submission has 4x A100 PCIe 80 GB GPUs.
Our results have increased in count from 41 to 45. This increased number of submissions helps customers see the performance of the systems using different PowerEdge servers, GPUs, and CPUs. With more results, customers can expect to see the influence of using different hardware settings that can play a vital role in time to convergence.
We have several procured winning titles that demonstrate the higher performance of our systems in relation to other submitters, starting with the highest number of results across all the submitters. Some other titles include the top position in the time to converge for BERT, ResNet, and Mask R-CNN with our PowerEdge XE8545 server powered by NVIDIA A100-40GB GPUs.
Improvement in Performance for BERT and Mask R-CNN
Figure 2. Performance gains from MLPerf v2.0 to MLPerf v2.1 running BERT
Figure 2 shows the improvements seen with the PowerEdge R750xa and PowerEdge XE8545 servers with A100 GPUs from MLPerf training v2.0 to MLPerf training v2.1 running BERT language model workload. The PowerEdge XE8545 server with A100-80GB has the fastest time to convergence and the highest improvement at 13.1 percent, whereas the PowerEdge XE8545 server with A100-40GB has 7.74 percent followed by the PowerEdge R750xa server with A100-PCIe at 5.35 percent.
Figure 3. Performance gains from MLPerf v2.0 to MLPerf v2.1 running Mask R-CNN
Figure 3 shows the improvements seen with the PowerEdge XE8545 server with A100 GPUs. There is a 3.31 percent improvement in time to convergence with MLPerf v2.1.
For both BERT and Mask R-CNN, the improvements are software-based. These results show that software-only improvements can reduce convergence time. Customers can benefit from similar improvements without any changes in their hardware environment.
The following sections compare the performance differences between SXM and PCIe form factor GPUs.
Performance Difference Between PCIe and SXM4 Form Factor with A100 GPUs
Figure 4. SXM4 form factor compared to PCIe for the BERT
Figure 5. SXM4 form factor compared to PCIe for Resnet50 v1.5
Figure 6. SXM4 form factor compared to PCIe for the RNN-T
Table 1:
System | BERT | Resnet50 | RNN-T |
R750xax4A100-PCIe-80GB | 48.95 | 61.27 | 66.19 |
XE8545x4A100-SXM-80GB | 32.79 | 54.23 | 55.08 |
Percentage difference | 39.54% | 12.19% | 18.32% |
Figures 4, 5, and 6 and Table 1 show that SXM form factor is faster than the PCIe form factor for BERT, Resnet50 v1.5, and RNN-T workloads.
The SXM form factor typically consumes more power and is faster than PCIe. For the above workloads, the minimum percentage improvement in convergence that customers can expect is in double digits, ranging from approximately 12 percent to 40 percent, depending on the workload.
Multinode Results Comparison
Multinode performance assessment is more important than ever. With the advent of large models and different parallelism techniques, customers have an ever-increasing need to find results faster. Therefore, we have submitted several multinode results to assess scaling performance.
Figure 7. BERT multinode results with PowerEdge R750xa and XE8545 servers
Figure 7 indicates multinode results from three different systems with the following configurations:
- R750xa with 4 A100-PCIe-80GB GPUs
- XE8545 with 4 A100-SXM-40GB GPUs
- XE8545 with 4 A100-SXM-80GB GPUs
Every node of the above system has four GPUs each. When the graph shows eight GPUs, it means that the performance results are derived from two nodes. Similarly, for 16 GPUs the results are derived from four nodes, and so on.
Figure 8. Resnet50 multinode results with R750xa and XE8545 servers
Figure 9. Mask R-CNN multinode results with R750xa and XE8545 servers
As shown in Figures 7, 8, and 9, the multinode scaling results of the BERT, Resnet50, and Mask R-CNN are linear or nearly linear scaled. This shows that Dell servers offer outstanding performance with single-node and multinode scaling.
Conclusion
The findings described in this blog show that:
- Dell servers can run all types of workloads in the MLPerf Training submission.
- Software-only enhancements reduce time to solution for our customers, as shown in our MLPerf Training v2.1 submission, and customers can expect to see improvements in their environments.
- Dell PowerEdge XE8545 and PowerEdge R750xa servers with NVIDIA A100 with PCIe and SXM4 form factors are both great selections for all deep learning models.
- PCIe-based PowerEdge R750xa servers can deliver reinforcement learning workloads in addition to other classes of workloads, such as image classification, lightweight and heavy object detection, speech recognition, natural language processing, and medical image segmentation.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode run scales well and helps to reduce time to solution across a distinct set of workload types, making Dell servers apt for single-node and multinode deep learning training workloads.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode results show a well-scaled performance that helps to reduce time to solution across a distinct set of workload types. This makes Dell servers apt for both small training workloads on single nodes and large deep learning training workloads on multinodes.
Appendix
System Under Test
MLPerf system configurations for PowerEdge XE8545 systems
Operating system | CPU | Memory | GPU | GPU form factor | GPU count | Networking | Software stack |
XE8545x4A100-SXM-40GB 2xXE8545x4A100-SXM-40GB 4xXE8545x4A100-SXM-40GB 8xXE8545x4A100-SXM-40GB 16xXE8545x4A100-SXM-40GB 32xXE8545x4A100-SXM-40GB | 2x ConnectX-6 IB HDR 200Gb/Sec
|
| |||||
Red Hat Enterprise Linux | AMD EPYC 7713 | 1 TB | NVIDIA A100-SXM-40GB | SXM4 | 4, 8, 16, 32, 64, 128 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
XE8545x4A100-SXM-80GB |
|
| |||||
Ubuntu 20.04.4 | AMD EPYC 7763 | 1 TB | NVIDIA A100-SXM-80GB | SXM4 | 4 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
2xXE8545x4A100-SXM-80GB 4xXE8545x4A100-SXM-80GB |
|
| |||||
Red Hat Enterprise Linux | AMD EPYC 7713 | 1 TB | NVIDIA A100-SXM-80GB | SXM4 | 4, 8 |
| CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
MLPerf system configurations for Dell PowerEdge R750xa servers
| 2xR750xa_A100 | 8xR750xa_A100 |
MLPerf System ID | 2xR750xax4A100-PCIE-80GB | 8xR750xax4A100-PCIE-80GB |
Operating system | CentOS 8.2.2004 | |
CPU | Intel Xeon Gold 6338 | |
Memory | 512 GB | |
GPU | NVIDIA A100-PCIE-80GB | |
GPU form factor | PCIe | |
GPU count | 4,32 | |
Networking | 1x ConnectX-5 IB EDR 100Gb/Sec | |
Software stack | CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 | CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |
MLPerf system configurations Dell DSS 8440 servers
| DSS 8440 |
MLPerf System ID | DSS8440x8A30-NVBRIDGE |
Operating system | CentOS 8.2.2004 |
CPU | Intel Xeon Gold 6248R |
Memory | 768 GB |
GPU | NVIDIA A30 |
GPU form factor | PCIe |
GPU count | 8 |
Networking | 1x ConnectX-5 IB EDR 100Gb/Sec |
Software stack | CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0 |