Dell Technologies has completed the successful submission of MLPerf Training, which marks the seventh round of submission to MLCommons™. This blog provides an overview and highlights the performance of the Dell PowerEdge R750xa, XE8545, and DSS8440 servers that were used for the submission.

What’s new in MLPerf Training v2.1?

This round of submission does not include new benchmarks or changes in the existing benchmarks. A change is introduced in the submission compliance checker.

This round adds one-sided normalization to the checker to reduce variance in the number of steps to converge. This change means that if a result converges faster than the RCP mean within a certain range, the checker normalizes the results to the RCP mean. This normalization was not available in earlier rounds of submission.

What’s new in MLPerf Training v2.1 with Dell submissions?

For Dell submission for MLPerf Training v2.1, we included:

Improved performance with BERT and Mask R-CNN models
Minigo submission results on Dell PowerEdge R750xa server with A100 PCIe GPUs

Overall Dell Submissions

Figure 1. Overall submissions for all Dell PowerEdge servers in MLPerf Training v2.1

Figure 1 shows our submission in which the workloads span across image classification, lightweight and heavy object detection, speech recognition, natural language processing, recommender systems, medical image segmentation, and reinforcement learning. There were different NVIDIA GPUs including the A100, with PCIe and SXM4 form factors having 40 GB and 80 GB VRAM and A30.

The Minigo on the PowerEdge R750xa server is a first-time submission, and it takes around 516 minutes to run to target quality. That submission has 4x A100 PCIe 80 GB GPUs.

Our results have increased in count from 41 to 45. This increased number of submissions helps customers see the performance of the systems using different PowerEdge servers, GPUs, and CPUs. With more results, customers can expect to see the influence of using different hardware settings that can play a vital role in time to convergence.

We have several procured winning titles that demonstrate the higher performance of our systems in relation to other submitters, starting with the highest number of results across all the submitters. Some other titles include the top position in the time to converge for BERT, ResNet, and Mask R-CNN with our PowerEdge XE8545 server powered by NVIDIA A100-40GB GPUs.

Improvement in Performance for BERT and Mask R-CNN

Figure 2. Performance gains from MLPerf v2.0 to MLPerf v2.1 running BERT

Figure 2 shows the improvements seen with the PowerEdge R750xa and PowerEdge XE8545 servers with A100 GPUs from MLPerf training v2.0 to MLPerf training v2.1 running BERT language model workload. The PowerEdge XE8545 server with A100-80GB has the fastest time to convergence and the highest improvement at 13.1 percent, whereas the PowerEdge XE8545 server with A100-40GB has 7.74 percent followed by the PowerEdge R750xa server with A100-PCIe at 5.35 percent.

Figure 3. Performance gains from MLPerf v2.0 to MLPerf v2.1 running Mask R-CNN

Figure 3 shows the improvements seen with the PowerEdge XE8545 server with A100 GPUs. There is a 3.31 percent improvement in time to convergence with MLPerf v2.1.

For both BERT and Mask R-CNN, the improvements are software-based. These results show that software-only improvements can reduce convergence time. Customers can benefit from similar improvements without any changes in their hardware environment.

The following sections compare the performance differences between SXM and PCIe form factor GPUs.

Performance Difference Between PCIe and SXM4 Form Factor with A100 GPUs

Figure 4. SXM4 form factor compared to PCIe for the BERT

Figure 5. SXM4 form factor compared to PCIe for Resnet50 v1.5

Figure 6. SXM4 form factor compared to PCIe for the RNN-T

Table 1:

System	BERT	Resnet50	RNN-T
R750xax4A100-PCIe-80GB	48.95	61.27	66.19
XE8545x4A100-SXM-80GB	32.79	54.23	55.08
Percentage difference	39.54%	12.19%	18.32%

Figures 4, 5, and 6 and Table 1 show that SXM form factor is faster than the PCIe form factor for BERT, Resnet50 v1.5, and RNN-T workloads.

The SXM form factor typically consumes more power and is faster than PCIe. For the above workloads, the minimum percentage improvement in convergence that customers can expect is in double digits, ranging from approximately 12 percent to 40 percent, depending on the workload.

Multinode Results Comparison

Multinode performance assessment is more important than ever. With the advent of large models and different parallelism techniques, customers have an ever-increasing need to find results faster. Therefore, we have submitted several multinode results to assess scaling performance.

Figure 7. BERT multinode results with PowerEdge R750xa and XE8545 servers

Figure 7 indicates multinode results from three different systems with the following configurations:

R750xa with 4 A100-PCIe-80GB GPUs
XE8545 with 4 A100-SXM-40GB GPUs
XE8545 with 4 A100-SXM-80GB GPUs

Every node of the above system has four GPUs each. When the graph shows eight GPUs, it means that the performance results are derived from two nodes. Similarly, for 16 GPUs the results are derived from four nodes, and so on.

Figure 8. Resnet50 multinode results with R750xa and XE8545 servers

Figure 9. Mask R-CNN multinode results with R750xa and XE8545 servers

As shown in Figures 7, 8, and 9, the multinode scaling results of the BERT, Resnet50, and Mask R-CNN are linear or nearly linear scaled. This shows that Dell servers offer outstanding performance with single-node and multinode scaling.

Conclusion

The findings described in this blog show that:

Dell servers can run all types of workloads in the MLPerf Training submission.
Software-only enhancements reduce time to solution for our customers, as shown in our MLPerf Training v2.1 submission, and customers can expect to see improvements in their environments.
Dell PowerEdge XE8545 and PowerEdge R750xa servers with NVIDIA A100 with PCIe and SXM4 form factors are both great selections for all deep learning models.
PCIe-based PowerEdge R750xa servers can deliver reinforcement learning workloads in addition to other classes of workloads, such as image classification, lightweight and heavy object detection, speech recognition, natural language processing, and medical image segmentation.
The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode run scales well and helps to reduce time to solution across a distinct set of workload types, making Dell servers apt for single-node and multinode deep learning training workloads.
The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode results show a well-scaled performance that helps to reduce time to solution across a distinct set of workload types. This makes Dell servers apt for both small training workloads on single nodes and large deep learning training workloads on multinodes.

Appendix

System Under Test

MLPerf system configurations for PowerEdge XE8545 systems

Operating system	CPU	Memory	GPU	GPU form factor	GPU count	Networking	Software stack
XE8545x4A100-SXM-40GB 2xXE8545x4A100-SXM-40GB 4xXE8545x4A100-SXM-40GB 8xXE8545x4A100-SXM-40GB 16xXE8545x4A100-SXM-40GB 32xXE8545x4A100-SXM-40GB						2x ConnectX-6 IB HDR 200Gb/Sec
Red Hat Enterprise Linux	AMD EPYC 7713	1 TB	NVIDIA A100-SXM-40GB	SXM4	4, 8, 16, 32, 64, 128		CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0
XE8545x4A100-SXM-80GB
Ubuntu 20.04.4	AMD EPYC 7763	1 TB	NVIDIA A100-SXM-80GB	SXM4	4		CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0
2xXE8545x4A100-SXM-80GB 4xXE8545x4A100-SXM-80GB
Red Hat Enterprise Linux	AMD EPYC 7713	1 TB	NVIDIA A100-SXM-80GB	SXM4	4, 8		CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0

MLPerf system configurations for Dell PowerEdge R750xa servers

	2xR750xa_A100	8xR750xa_A100
MLPerf System ID	2xR750xax4A100-PCIE-80GB	8xR750xax4A100-PCIE-80GB
Operating system	CentOS 8.2.2004
CPU	Intel Xeon Gold 6338
Memory	512 GB
GPU	NVIDIA A100-PCIE-80GB
GPU form factor	PCIe
GPU count	4,32
Networking	1x ConnectX-5 IB EDR 100Gb/Sec
Software stack	CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0	CUDA 11.6 Driver 470.42.01 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0

MLPerf system configurations Dell DSS 8440 servers

	DSS 8440
MLPerf System ID	DSS8440x8A30-NVBRIDGE
Operating system	CentOS 8.2.2004
CPU	Intel Xeon Gold 6248R
Memory	768 GB
GPU	NVIDIA A30
GPU form factor	PCIe
GPU count	8
Networking	1x ConnectX-5 IB EDR 100Gb/Sec
Software stack	CUDA 11.6 Driver 510.47.03 cuBLAS 11.9.2.110 cuDNN 8.4.0.27 TensorRT 8.0.3 DALI 1.5.0 NCCL 2.12.10 Open MPI 4.1.1rc1 MOFED 5.4-1.0.3.0

Your Browser is Out of Date

Dell Servers Excel in MLPerf™ Training v2.1