Dell Servers Excel in MLPerf™ Training v2.1
Wed, 16 Nov 2022 10:07:33 -0000|
Read Time: 0 minutes
Dell Technologies has completed the successful submission of MLPerf Training, which marks the seventh round of submission to MLCommons™. This blog provides an overview and highlights the performance of the Dell PowerEdge R750xa, XE8545, and DSS8440 servers that were used for the submission.
What’s new in MLPerf Training v2.1?
This round of submission does not include new benchmarks or changes in the existing benchmarks. A change is introduced in the submission compliance checker.
This round adds one-sided normalization to the checker to reduce variance in the number of steps to converge. This change means that if a result converges faster than the RCP mean within a certain range, the checker normalizes the results to the RCP mean. This normalization was not available in earlier rounds of submission.
What’s new in MLPerf Training v2.1 with Dell submissions?
For Dell submission for MLPerf Training v2.1, we included:
- Improved performance with BERT and Mask R-CNN models
- Minigo submission results on Dell PowerEdge R750xa server with A100 PCIe GPUs
Overall Dell Submissions
Figure 1. Overall submissions for all Dell PowerEdge servers in MLPerf Training v2.1
Figure 1 shows our submission in which the workloads span across image classification, lightweight and heavy object detection, speech recognition, natural language processing, recommender systems, medical image segmentation, and reinforcement learning. There were different NVIDIA GPUs including the A100, with PCIe and SXM4 form factors having 40 GB and 80 GB VRAM and A30.
The Minigo on the PowerEdge R750xa server is a first-time submission, and it takes around 516 minutes to run to target quality. That submission has 4x A100 PCIe 80 GB GPUs.
Our results have increased in count from 41 to 45. This increased number of submissions helps customers see the performance of the systems using different PowerEdge servers, GPUs, and CPUs. With more results, customers can expect to see the influence of using different hardware settings that can play a vital role in time to convergence.
We have several procured winning titles that demonstrate the higher performance of our systems in relation to other submitters, starting with the highest number of results across all the submitters. Some other titles include the top position in the time to converge for BERT, ResNet, and Mask R-CNN with our PowerEdge XE8545 server powered by NVIDIA A100-40GB GPUs.
Improvement in Performance for BERT and Mask R-CNN
Figure 2. Performance gains from MLPerf v2.0 to MLPerf v2.1 running BERT
Figure 2 shows the improvements seen with the PowerEdge R750xa and PowerEdge XE8545 servers with A100 GPUs from MLPerf training v2.0 to MLPerf training v2.1 running BERT language model workload. The PowerEdge XE8545 server with A100-80GB has the fastest time to convergence and the highest improvement at 13.1 percent, whereas the PowerEdge XE8545 server with A100-40GB has 7.74 percent followed by the PowerEdge R750xa server with A100-PCIe at 5.35 percent.
Figure 3. Performance gains from MLPerf v2.0 to MLPerf v2.1 running Mask R-CNN
Figure 3 shows the improvements seen with the PowerEdge XE8545 server with A100 GPUs. There is a 3.31 percent improvement in time to convergence with MLPerf v2.1.
For both BERT and Mask R-CNN, the improvements are software-based. These results show that software-only improvements can reduce convergence time. Customers can benefit from similar improvements without any changes in their hardware environment.
The following sections compare the performance differences between SXM and PCIe form factor GPUs.
Performance Difference Between PCIe and SXM4 Form Factor with A100 GPUs
Figure 4. SXM4 form factor compared to PCIe for the BERT
Figure 5. SXM4 form factor compared to PCIe for Resnet50 v1.5
Figure 6. SXM4 form factor compared to PCIe for the RNN-T
Figures 4, 5, and 6 and Table 1 show that SXM form factor is faster than the PCIe form factor for BERT, Resnet50 v1.5, and RNN-T workloads.
The SXM form factor typically consumes more power and is faster than PCIe. For the above workloads, the minimum percentage improvement in convergence that customers can expect is in double digits, ranging from approximately 12 percent to 40 percent, depending on the workload.
Multinode Results Comparison
Multinode performance assessment is more important than ever. With the advent of large models and different parallelism techniques, customers have an ever-increasing need to find results faster. Therefore, we have submitted several multinode results to assess scaling performance.
Figure 7. BERT multinode results with PowerEdge R750xa and XE8545 servers
Figure 7 indicates multinode results from three different systems with the following configurations:
- R750xa with 4 A100-PCIe-80GB GPUs
- XE8545 with 4 A100-SXM-40GB GPUs
- XE8545 with 4 A100-SXM-80GB GPUs
Every node of the above system has four GPUs each. When the graph shows eight GPUs, it means that the performance results are derived from two nodes. Similarly, for 16 GPUs the results are derived from four nodes, and so on.
Figure 8. Resnet50 multinode results with R750xa and XE8545 servers
Figure 9. Mask R-CNN multinode results with R750xa and XE8545 servers
As shown in Figures 7, 8, and 9, the multinode scaling results of the BERT, Resnet50, and Mask R-CNN are linear or nearly linear scaled. This shows that Dell servers offer outstanding performance with single-node and multinode scaling.
The findings described in this blog show that:
- Dell servers can run all types of workloads in the MLPerf Training submission.
- Software-only enhancements reduce time to solution for our customers, as shown in our MLPerf Training v2.1 submission, and customers can expect to see improvements in their environments.
- Dell PowerEdge XE8545 and PowerEdge R750xa servers with NVIDIA A100 with PCIe and SXM4 form factors are both great selections for all deep learning models.
- PCIe-based PowerEdge R750xa servers can deliver reinforcement learning workloads in addition to other classes of workloads, such as image classification, lightweight and heavy object detection, speech recognition, natural language processing, and medical image segmentation.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode run scales well and helps to reduce time to solution across a distinct set of workload types, making Dell servers apt for single-node and multinode deep learning training workloads.
- The single-node results of our submission indicate that Dell servers deliver outstanding performance and that multinode results show a well-scaled performance that helps to reduce time to solution across a distinct set of workload types. This makes Dell servers apt for both small training workloads on single nodes and large deep learning training workloads on multinodes.
System Under Test
MLPerf system configurations for PowerEdge XE8545 systems
GPU form factor
2x ConnectX-6 IB HDR 200Gb/Sec
Red Hat Enterprise Linux
AMD EPYC 7713
4, 8, 16, 32, 64, 128
Open MPI 4.1.1rc1
AMD EPYC 7763
Open MPI 4.1.1rc1
Red Hat Enterprise Linux
AMD EPYC 7713
Open MPI 4.1.1rc1
MLPerf system configurations for Dell PowerEdge R750xa servers
MLPerf System ID
Intel Xeon Gold 6338
GPU form factor
1x ConnectX-5 IB EDR 100Gb/Sec
Open MPI 4.1.1rc1
Open MPI 4.1.1rc1
MLPerf system configurations Dell DSS 8440 servers
MLPerf System ID
Intel Xeon Gold 6248R
GPU form factor
1x ConnectX-5 IB EDR 100Gb/Sec
Open MPI 4.1.1rc1
Related Blog Posts
Performance of the Dell PowerEdge R750xa Server for MLPerf™ Inference v2.0
Thu, 21 Apr 2022 18:20:33 -0000|
Read Time: 0 minutes
Dell Technologies recently submitted results to the MLPerf Inference v2.0 benchmark suite. The results provide information about the performance of Dell servers. This blog takes a closer look at the Dell PowerEdge R750xa server and its performance for MLPerf Inference v1.1 and v2.0.
We compare the v1.1 results with the v2.0 results. We show the performance difference between the software stack versions. We also use the PowerEdge R750xa server to demonstrate that the v1.1 results from all systems can be referenced for planning an ML workload on systems that are not available for MLPerf Inference v2.0.
PowerEdge R750xa server
Built with state-of-the-art components, the PowerEdge R750xa server is ideal for artificial intelligence (AI), machine learning (ML), and deep learning (DL) workloads. The PowerEdge R750xa server is the GPU-optimized version of the PowerEdge R750 server. It supports accelerators as 4 x 300 W DW or 6 x 75 W SW. The GPUs are placed in the front of the PowerEdge R750xa server allowing for better airflow management. It has up to eight available PCIe Gen4 slots and supports up to eight NVMe SSDs.
The following figures show the PowerEdge R750xa server (source):
Figure 1: Front view of the PowerEdge R750xa server
Figure 2: Rear view of the PowerEdge R750xa server
Figure 3: Top view of the PowerEdge R750xa server
The following table describes the software stack configurations from the two rounds of submission for the closed data center division:
Table 1: MLPerf Inference v1.1 and v2.0 software stacks
v1.1 software stack
v2.0 software stack
Although the software has been updated across the two rounds of submission, performance is consistent, if not better, for the v2.0 submission. For MLPerf Inference v2.0, Triton performance results can be extrapolated from MLPerf Inference v1.1 except for the 3D U-Net benchmark, which is due to a v2.0 dataset change.
The following table describes the System Under Test (SUT) configurations from MLPerf Inference v1.1 and v2.0 of data center inference submissions:
Table 2: MLPerf Inference v1.1 and v2.0 system configuration of the PowerEdge R750xa server
v1.1 system configuration
v2.0 system configuration
R750xa 4x A100-PCIE-80GB, TensorRT
R750xa 4xA100 TensorRT
MLPerf system ID
Intel Xeon Gold 6338 CPU @ 2.00 GHz
GPU form factor
In the v1.1 round of submission, Dell Technologies submitted four different configurations on the PowerEdge R750xa server. Although the GPU count of four was maintained, Dell Technologies submitted the 40 GB and the 80 GB versions of the NVIDIA A100 GPU. Additionally, Dell Technologies submitted Multi-Instance GPU (MIG) numbers using 28 instances of the one compute instance of the 10gb memory profile on the 80 GB A100 GPU. Furthermore, Dell Technologies submitted power numbers (MaxQ is a performance and power submission) for the 40 GB version of the A100 GPU and submitted with the Triton server on the 80 GB version of the A100 GPU. A discussion about the v1.1 submission by Dell Technologies can be found in this blog.
Performance comparison of the PowerEdge R70xa server for MLPerf Inference v2.0 and v1.1
ReNet50 is a 50-layer deep convolution neural network that is made up of 48 convolution layers along with a single max pool and average pool layer. This model is used for computer vision applications including image classification, object detection, and object classification. For the ResNet 50 benchmark, the performance numbers from the v2.0 submission match and outperform in the server and offline scenarios respectively when compared to the v1.1 round of submission. As shown in the following figure, the v2.0 submission results are within 0.02 percent in the server scenario and outperform the previous round by 1 percent in the offline scenario:
Figure 4: MLPerf Inference v2.0 compared to v1.1 ResNet 50 per card results on the PowerEdge R750xa server
Bidirectional Encoder Representation from Transformers (BERT) is a state-of-the-art language representational model for Natural Language Processing applications. This benchmark performs the SQuAD question answering task. The BERT benchmark consists of default and high accuracy modes for the offline and server scenarios. In the v2.0 round of submission, the PowerEdge R750xa server matched and slightly outperformed its performance from the previous round. In the default BERT server and offline scenarios, the extracted performance is within 0.06 and 2.33 percent respectively. In the high accuracy BERT server and offline scenarios, the extracted performance is within 0.14 and 1.25 percent respectively.
Figure 5: MLPerf Inference v2.0 compared to v1.1 BERT per card results on the PowerEdge R750xa server
The SSD-ResNet 34 model falls under the computer vision category. This benchmark performs object detection. For the SSD-ResNet 34 benchmark, the results produced in the v2.0 round of submission are within 0.14 percent for the server scenario and show a 1 percent improvement in the offline scenario.
Figure 6: MLPerf Inference v2.0 compared to v1.1 SSD-ResNet 34 per card results on the PowerEdge R750xa server
Deep Learning Recommendation Model (DLRM) is an effective benchmark for understanding workload requirements for building recommender systems. This model uses collaborative filtering and predicative analysis-based approaches to process large amounts of data. The DLRM benchmark consists of default and high accuracy modes, both containing the server and offline scenarios. For the server scenario in both the default and high accuracy modes, the v2.0 submissions results are within 0.003 percent. For the offline scenario across both modes, the PowerEdge R750xa server showed a 2.62 percent performance gain.
Figure 7: MLPerf Inference v2.0 compared to v1.1 DLRM per card results on the PowerEdge R750xa server
The Recurrent Neural Network Transducers (RNNT) model falls under the speech recognition category. This benchmark accepts raw audio samples and produces the corresponding character transcription. For the RNNT benchmark, the PowerEdge R750xa server maintained similar performance behavior within 0.04 percent in the server mode and showing 1.46 percent performance gains in the offline mode.
Figure 8: MLPerf Inference v2.0 compared to v1.1 RNNT per card results on the PowerEdge R750xa server
The 3D U-Net performance numbers have changed in terms of scale and are not comparable in a bar graph because of a change to the dataset. The new dataset for this model is the Kitts 2019 Kidney Tumor Segmentation set. However, the PowerEdge R750xa server yielded Number One results among the PCIe form factor systems that were submitted. This model falls under the computer vision category, but it specifically deals with medical image data.
Figure 1 through Figure 8 show the consistent performance of the PowerEdge R750xa server across both rounds of submission.
The following figure shows that in the offline scenarios for the benchmarks there is a small but noticeable performance improvement:
Figure 9: Performance improvement in percentage of the PowerEdge R750xa server across MLPerf Inference v2.0 and v1.1
The small percentage delta in the server scenarios can be a result of noise and are consistent with the previous round of submission.
This blog confirms the consistent performance of the Dell PowerEdge R750xa server across the MLPerf Inference v1.1 and MLPerf Inference v2.0 submissions. Because an identical system from round v1.1 performed at a consistent level for MLPerf Inference v2.0, we see that the software stack upgrades had minimal impact on performance. Therefore, the optimal results from the v1.1 round of submission can be used for making informed decisions about server performance for a specific ML workload. Because Dell Technologies submitted a diverse set of configurations in the v1.1 round of submission, customers can take advantage of many results.
The First MLPerf Inference v2.1 Performance Result on AMD EPYC™ CPU-Based PowerEdge Servers
Thu, 08 Sep 2022 17:00:38 -0000|
Read Time: 0 minutes
Dell Technologies, AMD, and Deci AI recently submitted results to MLPerf Inference v2.1 in the open division. This blog showcases our first successful three-way submission and describes how the software and hardware of each party was best used to achieve optimal performance for the MLPerf BERT-Large model.
MLCommons™ is a consortium of companies whose mission is to accelerate machine learning innovation to benefit everyone. The organization focuses on benchmarking to enable the display of fair performance measurements and makes datasets open and available, since models in the benchmarks are only as good as the data. It also shares best practices to initiate standardization of sharing and communication among machine learning stakeholders.
The MLPerf Inference v2.1 submission falls under the benchmarking road map for MLCommons. Submissions made to the closed division warrant an equitable comparison of hardware platforms and software frameworks. The submissions must use the same model and optimizer as the reference implementation. Additionally, no retraining is permitted in the closed division. On the other hand, the open division promotes faster models and optimizers as it allows benchmark implementations that use a different model for the same task. Any machine learning approach is permitted if it meets the target quality. Results submitted to the open division showcase exciting technologies that are being developed.
This blog highlights an offline submission made in the open division BERT 99.9 category for the natural language processing (NLP) task. The goal of the submission was to maximize throughput while keeping the accuracy within a 0.1 percent margin of error from the baseline accuracy, which is 90.874 F1 (Stanford Question Answering Dataset (SQuAD)).
Dell PowerEdge R7525 Server Powered with AMD EPYC™ processors
Since MLPerf benchmarking results are a showcase of the joint performance of both the software and underlying hardware, Deci AI’s optimized BERT-Large model, known as DeciBERT-Large, was run using ONNXRT on the Dell PowerEdge R7525 rack server populated with two 64-core AMD EPYC 7773X processors.
The PowerEdge R7525 rack server is a highly scalable and adaptable two-socket 2U rack server that delivers powerful performance and flexible configurations. It is ideal for traditional and emerging workloads and applications that include flash software-defined storage (SDS), virtual desktop infrastructure (VDI), and data analytics (DA) workloads. As this blog’s MLPerf submission shows, the PowerEdge R7525 rack server is also well suited for AI workloads such as deep learning inference.
The PowerEdge R7525 server is an excellent server choice for several reasons. Some of the high-level specifications to meet performance demands include up to 24 directly connected NVMe drives that support all flash AF8 vSAN Ready Nodes. The 4 TB of memory and two AMD EPYC processors enable optimal performance. Also, the PowerEdge R7525 server has maximized IOPS, storage, and memory configurations enabled by up to eight PCIe Gen4 slots. Furthermore, AMD Instinct™ MI100 and MI200 series accelerators and other double-width GPUs can provide additional levels of acceleration.
AMD EPYC processors with AMD 3D V-Cache™ Technology were launched in March 2022. This innovative new lineup of server-class AMD EPYC processors was positioned for accelerating technical computing workloads, including computational fluid dynamics (CFD), electronic design automation (EDA), and finite element analysis (FEA).
With this joint MLPerf submission, a first for AMD EPYC processors with AMD 3D V-Cache Technology, AMD demonstrates the applicability of the new AMD EPYC processors and their extra L3 cache for deep learning inference workloads.
Deci AI DeciBERT-Large Model Comparisons and Metrics
Deci AI used their proprietary AutoNAC™ (Automated Neural Architecture Construction) Engine to generate an optimized BERT-Large model, called DeciBERT-Large, tuned specifically for the underlying PowerEdge R7525 server and two 64-core AMD EPYC 7773X processors. The Deci AI algorithm reduces the reference BERT-Large model size by nearly three times, from 340 million parameters in the standard BERT-Large model down to 115 million parameters, while achieving compelling performance and accuracy.
From a memory capacity perspective, the parameter count reduction also contributes to similarly significant space savings with the DeciBERT-Large model. The ONNX DeciBERT-Large model size is 378 MB in FP32 and 95 MB in INT8 compared to 1.4 GB of the reference BERT-Large model implementation from MLCommons.
By pairing the optimized, smaller DeciBERT-Large model with the extended L3 cache capacity of the AMD EPYC processors with 3D V-Cache, more of the model can be stored in the cache at a time. This method of leveraging the additional L3 cache enables near compute and lower latency memory accesses compared to DRAM.
The following tables highlight the data points collected by Deci AI on the PowerEdge R7525 server with two AMD EPYC 7773X processors. The application of the Deci AI AutoNAC algorithm to generate the DeciBERT-Large model highlights a 6.33 times improvement in FP32 performance and a 6.64 times improvement in INT8 performance, while achieving an INT8 F1 score of 91.08, which is higher than the F1 score of 90.07 of the reference BERT-Large implementation in INT8.
Table 1: BERT-Large comparisons – FP32
F1 accuracy on
ONNX runtime FP32
378 MB or 0.378 GB
0.21 better F1 accuracy
73% size reduction
6.33 times throughput improvement
Table 2: BERT-Large comparisons – INT8
F1 accuracy on SQuAD (INT8)
ONNX runtime INT8
95 MB or 0.095 GB
1.01 better F1 accuracy
93.2% size reduction
6.44 times throughput improvement
The following figure shows that Deci AI’s implementations of the BERT-Large models compiled with ONNXRT are critical in enabling competitive performance in both FP32 and INT8 precisions:
Figure 1: Performance comparison between the reference BERT-Large model implementation and Deci AI’s optimized DeciBERT-Large model
FP32 is commonly used for running deep learning models as it is the default floating datatype in programming languages. It consists of 32 bits of ones and zeros, of which the first bit is the sign bit, representing whether the value is positive. The next eight bits are the exponent of the number, and the last 23 bits are the fraction or mantissa of the number. FP32, or floating point 32, uses nine bits for range and 23 bits for accuracy. The dynamic range of FP32, or the quantity of representable numbers using this datatype, reaches nearly four billion values.
INT8 has become a popular datatype for deep learning inference. Since INT8 has fewer bits and a smaller dynamic range (256 values compared to the four billion values representable by FP32), INT8 compute requirements are considerably reduced compared to FP32. Typically, latencies are lower and throughputs are higher when using INT8 models compared to FP32 models. However, the increased throughput and lower latency tends to come at the cost of accuracy degradation.
Most MLPerf BERT-Large submissions in the 99.9 percent accuracy category use 32-bit or 16-bit quantization because 8-bit quantization is lossy and typically reduces model accuracy below the 99.9 percent threshold. For example, while applying INT8 quantization to the baseline BERT-Large model is an option that accelerates throughput from 12 FPS to 18 FPS, it no longer meets the MLPerf 99.9 percent accuracy constraints.
Deci AI AutoNac Engine and Optimization
The Deci AI AutoNAC engine guarantees that the model designed meets the accuracy requirements set by MLPerf and pursues the most performant variation of the specific model within those constraints, allowing INT8 quantization to be leveraged for the submission.
The Deci AI AutoNAC engine begins by generating a dynamic search space that accounts for parameters such as the baseline accuracy, inference performance targets, underlying hardware, compilers, and quantization, among others. A fast and accurate multiconstraints search algorithm is initiated and creates a new model architecture that delivers the highest performance given the defined constraints.
From a computation time perspective, the AutoNAC search process is approximately three times longer than standard training, depending on the task. For example, training the DeciBert model to perform the SQuAD NLP task requires approximately 60 GPU hours. The search for this DeciBERT model required approximately 180 GPU hours, and the computation involved was parallelized. Therefore, the computation of AutoNAC is commercially affordable for almost any organization.
In summary, Deci AI generated a model using AutoNAC that was specifically designed to deliver optimal performance within the MLPerf constraints when running on a Dell server with AMD EPYC processors with 3D V-Cache.
The following figure shows the AutoNAC optimization process:
Figure 2: Deci AI’s AutoNac process
The Deci AI AutoNAC engine generates optimized deep learning inference models that meet customer accuracy and dataset requirements while maximizing performance. The increased performance, combined with the significant reduction in parameter count and memory size, positions Deci AI optimized models as highly efficient for a range of applications. The DeciBERT-Large model is an optimized version of the state-of-the-art BERT-Large model for NLP applications. Applying that to real-world scenarios, call centers are examples of customers that can take advantage of deep learning insights in the areas of sentiment analysis, live transcription and translation, and question answering. The DeciBERT-Large model, as developed for MLPerf v2.1 by Deci AI, can be easily tuned for a call center’s own dataset and application, and deployed in production today to improve performance, shorten time to insights, and enable the deployment of smaller optimized models with reduced compute requirements, which becomes particularly beneficial in power or cost constrained environments.