Comparison of MLPerf™ Inference v1.1 Results of Dell EMC PowerEdge R7525 Servers with NVIDIA GPUs
Mon, 04 Apr 2022 11:08:17 -0000
|Read Time: 0 minutes
Abstract
This blog showcases the MLPerf Inference v1.1 performance results of Dell EMC PowerEdge R7525 servers configured with NVIDIA A100 40 GB GPUs or with NVIDIA A30 GPUs. We compare the cost of a system with both types of GPUs to help you choose the best configuration for your AI inference workloads.
Introduction
MLPerf Inference v1.1 falls under the benchmarks and metrics category from MLCommons™ and serves as the industry standard for machine learning (ML) inference performance. The MLPerf benchmarking suite measures the performance of ML workloads consistently and fairly. The MLPerf Inference benchmark measures how fast a system can perform ML inference by using a pretrained model in various deployment scenarios. For a comprehensive understanding of MLPerf Inference, see this blog.
Test bed details
The systems under test (SUT) include:
- PowerEdge R7525 server that is configured with three NVIDIA A100 PCIe 40 GB (250 W, 40 GB passive, double wide, full height GPU) GPUs. All references to the PowerEdge R7525 server with A100 GPUs assume that the configuration includes three NVIDIA A100 GPUs.
- PowerEdge R7525 server that is configured with three NVIDIA A30 (165 W, 24 GB passive, double wide, full height GPU with cable) GPUs. All references to the PowerEdge R7525 server with A30 GPUs assume that the configuration includes three NVIDIA A100 GPUs.
The following figure shows the PowerEdge R7525 server:
Both systems run TensorRT, which is a library designed and developed for improved performance in inference on NVIDIA GPUs. For more information about TensorRT, see the NVIDIA documentation.
SUT configuration
The following table shows the MLPerf system configurations for the SUTs:
Table 1: SUT configuration
Platform | PowerEdge R7525 with 3 A100 PCIe 40 GB GPUs | PowerEdge R7525 with 3 A30 GPUs |
MLPerf System ID | R7525_A100-PCIE-40GBx3_TRT | R7525_A30x3_TRT |
Operating system | CentOS 8.2.2004 | |
Memory | 512 GB | 1 TB |
GPU | NVIDIA A100-PCIE-40GB | NVIDIA A30 |
GPU count | 3
| |
Software stack | TensorRT 8.0.2 CUDA 11.3 cuDNN 8.2.1 GPU Driver 470.42.01 DALI 0.31.0
|
MLPerf Inference v1.1 results per model
ResNet 50
ResNet50 is a 50-layer deep convolution neural network that is used for many computer vision applications. This neural network can address vanishing gradients using the concept of skip connections by allowing gradients to move through layers in the network. For an introduction to ResNet, see Deep Residual Learning for Image Recognition.
We conducted four tests on this model across the two SUTs: two in the Offline scenario and two in the Server scenario. The following figure shows our ResNet50 results. The performance of the PowerEdge R7525 server with A30 GPUs across both scenarios is approximately 50 percent higher than the PowerEdge R7525 server with A100 GPUs.
Figure 1: ResNet50 results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
BERT
Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art language representational model. In essence, BERT is a stack of Transformer encoders. The Transformer architecture is fast because it can process words simultaneously, and the context of words can be learned from both directions simultaneously. BERT can be used for neural machine translation, question answering, sentiment analysis, and text summarization, all of which require language understanding. BERT is trained in two phases: pretrain in which the model understands language and context, and fine-tuning in which BERT learns specific tasks such as questioning and answering. For an in-depth understanding, see BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding from Google AI Language.
For this model, we conducted eight tests across our systems in which we considered the default and high accuracy modes in both the Server and Offline scenarios. In the default mode, the PowerEdge R7525 server with A100 GPUs performed 69 percent better than the PowerEdge R7525 server with A30 GPUs in the Offline scenario and 99 percent better in the Server scenario. The high accuracy mode provided similar results in which the PowerEdge R7525 server with A100 GPUs performed 72 percent better than the PowerEdge R7525 server with A30 GPUs in the Offline scenario and 96 percent better in the Server scenario. In the following figure, bert-99 refers to the default accuracy target, whereas bert-99.9 refers to the high accuracy target.
Figure 2: BERT results on a PowerEdge R7525 with A100 GPUs and a PowerEdge R7525 with A30 GPUs
SSD-ResNet34
ResNet34 is an encoder on top of Single Shot Multibox Detector (SSD) that is used to improve performance and reduce training time. As the full form suggests, the SSD is a single stage objection detection model that is known for speed. For an in-depth understanding, see Small Object Detection using Context and Attention.
For this model, we conducted four tests across both of our systems. In the Offline scenario, the PowerEdge R7525 server with A100 GPUs outperformed the PowerEdge R7525 server with A30 GPUs by 74 percent. Similarly, in the Server scenario, the PowerEdge R7525 server with A100 GPUs performed 78 percent better than the PowerEdge R7525 server with A30 GPUs.
Figure 3: SSD-ResNet34 results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
DLRM
DLRM, an open-source Deep Learning Recommendation Model, is available on Facebook’s PyTorch platform. The model is composed of compute-dominated multilayer perceptrons (MLPs) and relies on data parallelism to improve performance. When predicting click percentage for certain items, for example, it is aligned with randomized Las Vegas algorithms in which resources (time and memory) are used freely but the results are always correct. DLRM uses collaborative filtering and predicative analysis-based approaches to process large amounts of data. For more information about DLRM, see Deep Learning Recommendation Model for Personalization and Recommendation Systems.
For this model, we conducted eight tests across both of our systems. For the PowerEdge R7525 server with A100 GPUs, we notice a tight range with a lower and upper bound of 764,569 and 768,806 result samples per second, respectively. Also, the results produced across the default and high accuracy tests are the same for their respective systems. The initial numbers from the PowerEdge R7525 server with A30 GPUs were slightly below expectations. After the submission deadline, our team was able to extract additional performance, particularly in the Server scenario. The numbers for the PowerEdge R7525 server with A30 GPUs shown in the following figure are not the same as the numbers published on the MLCommons website. However, these numbers are valid and pass all the required compliance tests. The PowerEdge R7525 server with A30 GPUs behaved like the PowerEdge R7525 server with A100 GPUs in that the Server scenario results are slightly lower than the Offline results. The tuned numbers provided the best per card performance among all A30 GPU submissions.
Figure 4: DLRM results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
RNNT
Recurrent Neural Network (RNNT) is a type of neural network in which outputs are recycled as inputs for the current step. By using one-hot encoding and memory, RNNT can remember information through time that might be useful in time series prediction. This model uses a squashing function to learn to predict the next potential word or step to take. The result of the squashing function is always between –1 and 1, which allows neural networks to remain nonlinear and thus effective as the same values are passed through the neural network.
For this model, we conducted four tests across both of our systems. In the Offline scenario, the PowerEdge R7525 server with A100 GPUs outperformed the PowerEdge R7525 server with A30 GPUs by 80 percent. In the Server scenario, the PowerEdge R7525 server with A100 GPUs excelled by performing 199 percent better than the PowerEdge R7525 server with A30 GPUs.
Figure 5: RNNT results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
3D U-Net
3D U-Net is an elegant improvement to the sliding window approach of convolution neural networks (CNNs) in which fewer training images can be used and more precise segmentations can be yielded. In brief, an input image goes through a contraction and expansion path (in a U shape architecture with skip connections) and becomes a segmentation map output. This segmentation map provides class labels for what is inside the image. For a deeper understanding of 3D U-Net's architecture, see U-Net: Convolutional Networks for Biomedical.
Across the two systems, we conducted Offline scenario tests for the default and high accuracy modes. The default and high accuracy modes yielded the same results across the two systems. Across the two systems, the PowerEdge R7525 server with A100 GPUs performed 75 percent better than the PowerEdge R7525 server with A30 GPUs.
Figure 6: 3D U-Net results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
Cost Considerations
When placing an order for the PowerEdge R7525 Rack Server on the Dell Technologies website, customers are guided through the purchasing process with suggestions and requirements for their specific rack server. The PowerEdge R7525 server with three NVIDIA Ampere A100 GPUs is 1.423 times more expensive than the PowerEdge R7525 server with three NVIDIA Ampere A30 GPUs. The price difference between the two configurations is due to the powerful GPU itself. Also, the PowerEdge R7525 server with A100 GPUs requires higher performance fans and a more powerful thermal configuration. Despite the additional options required for the PowerEdge R7525 server with A100 GPUs, understanding the throughput performance (queries per second (QPS) in the Server mode and samples per second in the Offline mode) per dollar provides valuable insight into achievable performance per dollar spent.
The following figure shows the relative performance of the two systems per dollar. If we divide the performance achieved on a system for a particular benchmark by the total cost of the system, we determine the achievable throughput per dollar spent on the system. The higher the throughput per dollar amount indicates that greater performance can be extracted from the system per dollar spent.
Figure 7: Relative QPS per cost of a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs
In the figure, the orange line shows the normalized data of the throughput per cost of the PowerEdge R7525 server with A30 GPUs. The blue bars indicate the relative achievable performance of the PowerEdge R7525 server with A100 GPUs. For most of the benchmarks, we see an acceptable range of performance on both systems. However, the PowerEdge R7525 server with A100 GPUs unconditionally outperformed the PowerEdge R7525 server with A30 GPUs in the DLRM Server default and high accuracy modes as well as in the RNNT Server mode. Both systems perform well per dollar spent.
Note: We compiled the cost data in this section from the PowerEdge R7525 Rack Server page on the Dell Technologies website on September 7, 2021. The data might be subject to change.
Conclusion
The blog provides a detailed comparison of performance between the Dell EMC PowerEdge R7525 server configured with three A100s and the Dell EMC PowerEdge R7525 server configured with three A30 GPUs. If your ML workload focuses on inferencing, the PowerEdge R7525 server configured with A100s might suit your needs well. However, if you are looking for a system that not only performs well, but also is more cost-effective, the PowerEdge R7525 server configured with A30 GPUs will suit those needs. Both systems performed well and are a great investment based on your ML workload requirements.
Next Steps
In future blogs, we plan to describe:
- How to run MLPerf Inference v1.1
- The PowerEdge R750xa server as a platform for inference v1.1
- The DSS8440 server as a platform for inference v1.1
- The PowerEdge R725 server as a platform for inference v1.1
- The PowerEdge XE8545 server as a platform for inference v1.1
- Comparison of inference v1.0 performance with inference v1.1 performance
Related Blog Posts
Unveiling the Power of the PowerEdge XE9680 Server on the GPT-J Model from MLPerf™ Inference
Tue, 16 Jan 2024 18:30:32 -0000
|Read Time: 0 minutes
Abstract
For the first time, the latest release of the MLPerf™ inference v3.1 benchmark includes the GPT-J model to represent large language model (LLM) performance on different systems. As a key player in the MLPerf consortium since version 0.7, Dell Technologies is back with exciting updates about the recent submission for the GPT-J model in MLPerf Inference v3.1. In this blog, we break down what these new numbers mean and present the improvements that Dell Technologies achieved with the Dell PowerEdge XE9680 server.
MLPerf inference v3.1
MLPerf inference is a standardized test for machine learning (ML) systems, allowing users to compare performance across different types of computer hardware. The test helps determine how well models, such as GPT-J, perform on various machines. Previous blogs provide a detailed MLPerf inference introduction. For in-depth details, see Introduction to MLPerf inference v1.0 Performance with Dell Servers. For step-by-step instructions for running the benchmark, see Running the MLPerf inference v1.0 Benchmark on Dell Systems. Inference version v3.1 is the seventh inference submission in which Dell Technologies has participated. The submission shows the latest system performance for different deep learning (DL) tasks and models.
Dell PowerEdge XE9680 server
The PowerEdge XE9680 server is Dell’s latest two-socket, 6U air-cooled rack server that is designed for training and inference for the most demanding ML and DL large models.
Figure 1. Dell PowerEdge XE9680 server
Key system features include:
- Two 4th Gen Intel Xeon Scalable Processors
- Up to 32 DDR5 DIMM slots
- Eight NVIDIA HGX H100 SXM 80 GB GPUs
- Up to 10 PCIe Gen5 slots to support the latest Gen5 PCIe devices and networking, enabling flexible networking design
- Up to eight U.2 SAS4/SATA SSDs (with fPERC12)/ NVMe drives (PSB direct) or up to 16 E3.S NVMe drives (PSB direct)
- A design to train and inference the most demanding ML and DL large models and run compute-intensive HPC workloads
The following figure shows a single NVIDIA H100 SXM GPU:
Figure 2. NVIDIA H100 SXM GPU
GPT-J model for inference
Language models take tokens as input and predict the probability of the next token or tokens. This method is widely used for essay generation, code development, language translation, summarization, and even understanding genetic sequences. The GPT-J model in MLPerf inference v3.1 has 6 B parameters and performs text summarization tasks on the CNN-DailyMail dataset. The model has 28 transformer layers, and a sequence length of 2048 tokens.
Performance updates
The official MLPerf inference v3.1 results for all Dell systems are published on https://mlcommons.org/benchmarks/inference-datacenter/. The PowerEdge XE9680 system ID is ID 3.1-0069.
After submitting the GPT-J model, we applied the latest firmware updates to the PowerEdge XE9680 server. The following figure shows that performance improved as a result:
Figure 3. Improvement of the PowerEdge XE9680 server on GPT-J Datacenter 99 and 99.9, Server and Offline scenarios [1]
In both 99 and 99.9 Server scenarios, the performance increased from 81.3 to an impressive 84.6. This 4.1 percent difference showcases the server's capability under randomly fed inquires in the MLPerf-defined latency restriction. In the Offline scenarios, the performance saw a notable 5.3 percent boost from 101.8 to 107.2. These results mean that the server is even more efficient and capable of handling batch-based LLM workloads.
Note: For PowerEdge XE9680 server configuration details, see https://github.com/mlcommons/inference_results_v3.1/blob/main/closed/Dell/systems/XE9680_H100_SXM_80GBx8_TRT.json
Conclusion
This blog focuses on the updates of the GPT-J model in the v3.1 submission, continuing the journey of Dell’s experience with MLPerf inference. We highlighted the improvements made to the PowerEdge XE9680 server, showing Dell's commitment to pushing the limits of ML benchmarks. As technology evolves, Dell Technologies remains a leader, constantly innovating and delivering standout results.
[1] Unverified MLPerf® v3.1 Inference Closed GPT-J. Result not verified by MLCommons Association.
The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
Comparison of Top Accelerators from Dell Technologies’ MLPerf™ Inference v3.0 Submission
Fri, 21 Apr 2023 21:43:39 -0000
|Read Time: 0 minutes
Abstract
Dell Technologies recently submitted results to MLPerfTM Inference v3.0 in the closed division. This blog highlights the NVIDIA H100 PCIe GPU and compares the results to the NVIDIA A100 PCIe GPU with the PCIe form factor held constant.
Introduction
MLPerf Inference v3.0 submission falls under the benchmarking pillar of the MLCommonsTM consortium with the objective to make fair comparisons across server configurations. Submissions that are made to the closed division warrant an equitable comparison of the systems.
This blog highlights the closed division submissions Dell Technologies made with the NVIDIA A100 GPU using the PCIe (peripheral component interconnect express) form factor. The PCIe form factor is an interfacing standard for connecting various high-speed components in hardware such as a computer or a server. Servers include a certain number of PCIe slots in which to insert GPUs or other additional cards. Note that there are different physical configurations for the slots to indicate the number of lanes for data to travel to and from the PCIe card. The NVIDIA H100 GPU is truly the latest and greatest GPU with NVIDIA AI Enterprise included; it is a dual-slot air cooled PCIe generation 5.0 GPU. This GPU runs at a memory bandwidth speed of over 2,000 megabits per second and up to seven Multi-Instance GPUs at 10 gigabytes each. The NVIDIA A100 80 GB GPU is a dual-slot PCIe generation 4.0 GPU that runs at a memory bandwidth speed of over 2,000 megabits per second.
NVIDIA H100 PCIe GPU and NVIDIA A100 PCIe GPU comparison
In addition to making a submission with the NVIDIA A100 GPU, Dell Technologies made a submission with the NVIDIA H100 GPU. To make a fair comparison, the systems were identical and the PCIe form factor was held constant.
Platform | Dell PowerEdge R750xa (4x A100-PCIe-80GB, TensorRT) | Dell PowerEdge R750xa (4x H100-PCIe-80GB, TensorRT) |
Round | V3.0 | |
MLPerf System ID | R750xa_A100_PCIe_80GBx4_TRT | R750xa_H100_PCIe_80GBx4_TRT |
Operating system | CentOS 8.2 | |
CPU | Intel Xeon Gold 6338 CPU @ 2.00 GHz | |
Memory | 1 TB | 1 TB |
GPU | NVIDIA A100-PCIe-80GB | NVIDIA H100-PCIe-80GB |
GPU form factor | PCIe | |
GPU memory configuration | HBM2e | |
GPU count | 4 | |
Software stack | TensorRT 8.6 CUDA 12.0 cuDNN 8.8.0 Driver 525.85.12 DALI 1.17.0 | TensorRT 8.6 CUDA 12.0 cuDNN 8.8.0 Driver 525.60.13 DALI 1.17.0 |
Table 1: Software stack of submissions made on NVIDIA A100 PCIe and NVIDIA H100 PCIe GPUs for MLPerf Inference v3.0 on the Dell PowerEdge R750xa server
In the following figure, the per card numbers are normalized over the NVIDIA A100 GPU results to show a readable comparison of the GPUs on the same system. Across object detection, medical image segmentation, and speech to text and natural language processing, the latest NVIDIA H100 GPU outperforms its predecessor in all categories. Note the outstanding performance of the Dell PowerEdge R750xa server with NVIDIA H100 GPUs with the BERT benchmark in the high accuracy mode. With the advancements in generative artificial intelligence, the Dell PowerEdge R750xa server is a versatile, reliable, and high performing platform.
Figure 1: Normalized per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs on the Dell PowerEdge R750xa server
The following figures show absolute numbers for a comparison of the NVIDIA H100 and NVIDIA A100 GPUs.
Figure 2: Per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs for RetinaNet on the PowerEdge R750xa server
Figure 3: Per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs for 3D-Unet on the PowerEdge R750xa server
Figure 4: Per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs for RNNT on the PowerEdge R750xa server
Figure 5: Per GPU comparison of NVIDIA A100 and NVIDIA H100 GPUs for BERT on the PowerEdge R750xa server
These results can be found on the MLCommons website.
Submissions made with the NVIDIA A100 PCIe GPU
In this round of submissions, Dell Technologies submitted results on the PowerEdge R750xa server packaged with four NVIDIA A100 80 GB PCIe GPUs. In previous rounds, the PowerEdge R750xa server showed outstanding performance across all the benchmarks. For a deeper dive of a previous round's submission, check out our blog from MLPerf Inference v2.0. From the previous round of MLPerf Inference v2.1 submissions, Dell Technologies submitted results on an identical system. However, across the two rounds of submissions, the main difference is the upgrades in the software stack, as described in the following table:
Platform | Dell PowerEdge R750xa (4x A100-PCIe-80GB, TensorRT) | Dell PowerEdge R750xa (4x A100-PCIe-80GB, TensorRT) |
Round | V3.0 | V2.1 |
MLPerf System ID | R750xa_A100_PCIe_80GBx4_TRT | |
Operating system | CentOS 8.2 | |
CPU | Intel Xeon Gold 6338 CPU @ 2.00 GHz | |
Memory | 512 GB | |
GPU | NVIDIA A100-PCIe-80GB | |
GPU form factor | PCIe | |
GPU memory configuration | HBM2e | |
GPU count | 4 | |
Software stack | TensorRT 8.6 CUDA 12.0 cuDNN 8.8.0 Driver 525.85.12 DALI 1.17.0 | TensorRT 8.4.2 CUDA 11.6 cuDNN 8.4.1 Driver 510.39.01 DALI 0.31.0 |
Table 2: Software stack for submissions made on the NVIDIA A100 PCIe GPU in MLPerf Inference v3.0 and v2.1
Comparison of PowerEdge R750xa NVIDIA A100 results from Inference v3.0 and v2.1
Object detection
The RetinaNet benchmark falls under the object detection category and uses the OpenImages dataset. The results from Inference v3.0 show a less than 0.05 percent difference in the Server scenario and a 21.53 percent difference in the Offline scenario. A potential reason for this result might be NVIDIA’s optimizations, as outlined in their technical blog.
Figure 6: RetinaNet Server and Offline results on the PowerEdge R750xa server from Inference v3.0 and Inference v2.1
Medical image segmentation
The 3D-Unet benchmark performs the KiTS 2019 kidney tumor segmentation task. Across the two rounds of submission, the PowerEdge R750xa server performed consistently well with a 0.3 percent difference in both the default and high accuracy modes.
Figure 7: 3D-UNet Offline results on the PowerEdge R750xa server from Inference v3.0 and v2.1
Speech to text
The Recurrent Neural Network Transducers (RNNT) model falls under the speech recognition category. This benchmark accepts raw audio samples and produces the corresponding character transcription. In the Server scenario, the results are within a 2.25 percent difference and 0.41 percent difference in the Offline scenario.
Figure 8: RNNT Server and Offline results on the Dell PowerEdge R750xa server from Inference v3.0 and v2.1
Natural language processing
Bidirectional Encoder Representation from Transformers (BERT) is a state-of-the-art language representational model for Natural Language Processing applications. This benchmark performs the SQuAD question answering task. The BERT benchmark consists of default and high accuracy modes for the Offline and Server scenarios. For the Server scenarios, the default mode results are within a 1.69 percent range and 3.12 percent range for the high accuracy mode. For the Offline scenarios, a similar behavior is noticeable in which the default mode results are within a 0.86 percent range and 3.65 percent range in the high accuracy mode.
Figure 9: BERT Server and Offline results on the PowerEdge R750xa server from Inference v3.0 and v2.1
Conclusion
Across the various rounds of submissions to the MLPerf Inference benchmark suite, the PowerEdge R750xa server has been a consistent top performer for any machine learning tasks ranging from object detection, medical image segmentation, speech to text and natural language processing. The PowerEdge R750xa server continues to be an excellent server choice for machine learning inference workloads. Customers can take advantage of the diverse results submitted on the Dell PowerEdge R750xa server with the NVIDIA H100 GPU to make an informed decision for their specific solution needs.