Abstract

This blog showcases the MLPerf Inference v1.1 performance results of Dell EMC PowerEdge R7525 servers configured with NVIDIA A100 40 GB GPUs or with NVIDIA A30 GPUs. We compare the cost of a system with both types of GPUs to help you choose the best configuration for your AI inference workloads.

Introduction

MLPerf Inference v1.1 falls under the benchmarks and metrics category from MLCommons™ and serves as the industry standard for machine learning (ML) inference performance. The MLPerf benchmarking suite measures the performance of ML workloads consistently and fairly. The MLPerf Inference benchmark measures how fast a system can perform ML inference by using a pretrained model in various deployment scenarios. For a comprehensive understanding of MLPerf Inference, see this blog.

Test bed details

The systems under test (SUT) include:

PowerEdge R7525 server that is configured with three NVIDIA A100 PCIe 40 GB (250 W, 40 GB passive, double wide, full height GPU) GPUs. All references to the PowerEdge R7525 server with A100 GPUs assume that the configuration includes three NVIDIA A100 GPUs.
PowerEdge R7525 server that is configured with three NVIDIA A30 (165 W, 24 GB passive, double wide, full height GPU with cable) GPUs. All references to the PowerEdge R7525 server with A30 GPUs assume that the configuration includes three NVIDIA A100 GPUs.

The following figure shows the PowerEdge R7525 server:

Both systems run TensorRT, which is a library designed and developed for improved performance in inference on NVIDIA GPUs. For more information about TensorRT, see the NVIDIA documentation.

SUT configuration

The following table shows the MLPerf system configurations for the SUTs:

Table 1: SUT configuration

Platform	PowerEdge R7525 with 3 A100 PCIe 40 GB GPUs	PowerEdge R7525 with 3 A30 GPUs
MLPerf System ID	R7525_A100-PCIE-40GBx3_TRT	R7525_A30x3_TRT
Operating system	CentOS 8.2.2004
Memory	512 GB	1 TB
GPU	NVIDIA A100-PCIE-40GB	NVIDIA A30
GPU count	3
Software stack	TensorRT 8.0.2 CUDA 11.3 cuDNN 8.2.1 GPU Driver 470.42.01 DALI 0.31.0

MLPerf Inference v1.1 results per model

ResNet 50

ResNet50 is a 50-layer deep convolution neural network that is used for many computer vision applications. This neural network can address vanishing gradients using the concept of skip connections by allowing gradients to move through layers in the network. For an introduction to ResNet, see Deep Residual Learning for Image Recognition.

We conducted four tests on this model across the two SUTs: two in the Offline scenario and two in the Server scenario. The following figure shows our ResNet50 results. The performance of the PowerEdge R7525 server with A30 GPUs across both scenarios is approximately 50 percent higher than the PowerEdge R7525 server with A100 GPUs.

Figure 1: ResNet50 results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs

BERT

Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art language representational model. In essence, BERT is a stack of Transformer encoders. The Transformer architecture is fast because it can process words simultaneously, and the context of words can be learned from both directions simultaneously. BERT can be used for neural machine translation, question answering, sentiment analysis, and text summarization, all of which require language understanding. BERT is trained in two phases: pretrain in which the model understands language and context, and fine-tuning in which BERT learns specific tasks such as questioning and answering. For an in-depth understanding, see BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding from Google AI Language.

For this model, we conducted eight tests across our systems in which we considered the default and high accuracy modes in both the Server and Offline scenarios. In the default mode, the PowerEdge R7525 server with A100 GPUs performed 69 percent better than the PowerEdge R7525 server with A30 GPUs in the Offline scenario and 99 percent better in the Server scenario. The high accuracy mode provided similar results in which the PowerEdge R7525 server with A100 GPUs performed 72 percent better than the PowerEdge R7525 server with A30 GPUs in the Offline scenario and 96 percent better in the Server scenario. In the following figure, bert-99 refers to the default accuracy target, whereas bert-99.9 refers to the high accuracy target.

Figure 2: BERT results on a PowerEdge R7525 with A100 GPUs and a PowerEdge R7525 with A30 GPUs

SSD-ResNet34

ResNet34 is an encoder on top of Single Shot Multibox Detector (SSD) that is used to improve performance and reduce training time. As the full form suggests, the SSD is a single stage objection detection model that is known for speed. For an in-depth understanding, see Small Object Detection using Context and Attention.

For this model, we conducted four tests across both of our systems. In the Offline scenario, the PowerEdge R7525 server with A100 GPUs outperformed the PowerEdge R7525 server with A30 GPUs by 74 percent. Similarly, in the Server scenario, the PowerEdge R7525 server with A100 GPUs performed 78 percent better than the PowerEdge R7525 server with A30 GPUs.

Figure 3: SSD-ResNet34 results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs

DLRM

DLRM, an open-source Deep Learning Recommendation Model, is available on Facebook’s PyTorch platform. The model is composed of compute-dominated multilayer perceptrons (MLPs) and relies on data parallelism to improve performance. When predicting click percentage for certain items, for example, it is aligned with randomized Las Vegas algorithms in which resources (time and memory) are used freely but the results are always correct. DLRM uses collaborative filtering and predicative analysis-based approaches to process large amounts of data. For more information about DLRM, see Deep Learning Recommendation Model for Personalization and Recommendation Systems.

For this model, we conducted eight tests across both of our systems. For the PowerEdge R7525 server with A100 GPUs, we notice a tight range with a lower and upper bound of 764,569 and 768,806 result samples per second, respectively. Also, the results produced across the default and high accuracy tests are the same for their respective systems. The initial numbers from the PowerEdge R7525 server with A30 GPUs were slightly below expectations. After the submission deadline, our team was able to extract additional performance, particularly in the Server scenario. The numbers for the PowerEdge R7525 server with A30 GPUs shown in the following figure are not the same as the numbers published on the MLCommons website. However, these numbers are valid and pass all the required compliance tests. The PowerEdge R7525 server with A30 GPUs behaved like the PowerEdge R7525 server with A100 GPUs in that the Server scenario results are slightly lower than the Offline results. The tuned numbers provided the best per card performance among all A30 GPU submissions.

Figure 4: DLRM results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs

RNNT

Recurrent Neural Network (RNNT) is a type of neural network in which outputs are recycled as inputs for the current step. By using one-hot encoding and memory, RNNT can remember information through time that might be useful in time series prediction. This model uses a squashing function to learn to predict the next potential word or step to take. The result of the squashing function is always between –1 and 1, which allows neural networks to remain nonlinear and thus effective as the same values are passed through the neural network.

For this model, we conducted four tests across both of our systems. In the Offline scenario, the PowerEdge R7525 server with A100 GPUs outperformed the PowerEdge R7525 server with A30 GPUs by 80 percent. In the Server scenario, the PowerEdge R7525 server with A100 GPUs excelled by performing 199 percent better than the PowerEdge R7525 server with A30 GPUs.

Figure 5: RNNT results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs

3D U-Net

3D U-Net is an elegant improvement to the sliding window approach of convolution neural networks (CNNs) in which fewer training images can be used and more precise segmentations can be yielded. In brief, an input image goes through a contraction and expansion path (in a U shape architecture with skip connections) and becomes a segmentation map output. This segmentation map provides class labels for what is inside the image. For a deeper understanding of 3D U-Net's architecture, see U-Net: Convolutional Networks for Biomedical.

Across the two systems, we conducted Offline scenario tests for the default and high accuracy modes. The default and high accuracy modes yielded the same results across the two systems. Across the two systems, the PowerEdge R7525 server with A100 GPUs performed 75 percent better than the PowerEdge R7525 server with A30 GPUs.

Figure 6: 3D U-Net results on a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs

Cost Considerations

When placing an order for the PowerEdge R7525 Rack Server on the Dell Technologies website, customers are guided through the purchasing process with suggestions and requirements for their specific rack server. The PowerEdge R7525 server with three NVIDIA Ampere A100 GPUs is 1.423 times more expensive than the PowerEdge R7525 server with three NVIDIA Ampere A30 GPUs. The price difference between the two configurations is due to the powerful GPU itself. Also, the PowerEdge R7525 server with A100 GPUs requires higher performance fans and a more powerful thermal configuration. Despite the additional options required for the PowerEdge R7525 server with A100 GPUs, understanding the throughput performance (queries per second (QPS) in the Server mode and samples per second in the Offline mode) per dollar provides valuable insight into achievable performance per dollar spent.

The following figure shows the relative performance of the two systems per dollar. If we divide the performance achieved on a system for a particular benchmark by the total cost of the system, we determine the achievable throughput per dollar spent on the system. The higher the throughput per dollar amount indicates that greater performance can be extracted from the system per dollar spent.

Figure 7: Relative QPS per cost of a PowerEdge R7525 server with A100 GPUs and a PowerEdge R7525 server with A30 GPUs

In the figure, the orange line shows the normalized data of the throughput per cost of the PowerEdge R7525 server with A30 GPUs. The blue bars indicate the relative achievable performance of the PowerEdge R7525 server with A100 GPUs. For most of the benchmarks, we see an acceptable range of performance on both systems. However, the PowerEdge R7525 server with A100 GPUs unconditionally outperformed the PowerEdge R7525 server with A30 GPUs in the DLRM Server default and high accuracy modes as well as in the RNNT Server mode. Both systems perform well per dollar spent.

Note: We compiled the cost data in this section from the PowerEdge R7525 Rack Server page on the Dell Technologies website on September 7, 2021. The data might be subject to change.

Conclusion

The blog provides a detailed comparison of performance between the Dell EMC PowerEdge R7525 server configured with three A100s and the Dell EMC PowerEdge R7525 server configured with three A30 GPUs. If your ML workload focuses on inferencing, the PowerEdge R7525 server configured with A100s might suit your needs well. However, if you are looking for a system that not only performs well, but also is more cost-effective, the PowerEdge R7525 server configured with A30 GPUs will suit those needs. Both systems performed well and are a great investment based on your ML workload requirements.

Next Steps

In future blogs, we plan to describe:

How to run MLPerf Inference v1.1
The PowerEdge R750xa server as a platform for inference v1.1
The DSS8440 server as a platform for inference v1.1
The PowerEdge R725 server as a platform for inference v1.1
The PowerEdge XE8545 server as a platform for inference v1.1
Comparison of inference v1.0 performance with inference v1.1 performance

Your Browser is Out of Date

Comparison of MLPerf™ Inference v1.1 Results of Dell EMC PowerEdge R7525 Servers with NVIDIA GPUs