Home Workload Solutions Artificial Intelligence Blogs

Dell EMC Servers Offer Excellent Deep Learning Performance with the MLPerf™ Training v1.1 Benchmark

Wed, 01 Dec 2021 21:31:51 -0000

Read Time: 0 minutes

Frank Han

Rakshith Vasudev

Dharmesh Patel

Overview

Dell Technologies has submitted results to the MLPerf Training benchmarking suite for the fifth round. This blog provides an overview of our submissions for the latest version, v1.1. Submission results indicate that different Dell EMC servers (Dell EMC DSS8440, PowerEdge R750xa, and PowerEdge XE8545 servers) offer promising performance for deep learning workloads. These workloads are across different problem types such as image classification, medical image segmentation, lightweight object detection, heavyweight object detection, speech recognition, natural language processing, recommendation, and reinforcement learning.

The previous blog about MLPerf v1.0 contains an introduction to MLCommons™ and the benchmarks in the MLPerf training benchmarking suite. We recommend that you read this blog for an overview of the benchmarks. All the benchmarks and rules remain the same as for v1.0.

The following graph with an exponentially scaled y axis indicates time to converge for the servers and benchmarks in question:

Fig 1: All Dell Technologies submission results for MLPerf Training v1.1

Figure 1 shows that this round of Dell Technologies submissions includes many results. We provided 51 results. These results encompass different Dell Technologies servers including Dell EMC DSS8440, PowerEdge R750xa, and PowerEdge XE8545 servers with various NVIDIA A100 accelerator configurations with different form factors: PCIe, SXM4, and different VRAM variants including 40 GB and 80 GB versions. These variants also include 300 W, 400 W, and 500 W TDP variants.

Note: For the hardware and software specifications of the systems in the graph, see https://github.com/mlcommons/training_results_v1.1/tree/master/Dell/systems.

Different benchmarks were submitted that span areas of image classification, medical image segmentation, lightweight object detection, heavy weight object detection, speech recognition, natural language processing, recommendation, and reinforcement learning. In all these areas, the Dell EMC DSS8440, PowerEdge R750xa, and PowerEdge XE8545 server performance is outstanding.

Highlights

Full coverage

Dell Technologies not only submitted the most results but also comprehensive results from a single system. PowerEdge XE8545 x 4 A100-SXM-80GB server results include submissions across the full spectrum of benchmarked models in the MLPerf training v1.1 suite such as BERT, DLRM, MaskR-CNN, Minigo, ResNet, SSD, RNNT, and 3D U-Net.

Multinode results

The performance scaling of the multinode results is nearly linear or linear and results scale well. This scaling makes the performance of Dell EMC servers in a multinode environment more conducive to faster time to value. Furthermore, among other submitters that include NVIDIA accelerator-based submissions, we are one of three submitters that encompass multinode results.

Improvements from v1.0 to v1.1

Updates for the Dell Technologies v1.1 submission include:

The v1.1 submission includes results from the PowerEdge R750xa server. The PowerEdge R750xa server offers compelling performance, well suited for artificial intelligence, machine learning, and deep learning training and inferencing workloads.
Our results include numbers for 10 GPUs with 80 GB A100 variants on the Dell EMC DSS8440 server. The results for 10 GPUs are useful because more GPUs in a server help to train the model faster, if constrained in a single node environment for training.

Fig 2: Performance comparison of BERT between v1.0 and v1.1 across Dell EMC DSS8440 and PowerEdge XE8545 servers

We noticed the performance improvement of v1.1 over v1.0 with the BERT model, especially with the PowerEdge XE8545 server. While many deep learning workloads were similar in performance between v1.0 and v1.1, the many results that we submitted help customers understand the performance difference across versions.

Conclusion

Our number of submissions was significant (51 submissions). They help customers observe performance with different Dell EMC servers across various configurations. A higher number of results helps customers understand server performance that enables a faster time to solution across different configuration types, benchmarks, and multinode settings.
Among other submissions that include NVIDIA accelerator-based submissions, we are one of three submitters that encompass multinode results. It is imperative to understand scaling performance across multiple servers as deep learning compute needs continue to increase with different kinds of deep learning models and parallelism techniques.
PowerEdge XE8545 x 4A100-SXM-80GB server results include all the models in the MLPerf v1.1 benchmark.
PowerEdge R750xa server results were published for this round; they offer excellent performance.

Next steps

In future blogs, we plan to compare the performance of NVLINK Bridged systems with non-NVLINK Bridged systems.

Tags:

Summary

In this blog, we provide the performance numbers of our recently released Dell EMC PowerEdge R7525 server with two NVIDIA A100 GPUs on all the results of the MLPerf Inference v0.7 benchmark. Our results indicate that the PowerEdge R7525 server is an excellent choice for inference workloads. It delivers optimal performance for different tasks that are in the MLPerf Inference v0.7 benchmark. These tasks include image classification, object detection, medical image segmentation, speech to text, language processing, and recommendation.

The PowerEdge R7525 server is a two-socket, 2U rack server that is designed to run workloads using flexible I/O and network configurations. The PowerEdge R7525 server features the 2nd Gen AMD EPYC processor, supports up to 32 DIMMs, has PCI Express (PCIe) Gen 4.0-enabled expansion slots, and provides a choice of network interface technologies to cover networking options.

The following figure shows the front view of the PowerEdge R7525 server:

Figure 1. Dell EMC PowerEdge R7525 server

The PowerEdge R7525 server is designed to handle demanding workloads and for AI applications such as AI training for different kinds of models and inference for different deployment scenarios. The PowerEdge R7525 server supports various accelerators such as NVIDIA T4, NVIDIA V100S, NVIDIA RTX, and NVIDIA A100 GPU s. The following sections compare the performance of NVIDIA A100 GPUs with NVIDIA T4 and NVIDIA RTX GPUs using MLPerf Inference v0.7 as a benchmark.

The following table provides details of the PowerEdge R7525 server configuration and software environment for MLPerf Inference v0.7:

Component	Description
Processor	AMD EPYC 7502 32-Core Processor
Memory	512 GB (32 GB 3200 MT/s * 16)
Local disk	2x 1.8 TB SSD (No RAID)
Operating system	CentOS Linux release 8.1
GPU	NVIDIA A100-PCIe-40G, T4-16G, and RTX8000
CUDA Driver	450.51.05
CUDA Toolkit	11.0
Other CUDA-related libraries	TensorRT 7.2, CUDA 11.0, cuDNN 8.0.2, cuBLAS 11.2.0, libjemalloc2, cub 1.8.0, tensorrt-laboratory mlperf branch
Other software stack	Docker 19.03.12, Python 3.6.8, GCC 5.5.0, ONNX 1.3.0, TensorFlow 1.13.1, PyTorch 1.1.0, torchvision 0.3.0, PyCUDA 2019.1, SacreBLEU 1.3.3, simplejson, OpenCV 4.1.1
System profiles	Performance

For more information about how to run the benchmark, see Running the MLPerf Inference v0.7 Benchmark on Dell EMC Systems.

MLPerf Inference v0.7 performance results

The MLPerf inference benchmark measures how fast a system can perform machine learning (ML) inference using a trained model in various deployment scenarios. The following results represent the Offline and Server scenarios of the MLPerf Inference benchmark. For more information about different scenarios, models, datasets, accuracy targets, and latency constraints in MLPerf Inference v0.7, see Deep Learning Performance with MLPerf Inference v0.7 Benchmark.

In the MLPerf inference evaluation framework, the LoadGen load generator sends inference queries to the system under test, in our case, the PowerEdge R7525 server with various GPU configurations. The system under test uses a backend (for example, TensorRT, TensorFlow, or PyTorch) to perform inferencing and sends the results back to LoadGen.

MLPerf has identified four different scenarios that enable representative testing of a wide variety of inference platforms and use cases. In this blog, we discuss the Offline and Server scenario performance. The main differences between these scenarios are based on how the queries are sent and received:

Offline—One query with all samples is sent to the system under test. The system under test can send the results back once or multiple times in any order. The performance metric is samples per second.
Server—Queries are sent to the system under test following a Poisson distribution (to model real-world random events). One query has one sample. The performance metric is queries per second (QPS) within latency bound.

Note: Both the performance metrics for Offline and Server scenario represent the throughput of the system.

In all the benchmarks, two NVIDIA A100 GPUs outperform eight NVIDIA T4 GPUs and three NVIDIA RTX800 GPUs for the following models:

ResNet-50 image classification model
SSD-ResNet34 object detection model
RNN-T speech recognition model
BERT language processing model
DLRM recommender model
3D U-Net medical image segmentation model

The following graphs show PowerEdge R7525 server performance with two NVIDIA A100 GPUs, eight NVIDIA T4 GPUs, and three NVIDIA RTX8000 GPUs with 99% accuracy target benchmarks and 99.9% accuracy targets for applicable benchmarks:

99% accuracy (default accuracy) target benchmarks: ResNet-50, SSD-Resnet34, and RNN-T
99% and 99.9% accuracy (high accuracy) target benchmarks: DLRM, BERT, and 3D-Unet

99% accuracy target benchmarks

ResNet-50

The following figure shows results for the ResNet-50 model:

Figure 2. ResNet-50 Offline and Server inference performance

From the graph, we can derive the per GPU values. We divide the system throughput (containing all the GPUs) by the number of GPUs to get the Per GPU results as they are linearly scaled.

SSD-Resnet34

The following figure shows the results for the SSD-Resnet34 model:

Figure 3. SSD-Resnet34 Offline and Server inference performance

RNN-T

The following figure shows the results for the RNN-T model:

Figure 4. RNN-T Offline and Server inference performance

99.9% accuracy target benchmarks

DLRM

The following figures show the results for the DLRM model with 99% and 99.9% accuracy:

Figure 5. DLRM Offline and Server Scenario inference performance – 99% and 99.9% accuracy

For the DLRM recommender and 3D U-Net medical image segmentation (see Figure 7) models, both 99% and 99.9% accuracy have the same throughput. The 99.9% accuracy benchmark also satisfies the required accuracy constraints with the same throughput as that of 99%.

BERT

The following figures show the results for the BERT model with 99% and 99.9% accuracy:

Figure 6. BERT Offline and Server inference performance – 99% and 99.9% accuracy

For the BERT language processing model, two NVIDIA A100 GPUs outperform eight NVIDIA T4 GPUs and three NVIDIA RTX8000 GPUs. However, the performance of three NVIDIA RTX8000 GPUs is a little better than that of eight NVIDIA T4 GPUs.

3D U-Net

For the 3D-Unet medical image segmentation model, only the Offline scenario benchmark is available.

The following figure shows the results for the 3D U-Net model Offline scenario:

Figure 7. 3D U-Net Offline inference performance

For the 3D-Unet medical image segmentation model, since there is only offline scenario benchmark for 3D-Unet the above graph represents only Offline scenario.

The following table compares the throughput between two NVIDIA A100 GPUs, eight NVIDIA T4 GPUs, and three NVIDIA RTX8000 GPUs with 99% accuracy target benchmarks and 99.9% accuracy targets:

Model	Scenario	Accuracy	2 x A100 GPUs vs 8 x T4 GPUs	2 x A100 GPUs vs 3 x RTX8000 GPUs
ResNet-50	Offline	99%	5.21x	2.10x
ResNet-50	Server		4.68x	1.89x
SSD-Resnet34	Offline		6.00x	2.35x
SSD-Resnet34	Server		5.99x	2.21x
RNN-T	Offline		5.55x	2.14x
RNN-T	Server		6.71x	2.43x
DLRM	Offline		6.55x	2.52x
	Server		5.92x	2.47x
	Offline	99.9%	6.55x	2.52x
	Server	99.9%	5.92x	2.47x
BERT	Offline	99%	6.26x	2.31x
	Server	99%	6.80x	2.72x
	Offline	99.9%	7.04x	2.22x
	Server	99.9%	6.84x	2.20x
3D U-Net	Offline	99%	5.05x	2.06x
3D U-Net	Server	99.9%	5.05x	2.06x

Conclusion

With support of NVIDIA A100, NVIDIA T4, or NVIDIA RTX8000 GPUs, Dell EMC PowerEdge R7525 server is an exceptional choice for various workloads that involve deep learning inference. However, the higher throughput that we observed with NVIDIA A100 GPUs translates to performance gains and faster business value for inference applications.

Dell EMC PowerEdge R7525 server with two NVIDIA A100 GPUs delivers optimal performance for various inference workloads, whether it is in a batch inference setting such as Offline scenario or Online inference setting such as Server scenario.

Next steps

In future blogs, we will discuss sizing the system (server and GPU configurations) correctly based on the type of workload (area and task).

NVIDIA Intel PowerEdge GPU MLPerf

MLPerf™ Inference v4.0 Performance on Dell PowerEdge R760xa and R7615 Servers with NVIDIA L40S GPUs

Fri, 05 Apr 2024 17:41:56 -0000

Read Time: 0 minutes

Abstract

Dell Technologies recently submitted results to the MLPerf™ Inference v4.0 benchmark suite. This blog highlights Dell Technologies’ closed division submission made for the Dell PowerEdge R760xa, Dell PowerEdge R7615, and Dell PowerEdge R750xa servers with NVIDIA L40S and NVIDIA A100 GPUs.

Introduction

This blog provides relevant conclusions about the performance improvements that are achieved on the PowerEdge R760xa and R7615 servers with the NVIDIA L40S GPU compared to the PowerEdge R750xa server with the NVIDIA A100 GPU. In the following comparisons, we held the GPU constant across the PowerEdge R760xa and PowerEdge R7615 servers to show the excellent performance of the NVIDIA L40S GPU. Additionally, we also compared the PowerEdge R750xa server with the NVIDIA A100 GPU to its successor the PowerEdge R760xa server with the NVIDIA L40S GPU.

System Under Test configuration

The following table shows the System Under Test (SUT) configuration for the PowerEdge servers.

Table 1: SUT configuration of the Dell PowerEdge R750xa, R760xa, and R7615 servers for MLPerf Inference v4.0

Server	PowerEdge R750xa	PowerEdge R760xa	PowerEdge R7615
MLPerf Version	V4.0
GPU	NVIDIA A100 PCIe 80 GB	NVIDIA L40S
Number of GPUs	4		2
MLPerf System ID	R750xa_A100_PCIe_80GBx4_TRT	R760xa_L40Sx4_TRT	R7615_L40Sx2_TRT
CPU	2 x Intel Xeon Gold 6338 CPU @ 2.00GHz	2 x Intel Xeon Platinum 8470Q	1 x AMD EPYC 9354 32-Core Processor
Memory	512 GB
Software Stack	TensorRT 9.3.0 CUDA 12.2 cuDNN 8.9.2 Driver 535.54.03 / 535.104.12 DALI 1.28.0

The following table lists the technical specifications of the NVIDIA L40S and NVIDIA A100 GPUs.

Table 2: Technical specifications of the NVIDIA A100 and NVIDIA L40S GPUs

Model	NVIDIA A100			NVIDIA L40S
Form factor	SXM4	PCIe Gen4		PCIe Gen4
GPU architecture	Ampere			Ada Lovelace
CUDA cores	6912			18176
Memory size	80 GB			48 GB
Memory type	HBM2e			HBM2e
Base clock	1275 MHz		1065 MHz	1110 MHz
Boost clock	1410 MHz			2520 MHz
Memory clock	1593 MHz		1512 MHz	2250 MHz
MIG support	Yes			No
Peak memory bandwidth	2039 GB/s		1935 GB/s	864 GB/s
Total board power	500 W		300 W	350 W

Dell PowerEdge R760xa server

The PowerEdge R760xa server shines as an Artificial Intelligence (AI) workload server with its cutting-edge inferencing capabilities. This server represents the pinnacle of performance in the AI inferencing space with its processing prowess enabled by Intel Xeon Platinum processors and NVIDIA L40S GPUs. Coupled with NVIDIA TensorRT and CUDA 12.2, the PowerEdge R760xa server is positioned perfectly for any AI workload including, but not limited to, Large Language Models, computer vision, Natural Language Processing, robotics, and edge computing. Whether you are processing image recognition tasks, natural language understanding, or deep learning models, the PowerEdge R760xa server provides the computational muscle for reliable, precise, and fast results.

Figure 1: Front view of the Dell PowerEdge R760xa server

Figure 2: Top view of the Dell PowerEdge R760xa server

Dell PowerEdge R7615 server

The PowerEdge R7615 server stands out as an excellent choice for AI, machine learning (ML), and deep learning (DL) workloads due to its robust performance capabilities and optimized architecture. With its powerful processing capabilities including up to three NVIDIA L40S GPUs supported by TensorRT, this server can handle complex neural network inference and training tasks with ease. Powered by a single AMD EPYC processor, this server performs well for any demanding AI workloads.

Figure 3: Front view of the Dell PowerEdge R7615 server

Figure 4: Top view of the Dell PowerEdge R7615 server

Dell PowerEdge R750xa server

The PowerEdge R750xa server is a perfect blend of technological prowess and innovation. This server is equipped with Intel Xeon Gold processors and the latest NVIDIA GPUs. The PowerEdge R760xa server is designed for the most demanding AI, ML, and DL workloads as it is compatible with the latest NVIDIA TensorRT engine and CUDA version. With up to nine PCIe Gen4 slots and availability in a 1U or 2U configuration, the PowerEdge R750xa server is an excellent option for any demanding workload.

Figure 5: Front view of the Dell PowerEdge R750xa server

Figure 6: Top view of the Dell PowerEdge R750xa server

Performance results

Classical Deep Learning models performance

The following figure presents the results as a ratio of normalized numbers over the Dell PowerEdge R750xa server with four NVIDIA A100 GPUs. This result provides an easy-to-read comparison of three systems and several benchmarks.

Figure 7: Normalized NVIDIA L40S GPU performance over the PowerEdge R750xa server with four A100 GPUs

The green trendline represents the performance of the Dell PowerEdge R750xa server with four NVIDIA A100 GPUs. With a score of 1.00 for each benchmark value, the results have been divided by themselves to serve as the baseline in green for this comparison. The blue trendline represents the performance of the PowerEdge R760xa server with four NVIDIA L40S GPUs that has been normalized by dividing each benchmark result by the corresponding score achieved by the PowerEdge R750xa server. In most cases, the performance achieved on the PowerEdge R760xa server outshines the results of the PowerEdge R750xa server with NVIDIA A100 GPUs, proving the expected improvements from the NVIDIA L40S GPU. The red trendline has also been normalized over the PowerEdge R750xa server and represents the performance of the PowerEdge R7615 server with two NVIDIA L40S GPUs. It is interesting that the red line almost mimics the blue line. This result suggests that the PowerEdge R7615 server, despite having half the compute resources, still performs comparably well in most cases, showing its efficiency.

Generative AI performance

The latest submission saw the introduction of the new Stable Diffusion XL benchmark. In the context of generative AI, stable diffusion is a text to image model that generates coherent image samples. This result is achieved gradually by refining and spreading out information throughout the generation process. Consider the example of dropping food coloring into a large bucket of water. Initially, only a small, concentrated portion of the water turns color, but gradually the coloring is evenly distributed in the bucket.

The following table shows the excellent performance of the PowerEdge R760xa server with the powerful NVIDIA L40S GPU for the GPT-J and Stable Diffusion XL benchmarks. The PowerEdge R760xa takes the top spot in GPT-J and Stable Diffusion XL when compared to other NVIDIA L40S results.

Table 3: Benchmark results for the PowerEdge R760xa server with the NVIDIA L40S GPU

Benchmark	Dell PowerEdge R760xa L40S result (Server in Queries/s and Offline in Samples/s)	Dell’s % gain to the next best non-Dell results (%)
Stable Diffusion XL Server	0.65	5.24
Stable Diffusion XL Offline	0.67	2.28
GPT-J 99 Server	12.75	4.33
GPT-J 99 Offline	12.61	1.88
GPT-J 99.9 Server	12.75	4.33
GPT-J 99.9 Offline	12.61	1.88

Conclusion

The MLPerf Inference submissions elicit insightful like-to-like comparisons. This blog highlights the impressive performance of the NVIDIA L40S GPU in the Dell PowerEdge R760xa and PowerEdge R7615 servers. Both servers performed well when compared to the performance of the Dell PowerEdge R750xa server with the NVIDIA A100 GPU. The outstanding performance improvements in the NVIDIA L40S GPU coupled with the Dell PowerEdge server position Dell customers to succeed in AI workloads. With the advent of the GPT-J and Stable diffusion XL Models, the Dell PowerEdge server is well positioned to handle Generative AI workloads.

Your Browser is Out of Date

Dell EMC Servers Offer Excellent Deep Learning Performance with the MLPerf™ Training v1.1 Benchmark

Overview

Highlights

Full coverage

Multinode results

Improvements from v1.0 to v1.1

Conclusion

Next steps

Related Blog Posts

Quantifying Performance of Dell EMC PowerEdge R7525 Servers with NVIDIA A100 GPUs for Deep Learning Inference

Summary

MLPerf Inference v0.7 performance results

99% accuracy target benchmarks

ResNet-50

SSD-Resnet34

RNN-T

99.9% accuracy target benchmarks

DLRM

BERT

3D U-Net

Conclusion

Next steps

MLPerf™ Inference v4.0 Performance on Dell PowerEdge R760xa and R7615 Servers with NVIDIA L40S GPUs

Abstract

Introduction

System Under Test configuration

Dell PowerEdge R760xa server

Dell PowerEdge R7615 server

Dell PowerEdge R750xa server

Performance results

Classical Deep Learning models performance

Generative AI performance

Conclusion