Home Workload Solutions Artificial Intelligence Blogs

Dell EMC Servers Shine in MLPerf Inference v0.7 Benchmark

Tue, 03 Nov 2020 12:46:25 -0000

Read Time: 0 minutes

Ramesh Radhakrishnan

Frank Han

Liz Raymond

As software applications and systems using Artificial Intelligence (AI) gain mainstream adoption across all industries, inference workloads for ongoing operations are becoming a larger resource consumer in the datacenter. MLPerf is a benchmark suite that is used to evaluate the performance profiles of systems for both training and inference AI tasks. In this blog we take a closer look at the recent results submitted by Dell EMC and how our various servers performed in the datacenter category.

The reason we do this type of work is to help customers understand which server platform makes the most sense for their use case. Dell Technologies wants to make the choice easier and reduce work for our customers, so they don’t waste their precious resources. We want customers to use their time focusing on the use case helping accelerate time to value for the business.

Dell Technologies has a total of 210 submissions for MLPerf Inference v0.7 in the Datacenter category using various server platforms and accelerators. Why so many? It is because many customers have never run AI in their environment, the use cases are endless across industries and expertise limited. Customers have told us they need help identifying the correct server platform based on their workloads.

We’re proud of what we’ve done, but it’s still all about helping customers adopt AI. By sharing our expertise and providing guidance on infrastructure for AI, we help customers become successful and get their use case into production.

MLPerf Benchmarks

MLPerf was founded in 2018 with a goal of accelerating improvements in ML system performance. Formed as a collaboration of companies and researchers from leading educational institutions, MLPerf leverages open source code, public state-of-the-art Machine Learning (ML) models and publicly available datasets contributed to the ML community. The MLPerf suites include MLPerf Training and MLPerf Inference.

MLPerf Training measures how fast a system can train machine learning models. Training benchmarks have been defined for image classification, lightweight and heavy-weight object detection, language translation, natural language processing, recommendation and reinforcement learning. Each benchmark includes specifications for input datasets, quality targets and reference implementation models. The first round of training submissions was published on the MLPerf website in December 2018 with results submitted by Google, Intel and NVIDIA.

The MLPerf Inference suite measures how quickly a trained neural network can evaluate new data and perform forecasting or classification for a wide range of applications. MLPerf Inference includes image classification, object detection and machine translation with specific models, datasets, quality, server latency and multi-stream latency constraints. MLPerf validated and published results for MLPerf Inference v0.7 on October 21, 2020. In this blog we take a closer look at the for MLPerf Inference v0.7 results submitted by Dell EMC and how the servers performed in the datacenter category.

A summary of the key highlights of the Dell EMC results are shown in Table 1. These are derived from the submitted results in MLPerf datacenter closed category. Ranking and claims are based on Dell analysis of published MLPerf data. Per accelerator is calculated by dividing the primary metric of total performance by the number of accelerators reported.

Rank	Category	Specifics	Use Cases
#1	Performance per Accelerator for NVIDIA A100-PCIe	PowerEdge R7525	Medical Imaging, Image Classification
#1	Performance per Accelerator with NVIDIA T4 GPUs	PowerEdge XE2420, PowerEdge R7525, DSS8440	Medical Imaging, NLP, Image Classification, Speech Recognition, Object Detection, Recommendation
#1	Highest inference results with Quadro RTX6000 and RTX8000	PowerEdge R7525, DSS 8440	Medical Imaging, NLP, Image Classification, Speech Recognition, Object Detection, Recommendation

Dell EMC had a total of 210 submissions for MLPerf Inference v0.7 in the Datacenter category using various Dell EMC platforms and accelerators from leading vendors. We achieved impressive results when compared to other submissions in the same class of platforms.

MLPerf Inference Categories and Dell EMC Achievements

A benchmark suite is made up of tasks or models from vision, speech, language and commerce use cases. MLPerf Inference measures how fast a system can perform ML inference by using a load generator against the System Under Test (SUT) where the trained model is deployed.

There are three types of benchmark tests defined in MLPerf inference v0.7, one for datacenter systems, one for edge systems and one for mobile systems. MLPerf then has four different scenarios to enable representative testing of a wide variety of inference platforms and use cases:

Single stream
Multiple stream
Server
Offline

The single stream and multiple stream scenarios are only used for edge and mobile inference benchmarks. The data center benchmark type targets systems designed for data center deployments and requires evaluation of both the server and offline scenarios. The metrics used in the Datacenter category are inference operations/second. In the server scenario, the MLPerf load generator sends new queries to the SUT according to a Poisson distribution. This is representative of on-line AI applications such as translation, image tagging which have variable arrival patterns based on end-user traffic. Offline represents AI tasks done thru batch processing such as photo categorization where all the data is readily available ahead of time.

Dell EMC published multiple results in the datacenter systems category. Details on the models, dataset and the scenarios submitted for the different datacenter benchmark are shown in Table 2

Area	Task	Model	Dataset	Required Scenarios
Vision	Image classification	ResNet50-v1.5	Imagenet (224x224)	Server, Offline
	Object detection (large)	SSD-ResNet34	COCO (1200x1200)	Server, Offline
	Medical image segmentation	3d Unet	BraTS 2019 (224x224x160)	Offline
Speech	Speech-to-text	RNNT	Librispeech dev-clean (samples < 15 seconds)	Server, Offline
Language	Language processing	BERT	SQuAD v1.1 (max_seq_len=384)	Server, Offline
Commerce	Recommendation	DLRM	1TB Click Logs	Server, Offline

Next we highlight some of the key performance achievements for the broad range of solutions available in the Dell EMC portfolio for inference use cases and deployments.

1. Dell EMC is #1 in total number of datacenter submissions in the closed division including bare metal submissions using different GPUs, Xeon CPUs, Xilinx FPGA and virtualized submission on VMware vSphere

The closed division enables head to head comparisons and consists of server platforms used from the Edge to private or public clouds. The Dell Technologies engineering team submitted 210 out of the total 509 results.

We remain committed to helping customers deploy inference workloads as efficiently as possible, meeting their unique requirements of power, density, budget and performance. The wide range of servers submitted by Dell Technologies’ is a testament to this commitment -

The only vendor with submissions for a variety of inference solutions – leveraging GPU, FPGA and CPUs for the datacenter/private cloud and Edge
Unique in the industry by submitting results across a multitude of servers that range from mainstream servers (R740/R7525) to dense GPU-optimized servers supporting up to 16 NVIDIA GPUs (DSS8440).
Demonstrated that customers that demand real-time inferencing at the telco or retail Edge can deploy up to 4 GPUs in a short depth NEBS-compliant PowerEdge XE2420 server.
Demonstrated efficient Inference performance using the 2^nd Gen Intel Xeon Scalable platform on the PowerEdge R640 and PowerEdge R740 platforms for customers wanting to run inference on Intel CPUs.
Dell submissions using Xilinx U280 in PowerEdge R740 demonstrated that customers wanting low latency inference can leverage FPGA solutions.

2. Dell EMC is #1 in performance “per Accelerator” with PowerEdge R7525 and A100-PCIe for multiple benchmarks

The Dell EMC PowerEdge R7525 was purpose-built for superior accelerated performance. The MLPerf results validated leading performance across many scenarios including:

Performance Rank “Per Accelerator”	Inference Throughput	Dell EMC System
#1 ResNet50 (server)	30,005	PowerEdge R7525 (3x NVIDIA A100-PCIE)
#1 3D-Unet-99 (offline)	39	PowerEdge R7525 (3x NVIDIA A100-PCIE)
#1 3D-Unet-99 (offline)	39	PowerEdge R7525 (3x NVIDIA A100-PCIE)
#2 DLRM-99 (server)	192,543	PowerEdge R7525 (2x NVIDIA A100-PCIE)
#2 DLRM-99 (server)	192,543	PowerEdge R7525 (2x NVIDIA A100-PCIE)

3. Dell achieved the highest inference scores with NVIDIA Quadro RTX GPUs using the DSS 8440 and R7525

Dell Technologies engineering understands that since training isn’t the only AI workload, using the right technology for each job is far more cost effective. Dell is the only vendor to submit results using NVIDIA RTX6000 and RTX8000 GPUs that provide up to 48GB HBM memory for large inference models. The DSS 8440 with 10 Quadro RTX achieved

#2 and #3 highest system performance on RNN-T for Offline scenario.

The #1 ranking was delivered using 8x NVIDIA A100 SXM4 that was introduced in May 2020 and is a powerful system for customer to train state of the art deep learning models. Dell Technologies took the #2 and #3 spots with the DSS8440 server equipped with 10x NVIDIA RTX8000 and DSS8440 with 10x NVIDIA RTX6000 providing a better power and cost efficiency for inference workloads compared to other submissions.

4. Dell EMC claims #1 spots for NVIDIA T4 platforms with DSS 8440, XE2420 and PowerEdge R7525

Dell Technologies provides system options for customers to deploy inference workloads that match their unique requirements. Today’s accelerators vary significantly in price, performance and power consumption. For example, the NVIDIA T4 is a low profile, lower power GPU option that is widely deployed for inference due to its superior power efficiency and economic value for that use case.

The MLPerf results corroborate the exemplary inference performance of NVIDIA T4 on Dell EMC Servers. The T4 leads for performance per GPU among the 20 servers used to submit scores using NVIDIA T4 GPUs

#1 in performance per GPU on 3d-unet-99 and 3d-unet-99.9 Offline scenario
#1 in performance per GPU on Bert-99 Server and Bert-99.9 Offline scenario
#1, #2 and #3 in performance with T4 on DLRM-99 & DLRM-99.9 Server scenario
#1 in performance per GPU on ResNet50 Offline scenario
#1 in performance per GPU on RNN-T Server and Offline scenario
#1 in performance per GPU on SSD-large Offline scenario

The best scores achieved for the NVIDIA T4 “Per GPU” rankings above and respective platforms are shown in the table:

Benchmark	Offline Scenario			Server Scenario
Benchmark	Rank	Throughput	Server	Rank	Throughput	Server
3d-unet-99	#1	7.6	XE2420	n/a
3d-unet-99.9	#1	7.6	XE2420	n/a
bert-99	#3	449	XE2420	#1	402	XE2420
bert-99.9	#1	213	DSS 8440	#2	190	XE2420
dlrm-99	#2	35,054	XE2420	#1	32,507	R7525
dlrm-99.9	#2	35,054	XE2420	#1	32,507	R7525
resnet	#1	6,285	XE2420	#4	5,663	DSS 8440
rnnt	#1	1,560	XE2420	#1	1,146	XE2420
ssd-large	#1	142	XE2420	#2	131	DSS 8440

5. Dell is the only vendor to submit results on virtualized infrastructure with vCPUs and NVIDIA virtual GPUs (vGPU) on VMware vSphere

Customers interested in deploying inference workloads for AI on virtualized infrastructure can leverage Dell servers with VMware software to reap the benefits of virtualization.

To demonstrate efficient virtualized performance on Intel 2^nd Generation Intel Xeon Scalable processors, Dell EMC and VMware submitted results using vSphere and OpenVino on the PowerEdge R640.

Virtualization overhead for a single VM was observed to be minimal and testing showed that using multiple VMs could be deployed on a single server to achieve ~26% better throughput compared to a bare metal environment.

Dell EMC has published guidance on virtualizing GPUs using DirectPath I/O, NVIDIA Virtual Compute Server (vCS) and more. Dell EMC and VMware used NVIDIA vCS virtualization software in vSphere for MLPerf Inference benchmarks on virtualized NVIDIA T4 GPUs

VMware vSphere using NVIDIA vCS delivers near bare metal performance for MLPerf Inference v0.7 benchmarks. The inference throughput (queries processed per second) increases linearly as the number of vGPUs attached to the VM increases.

Blogs covering these virtualized tests in greater detail are published at VMware’s performance Blog site.

This finishes our coverage of the top 5 highlights out of the 200+ submissions done by Dell EMC in the datacenter division. Next we discuss other aspects of the GPU optimized portfolio that are important for customers – quality and support.

Dell has the highest number of NVIDIA GPU submissions using NVIDIA NGC Ready systems

Dell GPU enabled platforms are part of NVIDIA NGC-Ready and NGC-Ready for Edge validation programs. At Dell, we understand that performance is critical, but customers are not willing to compromise quality and reliability to achieve maximum performance. Customers can confidently deploy inference and other software applications from the NVIDIA NGC catalog knowing that the Dell systems meet all the requirements set by NVIDIA to deploy customer workloads on-premises or at the Edge.

NVIDIA NGC validated configs that were used for this round of MLPerf submissions are:

Dell EMC PowerEdge XE2420 (4x T4)
Dell EMC DSS 8440 (10x Quadro RTX 8000)
Dell EMC DSS 8440 (12x T4)
Dell EMC DSS 8440 (16x T4)
Dell EMC DSS 8440 (8x Quadro RTX 8000)
Dell EMC PowerEdge R740 (4x T4)
Dell EMC PowerEdge R7515 (4x T4)
Dell EMC PowerEdge R7525 (2x A100-PCIE)
Dell EMC PowerEdge R7525 (3x Quadro RTX 8000)

Dell EMC portfolio can address customers inference needs from on-premises to the edge

In this blog, we highlighted the results submitted by Dell EMC to demonstrate how our various servers performed in the datacenter category. The Dell EMC server portfolio provides many options for customer wanting to deploy AI inference in their datacenters or on the edge. We also offer a wide range of accelerator options including both multiple GPU and FPGA models for running inference either on bare metal or virtualized infrastructure that can meet specific application and deployment requirements.

Finally, we list the performance for a subset of the server platforms that we see mostly commonly used by customers today for running inference workloads. These rankings highlight that the platform can support a wide range of inference use cases that are showcased in the MLPerf suite.

1. The Dell EMC PowerEdge XE2420 with 4x NVIDIA T4 GPUs: Ranked between #1 and #3 in 14 out of 16 benchmark categories when compared with other T4 Servers

Dell EMC PowerEdge XE2420 (4x T4) Per Accelerator Ranking*
	Offline	Server
3d-unet-99	#1	n/a
3d-unet-99.9	#1	n/a
bert-99	#3	#1
bert-99.9	#2	#2
dlrm-99	#1	#3
dlrm-99.9	#1	#3
resnet	#1
rnnt	#1	#1
ssd-large	#1

*out of 20 server submissions using T4

2. Dell EMC PowerEdge R7525 with 8x T4 GPUs: Ranked between #1 and #5 in 11 out of 16 benchmark categories in T4 server submission

Dell EMC PowerEdge R7525 (8x T4) Per Accelerator Ranking*
	Offline	Server
3d-unet-99	#4	n/a
3d-unet-99.9	#4	n/a
bert-99	#4
dlrm-99	#2	#1
dlrm-99.9	#2	#1
rnnt	#2	#5
ssd-large	#5

*out of 20 server submissions using T4

3. The Dell EMC PowerEdge R7525 with up to 3xA100-PCIe: ranked between #3 and #10 in 15 out of 16 benchmark categories across all datacenter submissions

Dell EMC PowerEdge R7525 (2\|3x A100-PCIe) Per Accelerator
	Offline	Server
3d-unet-99	#4	n/a
3d-unet-99.9	#4	n/a
bert-99	#8	#9
bert-99.9	#7	#8
dlrm-99	#6	#4
dlrm-99.9	#6	#4
resnet	#10	#3
rnnt	#6	#7
ssd-large	#10

*out of total submissions (53)

4. The Dell EMC DSS 8440 with 16x T4 ranked between #3 and #7 when compared against all submissions using T4

Dell EMC DSS 8440 (16x T4)
	Offline	Server
3d-unet-99	#4	n/a
3d-unet-99.9	#4	n/a
bert-99	#6	#4
bert-99.9	#7	#5
dlrm-99	#3	#3
dlrm-99.9	#3	#3
resnet	#6	#4
rnnt	#5	#5
ssd-large	#7	#5

*out of 20 server submissions using T4

5. The Dell EMC DSS 8440 with 10x RTX6000 ranked between #2 and #6 in 14 out of 16 benchmarks when compared against all submissions

Dell EMC DSS 8440 (10x Quadro RTX6000)
	Offline	Server
3d-unet-99	#4	n/a
3d-unet-99.9	#4	n/a
bert-99	#4	#5
bert-99.9	#4	#5
dlrm-99
dlrm-99.9
resnet	#5	#6
rnnt	#2	#5
ssd-large	#5	#6

*out of total submissions (53)

6. Dell EMC DSS 8440 with 10x RTX8000 ranked between #2 and #6 when compared against all submissions

Dell EMC DSS 8440 (10x Quadro RTX8000)
	Offline	Server
3d-unet-99	#5	n/a
3d-unet-99.9	#5	n/a
bert-99	#5	#4
bert-99.9	#5	#4
dlrm-99	#3	#3
dlrm-99.9	#3	#3
resnet	#6	#5
rnnt	#3	#6
ssd-large	#6	#5

*out of total submissions (53)

Get more information on MLPerf results at www.mlperf.org and earn more about PowerEdge servers that are optimized for AI / ML / DL at www.DellTechnologies.com/Servers

Acknowledgements: These impressive results were made possible by the work of the following Dell EMC and partner team members - Shubham Billus, Trevor Cockrell, Bagus Hanindhito (Univ. of Texas, Austin), Uday Kurkure (VMWare), Guy Laporte, Anton Lokhmotov (Dividiti), Bhavesh Patel, Vilmara Sanchez, Rakshith Vasudev, Lan Vu (VMware) and Nicholas Wakou. We would also like to thank our partners – NVIDIA, Intel and Xilinx for their help and support in MLPerf v0.7 Inference submissions.

Tags:

Server	PowerEdge R750xa	PowerEdge R760xa	PowerEdge R7615
MLPerf Version	V4.0
GPU	NVIDIA A100 PCIe 80 GB	NVIDIA L40S
Number of GPUs	4		2
MLPerf System ID	R750xa_A100_PCIe_80GBx4_TRT	R760xa_L40Sx4_TRT	R7615_L40Sx2_TRT
CPU	2 x Intel Xeon Gold 6338 CPU @ 2.00GHz	2 x Intel Xeon Platinum 8470Q	1 x AMD EPYC 9354 32-Core Processor
Memory	512 GB
Software Stack	TensorRT 9.3.0 CUDA 12.2 cuDNN 8.9.2 Driver 535.54.03 / 535.104.12 DALI 1.28.0

Model	NVIDIA A100			NVIDIA L40S
Form factor	SXM4	PCIe Gen4		PCIe Gen4
GPU architecture	Ampere			Ada Lovelace
CUDA cores	6912			18176
Memory size	80 GB			48 GB
Memory type	HBM2e			HBM2e
Base clock	1275 MHz		1065 MHz	1110 MHz
Boost clock	1410 MHz			2520 MHz
Memory clock	1593 MHz		1512 MHz	2250 MHz
MIG support	Yes			No
Peak memory bandwidth	2039 GB/s		1935 GB/s	864 GB/s
Total board power	500 W		300 W	350 W

Benchmark	Dell PowerEdge R760xa L40S result (Server in Queries/s and Offline in Samples/s)	Dell’s % gain to the next best non-Dell results (%)
Stable Diffusion XL Server	0.65	5.24
Stable Diffusion XL Offline	0.67	2.28
GPT-J 99 Server	12.75	4.33
GPT-J 99 Offline	12.61	1.88
GPT-J 99.9 Server	12.75	4.33
GPT-J 99.9 Offline	12.61	1.88

Your Browser is Out of Date

Dell EMC Servers Shine in MLPerf Inference v0.7 Benchmark

MLPerf Benchmarks

MLPerf Inference Categories and Dell EMC Achievements

Dell has the highest number of NVIDIA GPU submissions using NVIDIA NGC Ready systems

Dell EMC portfolio can address customers inference needs from on-premises to the edge

Related Blog Posts

Multinode Performance of Dell PowerEdge Servers with MLPerfTM Training v1.1

MLPerf™ Inference v4.0 Performance on Dell PowerEdge R760xa and R7615 Servers with NVIDIA L40S GPUs

Abstract

Introduction

System Under Test configuration

Dell PowerEdge R760xa server

Dell PowerEdge R7615 server

Dell PowerEdge R750xa server

Performance results

Classical Deep Learning models performance

Generative AI performance

Conclusion