Introduction

In this blog, we present the MLPerf™ v4.0 Data Center Inference results obtained on a Dell PowerEdge R760 with the latest 5^th Generation Intel® Xeon® Scalable Processors (a CPU only system).

These new Intel® Xeon® processors use an Intel® AMX matrix multiplication engine in each core to boost overall inferencing performance. With a focus on ease of use, Dell Technologies delivers exceptional CPU performance results out of the box with an optimized BIOS profile that fully unleashes the power of Intel’s OneDNN software – a software which is fully integrated with both PyTorch and TensorFlow frameworks. The server configurations and the CPU specifications in the benchmark experiments are shown in Tables 1 and 2, respectively.

Table 1. Dell PowerEdge R760 Server Configuration

System Name	PowerEdge R760
Status	Available
System Type	Data Center
Number of Nodes	1
Host Processor Model	5^th Generation Intel^® Xeon^® Scalable Processors
Host Processors per Node	2
Host Processor Core Count	64
Host Processor Frequency	1.9 GHz, 3.9 GHz Turbo Boost
Host Memory Capacity	2 TB, 16 x 128 GB 5600 MT/s
Host Storage Capacity	7.68TB, NVME

Table 2. 5^th Generation Intel® Xeon® Scalable Processor Technical Specifications

Product Collection	5^th Generation Intel® Xeon® Scalable Processors
Processor Name	Platinum 8592+
Status	Launched
# of CPU Cores	64
# of Threads	128
Base Frequency	1.9 GHz
Max Turbo Speed	3.9 GHz
Cache L3	320 MB
Memory Type	DDR5 5600 MT/s
ECC Memory Supported	Yes

MLPerf™ Inference v4.0 - Datacenter

The MLPerf™ inference benchmark measures how fast a system can perform ML inference using a trained model with new data in a variety of deployment scenarios. There are two benchmark suites, one for Datacenter systems and one for Edge. Figure 1 shows the 7 models with each targeting at different task in the official release v4.0 for Datacenter systems category that were run on this PowerEdge R760 and submitted in the closed category. The dataset and quality target are defined for each model for benchmarking, as listed in Table 3.

Figure 1. Benchmarked models for MLPerf™ datacenter inference v4.0

Table 3. Datacenter Suite Benchmarks. Source: MLCommons™

Area	Task	Model	Dataset	QSL Size	Quality	Server latency constraint
Vision	Image classification	ResNet50-v1.5	ImageNet (224x224)	1024	99% of FP32 (76.46%)	15 ms
Vision	Object detection	RetinaNet	OpenImages (800x800)	64	99% of FP32 (0.20 mAP)	100 ms
Vision	Medical imaging	3D-Unet	KITS 2019 (602x512x512)	16	99.9% of FP32 (0.86330 mean DICE score)	N/A
Speech	Speech-to-text	RNN-T	Librispeech dev-clean (samples < 15 seconds)	2513	99% of FP32 (1 - WER, where WER=7.452253714852645%)	1000 ms
Language	Language processing	BERT-large	SQuAD v1.1 (max_seq_len=384)	10833	99% of FP32 and 99.9% of FP32 (f1_score=90.874%)	130 ms
Language	Summarization	GPT-J	CNN Dailymail (v3.0.0, max_seq_len=2048)	13368	99% of FP32 (f1_score=80.25% rouge1=42.9865, rouge2=20.1235, rougeL=29.9881).	20 s
Commerce	Recommendation	DLRMv2	Criteo 4TB multi-hot	204800	99% of FP32 (AUC=80.25%)	60 ms

Scenarios

The models are deployed in a variety of critical inference applications or use cases known as “scenarios” where each scenario requires different metrics, demonstrating production environment performance in practice. Following is the description of each scenario. Table 4 shows the scenarios required for each Datacenter benchmark included in this submission v4.0.

Offline scenario: represents applications that process the input in batches of data available immediately and do not have latency constraints for the metric performance measured in samples per second.

Server scenario: represents deployment of online applications with random input queries. The metric performance is measured in queries per second (QPS) subject to latency bound. The server scenario is more complicated in terms of latency constraints and input queries generation. This complexity is reflected in the throughput-degradation results compared to the offline scenario.

Each Datacenter benchmark requires the following scenarios:

Table 4. Datacenter Suite Benchmark Scenarios. Source: MLCommons™

Area	Task	Required Scenarios
Vision	Image classification	Server, Offline
Vision	Object detection	Server, Offline
Vision	Medical imaging	Offline
Speech	Speech-to-text	Server, Offline
Language	Language processing	Server, Offline
Language	Summarization	Server, Offline
Commerce	Recommendation	Server, Offline

Software stack and system configuration

The software stack and system configuration used for this submission is summarized in Table 5.

Table 5. System Configuration

OS	CentOS Stream 8 (GNU/Linux x86_64)
Kernel	6.7.4-1.el8.elrepo.x86_64
Intel® Optimized Inference SW for MLPerf™	MLPerf™ Intel® OneDNN integrated with Intel® Extension for PyTorch (IPEX)
ECC memory mode	ON
Host memory configuration	2TB, 16 x 128 GB, 1 DIMM per channel, well balanced
Turbo mode	ON
CPU frequency governor	Performance

What is Intel® AMX (Advanced Matrix Extensions)?

Intel® AMX is a built-in accelerator that enables 5^th Gen Intel® Xeon® Scalable processors to optimize deep learning (DL) training and inferencing workloads. With the high-speed matrix multiplications enabled by Intel® AMX, 5^th Gen Intel® Xeon® Scalable processors can quickly pivot between optimizing general computing and AI workloads.

Imagine an automobile that could excel at city driving and then quickly shift to deliver Formula 1 racing performance. 5^th Gen Intel® Xeon® Scalable processors deliver this level of flexibility. Developers can code AI functionality to take advantage of the Intel® AMX instruction set as well as code non-AI functionality to use the processor instruction set architecture (ISA). Intel® has integrated the oneAPI Deep Neural Network Library (oneDNN) – its oneAPI DL engine – into popular open-source tools for AI applications, including TensorFlow, PyTorch, PaddlePaddle, and ONNX.

AMX architecture

Intel® AMX architecture consists of two components, as shown in Figure 1:

Tiles consist of eight two-dimensional registers, each 1 kilobyte in size. They store large chunks of data.
Tile Matrix Multiplication (TMUL) is an accelerator engine attached to the tiles that performs matrix-multiply computations for AI.

Figure 2. Intel® AMX architecture consists of 2D register files (tiles) and TMUL

Results

Both MLPerf™ v3.1 and MLPerf™ v4.0 benchmark results are based on the Dell R760 server but with different generations of Xeon® CPUs (4^th Generation Intel^® Xeon^® CPUs for MLPerf™ v3.1 versus 5^th Generation Intel^® Xeon^® CPUs for MLPerf™ v4.0) and optimized software stacks. In this section, we show the performance in the comparing mode so the improvement from the last submission can be easily observed.

Comparing Performance from MLPerf^TM v4.0 to MLPerf^TM v3.1

ResNet50 server & offline scenarios:

Figure 3. ResNet50 inference throughput in server and offline scenarios

BERT Large Language Model server & offline scenarios:

Figure 4. BERT Inference results for server and offline scenarios

RetinaNet Object Detection Model server & offline scenarios:

Figure 5. RetinaNet Object Detection Model Inference results for server and offline scenarios

RNN-T Text to Speech Model server & offline scenarios:

Figure 6. RNN-T Text to Speech Model Inference results for server and offline scenarios

3D-Unet Medical Imaging Model offline scenarios:

Figure 7. 3D-Unet Medical Imaging Model Inferencing results for server and offline scenarios

DLRMv2-99 Recommendation Model server & offline scenarios:

Figure 8. DLRMv2-99 Recommendation Model Inference results for server and offline scenarios

GPT-J-99 Summarization Model server & offline scenarios:

Figure 9. GPT-J-99 Summarization Model Inference results for server and offline scenarios

Conclusion

The PowerEdge R760 server with 5^th Generation Intel® Xeon® Scalable Processors produces strong data center inference performance, confirmed by the official version 4.0 MLPerf^TM benchmarking results from MLCommons^TM.
The high performance and versatility are demonstrated across natural language processing, image classification, object detection, medical imaging, speech-to-text inference, recommendation, and summarization systems.
Compared to its prior version 3.0 and 3.1 submissions enabled by 4^th Generation Intel® Xeon® Scalable Processors, the R760 with 5^th Generation Intel® Xeon® Scalable Processors show significant performance improvement across different models, including the generative AI models like GPT-J.
The R760 supports different deep learning inference scenarios in the MLPerf^TM benchmark scenarios as well as other complex workloads such as database and advanced analytics. It is an ideal solution for data center modernization to drive operational efficiency, lead to higher productivity, and minimize total cost of ownership (TCO).

References

MLCommons^TM MLPerf^TM v4.0 Inference Benchmark Submission IDs

ID	Submitter	System
4.0-0026	Dell	Dell PowerEdge Server R760 (2x Intel® Xeon® Platinum 8592+)

Your Browser is Out of Date

MLPerf™ Inference 4.0 on Dell PowerEdge Server with Intel® 5th Generation Xeon® CPU

Introduction

MLPerf™ Inference v4.0 - Datacenter

Scenarios

Software stack and system configuration

What is Intel® AMX (Advanced Matrix Extensions)?

AMX architecture

Results

Comparing Performance from MLPerf^TM v4.0 to MLPerf^TM v3.1

ResNet50 server & offline scenarios:

BERT Large Language Model server & offline scenarios:

RetinaNet Object Detection Model server & offline scenarios:

RNN-T Text to Speech Model server & offline scenarios:

3D-Unet Medical Imaging Model offline scenarios:

DLRMv2-99 Recommendation Model server & offline scenarios:

GPT-J-99 Summarization Model server & offline scenarios:

Conclusion

References

MLCommons^TM MLPerf^TM v4.0 Inference Benchmark Submission IDs

Your Browser is Out of Date

MLPerf™ Inference 4.0 on Dell PowerEdge Server with Intel® 5th Generation Xeon® CPU

Introduction

MLPerf™ Inference v4.0 - Datacenter

Scenarios

Software stack and system configuration

What is Intel® AMX (Advanced Matrix Extensions)?

AMX architecture

Results

Comparing Performance from MLPerfTM v4.0 to MLPerfTM v3.1

ResNet50 server & offline scenarios:

BERT Large Language Model server & offline scenarios:

RetinaNet Object Detection Model server & offline scenarios:

RNN-T Text to Speech Model server & offline scenarios:

3D-Unet Medical Imaging Model offline scenarios:

DLRMv2-99 Recommendation Model server & offline scenarios:

GPT-J-99 Summarization Model server & offline scenarios:

Conclusion

References

MLCommonsTM MLPerfTM v4.0 Inference Benchmark Submission IDs

Comparing Performance from MLPerf^TM v4.0 to MLPerf^TM v3.1

MLCommons^TM MLPerf^TM v4.0 Inference Benchmark Submission IDs