Home Servers PowerEdge Components Direct from Development: Tech Notes

1S PowerEdge R7515 has Equivalent T4 GPU Performance to 2S PowerEdge R7425

Download PDF

Mon, 16 Jan 2023 13:44:27 -0000

Read Time: 0 minutes

Matt Ogle

Bhavesh Patel

Ramesh Radhakrishnan

Summary

The 2^nd Gen AMD EPYC^TM CPU is a 7nm processor loaded with 64 threads, making it a powerhouse for any server. Its impressive specs give it room for generational growth, as its supporting server hardware progress to become capable of fully utilizing it. This DfD analyzes how one 64-core AMD CPU in a 1S R7515 produces equivalent T4 GPU performance to two 32-core AMD CPUs in a 2S R7425, and why users looking to run ML inference workloads should consider utilizing this 64- core CPU in a 1S server.

Distinguished Next Gen AMD EPYC^TM CPU

The launch of AMDs 2^nd Generation EPYC^TM (Rome) CPUs shook up the CPU industry by refining their proprietary Zen microarchitecture to new limits. With up to 64 cores, twice the amount of its predecessor (Naples), AMD went above and beyond the traditional tech mold by delivering a product truly worth of the term “next-gen”.

Figure 1 – AMD Rome CPU architecture graphic (large I/O die in the center with 8 chip dies containing 8 cores bordering the I/O die)

From a component-spec standpoint, the Rome CPU is 2x as capable as the Naples CPU. However, Dell Technologies wanted to confirm its ability to manage dense workloads that stress the processor. This led to various tests executed on the PowerEdge R7515 server, which supports 1 Rome CPU, and the PowerEdge R7425 server, which supports 2 Naples CPUs, to record and compare the performance of each CPU generation. Object detection, image classification and machine translation workloads were run with the support of NVIDIA T4 GPUs assisting the CPU(s).

VDI, IVA and Inference Studies

By executing tests on both servers (Figure 2) for various workloads (Figures 3-7), two factors are examined:

How the R7515 (Rome) and R7425 (Naples) solutions performed across various Machine Learning inference workloads. This accounts for the reduction of eight memory modules in the R7515 solution.
How NVIDIA T4 GPU performance compared between both solutions (QPS and inputs per second).

Server Details

Figure 2 – Server configuration details for the 32-core server (left) and 64-core server (right)

The figures above display the performance comparison of a 1S PowerEdge R7515 configured with 4 NVIDIA T4 GPUs and a 2S PowerEdge R7425 with 6 NVIDIA T4 GPUs. Although the bar graphs may not appear equivalent, once the total queries and inputs per second are divided by the total GPU count, we see that the performance per individual GPU is nearly equivalent (see Figure 8).

MobileNet-v1 (ImageNet (224x224)
Performance Measurement	R7515 (1x T4)	R7425 (6x T4)	1S - 2S	% Variance
QPS (x1 T4)	16254	16431	-177	-1.08%
Input / Second (x1 T4)	16945	16815	130	0.77%
ResNet-50 v1.5 (ImageNet (224x224)
Performance Measurement	R7515 (1x T4)	R7425 (6x T4)	1S - 2S	% Variance
QPS (x1 T4)	4770	5098	-328	-6.43%
Input / Second (x1 T4)	5397	5368	29	0.54%
SSD w/ MobileNet-v1 (COCO)
Performance Measurement	R7515 (1x T4)	R7425 (6x T4)	1S - 2S	% Variance
QPS (x1 T4)	6484	6947	-463	-6.66%
Input / Second (x1 T4)	7122	7268	-146	-2.01%
SSD w/ ResNet-34 (COCO 1200x1200)
Performance Measurement	R7515 (1x T4)	R7425 (6x T4)	1S - 2S	% Variance
QPS (x1 T4)	100	117	-17	-14.53%
Input / Second (x1 T4)	129	132	-3	-2.27%
GNMT (WMT E-G)
Performance Measurement	R7515 (1x T4)	R7425 (6x T4)	1S - 2S	% Variance
QPS (x1 T4)	200	198	2	1.01%
Input / Second (x1 T4)	341	221	120	54.30%

Figure 8 – Performance variance percentages for one T4 GPU highlighted in the last row. Note that negative percentages translates to lower performance for R7515 GPUs.

Now that the data is reduced to a common denominator of one GPU, the performance variance becomes easy to interpret. The inputs per second for Image Classification and Object Detection are nearly identical between server configurations; staying within ±3% of one another. Machine Translation numbers, however, are heavily boosted by the AMD Rome CPU. The queries per second (QPS) are a little more variant but are still very similar. All workloads stay within ± 7% of one another, except for the object detection workload ResNet-34, which has a -14.53% loss in performance.

Major Takeaways

This data proves that despite executing the workload on a single socket server, the Rome server configuration is still executing vision and language processing tasks at a nearly equivalent performance to the Naples configuration. Knowing this, Dell Technologies customers can now be informed of the following takeaways upon their next PowerEdge configuration order:

A single socket 64-core AMD Rome CPU performs at near equivalence to two socket 32-core AMD Naples CPUs for vision and language processing tasks. This means that inference workloads in the AI space will be able to perform effectively with less components loaded in the server. Therefore, customers running workloads such as inference that are not impacted by a reduction in total system memory capacity would be great candidates for switching from 2S to 1S platforms.
Non-Uniform Memory access (NUMA) memory and I/O performance issues associated with 2S platforms is avoided with the 1S R7515 Rome configuration. This is beneficial to I/O and memory intensive workloads as data transfers are localized to one socket; therefore avoiding any associated latency and bandwidth penalties.
64-core single socket servers typically offer better value due to the amortization of system components.
Reducing the number of CPUs and memory will reduce the total power consumption .

Conclusion
One 2^nd Generation AMD EPYC^TM (Rome) CPU is capable of supporting AI vision and language processing tasks at near-equivalent performance to two 1^st Generation AMD EPYC^TM (Naples) CPUs. The advantages attached to this generational performance gap, such as increased cost-effectiveness, will appeal to many PowerEdge users and should be considered for future solutions.

Tags:

Model	Reference Application	Dataset
resnet50-v1.5	vision / classification and detection	ImageNet (224x224)
ssd-mobilenet 300x300	vision / classification and detection	COCO (300x300)
ssd-resnet34 1200x1200	vision / classification and detection	COCO (1200x1200)
bert	language	squad-1.1
dlrm	recommendation	Criteo Terabyte
3d-unet	vision/medical imaging	BraTS 2019
rnnt	speech recognition	OpenSLR LibriSpeech Corpus

Scenario	Sample Use Case	Metrics
SingleStream	Cell phone augmented reality	Latency in milliseconds
MultiStream	Multiple camera driving assistance	Number of streams
Server	Translation site	QPS
Offline	Photo sorting	Inputs/second

Your Browser is Out of Date

1S PowerEdge R7515 has Equivalent T4 GPU Performance to 2S PowerEdge R7425

Summary

Distinguished Next Gen AMD EPYCTM CPU

Major Takeaways

Conclusion

Related Documents

Efficient Machine Learning Inference on Dell EMC PowerEdge R7525 and R7515 Servers using NVIDIA GPUs

Summary

MLPerf Inference Benchmarks

Executing Inference Workloads on Dell EMC PowerEdge

NVIDIA Technologies Used for Efficient Inference

NVIDIA® Tesla T4

NVIDIA® Quadro RTX8000

NVIDIA® A100-PCIE

NVIDIA Inference Software Stack for GPUs

MLPerf v0.7 Performance Results and Key Takeaways

Offline Performance

Conclusion

Understanding the Value of AMDs Socket to Socket Infinity Fabric

Summary

Introduction

How Socket-to-Socket Infinity Fabric Works

The Value of Infinity Fabric Interconnect

Conclusion

Distinguished Next Gen AMD EPYC^TM CPU