Home Workload Solutions Artificial Intelligence Blogs

Choosing a PowerEdge Server and NVIDIA GPUs for AI Inference at the Edge

Fri, 05 May 2023 16:38:19 -0000

Read Time: 0 minutes

Fabricio Bronzati

Manpreet Sokhi

Rakshith Vasudev

Frank Han

Dell Technologies submitted several benchmark results for the latest MLCommons^TM Inference v3.0 benchmark suite. An objective was to provide information to help customers choose a favorable server and GPU combination for their workload. This blog reviews the Edge benchmark results and provides information about how to determine the best server and GPU configuration for different types of ML applications.

Results overview

For computer vision workloads, which are widely used in security systems, industrial applications, and even in self-driven cars, ResNet and RetinaNet results were submitted. ResNet is an image classification task and RetinaNet is an object detection task. The following figures show that for intensive processing, the NVIDIA A30 GPU, which is a double-wide card, provides the best performance with almost two times more images per second than the NVIDIA L4 GPU. However, the NVIDIA L4 GPU is a single-wide card that requires only 43 percent of the energy consumption of the NVIDIA A30 GPU, considering nominal Thermal Design Power (TDP) of each GPU. This low-energy consumption provides a great advantage for applications that need lower power consumption or in environments that are more challenging to cool. The NVIDIA L4 GPU is the replacement for the best-selling NVIDIA T4 GPU, and provides twice the performance with the same form factor. Therefore, we see that this card is the best option for most Edge AI workloads.

Conversely, the NVIDIA A2 GPU exhibits the most economical price (compared to the NVIDIA A30 GPU's price), power consumption (TDP), and performance levels among all available options in the market. Therefore, if the application is compatible with this GPU, it has the potential to deliver the lowest total cost of ownership (TCO).

Figure 1: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the ResNet Offline benchmark

Figure 2: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the RetinaNet Offline benchmark

The 3D-UNet benchmark is the other computer vision image-related benchmark. It uses medical images for volumetric segmentation. We saw the same results for default accuracy and high accuracy. Moreover, the NVIDIA A30 GPU delivered significantly better performance over the NVIDIA L4 GPU. However, the same comparison between energy consumption, space, and cooling capacity discussed previously applies when considering which GPU to use for each application and use case.

Figure 3: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the 3D-UNet Offline benchmark

Another important benchmark is for BERT, which is a Natural Language Processing model that performs tasks such as question answering and text summarization. We observed similar performance differences between the NVIDIA A30, L4, T4, and A2 GPUs. The higher the value, the better.

Figure 4: Performance comparison of NVIDIA A30, L4, T4, and A2 GPUs for the BERT Offline benchmark

MLPerf benchmarks also include latency results, which are the time that systems take to process requests. For some use cases, this processing time can be more critical than the number of requests that can be processed per second. For example, if it takes several seconds to respond to a conversational algorithm or an object detection query that needs a real-time response, this time can be particularly impactful on the experience of the user or application.

As shown in the following figures, the NVIDIA A30 and NVIDIA L4 GPUs have similar latency results. Depending on the workload, the results can vary due to which GPU provides the lowest latency. For customers planning to replace the NVIDIA T4 GPU or seeking a better response time for their applications, the NVIDIA L4 GPU is an excellent option. The NVIDIA A2 GPU can also be used for applications that require low latency because it performed better than the NVIDIA T4 GPU in single stream workloads. The lower the value, the better.

Figure 4: Latency comparison of NVIDIA A30, L4, T4, and A2 GPUs for the ResNet single-stream and multistream benchmark

Figure 5: Latency comparison of NVIDIA A30, L4, T4, and A2 GPUs for the RetinaNet single-stream and multistream benchmark and the BERT single-stream benchmark

Dell Technologies submitted to various benchmarks to help understand which configuration is the most environmentally friendly as the data center’s carbon footprint is a concern today. This concern is relevant because some edge locations have power and cooling limitations. Therefore, it is important to understand performance compared to power consumption.

The following figure affirms that the NVIDIA L4 GPU has equal or better performance per watt compared to the NVIDIA A2 GPU, even with higher power consumption. For Throughput and Perf/watt values, higher is better; for Power(watt) values, lower is better.

Figure 6: NVIDIA L4 and A2 GPU power consumption comparison

Conclusion

With measured workload benchmarks on MLPerf Inference 3.0, we can conclude that all NVIDIA GPUs tested for Edge workloads have characteristics that address several use cases. Customers must evaluate size, performance, latency, power consumption, and price. When choosing which GPU to use and depending on the requirements of the application, one of the evaluated GPUs will provide a better result for the final use case.

Another important conclusion is that the NVIDIA L4 GPU can be considered as an exceptional upgrade for customers and applications running on NVIDIA T4 GPUs. The migration to this new GPU can help consolidate the amount of equipment, reduce the power consumption, and reduce the TCO; one NVIDIA L4 GPU can provide twice the performance of the NVIDIA T4 GPU for some workloads.

Dell Technologies demonstrates on this benchmark the broad Dell portfolio that provides the infrastructure for any type of customer requirement.

The following blogs provide analyses of other MLPerf^TM benchmark results:

References

For more information about Dell Power Edge servers, go to the following links:

For more information about NVIDIA GPUs, go to the following links:

MLCommons^TM Inference v3.0 results presented in this document are based on following system IDs:

ID	Submitter	Availability	System
2.1-0005	Dell Technologies	Available	Dell PowerEdge XE2420 (1x T4, TensorRT)
2.1-0017	Dell Technologies	Available	Dell PowerEdge XR4520c (1x A2, TensorRT)
2.1-0018	Dell Technologies	Available	Dell PowerEdge XR4520c (1x A30, TensorRT)
2.1-0019	Dell Technologies	Available	Dell PowerEdge XR4520c (1x A2, MaxQ, TensorRT)
2.1-0125	Dell Technologies	Preview	Dell PowerEdge XR5610 (1x L4, TensorRT, MaxQ)
2.1-0126	Dell Technologies	Preview	Dell PowerEdge XR7620 (1x L4, TensorRT)

Table 1: MLPerf^TM system IDs

Tags:

	Dell PowerEdge XE 8640 (4x NVIDIA H100-SXM-80GB, TensorRT)	Dell PowerEdge XE 9640 (4x H100-SXM-80GB, TensorRT)	Dell PowerEdge R760xa (4x H100-PCIe-80GB, TensorRT)	Dell PowerEdge XE 8545 (4x A100-SXM-80GB, TensorRT)
MLPerf submission ID	3.1-0066	3.1-0067	3.1-0064	3.0-0011
MLPerf system ID	XE8640_H100_SXM_80GBx4_TRT	XE9640_H100_SXM_80GBx4_TRT	R760xa_H100_PCIe_80GBx4_TRT	XE8545_A100_SXM4_80GBx4_TRT
Operating system	Rocky Linux 9.1	Ubuntu 22.04	Ubuntu 20.04.4	Ubuntu 22.04
CPU	Intel Xeon Platinum 8480	Intel Xeon Platinum 8480+	Intel Xeon Platinum 8480+	AMD EPYC 7763
Memory	1 TB	1 TB	2 TB	2 TB
GPU	NVIDIA H100 SXM 80 GB		NVIDIA H100 PCIE 80 GB	NVIDIA A100 SXM 80 GB CTS
GPU count	4
Software stack	TensorRT 9.0.0 CUDA 12.2			TensorRT 8.6.0 CUDA 12.2

Platform	R750xa	R750xa	R760xa
MLPerf Version	V3.0	V3.1	V3.1
GPU	NVIDIA A100 PCIe 80 GB	NVIDIA A100 PCIe 80 GB NVIDIA H100 PCIe 80 GB	NVIDIA H100 PCIe 80 GB
GPU Count	4
MLPerf System ID	R750xa_A100_PCIE_80GBx4_TRT	R750xa_A100_PCIe_80GBx4_TRT R750xa_H100_PCIe_80GBx4_TRT	R760xa_H100_PCIe_80GBx4_TRT
CPU	Intel Xeon Gold 6338 CPU @ 2.00 GHz		Intel Xeon Platinum 8480+
Memory	512 GB	512 GB 1 TB	2 TB
Software Stack	TensorRT 8.6 CUDA 12.0 cuDNN 8.8.0 Driver 525.85.12 DALI 1.17.0	TensorRT 9.0.0 CUDA 12.2 cuDNN 8.9.2 Driver 535.86.10 DALI 1.28.0

GPU	NVIDIA A100				NVIDIA H100
Form factor	SXM4	PCIe Gen4	SXM4	PCIe Gen4	PCIe Gen5	NVL PCIe Gen5	SXM5
GPU architecture	Ampere				Hopper
CUDA cores	6912				14592	2x 16895	16895
Memory size	40 GB		80 GB		80 GB	2x 94 GB (188 GB)	80 GB	94 GB
Memory type	HBM2e	HBM2	HBM2e		HBM2e	HBM3		HBM2e
Base clock	1095 MHz	765 MHz	1275 MHz	1065 MHz	1095 MHz	1080 MHz	1590 MHz	1605 MHz
Boost clock	1410 MHz				1755 MHz	1785 MHz	1980 MHz
Memory clock	1215 MHz		1593 MHz	1512 MHz	1593 MHz	2619 MHz		1593 MHz
MIG support	Yes				Yes/2^nd Gen
Peak memory bandwidth	1555 GB/s		2039 GB/s	1935 GB/s	2039 GB/s	3938 GB/s	3352 GB/s	2359 GB/s
Total board power	400 W	250 W	400 W	300 W	310/350 W	400 W	700 W

Your Browser is Out of Date

Choosing a PowerEdge Server and NVIDIA GPUs for AI Inference at the Edge

Results overview

Conclusion

References

Related Blog Posts

Promising MLPerf™ Inference 3.1 Performance of Dell PowerEdge XE8640 and XE9640 Servers with NVIDIA H100 GPUs

Abstract

Overview of top title results

Comparison from the previous rounds of submission

Comparing air-cooled and liquid-cooled servers

Impact of SXM over PCIe form factors

Comparing efficiency of new and previous generation servers

Hardware overview

Dell PowerEdge XE8640 server

Dell PowerEdge XE9640 server

NVIDIA H100 Tensor core GPU

Conclusion

Appendix

MLCommons results

Comparing the NVIDIA H100 and A100 GPUs in Dell PowerEdge R760xa and R750xa Servers

Abstract

Introduction

System Under Test (SUT) configuration

Dell PowerEdge R760xa server

Dell PowerEdge R750xa server

Conclusion

MLCommons™ results