Home Servers Rack and Tower Servers Intel Direct from Development - Tech Notes

AI Acceleration using Red Hat OpenShift with Dell PowerEdge Servers with 4th Gen Intel® Xeon® Processors

Download PDF

Mon, 13 May 2024 19:46:21 -0000

Read Time: 0 minutes

Delmar Hernandez

Manya Rastogi

Rajesh Poornachandran- Intel

Divakar Mariyanna- Intel

Ryan Saffores- Intel

Veenadhari Bedida- Intel

Summary

Intel® Xeon® Scalable processors feature built-in accelerators for more performance-per-core and accelerated AI performance for a CPU, with advanced security technologies for the most in-demand workload requirements - all while offering cloud choice and application portability[1]. Red Hat OpenShift (RHOS)[2] provides a robust platform for running Large Language Model (LLM) inference and fine-tuning experiments. Red Hat OpenShift Container Platform (RHOCP) leverages Kubernetes containerization technology, allowing us to package the LLM model and its dependencies in a container for ease of deployment and portability. This ensures consistent and isolated execution across different environments. To demonstrate the combined benefits of both the advanced hardware and software products, including full end to end orchestration, Dell and Intel recently conducted Large Language Model (LLM) Artificial Intelligence (AI) performance testing. This document summaries the key features incorporated at a system level along with performance results for both LLM fine-tuning and inference use cases.

Solution overview

OpenShift is a family of containerization software products developed by Red Hat. Its flagship product is the OpenShift Container Platform - a hybrid cloud platform as a service built around Linux containers orchestrated and managed by Kubernetes on a foundation of Red Hat Enterprise Linux[3].

Some of the key changes incorporated into 4th generation Intel Xeon Scalable processors that we used for this test included:

New Advanced Matrix Extension (AMX) capabilities[4]
Improved Advanced Vector Extension (AVX) performance
The new Intel Extension for PyTorch® open-source solution[5]

System configurations tested

To conduct the testing, we first deployed a 16th generation Dell PowerEdge R760 with Red Hat Enterprise Linux 8.8 as an “Administration node”. Next, we deployed a cluster of three 16th generation Dell PowerEdge R660s with Red Hat Enterprise Linux CoreOS 4.13.92 as the “Control Plane” nodes providing the Kubernetes services. These systems were chosen simply for hardware availability reasons to provided administration and orchestration of the OpenShift cluster. Table 1 shows the hardware configuration used; Table 2 shows the associated software configuration.

Hardware configuration

Table 1. Hardware configuration

	Admin Node	Control Plane Node
System	Dell Inc. PowerEdge R760	Dell Inc. PowerEdge R660
CPU Model	Intel Xeon Platinum 8452Y	Intel Xeon Platinum 8452Y
Sockets	2	2
Core per Socket	36	36
All Core Turbo Freq	2.8GHz	2.8GHz
TDP	300W	300W
Memory	1024GB (16x64GB DDR5 4800 MT/s)	1024GB (16x64GB DDR5 4800 MT/s)
Microcode	0x2b0001b0	0x2b0001b0
Test Date	Tested by Intel as of 11/30/23	Tested by Intel as of 11/30/23

Software configuration

Table 2. Software configuration

Component	Version
Kernel	5.14.0-284.18.1.el9_2.x86_64
OS	RHEL CoreOS 4.13.92
RHOCP	v1.26.5
Framework	PyTorch 2.1.0+cpu
Other Software	Python: 3.9, IPEX: 2.1.0+cpu, transformers: 4.31.0

Workload configuration

Table 3. Workload configuration

Component	Version
Model	Llama2-7B-hf
Dataset	Finance-Alpaca
Fine-tuning	1,2 and 3-node cluster
Inference	Single node
Precision	Bfloat16 and INT8
Batch Size	1,2,4,6, and 8
Inference SLA	100ms for second token latency

Performance results

All the figures in this section demonstrate the performance results of LLAMA-2-7B. Figure 1 shows the training (fine-tuning) efficiency of LLAMA-2-7B from 1 to 3 nodes in terms of time to train (hours) as Key Performance Indicator (KPI). Figure 2 shows the single node inference performance for both INT8 and BFloat16 datatypes accelerated via 4th Gen Xeon built-in AI Acceleration with AMX. Figure 3 shows the performance with multi-instance scenarios. Figures 4-11 show the performance sweeps across various batch sizes.

Chart showing finetuning scaling across 1, 2, and 3 nodes.

Figure 1. Fine-tuning scaling efficiency

Graph showing Int8 Bfloat 16 single socket performance across multiple token sizes.

Figure 2. Inference performance for different input token sizes

Graph showing Int8 single socket multi-instance performance across multiple input token sizes.

Figure 3: Multi-Instance Inference performance for different input token sizes

Graph showing bfloat16 single socket performance with input token of 1024 across various batch sizes.

Figure 4: Inference performance for different batch sizes

Chart showing output of 2nd token latency scaling in ms across 1-8 batch size.

Figure 5: Inference performance for different batch sizes

Graph showing bfloat16 performance with input token of 128 across multiple batch sizes

Figure 6: Inference performance for different batch sizes

Graph showing bfloat16 single socket performance with input token of 32 across various batch sizes

Figure 7: Inference performance for different batch sizes

Chart showing output of 2nd token latency ms across batch size 1 through 8 in increments of 2 with input token of 1024.

Figure 8: Inference performance for different batch sizes

Graph showing Int8 single socket performance with input token 32 across various batch sizes

Figure 9: Inference performance for different batch sizes

Graph showing int8 single socket performance with input token of 128 across various batch sizes

Figure 10: Inference performance for different batch sizes

Graph showing int8 single socket performance with input token of 2048 across various batch sizes

Figure 11: Inference performance for different batch sizes

Key takeaways

Fine-tuning node scaling from 1 to 3 nodes can be easily orchestrated with Kubernetes + RHOS with 25%-35% scaling efficiency.
Across input tokens (32, 128, 1K, 2K), INT8 1 instance/socket can deliver inference with avg. latency under 50ms.
Across input tokens (32, 128, 1K, 2K), INT8 2 instances/socket can deliver inference with avg. latency under 100ms.
Across input tokens (32, 128), INT8 3 instances/socket can deliver inference with avg. latency under 100ms.
Across input tokens (32, 128, 1K, 2K), BF16 1 instance 1 socket can deliver inference with avg. latency under 100ms.
Across input tokens (32, 128, 1K, 2K), INT8 speed up is up to 1.7x of BF16 model.

Conclusion and future work

This work demonstrated the performance effectiveness of 4th Gen Xeon on Dell PowerEdge servers for AI Large Language Model (LLM) with RHOS, the Meta LLAMA 2 Large Language Model (LLM) fine-tuning and inference. Additionally, this work demonstrates that choosing the right combination of server, processor, and software products can help provide scale out with increased performance. We would like to extend the scope of this study for larger LLMs with a variety of network topologies of varying speeds and feeds to identify optimal compute vs. communication tradeoffs for best performance.

Notices and disclaimers

Performance varies by use, configuration and other factors. Learn more at www.intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

Learn more

Contact your Dell or Intel account team for a customized quote.

_____________

[1] https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html

[2] https://www.redhat.com/en/technologies/cloud-computing/openshift

[3] https://en.wikipedia.org/wiki/OpenShift

[4] https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html

[5] https://github.com/intel/intel-extension-for-pytorch

Tags:

Your Browser is Out of Date

AI Acceleration using Red Hat OpenShift with Dell PowerEdge Servers with 4th Gen Intel® Xeon® Processors

Summary

Solution overview

System configurations tested

Hardware configuration

Software configuration

Workload configuration

Performance results

Key takeaways

Conclusion and future work

Notices and disclaimers

Learn more

Related Documents

Accelerating High-Performance Computing with Dell PowerEdge XE9680: A Look at HPL Performance

Executive Summary

Testing

Performance

Accelerating AI Inferencing with Dell PowerEdge XE9680: A Performance Analysis

Executive Summary

Testing

Performance

+300%: PowerEdge XE9680 NVIDIA A100 to H100 performance(1)

+700%: When compared to PowerEdge XE8545(2)

+300%: PowerEdge XE9680 NVIDIA A100 to H100 performance⁽¹⁾

+700%: When compared to PowerEdge XE8545⁽²⁾