Home > Servers > Specialty Servers > White Papers > Deploy GenAI on the PowerEdge XE9680 with Intel® Gaudi®3 Accelerators > Overview
The demand for powerful computing resources is rising due to the increasing use of Deep Learning for natural language processing, image and audio recognition, predictive analysis, and more. To meet this demand, there is a need for faster and more efficient inference systems to meet growing Enterprise needs. The Intel® Gaudi® processor is built to enhance the speed and efficiency of DL, from training to inference. It comes with user-friendly software that is adaptable to various workloads and makes it easy for developers to migrate their existing models to Gaudi® 3.
For more information on the Dell PowerEdge XE9680 offering with Intel® Gaudi® 3, refer to this tech note from Dell Technologies and Intel®: Introducing Dell PowerEdge XE9680 with Intel Gaudi 3 AI Accelerator.
This deployment guide provides detailed insights into the versatile applications of the Gaudi® 3 processor, showcasing its inference and fine-tuning capabilities. It covers the use of Optimum-Habana, vLLM and TGI use cases for inference, and an example to get you started on fine-tuning.
Optimum for Intel® Gaudi® was designed with one goal in mind—to make training and inference straightforward for Transformers and Diffusers users while fully leveraging the power of Intel® Gaudi® AI Accelerators.
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is designed to efficiently serve large language models by virtualizing them, which helps in reducing memory footprint and improving inference efficiency. When deploying large language models that exceed typical memory constraints or in scenarios requiring efficient resource utilization and high-performance serving, vLLM with the power of Intel® Gaudi® AI Accelerators plays a pivotal role.
Text Generation Inference (TGI) is an optimized serving solution for large language models, particularly those applications requiring high-performance text generation.
When running inference on large language models (LLMs), high memory usage is often a bottleneck. This guide shares steps to run these processes with BFloat16 precision. In addition, FP8 quantization for large language models halves the required memory bandwidth and compute is twice as fast as BF16 compute, so even compute-bound workloads—such as offline inference on large batch sizes—benefit. Details on these steps are provided as well.
This guide covers how to get started with the stack and model configurations of Optimum, vLLM, and TGI for Llama3-8B models. The study can be further scaled to bigger models like Llama3-70B. As efficiency and performance optimizations progress with the Habana Gaudi® software stack, one can expect additional performance improvements over upcoming versions.
Each use case is explored to demonstrate how the Gaudi® 3 processor can be effectively utilized to enhance performance and efficiency in cutting-edge AI domains. To try to provide clarity to customers looking to deploy Gaudi® 3 models, Dell Technologies and Intel® partnered to conduct performance testing. The Dell PowerEdge models chosen for this testing were the PowerEdge XE9680 combined with 5th Generation Intel® Xeon® processors and Intel® Gaudi®3 accelerators.
All testing was conducted on Dell Platform servers by Intel® Engineers during August-September 2024.