Home > Servers > Specialty Servers > White Papers > Deploy GenAI on the PowerEdge XE9680 with Intel® Gaudi®3 Accelerators > Deploy and Serve models with Virtual Large Language Models (vLLM)
This section covers how to set up and use vLLM, a fast and efficient library for Large Language Model (LLM) inference and serving, with Intel® Gaudi® 3 AI Accelerators. vLLM-fork for Gaudi® is an adaptation of the original vLLM project, optimized to leverage the power of Gaudi® hardware.
Note: vllm-fork from Habana AI goes through periodic releases, so ensure the versions and related configs in the following .yaml files are updated for the latest releases.
vLLM offers several advantages for LLM inference:
This section of the guide will illustrate how to set up and use vLLM on Gaudi® hardware, demonstrating its capacity for fast and efficient LLM inference.
Sample files for the vLLM use case (Llama-3-8B with BF16 precision) are available at the following GitHub repository:
https://github.com/dell-examples/generative-ai/tree/main/intel-XE9680-gaudi3/vLLM
Following are the deployment files in the vLLM repository: