Deploy and Serve models with Virtual Large Language Models (vLLM) | Deploy GenAI on the PowerEdge XE9680 with Intel® Gaudi®3 Accelerators | Dell Technologies Info Hub

Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Deploy and Serve models with Virtual Large Language Models (vLLM)

Deploy and Serve models with Virtual Large Language Models (vLLM)

Thank you for your feedback!

This section covers how to set up and use vLLM, a fast and efficient library for Large Language Model (LLM) inference and serving, with Intel® Gaudi® 3 AI Accelerators. vLLM-fork for Gaudi® is an adaptation of the original vLLM project, optimized to leverage the power of Gaudi® hardware.
Note: vllm-fork from Habana AI goes through periodic releases, so ensure the versions and related configs in the following .yaml files are updated for the latest releases.
vLLM offers several advantages for LLM inference:
- High throughput serving with state-of-the-art performance
- Efficient memory management using PagedAttention
- Continuous batching of incoming requests
- Optimized execution with custom Gaudi® implementations for LLM operators
- Support for offline batched inference and online inference via OpenAI-Compatible Server
This section of the guide will illustrate how to set up and use vLLM on Gaudi® hardware, demonstrating its capacity for fast and efficient LLM inference.
Sample files for the vLLM use case (Llama-3-8B with BF16 precision) are available at the following GitHub repository:
https://github.com/dell-examples/generative-ai/tree/main/intel-XE9680-gaudi3/vLLM
Following are the deployment files in the vLLM repository:
- README.md
- Dockerfile-vllm-backend
- Dockerfile-vllm-benchmark
- Configmap.yaml
- Job.yaml
- Deployment.yaml
- Service.yaml