Run Llama 3 on Dell PowerEdge XE9680 and AMD MI300x with vLLM
Thu, 09 May 2024 18:49:19 -0000
|Read Time: 0 minutes
In the rapidly evolving AI landscape, Meta AI Llama 3 stands out as a leading large language model (LLM), driving advancements across a variety of applications, from chatbots to text generation. Dell PowerEdge servers offer an ideal platform for deploying this sophisticated LLM, catering to the growing needs of AI-centric enterprises with their robust performance and scalability.
This blog demonstrates how to infer with the state-of-the-art Llama 3 model (meta-llama/Meta-Llama-3-8B-Instruct and meta-llama/Meta-Llama-3-70B-Instruct) using different methods. These methods include a container image (offline inferencing), endpoint API, and the OpenAI approach—using Hugging Face with vLLM.
Figure 1. Model Architecture of deploying Llama 3 model on PowerEdge XE9680 with AMD MI300x
Deploy Llama 3
Step 1: Configure Dell PowerEdge XE9680 Server
Use the following system configuration settings:
- Operating System: Ubuntu 22.04.3 LTS
- Kernel: Linux 5.15.0-105-generic
- Architecture: x86-64
- nerdctl: 1.5.0
- ROCm™ version: 6.1
- Server: Dell™ PowerEdge™ XE9680
- GPU: 8x AMD Instinct™ MI300X Accelerators
- vLLM version: 0.3.2+rocm603
- Llama 3 model: meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct
Step 2: Build vLLM from container
vLLM is a high-performance, memory-efficient serving engine for large language models (LLMs). It leverages PagedAttention and continuous batching techniques to rapidly process LLM requests.
1. Install the AMD ROCm™ driver, libraries, and tools. Use the following instructions from AMD for your Linux based platform. To ensure these installations are successful, check the GPU info using rocm-smi.
2. To quickly start with vLLM, we suggest using the ROCm Docker container, as installing and building vLLM from source can be complex; however, for our study, we built it from source to access the latest version.
- Git clone fo vLLM 0.3.2 version.
git clone -b v0.3.2 https://github.com/vllm-project/vllm.git
- To use nerdctl requires that BuildKit is enabled. Use the following instructions to set up nerdctl build with BuildKit.
cd vllm sudo nerdctl build -f Dockerfile -t vllm-rocm:latest . (This execution will take approximate 2-3 hours) nerdctl images | grep vllm
- Run it and replace <path/to/model> with the appropriate path if you have a folder of LLMs you would like to mount and access in the container.
nerdctl run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri/card0 --device /dev/dri/card1 --device /dev/dri/renderD128 -v <path/to/model>:/app/model vllm-rocm:latest
Step 3: Run Llama 3 using vLLM with three approaches
vLLM provides various deployment methods for running your Llama-3 model - run it offline, with a container image, via an endpoint API, or using the OpenAI approach.
Using the Dell XE9680 server with AMD Mi300 GPUs, let's explore each option in detail to determine which aligns best with your infrastructure and workflow requirements.
Container Image
- Use pre-built container images to install vLLM quickly and with minimal effort.
- Supports consistent deployment across diverse environments, both on-premises and cloud-based.
Endpoint API
- Access vLLM's functionality through a RESTful API, enabling easy integration with other systems.
- Encourages a modular design and smooth communication with external applications for added flexibility.
OpenAI
- Run vLLM using a framework like OpenAI, this option is perfect for those who are familiar with OpenAI's architecture.
- Ideal for users seeking a seamless transition from existing OpenAI workflows to vLLM.
Now, let's dive into the step-by-step process for implementing each of these approaches.
Approach 1: Llama 3 with container image (offline inferencing)
To start the offline inferencing we first need to export the environment variables. To use the Llama-3 model, you must set the HUGGING_FACE_HUB_TOKEN environment variable for authentication, this requires signing up for a Hugging Face account to obtain an access token.
root@16118303efa7:/app# export HUGGING_FACE_HUB_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Use the existing script provided by vLLM , and edit the model needed for your offline inferencing , in our case it would be “meta-llama/Meta-Llama-3-8B-Instruct and meta-llama/Meta-Llama-3-70B-Instruct”. Make sure you have the valid permissions to access the model from HUGGING FACE.
root@16118303efa7:/app# vi vllm/examples/offline_inference.py
Next, the Python script (offline_inference.py) is executed to run the inference process. This script is designed to work with vLLM, an optimized inference engine for large language models.
root@16118303efa7:/app# python vllm/examples/offline_inference.py
The script then, initializes the Llama 3 model by logging key setup information like tokenizer, model configurations, device setup, and weight-loading format and proceeds to download and load the model weights and components (such as tokenizer configurations) for offline inference with detailed logs showing which files are downloaded at what speed.
Finally, the script executes the Llama 3 model with example prompts, displaying text-based responses to demonstrate coherent outputs, confirming the model's offline inference capabilities.
Prompt: 'Hello, my name is', Generated text: ' Helen and I am a second-year student at the University of Central Florida. I' Prompt: 'The president of the United States is', Generated text: " a powerful figure, with the ability to shape the country's laws, policies," Prompt: 'The capital of France is', Generated text: ' a city that is steeped in history, art, and culture. From the' Prompt: 'The future of AI is', Generated text: ' here, and it’s already changing the way we live, work, and interact'
Approach 2: Llama 3 inferencing via api server
This example maintains a consistent server environment and includes vLLM through API server that allows you to run the vLLM backend and interact with it via HTTP endpoints. This provides flexible access to language models for generating text or processing requests.
root@16118303efa7:/app# python -m vllm.entrypoints.api_server --model meta-llama/Meta-Llama-3-8B-Instruct
First, execute the following command to enable the api_server inference endpoint inside the container. User can modify model argument parameter as per their requirement from the supported model list. If you require the 70B model, ensure that it runs on a dedicated GPU, as the VRAM won't be sufficient if other processes are also running.
INFO 01-17 20:25:37 llm_engine.py:222] # GPU blocks: 2642, # CPU blocks: 327 INFO: Started server process [10] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
The following example provides the expected output. Once the above command is executed, vLLM gets enabled through port 8000. Now, you can use the endpoint to communicate with the model.
curl http://localhost:8000/generate -d '{ "prompt": "Dell is", "use_beam_search": true, "n": 5, "temperature": 0 }'
The following example provides the expected output.
{"text":["Dell is one of the world's largest technology companies, providing a wide range of products and","Dell is one of the most well-known and respected technology companies in the world, with a","Dell is a well-known brand in the tech industry, and their laptops are popular among consumers","Dell is one of the most well-known and respected technology companies in the world. With a","Dell is one of the most well-known and respected technology companies in the world, and their"]}
Approach 3: Llama 3 inferencing via openai compatible server
This approach maintains a consistent server environment with vLLM deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000 . You can specify the address with --host and --port arguments. The server currently hosts one model at a time and implements list models, create chat completion, and create completion endpoints. The following actions provide an example for a Llama 3 model.
To activate the openai_compatible inference endpoint within the container, use the following command. If you require the 70B model, ensure it runs on a dedicated GPU, as the VRAM won't be sufficient if other processes are also running.
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct
To install OpenAI, run the following command with root privileges from the host entity.
pip install openai
Run the following command to invoke the python file, and then edit the model and openapi base url.
root@compute1:~# cat openai_vllm.py from openai import OpenAI openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) stream = client.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "Explain the differences betweem Navy Diver and EOD rate card"}], max_tokens=4000, stream=True, ) for chunk in stream: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="")
Run the following command to execute.
root@compute1:~# python3 openai_vllm.py
The following example provides the expected output.
Navy Diver (ND) and Explosive Ordnance Disposal (EOD) are both specialized ratings in the United States Navy, but they have distinct roles and responsibilities. **Navy Diver (ND) Rating:** Navy Divers are trained to conduct a variety of underwater operations, including: 1. Salvage and recovery of underwater equipment and vessels 2. Construction and maintenance of underwater structures 3. Clearance of underwater obstacles and hazards 4. Recovery of aircraft and other underwater vehicles 5. Scientific and research diving Navy Divers typically work in a variety of environments, including freshwater and saltwater, and may be deployed on board ships, submarines, or ashore. Their primary responsibilities include: * Conducting dives to depths of up to 300 feet (91 meters) in a variety of environments * Operating specialized diving equipment, such as rebreathers and rebreather systems * Performing underwater repairs and maintenance on equipment and vessels * Conducting underwater construction and salvage operations **Explosive Ordnance Disposal (EOD) Rating:**
Conclusion
The inclusion of Llama 3 with Dell PowerEdge XE9680, taps into the powerful capabilities of AMD's MI300 and XE960, underscores swift data processing, and provides a flexible infrastructure. By leveraging vLLM, it enhances model deployment and inference efficiency, reinforcing Meta's focus on advanced open-source AI technology and demonstrating Dell's strength in delivering high-performance solutions for a broad range of applications.
Stay tuned for future blogs posted on the Dell Technologies Info Hub for AI for more information on vLLM with different models and their performance metrics on Poweredge XE9680.