In the rapidly evolving AI landscape, Meta AI Llama 3 stands out as a leading large language model (LLM), driving advancements across a variety of applications, from chatbots to text generation. Dell PowerEdge servers offer an ideal platform for deploying this sophisticated LLM, catering to the growing needs of AI-centric enterprises with their robust performance and scalability.

This blog demonstrates how to infer with the state-of-the-art Llama 3 model (meta-llama/Meta-Llama-3-8B-Instruct and meta-llama/Meta-Llama-3-70B-Instruct) using different methods. These methods include a container image (offline inferencing), endpoint API, and the OpenAI approach—using Hugging Face with vLLM.

This image shows Token Generation and Fine Tuning with Dell PowerEdge XE9680 with AMD MI300X

Figure 1. Model Architecture of deploying Llama 3 model on PowerEdge XE9680 with AMD MI300x

Deploy Llama 3

Step 1: Configure Dell PowerEdge XE9680 Server

Use the following system configuration settings:

Operating System: Ubuntu 22.04.3 LTS
Kernel: Linux 5.15.0-105-generic
Architecture: x86-64
nerdctl: 1.5.0
ROCm™ version: 6.1
Server: Dell™ PowerEdge™ XE9680
GPU: 8x AMD Instinct™ MI300X Accelerators
vLLM version: 0.3.2+rocm603
Llama 3 model: meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct

Step 2: Build vLLM from container

vLLM is a high-performance, memory-efficient serving engine for large language models (LLMs). It leverages PagedAttention and continuous batching techniques to rapidly process LLM requests.

1. Install the AMD ROCm™ driver, libraries, and tools. Use the following instructions from AMD for your Linux based platform. To ensure these installations are successful, check the GPU info using rocm-smi.

This image shows the ROCm System Management Interface information.

2. To quickly start with vLLM, we suggest using the ROCm Docker container, as installing and building vLLM from source can be complex; however, for our study, we built it from source to access the latest version.

Git clone fo vLLM 0.3.2 version.

git clone -b v0.3.2 https://github.com/vllm-project/vllm.git

To use nerdctl requires that BuildKit is enabled. Use the following instructions to set up nerdctl build with BuildKit.

cd vllm
sudo nerdctl build -f Dockerfile -t  vllm-rocm:latest . (This execution will take approximate 2-3 hours)
nerdctl images | grep vllm

Run it and replace <path/to/model> with the appropriate path if you have a folder of LLMs you would like to mount and access in the container.

nerdctl run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri/card0 --device /dev/dri/card1 --device /dev/dri/renderD128 -v <path/to/model>:/app/model vllm-rocm:latest

Step 3: Run Llama 3 using vLLM with three approaches

vLLM provides various deployment methods for running your Llama-3 model - run it offline, with a container image, via an endpoint API, or using the OpenAI approach.

Using the Dell XE9680 server with AMD Mi300 GPUs, let's explore each option in detail to determine which aligns best with your infrastructure and workflow requirements. This image shows the Llama 3 vLLM, AMD ROCm 6.1, Ubuntu 22.04 LTS and PowerEdge XE9680 with AMD Instinct MI300x GPU

Container Image

Use pre-built container images to install vLLM quickly and with minimal effort.
Supports consistent deployment across diverse environments, both on-premises and cloud-based.

Endpoint API

Access vLLM's functionality through a RESTful API, enabling easy integration with other systems.
Encourages a modular design and smooth communication with external applications for added flexibility.

OpenAI

Run vLLM using a framework like OpenAI, this option is perfect for those who are familiar with OpenAI's architecture.
Ideal for users seeking a seamless transition from existing OpenAI workflows to vLLM.

Now, let's dive into the step-by-step process for implementing each of these approaches.

Approach 1: Llama 3 with container image (offline inferencing)

To start the offline inferencing we first need to export the environment variables. To use the Llama-3 model, you must set the HUGGING_FACE_HUB_TOKEN environment variable for authentication, this requires signing up for a Hugging Face account to obtain an access token.

root@16118303efa7:/app# export HUGGING_FACE_HUB_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Use the existing script provided by vLLM , and edit the model needed for your offline inferencing , in our case it would be “meta-llama/Meta-Llama-3-8B-Instruct and meta-llama/Meta-Llama-3-70B-Instruct”. Make sure you have the valid permissions to access the model from HUGGING FACE.

root@16118303efa7:/app# vi vllm/examples/offline_inference.py

Next, the Python script (offline_inference.py) is executed to run the inference process. This script is designed to work with vLLM, an optimized inference engine for large language models.

root@16118303efa7:/app# python vllm/examples/offline_inference.py

The script then, initializes the Llama 3 model by logging key setup information like tokenizer, model configurations, device setup, and weight-loading format and proceeds to download and load the model weights and components (such as tokenizer configurations) for offline inference with detailed logs showing which files are downloaded at what speed.

Finally, the script executes the Llama 3 model with example prompts, displaying text-based responses to demonstrate coherent outputs, confirming the model's offline inference capabilities.

Prompt: 'Hello, my name is', Generated text: ' Helen and I am a second-year student at the University of Central Florida. I' 
Prompt: 'The president of the United States is', Generated text: " a powerful figure, with the ability to shape the country's laws, policies," 
Prompt: 'The capital of France is', Generated text: ' a city that is steeped in history, art, and culture. From the' 
Prompt: 'The future of AI is', Generated text: ' here, and it’s already changing the way we live, work, and interact'

Approach 2: Llama 3 inferencing via api server

This example maintains a consistent server environment and includes vLLM through API server that allows you to run the vLLM backend and interact with it via HTTP endpoints. This provides flexible access to language models for generating text or processing requests.

root@16118303efa7:/app# python -m vllm.entrypoints.api_server --model meta-llama/Meta-Llama-3-8B-Instruct

First, execute the following command to enable the api_server inference endpoint inside the container. User can modify model argument parameter as per their requirement from the supported model list. If you require the 70B model, ensure that it runs on a dedicated GPU, as the VRAM won't be sufficient if other processes are also running.

INFO 01-17 20:25:37 llm_engine.py:222] # GPU blocks: 2642, # CPU blocks: 327
INFO: Started server process [10]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The following example provides the expected output. Once the above command is executed, vLLM gets enabled through port 8000. Now, you can use the endpoint to communicate with the model.

curl http://localhost:8000/generate -d '{ 
 "prompt": "Dell is", 
 "use_beam_search": true, 
 "n": 5, 
 "temperature": 0 
 }'

The following example provides the expected output.

 {"text":["Dell is one of the world's largest technology companies, providing a wide range of products and","Dell is one of the most well-known and respected technology companies in the world, with a","Dell is a well-known brand in the tech industry, and their laptops are popular among consumers","Dell is one of the most well-known and respected technology companies in the world. With a","Dell is one of the most well-known and respected technology companies in the world, and their"]}

Approach 3: Llama 3 inferencing via openai compatible server

This approach maintains a consistent server environment with vLLM deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000 . You can specify the address with --host and --port arguments. The server currently hosts one model at a time and implements list models, create chat completion, and create completion endpoints. The following actions provide an example for a Llama 3 model.

To activate the openai_compatible inference endpoint within the container, use the following command. If you require the 70B model, ensure it runs on a dedicated GPU, as the VRAM won't be sufficient if other processes are also running.

 python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct   
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct

To install OpenAI, run the following command with root privileges from the host entity.

pip install openai

Run the following command to invoke the python file, and then edit the model and openapi base url.

root@compute1:~# cat openai_vllm.py


from openai import OpenAI 

 
openai_api_key = "EMPTY" 
openai_api_base = "http://localhost:8000/v1" 
client = OpenAI( 
    api_key=openai_api_key, 
    base_url=openai_api_base, 
) 
 
stream = client.chat.completions.create( 
    model="meta-llama/Meta-Llama-3-8B-Instruct", 
    messages=[{"role": "user", "content": "Explain the differences betweem Navy Diver and EOD rate card"}], 
    max_tokens=4000, 
    stream=True, 
) 
for chunk in stream: 
    if chunk.choices[0].delta.content is not None: 
        print(chunk.choices[0].delta.content, end="")

Run the following command to execute.

root@compute1:~# python3 openai_vllm.py

The following example provides the expected output.

Navy Diver (ND) and Explosive Ordnance Disposal (EOD) are both specialized ratings in the United States Navy, but they have distinct roles and responsibilities. 
 
**Navy Diver (ND) Rating:** 
 
Navy Divers are trained to conduct a variety of underwater operations, including: 
 
1. Salvage and recovery of underwater equipment and vessels 
2. Construction and maintenance of underwater structures 
3. Clearance of underwater obstacles and hazards 
4. Recovery of aircraft and other underwater vehicles 
5. Scientific and research diving 
 
Navy Divers typically work in a variety of environments, including freshwater and saltwater, and may be deployed on board ships, submarines, or ashore. Their primary responsibilities include: 
 
* Conducting dives to depths of up to 300 feet (91 meters) in a variety of environments 
* Operating specialized diving equipment, such as rebreathers and rebreather systems 
* Performing underwater repairs and maintenance on equipment and vessels 
* Conducting underwater construction and salvage operations 
 
**Explosive Ordnance Disposal (EOD) Rating:**

Conclusion

The inclusion of Llama 3 with Dell PowerEdge XE9680, taps into the powerful capabilities of AMD's MI300 and XE960, underscores swift data processing, and provides a flexible infrastructure. By leveraging vLLM, it enhances model deployment and inference efficiency, reinforcing Meta's focus on advanced open-source AI technology and demonstrating Dell's strength in delivering high-performance solutions for a broad range of applications.

Stay tuned for future blogs posted on the Dell Technologies Info Hub for AI for more information on vLLM with different models and their performance metrics on Poweredge XE9680.

Your Browser is Out of Date

Run Llama 3 on Dell PowerEdge XE9680 and AMD MI300x with vLLM

Deploy Llama 3

Step 1: Configure Dell PowerEdge XE9680 Server

Step 2: Build vLLM from container

Step 3: Run Llama 3 using vLLM with three approaches

Container Image

Endpoint API

OpenAI

Approach 1: Llama 3 with container image (offline inferencing)

Approach 2: Llama 3 inferencing via api server

Approach 3: Llama 3 inferencing via openai compatible server

Conclusion