Home > AI Solutions > Gen AI > Guides > Generative AI in the Enterprise with AMD Accelerators > Inference validation
Our objective is to deploy the Llama 3 70B-Instruct model using the vLLM library on KServe for real-time generative AI inference tasks. The vLLM library provides seamless integration with HuggingFace models and provides an OpenAI-compatible API server. KServe is an inference platform for highly scalable inference. In this validation, we show an example of how to serve the Llama 3 70B-instruct model using vLLM and KServe and send the request to the API endpoint and receive the response generated by the LLM.
Table 8. Software components and versions
Component | Details and version |
vLLM | 0.4.1 |
Kserve-controller | 0.11.2 |
LLM | Meta-Llama-3-70B-Instruct |
For this validation, we installed the vLLM library with ROCm support. The following steps show how to deploy Llama 3 and run inferencing:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama3-70b
spec:
predictor:
containers:
- args:
- "--port"
- "8080"
- "--model"
- "meta-llama/Meta-Llama-3-70B-Instruct"
command:
- "python3:"
- "-m"
- "vllm.entrypoints.api_server"
env:
- name: HUGGING_FACE_HUB_TOKEN
value: <UPDATE_HF_TOKEN>
image: imagehub.ai.lab:5000/corescientificai/private-oci:vllm-0.4.1-rocm6.1-db-v1
name: vllm-container
resources:
limits:
cpu: "4"
memory: 600Gi
amd.com/gpu: "8"
requests:
cpu: "1"
memory: 200Gi
amd.com/gpu: "8"
kubectl create -f kserve-llama3_70B-example.yaml
kubectl get isvc
kubectl get pods
Wait for the respective isvc resources to become ready; both the containers in the pod must be ready.
kubectl port-forward llama3-70b-predictor-00001-deployment-6cf668ccc4-zz7cc 8081:8080
# curl http://0.0.0.0:8081/generate -d '{
"prompt": "The capital of Texas is",
"use_beam_search": true,
"n": 2,
"temperature": 0
}'
{"text":["The capital of Texas is Austin, which is located in the south-central part of the state. Austin is","The capital of Texas is Aus
tin. The city is located in the central part of the state, along the"]}root@omni1:~#
For more information, see the vLLM Meets Kubernetes-Deploying Llama-3 Models using KServe on Dell PowerEdge XE9680 with AMD MI300X blog.