vLLM Meets Kubernetes-Deploying Llama-3 Models using KServe on Dell PowerEdge XE9680 with AMD MI300X
Fri, 17 May 2024 19:18:34 -0000
|Read Time: 0 minutes
Introduction
Dell's PowerEdge XE9680 server infrastructure, coupled with the capabilities of Kubernetes and KServe, offers a comprehensive platform for seamlessly deploying and managing sophisticated large language models such as Meta AI's Llama-3, addressing the evolving needs of AI-driven businesses.
Our previous blog post explored leveraging advanced Llama-3 models (meta-llama/Meta-Llama-3-8B-Instruct and meta-llama/Meta-Llama-3-70B-Instruct) for inference tasks, where Dell Technologies highlighted the container-based deployment, endpoint API methods, and OpenAI-style inferencing. This subsequent blog delves deeper into the inference process but with a focus on Kubernetes (k8s) and KServe integration.
This method seamlessly integrates with the Hugging Face ecosystem and the vLLM framework, all operational on the robust Dell™ PowerEdge™ XE9680 server, empowered by the high-performance AMD Instinct™ MI300X accelerators.
System configurations and prerequisites
- Operating System: Ubuntu 22.04.3 LTS
- Kernel: Linux 5.15.0-105-generic
- Architecture: x86-64
- ROCm™ version: 6.1
- Server: Dell™ PowerEdge™ XE9680
- GPU: 8x AMD Instinct™ MI300X Accelerators
- vLLM version: 0.4.1.2
- Llama-3: meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct
- Kubernetes: 1.26.12
- KServe: V0.11.2
To install vLLM, see our previous blog for instructions on setting up the cluster environment needed for inference.
Deploying Kubernetes and KServe on XE9680
Overview
To set up Kubernetes on a bare metal XE9680 cluster, Dell Technologies used Kubespray, an open-source tool that streamlines Kubernetes deployments. Dell Technologies followed the quick start section of its documentation, which provides clear step-by-step instructions for installation and configuration.
Next, Dell Technologies installed KServe, a highly scalable and standards-based model inference platform on Kubernetes.
KServe provides the following features:
- It acts as a universal inference protocol for a range of machine learning frameworks, ensuring compatibility across different platforms.
- It supports serverless inference workloads with autoscaling capabilities, including GPU scaling down to zero when not in use.
- It uses ModelMesh to achieve high scalability, optimized resource utilization, and intelligent request routing.
- It provides a simple yet modular solution for production-level ML serving, encompassing prediction, preprocessing and postprocessing, monitoring, and explainability features.
- It facilitates advanced deployment strategies such as canary rollouts, experimental testing, model ensembles, and transformers for more complex use cases.
Setting up KServe: Quickstart and Serverless installation
To test inference with KServe, start with the KServe Quickstart guide for a simple setup. If you need a production-ready environment, refer to the Administration Guide. Dell Technologies opted for Serverless installation to meet our scalability and resource flexibility requirements.
Serverless installation:
As part of KServe (v0.11.2) installation, Dell Technologies had to install the following dependencies first:
- Istio (V1.17.0)
- Certificate manager (v1.13.0)
- Knative Serving (v1.11.0)
- DNS Configuration
Each dependency is described below.
Istio (V1.17.0)
Purpose: Manages traffic and security.
Why needed: Ensures efficient routing, secure service communication, and observability for microservices.
- Download the istio tar file - https://github.com/istio/istio/releases/download/1.17.0/istio-1.17.0-linux-amd64.tar.gz
- Extract the tar file and change directory (cd) to the extracted folder.
- Install with the istioctl command - bin/istioctl install --set profile=default -y
See the Istio guide for details.
Certificate manager (v1.13.0)
Purpose: Automates TLS certificate management.
Why needed: Provides encrypted and secure communication between services, crucial for protecting data.
- kubectl apply -f: https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
Knative Serving (v1.11.0)
Purpose: Enables serverless deployment and scaling.
Why needed: Automatically scales KServe model serving pods based on demand, ensuring efficient resource use.
- kubectl apply -f: https://github.com/knative/serving/releases/download/knative-v1.11.0/serving-crds.yaml
- kubectl apply -f: https://github.com/knative/serving/releases/download/knative-v1.11.0/serving-core.yaml
- kubectl apply -f:https://github.com/knative/net-istio/releases/download/knative-v1.11.0/net-istio.yaml
DNS Configuration
Purpose: Facilitates service domain.
Why needed: Ensures that services can communicate using human-readable names, which is crucial for reliable request routing.
kubectl patch configmap/config-domain \
--namespace knative-serving \
--type merge \
--patch '{"data":{"example.com":""}}'
For more details, please see the Knative Serving install guide.
Installation steps for KServe
- kubectl apply -f: https://github.com/kserve/kserve/releases/download/v0.11.2/kserve.yaml
- kubectl apply -f: https://github.com/kserve/kserve/releases/download/v0.11.2/kserve-runtimes.yaml
The deployment requires ClusterStorageContainer. Provide the http_proxy and https_proxy values if the cluster is behind a proxy, and nodes do not have direct Internet access.
KServe is also included with the Kubeflow deployment. If you must deploy Kubeflow, refer to the Kubeflow git page for installation steps. Dell Technologies used a single command installation approach to install Kubeflow with Kustomize.
Llama 3 model execution with Kubernetes
Initiate the execution of the Llama 3 model within Kubernetes by deploying it using the established configurations as given above and then follow the instructions below.
Create the manifest file
This YAML snippet employs the serving.kserve.io/v1beta1 API version, ensuring compatibility and adherence to the standards within the KServe environment. Within this setup, a KServe Inference Service named llama-3-70b is established using a container housing the vLLM model.
The configuration includes precise allocations for MI300X GPUs and environmental settings. It intricately orchestrates the deployment process by specifying resource limits for CPUs, memory, and GPUs, along with settings for proxies and authentication tokens.
In the YAML example file below, arguments are passed to the container's command. This container expects:
- --port: The port on which the service will listen (8080)
- --model: The model to be loaded, specified as meta-llama/Meta-Llama-3-70B-Instruct / meta-llama/Meta-Llama-3-8B
Alternatively, separate YAML files can be created to run both models independently.
For the end-point inferencing, choose any of the three methods (with a container image (offline inferencing, endpoint API, and the OpenAI approach) mentioned in the previous blog.
Apply the manifest file
The next step performs the kubectl apply command to deploy the vLLM configuration defined in the YAML file onto the Kubernetes cluster. This command triggers Kubernetes to interpret the YAML specifications and initiate the creation of the Inference Service, named llama-3-70b. This process ensures that the vLLM model container is set up with the designated resources and environment configurations.
The initial READY status will be either unknown or null. After the model is ready, it changes to True. For an instant overview of Inference Services across all namespaces, check the kubectl get isvc -A command. It provides essential details like readiness status, URLs, and revision information, enabling quick insights into deployment status and history.
For each deployed inference service, one pod gets created. Here we can see that two pods are running; each hosting a distinct Llama-3 model (8b and 70b) on two different GPUs on an XE9680 server.
To get the detailed information about the specified pod's configuration, status, and events use the kubectl describe pod command, which aids in troubleshooting and monitoring within Kubernetes.
After the pod is up and running, users can perform inference through the designated endpoints. Perform the curl request to verify whether the model is successfully served at the specific local host.
Users can also follow the ingress IP and ports method for inferencing.
Conclusion
Integrating Llama 3 with the Dell PowerEdge XE9680 and leveraging the powerful AMD MI300X highlights the adaptability and efficiency of Kubernetes infrastructure. vLLM enhances both deployment speed and prediction accuracy, demonstrating KServe's advanced AI technology and expertise in optimizing Kubernetes for AI workloads.