Home > AI Solutions > Artificial Intelligence > Guides > Implementation Guide—Virtualizing GPUs for AI with VMware and NVIDIA Based on Dell Infrastructure > Deploy the solution for running AI Workloads as Kubernetes pods
This section provides guidelines for deploying a Tanzu Kubernetes cluster for running AI workloads as Kubernetes pods.
The following figure shows the software components of VMware vSphere with Tanzu:
Figure 8. VMware vSphere with Tanzu – software components
See the Install and Configure the NSX Advanced Load Balancer for installation instructions for the NSX Advanced Load Balancer. The high-level steps include:
Enable a vSphere cluster for Workload Management by creating a Supervisor Cluster. See Enable Workload Management with vSphere Networking and the Tanzu product documentation for more information.
A VM class is a template that defines CPU, memory, and reservations for VMs. GPU allocations are added to the VM class. The VM class helps to set guardrails for the policy and governance of VMs by anticipating development needs and accounting for resource availability and constraints. We recommend creating one VM class (or have the same GPU resource allocated to all the VM classes) per Tanzu Kubernetes cluster.
The steps to create a VM class include:
Time Sharing refers to temporal partitioning, while Multi-Instance GPU sharing refers to NVIDIA MIG capability, as shown in the following figure:
Figure 9. GPU sharing field
VM classes can also be used with NVIDIA networking. NVIDIA networking cards can be added to VM classes as PCI devices. You can use VM classes with NVIDIA networking without GPUs. For detailed instructions, see the NVIDIA AI Enterprise documentation.
When VM classes and namespaces are created, use a YAML file to create a Tanzu Kubernetes cluster. The following example shows a YAML file.
Note: Enter the details for your cluster in the fields with angle brackets (< >).
apiVersion: run.tanzu.vmware.com/v1alpha3
kind: TanzuKubernetesCluster
metadata:
name: tkg-a30-cx6
namespace: tkg-ns
annotations:
run.tanzu.vmware.com/resolve-os-image.nodepool-a30-cx6: os-name=ubuntu
spec:
distribution:
fullVersion: v1.23.8---vmware.2-tkg.2-zshippable
settings:
storage:
defaultClass: vsan-default-storage-policy
network:
cni:
name: antrea
services:
cidrBlocks: [<IP range>]
pods:
cidrBlocks: [<IP range>]
serviceDomain: local
topology:
controlPlane:
replicas: 3
storageClass: <Kubernetes Storage Class>
tkr:
reference:
vmClass: vm-class-control-plane
nodePools:
- name: nodepool-a30-cx6
labels:
node-label: gpu-infer-large
replicas: 2
storageClass: <Kubernetes Storage Class>
tkr:
reference:
vmClass: vm-class-cpu-32c-512g
volumes:
- name: containerd
mountPath: /var/lib/containerd
capacity:
storage: 100Gi
- name: kubelet
mountPath: /var/lib/kubelet
capacity:
storage: 100Gi
Run the following command to apply the configuration in the YAML file to create the TKG. Ensure that the kubectl is set to the correct context, which is the Supervisor Cluster.
kubectl apply -f tanzucluster.yaml
Run the following command to see the status of the cluster and ensure that the cluster is deployed and ready.
kubectl get tkc
NVIDIA GPU Operator uses the operator framework in Kubernetes to automate the management of all NVIDIA software components that are needed to provision the GPU. These components include the NVIDIA drivers (to enable CUDA), the Kubernetes device plug-in for GPUs, the NVIDIA Container Runtime, automatic node labeling, DCGM-based monitoring, and others.
NVIDIA Network Operator uses the operator framework in Kubernetes to manage networking related components, to enable fast networking, RDMA, and GPUDirect for workloads in a Kubernetes cluster. The Network Operator works with the GPU Operator to enable GPU-Direct RDMA on compatible systems.
See the NVIDIA documentation for information about deploying the operators.
This section provides an overview of the software components that are deployed as part of the Tanzu ecosystem.
Tanzu Mission Control is available as SaaS in the VMware Cloud Services portfolio. Follow the steps in the invitation link or service sign-up email to create a Cloud Services account. When you have created an account, you can access Tanzu Mission Control as one of the services available to your account.
Deploy the following TKG extensions: