The following figure shows the validated design for AI software components:
Figure 4. Validated design for AI components
The software components for the validated design for AI include:
- VMware vSphere 8 supports virtualized NVIDIA GPUs for both VMs and containers through VMware vSphere with Tanzu.
- NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software that is optimized, certified, and supported by NVIDIA to run exclusively on VMware vSphere with NVIDIA-Certified Systems. It includes the following components for Tanzu support:
- NVIDIA GPU Operator, which uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plug-in for GPUs, the NVIDIA Container Runtime, automatic node labeling, DCGM-based monitoring, and others.
- NVIDIA Network Operator, which uses the operator framework within Kubernetes to manage networking-related components to enable fast networking, RDMA, and GPUDirect for workloads in a Kubernetes cluster. The Network Operator works with the GPU Operator to enable GPU-Direct RDMA on compatible systems.
- Harbor is an open source, trusted, cloud native container registry that stores, signs, and scans content. Tanzu Kubernetes Grid includes signed binaries for Harbor, which can be deployed on a shared services cluster to provide container registry services for other Tanzu Kubernetes clusters.
Harbor can be integrated with NVIDIA AI Enterprise and can be used as a private registry for NVIDIA AI Enterprise images. The images can be made available to all the Tanzu Kubernetes clusters. Harbor implementation as a shared service has been tested with Tanzu Kubernetes Grid and is fully supported by VMware.
- Prometheus is an open-source systems monitoring and alerting toolkit. Grafana is open-source software that enables you to visualize and analyze metrics data collected by Prometheus on Tanzu Kubernetes clusters. Prometheus collects and stores its metrics as time series data, that is, metrics information is stored with the timestamp at which it was recorded, along with optional key-value pairs. Tanzu Kubernetes Grid includes signed binaries for Prometheus and Grafana that you can deploy on Tanzu Kubernetes clusters to monitor cluster health and services.
Prometheus and Grafana are supported with VMware Tanzu. However, Prometheus and Grafana do not support GPU metrics for NVIDIA A100 GPUs.
- VMware NSX Advanced Load Balancer (Avi) with Cloud Services has multicloud load balancing, web application firewall, and container ingress services. It can be used to load balance AI use cases such as Machine Learning Operation applications or inference workloads.
- Tanzu Mission Control is a centralized hub for simplified, multicloud, multicluster Kubernetes management. Tanzu Mission Control provides centralized policy management enabling administrators to apply consistent policies, such as access, security, and custom policies to a fleet of clusters and namespaces at scale. It provides life cycle management for Kubernetes clusters enabling administrators to provision, scale, upgrade, and delete Tanzu Kubernetes Grid clusters.