The Validated Design for AI with VMware and NVIDIA is used to deliver the cnvrg.io MLOps platform on Dell Technologies infrastructure, VMware vSphere, NVIDIA GPUs, and NVIDIA AI Enterprise. The following figure shows how cnvrg.io is deployed on these components:
Figure 3. Solution architecture for the Validated Design for MLOps using cnvrg.io
PowerEdge servers or the VxRail HCI Appliance provide compute resources for the AI workloads and pipelines deployed on cnvrg.io. The PowerEdge servers can be optionally configured with NVIDIA GPUs to support acceleration for neural network training and inference. PowerScale provides storage for the data lake—the data repository for unstructured data that you can use for neural network training. Isilon F800 All-Flash Scale-out NAS storage is ideal, delivering the analytics performance and extreme concurrency at scale to consistently feed the most data-hungry DL algorithms. PowerSwitches are used for network connectivity and out-of-band (OOB) connectivity.
These PowerEdge servers are configured with VMware vSphere with Tanzu to enable creation of Tanzu Kubernetes clusters, which can be managed by Tanzu Mission Control. Servers running VMware vSAN provide a storage repository for the VM and pods.
NVIDIA AI Enterprise and its GPU operator provide automated deployment of NVIDIA software components required for the GPUs. Using the MIG capability of the NVIDIA GPUs, administrators can create vGPU profiles and assign them to Kubernetes worker nodes. Using cnvrg.io, these nodes can run various workloads such as TensorFlow for training, Jupyter notebooks for interactive model development, and TensorRT for inference.
NVIDIA AI Enterprise provides fully validated and supported AI frameworks that are made available as containers and can be integrated into the cnvrg.io catalog.
To better use the resources and make them available to AI workloads, we recommend creating multiple node pools in the Tanzu Kubernetes Cluster. Using node pools provides a way to improved use of CPU and GPU resources and simplify management. Each node pool has nodes (or workers) with a specific CPU, memory, and GPU resource allocation. These resources are configured through a VM class. The node pools can then be associated with cnvrg.io workspaces through node labels in the compute templates. The following figure shows the configuration:
Figure 4. Multiple-node pool configuration with associated VM classes
When assigning GPU resources to a VM class using MIG profiles, specify the memory size associated with that MIG profile. The following table shows the recommended VM classes for a cnvrg.io deployment, which is a vSphere with Tanzu specific instance of Table 3.
Table 5. Recommended sizing of different worker node configurations in a Tanzu Kubernetes cluster using VM Classes
VM Class for worker node | Number of vCPUs | vMemory (in GB) | GPU Count, GPU Memory (Per VM) |
CPU-only | 64 | 128 | None |
GPU-train-medium | 24 | 96 | |
GPU-train-large | 48 | 96 | 1 GPU, 80 GB |
GPU-infer-small | 16 | 32 | 1 GPU, 10 GB |
GPU-infer-medium | 16 | 32 | 1 GPU, 20 GB |
We recommend creating a Tanzu Kubernetes cluster with three Kubernetes control plane nodes.
The design guide for the Validated Design for AI using VMware and NVIDIA provides recommended configurations that use NVIDIA GPUs. Additionally, a CPU-only configuration can be used for cnvrg.io deployments that do not require hardware acceleration for neural network training. The following table provides the recommendation for a CPU-only configuration:
Table 6. Recommended configurations for the CPU-only Validated Design
Configuration | CPU-only configuration |
Compute server | PowerEdge R750 or VxRail P670F |
Server configuration |
|
Number of nodes in a cluster | Minimum of 3 hosts; 4 ESXi hosts are recommended when using vSAN for resiliency during updating and upgrading |
Network switch | 2 x Dell S5248F-ON |
OOB switch | 1 x Dell PowerSwitch N3248TE-ON or 1 x Dell PowerSwitch S4148T-ON |
VMware vSphere |
|
Storage |
|