Home > AI Solutions > Artificial Intelligence > Guides > Design Guide—Optimize Machine Learning Through MLOps with Dell Technologies cnvrg.io > cnvrg.io on VMware Tanzu
The cnvrg.io MLOps platform is delivered on Validated Design for AI with VMware and NVIDIA, which includes Dell Technologies infrastructure, VMware vSphere with Tanzu, NVIDIA GPUs, and NVIDIA AI Enterprise. This validated design is a virtualization platform that enables an enterprise to manage clusters of both on-demand Kubernetes containers alongside traditional virtual machines, providing complete life cycle management of those compute and storage resources.
See the following documentation for more information about the Validated Design for AI using VMware and NVIDIA:
The following figure shows how cnvrg.io is deployed on these components:
PowerEdge servers or the VxRail HCI Appliance provide compute resources for the AI workloads and pipelines deployed on cnvrg.io. The PowerEdge servers can be optionally configured with NVIDIA GPUs to support acceleration for neural network training and inference. PowerScale provides storage for the data lake—the data repository for unstructured data that you can use for neural network training. Isilon F800 All-Flash Scale-out NAS storage is ideal, delivering the analytics performance and extreme concurrency at scale to consistently feed the most data-hungry DL algorithms. PowerSwitches are used for network connectivity and out-of-band (OOB) connectivity.
These PowerEdge servers are configured with VMware vSphere with Tanzu to enable creation of Tanzu Kubernetes clusters, which can be managed by Tanzu Mission Control. Servers running VMware vSAN provide a storage repository for the VM and pods.
NVIDIA AI Enterprise and its GPU operator provide automated deployment of NVIDIA software components required for the GPUs. Using the MIG capability of the NVIDIA GPUs, administrators can create vGPU profiles and assign them to Kubernetes worker nodes. Using cnvrg.io, these nodes can run various workloads such as TensorFlow for training, Jupyter notebooks for interactive model development, and TensorRT for inference.
NVIDIA AI Enterprise provides fully validated and supported AI frameworks that are made available as containers and can be integrated into the cnvrg.io catalog.
ML workloads run as worker pods. Worker pods are categorized. Each category of worker pods is associated with specific types of Kubernetes worker nodes that have a specific hardware resource configuration, such as the type of GPU configured in each worker node. The categories of worker pods include:
As described earlier, each category of worker pods is associated with a specific type of Kubernetes worker node using node pools. We constrain certain pods to run only on a set of nodes (pools) using “node labels.” Node labels in Kubernetes are key/value pairs that are attached to nodes. Pods are then constrained to run using “nodeSelector.” The nodeSelector field is added to the pod specification to specify the node labels that you want the target node to have. Kubernetes only schedules the pod onto nodes that have the labels that you specify.
The following node pools can be created by manually assigning node labels to the corresponding worker nodes:
Data scientists can specify node labels when creating compute templates that will be used for their project or workspace.
Administrators can also manually configure the control plane pods with corresponding node labels to constrain the control plane pods to run only on certain nodes (control plane node pool). However, this option is beyond the scope of this design guide. This validated design relies on the Kubernetes scheduler to automatically place the cnvrg.io control plane pods on worker nodes based on resource availability.
For smaller deployments, some of these node pools, such as the GPU model development and GPU training node, can be combined. Alternatively, we can avoid creating node pools for certain pods (CPU worker pods) and let the Kubernetes scheduler automatically place the pods (for example, spreading your pods across nodes without placing pods on a node with insufficient free resources).
To better use the resources and make them available to AI workloads, we recommend creating multiple node pools in the Tanzu Kubernetes Cluster. Using node pools provides a way to improved use of CPU and GPU resources and simplify management. Each node pool has nodes (or workers) with a specific CPU, memory, and GPU resource allocation. These resources are configured through a VM class. The node pools can then be associated with cnvrg.io workspaces through node labels in the compute templates. The following figure shows the configuration:
When assigning GPU resources to a VM class using MIG profiles, specify the memory size associated with that MIG profile. The following table shows the recommended VM classes for a cnvrg.io deployment, which is a vSphere with Tanzu specific instance of Table 3.
VM Class for worker node | Number of vCPUs | vMemory (in GB) | GPU Count, GPU Memory (Per VM) |
CPU-only | 64 | 128 | None |
GPU-train-medium | 24 | 96 | |
GPU-train-large | 48 | 96 | 1 GPU, 80 GB |
GPU-infer-small | 16 | 32 | 1 GPU, 10 GB |
GPU-infer-medium | 16 | 32 | 1 GPU, 20 GB |
We recommend creating a Tanzu Kubernetes cluster with three Kubernetes control plane nodes.
The design guide for the Validated Design for AI using VMware and NVIDIA provides recommended configurations that use NVIDIA GPUs. Additionally, a CPU-only configuration can be used for cnvrg.io deployments that do not require hardware acceleration for neural network training.
The following table provides the recommendation for a CPU-only configuration:
Configuration | CPU-only configuration |
Compute server | PowerEdge R750 or VxRail P670F |
Server configuration |
|
Number of nodes in a cluster | Minimum of 3 hosts; 4 ESXi hosts are recommended when using vSAN for resiliency during updating and upgrading |
Network switch | 2 x Dell S5248F-ON |
OOB switch | 1 x Dell PowerSwitch N3248TE-ON or 1 x Dell PowerSwitch S4148T-ON |
VMware vSphere |
|
Storage |
|
The Validated Design for AI using VMware and NVIDIA has two options for network architecture: 25 GbE-based design and 100 GbE-based design. For deploying cnvrg.io, we recommend the 25 GbE-based design with PowerSwitch network switches. This network design is suited for neural network training jobs that can run on a single node (using at most two GPUs), and for model development and inference jobs that take advantage of GPU partitioning. Because we do not support multi-GPU training in this release, a 100 GbE-based network is not required.
In this validated design, we use VMware NSX Advanced Load Balancer as the ingress controller and load balancer.
vSAN is the recommended storage for VMs that serve as the Tanzu Kubernetes Cluster control plane nodes and worker nodes. vSAN automatically creates a default Storage Class through the vSAN Container Storage Interface (CSI) driver. When cnvrg.io is deployed, vSAN is used for pods storage and Docker images. cnvrg.io application data is stored as persistent volumes in vSAN. Also, datasets imported to cnvrg.io are also hosted on vSAN.
We recommend PowerScale storage for data lake storage, that is, storing data that are required for neural network training. PowerScale storage can also be used for NFS caching for datasets. Local NFS cache can save time when working with large datasets pulled from an external object-storage. The data is saved to an NFS server accessible to the Kubernetes cluster."
The following figure illustrates the storage configuration for this validated design:
We recommend creating different types of worker nodes in a Tanzu Kubernetes Cluster to support the templates described in Table 2. Worker nodes that have the same configuration are grouped as node pools and are assigned a particular node label.
The following table provides the recommended sizing for different worker node configurations in a Tanzu Kubernetes cluster:
Worker node configuration | Number of CPU cores | Memory (in GB) | NVIDIA GPU profile for A100 80 GB |
CPU-only | 64 | 128 | No |
GPU-train-medium | 24 | 96 | grid_a100d-3-40c |
GPU-train-large | 48 | 96 | grid_a100d-7-80c |
GPU-infer-small | 16 | 32 | grid_a100d-1-10c |
GPU-infer-medium | 16 | 32 | grid_a100d-2-20c |
Note that a CPU-only worker node can host multiple CPU-only compute templates. However, each of the GPU worker nodes described in the preceding table can host only one GPU-based compute template. The number of CPU cores and memory allocated to a worker node is more than what is available to the compute template to ensure that enough compute resources are available for system processes such as GPU operators, Kubernetes management pods, and cnvrg.io management pods.