cnvrg.io provides an enterprise ready MLOps platform for data scientists and ML engineers to develop and publish AI applications. It is a Kubernetes-based deployment that includes container images with prebuilt deployment templates and default configurations.
The following figure shows how cnvrg.io is deployed on Kubernetes:
As a containerized application, cnvrg.io uses Kubernetes capabilities to manage the life cycle of the deployment, while providing improved availability and scalability. The cnvrg.io platform consists of control plane pods and worker pods. The control plane pods consist of all the pods in the cnvrg.io management plane, such as application server, web server PostgreSQL, Fluent Bit, and Sidekiq. Note that the cnvrg.io control plane is different than the Kubernetes control plane. References to the control plane in this document specify the cnvrg.io control plane.
ML workloads run as worker pods. Worker pods are categorized. Each category of worker pods is associated with specific types of Kubernetes worker nodes that have a specific hardware resource configuration, such as the type of GPU configured in each worker node. The categories of worker pods include:
As described earlier, each category of worker pods is associated with a specific type of Kubernetes worker node. There are several ways to associate worker pods with a category. In this design guide we focus on two approaches:
The following node pools can be created by manually assigning node labels to the corresponding worker nodes:
Data scientists can specify node labels when creating compute templates that will be used for their project or workspace.
Administrators can also manually configure the control plane pods with corresponding node labels to constrain the control plane pods to run only on certain nodes (control plane node pool). However, this option is beyond the scope of this design guide. This validated design relies on the Kubernetes scheduler to automatically place the cnvrg.io control plane pods on worker nodes based on resource availability.
For smaller deployments, some of these node pools, such as the GPU model development and GPU training node, can be combined. Alternatively, we can avoid creating node pools for certain pods (CPU worker pods) and let the Kubernetes scheduler automatically place the pods (for example, spreading your pods across nodes without placing pods on a node with insufficient free resources).
The node pool approach is recommended for simplified management.
As discussed earlier, cnvrg.io deployment consists of a control plane that includes components that manage the deployment along with worker nodes where AI workloads run. This modular deployment allows administrators to size, manage, and operate the control plane independently of the worker nodes. Administrators can seamlessly update cnvrg.io to the latest version and make new features available to the worker nodes with minimal impact.
In this design guide we focus on the following types of scaling:
When the workload for a particular component (such as the Sidekiq job scheduler) increases, increasing resource use such as the average CPU use or average memory use, new control plane pods of that component are automatically deployed to handle the increased workload. If the load decreases, and the number of pods is above the configured minimum, the HPA instructs the workload resource to scale down.
The HPA is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The horizontal pod autoscaling controller, running in the Kubernetes control plane, periodically adjusts the wanted scale of its target (for example, a deployment) to match observed metrics such as average CPU use, average memory use, or any custom metric that you specify.
There are three options to make AI data available to workspaces in cnvrg.io:
cnvrg.io uses ingress control and load balancers to govern access to deployment. cnvrg.io requires a DNS wildcard subdomain record, which resolves the ingress IP address to the cnvrg.io cluster domain, for example, *.cnvrg.my-org.com -> 172.20.13.42. Istio allocates the subdomain to the different components of cnvrg.io and for new workspaces and endpoints.