cnvrg.io is a machine learning platform built by data scientists, for data scientists. cnvrg.io helps teams manage, build, and automate machine learning from research to production. The key features of cnvrg.io include the ability to:
cnvrg.io is a Kubernetes-based deployment. The deployment consists of control plane nodes and worker nodes. ML workloads run on compute resources in the worker node. This section describes these infrastructure concepts.
The cnvrg.io control plane manages the cnvrg.io back-end and front-end services, including the database, object storage, metadata services, and more. The control plane makes global decisions about the deployment. It includes components that are responsible for managing the provisioning and execution of AI workloads and pipelines. The control plane provides a control point for multiple Kubernetes clusters and Spark clusters. It consists of, but is not limited to, the following components:
Note: Prometheus and Grafana installed with cnvrg.io do not display GPU usage metrics in this release.
cnvrg.io refers to Kubernetes clusters and on-premises machines as compute resources. ML jobs and workloads are run from the compute resources. cnvrg.io seamlessly integrates with Kubernetes clusters and allows administrators to use Kubernetes clusters quickly and easily. cnvrg.io can connect and manage multiple Kubernetes clusters.
You can connect to your on-premises machines and add them as resources. Also, you can add remote Spark clusters as compute resources. These remote Spark clusters are outside the scope of this validated design.
This validated design provides two options for deploying cnvrg.io:
A compute template is a defined set of resources that a particular project or workspace uses. The template includes sizing of CPU, memory, and GPU resources. Each template describes a different set of resources to use as a compute engine when running a job – either as a Kubernetes cluster or as a specific on-premises machine. cnvrg.io supports five types of compute templates, a single pod template, a multinode template, a distributed PyTorch template, a Spark on Kubernetes template, and a remote Spark template. In this validated design, we focus only on the single pod template.
cnvrg.io supports NVIDIA Multi-Instance GPU (MIG) partitions, which can be enabled during installation. Compute templates can be configured with a specific MIG partition. In this validated design, the GPU resources are allocated using MIG partitions created on NVIDIA A100 and A30 GPUs.
Registries are repositories of Docker images. A Docker image is a template that contains the source code, libraries, and dependencies that cnvrg.io requires to deploy a container to run a job. A job can be a workspace, experiment, web application, endpoint, or flow. cnvrg.io supplies capable, ready-to-use Docker images by default. To use NVIDIA GPU resources, you must download images from NVIDIA AI Enterprise and add them to the registry. These images include PyTorch, NVIDIA RAPIDS, NVIDIA TensorRT, and NVIDIA Triton Inference Server.