Architecture concepts for cnvrg.io

Thank you for your feedback!

Overview
cnvrg.io is a machine learning platform built by data scientists, for data scientists. cnvrg.io helps teams manage, build, and automate machine learning from research to production. The key features of cnvrg.io include the ability to:
- Manage—Deploy more models with drag-and-drop ML pipelines. cnvrg.io offers enterprises a dynamic, highly automated platform for managing the ML life cycle. cnvrg.io helps data scientists, software engineers, and even business analysts apply state-of-the-art algorithms on their datasets with cloud-like, self-service ease.
- Build—Run and track experiments in parallel with the freedom to use any compute environment, framework, programming language, or tool with minimal configuration.
- Automate—Build more models and automate your machine learning from research to production using reusable components and a drag-and-drop interface. Add continual and active learning to your models with one click.
cnvrg.io software architecture
cnvrg.io is a Kubernetes-based deployment. The deployment consists of control plane nodes and worker nodes. ML workloads run on compute resources in the worker node. This section describes these infrastructure concepts.
Control plane
The cnvrg.io control plane manages the cnvrg.io back-end and front-end services, including the database, object storage, metadata services, and more. The control plane makes global decisions about the deployment. It includes components that are responsible for managing the provisioning and execution of AI workloads and pipelines. The control plane provides a control point for multiple Kubernetes clusters and Spark clusters. It consists of, but is not limited to, the following components:
- Application server runs the main application code. It oversees the deployment, API services, and application logic.
- Web server handles web requests.
- PostgreSQL is a free and open-source relational database management system (RDBMS).
- Fluent Bit is a lightweight log processor and forwarder that collects data and logs from diverse sources, unifies them, and sends them to multiple destinations. It gathers log information from various cnvrg.io pods and forwards them to Elasticsearch.
- Prometheus and Grafana are open-source software. Prometheus is an open-source systems monitoring and alerting toolkit. Grafana is open-source software that enables you to visualize and analyze metrics data collected by Prometheus.
Note: Prometheus and Grafana installed with cnvrg.io do not display GPU usage metrics in this release.
- Istio is an open-source service mesh that provides a uniform and more efficient way to secure, connect, and monitor services. It is also the default ingress controller used by cnvrg.io.
- Elasticsearch and Kibana are free software. Elasticsearch is a distributed, free, and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Kibana is a free and open user interface that helps to visualize Elasticsearch data, navigate the Elastic Stack, and provide a dashboard for viewing data.
- Sidekiq is an open-source job scheduler written in Ruby. It uses Redis as an in-memory data structure store. Sidekiq is responsible for running all cnvrg.io jobs and monitors the life cycle of each job. It also manages gathering system metrics and sending alerts.
- Redis is a distributed, in-memory key-value database, cache, and message broker used by Sidekiq.
Worker nodes as compute resources
cnvrg.io refers to Kubernetes clusters and on-premises machines as compute resources. ML jobs and workloads are run from the compute resources. cnvrg.io seamlessly integrates with Kubernetes clusters and allows administrators to use Kubernetes clusters quickly and easily. cnvrg.io can connect and manage multiple Kubernetes clusters.
You can connect to your on-premises machines and add them as resources. Also, you can add remote Spark clusters as compute resources. These remote Spark clusters are outside the scope of this validated design.
Metacloud and on-premises deployment
This validated design provides two options for deploying cnvrg.io:
- The first option is to deploy cnvrg.io on an on-premises cluster. The control plane and worker nodes are both deployed on Kubernetes clusters.
- The second option, which is called cnvg.io Metacloud, is a SaaS version of cnvrg.io that eliminates issues of sizing, provisioning, and managing the control plane. Customers can connect to their on-premises Kubernetes clusters on Dell infrastructure to cnvg.io Metacloud.
Resource concepts
Compute templates
A compute template is a defined set of resources that a particular project or workspace uses. The template includes sizing of CPU, memory, and GPU resources. Each template describes a different set of resources to use as a compute engine when running a job – either as a Kubernetes cluster or as a specific on-premises machine. cnvrg.io supports five types of compute templates, a single pod template, a multinode template, a distributed PyTorch template, a Spark on Kubernetes template, and a remote Spark template. In this validated design, we focus only on the single pod template.
cnvrg.io supports NVIDIA Multi-Instance GPU (MIG) partitions, which can be enabled during installation. Compute templates can be configured with a specific MIG partition. In this validated design, the GPU resources are allocated using MIG partitions created on NVIDIA A100 and A30 GPUs.
Registries
Registries are repositories of Docker images. A Docker image is a template that contains the source code, libraries, and dependencies that cnvrg.io requires to deploy a container to run a job. A job can be a workspace, experiment, web application, endpoint, or flow. cnvrg.io supplies capable, ready-to-use Docker images by default. To use NVIDIA GPU resources, you must download images from NVIDIA AI Enterprise and add them to the registry. These images include PyTorch, NVIDIA RAPIDS, NVIDIA TensorRT, and NVIDIA Triton Inference Server.

Your Browser is Out of Date

Architecture concepts for cnvrg.io

Architecture concepts for cnvrg.io

Overview

cnvrg.io software architecture

Control plane

Worker nodes as compute resources

Metacloud and on-premises deployment

Resource concepts

Compute templates

Registries