To better understand the sizing recommendations, consider the following key concepts in cnvrg.io:
In this section, we first provide recommendations for creating compute templates that can be used for various AI workloads and use cases. Then, we provide sizing recommendations for different types of worker nodes. Finally, we provide sizing recommendations for the deployment.
The following figure shows the workspaces running on worker nodes using recommended compute templates. In the rest of this section, we describe the sizing of CPU- and GPU-based templates and worker nodes.
Figure 2. Worker nodes, compute templates, and compute resource in cnvrg.io
In a compute template, the size of the NVIDIA MIG partitions is abstracted, therefore, a template can only specify if it uses a GPU or not. The type of GPU partition that is assigned to a template is based on the worker node (or node pool) associated with the template. These associations are specified using node labels. The following table shows the recommended compute templates that can be created in cnvrg.io:
Table 2. Recommended compute templates
Number of CPU cores | Memory (in GB) | Number of GPUs | Node label | |
CPU-small | 4 | 8 | 0 | CPU-only |
CPU-medium | 8 | 16 | 0 | CPU-only |
16 | 32 | 0 | CPU-only | |
GPU-train-medium | 16 | 64 | 1 | GPU-train-medium |
GPU-train-large | 32 | 64 | 1 | GPU-train-large |
GPU-infer-small | 8 | 16 | 1 | GPU-infer-small |
GPU-infer-medium | 8 | 16 | 1 | GPU-infer-medium |
Note: cnvrg.io deployments come with a default CPU template which has one CPU core and 1 GB memory.
Each compute template can be used for a specific set of workloads or use cases. Our recommended compute templates are categorized into three workload categories:
A compute resource, which is a Kubernetes cluster consisting of Kubernetes worker nodes, runs the cnvrg.io workspaces as pods. We recommend creating different types of worker nodes to support the templates described in Table 2. Worker nodes that have the same configuration are grouped as node pools and are assigned a particular node label.
The following table provides the recommended sizing for different worker node configurations in a Kubernetes cluster:
Table 3. Recommended sizing for node configurations
Number of CPU cores | Memory (in GB) | NVIDIA GPU profile for A100 80 GB | Node label for node pool | |
CPU-only | 64 | 128 | No | CPU-only |
GPU-train-medium | 24 | 96 | grid_a100d-3-40c | GPU-train-medium |
GPU-train-large | 48 | 96 | grid_a100d-7-80c | GPU-train-large |
GPU-infer-small | 16 | 32 | grid_a100d-1-10c | GPU-infer-small |
GPU-infer-medium | 16 | 32 | grid_a100d-2-20c | GPU-infer-medium |
Note that a CPU-only worker node can host multiple CPU-only compute templates. However, each of the GPU worker nodes described in the preceding table can host only one GPU-based compute template. The number of CPU cores and memory allocated to a worker node is more than what is available to the compute template to ensure that enough compute resources are available for system processes such as GPU operators, Kubernetes management pods, and cnvrg.io management pods.
Sizing recommendation for cnvrg.io deployment
You must consider several factors for sizing a cnvrg.io deployment, such as the number of projects and workspaces, the type of workspaces, the number of concurrent users, and the number of active tasks. With its modular and microservices-based design, a cnvrg.io deployment can easily be scaled depending on resource use. NVIDIA MIG capability also enables effective partitioning of GPUs for various use cases and increases GPU utilization.
We recommend three sizing deployments for our validated design. The total CPU, memory, and GPU resources for each deployment can be calculated from the number and configuration of worker nodes. The storage recommendation includes storage required for both cnvrg.io control plane and user data. It is assumed that this storage is shared across all the worker nodes.
The recommended deployments include:
Table 4. Recommended sizing for the validated design
Deployment | Number of worker nodes | Recommended storage |
Minimum production |
| 2 TB |
Mainstream |
| 5 TB |
High performance |
| Greater than 5 TB, depending on the use case and datasets |