Launch Flexible Machine Learning Models Quickly with cnvrg.io® on Red Hat OpenShift
Download PDFWed, 17 Jan 2024 14:11:31 -0000
|Read Time: 0 minutes
Summary
Data scientists hold a high degree of responsibility to support the decision-making process of companies and their strategies. To this end, data scientists extract insights from a large amount of heterogeneous data through a set of iterative tasks that include various aspects: cleaning and formatting the data available to them, building training and testing datasets, mining data for patterns, deciding on the type of data analysis to apply and the ML methods to use, evaluating and interpreting the results, refining ML algorithms, and possibly even managing infrastructure. To ensure that data scientists can deliver the most impactful insights for their companies efficiently and effectively, convrg.io provides a unified platform to operationalize the full machine learning (ML) lifecycle from research to production.
As the leading data-science platform for ML model operationalization (MLOps) and management, cnvrg.io is a pioneer in building cutting-edge ML development solutions that provide data scientists with all the tools they need in one place to streamline their processes. In addition, by deploying MLOps on Red Hat OpenShift, data scientists can launch flexible, container-based jobs and pipelines that can easily scale to deliver better efficiency in terms of compute resource utilization and cost. Infrastructure teams can also manage and monitor ML workloads in a single managed and cloud-native environment. For infrastructure architects who are deploying cnvrg.io on Dell PowerEdge servers and Intel® components, this document provides recommended hardware bill of materials (BoM) configurations to help get them started.
Key considerations
Key considerations for using the recommended hardware BoMs for deploying cnvrg.io on Red Hat OpenShift include:
- Provision external storage. When deploying cnvrg.io on Red Hat OpenShift, local storage is used only for container images and ephemeral volumes. External persistent storage volumes should be provisioned on a storage array or on another solution that you already have in place. If you do not already have a persistent storage solution, contact your Dell Technologies representative for guidance.
- Use high-performance object storage. The hardware BoMs below assume that you use an in-cluster solution based on MinIO for object storage. The number of drives and the capacity for MinIO object storage depends on the dataset size and performance requirements. An alternative object store would be an external S3-compatible object store such as Elastic Cloud Storage (ECS) or Dell PowerScale (Isilon), powered by high-capacity Solidigm SSDs.
- Scale object storage independently. Object storage capacity can be scaled independently of worker nodes by deploying additional storage nodes. Both high-performance, high capacity (with NVM Express [NVMe] Solidigm solid-state drives [SSDs]), and high-capacity (with rotational hard-disk drives [HDDs]) configurations can be used. All nodes using NVMe drives should be configured with 100 Gbps network interface controllers (NICs) to take full advantage of the drives’ I/O throughput.
Recommended configurations
Controller nodes (3 nodes required) and worker nodes
Table 1. PowerEdge R660-based, up to 10 NVMe drives, 1RU
Feature | Control-Plane (Master) Nodes | ML/Artificial Intelligence (AI) CPU Cluster (Worker) Nodes | |
Platform | Dell R660 supporting 10 x 2.5” drives with NVMe backplane - direct connection | ||
CPU |
| Base configuration | Plus configuration |
2x Xeon® Gold 6426Y (16c @ 2.5GHz) | 2x Xeon® Gold 6448Y (32c @ 2.1GHz) | 2x Xeon® Platinum 8468 (48c @ 2.1GHz) | |
DRAM | 128GB (8x 16GB DDR5-4800) | 256GB (16x 16GB DDR5-4800) | 512GB (16x 32GB DDR5-4800) |
Boot device | Dell BOSS-N1 with 2x 480GB M.2 NVMe SSD (RAID1) | ||
Storage[1] | 1x 1.6TB Solidigm[2] D7-P5620 SSD (PCIe Gen4, Mixed-use) | 2x 1.6TB Solidigm2 D7-P5620 SSD (PCIe Gen4, Mixed-use) | |
Object storage[3] | N/A | 4x (up to 10x) 1.92TB, 3.84TB or 7.68TB Solidigm D7-P5520 SSD (PCIe Gen4, Read-Intensive) | |
Shared storage[4] | N/A | External | |
NIC[5] | Intel® X710-T4L for OCP3 (Quad-port 10Gb) | Intel® X710-T4L for OCP3 (Quad-port 10Gb), or Intel® E810-CQDA2 PCIe add-on card (dual-port 100Gb) | |
Additional NIC for external storage[6] | N/A | Intel® X710-T4L for OCP3 (Quad-port 10Gb), or Intel® E810-CQDA2 PCIe add-on card (dual-port 100Gb) |
Optional – Dedicated storage nodes
Figure 2. PowerEdge R660-based, up to 10 NVMe drives or 12 SAS drives, 1RU
Feature | Description | |
Node type | High performance | High capacity |
Platform | Dell R660 supporting 10x 2.5” drives with NVMe backplane | Dell R760 supporting 12x 3.5” drives with SAS/SATA backplane |
CPU | 2x Xeon® Gold 6442Y (24c @ 2.6GHz) | 2x Xeon® Gold 6426Y (16c @ 2.5GHz) |
DRAM | 128GB (8x 16GB DDR5-4800) | |
Storage controller | None | HBA355e adapter |
Boot device | Dell BOSS-N1 with 2x 480GB M.2 NVMe SSD (RAID1) | |
Object storage3 | up to 10x 1.92TB / 3.84TB / 7.68TB Solidigm D7-P5520 SSD (PCIe Gen4, Read-Intensive) | up to 12x 8TB/16TB/22TB 3.5in 12Gbps SAS HDD 7.2k RPM |
NIC4 | Intel® E810-CQDA2 PCIe add-on card (dual-port 100Gb) | Intel® E810-XXV for OCP3 (dual-port 25Gb) |
Learn more
Contact your Dell or Intel account team for a customized quote at 1-877-289-3355
[1] Local storage used only for container images and ephemeral volumes; persistent volumes should be provisioned on an external storage system.
[2] Formerly Intel
[3] The number of drives and capacity for MinIO object storage depends on the dataset size and performance requirements.
[4] External shared storage required for Kubernetes persistent volumes.
[5] 100 Gb NICs are recommended for higher throughput.
[6] Optional, required only if a dedicated storage network for external storage system is necessary.