Deploy Machine Learning Models Quickly with cnvrg.io and VMware Tanzu
Download PDFWed, 13 Dec 2023 21:09:16 -0000
|Read Time: 0 minutes
Summary
Data scientists and developers use cnvrg.io to quickly deploy machine learning (ML) models to production. For infrastructure teams interested in enabling cnrvg.io on VMware Tanzu, this article contains a recommended hardware bill of materials (BoM). Data scientists will appreciate the performance boost that they can experience using Dell PowerEdge servers with Intel Xeon Scalable Processors as they wrangle big data to uncover hidden patterns, correlations, and market trends. Containers are a quick and effective way to deploy MLOps solutions built with cnvrg.io, and IT teams are turning to VMware Tanzu to create them. Tanzu enables IT admins to curate security-enabled container images that are grab-and-go for data scientists and developers, to speed development and delivery.
Market positioning
Too many AI projects take too long to deliver value. What gets in the way? Drudgery from low-level tasks that should be automated: managing compute, storage, and software, managing Kubernetes pods, sequencing jobs, monitoring experiments, models, and resources. AI development requires data scientists to perform many experiments that require adjusting a variety of optimizations, and then preparing models for deployment. There is no time to waste on tasks already automated by MLOps platforms.
Cnvrg.io provides a platform for MLOps that streamlines the model lifecycle through data ingestion, training, testing, deployment, monitoring, and continuous updating. The cnvrg.io Kubernetes operator deploys with VMware Tanzu to seamlessly manage pods and schedule containers. With cnvrg.io, AI developers can create entire AI pipelines with a few commands, or with a drag-and-drop visual canvas. The result? AI developers can deploy continuously updated models faster, for a better return on AI investments.
Key considerations
- Intel Xeon Scalable Processors – The 4th Generation Intel Xeon Scalable processor family features the most built-in accelerators of any CPU on the market for AI, databases, analytics, networking, storage, crypto, and data compression workloads.
- Memory throughput – Dell PowerEdge servers with Intel 4th Gen Xeon Scalable Processors provide an enhanced memory performance by supporting eight channels of DDR5 memory modules per socket, with speeds of up to 4800MT/s with 1 DIMM per channel (1DPC) or up to 4400MT/s with 2 DIMMs per channel (2DPC). Dell PowerEdge servers using DDR5 support higher-capacity memory modules, consume less power, and offer up to 1.5x bandwidth compared to previous generation platforms that use DDR4.
- Higher performance for intensive ML applications – Dell PowerEdge R760 servers support up to 24 x 2.5” NVM Express (NVMe) drives with an NVMe backplane. NVMe drives enable VMware vSAN, which runs under VMware Tanzu, to meet the high-performance requirements of ML workloads, in terms of both throughput and latency metrics.
- Storage architecture – vSAN’s Original Storage Architecture (OSA) is a legacy 2-tier model using high throughput storage drives for a caching tier, and a capacity tier composed of high-capacity drives. In contrast, the Express Storage Architecture (ESA) is an alternative design introduced in vSAN 8.0 that features a single-tier model designed to take full advantage of modern NVMe drives.
- Scale object-storage capacity – Deploy additional storage nodes to scale object-store capacity independently of worker nodes. Both high performance (with NVMe solid-state drives [SSDs]) and high-capacity (with rotational hard-disk drives [HDDs]) configurations can be used. All nodes using NVMe drives should be configured with 100 Gb network interface controllers (NICs) to take full advantage of the drives’ data transfer rates.
Recommended configurations
Worker Nodes (minimum four nodes required, up to 64 nodes per cluster)
Table 1. PowerEdge R760-based, up to 16 NVMe drives, 2RU
Feature | Description | |
Platform | Dell R760 supporting 16x 2.5” drives with NVMe backplane - direct connection | |
CPU | Base configuration: 2x Xeon Gold 6448Y (32c @ 2.1GHz), or Plus configuration: 2x Xeon Gold 8468 (48c @ 2.1GHz) | |
vSAN Storage Architecture | OSA | ESA |
DRAM | 256GB (16x 16GB DDR5-4800) | 512GB (16x 32GB DDR5-4800) |
Boot device | Dell BOSS-N1 with 2x 480GB M.2 NVMe SSD (RAID1) | |
vSAN Cache Tier [1] | 2x 1.92TB Solidigm D7-P5520 SSD (PCIe Gen4, Read-Intensive) | N/A |
vSAN Capacity Tier1 | 6x 1.92TB Solidigm D7-P5620 SSD (PCIe Gen4, Mixed Use) | |
Object storage1 | 4x (up to 10x) 1.92TB, 3.84TB or 7.68TB Solidigm D7-P5520 SSD (PCIe Gen4, Read-Intensive) | |
NIC[2] | Intel E810-XXV for OCP3 (dual-port 25Gb), or Intel E810-CQDA2 PCIe add-on card (dual-port 100Gb) | |
Additional NIC[3] | Intel E810-XXV for OCP3 (dual-port 25Gb), or Intel E810-CQDA2 PCIe add-on card (dual-port 100Gb) |
Optional – Dedicated storage nodes
Table 2. PowerEdge R660-based, up to 10 NVMe drives or 12 SAS drives, 1RU
Feature | Description | |
Node type | High performance | High capacity |
Platform | Dell R660 supporting 10x 2.5” drives with NVMe backplane | Dell R760 supporting 12x 3.5” drives with SAS/SATA backplane |
CPU | 2x Xeon Gold 6442Y (24c @ 2.6GHz) | 2x Xeon Gold 6426Y (16c @ 2.5GHz) |
DRAM | 128GB (16x 8GB DDR5-4800) | |
Storage controller | None | HBA355e adapter |
Boot device | Dell BOSS-N1 with 2x 480GB M.2 NVMe SSD (RAID1) | |
Object storage1 | up to 10x 1.92TB / 3.84TB / 7.68TB Solidigm D7-P5520 SSD (PCIe Gen4, Read-Intensive) | up to 12x 8TB/16TB/22TB 3.5in 12Gbps SAS HDD 7.2k RPM |
NIC2 | Intel E810-CQDA2 PCIe add-on card (dual-port 100Gb) | Intel E810-XXV for OCP3 (dual-port 25Gb) |
Learn more
Deploy ML models quickly with cnvrg.io and VMware Tanzu. Contact your Dell or Intel account team for a customized quote, at 1-877-289-3355.
[1] Number of drives and capacity for MinIO object storage depends on the dataset size and performance requirements.
[2] 100Gbps NICs recommended for higher throughput.
[3] Optional – required only if dedicated storage network for external storage system is necessary.