Home Servers Rack and Tower Servers Intel Direct from Development - Tech Notes

Deploy Machine Learning Models Quickly with cnvrg.io and VMware Tanzu

Download PDF

Wed, 13 Dec 2023 21:09:16 -0000

Read Time: 0 minutes

Todd Mottershead

Rodrigo Escobar Palacios-Intel

Esther Baldwin-Intel

Bob Glithero-Cnvrg.io

Summary

Data scientists and developers use cnvrg.io to quickly deploy machine learning (ML) models to production. For infrastructure teams interested in enabling cnrvg.io on VMware Tanzu, this article contains a recommended hardware bill of materials (BoM). Data scientists will appreciate the performance boost that they can experience using Dell PowerEdge servers with Intel Xeon Scalable Processors as they wrangle big data to uncover hidden patterns, correlations, and market trends. Containers are a quick and effective way to deploy MLOps solutions built with cnvrg.io, and IT teams are turning to VMware Tanzu to create them. Tanzu enables IT admins to curate security-enabled container images that are grab-and-go for data scientists and developers, to speed development and delivery.

Market positioning

Too many AI projects take too long to deliver value. What gets in the way? Drudgery from low-level tasks that should be automated: managing compute, storage, and software, managing Kubernetes pods, sequencing jobs, monitoring experiments, models, and resources. AI development requires data scientists to perform many experiments that require adjusting a variety of optimizations, and then preparing models for deployment. There is no time to waste on tasks already automated by MLOps platforms.

Cnvrg.io provides a platform for MLOps that streamlines the model lifecycle through data ingestion, training, testing, deployment, monitoring, and continuous updating. The cnvrg.io Kubernetes operator deploys with VMware Tanzu to seamlessly manage pods and schedule containers. With cnvrg.io, AI developers can create entire AI pipelines with a few commands, or with a drag-and-drop visual canvas. The result? AI developers can deploy continuously updated models faster, for a better return on AI investments.

Key considerations

Intel Xeon Scalable Processors – The 4th Generation Intel Xeon Scalable processor family features the most built-in accelerators of any CPU on the market for AI, databases, analytics, networking, storage, crypto, and data compression workloads.
Memory throughput – Dell PowerEdge servers with Intel 4th Gen Xeon Scalable Processors provide an enhanced memory performance by supporting eight channels of DDR5 memory modules per socket, with speeds of up to 4800MT/s with 1 DIMM per channel (1DPC) or up to 4400MT/s with 2 DIMMs per channel (2DPC). Dell PowerEdge servers using DDR5 support higher-capacity memory modules, consume less power, and offer up to 1.5x bandwidth compared to previous generation platforms that use DDR4.
Higher performance for intensive ML applications – Dell PowerEdge R760 servers support up to 24 x 2.5” NVM Express (NVMe) drives with an NVMe backplane. NVMe drives enable VMware vSAN, which runs under VMware Tanzu, to meet the high-performance requirements of ML workloads, in terms of both throughput and latency metrics.
Storage architecture – vSAN’s Original Storage Architecture (OSA) is a legacy 2-tier model using high throughput storage drives for a caching tier, and a capacity tier composed of high-capacity drives. In contrast, the Express Storage Architecture (ESA) is an alternative design introduced in vSAN 8.0 that features a single-tier model designed to take full advantage of modern NVMe drives.
Scale object-storage capacity – Deploy additional storage nodes to scale object-store capacity independently of worker nodes. Both high performance (with NVMe solid-state drives [SSDs]) and high-capacity (with rotational hard-disk drives [HDDs]) configurations can be used. All nodes using NVMe drives should be configured with 100 Gb network interface controllers (NICs) to take full advantage of the drives’ data transfer rates.

Recommended configurations

Worker Nodes (minimum four nodes required, up to 64 nodes per cluster)

Table 1. PowerEdge R760-based, up to 16 NVMe drives, 2RU

Feature	Description
Platform	Dell R760 supporting 16x 2.5” drives with NVMe backplane - direct connection
CPU	Base configuration: 2x Xeon Gold 6448Y (32c @ 2.1GHz), or Plus configuration: 2x Xeon Gold 8468 (48c @ 2.1GHz)
vSAN Storage Architecture	OSA	ESA
DRAM	256GB (16x 16GB DDR5-4800)	512GB (16x 32GB DDR5-4800)
Boot device	Dell BOSS-N1 with 2x 480GB M.2 NVMe SSD (RAID1)
vSAN Cache Tier^[1]	2x 1.92TB Solidigm D7-P5520 SSD (PCIe Gen4, Read-Intensive)	N/A
vSAN Capacity Tier¹	6x 1.92TB Solidigm D7-P5620 SSD (PCIe Gen4, Mixed Use)
Object storage¹	4x (up to 10x) 1.92TB, 3.84TB or 7.68TB Solidigm D7-P5520 SSD (PCIe Gen4, Read-Intensive)
NIC[2]	Intel E810-XXV for OCP3 (dual-port 25Gb), or Intel E810-CQDA2 PCIe add-on card (dual-port 100Gb)
Additional NIC[3]	Intel E810-XXV for OCP3 (dual-port 25Gb), or Intel E810-CQDA2 PCIe add-on card (dual-port 100Gb)

Optional – Dedicated storage nodes

Table 2. PowerEdge R660-based, up to 10 NVMe drives or 12 SAS drives, 1RU

Feature	Description
Node type	High performance	High capacity
Platform	Dell R660 supporting 10x 2.5” drives with NVMe backplane	Dell R760 supporting 12x 3.5” drives with SAS/SATA backplane
CPU	2x Xeon Gold 6442Y (24c @ 2.6GHz)	2x Xeon Gold 6426Y (16c @ 2.5GHz)
DRAM	128GB (16x 8GB DDR5-4800)
Storage controller	None	HBA355e adapter
Boot device	Dell BOSS-N1 with 2x 480GB M.2 NVMe SSD (RAID1)
Object storage¹	up to 10x 1.92TB / 3.84TB / 7.68TB Solidigm D7-P5520 SSD (PCIe Gen4, Read-Intensive)	up to 12x 8TB/16TB/22TB 3.5in 12Gbps SAS HDD 7.2k RPM
NIC²	Intel E810-CQDA2 PCIe add-on card (dual-port 100Gb)	Intel E810-XXV for OCP3 (dual-port 25Gb)

Learn more

Deploy ML models quickly with cnvrg.io and VMware Tanzu. Contact your Dell or Intel account team for a customized quote, at 1-877-289-3355.

[1] Number of drives and capacity for MinIO object storage depends on the dataset size and performance requirements.

[2] 100Gbps NICs recommended for higher throughput.

[3] Optional – required only if dedicated storage network for external storage system is necessary.

Tags:

Feature	Control-Plane (Master) Nodes	ML/Artificial Intelligence (AI) CPU Cluster (Worker) Nodes
Platform	Dell R660 supporting 10 x 2.5” drives with NVMe backplane - direct connection
CPU		Base configuration	Plus configuration
CPU	2x Xeon^® Gold 6426Y (16c @ 2.5GHz)	2x Xeon^® Gold 6448Y (32c @ 2.1GHz)	2x Xeon^® Platinum 8468 (48c @ 2.1GHz)
DRAM	128GB (8x 16GB DDR5-4800)	256GB (16x 16GB DDR5-4800)	512GB (16x 32GB DDR5-4800)
Boot device	Dell BOSS-N1 with 2x 480GB M.2 NVMe SSD (RAID1)
Storage[1]	1x 1.6TB Solidigm[2] D7-P5620 SSD (PCIe Gen4, Mixed-use)	2x 1.6TB Solidigm² D7-P5620 SSD (PCIe Gen4, Mixed-use)
Object storage[3]	N/A	4x (up to 10x) 1.92TB, 3.84TB or 7.68TB Solidigm D7-P5520 SSD (PCIe Gen4, Read-Intensive)
Shared storage[4]	N/A	External
NIC[5]	Intel^® X710-T4L for OCP3 (Quad-port 10Gb)	Intel^® X710-T4L for OCP3 (Quad-port 10Gb), or Intel^® E810-CQDA2 PCIe add-on card (dual-port 100Gb)
Additional NIC for external storage[6]	N/A	Intel^® X710-T4L for OCP3 (Quad-port 10Gb), or Intel^® E810-CQDA2 PCIe add-on card (dual-port 100Gb)

Your Browser is Out of Date

Deploy Machine Learning Models Quickly with cnvrg.io and VMware Tanzu

Summary

Market positioning

Key considerations

Recommended configurations

Worker Nodes (minimum four nodes required, up to 64 nodes per cluster)

Optional – Dedicated storage nodes

Learn more

Related Documents

Launch Flexible Machine Learning Models Quickly with cnvrg.io® on Red Hat OpenShift

Summary

Key considerations

Recommended configurations

Controller nodes (3 nodes required) and worker nodes

Optional – Dedicated storage nodes

Learn more

Powering Kafka with Kubernetes and Dell PowerEdge Servers with Intel® Processors

Kafka with Kubernetes

Solution Overview