Home Servers Rack and Tower Servers Intel Direct from Development - Tech Notes

Scaling and Optimizing ML in Enterprises

Download PDF

Tue, 16 May 2023 19:53:46 -0000

Read Time: 0 minutes

Justin King

Todd Mottershead

Seamus Jones

Abirami Prabhakaran

Francisco M. Casares

Marcin Hoffmann

Marcin Gajzler

Krzysztof Cieplucha Intel

Andy Morris

Mishali Naik -Intel

Summary

This joint paper, written by Dell Technologies, in collaboration with Intel^®, describes the key hardware considerations when configuring a successful MLOps deployment and recommends configurations based on the most recent 15th Generation Dell PowerEdge Server portfolio offerings.

Today’s enterprises are looking to operationalize machine learning to accelerate and scale data science across the organization. This is especially the case as their needs grow to deploy, monitor, and maintain data pipelines and models. Cloud native infrastructure, such as Kubernetes, offers a fast and scalable means to implement Machine Learning Operations (MLOps) by using Kubeflow, an open source platform for developing and deploying Machine Learning (ML) pipelines on Kubernetes.

Dell PowerEdge R650 servers with 3rd Generation Intel^® Xeon^® Scalable processors deliver a scalable, portable, and cost-effective solution to implement and operationalize machine learning within the Enterprise organization.

Key Considerations

Portability. A single end-to-end platform to meet the machine learning needs of various use cases, including predictive analytics, inference, and transfer learning.
Optimized performance. High-performance 3rd Generation Intel^® Xeon^® Scalable processors optimize performance for machine learning algorithms using AVX-512. Intel^® performance optimizations that are built into Dell PowerEdge servers can help fine-tune large Transformers models across multi-node systems. These work in conjunction with open-source cloud native MLOps tools. Optimizations include Intel^® and open-source software and hardware technologies such as Kubernetes stack, AVX-512, Horovod for distributed training, and Tensorflow 2.10.0.
Scalability. As the machine learning workload grows, additional compute capacity needs to be added to the cloud native infrastructure. Dell PowerEdge R750 servers with 3rd Generation Intel^® Xeon^® Scalable processors deliver an efficient and scalable approach to MLOps.

Recommended Configurations

Cluster
	Control Plane Nodes (Three Nodes Required)	Data Plane Nodes (4 Nodes or More)
Functions	Kubernetes services	Develop, Deploy, Run Machine Learning (ML) workflows
Platform	Dell PowerEdge R650 up to 10x 2.5” NVMe Direct Drives
CPU	2x Intel^® Xeon^® Gold 6326 processor (16 cores @ 2.9GHz), or better	2x Intel^® Xeon^® Platinum 8380 processor (40 cores at 2.3 GHz), or 2x Intel^® Xeon^® Platinum 8368 processor (38 cores @ 2.4GHz), or Intel^® Xeon^® Platinum 8360Y processor (36 cores @ 2.4GHz)
DRAM	128 GB (16x 8 GB DDR4-3200)	512 GB (16x 32 GB DDR5-4800)
Boot device	Dell Boot Optimized Server Storage (BOSS)-S2 with 2x 240GB or 2x 480 GB Intel^® SSD S4510 M.2 SATA (RAID1)
Storage adapter	Not required for all-NVMe configuration.
Storage (NVMe)	1x 1.6TB Enterprise NVMe Mixed- Use AG Drive U.2 Gen4	1x 1.6TB (or larger) Enterprise NVMe Mixed-Use AG Drive U.2 Gen4
NIC	Intel^® E810-XXVDA2 for OCP3 (dual-port 25GbE)	Intel^® E810-XXVDA2 for OCP3 (dual-port 25GbE), or Intel^® E810-CQDA2 PCIe (dual-port 100Gb)

Resources

Visit the Dell support page or contact your Dell or Intel account team for a customized quote 1-877-289-3355.

Tags:

Your Browser is Out of Date

Scaling and Optimizing ML in Enterprises

Summary

Recommended Configurations

Related Documents

Powering AI using Red Hat Openshift with Intel based PowerEdge servers

End-to-End AI using OpenShift Overview

Solution Overview

Powering Your Elasticsearch on Kubernetes

Summary

Available Configurations

Resources