Scaling and Optimizing ML in Enterprises
Download PDFTue, 16 May 2023 19:53:46 -0000
|Read Time: 0 minutes
Summary
This joint paper, written by Dell Technologies, in collaboration with Intel®, describes the key hardware considerations when configuring a successful MLOps deployment and recommends configurations based on the most recent 15th Generation Dell PowerEdge Server portfolio offerings.
Today’s enterprises are looking to operationalize machine learning to accelerate and scale data science across the organization. This is especially the case as their needs grow to deploy, monitor, and maintain data pipelines and models. Cloud native infrastructure, such as Kubernetes, offers a fast and scalable means to implement Machine Learning Operations (MLOps) by using Kubeflow, an open source platform for developing and deploying Machine Learning (ML) pipelines on Kubernetes.
Dell PowerEdge R650 servers with 3rd Generation Intel® Xeon® Scalable processors deliver a scalable, portable, and cost-effective solution to implement and operationalize machine learning within the Enterprise organization.
Key Considerations
- Portability. A single end-to-end platform to meet the machine learning needs of various use cases, including predictive analytics, inference, and transfer learning.
- Optimized performance. High-performance 3rd Generation Intel® Xeon® Scalable processors optimize performance for machine learning algorithms using AVX-512. Intel® performance optimizations that are built into Dell PowerEdge servers can help fine-tune large Transformers models across multi-node systems. These work in conjunction with open-source cloud native MLOps tools. Optimizations include Intel® and open-source software and hardware technologies such as Kubernetes stack, AVX-512, Horovod for distributed training, and Tensorflow 2.10.0.
- Scalability. As the machine learning workload grows, additional compute capacity needs to be added to the cloud native infrastructure. Dell PowerEdge R750 servers with 3rd Generation Intel® Xeon® Scalable processors deliver an efficient and scalable approach to MLOps.
Recommended Configurations
Cluster | ||
| Control Plane Nodes (Three Nodes Required) | Data Plane Nodes (4 Nodes or More) |
Functions | Kubernetes services | Develop, Deploy, Run Machine Learning (ML) workflows |
Platform | Dell PowerEdge R650 up to 10x 2.5” NVMe Direct Drives | |
CPU | 2x Intel® Xeon® Gold 6326 processor (16 cores @ 2.9GHz), or better | 2x Intel® Xeon® Platinum 8380 processor (40 cores at 2.3 GHz), or 2x Intel® Xeon® Platinum 8368 processor (38 cores @ 2.4GHz), or Intel® Xeon® Platinum 8360Y processor (36 cores @ 2.4GHz) |
DRAM | 128 GB (16x 8 GB DDR4-3200) | 512 GB (16x 32 GB DDR5-4800) |
Boot device | Dell Boot Optimized Server Storage (BOSS)-S2 with 2x 240GB or 2x 480 GB Intel® SSD S4510 M.2 SATA (RAID1) | |
Storage adapter | Not required for all-NVMe configuration. | |
Storage (NVMe) | 1x 1.6TB Enterprise NVMe Mixed- Use AG Drive U.2 Gen4 | 1x 1.6TB (or larger) Enterprise NVMe Mixed-Use AG Drive U.2 Gen4 |
NIC | Intel® E810-XXVDA2 for OCP3 (dual-port 25GbE) | Intel® E810-XXVDA2 for OCP3 (dual-port 25GbE), or Intel® E810-CQDA2 PCIe (dual-port 100Gb) |
Resources
Visit the Dell support page or contact your Dell or Intel account team for a customized quote 1-877-289-3355.