Dell Technologies, NVIDIA, and VMware are offering enterprises a new way forward with the launch of a new solution to democratize and unlock AI across the enterprise. This new Dell Technologies Validated Design for AI is jointly engineered and validated to help organizations capitalize on the benefits of virtualization for AI workloads. The Dell Technologies Validated Design for AI includes the latest version of VMware vSphere combined with NVIDIA AI Enterprise suite on Dell EMC PowerEdge Servers and VxRail Hyperconverged Infrastructure (HCI). The design also includes Dell EMC PowerScale, which provides the necessary analytics performance and concurrency at scale to consistently feed the most data hungry AI algorithms.
Figure 1. Overview of components in the Dell Technologies Validated Design for AI
This combination of leading-edge technologies makes it possible to leverage the latest NVIDIA Ampere GPUs using the predictability and security of vSphere for virtualization with VMware-optimized infrastructure. This validated design brings the following key benefits:
- No siloed infrastructure for AI—Customers can use the same data center tools and processes with which they are familiar for building and operating AI infrastructure. With integration to the VMware ecosystem, customers can avoid silos of AI-specific systems that are difficult to manage and secure. They can also mitigate the risks of shadow AI deployments, where data scientists and machine learning engineers procure resources outside of the IT ecosystem.
- Consistent tools for management and operations—GPU resources can now be virtualized in a manner like CPU, memory, network, and storage resources. This virtualization allows IT administrators to use the same tools for management and operations for both their AI workloads and other data center workloads.
- Curated end-to-end AI software—The NVIDIA AI Enterprise software suite includes AI and data science tools and frameworks that are packaged as containers for easy and rapid deployment. These containers are selected to support the end-to-end AI development and are validated on VMware vSphere.
- Enterprise grade support—NVIDIA Support Services for the NVIDIA AI Enterprise software suite provides access to comprehensive software patches, updates, upgrades, and technical support. These services help customers with an easy and reliable way to improve productivity and reduce downtime for their AI infrastructure.
- Near bare-metal performance and scalability—AI workloads can run at near bare-metal performance on virtualized GPUs. These workloads can scale across multiple GPUs and multiple nodes, allowing training of even the largest deep learning models.
Some of the key features of this validated design include:
- GPU virtualization and allocation—VMware vSphere 7 Update 2 and later supports virtualization for NVIDIA Ampere GPUs. The virtualized GPUs can be assigned to virtual machines (VMs) and containers through Single-Root Input/Output Virtualization (SR-IOV). Also, vSphere supports:
- Partitioning of GPUs using NVIDIA Multi-instance GPU (MIG) technology increasing GPU utilization
- GPU Aggregation allowing multiple virtual GPUs to be assigned to VMs to allow deep learning jobs that are compute intensive
- Availability and continuous maintenance using VMware vMotion—vSphere enables live migration (using vMotion) for NVIDIA virtual GPU (vGPU)-powered VMs, simplifying infrastructure maintenance such as consolidation, expansion, or upgrades, and enabling nondisruptive operations
Also, with the Distributed Resource Scheduler (DRS), vSphere provides automatic initial workload placement for AI infrastructure at scale for optimal resource consumption and to avoid performance bottlenecks.
- Support for VM suspend and resume operations with virtual GPUs multinode training—GPUDirect RDMA from NVIDIA enables a direct peer-to-peer data path between the GPU memory and ConnectX network adapters. This path provides a significant decrease in GPU-to-GPU communication latency and completely offloads the CPU, removing it from all GPU-to-GPU communications across the network. GPUDirect RDMA from NVIDIA enables near bare-metal performance on multinode training.