With VMware vSphere support for virtualized GPUs, IT administrators can run AI workloads such as neural network training, inference, or model development alongside their standard data center applications. The following figure shows the Dell Technologies Validated Design for AI with PowerEdge R750xa servers, each with two NVIDIA A100 GPUs and a ConnectX network adapter, as part of a VMware vSphere cluster. The VMs with vGPU run containers from NVIDIA AI Enterprise. This validated design allows AI workloads to run alongside existing data center applications such as MySQL and Apache web server, as shown in the following figure:
Figure 6. Dell Technologies Validated Design for AI with PowerEdge R750xa servers
Key aspects of this validated design include:
- Compute server—The PowerEdge R750xa, R750, R740, R740xd, and R7525 servers are part of this validated design. Customers who require HCI can use VxRail V670 or VxRail V570F appliances.
- GPUs—NVIDIA A100 and A30 GPUs can be used for AI and machine learning. We recommend the A100 GPU for large neural network training models that require high performance and the A30 GPU for AI inference and mainstream enterprise workloads. The number of GPUs supported in a server depends on the server model as shown in Table 1.
- Storage—vSAN is the recommended storage for VMs. We recommend PowerStore storage for data lake storage, that is, storing data that are required for neural network training. PowerScale storage can also be used both for storing data for AI workloads in an NFS partition.
- Network infrastructure—Customers can have either a 25 Gb Ethernet network infrastructure or a 100 Gb Ethernet network infrastructure. We recommend 25 GbE for workloads that can use existing network infrastructure without needing to invest in 100 Gb network infrastructure. This design is suited for neural network training jobs that can run on a single node (using at most two GPUs), and for model development and inference jobs that take advantage of GPU partitioning.
We recommend 100 GbE for workloads that require large-scale model training using large datasets (typically high-resolution video or image-based datasets).
- Management with VMware vCenter—VMware vCenter Server can be deployed as a VM in either of the following ways:
- vCenter Server is installed on one PowerEdge R750xa server (compute cluster). This deployment is only recommended for small environments. Maintenance, upgrades, and other host operations might impact the availability of the vCenter server.
- To avoid the preceding limitations, we recommend installing vCenter Server on a separate management cluster that has network connectivity to the compute cluster with GPUs.
For more information about the validated design, including detailed recommended configurations, design considerations, and deployment overview, see the Virtualizing GPUs for AI with VMware and NVIDIA design guide.