Design considerations for the machine learning compute servers include:
- PowerEdge server and VxRail models—PowerEdge R7525 or PowerEdge R750 servers are ideal for mainstream performance. The PowerEdge R750xa server is designed for accelerators and provides the most GPU density among PowerEdge servers. Customers requiring an HCI appliance that is turn-key and provides full stack life cycle management can select VxRail as the compute server. All these models are NVIDIA-Certified Systems, which means that they have been validated to provide optimal performance, scalability, and security for accelerated workloads.
- Number of compute servers—We have validated up to four servers per VMware vSphere cluster. VMware supports a maximum of 64 nodes per cluster.
- Processor and memory— We recommend Intel Xeon Platinum or Gold processors for PowerEdge R750, PowerEdge R750xa, and VxRail 670F, and AMD EPYC processor for PowerEdge R7525. We recommend at least 512 GB memory for memory-intensive AI workloads.
- GPUs—The PowerEdge servers can be configured with NVIDIA A100 or A30 GPUs. The A100 GPU is recommended for deep learning and training of complex neural networks. The A30 GPU is recommended for AI inference, language processing, conversational AI, and recommender systems.
- Storage for VMware ESXi—For better reliability, higher performance, and increased isolation from VM data, we recommend that you install VMware ESXi on the BOSS controller.
- Storage controller and hard drives for vSAN—A Dell HBA330 controller is used as the storage controller for vSAN. Customer workloads and VM requirements determine the hard drive requirements. We used 800 GB SSD SAS Write Intensive for the vSAN cache tier and 960 GB SSD SAS Read Intensive for the vSAN capacity tier.
- Network adapters—As shown in Table 5, the 25 GbE design uses the ConnectX-5/ConnectX-6 network adapter and the 100 GbE design uses the ConnectX-6 network adapters. The 100 GbE design also uses the Intel X710 10 GbE network adapter for management, vSphere vMotion, and other VMs that do not require 100 GbE traffic.
VMware vCenter Server can be deployed in either of the following ways:
- vCenter Server is installed on one PowerEdge or VxRail server (compute cluster). This deployment is only recommended for small environments. Be careful during maintenance, upgrades, and other host operations that might impact the availability of the vCenter server.
- vCenter Server is installed on a separate management cluster and has network connectivity to the compute cluster with GPUs.