Design considerations for the machine learning compute servers include:
- Number of compute servers—We have validated up to four servers per VMware vSphere cluster. VMware supports a maximum of 64 nodes per cluster.
- Processor and memory—We recommend Intel Xeon Platinum or Gold processors for virtualization. We recommend at least 384 GB memory for memory-intensive AI workloads.
- GPUs—Each PowerEdge R740 or PowerEdge R740xd compute server can support a maximum of three NVIDIA A100 GPUs (without any additional network adapters). When you configure the server with a ConnnectX-5 or ConnectX-6 network adapter, a maximum of two NVIDIA A100 GPUs are supported.
- Storage for VMware ESXi—An ESXi server can be installed in the SD card or the BOSS controller.
- Storage controller and hard drives for vSAN—A Dell HBA330 controller is used as the storage controller for vSAN. Customer workloads and VM requirements drive the hard drive requirements. We used 800 GB SSD SAS Write Intensive for vSAN cache tier and 960 GB SSD SAS Read Intensive for vSAN capacity tier.
- Network adapters—As noted in Table 2, the 25 GbE design uses the ConnectX-5 network adapter and the 100 GbE design uses the ConnectX-6 network adapters. The 100 GbE design also uses Intel X710 10 GbE network adapter for management, vSphere vMotion, and other VMs that do not require 100 GbE traffic.