Design considerations for the machine learning compute servers include:
- PowerEdge server and VxRail models—PowerEdge R7525 or R750 servers are ideal for mainstream performance. The PowerEdge R750xa server is designed for accelerators and provides the most GPU density among PowerEdge servers. Customers requiring an HCI appliance that is turn-key and provides full stack life cycle management can select VxRail as the compute server. All these models are NVIDIA-Certified Systems, which means they have been validated to provide optimal performance, scalability, and security for accelerated workloads.
- Number of compute servers—We have validated up to four servers per VMware vSphere cluster. VMware supports a maximum of 64 nodes per cluster.
- Processor and memory—We recommend Intel Xeon Platinum or Gold processors for virtualization. We recommend at least 384 GB memory for memory-intensive AI workloads.
- GPUs—The PowerEdge servers can be configured with NVIDIA A100 or A30 GPUs. The A100 GPU is recommended for deep learning and training of complex neural networks. The A30 GPU is recommended for AI inference, language processing, conversational AI, and recommender systems.
- Storage for VMware ESXi—An ESXi server can be installed in the SD card or the BOSS controller.
- Storage controller and hard drives for vSAN—A Dell HBA330 controller is used as the storage controller for vSAN. Customer workloads and VM requirements determine the hard drive requirements. We used 800 GB SSD SAS Write Intensive for the vSAN cache tier and 960 GB SSD SAS Read Intensive for the vSAN capacity tier.
- Network adapters—As shown in Table 2, the 25 GbE design uses the ConnectX-5 network adapter and the 100 GbE design uses the ConnectX-6 network adapters. The 100 GbE design also uses the Intel X710 10 GbE network adapter for management, vSphere vMotion, and other VMs that do not require 100 GbE traffic.