Selecting the appropriate server and network configuration for generative AI model customization is crucial to ensure adequate resources are allocated for model training. This section provides example configurations for both management and compute workloads and network architecture.
The following sections describe the management server configuration.
The following table provides the recommended minimum configuration for the management head node and the control plane node:
Table 3. PowerEdge R660 head node and control plane configuration
Component | Head node and control plane nodes |
Server model | 3 x PowerEdge R660 |
CPU | 1 x Intel Xeon Gold 6426Y 2.5G, 16C/32T |
Memory | 8 x 16 GB DDR5 4800 MT/s RDIMM |
RAID controller | PERC H755 with rear load Brackets |
Storage | 4 x 960 GB SSD vSAS Read Intensive 12 Gbps 512e 2.5in Hot-Plug, AG Drive SED, 1DWPD |
PXE network | 1 x Broadcom 5720 Dual Port 1 GbE Optional LOM |
PXE/K8S network | 1 x NVIDIA ConnectX-6 Lx Dual Port 10/25 GbE SFP28, OCP NIC 3.0 |
Kubernetes/storage network (optional) | 1 x NVIDIA ConnectX-6 Lx Dual Port 10/25 GbE SFP28 Adapter, PCIe |
InfiniBand network (Optional) | 1 x NVIDIA ConnectX-7 Single Port NDR OSFP PCIe, No Crypto, Full Height or 1 x NVIDIA ConnectX-6 Single Port HDR200 VPI InfiniBand Adapter PCIe |
Consider the following recommendations for head and control plane node configuration:
PowerEdge XE9680 servers can be configured as worker nodes for Generative AI model customization. The following table provides a recommended configuration for a PowerEdge XE9680 GPU worker node:
Table 4. PowerEdge XE9680 GPU worker node
Component | Details |
Server model | PowerEdge XE9680 (minimum of 2) |
CPU | 2 x Intel Xeon Platinum 8468 2.1G, 48 C/96 T, 16 GT/s |
Memory | 16 x 64 GB RDIMM, 4800 MT/s Dual Rank |
Storage | 2 x 1.92 TB Enterprise NVMe Read Intensive AG Drive U.2 Gen4 with carrier |
PXE Network | Broadcom 5720 Dual Port 1 GbE Optional LOM |
Kubernetes | 1 x Intel E810-XXV Dual Port 10/25GbE SFP28, OCP NIC 3.0 |
Storage network | 2 x NVIDIA ConnectX-6 DX Dual Port 100 GbE QSFP56 Network Adapter |
GPU | 8 x NVIDIA H100 SXM |
InfiniBand Network | 8 x NVIDIA ConnectX-7 Single Port NDR OSFP PCIe, No Crypto, Full Height or 8 x Mellanox ConnectX-6 Single Port HDR200 VPI InfiniBand Adapter PCIe |
Internally, each XE9680 has four PCIe switches. Since two GPUs are connected to each PCIe switch, for maximum throughput and performance, each PCIe switch has been subdivided into two virtual switches. Therefore, for optimal GPU network performance, we recommend that each GPU has a dedicated network adapter. We recommend four InfiniBand adapters for the XE8640 and eight InfiniBand adapters for the XE9680.
While LLM tasks primarily rely on GPUs and do not significantly tax the CPU and memory, it is advisable to equip the system with high-performance CPUs and larger memory capacities. This provisioning ensures sufficient headroom for various data processing activities, machine learning operations, monitoring, and logging tasks. Our goal is to guarantee that the servers boast ample CPU and memory resources for these functions, preventing any potential disruptions to the critical AI operations carried out on the GPUs.
The following figure shows the network architecture. It shows the network connectivity for compute servers. The figure also shows three PowerEdge head nodes, which incorporate NVIDIA Base Command Manager Essentials and Kubernetes control plane nodes.
Figure 5. Network design
This validated design requires the following networks to manage the cluster and facilitate communication and coordination between different components and nodes in the cluster:
The following figure shows an example rack design for two PowerEdge XE9680 servers, equivalent to the configuration shown in Figure 5:
Figure 6. Example rack configuration for Validated Design for Model Customization
This rack was created by using the Dell Enterprise Infrastructure Planning Tool (with the illustrations of the switches enhanced). You can use the tool to configure your solution and receive weight, power requirement, airflow, and other details.
This example shows two servers in a single rack and requires two 17kW Power Distribution Units (PDUs). However, customers must carefully evaluate their own power and cooling requirements and their preference for rack layout, power distribution, airflow management, and cabling design.
Where significant growth is anticipated in the size of the deployment, customers should consider separate racks for compute, storage, and management nodes to allow sufficient capacity for that growth.
The following table provides the APC Rack and PDU recommendations for the Americas region. We recommend that you consult your Dell or APC representative to understand your unique data center requirements to provide an accurate PDU recommendation.
Table 5. Example Rack and PDU recommendations for the PowerEdge XE9680 server
Servers per cabinet | Rack U height | APC Rack Model | PDU Qty | APC PDU model |
2 | 42 | AR3300 | 2 | APDU10452SW |
4 | 42 | AR3350 | 4 | APDU10452SW |
2 | 48 | AR3307 | 2 | APDU10450SW |
4 | 48 | AR3357 | 4 | APDU10450SW |