Selecting the appropriate server and network configuration for generative AI inference is crucial to ensure adequate resources are allocated for both management and inference tasks. This section provides example configurations for both management and compute workloads and network architecture.
The following table provides the recommended minimum configuration for the management head node and Kubernetes control plan node.
Table 4. PowerEdge R660 head node and Kubernetes control plane configuration
Component | Head node and control plane nodes |
Server model | 3 x PowerEdge R660 |
CPU | 1 X Intel Xeon Gold 6426Y 2.5 G, 16C/32T |
Memory | 8 x 16 GB DDR5 4800 MT/s RDIMM |
RAID controller | PERC H755 with rear load Brackets |
Storage | 4 x 960 GB SSD vSAS Read Intensive 12 Gbps 512e 2.5in Hot-Plug, AG Drive SED, 1DWPD |
PXE network | Broadcom 5720 Dual Port 1 GbE Optional LOM |
PXE/K8S network | NVIDIA ConnectX-6 Lx Dual Port 10/25GbE SFP28, OCP NIC 3.0 |
K8S/Storage network | 1 x NVIDIA ConnectX-6 Lx Dual Port 10/25GbE SFP28 Adapter, PCIe (optional) |
Consider the following recommendations:
Dell Technologies provides a selection of three GPU-optimized servers suitable for configuration as worker nodes for Generative AI inference: the PowerEdge R760xa, PowerEdge XE9680, and PowerEdge XE8640 servers. Customers have the flexibility to choose one of these PowerEdge servers based on the specific model size that they require. Larger models, characterized by a greater parameter size, require servers equipped with a higher GPU count and enhanced connectivity. For specific examples of LLM models that can be deployed on each server model, see Table 3.
The GPU-optimized servers act as worker nodes in a Kubernetes cluster. The number of servers depends on the number of models and the number of concurrent requests served by those models. We have validated an eight-GPU worker node cluster. The minimum number of worker nodes in the cluster is one.
The following table shows a recommended configuration for a PowerEdge R760xa GPU worker node.
Table 5. PowerEdge R760xa GPU worker node
Component | Details |
Server model | PowerEdge R760xa |
CPU | 2 x Intel Xeon Gold 6438M 2.2G, 32C/64T |
Memory | 16 x 32 GB DDR5 4800 MT/s RDIMM |
Storage | 2 x 1.92 TB Enterprise NVMe Read Intensive AG Drive U.2 Gen4 with carrier |
PXE Network | Broadcom 5720 Dual Port 1 GbE Optional LOM |
K8S/Storage Network |
|
GPU | Either:
|
The following table shows a recommended configuration for a PowerEdge XE8640 GPU worker node.
Table 6. PowerEdge XE8640 GPU worker node
Component | Details |
Server model | PowerEdge XE8640 |
CPU | 2 x Intel Xeon Platinum 8468 2.1G, 48 C/96 T, 16 GT/s |
Memory | 16 x 32 GB RDIMM, 4800MT/s Dual Rank |
Storage | 2 x 1.92 TB Enterprise NVMe Read Intensive AG Drive U.2 Gen4 with carrier |
PXE Network | Broadcom 5720 Dual Port 1 GbE Optional LOM |
K8S/Storage Network |
|
GPU | 4 x NVIDIA H100 SXM |
The following table shows a recommended configuration for a PowerEdge XE9680 GPU worker node.
Table 7. PowerEdge XE9680 GPU worker node
Component | Details |
Server model | PowerEdge XE9680 |
CPU | 2 x Intel Xeon Platinum 8470 2G, 52C/104T |
Memory | 16 x 64 GB RDIMM, 4800 MT/s Dual Rank |
Storage | 2 x 1.92 TB Enterprise NVMe Read Intensive AG Drive U.2 Gen4 with carrier |
PXE Network | Broadcom 5720 Dual Port 1 GbE Optional LOM |
K8S/Storage Network |
|
GPU | 8 x NVIDIA H100 SXM |
The CPU memory allocation in the XE9680 GPU worker node configuration exceeds that of the XE8640 configuration. This increase is attributed to the presence of twice as many GPUs, which implies a heightened demand for overall inferencing capacity and, therefore, greater CPU memory requirements.
While inferencing tasks primarily rely on GPUs and do not significantly tax the CPU and memory, it is advisable to equip the system with high-performance CPUs and larger memory capacities. This provisioning ensures sufficient headroom for various data processing activities, machine learning operations, monitoring, and logging tasks. Our goal is to guarantee that the servers boast ample CPU and memory resources for these functions, preventing any potential disruptions to the critical inferencing operations carried out on the GPUs.
The following figure shows the network architecture. It shows the network connectivity for compute servers. The figure also shows three PowerEdge head nodes, which incorporate NVIDIA Base Command Manager Essentials and Kubernetes control plane nodes.
Figure 3. Network architecture
This validated design requires the following networks to manage the cluster and facilitate communication and coordination between different components and nodes within the cluster: