The compute infrastructure is a critical component of the design, ensuring the efficient implementation of AI models. The PowerEdge XE9680 server is a two-socket, 6U server that supports eight NVIDIA H100 accelerators, offering more options for AI performance.
In this design, PowerEdge XE9680 servers are configured as worker nodes in a cluster. Omnia, an open ‑ source software for deploying and managing clusters, is used to deploy operating systems, while other required software stacks are configured manually.
The following table provides a recommended configuration for a PowerEdge XE9680 GPU compute node:
Component | Compute Nodes |
Server model | 4x PowerEdge XE9680 |
CPU | 2x Intel Xeon Platinum 8480+ Processor (105M cache, 2.00 GHz) |
Memory | 2 TB, 32x 64 GB 4800 MTs |
Operating system | Ubuntu 22.04 |
Storage | Local: 1.92 TB NVMe MZ-WLR3T8B Shared: 103 TB NFS Mount PowerScale F710 |
Networking | GPU network: 8x NVIDIA Mellanox NDR 400 Storage network: 1x NVIDIA CX6 port set to ETH mode |
GPU (accelerator) | 8x NVIDIA H100 SXM 80 GB Tensor Core GPU |
The CPU memory allocation in the PowerEdge XE9680 GPU compute node configuration must exceed the combined GPU memory footprint. Therefore, we recommend a minimum of 2 TB of total RAM. While LLM tasks primarily rely on GPUs and do not significantly tax the CPU and memory, it is advisable to equip the system with high-performance CPUs and larger memory capacities. This provisioning ensures sufficient capacity for various data processing activities, machine learning operations, and monitoring and logging tasks. The objective is to ensure that the servers provide ample CPU and memory resources for these functions, preventing any potential disruptions to the critical AI operations on the GPUs.