Compute servers

The compute infrastructure is a critical component of the design, ensuring the efficient implementation of AI models. The PowerEdge XE9680 server is a two-socket, 6U server that supports eight NVIDIA H100 accelerators, offering more options for AI performance.

In this design, PowerEdge XE9680 servers are configured as worker nodes in a cluster. Omnia, an open ‑ source software for deploying and managing clusters, is used to deploy operating systems, while other required software stacks are configured manually.

The following table provides a recommended configuration for a PowerEdge XE9680 GPU compute node:

Table 3. PowerEdge XE9680 GPU compute node
Component	Compute Nodes
Server model	4x PowerEdge XE9680
CPU	2x Intel Xeon Platinum 8480+ Processor (105M cache, 2.00 GHz)
Memory	2 TB, 32x 64 GB 4800 MTs
Operating system	Ubuntu 22.04
Storage	Local: 1.92 TB NVMe MZ-WLR3T8B Shared: 103 TB NFS Mount PowerScale F710
Networking	GPU network: 8x NVIDIA Mellanox NDR 400 Storage network: 1x NVIDIA CX6 port set to ETH mode
GPU (accelerator)	8x NVIDIA H100 SXM 80 GB Tensor Core GPU

The CPU memory allocation in the PowerEdge XE9680 GPU compute node configuration must exceed the combined GPU memory footprint. Therefore, we recommend a minimum of 2 TB of total RAM. While LLM tasks primarily rely on GPUs and do not significantly tax the CPU and memory, it is advisable to equip the system with high-performance CPUs and larger memory capacities. This provisioning ensures sufficient capacity for various data processing activities, machine learning operations, and monitoring and logging tasks. The objective is to ensure that the servers provide ample CPU and memory resources for these functions, preventing any potential disruptions to the critical AI operations on the GPUs.

Your Browser is Out of Date

Compute servers