Home > AI Solutions > Gen AI > Guides > Generative AI in the Enterprise with AMD Accelerators > Hardware components
Selecting the appropriate server and network configuration for generative AI model customization is crucial to ensure adequate resources are allocated for model training. This section provides example configurations for both management and compute workloads and network architecture.
The following table provides the recommended minimum configuration for the management head node and the control plane node:
Table 2. PowerEdge R660 head node and control plane configuration
Component | Head node and control plane nodes |
Server model | 2 x PowerEdge R660 |
CPU | 1 x Intel Xeon Gold 6438M 2.2G, 32C/64T |
Memory | 8 x 16 GB DDR5 4800 MT/s RDIMM |
Operating system | BOSS-N1 controller card + with 2 M.2 960 GB (RAID 1) |
RAID controller | PERC H755 with rear load Brackets |
Storage | 2 x 3.84 TB SSD SAS RI 24 Gbps 512e 2.5in Hot-Plug, AG Drive 1DWPD |
PXE network (optional) | 1 x Broadcom 5720 Dual Port 1 GbE optional LOM |
Frontend network | 1 x Broadcom 57414 Dual Port 10/25GbE SFP28, OCP NIC 3.0 (optional) 1 x Broadcom 57508 Dual Port 100 GbE QSFP Adapter, PCIe Low Profile (recommended) |
Consider the following recommendations for the Omnia and Kubernetes control plane configuration:
The PowerEdge XE9680 server is a two-socket, 6U server that is designed especially for AI tasks. It supports eight accelerators, ideal for machine learning and deep learning training and inferencing workloads, especially for those training LLMs.
The PowerEdge XE9680 server with the AMD Instinct MI300X accelerator offers high-performance capabilities for enterprises seeking to unlock the value of their data and differentiate their business with customized LLMs. With eight MI300X accelerators, 192 GB of 5.3 TB/s HBM3 per GPU for a total coherent HBM3 capacity of 1.5 TB per server, and over 21 petaflops of FP16 performance, the PowerEdge XE9680 server enables enterprises to train larger models. It also reduces data center footprints, lowers TCO, and gains a competitive edge.
The GPU-optimized servers act as worker nodes in a Kubernetes cluster. The number of servers depends on the size of the model, the customization method, and the end-user requirements on training time. Larger models, characterized by a greater parameter size, require servers equipped with a higher GPU count and enhanced connectivity.
The following table provides a recommended configuration for a PowerEdge XE9680 GPU worker node:
Table 3. PowerEdge XE9680 GPU worker node
Component | Details |
Server model | PowerEdge XE9680 (minimum of 2) |
CPU | 2 x Intel Xeon Platinum 8462Y+ 2.8G, 32C/64T, 16 GT/s, 60M Cache |
Memory | 32 x 64 GB RDIMM, 4800 MT/s Dual Rank |
Operating system | BOSS-N1 controller card + with 2 M.2 960 GB (RAID 1) |
Storage | 2 x 3.84 TB Data Center NVMe Read Intensive AG Drive U2 Gen4 |
PXE network (optional) | Broadcom 5720 Dual Port 1 GbE Optional LOM |
Frontend network | 2 x Broadcom 57608 Dual Port 200G Q112 Adapter, PCIe Full Height (configured as 1x400G) |
GPU (accelerator) | AMD MI300X, 192 GB, 750 W |
Backend network | 8 x Broadcom 57608 Dual Port 200G Q112 Adapter, PCIe Full Height (configured as 1x400G) |
The CPU memory allocation in the PowerEdge XE9680 GPU worker node configuration must be greater than the combined GPU memory footprint. Therefore, we recommend a minimum of 2 TB of total RAM space. While LLM tasks primarily rely on GPUs and do not significantly tax the CPU and memory, it is advisable to equip the system with high-performance CPUs and larger memory capacities. This provisioning ensures sufficient capacity for various data processing activities, machine learning operations, monitoring, and logging tasks. Our goal is to guarantee that the servers provide ample CPU and memory resources for these functions, preventing any potential disruptions to the critical AI operations on the GPUs.
Dell Technologies Secured Component Verification (SCV) is a step in the Dell production process that provides assurance of product integrity from the time an order is fulfilled at the Dell factory to end-user delivery. When a client or server product is built, a manifest of installed components is generated, cryptographically signed by a Dell Certificate Authority, and stored securely in the system. When the product is received, customers have a designated SCV validation application, allowing them to verify and validate that no unauthorized system modifications have been made to the components. For more information, see Dell Technologies Secured Component Verification.