Home > AI Solutions > Gen AI > White Papers > Technical White Paper–Generative AI in the Enterprise – Model Training > Validation system configurations
We validated our reference design with up to six PowerEdge XE9680 nodes. In most cases, pretraining models with billions of parameters necessitate a significantly larger cluster, often comprising hundreds of GPUs. Our validation aims to demonstrate the feasibility of pretraining using Dell infrastructure equipped with NVIDIA GPUs and software. Furthermore, the cluster size we employed for validation should adequately support pretraining models containing billions of parameters.
The following tables list the hardware configurations and software components used for generative AI model training in this design.
Component | Configuration 1 | Configuration 2 |
Compute server for model customization | 6 x PowerEdge XE9680 servers | 6 x PowerEdge XE9680 servers |
GPUs per server | 8 x NVIDIA H100 SXM GPUs | 8 x NVIDIA H100 SXM GPUs |
Ethernet Network adapters | 2 x NVIDIA ConnectX-6 DX Dual Port 100 GbE | 2 x NVIDIA ConnectX-6 DX Dual Port 100 GbE |
Ethernet Network switch | 2 x PowerSwitch S5232F-ON | 2 x PowerSwitch S5232F-ON |
InfiniBand Network adapter | 8 x NVIDIA ConnectX-7, Single Port NDR OSFP PCIe, | 4 x NVIDIA ConnectX-7, |
InfiniBand Network switch | QM9790 | QM9790 |
We tested three different size deployments, consisting of 2, 4, and 6 node PowerEdge XE9680 servers. We evaluated two network configurations:
Component | Details |
Operating system | Ubuntu 22.04.1 LTS |
Cluster management | NVIDIA Base Command Manager Essentials 10.23.12 |
Slurm cluster | Slurm 23.02.4 |
AI framework | NVIDIA NeMo Framework v23.11 |
A Slurm cluster, powered by the "Simple Linux Utility for Resource Management" software, is a high-performance computing environment that efficiently manages and schedules computing tasks across multiple nodes. This open-source system is efficient at job scheduling, tracking resource availability, and prioritizing tasks based on user-defined requirements. It uses job queues and provides fairness mechanisms, ensuring that higher-priority jobs are accommodated without neglecting lower-priority ones.
Slurm offers access control features, facilitating user management and access policies, and is designed to handle node failures gracefully, redistributing jobs to maintain efficiency. We used Slurm for LLM training as it offers seamless scheduling components like batch scheduling, preemption, and multiple queues, making it efficient for orchestrating long-running tasks such as LLM training.