Validation system configurations

Thank you for your feedback!

We validated our reference design with up to six PowerEdge XE9680 nodes. In most cases, pretraining models with billions of parameters necessitate a significantly larger cluster, often comprising hundreds of GPUs. Our validation aims to demonstrate the feasibility of pretraining using Dell infrastructure equipped with NVIDIA GPUs and software. Furthermore, the cluster size we employed for validation should adequately support pretraining models containing billions of parameters.

The following tables list the hardware configurations and software components used for generative AI model training in this design.

Table 1. System configuration

Component	Configuration 1	Configuration 2
Compute server for model customization	6 x PowerEdge XE9680 servers	6 x PowerEdge XE9680 servers
GPUs per server	8 x NVIDIA H100 SXM GPUs	8 x NVIDIA H100 SXM GPUs
Ethernet Network adapters	2 x NVIDIA ConnectX-6 DX Dual Port 100 GbE	2 x NVIDIA ConnectX-6 DX Dual Port 100 GbE
Ethernet Network switch	2 x PowerSwitch S5232F-ON	2 x PowerSwitch S5232F-ON
InfiniBand Network adapter	8 x NVIDIA ConnectX-7, Single Port NDR OSFP PCIe, No Crypto, Full Height	4 x NVIDIA ConnectX-7, Single Port NDR OSFP PCIe, No Crypto, Full Height
InfiniBand Network switch	QM9790	QM9790

We tested three different size deployments, consisting of 2, 4, and 6 node PowerEdge XE9680 servers. We evaluated two network configurations:

Configuration 1: Each PowerEdge XE9680 is equipped with 8 x NVIDIA ConnectX-7 InfiniBand adapters.
Configuration 2: Each PowerEdge XE9680 is equipped with 4 x NVIDIA ConnectX-7 InfiniBand adapters.

Table 2. Software components and versions

Component	Details
Operating system	Ubuntu 22.04.1 LTS
Cluster management	NVIDIA Base Command Manager Essentials 10.23.12
Slurm cluster	Slurm 23.02.4
AI framework	NVIDIA NeMo Framework v23.11

A Slurm cluster, powered by the "Simple Linux Utility for Resource Management" software, is a high-performance computing environment that efficiently manages and schedules computing tasks across multiple nodes. This open-source system is efficient at job scheduling, tracking resource availability, and prioritizing tasks based on user-defined requirements. It uses job queues and provides fairness mechanisms, ensuring that higher-priority jobs are accommodated without neglecting lower-priority ones.

Slurm offers access control features, facilitating user management and access policies, and is designed to handle node failures gracefully, redistributing jobs to maintain efficiency. We used Slurm for LLM training as it offers seamless scheduling components like batch scheduling, preemption, and multiple queues, making it efficient for orchestrating long-running tasks such as LLM training.

Your Browser is Out of Date

Validation system configurations

Validation system configurations