Home > Workload Solutions > Data Analytics > White Papers > Scale AI Training and Fine-Tuning with Dell PowerScale and PowerEdge Servers > Solution approach
The solution approach for this design is to train a popular LLM to show both GPU and storage scaling. To keep the design applicable to today’s GenAI workflows, a LLAMA 2 model architecture was chosen with 7B and 70B parameter counts.
Component | Configuration 1 | Configuration 2 |
Compute server for model customization | 1 x PowerEdge XE9680 servers | 6 x PowerEdge XE9680 servers |
GPUs per server | 8 x NVIDIA H100 SXM GPUs | 8 x NVIDIA H100 SXM GPUs |
Ethernet Network adapters | 2 x NVIDIA ConnectX-6 DX Dual Port 100 GbE | 2 x NVIDIA ConnectX-6 DX Dual Port 100 GbE |
Ethernet Network switch | 2 x PowerSwitch S5232F-ON | 2 x PowerSwitch S5232F-ON |
InfiniBand Network adapter | 4 x NVIDIA ConnectX-7, Single Port NDR OSFP PCIe, | 4 x NVIDIA ConnectX-7, |
InfiniBand Network switch | QM9790 | QM9790 |
PowerScale F600 Cluster | 3 x PowerScale F600 Performance Optimized nodes | 3 x PowerScale F600 Performance Optimized nodes |
Figure 4 shows the network architecture and connectivity for the PowerEdge training nodes, PowerScale storage, and the three control plane nodes that incorporate NVIDIA Base Command Manager Essentials and other software components.
The NVIDIA AI software stack is the primary software used in this design. NVIDIA enterprise software solutions are designed to give IT admins, data scientists, architects, and designers access to the tools they need to easily manage and optimize their accelerated systems.
NVIDIA AI Enterprise, the software layer of the NVIDIA AI platform, accelerates the data science pipeline and streamlines development and deployment of production AI, including generative AI, computer vision, speech AI, and more. This secure, stable, cloud-native platform of AI software includes over 100 frameworks, pretrained models, and tools that accelerate data processing, simplify model training and optimization, and streamline deployment.
Component | Details |
Operating system | Ubuntu 22.04.1 LTS |
Cluster management | NVIDIA Base Command Manager Essentials 10.23.12 |
Slurm cluster | Slurm 23.02.4 |
AI framework | NVIDIA NeMo Framework v23.11 |
The solution design presented here is modular, and each of the components can be independently scaled depending on the customer’s workflow and application requirements.
The goal in this validation was not to train a model to convergence and generate a complete foundational model, but rather to train for a defined number of steps.
The following list provides the details of our validation setup: