Large model training is the most compute-demanding workload of the three use cases, with the largest models requiring data centers of large numbers of GPUs to train a model in a few months. The minimum configuration for training requires eight PowerEdge XE9680 servers with eight NVIDIA H100 GPUs each. The largest model training requires expansion to greater cluster sizes of 16-times, 32-times, or even larger configurations.
Design considerations for large model training include:
- Large generative AI models have significant compute requirements for training. According to OpenAI, for Chat GPT-3 with 175B parameters, the model size is approximately 350 GB, and it would take 355 years to train GPT-3 on a single NVIDIA Tesla V100 GPU. Alternatively, it would take 34 days to train with 1,024 NVIDIA A100 GPUs.
- The training model has a considerable memory footprint that does not fit in a single GPU; therefore, you must split the model across multiple GPUs (N-GPUs).
- The combination of model size, parallelism techniques for performance, and the size of the working dataset requires high communication throughput between GPUs, thus benefitting from PowerEdge XE9680 servers with eight NVIDIA GPUs fully connected to each other by NVIDIA NVLink and NVIDIA NVSwitch.
- During the training phase, there is also a significant amount of information exchange (for example, weights) between GPUs of different nodes; InfiniBand is required for optimized performance and throughput.
- The QM9700 InfiniBand switch has 64 network detection and response (NDR) ports. Therefore, 24 nodes of the PowerEdge XE9680 servers in this cluster fill the ports on the QM9700 in the InfiniBand module. Add additional InfiniBand modules in a fat-tree network topology.
- As you add additional PowerEdgeXE9680 server nodes to your cluster, expand the PowerScale switches appropriately to meet the input/output performance requirements.
- Checkpointing is a standard technique used in large model training. The size of the checkpoints depends on the size and dimensions of the model and pipeline parallelism used in training.
- Four Dell PowerScale F600 Prime storage platforms deliver 8 GBS write and 40 GBS read throughput performance with linear scaling.