Home > AI Solutions > Gen AI > White Papers > Technical White Paper–Generative AI in the Enterprise – Model Training > Infrastructure considerations
There are multiple considerations regarding the various hardware infrastructure components for a generative AI training system, including high performance compute and memory, high-speed networking, and scalable, high-capacity, and low-latency storage to name a few.
The time to train a large model depends so many things beyond even the number of server nodes and the number and types of GPUs per node, such as parameters, precision, model architecture and different algorithms and techniques used to train the model. On top of various model dependencies, the underlying AI frameworks and libraries may make the training faster and more efficient with each new software release. We can be sure of one thing; model sizes are increasing and even a set of eight XE9680 servers each with 8-way H100 NVIDIA GPU can take up to 4 days to train a smaller 5B parameter LLM.
Large models also have large memory requirements. For example, a 7B parameter model would need 7x4 = 28GB of GPU memory just to store those parameters in memory at FP32 precision. This is based on the fact that a 32bit floating point value needs 4 bytes of memory, so 7B parameters of FP32 values needs 28GB of memory. Or it would need 14GB of GPU memory at FP16 precision. Since FP16 only requires 2 bytes, it only needs 14GB of memory.
Larger models like ChatGPT3 with 175B parameter would require 350GB of GPU memory just to load the model. The NVIDIA H100 GPU has only 80GB of GPU memory and that means that a 175B model would have to be split across multiple GPUs.
GPU memory is one of the most significant hurdles for training LLMs. The state-of-the-art optimizers like Adam converge much faster than traditional stochastic gradient descent due to tracking the first and second order momentum, however, to track the momentum, Adam must keep two additional values for each parameter in the model, thus adding a 2X memory overhead. For deep learning models like transformers, activations of all layers need to be in memory for backpropagation and this causes memory cost of such models to increase proportionally to number of layers.
Parallelism techniques, like model parallelism, where a language model is split over multiple GPUs with each GPU storing one or more different layers, provides huge improvement in the time to train the model. However it brings the challenge of required data communication between GPUs. Typically, those communications go over the PCIe bus and sometimes may require going over the even slower CPU-to-CPU NUMA interconnect which may not meet the requirements of the performance and throughput needed for this kind of workload. That is why optimized GPU interconnects like NVIDIA NVLink are good solutions for linking GPU pairs in a machine and provide efficient communication when the model is split between 2 GPUs. When the models are split among four or six or eight GPUs then NVIDIA NVSwitch comes into play, where a system like XE9680 with 8 NVIDIA H100 SXM GPUs delivers all-to-all GPU communication in the box between the GPUs.
Once we scale the model across sufficient number of GPUs to satisfy the memory capacity requirements using model parallelism, we use data parallelism to further scale the performance and reduce training time This is typically accomplished in 3 steps. First, copies of the model are dispatched to each node (or groups of nodes if using model parallelism along with data parallelism). Then, we shard the data and distribute to n nodes or node-groups. Finally, all results are aggregated for the backpropagation step with an all-reduce operation, requiring low latency and high throughput connection between nodes. Data parallel uses a collective communication library (such as NVIDIA’s NCCL) to synchronize the transfer of the gradients.
For distributed training, high-speed networking like InfiniBand, with its low latency and high bandwidth capabilities, significantly reduces the communication overhead between processors. This acceleration is essential for synchronizing large volumes of data and parameters efficiently, ensuring that the distributed components of the model can be updated at the lowest possible latency. NVIDIA’s GPUDirect RDMA further enhances data transfer efficiency by allowing direct, high-speed, low-latency transfers between GPUs across multiple nodes, bypassing the traditional multistep data copy process.
As mentioned, training large models takes a significant amount of time on a large number of GPUs. Thus, there is a requirement to do application level checkpointing. In case the training is disrupted for any reason, one can restart the training by restoring the last known good checkpoint. In case of LLM training, checkpointing is essentially a snapshot of the model parameters (weights) after some defined time period or defined number of steps.
This is where high performance storage comes into play as the size of the checkpoint file depends on the size of the model and the number of checkpoint files depends on how many GPUs hold that model in memory during the training. For example, 175B parameters trained with tensor parallelism = 8 and pipeline parallelism = 8 means, 8 GPUS in 8 nodes working together on training that model – thus 64 GPUs hold a unique copy of the model, and each GPU will create one 40GB checkpoint file and the model will create approximately 2.4TB of data from 64 files. Checkpoint writes are single threaded from each GPU and checkpoint reads are simultaneously read by all 64 GPUs for restore if needed.