Home > AI Solutions > Gen AI > White Papers > Technical White Paper–Generative AI in the Enterprise – Model Training > Solution approach
Generative AI models have been growing in computation requirements. Training a model from scratch is typically a resource-intensive endeavor and can take considerable amounts of time. For example, according to OpenAI, for Chat GPT-3 with 175B parameters, the model size is approximately 350 GB and it would take 34 days to train on1024 NVIDIA A100 GPUs, or 355 years on a single V100 GPU.
As another example, recently Databricks released DBRX, an open general purpose LLM with 132B parameters that was trained using next-token prediction. It was trained on 3072 NVIDIA H100 GPUs and took three months to complete.
Clearly, training LLMs requires significant computational resources, including multiple GPUs and distributed training setups. Training times can vary depending on factors such as the size of the model, the complexity of the task, the size of the dataset, and the hardware infrastructure available. Therefore, the selection of these factors is an important consideration.
There are several aspects to training in order to create a pre-trained LLM and the choices you make during the planning phases have implications on the training time and the resulting performance of the model. The considerations include the acquisition and preparation of the data, the selection of the model architecture, the development, and training of the model, including tokenization. These steps are explained below, followed by discussion of the design of the infrastructure on which the training will take place.
The process of training for a large language model typically involves the following steps:
1. Data collection and preprocessing
2. Architecture selection
3. Initialization and parameterization
4. Objective function and training task definition
5. Training procedure
While the next two steps are not within the scope of this document, it is important to understand the full context
6. Evaluation and fine-tuning
7. Model deployment
Overall, the process of pre-training for a large language model involves collecting and preprocessing data, selecting an appropriate architecture, defining the pre-training task, training the model, evaluating its performance, and deploying it for inference on new tasks.
Transformers are the default choices for NLP applications. Depending on your preferences, some of the key elements to consider for model selection include the number of layers in transformer blocks, the number of attention heads, and loss function and hyperparameters. The size and configuration of the model directly influences the compute time required to train the model. During the training, the model is presented with a sequence of tokens and trained to predict the next token in the sequence. The model adjusts its weights based on the difference between its predicted token and the actual token in the sequence. This process is repeated millions of times, until the model reaches a certain level of performance, or the model has learned enough, and additional training is not changing the model’s accuracy.
To decrease training time, different parallelism techniques can be used, such as data parallelism and model parallelism. Model parallelism can be implemented as sequence parallelism, pipeline parallelism, and tensor parallelism. Each of these techniques has unique workload characteristics and infrastructure requirements.
When selecting an architecture for your LLM, several factors should be considered: