Solution approach

Thank you for your feedback!

Generative AI models have been growing in computation requirements. Training a model from scratch is typically a resource-intensive endeavor and can take considerable amounts of time. For example, according to OpenAI, for Chat GPT-3 with 175B parameters, the model size is approximately 350 GB and it would take 34 days to train on1024 NVIDIA A100 GPUs, or 355 years on a single V100 GPU.
As another example, recently Databricks released DBRX, an open general purpose LLM with 132B parameters that was trained using next-token prediction. It was trained on 3072 NVIDIA H100 GPUs and took three months to complete.
Clearly, training LLMs requires significant computational resources, including multiple GPUs and distributed training setups. Training times can vary depending on factors such as the size of the model, the complexity of the task, the size of the dataset, and the hardware infrastructure available. Therefore, the selection of these factors is an important consideration.
There are several aspects to training in order to create a pre-trained LLM and the choices you make during the planning phases have implications on the training time and the resulting performance of the model. The considerations include the acquisition and preparation of the data, the selection of the model architecture, the development, and training of the model, including tokenization. These steps are explained below, followed by discussion of the design of the infrastructure on which the training will take place.
Training process
The process of training for a large language model typically involves the following steps:
1. Data collection and preprocessing
- Collect a large corpus of text data from various sources, such as books, articles, websites, and other textual sources, or select an existing dataset. The amount of training data that is used has a direct impact on training time and the performance of the model.
- Preprocess the text data by tokenizing it into words or subwords, removing punctuation, lowercasing, and other normalization techniques.
2. Architecture selection
- Choose an appropriate architecture for training your model. Common model architectures used as starting points include Generative Pre-trained Transformers (GPT), Bidirectional Encoder Representations from Transformers (BERT), and LLMs designed for Natural Language Processing (NLP) tasks such as Llama 2.
3. Initialization and parameterization
- Define the architecture of the model, including the number of layers, hidden units, attention heads, and other hyperparameters.
- Initialize the model's parameters randomly.
4. Objective function and training task definition
- Define the pre-training task, which typically involves predicting the next word in a sequence (language modeling) or predicting masked words in a sentence (masked language modeling)
- Formulate the objective function or loss function based on the pre-training task, such as cross-entropy loss for language modeling or masked language modeling.
5. Training procedure
- Train the language model on the pre-training task using a large dataset and parallel processing techniques to accelerate training.
- Utilize techniques like mini-batch training, gradient descent optimization, and regularization methods to optimize the model's parameters and minimize the loss function.
While the next two steps are not within the scope of this document, it is important to understand the full context
6. Evaluation and fine-tuning
- Evaluate the performance of the pre-trained language model on held-out validation data to assess its generalization capability.
- Fine-tune the pre-trained model on downstream tasks or specific domains using task-specific labeled data, if preferred.
7. Model deployment
- Deploy the now pre-trained language model for inference on new text data or integrate it into applications and systems for various natural language processing tasks, such as text generation, text classification, and sentiment analysis.
Overall, the process of pre-training for a large language model involves collecting and preprocessing data, selecting an appropriate architecture, defining the pre-training task, training the model, evaluating its performance, and deploying it for inference on new tasks.
More on model architectures
Transformers are the default choices for NLP applications. Depending on your preferences, some of the key elements to consider for model selection include the number of layers in transformer blocks, the number of attention heads, and loss function and hyperparameters. The size and configuration of the model directly influences the compute time required to train the model. During the training, the model is presented with a sequence of tokens and trained to predict the next token in the sequence. The model adjusts its weights based on the difference between its predicted token and the actual token in the sequence. This process is repeated millions of times, until the model reaches a certain level of performance, or the model has learned enough, and additional training is not changing the model’s accuracy.
To decrease training time, different parallelism techniques can be used, such as data parallelism and model parallelism. Model parallelism can be implemented as sequence parallelism, pipeline parallelism, and tensor parallelism. Each of these techniques has unique workload characteristics and infrastructure requirements.
Other architecture considerations
When selecting an architecture for your LLM, several factors should be considered:
- The current state-of-the-art for LLM model is the transformer architecture based on the 2017 paper Attention is All You Need.
- Model architectures with larger parameter counts will have larger training dataset requirements based on the Chinchilla Scaling Laws.
- Decoder only, decoder/encoder and encoder only are different transformer architecture options. Many commercial LLMs used for text summarization, chat, and code generation are based on decoder-only architectures.
- Hyperparameter selection – this includes hyperparameters such as number of layers, number of attention heads, learning rate, and more. Many well-known open source LLMs publish this information so this can be a good starting point for architectures based fully or partly on those models.
- Industry leaderboards such as Hugging Face LLM Leaderboard best models. This leaderboard includes best models based on parameter count, performance on domain-specific datasets, and more. This information may be useful in deciding on an architecture similar to a high-performing model based on a specific parameter count and dataset may be appropriate.

Your Browser is Out of Date

Solution approach

Solution approach

Training process

More on model architectures

Other architecture considerations