LLMs are advanced natural language processing models that use deep learning techniques to understand and generate human language. LLMs can include a range of architectures and approaches, such as recurrent neural networks (RNNs), transformers, or even rule-based systems. Generative Pre-trained Transformer (GPT) is a popular and influential example of an LLM that is based on transformer architecture, which is a deep neural network architecture designed to handle sequential data efficiently. Transformers use self-attention mechanisms to process input sequences and learn contextual relationships between words, enabling them to generate coherent and contextually relevant language.
Parameters in LLMs refer to the learnable components or weights of the neural network that make up the model. These parameters determine how the model processes input data and makes predictions or generates output. Typically, GPTs have millions (M) to billions (B) of parameters. These parameters are learned during the training process, in which the model is exposed to vast amounts of data and adjusts its parameters to generate language. Assuming the model architecture and training data are comparable, the higher the parameters in model, the greater the accuracy and capability of the models. Models with higher parameters also require more compute resources, especially GPU resources.
Accuracy of LLMs is typically measured based on their performance on specific natural language processing (NLP) tasks. The evaluation metrics used depend on the nature of the task. Some commonly used tools to evaluate LLMs include ARC, HellaSwag, and Stanford Question Answering Dataset (SQuAD). HuggingFace maintains a leader board for the open LLM models.
Publicly available NVIDIA NeMo, BLOOM, and Llama models are foundation models. While these models offer a strong starting point with general capabilities, they are typically customized for specific use. This design guide does not address model customization.
The following sections provide examples of several LLMs that we validated in this design. For more details about the model validation, see Table 6.
GPT-style models are decoder-only transformer models. NVIDIA NeMo GPT models are trained using the NeMo framework. Several NeMo GPT models with varying parameter sizes exist, including 34 5M, 1.3 B, 2 B, 5 B, and 20 B parameters. These models were trained on "The Piles" dataset, an 825 GiB English text corpus specifically curated by Eleuther.AI for training LLMs. The 2B parameter model, for instance, was trained on 1.1 T tokens encompassing 53 languages and code.
BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) is an autoregressive LLM developed collaboratively by over 1000 researchers. It is based on a decoder-only transformer architecture. The model was trained on the ROOTS corpus, which consists of sources from 46 natural and 13 programming languages, totaling 1.61 terabytes of text. BLOOM offers multiple models with different parameter sizes, including 560 M, 1 B, 3 B, 7 B, and 176 B.
Llama 2, jointly developed by Meta and Microsoft, is freely available for research and commercial use. It offers a collection of pretrained models for generative text and fine-tuned models optimized for chat use cases. The Llama 2 models are trained on an extensive 2 T tokens dataset, featuring double the context length of Llama 1. Moreover, Llama 2-chat models have been further enriched through over 1 million new human annotations. These models are built on an optimized transformer architecture and come in various parameter sizes, including 7 B, 13 B, and 70 B.