There are several key architectural concepts that are related to generative AI inferencing and relevant to this design, including LLM characteristics and examples.
LLMs are advanced natural language processing models that use deep learning techniques to understand and generate human language. LLMs can include a range of architectures and approaches, such as recurrent neural networks (RNNs), transformers, or even rule-based systems. Generative Pre-trained Transformer (GPT) is a popular and influential example of an LLM that is based on transformer architecture, which is a deep neural network architecture designed to handle sequential data efficiently. Transformers use self-attention mechanisms to process input sequences and learn contextual relationships between words, enabling them to generate coherent and contextually relevant language.
A foundation or pre-trained model is a machine learning model that has been trained on a large dataset for a specific task before it is fine-tuned or adapted for a more specialized task. These models are typically trained on vast amounts of general data to learn basic features, patterns, and context within the data.
Foundation models are crucial because they provide a starting point that already understands a broad range of concepts and language patterns. This makes the process of customizing and fine-tuning for specific tasks much more effective and efficient. While this design is focused on inferencing using existing foundation models, a subsequent design will address model customization, including fine tuning and other methods.
Parameters in LLMs refer to the learnable components or weights of the neural network that make up the model. These parameters determine how the model processes input data and makes predictions or generates output. Typically, GPTs have millions (M) to billions (B) of parameters. These parameters are learned during the training process, in which the model is exposed to vast amounts of data and adjusts its parameters to generate language. Assuming the model architecture and training data are comparable, generally the higher the parameters in the model, the greater the accuracy and capability of the models. Although sometimes a smaller model that is trained to be specific to a particular outcome may be more accurate. Models with higher parameters also require more compute resources, especially GPU resources. So there is a balance to be considered when choosing a model.
Accuracy of LLMs is typically measured based on their performance on specific natural language processing (NLP) tasks. The evaluation metrics used depend on the nature of the task. Some commonly used tools to evaluate LLMs include ARC, HellaSwag, and Stanford Question Answering Dataset (SQuAD). HuggingFace maintains a leader board for the open LLM models.
Publicly available NVIDIA NeMo, BLOOM, and Llama models are foundation models. While these models offer a strong starting point with general capabilities, they are typically customized for specific use. This design guide does not address model customization.
Tokenization in generative AI refers to the process of breaking down a piece of text into smaller units, called tokens. These tokens can be words, subwords, or even characters, depending on the granularity chosen for the tokenization process.
In NLP tasks, tokenization is a critical step because it transforms continuous text into discrete units that machine learning models can process. By segmenting text into tokens, the model gains a structured representation of the input on which it can then analyze, understand, and generate responses. The choice of tokenization strategy (word-level, subword-level, character-level) and the specific tokenizer used can significantly impact the model's performance on various tasks.
We have validated several foundation models for inferencing with this infrastructure design, including most of the NeMo GPT models (ranging from 345 M to 20 B parameters) with Triton Inference Server. We also ran inferencing on BLOOM, Llama, and Stable Diffusion models on standard Python or PyTorch containers that are available from the NVIDIA NGC Catalog. For more information about the models we explicitly validated, see Table 9 in Validation Results.
GPT-style models are decoder-only transformer models. NVIDIA NeMo GPT models are trained using the NeMo framework. Several NeMo GPT models with varying parameter sizes exist, including 345 M, 1.3 B, 2 B, 5 B, and 20 B parameters. These models were trained on the Pile
BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) is an autoregressive LLM developed collaboratively by over 1000 researchers. It is based on a decoder-only transformer architecture. The model was trained on the ROOTS corpus, which consists of sources from 46 natural and 13 programming languages, totaling 1.61 TB of text. BLOOM offers multiple models with different parameter sizes, including 560 M, 1 B, 3 B, 7 B, and 176 B.
Llama 2, jointly developed by Meta and Microsoft, is freely available for research and commercial use. It offers a collection of pretrained models for generative text and fine-tuned models optimized for chat use cases. The Llama 2 models are trained on an extensive 2 T tokens dataset, featuring double the context length of Llama 1. Moreover, Llama 2-chat models have been further enriched through over 1 million new human annotations. These models are built on an optimized transformer architecture and come in various parameter sizes, including 7 B, 13 B, and 70 B.
Stable Diffusion is a latent, text-to-image diffusion model. Latent diffusion models (LDMs) operate by repeatedly reducing noise in a latent representation space and then converting that representation into a complete image. Although Stable Diffusion is not an LLM, it was validated to illustrate the capability of this architecture to carry our inferencing for other generative AI modalities, such as image generation.