LLMs are advanced natural language processing models that use deep learning techniques to understand and generate human language. LLMs can include a range of architectures and approaches, such as recurrent neural networks (RNNs), transformers, or rule-based systems. Generative Pre-trained Transformer (GPT) is a popular and influential example of an LLM that is based on transformer architecture, which is a deep neural network architecture designed to handle sequential data efficiently. Transformers use self-attention mechanisms to process input sequences and learn contextual relationships between words, enabling them to generate coherent and contextually relevant language.
A foundation model is an LLM that has been trained on a large dataset for a specific task before it is fine-tuned or adapted for a more specialized task. These models are typically trained on vast amounts of general data to learn basic features, patterns, and context within the data. Foundation models are crucial because they provide a starting point that already understands a broad range of concepts and language patterns.
Some popular examples of community-shared LLMs include Llama 2, BLOOM, Falcon, and MPT. NVIDIA also offers several pretrained models with the NeMo Framework.
In this design, we primarily focus on Llama 2. Llama 2, jointly developed by Meta and Microsoft, is freely available for research and commercial use. It offers a collection of pretrained models for generative text and fine-tuned models optimized for chat use cases. The Llama 2 models are trained on an extensive 2 trillion tokens dataset, featuring double the context length of Llama 1. Moreover, Llama 2-chat models have been further enriched by over 1 million new human annotations. These models are built on an optimized transformer architecture and come in various parameter sizes. On-premises deployments using open-source large language models such as Llama 2 offer customers better value over time with predictable costs and complete control over their data, reducing risk to security and IP leakage and ensuring compliance with regulations.
See Meta’s Responsible Use Guide and Community License Agreement for using Llama in your enterprise deployment.
The Llama 2 model comes in three sizes: 7B, 13B and 70B parameters. The following table provides details that can guide you in selecting the right model for your use case:
Table 1. Example use cases for Llama models
Foundation model | Strengths and use cases |
Llama 2 7B | Fastest model for simple language tasks like text classification and spelling correction. For example, in the enterprise setting, text classification can be used to categorize and route customer queries appropriately. |
Llama 2 13B | More accurate and verbose responses compared to Llama 2 7B, especially for tasks that require generating long output sequences. It can be helpful for developing chatbots, writing blog posts, articles, and summarization. |
Llama 2 70B | The most capable model for more complex generative tasks in an enterprise such as creative writing with engaging content for company-specific advertisements or promotional material. Another example is training and knowledge-sharing, helping to create comprehensive and accurate responses to common questions, contributing to the development of knowledge bases and FAQs. |
Parameters in LLMs refer to the learnable components or weights of the neural network that make up the model. These parameters determine how the model processes input data and makes predictions or generates output. Typically, GPTs have millions (M) to billions (B) of parameters. These parameters are learned during the training process, in which the model is exposed to vast amounts of data and adjusts its parameters during gradient descent. Assuming the model architecture and training data are comparable, generally the higher the parameters in the model, the greater the accuracy and capability of the models. Although a smaller model that is trained to be specific to a particular outcome might be more accurate, especially with more training data as input. Models with higher parameters also require more compute resources, especially GPU resources. Therefore, a balance must be considered when choosing a model.