Customizing an LLM at inference time involves making model-specific adaptations or adjustments to its inputs during the inference process. This customization allows the model to generate responses or perform tasks that are tailored to a particular application or user needs without retraining the entire model.
A common way to achieve inference-time customization of an LLM is prompt engineering. Prompt engineering involves crafting effective input queries or prompts to elicit required responses from a generative AI model. The goal is to construct prompts that guide the model towards generating the required output, which can involve using specific keywords, phrasing, or context to influence the generated content.
Prompt engineering is especially valuable when using models like Llama 2 as it allows you to tailor their behavior for specific applications. It can be a crucial component in designing conversational agents, automated customer support systems, content generation, and more, where you need to guide the model's responses effectively.
Foundation models do not have embedded domain-specific knowledge and as such might have limited use with prompt engineering. However, prompt engineering can be incorporated with models that are customized with domain-specific knowledge. For example, a bank that develops a chatbot can append the following text to the prompts for the end user: Your responses must exclude any content that might be harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Model customization of LLMs refers to a technique in which an LLM in which a model is initially trained on a large dataset for a specific language-related task and then used for various other language understanding and generation tasks. This approach has revolutionized the field NPL, enabling enterprises to harness readily available models like Llama 2 to develop customized models that can be geared for their real-world language-related tasks. Some of the popular model customization techniques include the following:
A popular model customization method is supervised fine-tuning (SFT) that adapts the model to excel in particular language understanding or text generation tasks, such as text classification, question answering, language translation, or text summarization. This process begins with a foundation LLM and proceeds by training it further using a dataset containing labeled examples specific to the required task. Through backpropagation on the task-specific data, the entire model's parameters are adjusted, potentially enhancing its performance on the targeted task.
As the parameters of popular models have increased, the process of fine-tuning the entire model has become computationally intensive. The primary objective of Parameter-Efficient Fine-Tuning (PEFT) is to fine-tune only a small fraction of the model's parameters, all while attaining comparable performance to full fine-tuning, and substantially mitigating the computational demands. PEFT attains this level of efficiency by freezing select layers in the pretrained model and solely concentrating on the fine-tuning of certain layers. This approach allows the model to adapt to fresh tasks while significantly reducing the computational burden, and while reducing the reliance on the availability of a large number of labeled examples.
There are several methods of customizing a large language model using PEFT. In this design guide, we focus on P-tuning and Low-Rank Adaptation (LoRA).
P-tuning employs a small, trainable model preceding the LLM. This smaller model's purpose is to encode the text prompt and generate specialized virtual tokens specific to the task at hand. These task-specific virtual tokens are then prepended to the prompt and then forwarded to the LLM. When the tuning process is finalized, these virtual tokens are cataloged in a lookup table for later use during inference, effectively supplanting the smaller model. By fine-tuning solely, the parameters of the compact model and keeping the LLM's parameters fixed, P-tuning significantly reduces the computational resources required for model customization.
Low-Rank Adaptation (LoRA) introduces an innovative methodology for fine-tuning large language models to excel in specific tasks or domains. This approach is characterized by its unique strategy: LoRA maintains the integrity of the pretrained model weights while extending its capabilities through the integration of additional layers known as "rank-decomposition matrices." The key distinction is that only these added layers undergo training, rather than the entire model. This selective focus on training these supplementary layers optimizes their efficiency and computational resource use to a remarkable degree. Therefore, LoRA achieves a substantial reduction in computational requirements while simultaneously yielding performance that is either on par with or surpasses the results attained using conventional fine-tuning techniques across a diverse range of tasks.
Additional PEFT techniques such as Prompt Tuning, Adapters, and IA3 are not covered in this design guide. See the NVIDIA blog about model customization techniques.
These methods collectively offer a range of strategies for customizing LLMs to perform effectively on specific tasks or in particular domains. Depending on the requirements of the application, one or a combination of these methods can be employed to achieve optimal performance.