Home > AI Solutions > Gen AI > Guides > Generative AI in the Enterprise with AMD Accelerators > Model customization
Model customization, a subset of which is fine-tuning, refers to adapting a pretrained foundation model to perform a specific task or cater to a particular domain. For model customization, you customize the foundation model on a task-specific dataset or adjust its prompts or parameters to optimize its performance for the wanted use case. Fine-tuning enhances the model's ability to generate accurate and contextually relevant outputs.
Distributed fine-tuning of LLMs is a more resource-intensive process. It not only demands high computational power and substantial memory to store the model parameters but also requires efficient data communication mechanisms, which is where the need for high-speed interconnects is required. High-speed interconnects, such as InfiniBand or high-speed Ethernet, are essential for facilitating rapid communication between different nodes in a distributed computing environment. They enable quick synchronization of model parameters and gradients across multiple GPUs or nodes, which is critical for effective distributed training. Not having a sufficiently high performing network can cause a bottleneck in distributed training. Without high-speed interconnects, the communication overhead can become a bottleneck, significantly impeding the fine-tuning process and impacting the overall efficiency of the distributed system.
The following sections describe the fine-tuning techniques that we validated in this design.
Supervised fine-tuning (SFT) is a popular model customization method that adapts the model to excel in particular language understanding or text generation tasks, such as text classification, question answering, language translation, or text summarization. This process begins with a foundation LLM and proceeds by training it further using a dataset containing labeled examples specific to the required task. Through backpropagation on the task-specific data, the entire model's parameters are adjusted, potentially enhancing its performance on the targeted task.
Low-rank adaptation (LoRA) introduces a methodology for fine-tuning LLMs to excel in specific tasks or domains. It maintains the integrity of the pretrained model weights while extending its capabilities through the integration of additional layers known as "rank-decomposition matrices." The key distinction is that only these added layers undergo training, rather than the entire model. The selective focus on training these supplementary layers optimizes their efficiency and computational resource use to a remarkable degree. Therefore, LoRA achieves a substantial reduction in computational requirements while simultaneously yielding performance that is either equal to or surpasses the results attained using conventional fine-tuning techniques across a diverse range of tasks.
Checkpointing is a crucial aspect of fine-tuning LLMs. Due to the massive size of these models, they often do not fit into memory, necessitating the use of parallelism techniques such as data, model, and pipeline parallelism. Because training these models can take a significant amount of time, it is essential to save the state of the system regularly, a process known as checkpointing. Checkpointing allows for the recovery of the model's execution if unexpected events like component failures or undesirable learning patterns occur.