Home > AI Solutions > Gen AI > Guides > Generative AI in the Enterprise with AMD Accelerators > Fine-tuning performance
This section presents the performance results of model customization runs conducted on Dell PowerEdge XE9680 servers powered by AMD MI300X GPUs. The experiments were performed in a Kubernetes cluster environment, scaling from one node to eight nodes. The models used for these tests were Llama3-8B and Llama3-70B, using techniques such as Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA).
The backend infrastructure was managed by Omnia, an open-source toolkit developed by Dell for deploying and managing high-performance clusters tailored for HPC, AI, and data analytics workloads. Omnia simplifies fine-turning, with features that include Ansible playbook-based deployment of Slurm and Kubernetes on servers running an RPM-based Linux operating system, hardware enablement, operating system enablement, cluster management, telemetry and monitoring, and containerization.
The datasets used were from Databricks-dolly-15k.
Supervised Fine-Tuning (SFT) trains a pretrained model on a specific dataset with labeled data to improve its performance on a particular task. It fine-tunes the model’s parameters to align better with the target task, enhancing accuracy and efficiency.
Low-Rank Adaptation (LoRA) adapts a pretrained model to new tasks by introducing low-rank matrices into the model’s layers. This approach reduces the number of parameters that must be fine-tuned, making the adaptation process more efficient and less resource-intensive.
The training configurations used in these experiments are crucial for understanding the performance results. Key parameters include:
The performance data was collected using PyTorch torchtune, focusing on training time for both SFT and LoRA techniques across different node configurations. For more information about how we used torchtune to run fine-tuning, see the Fine-tuning Llama 3 Using torchtune blog.
The following table shows the training configurations that we used:
Table 14. Training configurations
Parameter | SFT (Llama 3 8 B and 70 B) | LoRA (Llama 3 8 B and 70 B) |
Epochs | 1 | 1 |
Gradient accumulation steps | 1 | 1 |
Batch size | 2 | 2 |
Data type | bf16 | bf16 |
Enable activation checkpointing | True | True |
Memory efficient FSDP wrap | True | N/A |
Learning rate | N/A | 3e-4 |
FSDP | Enabled | Enabled |
To capture the performance, we used the processes described in Chapter 4. The following table shows the results:
Table 15. Fine-tuning performance results
Model | Number of nodes | Number of GPUs | Training time for SFT (seconds) | Training time for LoRA (seconds) |
Llama3-8B | 1 | 8 | 684 | 679 |
2 | 16 | 383 | 364 | |
4 | 32 | 196 | 187 | |
8 | 64 | 100 | 99 | |
Llama3-70B | 1 | 8 | 2831 | 3008 |
2 | 16 | 1720 | 1797 | |
4 | 32 | 885 | 922 | |
8 | 64 | 463 | 494 |
The following observations can be made from the results:
The performance runs on the Dell PowerEdge XE9680 servers with AMD MI300X accelerators in a Kubernetes cluster environment demonstrated substantial improvements in training times with increased node counts, compared to smaller node counts in earlier studies. The combination of advanced techniques like SFT and LoRA, along with the use of FSDP, have proven to be highly effective in optimizing model customization processes. These results provide valuable insights for designing and implementing scalable AI training infrastructures.