Fine-tuning performance

Thank you for your feedback!

Introduction

This section presents the performance results of model customization runs conducted on Dell PowerEdge XE9680 servers powered by AMD MI300X GPUs. The experiments were performed in a Kubernetes cluster environment, scaling from one node to eight nodes. The models used for these tests were Llama3-8B and Llama3-70B, using techniques such as Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA).

The backend infrastructure was managed by Omnia, an open-source toolkit developed by Dell for deploying and managing high-performance clusters tailored for HPC, AI, and data analytics workloads. Omnia simplifies fine-turning, with features that include Ansible playbook-based deployment of Slurm and Kubernetes on servers running an RPM-based Linux operating system, hardware enablement, operating system enablement, cluster management, telemetry and monitoring, and containerization.

The datasets used were from Databricks-dolly-15k.

Techniques overview

Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) trains a pretrained model on a specific dataset with labeled data to improve its performance on a particular task. It fine-tunes the model’s parameters to align better with the target task, enhancing accuracy and efficiency.

Low-Rank Adaptation

Low-Rank Adaptation (LoRA) adapts a pretrained model to new tasks by introducing low-rank matrices into the model’s layers. This approach reduces the number of parameters that must be fine-tuned, making the adaptation process more efficient and less resource-intensive.

Training parameters

The training configurations used in these experiments are crucial for understanding the performance results. Key parameters include:

Epochs─The number of times the entire dataset is passed through the model during training. In this case, we used one epoch.
Gradient accumulation steps─The number of steps over which gradients are accumulated before performing a backward pass that help manage memory use during training.
Batch size─The number of samples processed before the model’s internal parameters are updated. We used a batch size of 2.
Data Type (dtype)─The precision of the data used during training. We chose bf16 (bfloat16) for its balance between performance and precision.
Enable activation checkpointing─A technique to save memory by storing only a subset of activations during the forward pass and recomputing them during the backward pass. We enabled this parameter to optimize memory use.
Memory efficient Fully Sharded Data Parallel (FSDP)wrap─A method to distribute model training across multiple GPUs efficiently. Memory-efficient wrapping was enabled to further optimize resource usage.
Learning rate (lr)─The step size at each iteration while moving toward a minimum of the loss function. For LoRA, we used a learning rate of 3e-4.

Training configurations

The performance data was collected using PyTorch torchtune, focusing on training time for both SFT and LoRA techniques across different node configurations. For more information about how we used torchtune to run fine-tuning, see the Fine-tuning Llama 3 Using torchtune blog.

The following table shows the training configurations that we used:

Table 14. Training configurations

Parameter	SFT (Llama 3 8 B and 70 B)	LoRA (Llama 3 8 B and 70 B)
Epochs	1	1
Gradient accumulation steps	1	1
Batch size	2	2
Data type	bf16	bf16
Enable activation checkpointing	True	True
Memory efficient FSDP wrap	True	N/A
Learning rate	N/A	3e-4
FSDP	Enabled	Enabled

Results

To capture the performance, we used the processes described in Chapter 4. The following table shows the results:

Table 15. Fine-tuning performance results

Model	Number of nodes	Number of GPUs	Training time for SFT (seconds)	Training time for LoRA (seconds)
Llama3-8B	1	8	684	679
	2	16	383	364
	4	32	196	187
	8	64	100	99
Llama3-70B	1	8	2831	3008
	2	16	1720	1797
	4	32	885	922
	8	64	463	494

Observations

The following observations can be made from the results:

Scalability: Both SFT and LoRA techniques demonstrated significant reductions in training time as the number of nodes increased, showcasing excellent scalability.
Resource use: Fully Sharded Data Parallel (FSDP) enabled efficient memory use and faster training times, making it a crucial component for performance optimization.
Checkpointing: The training time recorded in Table 15 does not include the duration for checkpointing. In the torchtune setup, checkpointing is done at the end of each epoch. As we are running it for one epoch, checkpointing is done after the first epoch, and our training time does not include checkpoint time. For example, in a three-epoch run, training time captured by TensorBoard will also include checkpointing time done between the end of epoch one and the end of epoch three. An enhancement request for an enabler and disabler feature for checkpointing in Torchtune is expected to be available in the next release.

Conclusion

The performance runs on the Dell PowerEdge XE9680 servers with AMD MI300X accelerators in a Kubernetes cluster environment demonstrated substantial improvements in training times with increased node counts, compared to smaller node counts in earlier studies. The combination of advanced techniques like SFT and LoRA, along with the use of FSDP, have proven to be highly effective in optimizing model customization processes. These results provide valuable insights for designing and implementing scalable AI training infrastructures.

Your Browser is Out of Date

Fine-tuning performance

Fine-tuning performance

Introduction

Techniques overview

Supervised Fine-Tuning

Low-Rank Adaptation

Training parameters

Training configurations

Results

Observations

Conclusion