In this section, we summarize the performance characteristics of customizing an LLM. The experiments use the following configuration:
- Total number of samples in the train dataset: 12012
- Unique samples in the train dataset: 11895
- Maximum number of tokens in a sample in the train dataset: 8588
- Minimum number of tokens in a sample in the train dataset: 1
- Average number of tokens in a sample in the train dataset: 114.53571428571429
- Total tokens in the train dataset: 1375803
- Micro batch size: 1 (Micro batch size refers to the size of a small subset of data samples processed at each step during training by each process/GPU.)
- Global batch size: 128 (Global batch size is the total number of data samples processed in a single training step. Global batch size = Micro batch size * Data Parallel Size * Gradient Accumulation steps. See the NVIDIA NeMo documentation.)
- Number of steps: Training step refers to a single iteration of the training process in which a model is updated based on a batch of input data. It involves backward passes through the model to calculate gradients, which are then used to update the model's parameters. The time for model convergence depends on the dataset, the model, and hyperparameters. Typically, the models can converge anywhere between 1000 to 2000 steps. We used the following number of steps:
- For LoRA and P-tuning with Llama 2 7B and 13B models, we used 1000 steps.
- For LoRA and P-tuning with Llama 2 70B and for SFT for all three Llama models, we used 50 steps and the results presented are extrapolated values.
- We used Tensor Parallelism and Pipeline Parallelism (see Table 8).
Note: The results presented here are based on our validation process and are not optimized for optimal performance. Therefore, do not use these findings for direct performance comparisons between the server and other hardware. As we continue our work, this document will be revised to reflect the most effective configurations. Dell Technologies has published benchmarking results using MLPerf™ that you can use for performance comparison.
The following table shows the time that we measured for model customization. It includes the time to train the model for 1000 steps. It does not include time to load the model, load the dataset, and perform validation.
Table 9. Time for model customization for various models and customization methods on a single node PowerEdge XE9680 server
Models | Number of GPUs | LoRA (time in minutes) | P-tuning (time in minutes) | SFT (time in minutes) |
Llama 2 7B | 4 | 102 | 100 | 380[5] |
Llama 2 7B | 8 | 79 | 78 | N/A |
Llama 2 7B | 16 | 39 | 39 | N/A |
Llama 2 13B | 4 | 248 | 220 | 820[5] |
Llama 2 13B | 8 | 174 | 166 | 660[5] |
Llama 2 13B | 16 | 85 | 81 | N/A |
Llama 2 70B | 8 | 985[5] | 867[5] | N/A |
Llama 2 70B | 16 | 487 | 428 | N/A |
[5] Training time is extrapolated from 50 steps to 1000 steps.
We have made the following observations:
- All customization methods scale well when adding GPU resources.
- Both LoRA and P-tuning require considerably less training time compared to SFT. This efficiency occurs because SFT involves updating all the model's parameters, whereas LoRA and P-tuning focus on modifying only a smaller subset of parameters. For many practical scenarios, beginning with LoRA or P-tuning as the preferred model customization approach is advisable.
- Running SFT model customization for Llama 2 70B on eight GPUs (scenarios marked as N/A) requires more resources than what is currently available in a single PowerEdge XE9680 server.
- Although the SFT technique with Llama 2 7B was completed successfully, it exhibited fluctuating training times. We are currently investigating the runs and will provide an update to this document when we pinpoint the underlying cause.