In this section, we summarize the performance characteristics of customizing an LLM. The experiments use the following configuration:
- Total number of samples in the train dataset: 12012
- Unique samples in the train dataset: 11895
- Maximum number of tokens in a sample in the train dataset: 8588
- Minimum number of tokens in a sample in the train dataset: 1
- Average number of tokens in a sample in the train dataset: 114.53571428571429
- Total tokens in the train dataset: 1375803
- Micro batch size: 1 (Micro batch size refers to the size of a small subset of data samples processed at each step during training by each process/GPU.)
- Global batch size: 128 (Global batch size is the total number of data samples processed in a single training step. Global batch size = Micro batch size * Data Parallel Size * Gradient Accumulation steps. See the NVIDIA NeMo documentation.)
- Number of steps: Training step refers to a single iteration of the training process in which a model is updated based on a batch of input data. It involves backward passes through the model to calculate gradients, which are then used to update the model's parameters. The time for model convergence depends on the dataset, the model, and hyperparameters. Typically, the models can converge anywhere between 1000 to 2000 steps. We used the following number of steps:
- For LoRA and P-tuning with Llama 2 7B and 13B models, we used 1000 steps.
- For LoRA and P-tuning with Llama 2 70B and for SFT for all three Llama models, we used 50 steps and the results presented are extrapolated values.
- We used Tensor Parallelism and Pipeline Parallelism (see Table 11).
Note: The results presented here are based on our validation process and are not optimized for optimal performance. Therefore, do not use these findings for direct performance comparisons between the server and other hardware. As we continue our work, this document will be revised to reflect the most effective configurations. Dell Technologies has published benchmarking results using MLPerf that you can use for performance comparison.
The following table shows the time that we measured for model customization. It includes the time to train the model for 1000 steps. It does not include time to load the model, load the dataset, and perform validation.
Table 12. Time for model customization for various models and customization methods
Llama 2 7B | 4 X L40S | 352 | 549 | N/A |
Llama 2 7B | 8 X L40S | 196 | 280 | 269 |
Llama 2 7B | 16 X L40S | 135 | 107 | 148 |
Llama 2 13B | 8 X L40S | 494 | 754 | N/A |
Llama 2 13B | 16 X L40S | 270 | 381 | 405 |
Llama 2 7B | 8 x H100 SXM | 79 | 78 | 163 |
Llama 2 7B | 16 x H100 SXM | 40 | 40 | 84 |
Llama 2 7B | 32 x H100 SXM | 20 | 20 | 45 |
Llama 2 13B | 8 x H100 SXM | 180 | 176 | 380 |
Llama 2 13B | 16 x H100 SXM | 95 | 89 | 185 |
Llama 2 13B | 32 x H100 SXM | 48 | 46 | 100 |
Llama 2 70B | 8 x H100 SXM | 1123 | 961 | N/A |
Llama 2 70B | 16 x H100 SXM | 571 | N/A | N/A |
Llama 2 70B | 32 x H100 SXM | 369 | 320 | N/A |
We made the following observations:
- All models scale well when adding GPU resources. LoRA and P-tuning scale linearly with H100 SXM.
- Both LoRA and P-tuning require considerably less training time compared to SFT. This efficiency occurs because SFT involves updating all the model's parameters, whereas LoRA and P-tuning focus on modifying only a smaller subset of parameters. For many practical scenarios, beginning with LoRA or P-tuning as the preferred model customization approach is advisable.
- Running SFT model customization for Llama 2 (scenarios marked as N/A) requires more resources than what is currently available for that scenario.
- All L40S training times were measured for 100 steps and extrapolated to 1000 steps.
- L40S, LoRA, and SFT perform better compared to P-tuning. In the case of L40, pipeline parallelism provides better performance than tensor parallelism. Pipeline parallelism support with P-tuning is not available as of publication of this document.