Performance test results

Thank you for your feedback!

In this section, we summarize the performance characteristics of customizing an LLM. The experiments use the following configuration:

Dataset: The Dolly dataset from Databricks. The model has the following characteristics:

Total number of samples in the train dataset: 12012
Unique samples in the train dataset: 11895
Maximum number of tokens in a sample in the train dataset: 8588
Minimum number of tokens in a sample in the train dataset: 1
Average number of tokens in a sample in the train dataset: 114.53571428571429
Total tokens in the train dataset: 1375803

Micro batch size: 1 (Micro batch size refers to the size of a small subset of data samples processed at each step during training by each process/GPU.)
Global batch size: 128 (Global batch size is the total number of data samples processed in a single training step. Global batch size = Micro batch size * Data Parallel Size * Gradient Accumulation steps. See the NVIDIA NeMo documentation.)
Number of steps: Training step refers to a single iteration of the training process in which a model is updated based on a batch of input data. It involves backward passes through the model to calculate gradients, which are then used to update the model's parameters. The time for model convergence depends on the dataset, the model, and hyperparameters. Typically, the models can converge anywhere between 1000 to 2000 steps. We used the following number of steps:

For LoRA and P-tuning with Llama 2 7B and 13B models, we used 1000 steps.
For LoRA and P-tuning with Llama 2 70B and for SFT for all three Llama models, we used 50 steps and the results presented are extrapolated values.

We used Tensor Parallelism and Pipeline Parallelism (see Table 11).

Note: The results presented here are based on our validation process and are not optimized for optimal performance. Therefore, do not use these findings for direct performance comparisons between the server and other hardware. As we continue our work, this document will be revised to reflect the most effective configurations. Dell Technologies has published benchmarking results using MLPerf that you can use for performance comparison.

The following table shows the time that we measured for model customization. It includes the time to train the model for 1000 steps. It does not include time to load the model, load the dataset, and perform validation.

Table 12. Time for model customization for various models and customization methods

Models	Number of GPUs	LoRA (time in minutes)	P-tuning (time in minutes)	SFT (time in minutes)
Llama 2 7B	4 X L40S	352	549	N/A
Llama 2 7B	8 X L40S	196	280	269
Llama 2 7B	16 X L40S	135	107	148
Llama 2 13B	8 X L40S	494	754	N/A
Llama 2 13B	16 X L40S	270	381	405
Llama 2 7B	8 x H100 SXM	79	78	163
Llama 2 7B	16 x H100 SXM	40	40	84
Llama 2 7B	32 x H100 SXM	20	20	45
Llama 2 13B	8 x H100 SXM	180	176	380
Llama 2 13B	16 x H100 SXM	95	89	185
Llama 2 13B	32 x H100 SXM	48	46	100
Llama 2 70B	8 x H100 SXM	1123	961	N/A
Llama 2 70B	16 x H100 SXM	571	N/A	N/A
Llama 2 70B	32 x H100 SXM	369	320	N/A

We made the following observations:

All models scale well when adding GPU resources. LoRA and P-tuning scale linearly with H100 SXM.
Both LoRA and P-tuning require considerably less training time compared to SFT. This efficiency occurs because SFT involves updating all the model's parameters, whereas LoRA and P-tuning focus on modifying only a smaller subset of parameters. For many practical scenarios, beginning with LoRA or P-tuning as the preferred model customization approach is advisable.
Running SFT model customization for Llama 2 (scenarios marked as N/A) requires more resources than what is currently available for that scenario.
All L40S training times were measured for 100 steps and extrapolated to 1000 steps.
L40S, LoRA, and SFT perform better compared to P-tuning. In the case of L40, pipeline parallelism provides better performance than tensor parallelism. Pipeline parallelism support with P-tuning is not available as of publication of this document.

Your Browser is Out of Date

Performance test results

Performance test results