The goal of our validation is to ensure the reliability, optimal performance, scalability, and interoperability of the system. Model customization yields an LLM that incorporates domain-specific knowledge. We validated our design to ensure the functionality of model customization techniques available in the NeMo framework. Our goal in this validation was not to train a model to convergence. The following list provides the details of our validation setup:
For multimode SFT employs distributed checkpointing, which necessitates worker nodes accessing folders created by their counterparts. We are working to determine the best way to make the volumes available to all the worker nodes
Time for training─ Usually, data scientists train a model until it reaches convergence, a point influenced by factors like the dataset, model complexity, and chosen hyperparameters. Our aim was not to achieve convergence for every scenario, as it is specific to our chosen dataset and parameters, offering limited insight into a customer's needs. To maintain a consistent metric across all scenarios, we conducted training jobs for a minimum of 50 steps.
The following table provides examples of prompt engineering results from our validation work and how the response from the model can be shaped using specific prompts:
Table 8. Examples of prompt engineering for describing the earth[1]
Prompt | Response from Llama 2 70B model |
In a sentence can you creatively describe the planet earth to a five-year-old? | I’m not sure I can do it in a sentence, but I’ll give it a try. The Earth is a big, round ball that is covered with water and land. |
How would you concisely describe planet earth to a high school senior? | I would say that the Earth is a planet that is the only one in the solar system that has life on it. It is the third planet from the Sun and is the largest of the terrestrial planets. |
Can you write a multiline poem to describe the planet earth? | The earth is a beautiful place, With its mountains and valleys, And its oceans and seas. It’s a place of wonder and awe, Where life can be found in abundance, And where the sun shines bright. The earth is a place of mystery, With its secrets yet to be discovered, And its beauty yet to be seen. It’s a place of hope and promise, Where dreams can come true, And where the future is bright. The earth is a place of love, Where people can come together, And where peace can be found. |
[1] The results are from the Llama 2 70B model with the following parameters: top_k=1, top_p=0.1, and temperature=0.5.
The following table summarizes the scenarios we validated and the configuration that we used.
Table 9. Validated scenario configuration
| Llama 2 7B | Llama 2 13B | Llama 2 70B |
Validated on a single PowerEdge XE9680 server | |||
SFT | Number of GPUs: 8 TP: 2 PP: 1 Maximum number of steps: 50 | Number of GPUs: 8 TP: 4 PP: 1 Maximum number of steps: 50 | Resources available in a single server are insufficient to perform this scenario. |
P-Tuning | Number of GPUs: 4,8 TP: 1 PP: 1 Maximum number of steps: 1000 | Number of GPUs: 4,8 TP: 2 PP: 1 Maximum number of steps: 1000 | Number of GPUs: 8 TP: 8 PP: 1 Maximum number of steps: 1000 |
LoRA | Number of GPUs: 4,8 TP: 1 PP: 1 Maximum number of steps: 1000 | Number of GPUs: 4,8 TP: 2 PP: 1 Maximum number of steps: 1000 | Number of GPUs: 8 TP: 8 PP: 1 Maximum number of steps: 1000 |
Validated on two PowerEdge XE9680 server (multinode) | |||
SFT | Number of GPUs: 16 TP: 2 PP: 1 Maximum number of steps: 50 | Number of GPUs: 16 TP: 4 PP: 1 Maximum number of steps: 50 | Resources available in two servers are insufficient to perform this scenario. |
P-Tuning | Number of GPUs: 16 TP: 1 PP: 1 Maximum number of steps: 1000 | Number of GPUs: 16 TP: 2 PP: 1 Maximum number of steps: 1000 | Number of GPUs: 16 TP: 8 PP: 1 Maximum number of steps: 1000 |
LoRA | Number of GPUs: 16 TP: 1 PP: 1 Maximum number of steps: 1000 | Number of GPUs: 16 TP: 2 PP: 1 Maximum number of steps: 1000 | Number of GPUs: 16 TP: 8 PP: 1 Maximum number of steps: 1000 |
In the preceding table, TP refers to Tensor Parallelism. It is a technique used to distribute the computations and memory across multiple GPUs in a single layer or operation in a neural network. It is particularly useful when dealing with models that are too large to fit in the memory of a single device. TP divides the computation graph into smaller tensors and distributes them across multiple devices, enabling parallel processing and reducing the time required for training.
PP refers to Pipeline Parallelism. It is a technique used to improve the efficiency of training by splitting the model into smaller, connected stages or segments. Each stage processes a portion of the data, and the results are passed sequentially from one stage to the next. This method allows for overlapping computation and communication, reducing idle time and improving overall training speed. PP is especially useful for very deep models with many layers, as it helps exploit parallelism in the forward and backward passes of the neural network.
To illustrate the validation, we provided a few example results. The appendix includes the Slurm batch file that initiates the model customization job.
The following figure illustrates the training loss for Llama 2 70B on a single-node 8-GPU run. Note that the graphs illustrate the validation of model customization, and not necessarily the optional hyperparameter configuration (such as learning rate):
Figure 7. Training loss for Llama 2 70B
For the same experiments, the following figure shows how the learning rate adapted during model customization:
Figure 8. Learning rate adapted during model customization