Home > AI Solutions > Gen AI > White Papers > Technical White Paper–Generative AI in the Enterprise – Model Training > Validation results
Model training or pretraining yields a foundational LLM by training it on a large corpus of data. We validated our design to ensure the functionality of model training technique available in the NeMo framework. Our goal in this validation was not to train a model to convergence and generate a complete foundational model, but rather to train for a defined number of steps in order to achieve the goals described here.
The following list provides the details of our validation setup:
Among the various LLMs available, we selected the 7B and 70B parameters of Llama 2 architectures to use for training, based on several key factors:
Below are the tensor and pipeline parallelism values we used for the models. Tensor parallelism and pipeline parallelism are generated by NeMo Megatron launcher based on model parameter size and number of GPUs.
Table 3. Parallelism for Llama 2 architecture for training
Model | Configuration |
Llama 2 7B | Tensor Parallelism = 2 Pipeline Parallelism = 1 Micro batch size = 1 Global batch size = 144 Sequence length = 4096 |
Llama 2 70B | Tensor Parallelism = 4 Pipeline Parallelism = 4 Micro batch size = 1 Global batch size = 144 Sequence length = 4096 |
The following table shows the time that we measured for model pretraining. It includes the time to train the model for 500 steps. It does not include time to initialize the model, load the dataset, checkpointing and validation.
Table 4. Time in minutes for model training for various models
Models | Number of nodes | Training time for 500 steps in Configuration 1 | Training time for 500 steps in Configuration 2 |
Llama 2 7B | 2 | 61 | 63 |
Llama 2 7B | 4 | 32 | 35 |
Llama 2 7B | 6 | 22 | 25 |
Llama 2 70B | 6 | 230 | 244 |
Note: All performance data contained in this report was obtained in a rigorously controlled environment. Results obtained in other operating environments may vary significantly. Dell Technologies does not warrant or represent that a user can or will achieve similar performance results.
With regard to performance results, please note the following:
While Configuration 1 performs slightly better, the improvement is not significant. For clusters with fewer than 8 nodes, Configuration 2 may offer a better cost-benefit ratio. However, in large clusters, network communication becomes crucial, and we recommend equipping each PowerEdge XE9680 server with 8 NVIDIA ConnectX-7 adapters.
Note that training results are highly dependent upon workload, specific application requirements, and system design and implementation. Relative system performance will vary as a result of these and other factors. Therefore, this workload should not be used as a substitute for a specific customer application benchmark when critical capacity planning and/or product evaluation decisions are contemplated. For benchmarking on PowerEdge server, refer to MLPerf benchmarking page.