Validation results

Thank you for your feedback!

Model training or pretraining yields a foundational LLM by training it on a large corpus of data. We validated our design to ensure the functionality of model training technique available in the NeMo framework. Our goal in this validation was not to train a model to convergence and generate a complete foundational model, but rather to train for a defined number of steps in order to achieve the goals described here.

The following list provides the details of our validation setup:

Model architectures: We trained primarily with 7B and 70B Llama 2 model architectures. We also trained with GPT model architectures.
Foundation model pre-training using NeMo Framework: See the NeMo documentation for available playbooks.
Cluster configuration: We used a Slurm for cluster management and job scheduling.
Dataset: We used Pile datasets for this validation. The Pile is an 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together, derived primarily from academic or professional sources.
Time for training: Usually, data scientists train a model until it reaches convergence, a point influenced by factors like the dataset, model complexity, and chosen hyperparameters. Our aim was not to achieve convergence for every scenario, as it is specific to our chosen dataset and parameters, offering limited insight into a customers’ needs. To maintain a consistent metric across all scenarios, we conducted training jobs for a minimum of 500 steps.

Model architecture selection

Among the various LLMs available, we selected the 7B and 70B parameters of Llama 2 architectures to use for training, based on several key factors:

Resource use: Using these two models sizes helped us better understand the infrastructure resource usage and requirements for various training workloads for a range of model sizes.
Ease of use: The models have been readily available for consumption, along with recipes and cookbook implementations, making modification to the codebase easier for customers’ use cases.

Parallelism

Below are the tensor and pipeline parallelism values we used for the models. Tensor parallelism and pipeline parallelism are generated by NeMo Megatron launcher based on model parameter size and number of GPUs.

Table 3. Parallelism for Llama 2 architecture for training

Model	Configuration
Llama 2 7B	Tensor Parallelism = 2 Pipeline Parallelism = 1 Micro batch size = 1 Global batch size = 144 Sequence length = 4096
Llama 2 70B	Tensor Parallelism = 4 Pipeline Parallelism = 4 Micro batch size = 1 Global batch size = 144 Sequence length = 4096

Model

Configuration

Llama 2 7B

Tensor Parallelism = 2

Pipeline Parallelism = 1

Micro batch size = 1

Global batch size = 144

Sequence length = 4096

Llama 2 70B

Tensor Parallelism = 4

Pipeline Parallelism = 4

Micro batch size = 1

Global batch size = 144

Sequence length = 4096

Training times

The following table shows the time that we measured for model pretraining. It includes the time to train the model for 500 steps. It does not include time to initialize the model, load the dataset, checkpointing and validation.

Table 4. Time in minutes for model training for various models

Models	Number of nodes	Training time for 500 steps in Configuration 1	Training time for 500 steps in Configuration 2
Llama 2 7B	2	61	63
Llama 2 7B	4	32	35
Llama 2 7B	6	22	25
Llama 2 70B	6	230	244

Note: All performance data contained in this report was obtained in a rigorously controlled environment. Results obtained in other operating environments may vary significantly. Dell Technologies does not warrant or represent that a user can or will achieve similar performance results.

With regard to performance results, please note the following:

Training time for LLMs: Typically, pretraining a large language model (LLM) requires around 100,000 steps or more. However, in our case, we’ve conducted validation using only 500 steps. The purpose of this validation is to demonstrate the functionality of both the hardware and software stack. For more detailed information on estimated training times for various model types, please refer to the NeMo documentation.
Comparison of configurations: We evaluated two configurations:
1. Configuration 1: Each PowerEdge XE9680 is equipped with 8 x ConnectX InfiniBand adapters.
2. Configuration 2: Each PowerEdge XE9680 is equipped with 4 x ConnectX InfiniBand adapters.

While Configuration 1 performs slightly better, the improvement is not significant. For clusters with fewer than 8 nodes, Configuration 2 may offer a better cost-benefit ratio. However, in large clusters, network communication becomes crucial, and we recommend equipping each PowerEdge XE9680 server with 8 NVIDIA ConnectX-7 adapters.

Note that training results are highly dependent upon workload, specific application requirements, and system design and implementation. Relative system performance will vary as a result of these and other factors. Therefore, this workload should not be used as a substitute for a specific customer application benchmark when critical capacity planning and/or product evaluation decisions are contemplated. For benchmarking on PowerEdge server, refer to MLPerf benchmarking page.

Your Browser is Out of Date