Storage performance

Thank you for your feedback!

The following table shows read and write performance during the initial epoch and the checkpoint operation for both the 7B and 70B parameter models. It includes the load, the dataset, checkpointing, and validation.

We evaluated two configurations:

Configuration 1: LLAMA 2 7B parameter model with 1x PowerEdge XE9680 equipped with 8 NVIDIA H100 GPUs
Configuration 2: LLAMA 2 70B parameter model with 6x PowerEdge XE9680 equipped with 48 NVIDIA H100 GPUs

Table 4. Time for training 40 steps of each model [1]

Models	# of XE9680 Servers	Time / mins	Checkpoint Size	Checkpoint Time (minutes)	Peak Read TP	Peak Write TP
Llama 2 7B	1	3.32	100 GB	2:30	827 KB/s	947 MB/s
Llama 2 70B	6	9.27	1.1 TB	3:28	3.8 MB/s	7.02 GB/s

Models

# of XE9680 Servers

Time / mins

Checkpoint Size

Checkpoint Time (minutes)

Peak

Read TP

Peak Write TP

Llama 2 7B

3.32

100 GB

2:30

827 KB/s

947 MB/s

Llama 2 70B

9.27

1.1 TB

3:28

3.8 MB/s

7.02 GB/s

The initial data load for both model examples had little performance impact on the storage. This is expected since most language- and text-based models have smaller dataset sizes and thus the model load portion of the training is minimal on the storage. This would account for the low read activity on the file system.

The checkpoint data is more interesting. The different parameter sizes of the two examples show the impact on the write throughput requirement on the OneFS file system during the checkpoint operation. The checkpoint during the 70B parameter model required significantly more write throughput than that of the 7B parameter model.

Note that benchmark results are highly dependent upon workload, specific application requirements, and system design and implementation. Relative system performance will vary due to these factors. Therefore, this workload should not be used as a substitute for a specific customer application benchmark when critical capacity planning and/or product evaluation decisions are contemplated. For benchmarking on Dell PowerEdge servers, refer to the MLPerf benchmarking page.

Your Browser is Out of Date

Storage performance

Storage performance