Home > Workload Solutions > Data Analytics > White Papers > Scale AI Training and Fine-Tuning with Dell PowerScale and PowerEdge Servers > Storage performance
The following table shows read and write performance during the initial epoch and the checkpoint operation for both the 7B and 70B parameter models. It includes the load, the dataset, checkpointing, and validation.
We evaluated two configurations:
Models | # of XE9680 Servers | Time / mins | Checkpoint Size | Checkpoint Time (minutes) | Peak Read TP | Peak Write TP |
Llama 2 7B | 1 | 3.32 | 100 GB | 2:30 | 827 KB/s | 947 MB/s |
Llama 2 70B | 6 | 9.27 | 1.1 TB | 3:28 | 3.8 MB/s | 7.02 GB/s |
The initial data load for both model examples had little performance impact on the storage. This is expected since most language- and text-based models have smaller dataset sizes and thus the model load portion of the training is minimal on the storage. This would account for the low read activity on the file system.
The checkpoint data is more interesting. The different parameter sizes of the two examples show the impact on the write throughput requirement on the OneFS file system during the checkpoint operation. The checkpoint during the 70B parameter model required significantly more write throughput than that of the 7B parameter model.
Note that benchmark results are highly dependent upon workload, specific application requirements, and system design and implementation. Relative system performance will vary due to these factors. Therefore, this workload should not be used as a substitute for a specific customer application benchmark when critical capacity planning and/or product evaluation decisions are contemplated. For benchmarking on Dell PowerEdge servers, refer to the MLPerf benchmarking page.