Each node in the cluster used a Broadcom NetXtreme BCM57508 (1x100 GbE) NIC port connected to the Z9664F-ON. Network adapter efficiency was measured during the Llama 2 7 B fine-tuning process.
Bandwidth usage was measured during the fine-tuning process of the Llama 2 7B model, employing a batch size of 126 per GPU. Batch size refers to the number of training samples to be processed in one iteration.
Component | XE9680 | XE8545 | R760xa |
Average Received Bandwidth(rx) | 9.25 Gbps | 9.17 Gbps | 9.26 Gbps |
Average Transmitted Bandwidth(tx) | 9.18 Gbps | 9.25 Gbps | 9.25 Gbps |
Average Total Bandwidth(rx+tx) | 18.4 Gbps | 18.4 Gbps | 18.5 Gbps |
Maximum Received Bandwidth(rx) | 46.0 Gbps | 44.3 Gbps | 43.6 Gbps |
Maximum Transmitted Bandwidth(tx) | 46.2 Gbps | 44.2 Gbps | 43.3 Gbps |
Maximum Total Bandwidth(rx+tx) | 92.2 Gbps | 88.5 Gbps | 86.9 Gbps |
The testing phase included two validation tests, each with five passes (epochs) through the entire dataset. The first test was with a single server, and the second test included all three servers in a cluster, dividing the fine-tuning workload across three nodes.
When taking on the workload alone, the XE8545 took two hours to complete one epoch. When the same workload was distributed across the 3-node cluster, the time per epoch fell to 46 minutes. This test shows enormous time savings when using a three-node cluster over a single node for training the model.
There were no bottlenecks on the network, which ran at full efficiency for this model. All components in this scenario ran at peak efficiency for this design.