Network efficiency

Each node in the cluster used a Broadcom NetXtreme BCM57508 (1x100 GbE) NIC port connected to the Z9664F-ON. Network adapter efficiency was measured during the Llama 2 7 B fine-tuning process.

Bandwidth usage was measured during the fine-tuning process of the Llama 2 7B model, employing a batch size of 126 per GPU. Batch size refers to the number of training samples to be processed in one iteration.

Table 4. PowerEdge bandwidth comparison
Component	XE9680	XE8545	R760xa
Average Received Bandwidth(rx)	9.25 Gbps	9.17 Gbps	9.26 Gbps
Average Transmitted Bandwidth(tx)	9.18 Gbps	9.25 Gbps	9.25 Gbps
Average Total Bandwidth(rx+tx)	18.4 Gbps	18.4 Gbps	18.5 Gbps
Maximum Received Bandwidth(rx)	46.0 Gbps	44.3 Gbps	43.6 Gbps
Maximum Transmitted Bandwidth(tx)	46.2 Gbps	44.2 Gbps	43.3 Gbps
Maximum Total Bandwidth(rx+tx)	92.2 Gbps	88.5 Gbps	86.9 Gbps

The testing phase included two validation tests, each with five passes (epochs) through the entire dataset. The first test was with a single server, and the second test included all three servers in a cluster, dividing the fine-tuning workload across three nodes.

When taking on the workload alone, the XE8545 took two hours to complete one epoch. When the same workload was distributed across the 3-node cluster, the time per epoch fell to 46 minutes. This test shows enormous time savings when using a three-node cluster over a single node for training the model.

There were no bottlenecks on the network, which ran at full efficiency for this model. All components in this scenario ran at peak efficiency for this design.

Your Browser is Out of Date

Network efficiency