Fine-tuning is a form of distributed training that uses pretrained models to speed up the development of customized AI inferencing solutions. Companies can add their own private data to a pretrained model and then perform fine-tuning methods, allowing them to cost-effectively create relevant solutions for their needs.
The network plays a key role in distributed fine-tuning models. Requirements for server and storage seen in AI distributed training applications demand a network architecture with low latency and high bandwidth. Large farms of GPUs have become commonplace for training models. Ethernet networks must scale to keep up with the increasing bandwidth demands as more GPUs are added to these training models.
Dell high performance computing (HPC) server clusters are used for parallel processing, deep learning, machine learning, training, and inference. Each of these workloads creates an enormous amount of network traffic which must pass seamlessly between all compute and storage systems. There should be no bottlenecks or data loss due to a poorly designed or misconfigured network.
The Dell PowerSwitch Z9664F-ON switch provides the high-speed and low-latency communications required by AI, allowing vast amounts of data transfers between all components at once.
This document provides a starting point for anyone who wants a basic example of how an AI cluster can be created with only a few components, using a single network switch.