This test shows how a single switch can be used for distributed training with a limited number of servers, but most Large Language Model (LLM) models will have multiple switches making up an AI fabric. LLM models can contain three or four fabrics, including a GPU fabric, Storage fabric, in-band (IB) management fabric, and out-of-band (OOB) Management fabric. Some may be optional depending on individual requirements.
The GPU fabric is a dedicated network fabric that provides GPU to GPU connectivity to perform an AI/ML training or inference job. GPUs perform hyper-parameter optimizations on this fabric. Ethernet solutions are evolving as the preferred choice for these fabrics. Each GPU-Server has 8x400G or 8x(2x200G) connectivity to leaf switches that are connected to spines and super spines. Of all fabric types, this fabric has the most stringent requirements for low-latency and lossless Ethernet.
The storage fabric provides access to a large-scale shared storage infrastructure. This storage is used as a shared resource for GPUs to communicate hyper-parameters during AI/ML training or inference jobs. These are typically 100 GbE to 200 GbE fabrics.
The in-band management fabric is used to distribute the AI/ML jobs on to the data center back-end network. This is a network that prioritizes, batches, and provides/allocates the necessary resources (GPUs, storage, network) for AI/ML applications. These too are typically 100 GbE to 200 GbE fabrics, since since large amounts of data are being sent to each GPU to do its processing.
An OOB Management fabric provides network administrators access to manage each server, storage, and network switch on the network. This is the traditional OOB network that may use a 1GbE connection to access server iDRACs, and OOB ports on the storage and network devices across the different fabrics.