The AI Ethernet fabric must deliver key features and leverage open standards to become a compelling fabric interconnect of choice for the AI world. These are as follows:
Interoperable
The Dell AI fabric must operate at the highest level and leverage a well-established network ecosystem that is based on proven open standards such as Ethernet. With an Ethernet-based approach, a flexible architecture can be achieved.
High-Performance
AI workloads are unique in that they require specific network or fabric properties to perform optimally.
With Dell Enterprise SONiC as the networking operating system, an AI-specific feature set is included such as cut-through switching, RDMA over Converged Ethernet (RoCE) version 2, dynamic load balancing (DLB), and enhanced hashing to deliver the network performance that support AI workloads. For details about these requirements, see Figure 2. In addition to the software feature set by Dell Enterprise SONiC, the PowerSwitch Z-series provides the necessary non-blocking 800GbE and 400GbE switching fabric capacity.
Scalable
With the Dell networking product portfolio, AI fabrics customizations enable a range of GPU cluster deployments. The requirements of AI workloads can range from a single GPU to a cluster of multiple GPUs. This whitepaper explores some relevant AI fabric topologies: single switch, TOR-wired Clos topology, Pure Rail topology, and Rail Optimized topology.
Lossless fabric
In order to support RoCEv2, an Ethernet fabric needs to leverage PFC to be lossless. Additional congestion control techniques, such as DCQCN, are also needed.
Load balancing
In a leaf and spine architecture, the fabric is Layer 3 and uses dynamic load balancing (DLB) on equal cost multi-path (ECMP) links. This results in a uniform traffic distribution across all the links between the leaf and spine switches.
Congestion control
DCQCN is the main congestion control networking feature set deployed on an Ethernet fabric. DCQCN uses explicit congestion notification (ECN) to implement a rate-based, flow-based end-to-end congestion control protocol. By acting on a per-flow basis, DCQCN provides fairness and significantly increases the Ethernet Fabric performances.