Dell Enterprise SONiC continues to evolve and redefine the networking landscape. It is quickly becoming the go-to-choice for modern enterprises across diverse industries, one of them as the preferred backbone for GenAI/AI fabrics. The demand for scalable high-performance networking solutions has never been greater.
Dell Enterprise SONiC brings unique feature enhancements such as RoCEv2, Dynamic Load Balancing with adaptive routing, and enhanced user-defined hashing. These features provide overall better packet entropy, optimal traffic distribution, and lossless fabric functionality.
A typical GenAI deployment consists of three different Ethernet fabrics.
- Out-of-Band (OOB) management fabric.
- Frontend or converged fabric.
- Backend or scale-out GPU fabric.
Out-of-Band management fabric
This fabric refers to the overall management access into the GenAI environment. This fabric uses 1 GbE and can be based on Layer 2 or Layer 3 features. The recommended Dell PowerSwitch is the N32148TE-ON providing 48 x 1 GbE copper RJ45 ports plus 2 x 10 GbE SFP for uplink functions.
Frontend or converged fabric
The second GenAI fabric is the frontend or converged fabric. This fabric refers to the design of the supporting GenAI components such as storage, multitenancy external access, and GPU cluster application management, such as Slurm, Kubernetes, and so forth.
The frontend fabric link speeds range from 400 GbE down to 25 GbE. The typical fabric design is a leaf and spine Clos architecture leveraging BGP EVPN VXLAN for multitenancy external access into the GenAI environment.
Backend or scale-out GPU fabric.
The third and final fabric is the backend or scale-out GPU fabric. This fabric refers to the GPU interconnectivity. The link speed ranges between 800 GbE and 400 GbE.
Unlike the frontend or converged fabric which uses leaf and spine with BGP EVPN VXLAN, there are several types of backend or scale-out GPU fabrics when it comes to GenAI deployments:
- Standalone switch fabric
- This type of fabric refers to a single Dell PowerSwitch Z9664F, or Z9864F. In this fabric, a single switch with 64 x 400/800 GbE provides dedicated 400/800 GbE for each GPU to communicate with another GPU using the same switch.
- Leaf and spine fabric
- This type of fabric refers to a multi-tier (2 or 3 tiers) consisting of leaf and spine switches using the Z9664F or Z9864F. The interlinks between leaf and spine are usually oversubscribed and use either Layer 2 or Layer 3 features enabled on the interlinks.
Unlike the standalone switch fabric where all GPUs use a single fabric to communicate, with a leaf and spine fabric, inter-GPU communication is not deterministic as data traffic can traverse multiple hops or stay within a single leaf switch.
- Pure Rail fabric.
- This type of fabric refers to a set of dedicating Dell PowerSwitch 400/800 GbE switches providing dedicated GPU links. It is similar to a single standalone switch fabric but with a greater scale beyond a single 64 GPU cluster.
- Rail optimized fabric.
- This type of fabric refers to a leaf and spine architecture with a pure rail aspect. The rail optimized fabric uses a spine layer that allows for inter-rail GPU connectivity if the optimum dedicate rail to rail path is not available.
Deployment best practices
Unlike your typical non-GenAI environment (storage, web traffic, virtualization, standard data traffic), a GenAI environment requires specific networking requirements for the different fabrics that make up the GenAI fabric.
The OOB management fabric does not require advanced networking features, nor does it require high-bandwidth speed to provide management access. When deploying the OOB management fabric, the following guidelines are recommended:
- Minimize the management subnet address. The lower the number of management subnets that must be exposed outside the GenAI environment, the less complex the networking design will be for the OOB management fabric.
- Enable spanning-tree on all connections to the devices being managed.
- Use the 10 GbE SFP uplinks from each N3248TE as the external connections to reach into the OOB management fabric if needed.
- Consider Layer 2 to keep the OOB management fabric as a flat network reachable through the 10 GbE SFP uplinks.
The frontend fabric requires some planning as it must deliver performance, multitenancy access, and GPU cluster administration. The following guidelines are recommended:
- Enable jumbo frames.
- Enable BGP EVPN VXLAN to deliver multitenancy access.
- Deploy the maximum available link speed on the storage cluster connections to the switches, this can range between 400 GbE to 100 GbE.
- Deploy Layer 2 features such as spanning-tree and link-aggregation between storage and application servers running Slurm, or Kubernetes.
- The border leaf switches should terminate all VXLAN tunnels and provide external access into the frontend fabric.
The last fabric is the backend GPU fabric. This is the actual fabric or the engine of the entire GenAI deployment. This fabric has specific networking requirements such as high-performance, lossless performance, low-latency, and scalability. To achieve these requirements, the following guidelines should be followed:
- Enable jumbo frames across the entire fabric.
- Either Layer 2 or Layer 3 can be deployed between GPU and switches.
- Dedicated 400 GbE or 800 GbE should be configured between GPU and switches.
- Enable RoCEv2 on all the switches part of the fabric.
- If a rail optimized fabric is deployed, then Dynamic Load Balancing should be deployed between the rail leaf and spine switches.
- Enable cut-thru switching.