The typical GenAI deployment consists of several fabrics (see Figure 6). These fabrics are implemented using a leaf and spine-distributed architecture using the Dell PowerSwitch Z, S, and N-Series. The fabric is deployed using either traditional BGP Layer 3 or BGP EVPN overlay.
The back-end fabric is defined as the inter-GPU communication channel where most of the switching activity is done. This is the fabric with the strictest of requirements as it needs to meet all three (lossless, high-performance, and scale) requirements. All the connections from the workloads and between the switches are 400GE based.
The front-end fabric is defined as workload storage and cluster access. This fabric does not demand high throughput, or lossless behavior.
And lastly, the out-of-band management for simple fabric and workload management.
The sizes of these fabrics range from the small – a single Z9664F-ON – to the large – several dozens of Z9664F-ON.
The first Dell GenAI fabric shows:
- A GenAI fabric of 64 GPUs
- No link oversubscription between workload and fabric
- A single Dell PowerSwitch Z9664F-On
- Eight Dell XE9680s
The second Dell GenAI fabric shows:
- A GenAI fabric of 2,048 GPUs
- A Leaf and Spine architecture
- No link oversubscription between the workload clusters and fabric
- 96 Dell PowerSwitch Z9664F-ON
- 256 Dell XE9680s
From the onset of its release, SONiC aimed at cloud providers. With Dell Technologies, it has benefited from contributions merged into its community versions.
Key features have been added to SONiC, creating multiple different bundles with different features. Figure 15 shows Dell Enterprise SONiC features that allow the deployment of GenAI workloads.
These features have been discussed in (Ethernet Fabric for GenAI workloads).