Home > AI Solutions > Gen AI > Guides > Generative AI in the Enterprise with AMD Accelerators > Networking design
The deployment of generative AI technologies presents several challenges, including technical complexities, a shortage of skilled professionals, and the constraints of proprietary solutions like InfiniBand. These limitations complicate integration, extend evaluation periods, increase costs, and contribute to vendor lock-in. As AI models scale from billions to trillions of parameters, the need for efficient data transfer becomes paramount. Any network-induced delays can severely impact performance. Fast, bulk data transfer—especially for large "elephant flows" between source and destination—is essential for completing tasks on time. While per-packet latency is important, the total processing time for each step is even more critical. Additionally, large AI clusters must have coordinated congestion management to prevent packet loss and ensure optimal GPU utilization. It requires synchronized management and monitoring to optimize both computational and network resources. Dell Technologies addresses these issues with the PowerSwitch Z-series. The PowerSwitch is an open Ethernet-powered solution tailored for generative AI providing next-generation Ethernet fabrics with advanced network silicon to ensure low latency and high throughput for generative AI applications.
This design incorporates three physical networks:
PowerSwitch Z9432F-ON is a high-density 400 GbE fabric switch with up to 32 ports of 400 GbE or up to 128 ports of 100 GbE using breakout cables. It has a broad range of functionality to meet the growing demands of today’s data center environment.
PowerSwitch Z9664F-ON offers optimum flexibility and cost effectiveness and provides a high-density choice of either 64 ports of 400 GbE in a QSFP56-DD form factor or 256 ports of 100 GbE in a 2U design. It can be used as a 10/25/40/50/100/200 switch using breakout cables for a maximum of 256 ports.
For AI applications as in this design, you may also want to consider the PowerSwitch Z9864F-ON as an alternative to the Z9664F-ON. The Z9864F-ON provides state-of-the-art, high-density 100/200/400/800 GbE ports and a broad range of functionality. The Z9864F-ON provides a density of 64 ports of 800 GbE in an OSFP112 form factor and 2U design. It can also be used as a 100/200/400 switch using breakout cables for a maximum of 320 ports.
Dell Technologies Open Networking is designed around a highly scalable, cloud-ready data center network fabric. It uses the Dell Enterprise SONiC network operating system. IT organizations can run their business with the innovation, automation, and reliability that comes from the first commercial offering of SONiC with production-ready enterprise feature enhancements and global support targeted for demanding cloud, data center, and edge fabrics. Dell Technologies has long been a pioneer in the field of open networking, and its leadership in the development and deployment of SONiC is a testament to its continued commitment to innovation and community collaboration and contribution. Dell’s approach to SONiC is built on the principles of openness and interoperability. Dell Technologies empowers enterprises to avoid vendor lock-in, enabling them to customize and optimize their networks to meet specific needs.
Enterprise SONiC Distribution by Dell Technologies offers several features designed to enhance the performance, efficiency, and connectivity for AI fabrics. Enterprise SONiC brings substantial advancements in AI fabric enablement. With feature additions such as Dynamic Load Balancing with Adaptive Routing, and Enhanced User-defined Hashing, this release empowers organizations to use AI fabrics more effectively. Dynamic Load Balancing ensures optimal use of links in an AI fabric with Adaptive Routing enhancing forwarding behavior, maximizing the performance and efficiency of network resources. These advanced network architecture capabilities allow AI data flow to simultaneously access all available paths to its destination.
The PowerSwitch Z-series switches are high-performance, open, and scalable data center switches used for spine, core, and aggregation applications.
The following figure shows the network architecture, including the network connectivity for the compute servers:
Figure 2. Networking design for up to eight nodes
The networking design shown in Figure 2 is a two-tier multi-switch configuration designed to support up to eight compute nodes.
There are various network topologies that are available for the backend network (also called GPU fabric). The topologies that have been validated in this design guide are single switch and “fat tree” (shown in Figure 2 using a single spine and two leaf switches).
The following sections describe the networks that this design requires to manage the cluster and facilitate communication and coordination between different components and nodes in the cluster.
The cluster management and out-of-band (OOB) network is a separate and dedicated physical network infrastructure used for managing and monitoring the servers. Omnia also uses this network to manage and deploy the host operating system image on the cluster nodes. Because of the low network requirements, a 1-Gigabit Ethernet network suffices. In our validated design, we use the PowerSwitch N3248TE-ON switch for this network fabric.
There are two design options available for this setup:
The frontend network fabric in this validated design refers to the standard Data Center Fabric providing connectivity to resources outside the AI cluster. As shown in Figure 2, it is a converged fabric that supports storage, external data center access, and Kubernetes management operations. This network fabric is often referred to as the north-south network.
As shown in Figure 2, a pair of Dell PowerSwitch Z9432F-ON switches is used for this network fabric. The PowerSwitch Z9432F-ON is a 32-port switch that supports 400 GbE and can be configured to operate at 10, 25, 40, 50, 100, or 400 GbE modes. These switches connect to the PowerScale F710 storage arrays and the PowerEdge R660 Kubernetes and Omnia management servers over 100 GbE QSFP or 25 GbE SFP cables. Also, the switches provide external access by connecting to the data center network. They also connect to the PowerEdge XE9680 server node using Broadcom's 400 GbE BCM57608 network adapters. Two BCM57608 network adapters are installed in PCIe slots 31 and 40, connecting to the PowerSwitch Z9432F-ON switches. Each adapter is placed in a PCIe slot linked to a different CPU socket, ensuring a balanced configuration.
The backend network fabric also referred to as the GPU, scale-out, or east-west network fabric is a 400 Gb Ethernet high-bandwidth low latency network fabric dedicated for inter-node GPU fabric communication. It is the fabric that is used to exchange AI workload parameters between GPUs from different PowerEdge XE9680 nodes. Internally, each network adapter is paired with a GPU using a PCIe switch.
The backend network fabric uses the Remote Direct Memory Access (RDMA) protocol, enabling Peer Direct RDMA. This technology allows direct memory access between GPUs on different nodes without involving the host CPU of either node. RDMA with Enhanced Hashing provides better packet entropy and optimal traffic distribution. By enabling direct communication between the memory of GPUs across nodes, Peer Direct reduces latency and increases throughput. This direct path for data transfer bypasses the traditional bottlenecks associated with CPU and system memory, accelerating workloads such as generative AI LLMs that require intensive GPU computations and large-scale data movement.
Tail-latency sensitivity is a key characteristic of AI fabrics as these AI workloads cannot afford delays, therefore, technologies such as RoCEv2 and Priority Flow Control (PFC) are important considerations. These features ensure smooth data flow, while Explicit Congestion Notification (ECN) and PFC watchdogs manage and mitigate congestion in the network. The Adaptive Routing with Dynamic Load Balancing technique allows for dynamic path selection to avoid suboptimal links and enhanced performance by quickly reacting to link failures.
Backend network fabric is required for distributed fine-tuning of LLMs to facilitate rapid communication between different nodes in a distributed computing environment. Backend network fabric is typically not required for inference-only clusters, where the model fits inside a single server. This design uses the Llama 3 model, which fits in a single PowerEdge XE9680 server for inferencing and a backend network fabric is not needed.
The following figure shows the front view of the PowerEdge XE9680 chassis:
Figure 3. Front chassis view of the Dell PowerEdge XE9680 server
There are 10 PCIe Gen 5 slots on the PowerEdge XE9680 server, which are internally connected to the CPU and GPU using PCIe switches. Each of the 10 PCIe slots is configured with 10 Broadcom 57608 400 GbE network adapters. Eight of the adapters in PCIe slots 32 through 39 are connected to a PowerSwitch Z9664F-ON. These eight NICs are coupled with the eight AMD MI300X accelerators and used for the backend fabric. Two of the adapters on the far right and left PCIe (slots 31, 40) are connected to the PowerSwitch Z9432F and used for the frontend fabric.
The PowerSwitch Z9664F-ON is a high-performance, high-density data center switch featuring 64 ports of 400 GbE connectivity. Powered by the Broadcom Tomahawk 4 chipset, it is ideally suited to meet the needs and demands of the backend fabric.
As part of the validated design, a “fat-tree” topology was implemented for this high-performance backend fabric. With each node requiring eight 400 GbE connections, this switch topology allows for scaling up to eight nodes or 64 GPUs in a cluster.
The following figure shows the port mapping of the Broadcom 57608 400 GbE NIC to the switch for the backend network. It shows a mapping of up to eight nodes. The spine connects to two leaf switches and uses 32 out of the 64 available 400 GbE ports for each. Each leaf uses the top 32 ports to connect to the spine and the bottom 32 ports to connect to the Broadcom 57608 400 GbE adapters on the PowerEdge XE9680 servers in slots 32 through 39.
Figure 4. Backend network using the PowerSwitch Z9664F-ON NIC to switch port mapping
You can construct a more expansive multi-tier switch topology to scale beyond eight nodes. The networking whitepaper Dell Technologies AI Fabrics Overview discusses some of these topologies.
Table 4. Network switch and adapters used for each fabric
Network fabric | Switch | Network adapter | |
Management | PowerSwitch N3248TE-ON | 1 GbE Broadcom 5720 (LOM) |
|
Frontend | PowerSwitch Z9432F-ON | 2 x 400 GbE Broadcom 57608 Adapter |
|
Backend | PowerSwitch Z9664F-ON | 8 x 400 GbE Broadcom 57608 Adapter |
|