Home > AI Solutions > Gen AI > White Papers > Generative AI in the Enterprise with NVIDIA Spectrum-X Networking Platform > Overview
This design uses the NVIDIA Spectrum-X networking platform, and consists of three physical networks:
The following figure shows the network topology used for the reference design:
Note: The figure shows only a single PowerEdge XE9680 server and a single PowerEdge R760xa server, which is for the purpose of illustration.
The cluster management and out-of-band (OOB) network is a separate and dedicated physical network infrastructure used for managing and monitoring the servers. Because of the low network requirements, a 1-Gigabit Ethernet network suffices.
There are two design options available for this setup:
The backend network fabric, also referred to as the GPU, Scale Out, or east-west network fabric, is a 400 Gb Ethernet high-bandwidth low latency network fabric dedicated for inter-node GPU fabric communication. It is required for distributed fine-tuning of LLMs to facilitate rapid communication between different nodes in a distributed computing environment. Backend network fabric is typically not required for inference-only clusters, in which the model fits inside a single server. This design uses the Llama 3 model, which fits in a single PowerEdge XE9680 or PowerEdge R760xa server for inferencing and a backend network fabric is not needed.
The backend network is powered by a Spectrum-4 SN5600 switch, which is a versatile network switch that serves as a smart-leaf, spine, and super-spine. It provides 64 ports of 800 GbE, all packed into a compact 2U form factor. This switch is suitable for an inter-GPU fabric. It supports both conventional leaf/spine designs with top-of-rack (ToR) switches and end-of-row (EoR) topologies. The SN5600 switch offers a wide range of connectivity options, from one to 800 GbE. It stands out in the industry with a leading total throughput of 51.2 Tb/s.
Spectrum-X incorporates the following capabilities to support innovations to achieve the highest effective bandwidth under load and at scale:
The following figure shows the front view of the PowerEdge XE9680 chassis:
There are 10 PCIe Gen 5 slots on the PowerEdge XE9680 server, which are internally connected to the CPU and GPU using PCIe switches. Each of the 10 PCIe slots is configured with eight NVIDIA Bluefield-3 Single Port 400 GbE B3140H adapters that are connected to a Spectrum-4 5600 switch, which constitutes the backend fabric.
The frontend network fabric in this design refers to the standard Data Center Fabric providing connectivity to resources outside the AI cluster. As shown in Figure 1, it is a converged fabric that supports storage, external data center access, and Kubernetes management operations. This network fabric is often referred to as the north-south network.
A pair of Spectrum-4 SN5400 switches power the frontend network fabric. These switches connect to the PowerScale F710 storage arrays and the PowerEdge R660 Kubernetes and Base Command Manager. Also, the switches provide external access by connecting to the data center network. They also connect to the PowerEdge XE9680 or PowerEdge R70xa server node using NVIDIA Bluefield-3 Dual Port 200 GbE.
NVIDIA’s Ethernet Switch Configurator can be used to help identify and order the appropriate cables and optics for your AI fabrics. After entering your topology information into the form, it lists the available product options for your environment, along with part numbers and other information.
For the network design that is shown in Figure 7, customers can either use Direct Attached Copper (DAC) or fiber.
Each GPU in the Dell PowerEdge XE9680 server is coupled through PCIe with a respective NIC that interfaces with the fabric at 400 GbE speed, as shown in the following figure. Different types of cabling are supported; direct attach copper (DACs) or fiber.
The following figure shows the front view of the PowerEdge XE9680 server:
The NICs in the two PCIe slots on the far right and far left are used for the frontend fabric connections. The eight GPU NICs are coupled with the eight GPUs and used to connect to the Scale Out Fabric.
The following figure shows that the Scale Out fabric interconnects the NICs coupled with the GPUs in a PowerEdge XE9680 server to build a GPU cluster including more than a single node. The Scale Out fabric is used by the set of GPUs composing a GPU cluster to exchange AI workload parameters.
When deploying a Scale Out fabric, keep the following guidelines in mind:
This list is not an exhaustive networking design checklist as the fabric configuration requires detailed planning and design throughout the entire deployment cycle.
Multiple topologies can be used to build a Scale Out Fabric:
The Scale Out Fabric topology is a single switch topology, which is the simplest instance of a TOR-Wired Clos topology. The following figure shows the connectivity properties of the TOR-Wired Clos topology:
This topology enables direct Scale Out Fabric communication between any GPU pairs without involving the Scale Up Fabric. For example, GPU 4 on server 1 can directly communicate through NIC 4 on server 1 with GPU 8 on server 2, because the single switch topology (or TOR-Wired Clos topology) enables NIC 4 on server 1 to communicate with NIC 8 on server 2 and NIC 8 on server 2 is coupled with GPU 8 on server 2.
The following items characterize a single switch topology:
The following figure shows a single switch fabric using the NVIDIA SN5600 switch:
The connections from each GPU to the switch can be 800 GbE or 400 GbE.
In this topology, each NIC is connected to a different switch (or spine-leaf network) and is called a rail. The rails are also interconnected at an upper tier. Therefore, this topology provides two ways to cross rails: through the Scale Up fabric (preferred) or through the upper tier of the Scale Out topology.
The following figure shows the connectivity properties of the Rail Optimized topology:
For example, to communicate with GPU 8 on server 2, GPU 4 on server 1 can either:
This property allows AI workloads to perform better on a Rail Optimized topology than on a Pure Rail topology because the current Collective Communication Libraries are not yet fully optimized for the Pure Rail topology. As such, the Rail Optimized topology is the recommended topology to build a Scale Out fabric (see Trade-offs).
The following figure shows an example of Rail Optimized topology with the NVIDIA SN5600 switch:
The Rail Optimized topology provides the same set of benefits as a Pure Rail topology, plus scalability and connectivity through the upper tier. There are times when the GPU-to-GPU across different nodes might not take the optimum path and the upper tier have to provide connectivity between the different rails.
For example, the red GPU on NODE_1, POD_1 wishes to communicate with the green GPU on NODE_1, POD_3. Ideally, the data path is for the red GPU to use the internal fabric on NODE_1 to communicate with the green GPU on NODE_1, and then the green GPU on NODE_1 transmits this data to Rail3. However, if this data path is not available, then the red GPU transmits its data upstream towards Spine1, where Spine1 sends this data downstream to Rail3.
The TOR-wired Clos topology allows shorter cables to connect servers to leaf switches in a rack, enabling cheaper DAC cables in place of optical cables.
Conversely, the Pure Rail and Rail Optimized topologies provide better GPU reachability (that is, more GPUs at one-hop distance). Therefore, these topologies allow better performances for AI workloads, which is why they are recommended over the TOR-wired Clos topology.