Home > AI Solutions > Gen AI > White Papers > Generative AI in the Enterprise with NVIDIA Spectrum-X Networking Platform > AI overview and fabric requirements
There are many types of applications that exist in a data center or network fabric. These applications can be loss-tolerant (that is, they can tolerate packet drops or are latency insensitive) or loss-intolerant (that is, they cannot tolerate packet drops or are latency sensitive).
Among these applications, few are as unique as AI whose data traffic pattern is characterized by:
These workload characteristics differentiate AI because it must adhere to strict infrastructure requirements to be of any use to any organization.
Fabrics for generative AI compute clusters are challenged with delivering the highest bandwidth and lowest latency data transfer while avoiding packet loss or any kind of retransmission delays.
Given the massive data volumes being pushed through the fabrics in support of AI workloads, these fabrics operate as closely as possible to saturation characterized by highly parallelized transmission of multiple elephant flows.
Finally, effective use of compute resources depends on minimizing delays due to the network transfers to allow the parallel compute jobs to progress in a synchronized fashion.
Unfortunately, traditional Ethernet is not sufficient for most AI infrastructure due to its inherent congestion, high latency, and unfair bandwidth. The Ethernet of yesterday cannot handle the tough requirements of AI traffic.
Ethernet fabrics for AI need to deliver key features and use open standards to become a compelling fabric interconnect of choice for the AI world. The following sections describe these features.
The fabric must operate at the highest level and use a well-established network ecosystem based on proven open standards such as Ethernet. With an Ethernet-based approach, a flexible architecture can be achieved.
AI workloads are unique in that they require specific network or fabric properties to perform optimally.
The requirements of AI workloads can range from a single GPU to a cluster of multiple GPUs. This white paper is relevant to the various AI fabric topologies: single switch, TOR-wired Clos topology, Pure Rail topology, and Rail Optimized topology.
GPUDirect RDMA, which is specifically engineered for GPU acceleration, facilitates direct interaction between NVIDIA GPUs across different systems, circumventing system CPUs and removing the necessity for data buffer copies through the system memory. When run over RoCE, GPUDirect RDMA attains its best performance, particularly when implemented on a lossless network.
In a leaf and spine architecture, the fabric is Layer 3. It uses NVIDIA RoCE Adaptive Routing on equal cost multipath (ECMP) links, which results in a uniform traffic distribution across all links between the leaf and spine switches. NVIDIA Direct Data Placement (DDP) Technology augments RoCE Adaptive Routing by correcting the order of packets received in the receiving host/GPU memory.
The Spectrum-X platform also includes performance isolation measures to ensure that workloads do not impact each other’s performance. With Spectrum-X’s RoCE Adaptive Routing, performance isolation is attained using fine-grained data path balancing to avoid collision of flows across the leaf and spine.
NVIDIA Spectrum-X uses Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control for end-to-end congestion control. This approach enables the SuperNIC to rate limit transmissions based on telemetry data obtained from the switch. Congestion control provides AI environments with better throughput and increased performance over traditional Ethernet.
NCCL is a high-performance library developed by NVIDIA for collective communication. NCCL provides a better runtime for multi-GPU and multi-node communications in deep learning and HPC applications. In this validation, NCCL is used to benchmark the communication bandwidth and efficiency between the GPUs with parameters such as NCCL_BUFFSIZE, NCCL_ALGO, NCCL_IB_HCA, NCCL_DEBUG, and NCCL_IB_GID_INDEX. The setup used in this guide, run through a Slurm job script using sbatch, ensured close-to-real conditions for deep learning workflows to elicit performance and scalability for the communication infrastructure during the testing of the GPUs.