AI overview and fabric requirements | Generative AI in the Enterprise with NVIDIA Spectrum-X Networking Platform

None

Thank you for your feedback!

There are many types of applications that exist in a data center or network fabric. These applications can be loss-tolerant (that is, they can tolerate packet drops or are latency insensitive) or loss-intolerant (that is, they cannot tolerate packet drops or are latency sensitive).
Among these applications, few are as unique as AI whose data traffic pattern is characterized by:
- An extremely large volume of data exchanged, particularly with LLMs and similar language models
- High-rate data exchange taking place during the initial training phase of any model
- Latency-sensitive applications exchanging vast amounts of data in a feedback loop fashion
- Diverse traffic patterns (for example, predictable and somehow ordered or unpredictable)
- Heterogeneous traffic size flows (for example, elephant and mice flows)
- Bursty dataset patterns
These workload characteristics differentiate AI because it must adhere to strict infrastructure requirements to be of any use to any organization.
Fabrics for generative AI compute clusters are challenged with delivering the highest bandwidth and lowest latency data transfer while avoiding packet loss or any kind of retransmission delays.
Given the massive data volumes being pushed through the fabrics in support of AI workloads, these fabrics operate as closely as possible to saturation characterized by highly parallelized transmission of multiple elephant flows.
Finally, effective use of compute resources depends on minimizing delays due to the network transfers to allow the parallel compute jobs to progress in a synchronized fashion.
Unfortunately, traditional Ethernet is not sufficient for most AI infrastructure due to its inherent congestion, high latency, and unfair bandwidth. The Ethernet of yesterday cannot handle the tough requirements of AI traffic.
Ethernet fabrics for AI need to deliver key features and use open standards to become a compelling fabric interconnect of choice for the AI world. The following sections describe these features.
Interoperable
The fabric must operate at the highest level and use a well-established network ecosystem based on proven open standards such as Ethernet. With an Ethernet-based approach, a flexible architecture can be achieved.
High-performance
AI workloads are unique in that they require specific network or fabric properties to perform optimally.
Scalable
The requirements of AI workloads can range from a single GPU to a cluster of multiple GPUs. This white paper is relevant to the various AI fabric topologies: single switch, TOR-wired Clos topology, Pure Rail topology, and Rail Optimized topology.
Lossless
GPUDirect RDMA, which is specifically engineered for GPU acceleration, facilitates direct interaction between NVIDIA GPUs across different systems, circumventing system CPUs and removing the necessity for data buffer copies through the system memory. When run over RoCE, GPUDirect RDMA attains its best performance, particularly when implemented on a lossless network.
Load balancing
In a leaf and spine architecture, the fabric is Layer 3. It uses NVIDIA RoCE Adaptive Routing on equal cost multipath (ECMP) links, which results in a uniform traffic distribution across all links between the leaf and spine switches. NVIDIA Direct Data Placement (DDP) Technology augments RoCE Adaptive Routing by correcting the order of packets received in the receiving host/GPU memory.
The Spectrum-X platform also includes performance isolation measures to ensure that workloads do not impact each other’s performance. With Spectrum-X’s RoCE Adaptive Routing, performance isolation is attained using fine-grained data path balancing to avoid collision of flows across the leaf and spine.
Congestion control
NVIDIA Spectrum-X uses Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control for end-to-end congestion control. This approach enables the SuperNIC to rate limit transmissions based on telemetry data obtained from the switch. Congestion control provides AI environments with better throughput and increased performance over traditional Ethernet.
NCCL
NCCL is a high-performance library developed by NVIDIA for collective communication. NCCL provides a better runtime for multi-GPU and multi-node communications in deep learning and HPC applications. In this validation, NCCL is used to benchmark the communication bandwidth and efficiency between the GPUs with parameters such as NCCL_BUFFSIZE, NCCL_ALGO, NCCL_IB_HCA, NCCL_DEBUG, and NCCL_IB_GID_INDEX. The setup used in this guide, run through a Slurm job script using sbatch, ensured close-to-real conditions for deep learning workflows to elicit performance and scalability for the communication infrastructure during the testing of the GPUs.

Your Browser is Out of Date

None

None

Interoperable

High-performance

Scalable

Lossless

Load balancing

Congestion control

NCCL