Many types of applications exist in a data center or network fabric. These applications can be loss-tolerant (that is, they can tolerate packet drops or are latency insensitive) or loss-intolerant (that is, they cannot tolerate packet drops or are latency sensitive).
Among these applications, few are as unique as artificial intelligence (AI) whose data traffic pattern is characterized by:
- An extremely large volume of data exchanged, particularly with large language models (LLMs) and similar language models
- High-rate data exchange taking place during the initial training phase of any model
- Latency sensitive applications exchanging vast amounts of data in a feedback loop fashion
- Diverse traffic patterns (as in, predictable and somehow ordered or unpredictable)
- Heterogenous traffic size flows (such as, elephant and mouse flows)
- Chaotic and bursty dataset patterns
These workload characteristics make AI different from others in the sense that it must adhere to strict infrastructure requirements to be useful to any organization.
Fabrics for GenAI compute clusters are all challenged with delivering the highest bandwidth and lowest latency data transfer, while avoiding packet loss or any kind of retransmission delays.
Given the massive data volumes being pushed through the fabrics in support of AI workloads, these fabrics will be operating as closely as possible to saturation characterized by highly parallelized transmission of multiple elephant flows.
Finally, effective utilization of the compute resources depends on minimizing delays due to the network transfers in order to allow the parallel compute jobs to progress in a synchronized fashion.
In summary, an AI fabric must be:
- Lossless
- High performing
- Scalable
Dell Technologies, with its complete product portfolio, has the software and hardware component stack needed to create and deliver the necessary infrastructure and surrounding components for an AI environment.