Building a GPU Cluster with the Dell PowerEdge XE9680

The Dell PowerEdge XE9680 integrates eight GPUs and up to 10 NICs/DPUs. Multiple fabrics are needed to create a GPU cluster based on the XE9680, as shown in Figure 9.

This figure shows an example of a Dell PowerEdge XE9680-based GPU cluster. — Figure 9. Example of a Dell PowerEdge XE9680-based GPU cluster

The eight GPUs within an XE9680 are interconnected by an Intra-Node Scale Up Fabric that creates a GPU high bandwidth domain. Eight of the NICs/DPUs are coupled through PCIe with the eight GPUs and are used to exchange parameters among GPUs belonging to different chassis through a Scale Out Fabric. The remaining two NICs/DPUs are connected to a front-end fabric, used to deploy storage, applications, and in-band cluster management.

Figure 9 shows the different fabrics that support the deployment of a typical AI workload:

1. Scale Up Fabric: This fabric interconnects the eight GPUs to form a high bandwidth domain, usually through proprietary network technologies (e.g., NVIDIA NVLink or AMD XGMI).
2. Scale Out Fabric: This fabric is used by the set of GPUs composing a GPU cluster to exchange AI workload parameters.
3. Front-End Fabric: This fabric supports the deployment of the storage, application, and in-band cluster management components of the AI solution.
4. Management Network: This fabric supports the overall environment management network for the solution.

The terminology "Scale Up Fabric", "Scale Out Fabric", and "Front-End Fabric" is expected to become the standard terminology in the industry. However other terms are used in the industry to indicate these fabrics and are listed here for reference:

Scale Up Fabric: GPU Fabric, Back-End Fabric, Front-End Fabric.
Scale Out Fabric: Host Fabric, East-West Fabric, Back-End Fabric, Front-End Fabric.
Front-End Fabric: Storage/Access Fabric, North-South Fabric.

We will focus on the Scale Out Fabric, that is used by all the GPUs in the cluster to exchange the parameters used in the long and intensive training jobs used to create a new model. In addition to the Scale Out Fabric, we will also consider the factors important to build a front-end fabric, which is used for storage access and application access in support of the cluster functionality.

Your Browser is Out of Date

Building a GPU Cluster with the Dell PowerEdge XE9680

Building a GPU cluster