Reference Architecture: GPU Acceleration for Dell PowerEdge MX7000
Download PDFWed, 03 Jul 2024 17:24:01 -0000
|Read Time: 0 minutes
Summary
In the realm of digital innovation, GPU acceleration is pivotal for powering a broad spectrum of demanding applications, particularly AI. The Liqid SmartStack MX for Dell PowerEdge MX7000 enables the on demand allocation of enterprise-grade PCIe GPUs to Dell PowerEdge compute sleds. This integration provides a modular infrastructure that not only accelerates AI-driven workloads but also ensures organizations remain at the forefront of technological advancements in their chosen server form factor.
GPU Acceleration for PowerEdge MX7000
The Dell PowerEdge MX7000 Modular Chassis stands at the forefront of modernizing dynamic data centers, streamlining the deployment and management of complex workloads. Pivotal to this modernization is the Liqid SmartStack MX, which introduces GPU acceleration capabilities essential for compute-intensive environments, including AI and graphics-intensive applications.
By adding SmartStack MX, customers can dynamically connect and scale enterprise-grade GPUs from NVIDIA, AMD, and Intel to the Dell PowerEdge compute sleds, at the bare metal. These sleds, which connect directly to pools of 10, 20, or 30 GPUs via PCIe, leverage the Liqid Matrix software to dynamically attach up to 20 GPUs enterprise GPUs. An example of this is the NVIDIA L40S connected to a single MX760c compute sled. This solution also facilitates the reallocation of GPUs between compute sleds as workload demands evolve and supports provisioning to Dell PowerEdge R- and C-series rackmount servers. This enables migration between platforms for better utilization, agility, and investment protection.
Figure 1. SmartStack MX20 Complete System
Liqid SmartStack MX are fully validated, composable solutions designed to meet your most challenging GPU requirements. It includes several key components, one of which is the Liqid EX-4410 PCIe expansion chassis, capable of holding up to 10 FHFL double-width GPUs. Another key component is the Liqid Director, which houses the Liqid Matrix software essential for GPU provisioning, also known as composing. Lastly is the Liqid 48-port PCIe Gen 4.0 switch, utilized in the SmartStack MX20, MX30, and MX30+ systems. Depending on the system configuration, either four or eight PCIe HBAs are housed within the CoreModuleXL, an expansion module provided by Amulet Hotkey. This module occupies the B1 and B2 fabrics in the MX7000 and serves as the direct connection between the Dell compute sleds and the Liqid PCIe fabric and GPUs. The SmartStack MX supports attaching GPUs to the following Dell PowerEdge compute sleds: MX760c, MX750c, and MX740c.
Liqid SmartStack MX Series Technical Specifications
| SmartStack MX10 | SmartStack MX20 | SmartStack MX30 | SmartStack MX30+ |
Description | 10 GPU / 4 Host Capacity | 20 GPU / 8 Host Capacity | 30 GPU / 8 Host Capacity | 30 GPU / 16 Host Capacity |
No. of MX7000 Chassis | 1x MX7000 | 1x MX7000 | 1x MX7000 | 2x MX7000 |
Max GPUs per MX7000 Enclosure | 10x Full-height, full-length (FHFL) 10.5”, dual-slot | 20x Full-height, full-length (FHFL) 10.5”, dual-slot | 30x Full-height, full-length (FHFL) 10.5”, dual-slot | 30x Full-height, full-length (FHFL) 10.5”, dual-slot |
Supported Device Types | GPU, NVMe, FPGA, DPU | GPU, NVMe, FPGA, DPU | GPU, NVMe, FPGA, DPU | GPU, NVMe, FPGA, DPU |
Max Hosts | 4x Compute Sleds | 8x Compute Sleds | 8x Compute Sleds | 16x Compute Sleds |
PCIe Expansion Chassis | 1x Liqid EX-4410 - 10-slot | 2x Liqid EX-4410 - 10-slot | 3x Liqid EX-4410 - 10-slot | 3x Liqid EX-4410 - 10-slot |
PCIe Fabric Switch | Integrated Switch | 1x 48 Port Switches | 1x 48 Port Switches | 2x 48 Port Switches |
PCIe Fabric HBA | 1x Fabric B CoreModuleXL w/ 4x PCIe Gen4 x16 HBAs | 1x Fabric B CoreModuleXL w/ 8x PCIe Gen4 x16 HBAs | 1x Fabric B CoreModuleXL w/ 8x PCIe Gen4 x16 HBAs | 2x Fabric B CoreModuleXL w/ 16x PCIe Gen4 x16 HBAs |
Rack Units | 5U | 10U | 14U | 15U |
Composable Devices | Go to liqid.com/resources/library, for a current hardware compatibility list of composable PCIe devices |
Table 1. Liqid SmartStack Solutions
Implementing GPU Expansion for MX
Figure 2. Liqid Matrix User Interface
First, install GPUs into the PCIe expansion chassis. Supported GPUs can be found on the Liqid HCL (Hardware Compatibility List). Then connect the HBAs in CoreModuleXL from MX7000 Fabrics B1 and B2 to the Liqid PCIe fabric. Liqid Matrix software is connected to the fabric via the Liqid Director and is used to provision resources. Additionally, Liqid supports the provisioning of other PCIe resources to Dell PowerEdge compute sleds, including Liqid NVMe SSD, NIC, and DPU.
Software Defined GPU Deployment
Once PCIe devices are connected to the MX7000, Liqid Matrix software enables the dynamic allocation of GPUs to MX compute sleds at the bare metal level and supports GPU hot-plug. Up to 20 GPUs can be added to a single MX760c compute sled via the Liqid UI or a RESTful API to meet end-user workload requirements. The MX750c supports up to 20 GPUs and the MX740c supports up to16 GPUs per compute sled. To the operating system, the GPUs are presented as local resources directly connected to the MX compute sled over PCIe. Most operating systems are supported including Linux, Windows, and VMware. Liqid also has a SLURM and Kubernetes plug-in. As workload needs change it is simple to add or remove resources on the fly, including GPU, NVMe SSD and FPGA via software.
Enabling GPU Peer-2-Peer Capability
A key feature included with the SmartStack MX is that RDMA Peer-2-Peer (P2P) communication is support between GPU devices in a single chassis and also across multiple Liqid expansion chassis; it is also available between GPUs and SSDs. Utilizing direct RDMA transfers, this feature dramatically enhances both throughput and response time (latency), which is critical for the highest performing GPU-centric applications. Performance improvements include up to a tenfold increase in throughput, significantly boosting bandwidth and reducing latency. This enhancement is crucial as it allows for bypassing the x86 processor, enabling direct communication between GPUs, and now also between GPUs and NVMe SSDs. This setup optimizes data transfer rates and minimizes response times, facilitating rapid, efficient inter-device communication even in complex, multi-chassis configurations. The Liqid GPU expansion chassis is PCIe Gen4, thus the P2P traffic for the MX760c, MX750c, and MX740c will be at PCIe Gen4 levels. The accompanying chart (Figure 3) and table (Table 2) provide an overview of how PCIe Peer-2-Peer functions are enabled. They also demonstrate the expected performance enhancements, when GPUs are composed to a single node with GPU RDMA Peer-2-Peer, is enabled.
Figure 3. Peer-2-Peer Modes and Performance
Table 2. Comparing Performance with Peer-2-Peer Disabled vs. Enabled
Application Performance
RDMA Peer-2-Peer is a crucial enhancement in GPU scaling for Artificial Intelligence, particularly for machine learning-based applications. Figure 4 presents performance data obtained using MLPerf on the MX7000 equipped with SmartStack MX. It showcases strong scalability from 4-GPU to 20-GPU configurations on a single compute sled. This data, represented in queries/second, demonstrates high scaling efficiency across a variety of MLPerf 3.1 workloads, achieved with the implementation of composable PCIe GPUs and Peer-2-Peer technology. The results illustrate a near-linear growth pattern in performance, highlighting the robust capabilities of Liqid's technology, which can allocate up to 20 GPUs to an application running on a single compute sled. Such scalability ensures optimal performance and resource utilization, critical for demanding AI computations.
Figure 4. GPU Performance Scaling Comparison – MX7000 with SmartStack MX (NVIDIA L40S), with Peer-2-Peer enabled.
Conclusion
The Liqid SmartStack MX represents a transformative solution, enabling advanced AI and graphics-intense workloads to be executed efficiently on Dell PowerEdge servers. Through a strategic collaboration with Dell Technologies Design Solutions, Liqid has enhanced the PowerEdge MX compute sleds with powerful GPU additions. This partnership not only accelerates applications but also ensures that enterprises can leverage cutting-edge AI capabilities with unprecedented scalability and flexibility. The integration of Liqid’s innovative technology with Dell Technologies’ robust infrastructure exemplifies a commitment to pushing the boundaries of what is possible in data center performance, setting new standards for enterprise computing.
Learn More | See a Demo | Get a Quote
This reference architecture is available as part of the Dell Technologies Design Solutions.
Ask your Dell account team for more details or Contact a Liqid Expert contact liqid