Dell Technologies provides a diverse selection of acceleration-optimized servers with an extensive portfolio of accelerators featuring NVIDIA GPUs. In this design, we showcase three Dell PowerEdge servers tailored for generative AI purposes:
In this section, we describe the configuration and connectivity options for NVIDIA GPUs, and how these server-GPU combinations can be applied to various LLM use cases.
This design for inferencing supports several options for NVIDIA GPU acceleration components. The following table provides a summary of the GPUs used in this design:
Table 2. NVIDIA GPUs – Technical specifications and use cases
| NVIDIA H100 SXM GPU | NVIDIA H100 PCIe GPU | NVIDIA L40 PCIe GPU |
Supported latest PowerEdge servers (and maximum number of GPUs) | PowerEdge XE9680 (8) PowerEdge XE8640 (4) | PowerEdge R760xa (4) PowerEdge R760 (2) | PowerEdge R760xa (4) PowerEdge R760 (2) |
GPU memory | 80 GB | 80 GB | 48 GB |
Form factor | SXM | PCIe (dual width, dual slot) | PCIe (dual width, dual slot) |
GPU interconnect | 900 GB/s PCIe | 600 GB/s NVLink Bridge supported in PowerEdge R760xa 128 GB/s PCIe Gen5 | None |
Multi-instance GPU support | Up to 7 MIGs | Up to 7 MIGs | None |
Decoders | 7 NVDEC 7 JPEG | 7 NVDEC 7 JPEG | 3 NVDEC 3 NVENC |
Max thermal design power (TDP) | 700 W | 350 W | 300 W |
NVIDIA AI Enterprise | Add-on | Included with H100 PCIe | Add-on |
Most applicable use cases | Generative AI training Large scale distributed training | Discriminative/Predictive AI Training and Inference Generative AI Inference | Small scale AI Visual computing Discriminative/Predictive AI Inference |
NVIDIA GPUs support various options to connect two or more GPUs, offering various bandwidths. GPU connectivity is often required for certain multi-GPU applications, especially when higher performance and lower latency are crucial. LLMs often do not fit in the memory of a single GPU and are typically deployed spanning multiple GPUs. Therefore, these GPUs require high-speed connectivity between them.
NVIDIA NVLink is a high-speed interconnect technology developed by NVIDIA for connecting multiple NVIDIA GPUs to work in parallel. It allows for direct communication between the GPUs with high bandwidth and low latency, enabling them to share data and work collaboratively on compute-intensive tasks.
The following figure illustrates the NVIDIA GPU connectivity options for the PowerEdge servers used in this design:
Figure 1. NVIDIA GPU connectivity in PowerEdge servers
PowerEdge servers support several different NVLink options:
1. PowerEdge R760xa server with NVIDIA H100 GPUs and NVLink Bridge—NVIDIA NVLink is a high-speed point-to-point (P2P) peer transfer connection. An NVLink bridge is a physical component that facilitates the connection between NVLink-capable GPUs. It acts as an interconnect between the GPUs, allowing them to exchange data at extremely high speeds.
The PowerEdge R760xa server supports four NVIDIA H100 GPUs; NVLink bridge can connect each pair of GPUs. The NVIDIA H100 GPU supports an NVLink bridge connection with a single adjacent NVIDIA H100 GPU. Each of the three attached bridges spans two PCIe slots for a total maximum NVLink Bridge bandwidth of 600 Gbytes per second.
2. PowerEdge XE8640 server with NVIDIA H100 SXM GPUs and NVLink—The PowerEdge XE8640 server incorporates four H100 GPUs with NVIDIA SXM5 technology. NVIDIA SXM is a high-bandwidth socket solution for connecting NVIDIA Compute Accelerators to a system.
The NVIDIA SXM form factor enables multiple GPUs to be tightly interconnected in a server, providing high-bandwidth and low-latency communication between the GPUs. NVIDIA's NVLink technology, which allows for faster data transfers compared to traditional PCIe connections, facilitates this direct GPU-to-GPU communication. The NVLink technology provides a bandwidth of 900 GB/s between any two GPUs.
3. PowerEdge XE9680 server with NVIDIA H100 GPUs and NVSwitch—The PowerEdge XE9680 server incorporates eight NVIDIA H100 GPUs with NVIDIA SXM5 technology. The server includes NVIDIA NVSwitch technology, which is a high-performance, fully connected, and scalable switch technology. It is designed to enable ultrafast communication between multiple NVIDIA GPUs. NVIDIA NVSwitch facilitates high-bandwidth and low-latency data transfers, making it ideal for large-scale AI and high-performance computing (HPC) applications. The NVSwitch technology provides a bandwidth of 900 GB/s between any two GPUs.
During inference, an AI model's parameters are stored in GPU memory. LLMs might require multiple GPU memory units to accommodate their entire neural network structure. In such cases, it is necessary to interconnect the GPUs using NVLink technology to effectively support the model's operations and ensure seamless communication between the GPUs. Therefore, the size of the LLM model required by an enterprise determines both the number of GPUs that are required and the nature of their interconnection. As a result, the choice of PowerEdge server model for the inference infrastructure is contingent on the dimensions of the LLM models.
For models with memory footprints under 160 GB, we recommend using the PowerEdge R760xa server for inference. Most models fall within this category. Notable examples include Llama 2 7B and 13B.
If your model's memory footprint falls between 160GB and 320GB, we recommend the PowerEdge XE8640. The XE8640 features four fully connected GPUs with NVLInk, enabling models to leverage this capability for enhanced performance. While there are no widely recognized community models in this range, customers have the option to deploy their own custom models.
For models with memory footprints exceeding 320 GB, we recommend the PowerEdge XE9860. Models such as Llama 2 70 B and BLOOM 175 B are examples that fall into this category. The following table summarizes examples of LLM models that can be deployed in the PowerEdge servers.
Table 3. Example models supported in PowerEdge servers:
Model characteristics | PowerEdge R760xa with NVIDIA H100 PCIe using NVLink Bridge | PowerEdge XE8640 with NVIDIA H100 SXM | PowerEdge XE9680 with NVIDIA H100 SXM |
Total memory available | 320 GB | 320 GB | 640 GB |
Maximum memory footprint of a model that can run | 160 GB | 320 GB | 640 GB |
Example open-source LLMs | NeMo GPT 345M, 1.3B, 2B, 5B and 20B Llama 2 7B and 13B | NeMo GPT 345M, 1.3 B, 2B, 5B and 20B Llama 2 7 B and 13 B | NeMo GPT 345M, 1.3B, 2B, 5B and 20B Llama 2 7B and 13B Llama 2 70 B (available for 8 GPUs) and BLOOM 175B |
A single LLM model can support a specific number of concurrent users. Beyond this threshold, the model's latency becomes prohibitively high for practical use. To accommodate a substantial volume of concurrent user requests, enterprises might find it necessary to deploy multiple instances of the same model. This consideration drives the requirement for a specific number of servers to host the LLM model in alignment with the specific use case. For more information, see Sizing guidelines.
For more information about generative AI models that were validated as part of this design, see Table 10.
Note: The preceding table does not consider where a model can span multiple nodes (multiple servers) and if the nodes are interconnected with a high-speed network like InfiniBand.