Home > AI Solutions > Gen AI > White Papers > Maximizing AI Performance: A Deep Dive into Scalable Inferencing on Dell with NVIDIA > Hardware design
Dell Technologies provides a selection of two GPU-optimized servers as part of the AI Factory with NVIDIA that are suitable for configuration as Generative AI worker nodes for inference: the PowerEdge R760xa, PowerEdge XE9680. The GPU-optimized servers act as worker nodes in a Kubernetes cluster. The number of servers depends on the number of models and the number of concurrent requests served by those models.
PowerEdge R760xa GPU worker node
The following table shows configuration for a PowerEdge R760xa GPU worker node.
Component | Details |
Server model | PowerEdge R760xa |
CPU | 2 x Gold 6438M 2.2G, 32C/64T |
Memory | 16 x 32 GB DDR5 4800 MT/s RDIMM |
Operating system | BOSS-N1 controller card + with 2 M.2 480GB (RAID 1) |
Storage | 2 x 1.92 TB Data Center NVMe Read Intensive AG Drive U2 Gen4 |
GPU | 4 x NVIDIA L40S, 48 GB PCIe GPU |
The PowerEdge R760xa is selected as part of the AI Factory with NVIDIA. The system under test has two Intel ® Xeon® Platinum 6438+ processors (32 cores / 64 threads) with 512 gigabytes of memory. The system has four NVIDIA L40S GPUs, with 48GB memory using PCIe Gen4 x16 connectivity. The L40S power consumption is 350 watts.
PowerEdge XE9680 GPU worker node
The following table shows a recommended configuration for a PowerEdge XE9680 GPU
worker node.
Component | Details |
Server model | PowerEdge XE9680 |
CPU | 2 x Platinum 8480+ 2.0G, 56C/112T |
Memory | 16 x 64 GB DDR5 4800 MT/s RDIMM |
Operating system | BOSS-N1 controller card + with 2 M.2 480GB (RAID 1) |
Storage | 2 x 1.92 TB Data Center NVMe Read Intensive AG Drive U2 Gen4 |
GPU | 8 x NVIDIA H100 SXM |
The PowerEdge XE9680 is selected as part of the AI Factory with NVIDIA. The system under test has two Intel ® Xeon® Platinum 8480+ processors (56 cores / 112 threads) with 1 terabyte of memory. The system has eight NVIDIA H100 GPUs, with 80GB memory using SXM5 connectivity.
NVIDIA GPU Baseboard has an eight-way NVIDIA NVLink topology full connectivity for high-bandwidth GPU to GPU communication. The H100 power consumption is 700 watts.
The PowerScale 710 provides centralized storage. The models were downloaded from the shared scale-out NFS share. However, once the model is downloaded from the share, there is very little if any traffic to report. If we were discussing RAG and measuring vector database performance on a common network share, then this story would be different. However, with inferencing only, we are using the PowerScale as a shared storage platform from which we download and cache models. If we were training a model, there would be model checkpoints sending and receiving updates to network share. It is important to ensure your storage traffic is optimized for AI workloads.
The inferencing activities described in this document use only a single node at any point in time. For multi-node AI activities, networking plays a critical role in interconnecting the GPU servers to optimize the GPU fabric to enhance the ability of the environment to perform distributed AI tasks. In addition, loading AI models from storage and frontend fabrics play a critical element of any architectural design. In this document, the models are downloaded from backend shared network storage and a front-end load balancer is used to distribute front end traffic.
Each model was run from a single system and loaded into a GPU. The front-end load balancer was used to distribute users across one or more servers. At no point was a single model distributed across multiple servers. There will be upcoming papers discussing GPU fabrics and spreading models across more than one server.
This solution runs NVIDIA AI Enterprise software as an end-to-end cloud-native software platform running on-premises in your data center. It accelerates data science pipelines and streamline the development and deployment of production-grade inference workload as well as other generative AI applications. Easy-to-use microservices provide optimized model performance with enterprise-grade security, support, and stability to ensure a smooth transition from prototype to production for enterprises to run with AI.
The Dell AI Factory with NVIDIA speeds AI adoption by delivering integrated Dell and NVIDIA capabilities tailored to a customer’s specifications or as pre-validated, full-stack solutions.
The environment is running Kubernetes version 1.27.
NVIDIA NIM supports a wide variety of models as listed here. In this design, we test the following generative AI models: Llama 3 8B and 70B with different levels of quantization using floating point of 8 and floating point of 16.
Large Language Models translate words into tokens. Each token is about ~0.75 English words. The LLM takes the query (input sequence length), places in queue, processes the query, and outputs the response, one token at a time (output sequence length).
GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. For large language models (LLMs), GenAI-Perf provides metrics such as output token throughput, time to first token, inter token latency, and request throughput. GenAI-Perf is currently in early release and under rapid development. As such, we try to remain consistent with updates with tooling maturity. Our testing is performed using input sequence length of 128 input tokens and output sequence length of 128 output tokens.
Below is the methodology used to describe each test.
Metrics of measurement:
Configuration:
Reference: GenAI-Perf — NVIDIA Triton Inference Server
GenAI-Perf could be ran on the system under test, with small impact on performance. However, for more real-world scenario, the GenAI-Perf software was deployed on a separate PowerEdge R6525 with ample CPU, memory, and networking to ensure the test software had enough resources to satisfy the load requirements of GenAI-Perf.