Hardware design | Maximizing AI Performance: A Deep Dive into Scalable Inferencing on Dell with NVIDIA

None

Thank you for your feedback!

Dell Technologies provides a selection of two GPU-optimized servers as part of the AI Factory with NVIDIA that are suitable for configuration as Generative AI worker nodes for inference: the PowerEdge R760xa, PowerEdge XE9680. The GPU-optimized servers act as worker nodes in a Kubernetes cluster. The number of servers depends on the number of models and the number of concurrent requests served by those models.

PowerEdge R760xa GPU worker node

The following table shows configuration for a PowerEdge R760xa GPU worker node.

Table 1. PowerEdge R760xa GPU worker node

Component	Details
Server model	PowerEdge R760xa
CPU	2 x Gold 6438M 2.2G, 32C/64T
Memory	16 x 32 GB DDR5 4800 MT/s RDIMM
Operating system	BOSS-N1 controller card + with 2 M.2 480GB (RAID 1)
Storage	2 x 1.92 TB Data Center NVMe Read Intensive AG Drive U2 Gen4
GPU	4 x NVIDIA L40S, 48 GB PCIe GPU

The PowerEdge R760xa is selected as part of the AI Factory with NVIDIA. The system under test has two Intel ® Xeon® Platinum 6438+ processors (32 cores / 64 threads) with 512 gigabytes of memory. The system has four NVIDIA L40S GPUs, with 48GB memory using PCIe Gen4 x16 connectivity. The L40S power consumption is 350 watts.

PowerEdge XE9680 GPU worker node

The following table shows a recommended configuration for a PowerEdge XE9680 GPU

worker node.

Table 2. PowerEdge XE9680 GPU worker node

Component	Details
Server model	PowerEdge XE9680
CPU	2 x Platinum 8480+ 2.0G, 56C/112T
Memory	16 x 64 GB DDR5 4800 MT/s RDIMM
Operating system	BOSS-N1 controller card + with 2 M.2 480GB (RAID 1)
Storage	2 x 1.92 TB Data Center NVMe Read Intensive AG Drive U2 Gen4
GPU	8 x NVIDIA H100 SXM

The PowerEdge XE9680 is selected as part of the AI Factory with NVIDIA. The system under test has two Intel ® Xeon® Platinum 8480+ processors (56 cores / 112 threads) with 1 terabyte of memory. The system has eight NVIDIA H100 GPUs, with 80GB memory using SXM5 connectivity.

NVIDIA GPU Baseboard has an eight-way NVIDIA NVLink topology full connectivity for high-bandwidth GPU to GPU communication. The H100 power consumption is 700 watts.

Storage

The PowerScale 710 provides centralized storage. The models were downloaded from the shared scale-out NFS share. However, once the model is downloaded from the share, there is very little if any traffic to report. If we were discussing RAG and measuring vector database performance on a common network share, then this story would be different. However, with inferencing only, we are using the PowerScale as a shared storage platform from which we download and cache models. If we were training a model, there would be model checkpoints sending and receiving updates to network share. It is important to ensure your storage traffic is optimized for AI workloads.

Networking

The inferencing activities described in this document use only a single node at any point in time. For multi-node AI activities, networking plays a critical role in interconnecting the GPU servers to optimize the GPU fabric to enhance the ability of the environment to perform distributed AI tasks. In addition, loading AI models from storage and frontend fabrics play a critical element of any architectural design. In this document, the models are downloaded from backend shared network storage and a front-end load balancer is used to distribute front end traffic.

Each model was run from a single system and loaded into a GPU. The front-end load balancer was used to distribute users across one or more servers. At no point was a single model distributed across multiple servers. There will be upcoming papers discussing GPU fabrics and spreading models across more than one server.

Software design

AI or Solution Software

This solution runs NVIDIA AI Enterprise software as an end-to-end cloud-native software platform running on-premises in your data center. It accelerates data science pipelines and streamline the development and deployment of production-grade inference workload as well as other generative AI applications. Easy-to-use microservices provide optimized model performance with enterprise-grade security, support, and stability to ensure a smooth transition from prototype to production for enterprises to run with AI.

The Dell AI Factory with NVIDIA speeds AI adoption by delivering integrated Dell and NVIDIA capabilities tailored to a customer’s specifications or as pre-validated, full-stack solutions.

The environment is running Kubernetes version 1.27.

Large Language Models

NVIDIA NIM supports a wide variety of models as listed here. In this design, we test the following generative AI models: Llama 3 8B and 70B with different levels of quantization using floating point of 8 and floating point of 16.

Large Language Models translate words into tokens. Each token is about ~0.75 English words. The LLM takes the query (input sequence length), places in queue, processes the query, and outputs the response, one token at a time (output sequence length).

NVIDIA GenAI Perf

GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. For large language models (LLMs), GenAI-Perf provides metrics such as output token throughput, time to first token, inter token latency, and request throughput. GenAI-Perf is currently in early release and under rapid development. As such, we try to remain consistent with updates with tooling maturity. Our testing is performed using input sequence length of 128 input tokens and output sequence length of 128 output tokens.

Below is the methodology used to describe each test.

Metrics of measurement:

Throughput (Tokens per second) – Total number of tokens from benchmark divided by benchmark duration.
Time to First Token (TTFT) – Time between when a request is sent and when its first response is received, one value per request in benchmark.
Request Latency – Time between when a request is sent and when its final response is received, one value per request in benchmark.

Configuration:

Concurrency – number of sessions happening at the same time.
Input Sequency Length of 128 tokens
Output Sequence Length of 128 tokens

Reference: GenAI-Perf — NVIDIA Triton Inference Server

GenAI-Perf could be ran on the system under test, with small impact on performance. However, for more real-world scenario, the GenAI-Perf software was deployed on a separate PowerEdge R6525 with ample CPU, memory, and networking to ensure the test software had enough resources to satisfy the load requirements of GenAI-Perf.

Your Browser is Out of Date

None

None

AI or Solution Software

Large Language Models

NVIDIA GenAI Perf