Home > AI Solutions > Gen AI > White Papers > Maximizing Llama Open Source Model Inference Performance with Tensor Parallelism on a Dell XE9680 with H100s > Overview
Tensor parallelism is one method to leverage the compute and memory of multiple GPUs to boost the performance of a Large Language Model. In this paper we maximize the performance of Meta’s open-source Llama models: the 70 billion parameter Llama-3 70B and the 13 billion parameter Llama-2 13B using tensor parallelism. We do this on a Dell PowerEdge XE9680 server with 8x H100 GPUs. Tensor parallelism (TP) optimized engines of these models are built using the TensorRT-LLM inference accelerator library (open-source) from NVIDIA.
Tensor parallelism (TP) optimization allows us to run large LLM models that cannot easily fit on a single GPU by partitioning necessary computations into smaller pieces that can be distributed to multiple GPUs. We show how this can be done for Meta’s Llama-3 70B model (available from the Dell Enterprise Hub https://dell.huggingface.co/). We show how the 16 bit floating point version of this model can be run on 2 (or even better 4 or 8) GPUs, using corresponding TP degrees of 2,4 and 8, to increase its performance. Build the Future of AI with Meta Llama 3 | Dell USA
Tensor parallelism also allows us to run models that are not as big, like the Llama-2 13B, with higher token per second efficiency by making use of all the available compute and high bandwidth memory (HBM) aggregated from multiple GPUs. For the Llama-2 13B we show the results running with TPs of 1,2,4 and 8 with improvement coming from the larger degrees of TP (See the Appendix for an explanation of how tensor parallelism works).
In our examples, we start with a tensor parallelism degree (and the equivalent number of GPUs) that provides enough memory to support the given 16 bit floating point model. We then measure the total token generation throughput, and throughput per GPU, with that minimum TP and with higher degrees of TP, in order to evaluate how well TP scales and maintains "efficiency" as more GPUs are brought in. Benefits of additional compute and memory come in with an increased degree of TP but so do additional costs from having to compute smaller fragmented tensors and then needing to synchronize those computations. Our results show the balance of these benefits and costs that come from using varying degrees of TP. A future whitepaper will show the results of combing tensor parallelism with quantized Llama models, like 8 bit floating point FP8 versions.
The evaluations we make on the two Llama inference engines use NVIDIA’s TensorRT-LLM in what is known as “stand alone” or “offline” mode, as opposed to “server” mode. Running in offline mode allows us to judge the raw computational and memory access efficiency of the Llama model engines themselves, without worrying about the statistical impacts related to serving multiple prompt query requests arriving at different times.
We run and record results for two scenarios. The first scenario is a Time To First Token (TTFT) time constraint of less than 2 seconds, and the second scenario is no time constraint. Applying a two-second TTFT constraint allows us to characterize how the engines perform in real-time service applications.
In Table 1, we show the server configuration used in our evaluation.
System Name | PowerEdge XE9680 |
Status | Available |
System Type | Data Center |
Number of Nodes | 1 |
Host Processor Model | 4th Generation Intel® Xeon® Scalable Processors |
Host Processor Name | Intel® Xeon® Platinum 8468 |
Host Processors per Node | 2 |
Cores per Socket | 46 |
Host Processor Frequency | 2.1 GHz |
Host Memory Capacity and Type | 1TB, 16x 64 GB DIMM, 4800 MT/s DDR5 |
Host Storage Capacity | 6.4TB, NVME |
GPU Name and Quantity | 8x H100 |
GPU Memory Capacity and Type | 80GB, HBM3 |
GPU High-speed Interface | PCIe Gen5 / NVLink Gen4 |
Nvidia driver version | 535.104.12 |
CUDA version | 12.3 |
TensorRT-LLM version | 0.9.0 |