Overview

Thank you for your feedback!

Tensor parallelism is one method to leverage the compute and memory of multiple GPUs to boost the performance of a Large Language Model. In this paper we maximize the performance of Meta’s open-source Llama models: the 70 billion parameter Llama-3 70B and the 13 billion parameter Llama-2 13B using tensor parallelism. We do this on a Dell PowerEdge XE9680 server with 8x H100 GPUs. Tensor parallelism (TP) optimized engines of these models are built using the TensorRT-LLM inference accelerator library (open-source) from NVIDIA.

Tensor parallelism (TP) optimization allows us to run large LLM models that cannot easily fit on a single GPU by partitioning necessary computations into smaller pieces that can be distributed to multiple GPUs. We show how this can be done for Meta’s Llama-3 70B model (available from the Dell Enterprise Hub https://dell.huggingface.co/). We show how the 16 bit floating point version of this model can be run on 2 (or even better 4 or 8) GPUs, using corresponding TP degrees of 2,4 and 8, to increase its performance. Build the Future of AI with Meta Llama 3 | Dell USA

Tensor parallelism also allows us to run models that are not as big, like the Llama-2 13B, with higher token per second efficiency by making use of all the available compute and high bandwidth memory (HBM) aggregated from multiple GPUs. For the Llama-2 13B we show the results running with TPs of 1,2,4 and 8 with improvement coming from the larger degrees of TP (See the Appendix for an explanation of how tensor parallelism works).

In our examples, we start with a tensor parallelism degree (and the equivalent number of GPUs) that provides enough memory to support the given 16 bit floating point model. We then measure the total token generation throughput, and throughput per GPU, with that minimum TP and with higher degrees of TP, in order to evaluate how well TP scales and maintains "efficiency" as more GPUs are brought in. Benefits of additional compute and memory come in with an increased degree of TP but so do additional costs from having to compute smaller fragmented tensors and then needing to synchronize those computations. Our results show the balance of these benefits and costs that come from using varying degrees of TP. A future whitepaper will show the results of combing tensor parallelism with quantized Llama models, like 8 bit floating point FP8 versions.

The evaluations we make on the two Llama inference engines use NVIDIA’s TensorRT-LLM in what is known as “stand alone” or “offline” mode, as opposed to “server” mode. Running in offline mode allows us to judge the raw computational and memory access efficiency of the Llama model engines themselves, without worrying about the statistical impacts related to serving multiple prompt query requests arriving at different times.

We run and record results for two scenarios. The first scenario is a Time To First Token (TTFT) time constraint of less than 2 seconds, and the second scenario is no time constraint. Applying a two-second TTFT constraint allows us to characterize how the engines perform in real-time service applications.

In Table 1, we show the server configuration used in our evaluation.

Table 1. XE9680 server configuration

System Name	PowerEdge XE9680
Status	Available
System Type	Data Center
Number of Nodes	1
Host Processor Model	4th Generation Intel® Xeon® Scalable Processors
Host Processor Name	Intel® Xeon® Platinum 8468
Host Processors per Node	2
Cores per Socket	46
Host Processor Frequency	2.1 GHz
Host Memory Capacity and Type	1TB, 16x 64 GB DIMM, 4800 MT/s DDR5
Host Storage Capacity	6.4TB, NVME
GPU Name and Quantity	8x H100
GPU Memory Capacity and Type	80GB, HBM3
GPU High-speed Interface	PCIe Gen5 / NVLink Gen4
Nvidia driver version	535.104.12
CUDA version	12.3
TensorRT-LLM version	0.9.0

Your Browser is Out of Date

Overview

Overview