Scenarios

Maximizing Llama Open Source Model Inference Performance with Tensor Parallelism on a Dell XE9680 with H100s

Llama-3 70B tensor parallelism results Llama-2 13B tensor parallelism results Conclusion Appendix References

Thank you for your feedback!

In Table 2, we show the two models used in our tensor parallelism experiments, which span a variety of characteristics. The two models we chose have:

Different sizes
Different minimum Tensor Parallelism GPU counts
Different attention mechanisms:

Grouped-query attention (GQA) for the Llama-3 70B
Multi-head attention (MHA) for Llama-2 13B

Table 2. Models used in tensor parallelism testing

Model	Parameters	Context	Quantization	Memory required just for Parameters	Minimum # of H100 with 80GB HBM3 memory each	Attention
Llama-3 70B	70B	8k tokens	FP16	~ 140 GB	2	GQA
Llama-2 13b	13B	4k tokens	FP16	~ 26 GB	1	MHA

The minimum number of H100s needed is based on memory required for parameters, activations, and KV cache. Each FP16 parameter requires two bytes of memory, and each GPU provides 80GB of additional memory to the aggregate memory pool.

In Table 3 and Table 4, we show the tests we ran against each of the models.

We tested Llama-3 70B FP16 inference with an 8k token context (4k input, 4k output). We tested Llama-2 13B FP16 inference with a 2k token context (2k input, 2k output). To compare the scaling of throughput and efficiency across different tensor parallelism counts, we plot the TensorRT-LLM results of 1) total throughput and 2) throughput efficiency per GPU for various batch sizes, up to and including the largest batch size that can be fit in memory. We show these results for two different scenarios, one where we consider only results with Time To First Token (TTFT) of less than 2 seconds and one where there is no TTFT constraint.

Because models require the accelerators to have enough high bandwidth memory (HBM) for their parameters and a key-value cache, larger models will need to use TP across more than one GPU. For the Llama-3 70B model, this minimum number of H100 GPUs is 2. So, we ran Llama-3 70B with tensor parallelism of 2, 4, and 8 because of the model size and also because number of KV heads (8 for Llama 70B) must be divisible by the number of GPUs. Meanwhile since the Llama-2 13B can fit on a single GPU, we tested it with tensor parallelism of 1, 2, 4, and 8.

While the larger TP count increases pooled memory and brings additional compute power to bear, a high degree of model parallelism leads to smaller matrix multiplications (General Matrix Multiplications - GEMMs), potentially decreasing GPU utilization. Our tests show us empirically the total combined net effect from all of these factors.

Table 3. Llama-3 70B test matrix

Model	Time Constraint	Tensor Parallelism	Context	Batch Size
Llama-3 70B	Time To First Token (TTFT) < 2 S	TP = 2 GPUs, 4 GPUs, 8 GPUs	4k in, 4k out	Up to and including the largest batch size that will fit
Llama-3 70B	No constraint	TP = 2 GPUs, 4 GPUs, 8 GPUs	4k in, 4k out	Up to and including the largest batch size that will fit

Table 4. Llama-2 13B test matrix

Model	Time Constraint	Tensor Parallelism	Context	Batch Size
Llama-2 13B	Time To First Token (TTFT) < 2 S	TP = 1 GPU, 2 GPUs, 4 GPUs, 8 GPUs	2k in, 2k out	Up to and including the largest batch size that will fit
Llama-2 13B	No constraint	TP = 1 GPU, 2 GPUs, 4 GPUs, 8 GPUs	2k in, 2k out	Up to and including the largest batch size that will fit

Your Browser is Out of Date

Scenarios

Scenarios