Conclusion

Maximizing Llama Open Source Model Inference Performance with Tensor Parallelism on a Dell XE9680 with H100s

Executive summary Scenarios Llama-3 70B tensor parallelism results Llama-2 13B tensor parallelism results Conclusion

Appendix References

Thank you for your feedback!

Our results demonstrate that the TensorRT-LLM tensor parallelism complied engines for both the Llama-3 70B and the Llama-2 13b models scale up very effectively on our Dell PowerEdge XE9680 server with 8x H100 GPU accelerators. When we started, we knew that using larger TP meant that increased compute becomes available to quickly generate tokens to beat Time To First Token constraints. We also knew that the larger aggregate memory resulting from larger TP allows larger batch sizes, which is particularly useful in maximizing total throughput.

But against these benefits of higher degrees of TP comes the cost of dealing with the smaller matrices themselves, as smaller matrices are less efficiently computed per element than are larger ones. It was not clear how effective the larger TP would be compared to the smaller TP.

The results we observed, after all of these factors are traded off and accounted for, is that the TensorRT-LLM tensor parallelism technique successfully produces efficient sharding from 1 to 2 to 4 to 8 GPUs. The peak efficiency per GPU comes either at TP=4 or TP=8 for all cases, and in all cases the efficiency per GPU at TP=8 on the Dell PowerEdge XE9680 server is at or greater than .9x of the peak efficiency realized at any TP, which is excellent.

Even at TP 8, the increased raw compute power (and matrix efficiency derived from larger pooled memory for use cases that are compatible larger batch sizes) mostly (and in some cases completely) offsets and compensates for the computation synchronization cost and small matrix computation cost.

Moreover, we observed similar results on both model sizes (13B and 70B) which in turn used different versions of Attention (Multi-head and Grouped-query).

Similar results were also seen with and without a Time To First Token time constraint.

Bottom line: Tensor parallelism allows us to scale up Llama-3 and Llama-2 LLMs across all 8x H100 GPUs on Dell PowerEdge XE9680 servers and make full use of the available High Bandwidth Memory (HBM3) and compute power.

The results we saw for the max batch size on each use case and tensor parallelism are listed here, in Table 5 through Table 8.

Table 5. Summary for Llama-3 70B with constraint of TTFT < 2 seconds

TP Degree	Context (Tokens in/ Tokens out)	Max Batch Size	Total Throughput (tokens/sec)	Throughput per GPU (tokens/sec)
2	4096 in/ 4096 out	2	62.2	31.1
4	4096 in/ 4096 out	6	264.4	66.11
8	4096 in/ 4096 out	10	585.2	73.15

Table 6. Summary for Llama-3 70B without constraint of TTFT < 2 seconds

TP Degree	Context (Tokens in/ Tokens out)	Max Batch Size	Total Throughput (tokens/sec)	Throughput per GPU (tokens/sec)
2	4096 in/ 4096 out	6	160	80
4	4096 in/ 4096 out	40	1359	340
8	4096 in/ 4096 out	62	2581	323

Table 7. Summary for Llama-2 13B with constraint of TTFT < 2 seconds

TP Degree	Context (Tokens in/ Tokens out)	Max Batch Size	Total Throughput (tokens/sec)	Throughput per GPU (tokens/sec)
1	2048 in/ 2048 out	8	430	430
2	2048 in/ 2048 out	32	1386	693
4	2048 in/ 2048 out	60	2787	697
8	2048 in/ 2048 out	88	5018	627

Table 8. Summary for Llama-2 13B without constraint of TTFT < 2 seconds

TP Degree	Context (Tokens in/ Tokens out)	Max Batch Size	Total Throughput (tokens/sec)	Throughput per GPU (tokens/sec)
1	2048 in/ 2048 out	8	430	430
2	2048 in/ 2048 out	32	1386	693
4	2048 in/ 2048 out	64	2930	733
8	2048 in/ 2048 out	128	5432	677

Your Browser is Out of Date

Conclusion

Conclusion