Home > AI Solutions > Gen AI > White Papers > Maximizing Llama Open Source Model Inference Performance with Tensor Parallelism on a Dell XE9680 with H100s > Conclusion
Our results demonstrate that the TensorRT-LLM tensor parallelism complied engines for both the Llama-3 70B and the Llama-2 13b models scale up very effectively on our Dell PowerEdge XE9680 server with 8x H100 GPU accelerators. When we started, we knew that using larger TP meant that increased compute becomes available to quickly generate tokens to beat Time To First Token constraints. We also knew that the larger aggregate memory resulting from larger TP allows larger batch sizes, which is particularly useful in maximizing total throughput.
But against these benefits of higher degrees of TP comes the cost of dealing with the smaller matrices themselves, as smaller matrices are less efficiently computed per element than are larger ones. It was not clear how effective the larger TP would be compared to the smaller TP.
The results we observed, after all of these factors are traded off and accounted for, is that the TensorRT-LLM tensor parallelism technique successfully produces efficient sharding from 1 to 2 to 4 to 8 GPUs. The peak efficiency per GPU comes either at TP=4 or TP=8 for all cases, and in all cases the efficiency per GPU at TP=8 on the Dell PowerEdge XE9680 server is at or greater than .9x of the peak efficiency realized at any TP, which is excellent.
Even at TP 8, the increased raw compute power (and matrix efficiency derived from larger pooled memory for use cases that are compatible larger batch sizes) mostly (and in some cases completely) offsets and compensates for the computation synchronization cost and small matrix computation cost.
Moreover, we observed similar results on both model sizes (13B and 70B) which in turn used different versions of Attention (Multi-head and Grouped-query).
Similar results were also seen with and without a Time To First Token time constraint.
Bottom line: Tensor parallelism allows us to scale up Llama-3 and Llama-2 LLMs across all 8x H100 GPUs on Dell PowerEdge XE9680 servers and make full use of the available High Bandwidth Memory (HBM3) and compute power.
The results we saw for the max batch size on each use case and tensor parallelism are listed here, in Table 5 through Table 8.
TP Degree | Context (Tokens in/ Tokens out) | Max Batch Size | Total Throughput (tokens/sec) | Throughput per GPU (tokens/sec) |
2 | 4096 in/ 4096 out | 2 | 62.2 | 31.1 |
4 | 4096 in/ 4096 out | 6 | 264.4 | 66.11 |
8 | 4096 in/ 4096 out | 10 | 585.2 | 73.15 |
TP Degree | Context (Tokens in/ Tokens out) | Max Batch Size | Total Throughput (tokens/sec) | Throughput per GPU (tokens/sec) |
2 | 4096 in/ 4096 out | 6 | 160 | 80 |
4 | 4096 in/ 4096 out | 40 | 1359 | 340 |
8 | 4096 in/ 4096 out | 62 | 2581 | 323 |
TP Degree | Context (Tokens in/ Tokens out) | Max Batch Size | Total Throughput (tokens/sec) | Throughput per GPU (tokens/sec) |
1 | 2048 in/ 2048 out | 8 | 430 | 430 |
2 | 2048 in/ 2048 out | 32 | 1386 | 693 |
4 | 2048 in/ 2048 out | 60 | 2787 | 697 |
8 | 2048 in/ 2048 out | 88 | 5018 | 627 |
TP Degree | Context (Tokens in/ Tokens out) | Max Batch Size | Total Throughput (tokens/sec) | Throughput per GPU (tokens/sec) |
1 | 2048 in/ 2048 out | 8 | 430 | 430 |
2 | 2048 in/ 2048 out | 32 | 1386 | 693 |
4 | 2048 in/ 2048 out | 64 | 2930 | 733 |
8 | 2048 in/ 2048 out | 128 | 5432 | 677 |