Home > AI Solutions > Gen AI > White Papers > Maximizing Llama Open Source Model Inference Performance with Tensor Parallelism on a Dell XE9680 with H100s > Llama-2 13B TP efficiency analysis with 2 second TTFT constraint
We plotted the throughput efficiency per GPU for different tensor parallelism degrees at a full 4k tokens of context length (2k input and 2k output) without any time constraint.
In Figure 12, we show the throughput per GPU efficiency at the best batch size (maximum batch size for each tensor parallelism). The TP1 case does not have enough memory to be effective, but by TP2 the throughput per GPU is quite good (.99x of the peak at TP 4). TP continues to be very effective with efficiency of .9x that of TP4.