Home > AI Solutions > Gen AI > White Papers > Maximizing Llama Open Source Model Inference Performance with Tensor Parallelism on a Dell XE9680 with H100s > Total Throughput analysis with 2 second TTFT constraint
First, in 0, we plot total throughput for different tensor parallelism degrees for Llama-3 70B with 8k tokens of context length, that is, 4k input and 4k output. The results were filtered with a time constraint of Time To First Token (TTFT) less than 2 seconds.
In Figure 2, we show the total throughput at the best batch size (maximum batch size for each tensor parallelism). The maximum throughput observed with TP 8 is 585.2 tokens per second.