Llama-3 70B Throughput analysis without TTFT constraint | Maximizing Llama Open Source Model Inference Performance with Tensor Parallelism on a Dell XE9680 with H100s

None

Thank you for your feedback!

In 0, we plot total throughput for different tensor parallelism degrees for Llama-3 70B with 8k tokens of context length (4k input and 4k output).
Figure 5. Llama-3 70B: Total Throughput vs. Batch Size: No TTFT constraint
In Figure 6, we show the total throughput at the best batch size (maximum batch size for each tensor parallelism). The maximum throughput observed with TP 8 is 2580.61 tokens per second.
Figure 6. Llama-3 70B: Total Throughput at max batch size for each Tensor Parallelism: No TTFT Constraint