Total Throughput analysis with 2 second TTFT constraint

Thank you for your feedback!

First, in 0, we plot total throughput for different tensor parallelism degrees for Llama-3 70B with 8k tokens of context length, that is, 4k input and 4k output. The results were filtered with a time constraint of Time To First Token (TTFT) less than 2 seconds.
Figure 1. Llama-3 70B: Total Throughput vs Batch Size: TTFT < 2 seconds
In Figure 2, we show the total throughput at the best batch size (maximum batch size for each tensor parallelism). The maximum throughput observed with TP 8 is 585.2 tokens per second.
Figure 2. Llama-3 70B: Total Throughput at max batch size for each Tensor Parallelism: TTFT < 2 seconds