Llama-3 70B Tokens per second per GPU without any TTFT constraint

Thank you for your feedback!

Similarly, we plotted the throughput efficiency per GPU for different tensor parallelism degrees at a full 8k tokens of context length (4k input and 4k output) without any time constraint.
Figure 7. Llama-3 70B: Per GPU Efficiency vs. Batch Size: No Time Constraint
Figure 8. Llama-3 70B: Per GPU Efficiency at maximum batch size TP Comparison: No Time Constraint