Llama-2 13B TP efficiency analysis with 2 second TTFT constraint

Thank you for your feedback!

We plotted the throughput efficiency per GPU for different tensor parallelism degrees at a full 4k tokens of context length (2k input and 2k output) without any time constraint.
Figure 11. Llama-2 13B: Per GPU Efficiency vs. Batch Size: TTFT < 2 seconds
In Figure 12, we show the throughput per GPU efficiency at the best batch size (maximum batch size for each tensor parallelism). The TP1 case does not have enough memory to be effective, but by TP2 the throughput per GPU is quite good (.99x of the peak at TP 4). TP continues to be very effective with efficiency of .9x that of TP4.
Figure 12. Llama-2 13B: Per GPU Efficiency at maximum batch size TP Comparison: TTFT < 2 seconds