Performance test results

Thank you for your feedback!

We assessed the performance of LLMs measuring first token latency in two distinct scenarios. The first scenario pertains to chatbot interactions, characterized by short input lengths. The second scenario pertains to Retrieval-Augmented Generation (RAG) tasks, which typically entail longer input lengths. Additionally, we gauged tokens per second, reflecting offline task scenarios.

Note: These results were measured with TensorRT-LLM version 0.7.1. Newer versions of TensorRT-LLM could yield better performance results.

We used benchmarking scripts that are packaged with NVIDIA to measure the model performance in the described scenarios. These benchmarking scripts were run on TensorRT-LLM without incorporating Triton Inference Server.

The model's inference performance fluctuates depending on model parallelism. To determine the best parallelism, we selected Llama 2 13B and compared its performance by adjusting tensor and pipeline parallelism. Later, we identified the optimal TP and PP values for each GPU configuration based on the results.

Table 12. Optimal Tensor and Pipeline parallelism

NVIDIA GPU	Optimal parallelism
1 x L40S	TP=1, PP=1
2 x L40S	TP=2, PP=1
4 x L40S	TP=4, PP=1
2 x H100 SXM	TP=2, PP=1
4 x H100 SXM	TP=4, PP=1
8 x H100 SXM	TP=8, PP=1

We used the following model build parameters:

Maximum input length: 2048
Maximum output length: 2048
Maximum batch size: 64

Latency tests for online inference

The goal of this benchmarking is to measure the first token latency to represent online tasks. The following tables show the results that we observed for a batch size of 1 and output length of 1 (to measure first token latency).

Table 13. First token latency for an input length of 128 (milliseconds)

Model/GPU	1 x L40S	2 x L40S	4 x L40S	1 x H100 SXM	2 x H100 SXM	4 x H100 SXM	8 x H100 SXM
LLama 2 7B - FP8	18.03	18.23	15.52	8.05	8.95	6.17	7.49
LLama 2 13B - FP8	29.56	29.77	23.49	12.16	12.61	10.24	9.40
LLama 2 70B - FP8	Not supported	Not supported	76.34	Not supported	42.05	25.09	24.53
LLama 2 7B - AWQ	10.38	10.98	11.37	9.27	9.36	Not recommended	Not recommended
LLama 2 13B - AWQ	18.33	19.33	19.56	15.21	15.86	10.88	Not recommended
LLama 2 70B - AWQ	Not supported	Not supported	60.87	58.01	58.87	30.61	Not recommended
Mistral - FP8	19.26	19.34	15.89	8.48	8.28	6.38	7.15
Falcon 180B - FP8	Not supported	Not supported	Not supported	Not supported	Not supported	Not supported	28.15

Table 14. First token latency for an input length of 2048 (milliseconds)

Model/GPU	1 x L40S	2 x L40S	4 x L40S	1 x H100 SXM	2 x H100 SXM	4 x H100 SXM	8 x H100 SXM
LLama 2 7B - FP8	68.45	72.09	119.57	29.87	29.96	19.33	18.64
LLama 2 13B - FP8	132.55	138.92	193.84	54.89	55.32	30.9	26.20
LLama 2 70B - FP8	Not supported	Not supported	674.99	Not supported	224.21	106.33	83.44
LLama 2 7B - AWQ	116.59	125.49	134.94	80.24	81.28	Not recommended	Not recommended
LLama 2 13B - AWQ	228.81	244.12	221.86	151.15	152.19	57.52	Not recommended
LLama 2 70B - AWQ	Error	Error	825.29	755.56	731.28	252.76	Not recommended
Mistral - FP8	73.26	76.40	120.25	31.22	31.66	19.60	20.89
Falcon 180B - FP8	Not supported	Not supported	Not supported	Not supported	Not supported	Not supported	128.84

Throughput tests for offline inference

The goal of this benchmarking is to measure the tokens per second to represent offline tasks. The following table shows the results that we observed for a batch size of 64 and an input and output length of 128.

Table 15. Throughput for LLMs (tokens per second)

Model/GPU	L40S x 1	L40S x 2	L40S x 4	XE8640 x 1	XE x 2	XE x 4	XE9680 x 8
LLama 2 7B - FP8	3355.63	4018.46	4690.64	10033.68	9917.65	14458.99	13542.70
LLama 2 13B - FP8	1901.83	2346.48	2851.63	6055.56	6594.96	9874.79	9587.35
LLama 2 70B - FP8	Not supported	Not supported	861.97	Not supported	2115.99	3189.37	3794.54
LLama 2 7B - AWQ	3046.51	3630.44	5392.88	5804.99	6649.89	Not recommended	Not recommended
LLama 2 13B - AWQ	1876.85	2258.64	2692.63	3512.79	4289.96	6117.19	Not recommended
LLama 2 70B - AWQ	Not supported	749.73	960.82	900.50	1280.65	1811.31	Not recommended
Mistral - FP8	3767.75	4268.13	4835.76	9963.33	10194.46	14208.13	13540.23
Falcon 180B - FP8	Not supported	Not supported	Not supported	Not supported	Not supported	Not supported	2861.58

Your Browser is Out of Date