Home > AI Solutions > Gen AI > Guides > Design Guide—Generative AI in the Enterprise - Inferencing > Performance test results
We assessed the performance of LLMs measuring first token latency in two distinct scenarios. The first scenario pertains to chatbot interactions, characterized by short input lengths. The second scenario pertains to Retrieval-Augmented Generation (RAG) tasks, which typically entail longer input lengths. Additionally, we gauged tokens per second, reflecting offline task scenarios.
Note: These results were measured with TensorRT-LLM version 0.7.1. Newer versions of TensorRT-LLM could yield better performance results.
We used benchmarking scripts that are packaged with NVIDIA to measure the model performance in the described scenarios. These benchmarking scripts were run on TensorRT-LLM without incorporating Triton Inference Server.
The model's inference performance fluctuates depending on model parallelism. To determine the best parallelism, we selected Llama 2 13B and compared its performance by adjusting tensor and pipeline parallelism. Later, we identified the optimal TP and PP values for each GPU configuration based on the results.
NVIDIA GPU | Optimal parallelism |
1 x L40S | TP=1, PP=1 |
2 x L40S | TP=2, PP=1 |
4 x L40S | TP=4, PP=1 |
2 x H100 SXM | TP=2, PP=1 |
4 x H100 SXM | TP=4, PP=1 |
8 x H100 SXM | TP=8, PP=1 |
We used the following model build parameters:
The goal of this benchmarking is to measure the first token latency to represent online tasks. The following tables show the results that we observed for a batch size of 1 and output length of 1 (to measure first token latency).
Model/GPU | 1 x L40S | 2 x L40S | 4 x L40S | 1 x H100 SXM | 2 x H100 SXM | 4 x H100 SXM | 8 x H100 SXM |
LLama 2 7B - FP8 | 18.03 | 18.23 | 15.52 | 8.05 | 8.95 | 6.17 | 7.49 |
LLama 2 13B - FP8 | 29.56 | 29.77 | 23.49 | 12.16 | 12.61 | 10.24 | 9.40 |
LLama 2 70B - FP8 | Not supported | Not supported | 76.34 | Not supported | 42.05 | 25.09 | 24.53 |
LLama 2 7B - AWQ | 10.38 | 10.98 | 11.37 | 9.27 | 9.36 | Not recommended | Not recommended |
LLama 2 13B - AWQ | 18.33 | 19.33 | 19.56 | 15.21 | 15.86 | 10.88 | Not recommended |
LLama 2 70B - AWQ | Not supported | Not supported | 60.87 | 58.01 | 58.87 | 30.61 | Not recommended |
Mistral - FP8 | 19.26 | 19.34 | 15.89 | 8.48 | 8.28 | 6.38 | 7.15 |
Falcon 180B - FP8 | Not supported | Not supported | Not supported | Not supported | Not supported | Not supported | 28.15 |
Model/GPU | 1 x L40S | 2 x L40S | 4 x L40S | 1 x H100 SXM | 2 x H100 SXM | 4 x H100 SXM | 8 x H100 SXM |
LLama 2 7B - FP8 | 68.45 | 72.09 | 119.57 | 29.87 | 29.96 | 19.33 | 18.64 |
LLama 2 13B - FP8 | 132.55 | 138.92 | 193.84 | 54.89 | 55.32 | 30.9 | 26.20 |
LLama 2 70B - FP8 | Not supported | Not supported | 674.99 | Not supported | 224.21 | 106.33 | 83.44 |
LLama 2 7B - AWQ | 116.59 | 125.49 | 134.94 | 80.24 | 81.28 | Not recommended | Not recommended |
LLama 2 13B - AWQ | 228.81 | 244.12 | 221.86 | 151.15 | 152.19 | 57.52 | Not recommended |
LLama 2 70B - AWQ | Error | Error | 825.29 | 755.56 | 731.28 | 252.76 | Not recommended |
Mistral - FP8 | 73.26 | 76.40 | 120.25 | 31.22 | 31.66 | 19.60 | 20.89 |
Falcon 180B - FP8 | Not supported | Not supported | Not supported | Not supported | Not supported | Not supported | 128.84 |
The goal of this benchmarking is to measure the tokens per second to represent offline tasks. The following table shows the results that we observed for a batch size of 64 and an input and output length of 128.
Model/GPU | L40S x 1 | L40S x 2 | L40S x 4 | XE8640 x 1 | XE x 2 | XE x 4 | XE9680 x 8 |
LLama 2 7B - FP8 | 3355.63 | 4018.46 | 4690.64 | 10033.68 | 9917.65 | 14458.99 | 13542.70 |
LLama 2 13B - FP8 | 1901.83 | 2346.48 | 2851.63 | 6055.56 | 6594.96 | 9874.79 | 9587.35 |
LLama 2 70B - FP8 | Not supported | Not supported | 861.97 | Not supported | 2115.99 | 3189.37 | 3794.54 |
LLama 2 7B - AWQ | 3046.51 | 3630.44 | 5392.88 | 5804.99 | 6649.89 | Not recommended | Not recommended |
LLama 2 13B - AWQ | 1876.85 | 2258.64 | 2692.63 | 3512.79 | 4289.96 | 6117.19 | Not recommended |
LLama 2 70B - AWQ | Not supported | 749.73 | 960.82 | 900.50 | 1280.65 | 1811.31 | Not recommended |
Mistral - FP8 | 3767.75 | 4268.13 | 4835.76 | 9963.33 | 10194.46 | 14208.13 | 13540.23 |
Falcon 180B - FP8 | Not supported | Not supported | Not supported | Not supported | Not supported | Not supported | 2861.58 |