Sizing the infrastructure for LLM inference is essential due to the computational demands, high memory requirements, and unique characteristics of these models. Properly sizing the infrastructure ensures efficient handling of LLMs' massive parameter count during inference, avoiding out-of-memory errors and achieving low latency. It enables resource efficiency, scalability, and cost optimization by accommodating varying workloads and selecting optimal configurations. Overall, proper infrastructure sizing for LLM inference ensures optimal performance, scalability, and user experience while managing operational costs effectively and removing any resource bottlenecks.
To understand the performance of a large language model (LLM), you can measure several metrics, each of which is relevant for assessing different aspects of the model's performance. Each metric contributes to understanding LLM performance as follows:
- First token latency─This metric, also known as time to first token (TTFT), measures the time it takes for the model to generate the first token of a response after receiving an input prompt. It reflects the initial processing time of the model and can be important for real-time applications where low latency is crucial.
- Tokens per second—This metric measures the throughput of the model, indicating how many tokens (words or characters) the model can generate per second on average. It provides insight into the overall speed of the model and its ability to process input data efficiently.
- Overall response latency—Unlike first token latency, this metric measures the total time it takes for the model to generate a complete response to a given input prompt. It includes the time for processing the entire input sequence and generating the output sequence. It is crucial for assessing the end-to-end latency experienced by users interacting with the model.
- Number of concurrent users—This metric measures the model's ability to handle multiple simultaneous requests or users. It helps determine the scalability and resource requirements of deploying the model in production environments with varying levels of user concurrency.
The following factors impact the performance of the LLM model inference:
- Model size in parameters—The number of parameters in the LLM directly impacts its memory footprint and computational requirements. Larger models with more parameters might require more powerful hardware for efficient inference. The following table below provides GPU memory consumption for various models. The models were built with the following parameters:
- Tensor parallelism = 1
- Pipeline parallelism = 1
- Maximum input length: 2048
- Maximum output length: 2048
- Maximum batch size: 64
The memory consumption was measured with a batch size of 1, input length =128, and output length=1.
Table 11. Optimal Tensor and Pipeline parallelism
LLama 2 7B - FP8 | 23.54 GB |
LLama 2 7B - AWQ | 16.54 GB |
LLama 2 13B - FP8 | 33.52 GB |
LLama 2 13B - AWQ | 23.67 GB |
LLama 2 70B - FP8 | 70.06 GB |
LLama 2 70B - AWQ | 66.28 GB |
Mistral - FP8 | 25.64 GB |
Falcon 180B - FP8 | 52.29 GB |
- GPU type, compute, and memory use—The choice of GPU type, compute capability, and available memory are crucial factors in optimizing inference performance. Different GPU architectures (NVIDIA H100 GPUs compared to NVIDIA L40S GPUs) offer varying levels of compute power and memory bandwidth, which can affect the speed and efficiency of LLM inference.
- Model build parameters—Parameters such as maximum input length, maximum output length, and maximum batch size determine the resources consumed by the model during inference. These parameters influence the memory and computational requirements of the model and must be carefully tuned to optimize inference performance while ensuring that the model can handle input of varying lengths and complexities.
- Model parallelism—Model parallelism is the technique of dividing the computation of a neural network across multiple GPUs to improve throughput and reduce inference latency. By partitioning the model into smaller segments and processing them in parallel, it is possible to exploit the computational capabilities of multiple GPUs and accelerate inference speed. However, implementing model parallelism requires careful consideration of communication overhead and load balancing to ensure efficient use of resources.
- Model quantization—Model quantization is a technique to reduce the precision of numerical values in the model parameters, typically from 32-bit floating-point numbers to lower precision formats such as 16-bit floating-point or fixed-point integers. Quantization can significantly reduce memory bandwidth requirements and improve inference speed by allowing more efficient storage and computation of model parameters. However, quantization might also impact the accuracy and quality of model predictions. It is essential to balance the trade-offs between inference speed and model accuracy when applying quantization techniques.