Home > AI Solutions > Gen AI > Guides > Design Guide—Generative AI in the Enterprise - Inferencing > Sizing guidelines
End users typically have requirements regarding first token latency, input and output length, batch size, overall latency, and throughput. As mentioned earlier, model performance is influenced by various factors, including GPU type, model build parameters, model parallelism, and quantization. Determining the appropriate infrastructure size relies on aligning user requirements with the factors affecting model performance.
To illustrate the process of sizing, we consider two examples. In the first example, we consider a small deployment that requires smaller models in production such as the Llama 2 13 and Mistral models. These models are built with a maximum batch size of 64 (hence, they can serve a maximum of 64 concurrent requests). Two instances of each model can serve a maximum of 128 concurrent users. First token latency and tokens per second can be inferred from Table 13 and Table 15. Note that Table 13 provides first token latency for a batch size of 1. Increasing the batch size impacts first token latency and overall latency. Using the data from the previous section, we can host 2 x Llama 2 13B – AWQ and 2 x Mistral - FP8 in a single PowerEdge R760xa server with 4 x NVIDIA L40S GPUs.
In the second example, we consider a larger deployment that requires models of a mixed parameter count such as Llama 2 70B, 13B, and 7B, all requiring FP8. Using the data from the previous section, we can host all these models in a single PowerEdge XE9680 server with an NVIDIA H100 GPU.
The following table lists the server requirements for the example scenarios that we described:
Example scenario 1 with PowerEdge R760xa with 4 x NVIDIA L40S GPUs | Example scenario 2 with PowerEdge XE9680 |
2 x Llama 2 13B – AWQ 2 x Mistral - FP8 | 1 x Llama 2 70B - FP8 1 x Llama 2 13B - FP8 2 x Llama 2 7B - FP8 |
The performance and sizing guidelines do not consider NVIDIA’s time slicing or MIG capabilities that can partition the GPU and potentially improve the number of models hosted on a particular server. As we characterize the performance of inference with these advanced capabilities, we will update this section.