To illustrate the process of sizing the compute infrastructure, let's examine a specific scenario. In this scenario, our goal is to deploy a NeMo GPT 5B model for 32 concurrent users and a NeMo GPT 20B model for 64 concurrent users. We will guide you through the steps to determine the infrastructure requirements for this example.
- Identify the Number of Model Instances: First, we need to ascertain the number of model instances required to meet these user concurrency targets. Based on the model analyzer reports, we've determined that the optimal concurrency levels for these models are four for the 5B model and 16 for the 20B model. We've chosen these values because exceeding these levels significantly impacts latency.
- Calculate Model Instances: Next, we calculate the number of instances needed for each model to fulfill the concurrent user requirements. Through straightforward calculations, we find that we require eight instances of the 5B model and four instances of the 20B model. It's important to note that latency requirements also play a role in determining the required number of model instances. For this example, we assume that the observed latency meets the criteria.
- Determine GPU Requirements: After establishing the number of instances, we calculate the total number of GPUs necessary to deploy these models. As outlined in Table 14, GPU utilization hovers around 100 percent for the considered concurrency levels. Consequently, each model instance requires one GPU for hosting. Therefore, to accommodate eight instances of the 5B model and four instances of the 20B model, we need a total of 12 GPUs.
- Server Selection: To support this scenario, we require three PowerEdge R760xa servers to that is configured with 4 GPUs each to meet the demands of our deployment.
By following these steps, you can effectively size the compute infrastructure to cater to your specific user concurrency and model requirements.
The following table summarizes the server requirements for the example scenario that we described. We have added one additional scenario for PowerEdge XE9680:
Table 15. Sizing examples
Models and concurrent users | - NeMo GPT 5B model for 32 concurrent users
- NeMo GPT 20B model for 64 concurrent users
| - NeMo GPT 1.3B model for 32 concurrent users
- NeMo GPT 5B model for 64 concurrent users
- NeMo GPT 20B model for 128 concurrent users
|
Total models needed to support required concurrency | - 8 x NeMo GPT 5B models
- 4 x NeMo GPT 20B models
| - 8 x NeMo GPT 1.3B models
- 16 x NeMo GPT 5B models
- 8 x NeMo GPT 20B models
|
Total PowerEdge R760xa required | Three PowerEdge R760xa or PowerEdge XE8640 servers, each with 4 x H100 GPUs | Four PowerEdge XE9680 servers, each with 8 x H100 GPUs |
These calculations are based on the FasterTransformer backend. NVIDIA recently announced early access for TensorRT-LLM, which includes several performance optimizations. Our sizing guidelines will be updated when TensorRT-LLM is generally available.