When sizing an AI inference infrastructure, several important factors need to be considered to ensure optimal performance, efficiency, and cost-effectiveness. Some key factors to consider include:
Table 11 provides insights into how model size relates to GPU memory consumption.
Using the Model Analyzer report, we can assess the model and determine the number of concurrent requests it can handle while meeting our latency requirements. If it is necessay to support more concurrent requests, deploying additional instances of the model is required. These additional instances ensure efficient use of resources and optimal performance in serving multiple requests simultaneously.
To demonstrate how to size the compute infrastructure, consider an example scenario. In this scenario, we want to deploy a NeMo GPT 5B model for 64 concurrent users and a NeMo GPT 20B model for 128 concurrent users. First, we must determine the number of model instances needed to meet this requirement. Based on the model analyzer reports, we determine that the optimal concurrency for these models is four and 16, respectively. To meet the requirement for concurrent users, we need 16 instances of the 5B model and eight instances of the 20B model. Note that latency requirements also drive the number of instances of the model needed. For this example, we assume the latency observed meets the requirements.
Next, after determining the number of instances, we calculate the total memory required to deploy these models. Using the following table as a reference for the example sizing information in this validated design, we can calculate the total GPU memory required to host these models, which amounts to approximately 560 GB. This total GPU memory requirement can be supported by two PowerEdge R760xa servers each equipped with four NVIDIA H100 GPUs.
The following table summarizes the memory requirements for the example scenario:
Table 11. Sizing example
Requirements and sizing | Example Scenario |
Models and concurrent users |
|
Total models needed to support required concurrency |
|
Total GPU memory required | Approximately 560 GB |
Total PowerEdge R760xa required | Two PowerEdge R760xa servers with 8 x H100 GPUs |