We conducted performance characterization of the following NeMo models: 345M, 1.3B, 5B, and 20B on a PowerEdge R760xa server with four H100 GPUs, with NVLink Bridge being used to connect each GPU to one other GPU to create two GPU pairs. We used the Model Analyzer tool, which allowed us to run model sweeps using synthetic datasets. During these sweeps, Model Analyzer performed inference against each NeMo model while varying the number of concurrent requests, enabling us to measure response latencies and gather essential GPU metrics. All NeMo models were run on a single GPU, and Model Analyzer generated comprehensive reports for each of these sweeps.
Because GPU memory is a primary bottleneck in the efficient inference of LLM models, it was crucial that we analyzed the model memory consumption. Understanding the GPU memory consumed by various LLM models is pivotal for determining the appropriate resource allocation in production environments. By accurately assessing GPU memory use, we can ensure that the infrastructure is properly sized, avoiding out-of-memory errors and optimizing the overall performance of LLM inference.
Table 13. Model parameter and GPU memory consumption
Model | GPU memory consumed (for one concurrent request) | Precision |
NeMo GPT 345M | 3.1 GB | BF16 |
NeMo GPT 1.3B | 4.1 GB | BF16 |
NeMo GPT 5B | 12.4 GB | BF16 |
NeMo GPT 20B | 42.4 GB | BF16 |
Bloom 7B | 16.6 GB | BF16 |
Stable Diffusion 2.0 | 7.9 GB | FP16 |
Llama-2-7B | 14.2 GB | FP16 |
Llama-2-13B | 29.5 GB | FP16 |
Llama-2-70B | 154.7 GB | FP16 |
As a rule, we can estimate that an LLM model consumes approximately 2.1GB of memory for every 1B parameter (at BF16 precision). However, there are exceptions to this rule, such as smaller models like NeMo 345M. Additionally, non-LLM models like Stable Diffusion have their own memory requirements and might not follow the same pattern. Note that memory consumed by a model might increase slightly as the number of concurrent requests handled by the model increases.
FP16 (half precision) and BF16 (bfloat16) are two different floating-point formats used in deep learning and high-performance computing to represent numerical values with reduced precision. Both formats offer benefits in terms of memory usage, computation speed, and energy efficiency. BF16 is typically preferred in deep learning training and inference on specialized hardware like tensor processing units.
Using the Model Analyzer report, we can visualize how the latency of the requests changes with the concurrent client requests. The following figure shows a graph produced by Model Analyzer for the 20B model:
Figure 10. Model Analyzer report for 20B model
For every model, we identified the optimal number of concurrent requests that minimally impact latency while effectively using available resources. For example, using the preceding graph, we concluded that a single instance 20B model can support a concurrency of 16. This information allows us to achieve the right balance between computational efficiency and responsiveness for each of the NeMo models.
The following table summarizes the resource use, latency, and concurrent request metrics for each model for their optimal number of concurrent requests:
Table 14. Resource use and model metrics for the optimal number of concurrent requests
Metric | NeMo GPT 345m | NeMo GPT 1.3B | NeMo GPT 5B | NeMo GPT 20B |
Request concurrency | 4 | 4 | 4 | 16 |
p95 Latency (ms) | 295.85 | 406.69 | 1135.52 | 1112.39 |
Client Response Wait (ms) | 275.34 | 396.45 | 1101.13 | 1099.72 |
Server Queue (ms) | 206.29 | 297.04 | 824.52 | 49.58 |
Throughput (inference/sec) | 14.5 | 10.07 | 3.61 | 14.51 |
Max server memory Usage (GB) | 4.68 | 4.35 | 4.68 | 4.59 |
Max GPU Memory Usage (GB) | 3.1 | 4.12 | 12.41 | 44.76 |
Avg GPU Utilization (%) | 98 | 99 | 99 | 93.5 |