Many enterprises elect to start with a pretrained model and use it without modification or conduct some prompt engineering or P-tuning to make better use of the model for a specific function. Starting with production deployment in mind is critical in the case of LLMs because there is a heavy demand for compute power. Depending on the size of the model, many larger models require multiple 8x GPU systems to achieve second or subsecond-level throughput. The minimum configuration for inferencing pretrained models starts with a single PowerEdge R760XA server with up to four NVIDIA H100 GPUs or one PowerEdge XE9680 server with eight NVIDIA H100 GPUs based on model size and number of instances. The number of nodes can then scale out as needed for performance or capacity, though two nodes are recommended for reliability purposes.
Design considerations for inferencing large models include:
- Large models tend to have a large memory footprint. While there might not be a clear boundary that defines a large model, for the sake of simplicity, anything above 10B parameters can be considered a large model.
- When the model is split between GPUs, the communication between GPUs plays a crucial role in delivering optimum performance. Therefore, the NVIDIA Triton Inference Server software with multi-GPU deployment using fast transformer technology might be employed.
- For large models above 40B parameters, we recommend the PowerEdge XE9680 server. For model sizes less than 40B parameters, the PowerEdge R760xa server delivers excellent performance.
- The PowerSwitch Z9432F supports 32 ports of 400 (QSFP56-DD optical transceivers) or up to 128 ports of 100 GbE. Inference does not have the InfiniBand module or high throughput requirement; therefore, it scales linearly for concurrency needs up to 32 nodes.
- Throughput (inference per second) requirements require multiple GPUs to be deployed depending on the workload needs.