Home > AI Solutions > Gen AI > White Papers > Generative AI in the Enterprise > Large model inferencing
Many enterprises elect to start with a pretrained model and use it without modification or conduct some prompt engineering or P-tuning to make better use of the model for a specific function. Starting with production deployment in mind is critical in the case of LLMs because there is a heavy demand for compute power. Depending on the size of the model, many larger models require multiple 8x GPU systems to achieve second or subsecond-level throughput. The minimum configuration for inferencing pretrained models starts with a single PowerEdge R760XA server with up to four NVIDIA H100 GPUs or one PowerEdge XE9680 server with eight NVIDIA H100 GPUs based on model size and number of instances. The number of nodes can then scale out as needed for performance or capacity, though two nodes are recommended for reliability purposes.
Design considerations for inferencing large models include: