Home > AI Solutions > Gen AI > Guides > Generative AI in the Enterprise with AMD Accelerators > Inferencing
Inferencing in generative AI is the process of using a trained model to generate predictions, make decisions, or produce outputs based on specific input data and contexts. It applies the learned knowledge and patterns acquired during the model’s training phase to respond with new and unique content.
In the context of LLMs, inference refers to the process of generating new text or making predictions based on the learned model. During inference, given an input (often called a prompt), the LLM generates a continuation of the text that is statistically most likely based on its training. This task might involve completing a sentence, writing a paragraph, or generating a whole story. The quality of inference depends on the richness of the training data and the complexity of the model. Inference is a crucial step in using LLMs for practical applications such as chatbots, text summarization, translation, and more.
Inference in LLMs primarily requires substantial computational resources, particularly processing power and memory. The processing power, often provided by GPUs is necessary to handle the complex matrix multiplications in the model’s numerous layers. GPU memory, on the other hand, is crucial for storing the model’s parameters during inference. Not allocating sufficient GPU compute and memory can create an inferencing bottleneck. However, compared to training or fine-tuning, the resource requirements for inference are relatively lower due to forward propagation through the model without the need for backpropagation or parameter updates. A single Inferencing task is performed on a single server and does not typically require high-speed inter-node networking.