Home > AI Solutions > Gen AI > Guides > Design Guide—Generative AI Digital Assistants in the Enterprise > NVIDIA AI Enterprise and LLM operations
Large language model operations (LLMOps) refer to the specialized practices and workflows that accelerate development, deployment, and management of LLM models throughout their complete life cycle. These operations include data preprocessing, language model training, monitoring, fine-tuning, and deployment. Similar to machine learning operations (MLOps), LLMOps enables collaboration among data scientists, AI developers, DevOps engineers, and IT professionals.
NVIDIA AI Enterprise provides enterprise support for various software frameworks, toolkits, workflows, and models that support inferencing. For the scope of this design, we used several NVIDIA capabilities in support of LLMOps, as described below.
As well as providing enterprise support for components such as Triton Inference Server and TensorRT-LLM, NVIDIA AI Enterprise is also required for Audio2Face, an NVIDIA Inference Microservice (NIM) that is part of the UneeQ software implementation and used to animate 3D character facial movements to match the audio input.
NVIDIA Triton Inference Server (also known as Triton) is an inference serving software that standardizes AI model deployment and execution and delivers fast and scalable AI in production. Enterprise support for Triton is available through NVIDIA AI Enterprise. Alternatively, Triton is available as open-source software.
Triton streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained machine learning or deep learning models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the appropriate framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-premises, edge, and embedded devices.
The benefits of Triton for AI inferencing include:
Triton Inference Server is the software that hosts generative AI models, which in our case is Llama 3. Triton Inference Server, along with its integration with TensorRT-LLM and the NeMo framework provides the infrastructure for deploying generative AI models.
NVIDIA recently unveiled TensorRT-LLM as a new backend for hosting LLMs. It encompasses the TensorRT deep learning compiler, featuring optimized kernels, preprocessing and postprocessing workflows, and specialized multi-GPU and multinode communication components, all of which contribute to remarkable performance gains on NVIDIA GPUs. TensorRT-LLM empowers developers to explore novel LLMs, providing both peak performance and rapid customization options. For additional insights and in-depth information, see NVIDIA’s Technical Blog.
NVIDIA offers the TensorRT-LLM API to define LLMs and build TensorRT engines to perform inference on NVIDIA GPUs. It includes a backend for integration with the NVIDIA Triton Inference Server to serve the LLMs. Models built with TensorRT-LLM can be run on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs using Tensor Parallelism and Pipeline Parallelism.
The Python API of TensorRT-LLM is architected to be similar to the PyTorch API. It provides both a functional module and a layers module to assemble LLMs. To maximize performance and reduce the footprint, models can run using different pretraining and post training quantization using the NVIDIA Algorithmic Model Optimization (AMMO) toolkit. The first step to create an inference solution is to either define your own model or select a predefined network architecture from the list of models supported by TensorRT-LLM. Then, the model must be trained using a training framework like NVIDIA NeMo or PyTorch because TensorRT-LLM does not support model training. For predefined models, checkpoints can be downloaded from various providers. TensorRT-LLM can use model weights obtained from HuggingFace Hub.
Equipped with the model definition and the weights, TensorRT-LLM’s Python API is used to re-create the model in a format that TensorRT can compile into an efficient engine. TensorRT-LLM provides components to create a runtime that starts the TensorRT engine. The Triton backend for TensorRT-LLM serves TensorRT-LLM models by using the Triton.
The final optimized model is ready for production deployment on a PowerEdge R760xa server equipped with NVIDIA GPUs using Triton Inference Server. It can be accessed through an API endpoint or using HTTPS/gPRC protocols. The Triton Inference Server also offers health and performance metrics of the model in production that can be consumed and visualized through Prometheus and Grafana (see Other software components). These monitoring tools provide valuable insights into the model's performance and overall system health during deployment.
The NVIDIA GPU Operator automates the life cycle management of the software required to expose GPUs on Kubernetes. It enables advanced functionality, including better GPU performance, utilization, and telemetry. The NVIDIA GPU Operator uses the Kubernetes operator framework to automate the management of all NVIDIA software components needed to provision the GPU.
These components include the drivers to enable NVIDIA’s Compute Unified Device Architecture (CUDA), the Kubernetes device plug-in for GPUs, the NVIDIA Container Toolkit, automatic node labeling using GPU Feature Discovery (GFD), and Data Center GPU Manager (DCGM) monitoring, among others.