NVIDIA AI Enterprise provides enterprise support for various software frameworks, toolkits, workflows, and models that support inferencing. See the NVIDIA AI Enterprise documentation for more information about all components available with NVIDIA AI Enterprise. The following components incorporated in this validated design are available as part of NVIDIA AI Enterprise:
The following sections describe the key software components and how they are used in this design. For more information about how these key components work together, see Figure 4 in Software architecture.
NVIDIA Triton Inference Server (also known as Triton) is inference serving software that standardizes AI model deployment and execution and delivers fast and scalable AI in production. Enterprise support for Triton is available through NVIDIA AI Enterprise. It is also available as an open-source software.
Triton streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained machine learning or deep learning models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the appropriate framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-premises, edge, and embedded devices.
The benefits of Triton for AI inferencing include the following:
Triton Inference Server is at the core of this validated design. It is the software that hosts generative AI models. Triton Inference Server, along with its integration with Model Analyzer, Fast Transformer, and the NeMo framework provides an ideal software for deploying generative AI models.
In NLP, the encoder and decoder are crucial components, and the transformer layer has gained popularity as an architecture for both. FasterTransformer from NVIDIA offers a highly optimized transformer layer for both the encoder and decoder, specifically designed for efficient inference.
When running on NVIDIA GPUs, FasterTransformer automatically uses the computational power of Tensor Cores, especially when the data and weights are represented in FP16 precision, enabling faster computations.
FasterTransformer is built using CUDA, cuBLAS, cuBLASLt, and C++. It provides convenient APIs for popular deep learning frameworks like TensorFlow, PyTorch, and Triton backend. These APIs allow users to seamlessly integrate FasterTransformer into their existing workflows using these frameworks.
In this validated design, we use FasterTransformer for inferencing NVIDIA NeMo models.
NVIDIA recently unveiled TensorRT-LLM as a new backend for hosting LLMs. It encompasses the TensorRT deep learning compiler, featuring optimized kernels, preprocessing and postprocessing workflows, and specialized multi-GPU and multinode communication components, all of which contribute to remarkable performance gains on NVIDIA GPUs. TensorRT-LLM empowers developers to explore novel LLMs, providing both peak performance and rapid customization options. For additional insights and in-depth information, see NVIDIA’s Technical Blog. At the time of writing, TensorRT-LLM is in an early access phase, and this white paper will be updated when it becomes generally available.
The Triton Inference Server offers a powerful solution for deploying AI models. However, each deployment presents its unique challenges, such as meeting latency targets, working with limited hardware resources, and accommodating various model requirements. To address these complexities, the Model Analyzer provides insight for planning and decision making.
The Model Analyzer enables users to send requests to their models while monitoring GPU memory and compute use. This tool provides a mechanism for a deep understanding of an AI model’s GPU memory requirements under different batch sizes and instance configurations. Using this information, users can make informed decisions about efficiently combining multiple models on a single GPU, ensuring optimal memory usage without exceeding capacity.
The Model Analyzer is a command-line interface (CLI) that significantly enhances comprehension of Triton Inference Server model compute and memory demands. It provides this awareness by conducting customizable configuration “sweeps” and generating comprehensive reports summarizing performance metrics.
With the Model Analyzer, you can:
By using the insights provided by the Model Analyzer, you can make better-informed decisions and optimize their AI model deployments for peak performance and efficiency.
In this validated design, we use the Model Analyzer to generate load and monitor the performance of LLM model inference. We used Prometheus and Grafana to gather and visualize the performance metrics. For the results of our validation, which uses Model Analyzer, see Chapter 5. For our sizing guidelines based on the reports of Model Analyzer, see Chapter 6.
NVIDIA supports various generative AI models available that can be used for inferencing or on which to perform model customization of transfer learning.
In this validated design, we used open-source NeMo GPT models to demonstrate inference of LLM models in our solutions. Also, we used the NeMo toolkit available in the NGC Catalog to deploy those models. Our validation efforts were performed with that NVIDIA NeMo Docker container.
NVIDIA Base Command Manager Essentials facilitates seamless operationalization of AI development at scale by providing features like operating system provisioning, firmware upgrades, network and storage configuration, multi-GPU and multinode job scheduling, and system monitoring. It maximizes the use and performance of the underlying hardware architecture.
In this validated design, we use NVIDIA Base Command Manager Essentials for: