NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software that accelerates the data science pipeline and streamlines development and deployment of production AI including generative AI, computer vision, speech AI, and more. This secure, stable platform includes over 100 frameworks, pretrained models, and tools that accelerate data processing, simplify model training and optimization, and streamline deployment.
NVIDIA AI Enterprise provides enterprise support for various software framework, toolkits, workflows, and models. See the NVIDIA AI Enterprise documentation for more information about all components available with NVIDIA AI Enterprise. The following components incorporated in this validated design are available as part NVIDIA AI Enterprise:
NVIDIA Triton Inference Server (also known as Triton) is inference serving software that standardizes AI model deployment and execution and delivers fast and scalable AI in production. Enterprise support for Triton is available through NVIDIA AI Enterprise. It is also available as an open-source software.
Triton streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained machine learning or deep learning models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the appropriate framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-premises, edge, and embedded devices.
The benefits of Triton for AI inferencing include the following:
Triton Inference Server is at the core of this validated design. It is the software that hosts generative AI models. Triton Inference Server, along with its integration with Model Analyzer, Fast Transformer, and the NeMo framework provides an ideal software for deploying generative AI models.
In NLP, the encoder and decoder are crucial components, and the transformer layer has gained popularity as an architecture for both. FasterTransformer from NVIDIA offers a highly optimized transformer layer for both the encoder and decoder, specifically designed for efficient inference.
When running on NVIDIA GPUs, FasterTransformer automatically uses the computational power of Tensor Cores, especially when the data and weights are represented in FP16 precision, enabling faster computations.
FasterTransformer is built using CUDA, cuBLAS, cuBLASLt, and C++. It provides convenient APIs for popular deep learning frameworks like TensorFlow, PyTorch, and Triton backend. These APIs allow users to seamlessly integrate FasterTransformer into their existing workflows using these frameworks.
In this validated design, we use FasterTransformer for inferencing NVIDIA NeMo models.
The Triton Inference Server offers a powerful solution for deploying AI models. However, each deployment presents its unique challenges, such as meeting latency targets, working with limited hardware resources, and accommodating various model requirements. To address these complexities, the Model Analyzer provides insight for planning and decision making.
The Model Analyzer enables users to send requests to their models while monitoring GPU memory and compute use. This tool provides a mechanism for a deep understanding of an AI model’s GPU memory requirements under different batch sizes and instance configurations. Using this information, users can make informed decisions about efficiently combining multiple models on a single GPU, ensuring optimal memory usage without exceeding capacity.
The Model Analyzer is a command-line interface (CLI) that significantly enhances comprehension of Triton Inference Server model compute and memory demands. It provides this awareness by conducting customizable configuration “sweeps” and generating comprehensive reports summarizing performance metrics.
With the Model Analyzer, you can:
By using the insights provided by the Model Analyzer, you can make better-informed decisions and optimize their AI model deployments for peak performance and efficiency.
In this validated design, we use the Model Analyzer to generate load and monitor the performance of LLM model inference. We used Prometheus and Grafana to gather and visualize the performance metrics. For the results of our validation. which uses Model Analyzer, see Chapter 5. For our sizing guidelines based on the reports of Model Analyzer, see Chapter 6.
NVIDIA NeMo is a framework to build, customize, and deploy generative AI models with billions of parameters. The NeMo framework provides an accelerated workflow for training with 3D parallelism techniques. It offers a choice of several customization techniques and is optimized for at-scale inference of large-scale models for language and image applications, with multi-GPU and multinode configurations. The NeMo framework makes generative AI model development easy, cost-effective, and fast for enterprises.
Figure 3. NVIDIA NeMo framework
There are various generative AI models available that can be used for inferencing or on which to perform model customization of transfer learning.
In this validated design, we used open-source NeMo GPT models to demonstrate inference of LLM models in our solutions. Also, we used NeMo toolkit available in the NGC Catalog to deploy those models. Our validation efforts were performed with that NVIDIA NeMo Docker container.
NVIDIA offers cluster management software for AI infrastructure that facilitates seamless operationalization of AI development at scale by providing features like operating system provisioning, firmware upgrades, network and storage configuration, multi-GPU and multinode job scheduling, and system monitoring. It maximizes the use and performance of the underlying hardware architecture.
In this validated design, we use the NVIDIA cluster manager capabilities for: