NVIDIA AI Enterprise provides enterprise support for various software frameworks, toolkits, and workflows. See the NVIDIA AI Enterprise documentation for more information about all components available with NVIDIA AI Enterprise. The following components incorporated in this validated design are available as part of NVIDIA AI Enterprise:
The following figure shows a block diagram of the NVIDIA Ai Enterprise end to end software suite:
Figure 1. NVIDIA AI Enterprise
The following sections provide descriptions of each component and how they are used for model customization.
NVIDIA NeMo is a framework to build, customize, and deploy generative AI models with billions of parameters. The NeMo framework provides an accelerated workflow for training with 3D parallelism techniques. It offers a choice of several customization techniques and is optimized for at-scale inference of large-scale models for language and image applications, with multi-GPU and multinode configurations. The NeMo framework makes generative AI model development easy, cost-effective, and fast for enterprises.
Figure 2. NVIDIA NeMo Framework
Key components of the NeMo Framework include:
The NeMo framework is at the core of this validated design. The framework is made available as two containers through NVIDIA’s NGC catalog: one for training and model customization and another for inference. In this validated design, we used NeMo Framework’s Model Customization “recipes” to customize Llama 2 models that are converted to NeMo format using customization scripts available in the framework on multiple nodes. The customized models were deployed to production using NeMo inference container with Triton Inference Server. See the NeMo User Guide for more information.
NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of the latest LLMs on NVIDIA GPUs. It lets developers experiment with new LLMs, offering speed-of-light performance and quick customization without deep knowledge of C++ or CUDA.
TensorRT-LLM wraps a deep learning compiler—which includes optimized kernels from FasterTransformer, pre- and postprocessing, and multi-GPU and multinode communication—in a simple open-source Python API for defining, optimizing, and running LLMs for inference in production.
TensorRT is integrated with the NeMo Framework that is used in this validated design for deploying customized Llama 2 models.
NVIDIA Triton Inference Server is inference serving software that standardizes AI model deployment and execution and delivers fast and scalable AI in production. Enterprise support for Triton Inference Server is available with NVIDIA AI Enterprise.
Triton Inference Server streamlines and standardizes AI inference of pretrained and customized models by enabling teams to deploy, run, and scale trained machine learning (ML) or deep learning (DL) models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the appropriate framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-premises, edge, and embedded devices.
Triton Inference Server is integrated with the NeMo framework, ideal for deploying generative AI models. In this validated design, we use Triton Inference Server with the NeMo inferencing container to deploy customized Llama 2 models.
NVIDIA Base Command Manager Essentials facilitates seamless operationalization of AI development at scale by providing features like operating system provisioning, firmware upgrades, network and storage configuration, multi-GPU and multinode job scheduling, and system monitoring. It maximizes the use and performance of the underlying hardware architecture.
In this validated design, we use the NVIDIA Base Command Manager Essentials for: