Inferencing using NVIDIA AI Enterprise | Design Guide—Generative AI in the Enterprise - Inferencing

None

Thank you for your feedback!

NVIDIA AI Enterprise provides enterprise support for various software frameworks, toolkits, workflows, and models that support inferencing. See the NVIDIA AI Enterprise documentation for more information about all components available with NVIDIA AI Enterprise. The following components incorporated in this design are available as part of NVIDIA AI Enterprise:
- Triton Inference Server with TensorRT-LLM
- NVIDIA Base Command Manager Essentials
The following sections describe the key software components and how they are used in this design. For more information about how these key components work together, see Chapter 4.
Triton Inference Server
NVIDIA Triton Inference Server (also known as Triton) is inference serving software that standardizes AI model deployment and execution. It delivers fast and scalable AI in production. Enterprise support for Triton is available through NVIDIA AI Enterprise. It is also available as an open-source software.
Triton streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained machine learning or deep learning models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the appropriate framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-premises, edge, and embedded devices.
Benefits
The benefits of Triton for AI inferencing include the following:
- Support for multiple frameworks—Triton supports all major training and inference frameworks, such as TensorFlow, NVIDIA TensorRT, NVIDIA TensorRT-LLM, PyTorch, Python, ONNX, RAPIDS cuML, XGBoost, scikit-learn RandomForest, OpenVINO, custom C++, and more.
- High-performance AI inference—Triton supports all NVIDIA GPU-, x86-, Arm CPU-, and AWS Inferentia-based inferencing. It offers dynamic batching, concurrent execution, optimal model configuration, model ensemble, and streaming audio and video inputs to maximize throughput and utilization.
- Designed for DevOps and MLOps—Triton integrates with Kubernetes for orchestration and scaling, exports Prometheus metrics for monitoring, supports live model updates, and can be used in all major public cloud AI and Kubernetes platforms. It is also integrated into many MLOps software solutions.
- Support for model ensembles—Because most modern inference requires multiple models with preprocessing and postprocessing to be run for a single query, Triton supports model ensembles and pipelines. Triton can run the parts of the ensemble on CPUs or GPUs and allows multiple frameworks inside the ensemble.
- Enterprise-grade security and API stability—NVIDIA AI Enterprise includes NVIDIA Triton for production inference, accelerating enterprises to the leading edge of AI with enterprise support, security, and API stability while mitigating the potential risks of open-source software.
Triton Inference Server is at the core of this design. It is the software that hosts generative AI models. Triton Inference Server provides an ideal software for deploying generative AI models.
NVIDIA Bas Command Manager Essentials
NVIDIA Base Command Manager Essentials facilitates seamless operationalization of AI development at scale by providing features like operating system provisioning, firmware upgrades, network and storage configuration, multi-GPU and multinode job scheduling, and system monitoring. It maximizes the use and performance of the underlying hardware architecture.
In this design, we use NVIDIA Base Command Manager Essentials for:
- Bare metal provisioning, including deploying the operating system and drivers, and configurating local storage in PowerEdge compute nodes
- Network configuration, including configuring networks for PXE boot, internal node access, POD networking, and storage networking
- Kubernetes deployment, including configuring control plane node and worker nodes, access control and provision of Kubernetes management toolkits and frameworks like Prometheus
- NVIDIA software deployment, including deploying NVIDIA GPU operator and Fabric Manager
Cluster monitoring and management, including health monitoring, fault tolerance, resource utilization monitoring, software and package management, security and access control, and scaling

Your Browser is Out of Date

None

None

Triton Inference Server

Benefits

NVIDIA Bas Command Manager Essentials