In our validated design, we use NVIDIA’s cluster manager for bare metal provisioning, in which the operating system (through PXE), drivers, and local storage configuration are deployed in PowerEdge compute nodes. NVIDIA’s cluster manager deploys Kubernetes, configuring control plane and worker nodes, access control, and provisions essential Kubernetes management toolkits and frameworks such as Prometheus. NVIDIA’s cluster manager also handles network configuration, including cluster networking, POD networking, and storage networking. Finally, NVIDIA’s cluster manager is used to deploy NVIDIA software, including the NVIDIA GPU operator and Fabric Manager, essential components for optimizing GPU performance and management.
When the infrastructure is ready to deploy AI models in production, the following workflow explains the software architecture of an AI model in production.
AI models are typically optimized for performance before deployment to production. Optimized models offer faster inference speed, improved resource efficiency, and reduced latency, resulting in cost savings and better scalability. They can handle increased workloads, require fewer computational resources, and provide a better user experience with quick responses.
The following figure shows an example LLM model optimization and deployment using the software components described in the reference architecture:
Figure 6. Workflow to optimize NeMo LLM models for inferencing using NVIDIA toolkits
NVIDIA offers several pretrained models in NeMo format. To optimize the model’s throughput and latency, it can be converted to the FasterTransformer format, which includes performance modifications to the encoder and decoder layers in the transformer architecture. FasterTransformer enables the model to serve inference requests with three times quicker latencies or more compared to their non- FasterTransformer counterparts. The NeMo framework training container includes the FasterTransformer framework and scripts for converting a .nemo file to the FasterTransformer format.
Note: NeMo models available in Hugging Face are foundation models. While these models offer a strong starting point with general capabilities, they are typically customized for specific use. This design guide does not address or show model customization.
When the model is converted, it can be optimized using the Model Analyzer tool. Model Analyzer helps gain insights into the compute and memory requirements of Triton Inference Server models by analyzing various configuration settings and generating performance reports. These reports summarize metrics like latency, throughput, GPU resource utilization, power draw, and more, enabling easy comparison of performance across different setups and identifying the optimal configuration for the inference container.
The final optimized model is ready for production deployment on a PowerEdge server equipped with NVIDIA GPUs using Triton Inference Server. It can be accessed through an API endpoint or using HTTPS/GPRC protocols. Triton Inference Server also offers health and performance metrics of the model in production, which can be consumed and visualized through Prometheus and Grafana. These monitoring tools provide valuable insights into the model's performance and overall system health during deployment.
The same workflow is applicable for predictive or discriminative AI. The starting point for a recommendation system use case is a model developed using the Merlin framework (as opposed to a NeMo model).