The Dell Validated Design for Generative AI Inferencing is a reference architecture designed to address the challenges of deploying LLMs in production environments. LLMs have shown tremendous potential in natural language processing tasks but require specialized infrastructure for efficient deployment and inferencing.
This reference architecture serves as a blueprint, offering organizations guidelines and best practices to design and implement scalable, efficient, and reliable AI inference systems specifically tailored for generative AI models. While its primary focus is generative AI inferencing, the architecture can be adapted for discriminative or predictive AI models, as explained further in this section.
Figure 2. Reference architecture
The following sections describe the key components of the reference architecture.
The compute infrastructure is a critical component of the design, responsible for the efficient execution of AI models. Dell Technologies offers a range of acceleration-optimized servers, equipped with NVIDIA GPUs, to handle the intense compute demands of LLMs. The following server models are available as compute resources for deploying LLM models in production:
Additionally, the PowerEdge R760xa server configured with NVIDIA L40 GPUs can be used for inferencing predictive AI, such as recommendation systems.
Organizations can choose between 25 Gb or 100 Gb networking infrastructure based on their specific requirements. For LLM inferencing tasks using text data, we recommend using existing network infrastructure with 25 Gb Ethernet, which adequately meets text data's bandwidth demands. To future-proof the infrastructure, a 100 Gb Ethernet setup can be used. PowerSwitch S5232F-ON or PowerSwitch S5248F-ON can be used as the network switch. PowerSwitch S5232F-ON supports both 25 Gb and 100 Gb Ethernet, while PowerSwitch S5248F-On is a 25 Gb Ethernet switch. ConnectX-6 Network adapters are used for network connectivity. They are available in both 25 Gb and 100 Gb options.
The scope of the reference architecture only includes AI models in production that can fit in a single PowerEdge server. It does not include models that span multiple nodes and require high-speed interconnect.
The management infrastructure ensures the seamless deployment and orchestration of the AI inference system. NVIDIA Base Command Manager Essentials performs bare metal provisioning, cluster deployment, and ongoing management tasks. Deployed on a PowerEdge R660 server that serves as a head node, NVIDIA Base Command Manager Essentials simplifies the administration of the entire cluster.
To enable efficient container orchestration, a Kubernetes cluster is deployed in the compute infrastructure using NVIDIA Base Command Manager Essentials. To ensure high availability and fault tolerance, we recommend that you to install the Kubernetes control plane on three PowerEdge R660 servers. The management node can serve as one of the control plane nodes.
Local storage that is available in PowerEdge servers is used for operating system and container storage. Kubernetes, deployed by NVIDIA Base Command Manager Essentials, deploys the Rancher local path Storage Class that makes local storage available for provisioning pods.
The need for external storage for AI model inference depends on the specific requirements and characteristics of the AI model and the deployment environment. In many cases, external storage is not strictly required for AI model inference, as the models reside in GPU memory. In this validated design, external storage in not explicitly required as part of the architecture.
However, PowerScale storage can be used as a repository for models, model versioning and management, model ensembles, and for storage and archival of inference data, including capture and retention of prompts and outputs when performing inferencing operations. This can be useful for marketing and sales or customer service applications where further analysis of customer interactions may be desirable.
The flexible, robust, and secure storage capabilities of PowerScale offer the scale and speed necessary for operationalizing AI models, providing a foundational component for AI workflow. Its capacity to handle the vast data requirements of AI, combined with its reliability and high performance, cements the crucial role that external storage plays in successfully bringing AI models from conception to application.
The heart of the AI inference system is the NVIDIA Triton Inference Server, which was described in detail earlier and which handles the AI models and processes inference requests. Triton is a powerful inference server software that efficiently serves AI models with low latency and high throughput. Its integration with the compute infrastructure, GPU accelerators, and networking ensures smooth and optimized inferencing operations.
To enable the deployment of generative AI models in production, the NeMo framework, combined with Triton Inference Server, offers powerful AI tools and optimizations. Specifically, the NeMo framework, combined with FasterTransformer, unlocks state-of-the-art accuracy, low latency, and high throughput inference performance across both single-GPU and multi-GPU configurations. This combination empowers organizations to unleash the full potential of their generative AI models.
The reference architecture can also be used for predictive or discriminative AI. For example, recommendation systems can be built on this reference architecture. Recommendation systems are AI models that analyze user preferences, behaviors, and historical data to suggest relevant and personalized items or content. NVIDIA Merlin is a framework that accelerates and streamlines the development and deployment of large-scale deep learning-based recommendation systems. It aims to optimize the entire recommendation pipeline, from data preprocessing and feature engineering to model training and deployment, with a focus on performance and scalability.
The reference architecture benefits from the extensive collection of pretrained models provided by the NeMo framework. These models address various categories, including Automatic Speech Recognition (ASR), NLP, and Text-to-Speech (TTS). Additionally, GPT models are readily available for download from locations such as the Hugging Face model repository, offering a vast selection of generative AI capabilities. The specific models that we validated in this design are listed in Chapter 5.
Organizations seeking comprehensive model life cycle management can optionally deploy Machine Learning Operations (MLOps) platforms and toolsets, like cnvrg.io, Kubeflow, MLflow, and others.
MLOps integrates machine learning with software engineering for efficient deployment and management of models in real-world applications. In generative AI, MLOps can automate model deployment, ensure continuous integration, monitor performance, optimize resources, handle errors, and ensure security and compliance. It can also manage model versions, detect data drift, and provide model explainability. These practices ensure generative models operate reliably and efficiently in production, which is critical for interactive tasks like content generation and customer service chatbots.
We have validated the MLOps platform from cnvrg.io as part of this validated design. cnvrg.io delivers a full stack MLOps platform that helps simplify continuous training and deployment of AI and ML models. With cnvrg.io, organizations can automate end-to-end ML pipelines at scale and make it easy to place training or inferencing workloads on CPUs and GPUs, based on cost and performance trade-offs. For more information, see the design guide Optimize Machine Learning through MLOPs with Dell Technologies and cnvrg.io.
With all the architectural elements described in this section for the Dell Validated Design for Generative AI Inferencing, organizations can confidently implement high-performance, efficient, and reliable AI inference systems. The architecture's modularity and scalability offer flexibility, making it well suited for various AI workflows, while its primary focus on generative AI inferencing maximizes the potential of advanced LLMs.