The Dell Validated Design for Generative AI Inferencing is a reference architecture designed to address the challenges of deploying LLMs in production environments. LLMs have shown tremendous potential in natural language processing tasks but require specialized infrastructure for efficient deployment and inferencing.
This reference architecture serves as a blueprint, offering organizations guidelines and best practices to design and implement scalable, efficient, and reliable AI inference systems specifically tailored for generative AI models. While its primary focus is generative AI inferencing, the architecture can be adapted for discriminative or predictive AI models, as explained further in this section.
Figure 4. Reference architecture
The following sections describe the key components of the reference architecture.
The compute infrastructure is a critical component of the design, responsible for the efficient execution of AI models. Dell Technologies offers a range of acceleration-optimized servers, equipped with NVIDIA GPUs, to handle the intense compute demands of LLMs. The following server models are available as compute resources for deploying LLM models in production:
Additionally, the PowerEdge R760xa server configured with NVIDIA L40 GPUs can be used for inferencing nongenerative AI, such as recommendation systems.
Organizations can choose between 25 Gb or 100 Gb networking infrastructure based on their specific requirements. For LLM inferencing tasks using text data, we recommend using existing network infrastructure with 25 Gb Ethernet, which adequately meets text data's bandwidth demands. To future-proof the infrastructure, a 100 Gb Ethernet setup can be used. PowerSwitch S5232F-ON or PowerSwitch S5248F-ON can be used as the network switch. PowerSwitch S5232F-ON supports both 25 Gb and 100 Gb Ethernet, while PowerSwitch S5248F-On is a 25 Gb Ethernet switch. ConnectX-6 Network adapters are used for network connectivity. They are available in both 25 Gb and 100 Gb options.
The scope of the reference architecture only includes AI models in production that can fit in a single PowerEdge server. It does not include models that span multiple nodes and require high-speed interconnect.
The management infrastructure ensures the seamless deployment and orchestration of the AI inference system. NVIDIA’s cluster management software, a robust cluster management solution, performs bare metal provisioning, cluster deployment, and ongoing management tasks. Deployed on a PowerEdge R660 server that serves as a head node, the cluster management software simplifies the administration of the entire cluster.
To enable efficient container orchestration, a Kubernetes cluster is deployed in the compute infrastructure, under the management of the cluster manager. Depending on redundancy and scalability requirements, the Kubernetes control plane can be deployed on one or three PowerEdge R660 servers. For a small cluster compute server (fewer than eight nodes), a single node control is sufficient. For larger and highly redundant clusters, the control plan can be deployed on three nodes.
Local storage that is available in PowerEdge servers is used for operating system and container storage. Kubernetes, deployed by NVIDIA’s cluster management software, deploys the Rancher local path Storage Class that makes local storage available for provisioning pods.
The need for external storage for AI model inference depends on the specific requirements and characteristics of the AI model and the deployment environment. In many cases, external storage is not strictly required for AI model inference, as the models reside in GPU memory. In this validated design, we do not include external storage as part of the architecture.
However, PowerScale storage can be used as a repository for models, model versioning and management, model ensembles, and for storage and archival of inference data. Its robust external storage capabilities offer the scale and speed necessary for operationalizing AI models, providing a foundational component for AI workflow. Its capacity to handle the vast data requirements of AI, combined with its reliability and high performance, cements the crucial role that external storage plays in successfully bringing AI models from conception to application.
The heart of the AI inference system is the Triton Inference Server, which handles the AI models and processes inference requests. Triton is a powerful inference server software that efficiently serves AI models with low latency and high throughput. Its integration with the compute infrastructure, GPU accelerators, and networking ensures smooth and optimized inferencing operations.
To enable the deployment of generative AI models in production, the NeMo framework, combined with Triton Inference Server, offers powerful AI tools and optimizations. Specifically, the NeMo framework, combined with FasterTransformer, unlocks state-of-the-art accuracy, low latency, and high throughput inference performance across both single-GPU and multi-GPU configurations. This combination empowers organizations to unleash the full potential of their generative AI models.
The reference architecture can also be used for predictive or discriminative AI. For example, recommendation systems can be built on this reference architecture. Recommendation systems are AI models that analyze user preferences, behaviors, and historical data to suggest relevant and personalized items or content. NVIDIA Merlin is a framework that accelerates and streamlines the development and deployment of large-scale deep learning-based recommendation systems. It aims to optimize the entire recommendation pipeline, from data preprocessing and feature engineering to model training and deployment, with a focus on performance and scalability.
The reference architecture benefits from the extensive collection of pretrained models provided by the NeMo framework. These models address various categories, including Automatic Speech Recognition (ASR), NLP, and Text-to-Speech (TTS). Additionally, GPT models are readily available for download from locations such as the Hugging Face model repository, offering a vast selection of generative AI capabilities.
Organizations seeking comprehensive model life cycle management can optionally deploy MLOps platforms, like cnvrg, Kubeflow, MLflow, and Domino Data. These platforms streamline the deployment, monitoring, and maintenance of AI models, ensuring efficient management and optimization throughout their life cycle. We have validated cnvrg.io as part of this validated design.
By adhering to this Dell Validated Design for Generative AI Inferencing, organizations can confidently implement high-performance, efficient, and reliable AI inference systems. The architecture's modularity and scalability offer flexibility, making it well suited for various AI workflows, while its primary focus on generative AI inferencing maximizes the potential of advanced LLMs.