The Dell Validated Design for Generative AI Model Customization is designed to address the challenges of customizing LLMs for enterprise use cases. LLMs have shown tremendous potential in natural language processing tasks but require specialized infrastructure for efficient customization and deployment.
This reference architecture serves as a blueprint, offering organizations guidelines and best practices to design and implement scalable, efficient, and reliable infrastructure specifically tailored for generative AI models training and customization. While its primary focus is LLM customization, the architecture can be adapted for discriminative or predictive AI model training.
Figure 4. Reference architecture
The following sections describe the key components of the reference architecture shown in the preceding figure. Additional information about the multinode configuration is shown in Networking design section.
The compute infrastructure is a critical component of the design, responsible for the training of AI models. Dell Technologies offers a range of acceleration-optimized servers, equipped with NVIDIA GPUs and high-speed connectivity, to handle the intense compute demands of LLMs. The PowerEdge XE9680 server is used as the compute infrastructure for LLM model customization in the current version of this design and is validated in both single-node and multinode configurations.
There are options for cluster configurations. The PowerEdge servers with NVIDIA GPUs can be configured either as a Kubernetes cluster or Slurm cluster.
A Kubernetes cluster is a group of interconnected servers that run containerized applications managed by Kubernetes, an open-source container orchestration system. In this setup, there are control plane nodes that control the cluster and worker nodes that run tasks. Containers are grouped into pods, which are the smallest deployable units. Kubernetes manages the scaling and deployment of pods through replica sets and deployments, ensuring the right number are running. Services help with load balancing and network access to pods, and resources like ConfigMaps and Secrets are used for configuration and sensitive data storage. Kubernetes clusters are highly scalable, making them ideal for managing containerized applications and complex distributed systems, offering features like automated load balancing and self-healing.
A Slurm cluster, powered by the "Simple Linux Utility for Resource Management" software, is a high-performance computing environment that efficiently manages and schedules computing tasks across multiple nodes. This open-source system is efficient at job scheduling, tracking resource availability, and prioritizing tasks based on user-defined requirements. It uses job queues and provides fairness mechanisms, ensuring that higher-priority jobs are accommodated without neglecting lower-priority ones. Slurm offers access control features, facilitating user management and access policies, and is designed to handle node failures gracefully, redistributing jobs to maintain efficiency. It is a popular choice for scientific research, academic institutions, and organizations requiring substantial computational power for tasks such as AI model training and customization, simulations, and data analysis.
NVIDIA Base Command Manager Essentials allows customers to manage both Kubernetes and Slurm clusters seamlessly. Base Command Manager Essentials can be used to configure the PowerEdge server either as part of the Kubernetes cluster or the Slurm cluster. They can be quickly reconfigured with just a reboot. This method allows administrators to allocate resources quickly to either of the clusters on demand and with minimal overhead.
We rely on Slurm and Kubernetes to provide secure clusters for model customization. This design does not address any additional security considerations.
Dell Technologies recommends using the Slurm cluster for model customization and the Kubernetes cluster for model inferencing. Slurm offers seamless scheduling components like batch scheduling, preemption, and multiple queues, making it efficient for orchestrating long-running tasks such as model customization. Meanwhile, Kubernetes delivers capabilities such as autoscaling, high availability, fault tolerance, and load balancing, tailored for services like model inferencing.
For this version of the design, we validated both a Slurm cluster and Kubernetes cluster for running model customization. See the Model customization validation section for details about the scenarios that we validated.
This validated design incorporates two physical networks: Ethernet network for management, storage, and client/server traffic (sometimes referred to as north/south traffic) and InfiniBand network for internode communication (sometimes referred to as east/west traffic) used for distributed training.
For Ethernet, organizations can choose between 25 Gb or 100 Gb networking infrastructure based on their specific requirements. For LLM customization tasks using text data, we recommend using Dell PowerSwitch Z9432F-ON which adequately meets text data's bandwidth demands.
PowerSwitch S5232F-ON or PowerSwitch S5248F-ON can also be used as the network switch. PowerSwitch S5232F-ON supports both 25 Gb and 100 Gb Ethernet, while PowerSwitch S5248F-ON is a 25 Gb Ethernet switch.
You can use ConnectX-6 Network adapters for network connectivity. They are available in both 25 Gb and 100 Gb options.
When model customization requires multiple servers for LLM training, you must connect these servers with high-speed interconnect. InfiniBand is a preferred choice for internode connectivity in LLM customization due to its high-bandwidth capabilities that facilitate swift data transfers between nodes, particularly essential when handling large datasets and complex neural networks. Low latency communication is crucial for synchronous model training, and InfiniBand’s low latency ensures rapid exchange of updates between nodes, contributing to synchronization and overall efficiency in distributed training.
Additionally, InfiniBand natively supports collective operations such as all-reduce, which are fundamental operations in AI model training. InfiniBand’s support for Remote Direct Memory Access (RDMA) allows data to be transferred directly between the memory of one node and another, reducing CPU involvement and minimizing latency. Reliability in data transfer is maintained, reducing the chances of data loss or errors. Overall, InfiniBand’s combination of high performance, low latency, scalability, and reliability makes it an ideal choice for AI model training on distributed computing clusters.
In this validated design, we used the following:
Customers can choose either the NDR or HDR configurations.
Note: We recommend eight single ConnectX InfiniBand adapters for each server to take advantage of GPU Direct RDMA. Each GPU in the PowerEdge XE9680 server requires a dedicated InfiniBand port.
The management infrastructure ensures the seamless deployment and orchestration of the AI model customization system. NVIDIA Base Command Manager Essentials performs bare metal provisioning, cluster deployment, and ongoing management tasks. Deployed on a PowerEdge R660 server that serves as a head node, NVIDIA Base Command Manager Essentials simplifies the administration of the entire cluster.
To enable efficient container orchestration, a cluster is deployed in the compute infrastructure using NVIDIA Base Command Manager Essentials. To ensure high availability and fault tolerance, we recommend installation of the Kubernetes control plane on three PowerEdge R660 servers. The management node can serve as one of the control plane nodes.
The Cluster configuration section above explains that you have the flexibility to select either Slurm, Kubernetes, or both clusters according to your needs. If you choose a Slurm cluster, we recommend that you set up three management nodes. This proactive approach future-proofs your deployment and ensures compatibility for any potential Kubernetes deployments in the future.
See the Networking design section for additional information about management networks.
Local storage that is available in PowerEdge servers provides operating system and container storage. The NeMo Framework might create temporary files and checkpoints that require large amount of storage. This storage can be mapped to local storage, and we recommend using high-capacity local storage.
The need for external storage for AI model customization depends on the specific requirements and characteristics of the AI model and the number of parameters and complexity of the fine-tuning process.
In this design, we recommend PowerScale storage as a repository for datasets for model customization, models, model versioning and management, and model ensembles. We also recommend it for storage and archival of inference data, including capture and retention of prompts and outputs when the model customization has been completed and put into inferencing operations. These recommendations can be useful for marketing and sales or customer service applications where further analysis of customer interactions might be desirable.
The flexible, robust, and secure storage capabilities of PowerScale offer the scale and speed necessary for training and operationalizing AI models, providing a foundational component for AI workflow. Its capacity to handle the vast data requirements of AI, combined with its reliability and high performance, cements the crucial role that external storage plays in successfully bringing AI models from conception to application.
In this validated design, we established volumes in PowerScale storage to store models, datasets, and checkpoints. These volumes are accessed using the NFS protocol that are configured using BCM on both the head node and worker nodes. The NeMo Framework's Supervised Fine Tuning employs distributed checkpointing, which necessitates worker nodes to access folders created by their counterparts. To ensure immediate availability of these folders across all workers, disabling attribute caching (actimeo=0) is required when mounting NFS volumes.
As described earlier, this validated design uses the NeMo framework for model customization with three sizes of the Llama 2 model as the recommended foundation model for our validation scenarios. See the Customizing Large Language Models chapter for more information about Llama 2.
Organizations seeking comprehensive model life cycle management can optionally deploy MLOps platforms and toolsets, like cnvrg.io, Kubeflow, MLflow, and others.
MLOps integrates machine learning with software engineering for efficient deployment and management of models in real-world applications. In generative AI, MLOps can automate model deployment, ensure continuous integration, monitor performance, optimize resources, handle errors, and ensure security and compliance. It can also manage model versions, detect data drift, and provide model explainability. These practices ensure generative models operate reliably and efficiently in production, which is critical for interactive tasks like content generation and customer service chatbots.
cnvrg.io delivers a full-stack MLOps platform that helps simplify continuous training, tuning, and deployment of AI and ML models. With cnvrg.io, organizations can automate end-to-end ML pipelines at scale and make it easy to place training or inferencing workloads on CPUs and GPUs based on cost and performance trade-offs. For more information about a reference architecture for cnvrg.io on Kubernetes, see the design guide Optimize Machine Learning through MLOPs with Dell Technologies and cnvrg.io.
Note: cnvrg.io and other popular MLOps platforms are only supported on the Kubernetes cluster. If you choose to use an MLOPs platform on Kubernetes, you must account for the scheduling considerations on Kubernetes and how it compares with Slurm as explained in the Cluster configuration section above.
With all the architectural elements described in this section for the Dell Validated Design for Generative AI Model Customization, organizations can confidently implement high-performance, efficient, and reliable AI infrastructure for model customization. The architecture's modularity and scalability offer flexibility, making it well suited for various AI workflows, while its primary focus is on generative AI model customization.