Home > AI Solutions > Gen AI > Guides > Generative AI in the Enterprise with AMD Accelerators > Sizing and scaling the infrastructure for LLM inferencing and fine-tuning
The Dell Validated Design for Generative AI Model Customization is engineered to scale horizontally, or ‘scale-out’, to accommodate the demands of LLM inference and fine-tuning.
‘Scale-out’ refers to the process of enhancing capacity by integrating more servers into the existing infrastructure. Newly added servers can seamlessly join the cluster and be deployed using the same image as the existing PowerEdge servers, facilitating a straightforward scaling process. By incorporating additional PowerEdge XE9680 servers, customers can host more LLM instances and cater to an increased number of concurrent users for inferencing. For fine-tuning, this expanded compute capacity facilitates improved training times.
In this design, PowerEdge XE9680 servers are deployed using Omnia and set up as worker nodes within a Kubernetes cluster. Given their high scalability, Kubernetes clusters are ideal for managing containerized applications and intricate distributed systems. This strategy enables the infrastructure to scale in response to the growing demands of LLM inference and fine-tuning.
The design also includes a robust network infrastructure to bolster scalability. It employs the PowerSwitch Z-series, an open Ethernet-powered solution that delivers next-generation Ethernet fabrics with advanced network silicon. The network architecture in this design supports scaling up to eight nodes.
Storage can be scaled independently of the server. As PowerScale storage nodes are added to a PowerScale OneFS cluster, the cluster increases in disk capacity and performance—including memory, CPU, and added network throughput. Dell PowerScale clusters can scale up to 252 nodes, 186 PB of capacity, and over 2.5 TB read/write throughput within a single namespace.