Dell Technologies provides a diverse selection of acceleration-optimized servers with an extensive portfolio of accelerators. Due to the specific workload characteristics of a digital assistant system and its overall sizing and scaling needs, we chose a particular set of PowerEdge servers, the PowerEdge R760xa and R760 servers (see Chapter 4 for more details).
To understand this decision better, consider the characteristics of a digital assistant workload:
- From a customer perspective, the most important characteristics are the number of tenants, the number of simultaneous conversations, and the response time – which most directly affects the user experience.
- The workload exhibits a high degree of parallelism. Incoming requests are entirely independent from another as they reflect the scope and type of requests from users. This case holds true for all components in the system.
- The LLM workload of a digital assistant solution is characterized as an inference workload, as opposed to a training workload, so we designed an inference-optimized infrastructure.
- Cost is a key concern for any AI workload. To offer a high-performing digital assistant solution while addressing cost concerns, it is imperative to find the right GPU/server mix. In this design, we answer the following questions with an experimental validation of the documented configurations to avoid costly (and lengthy) trial-and-error runs when the solution is deployed at customer sites:
- How many servers are required?
- How many GPUs are required in each server?
- What type of GPUs are best suited for the workload processed?
- The key variable part of the system, assuming that latency is held constant, is essentially the number of simultaneous conversations that the system must process. When properly designed, the system can easily scale up and down by adding and removing servers in the respective tiers, as needed.
A cloud-native design based on Kubernetes containers addresses the operations and management requirements. The Kubernetes control plane itself consumes few compute resources. Kubernetes best practices dictate implementing the control plane on three independent control nodes. For the Kubernetes control nodes, we chose Dell PowerEdge R660 servers, which offer a great value for limited-size management workloads, while meeting the Kubernetes quorum requirement.