Deploying a Large Language Model (LLM) can be a complicated and time-consuming operation. Dell endeavors to simplify this process for our customers, and ensure the most efficient transition from development to deployment. It is also our goal to educate our customers and help them choose the correct infrastructure for a particular workload. With this in mind, this whitepaper provides step-by-step guidance to deploy Llama 2 for inferencing on an on-premises datacenter and analyze memory utilization, latency, and efficiency of an LLM using a Dell platform.
We will demonstrate that the latency of the model is linearly related with the number of prompts, where the number of prompts and the maximum batch size are equal. If the number of prompts is constant and the batch size is increased, the latency varies between ±10% of its mean on a single A100 40GB GPU. Furthermore, the efficiency of the model plateaus after an inflection point (number of prompts). Based on our studies, the GPU memory limitation is reached using a batch size of 68.