Home > Servers > Rack and Tower Servers > Intel > White Papers > Driving GenAI Advancements: Dell PowerEdge R760 with the Latest 5th Gen Intel® Xeon® Scalable Processors > Real-time chatbox inference use-case
The workload chosen for these tests was the Llama 2 model from Meta and Falcon model, both of which are available from HuggingFace. Llama2 models come in a range of parameter size. The two chosen model sizes for this evaluation are: (1) https://huggingface.co/meta-llama/Llama-2-7b-hf, and (2) https://huggingface.co/meta-llama/Llama-2-13b-hf. Both models were acquired using the license from Meta/HuggingFace. For the Falcon model, we chose the 40B parameter model: https://huggingface.co/tiiuae/falcon-40b.
Showcase LLM next-token latency meeting the market requirement of <100ms for various models across different precisions using either 1, 2 or 4x nodes.
In addition to inference, CPUs can efficiently process fine-tuning of AI models using the power of multi-node distributed frameworks.