Real-time chatbox inference use-case | Driving GenAI Advancements: Dell PowerEdge R760 with the Latest 5th Gen Intel® Xeon® Scalable Processors | Dell Technologies Info Hub

Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Real-time chatbox inference use-case

Real-time chatbox inference use-case

Thank you for your feedback!

The workload chosen for these tests was the Llama 2 model from Meta and Falcon model, both of which are available from HuggingFace. Llama2 models come in a range of parameter size. The two chosen model sizes for this evaluation are: (1) https://huggingface.co/meta-llama/Llama-2-7b-hf, and (2) https://huggingface.co/meta-llama/Llama-2-13b-hf. Both models were acquired using the license from Meta/HuggingFace. For the Falcon model, we chose the 40B parameter model: https://huggingface.co/tiiuae/falcon-40b.
Showcase LLM next-token latency meeting the market requirement of <100ms for various models across different precisions using either 1, 2 or 4x nodes.
- Model: LLAMA2-7B, LLAMA2-13B, Falcon-40B
- Dataset: LAMBADA
- Batch size: 1,2,4,8,16
- Input Token Size: 256, 1024, 2048
- Output Token Size: 256
- Precision: BFloat16
In addition to inference, CPUs can efficiently process fine-tuning of AI models using the power of multi-node distributed frameworks.