Some key takeaways from these test results to note:
- Efficient, scalable, and optimized means to run Enterprise AI Gen-AI use-cases on Intel® hardware; optimized end-to-end stack with Kubeflow
- LLM model Llama2-7B scaling up to batch size 16, 2048 input tokens meets the 100ms market requirements for next-token latency on a single Dell R760 node with 5th Gen Intel® Xeon® Scalable Processors.
- LLM model Llama2-13B scaling up to batch size 16, 2048 input tokens is well below the 100ms market requirements for next-token latency on 4x Dell R760 node with 5th Gen Intel® Xeon® Scalable Processors.
- LLM model Falcon-40B scaling up to batch size 4, 1024 input tokens meets the 100ms market requirements for next-token latency on 4x Dell R760 node with 5th Gen Intel® Xeon® Scalable Processors. Single node is currently not feasible.
- Fine-tune a Bio-GPT model in ~16.23 minutes using 4x Dell R760 node with 5th Gen Intel® Xeon® Scalable Processors.