Home > AI Solutions > Artificial Intelligence > White Papers > Llama 2: Inferencing on a Single GPU > Introduction
Meta and Microsoft released Llama 2, an open source LLM, to the public for research and commercial use[1]. Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The fine-tuned versions use Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF) to align to human preferences for helpfulness and safety. It was pre-trained on 2 trillion pieces of data from publicly available sources.
This release includes model weights and starting code for pretrained and fine-tuned Llama 2 language models, ranging from 7B (billion) to 70B parameters (7B, 13B, 70B). The following table provides further detail about the models.
Models | Fine-tuned Models | Parameter |
Llama 2-7B | Llama 2-7B-chat | 7B |
Llama 2-13B | Llama 2-13B-chat | 13B |
Llama 2-70B | Llama 2-70B-chat | 70B |
To run these models for inferencing, 7B model requires 1GPU, 13 B model requires 2 GPUs, and 70 B model requires 8 GPUs. The Model Parallel (MP) values are set while the model is being built[2].