The key takeaways from this experiment using the open-source Llama 2 7B parameter model include:
- The results demonstrate that the user must optimize the number of prompts within their tolerable limits of latency, as the accelerator reaches its maximum throughput at 100.
- The latency of the model has a near perfect linear relationship with the number of prompts.
- When increasing the number of prompts, the average flops consumed by the model increased asymptotically to a point limited only by the capabilities of the underlying hardware, after which, the computational efficiency remained nearly constant.
- Varying batch size (constant number of prompts) had no effect on latency and efficiency of the model.
- The previous generation of NVIDIA Ampere based architecture A100 GPU is still viable when running the Llama 2 7B parameter model for inferencing.
- To achieve 139 tokens per second, we required only a single A100 GPU for optimal performance.