Home > AI Solutions > Artificial Intelligence > White Papers > Llama 2: Inferencing on a Single GPU > Experiment Setup
A model characterization provides valuable insights into memory utilization, latency, and computational efficiency measured during inferencing by varying the batch size (the number of chunks in which the tokens are processed) and number of prompts (user input to the model in the form of questions) provided to the model. We demonstrate deploying the Llama 2 7B chat model on PowerEdge R760xa using one A100 40GB for inferencing. To measure latency and TFLOPS (Tera Floating-Point Operations per Second) on the GPU, we used DeepSpeed Flops Profiler.
Component | Details |
Hardware | |
Compute server for inferencing | PowerEdge R760xa |
GPUs | NVIDIA A100 40GB PCIe CEM GPU |
Host Processor Model Name | Intel(R) Xeon(R) Gold 6454S (Sapphire Rapids) |
Host Processors per Node | 2 |
Host Processor Core Count | 32 |
Host Processor Frequency | 2.2 GHz |
Host Memory Capacity | 512 GB, 16 x 32GB 4800 MT/s DIMMs |
Host Storage Type | SSD |
Host Storage Capacity | 900 GB |
Software | |
Operating system | Ubuntu 22.04.1 |
Framework | |
Profiler |