Experiment Setup

Thank you for your feedback!

A model characterization provides valuable insights into memory utilization, latency, and computational efficiency measured during inferencing by varying the batch size (the number of chunks in which the tokens are processed) and number of prompts (user input to the model in the form of questions) provided to the model. We demonstrate deploying the Llama 2 7B chat model on PowerEdge R760xa using one A100 40GB for inferencing. To measure latency and TFLOPS (Tera Floating-Point Operations per Second) on the GPU, we used DeepSpeed Flops Profiler.

Figure 1. System Configuration

Table 2. Hardware and software configuration of the system

Component	Details
Hardware
Compute server for inferencing	PowerEdge R760xa
GPUs	NVIDIA A100 40GB PCIe CEM GPU
Host Processor Model Name	Intel(R) Xeon(R) Gold 6454S (Sapphire Rapids)
Host Processors per Node	2
Host Processor Core Count	32
Host Processor Frequency	2.2 GHz
Host Memory Capacity	512 GB, 16 x 32GB 4800 MT/s DIMMs
Host Storage Type	SSD
Host Storage Capacity	900 GB
Software
Operating system	Ubuntu 22.04.1
Framework	Pytorch NVIDIA NGC container
Profiler	DeepSpeed FLOPs Profiler

Your Browser is Out of Date

Experiment Setup

Experiment Setup