Home > AI Solutions > Gen AI > White Papers > Maximizing Llama Open Source Model Inference Performance with Tensor Parallelism on a Dell XE9680 with H100s > Scenarios
In Table 2, we show the two models used in our tensor parallelism experiments, which span a variety of characteristics. The two models we chose have:
Model | Parameters | Context | Quantization | Memory required just for Parameters | Minimum # of H100 with 80GB HBM3 memory each | Attention |
Llama-3 70B | 70B | 8k tokens | FP16 | ~ 140 GB | 2 | GQA |
Llama-2 13b | 13B | 4k tokens | FP16 | ~ 26 GB | 1 | MHA |
The minimum number of H100s needed is based on memory required for parameters, activations, and KV cache. Each FP16 parameter requires two bytes of memory, and each GPU provides 80GB of additional memory to the aggregate memory pool.
In Table 3 and Table 4, we show the tests we ran against each of the models.
We tested Llama-3 70B FP16 inference with an 8k token context (4k input, 4k output). We tested Llama-2 13B FP16 inference with a 2k token context (2k input, 2k output). To compare the scaling of throughput and efficiency across different tensor parallelism counts, we plot the TensorRT-LLM results of 1) total throughput and 2) throughput efficiency per GPU for various batch sizes, up to and including the largest batch size that can be fit in memory. We show these results for two different scenarios, one where we consider only results with Time To First Token (TTFT) of less than 2 seconds and one where there is no TTFT constraint.
Because models require the accelerators to have enough high bandwidth memory (HBM) for their parameters and a key-value cache, larger models will need to use TP across more than one GPU. For the Llama-3 70B model, this minimum number of H100 GPUs is 2. So, we ran Llama-3 70B with tensor parallelism of 2, 4, and 8 because of the model size and also because number of KV heads (8 for Llama 70B) must be divisible by the number of GPUs. Meanwhile since the Llama-2 13B can fit on a single GPU, we tested it with tensor parallelism of 1, 2, 4, and 8.
While the larger TP count increases pooled memory and brings additional compute power to bear, a high degree of model parallelism leads to smaller matrix multiplications (General Matrix Multiplications - GEMMs), potentially decreasing GPU utilization. Our tests show us empirically the total combined net effect from all of these factors.
Model | Time Constraint | Tensor Parallelism | Context | Batch Size |
Llama-3 70B | Time To First Token (TTFT) < 2 S | TP = 2 GPUs, 4 GPUs, 8 GPUs | 4k in, 4k out | Up to and including the largest batch size that will fit |
Llama-3 70B | No constraint | TP = 2 GPUs, 4 GPUs, 8 GPUs | 4k in, 4k out | Up to and including the largest batch size that will fit |
Model | Time Constraint | Tensor Parallelism | Context | Batch Size |
Llama-2 13B | Time To First Token (TTFT) < 2 S | TP = 1 GPU, 2 GPUs, 4 GPUs, 8 GPUs | 2k in, 2k out | Up to and including the largest batch size that will fit |
Llama-2 13B | No constraint | TP = 1 GPU, 2 GPUs, 4 GPUs, 8 GPUs | 2k in, 2k out | Up to and including the largest batch size that will fit |