Appendix

Maximizing Llama Open Source Model Inference Performance with Tensor Parallelism on a Dell XE9680 with H100s

Executive summary Scenarios Llama-3 70B tensor parallelism results Llama-2 13B tensor parallelism results Conclusion Appendix

References

Thank you for your feedback!

Tensor Parallelism is a parallelization technique that involves distributing tensor operations of model layers across multiple GPUs. It enables effective utilization of multiple GPUs. Each GPU processes a slice of tensor and only aggregates the full tensor for operations that require it. For example, in transformer models the main building block is a fully connected nn.Linear followed by a nonlinear activation GeLU. The dot-product part of GeLU operation can be written as Y = GeLU(XA), where X, Y and A are input, output, and weight matrices. This matrix operation can be split between multiple GPUs, as shown in Figure 17. However, note that tensor operations are not limited to multiplications and can be executed for tensors with dimensions higher than 2.
Figure 17. High level illustration of two ways in which matrix multiplication (most common tensor operation) is parallelized