Use Case 5 - Impact of Running Models with Different Quantization on the XE9680 | Maximizing AI Performance: A Deep Dive into Scalable Inferencing on Dell with NVIDIA | Dell Technologies Info Hub

Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

None

None

Thank you for your feedback!

This use case focuses on understanding the impact of running models with different quantizations on the XE9680. Specifically, we tested the Llama 3 70B model using both FP16 and FP8 quantization to compare their performance metrics.
Goals:
- Evaluate the performance of the Llama 3 70B model with different quantizations (FP16, FP8).
- Gather metrics on throughput and latency.
- Compare performance results to understand the impact of quantization.
Note: This paper does not detail model weights when using quantization. However, larger models provide more accurate responses.

Note: Unless otherwise stated, all use cases will utilize the default NIM quantization (FP8).