Use Case 4 - Running Multiple Different Models on the XE9680 | Maximizing AI Performance: A Deep Dive into Scalable Inferencing on Dell with NVIDIA | Dell Technologies Info Hub

Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

None

None

Thank you for your feedback!

In this use case, we aim to run multiple different models on a single XE9680 server, deploying one Llama 3 70B model using 4 H100 GPUs and four Llama 3 8B models, each using a single H100 GPU. This configuration fully utilizes the 8 GPUs available on the XE9680, running a total of 5 models concurrently. The objective is to gather throughput and latency metrics and compare them to baseline results where single models are running on the XE9680.
Goals:
- Run 1x 70B and 4x 8B Llama 3 models on a single XE9680, utilizing all 8 GPUs.
- Gather metrics on throughput, latency, and system performance.
- Compare these metrics to the baseline of a single model running on the XE9680.
Running multiple different models on a single XE9680 server allows for efficient utilization of GPU resources while maintaining performance. This capability is critical for applications requiring diverse model deployments, such as code generation and question-answering systems. By effectively managing resources and maintaining consistent performance metrics, the XE9680 proves to be a robust platform for advanced AI workloads.