Home > AI Solutions > Gen AI > White Papers > Maximizing AI Performance: A Deep Dive into Scalable Inferencing on Dell with NVIDIA > Use Case 4 - Running Multiple Different Models on the XE9680
In this use case, we aim to run multiple different models on a single XE9680 server, deploying one Llama 3 70B model using 4 H100 GPUs and four Llama 3 8B models, each using a single H100 GPU. This configuration fully utilizes the 8 GPUs available on the XE9680, running a total of 5 models concurrently. The objective is to gather throughput and latency metrics and compare them to baseline results where single models are running on the XE9680.
Goals:
Running multiple different models on a single XE9680 server allows for efficient utilization of GPU resources while maintaining performance. This capability is critical for applications requiring diverse model deployments, such as code generation and question-answering systems. By effectively managing resources and maintaining consistent performance metrics, the XE9680 proves to be a robust platform for advanced AI workloads.