Use Case 2 - Scaling Up Multiple Instances of Llama 3 Models | Maximizing AI Performance: A Deep Dive into Scalable Inferencing on Dell with NVIDIA | Dell Technologies Info Hub

Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

None

None

Thank you for your feedback!

The second use case involves executing multiple Llama 3 8B model instances on the PowerEdge XE9680 and R760xa servers. The goal is to gather data on latency and throughput as the number of instances increases and to collect system metrics such as CPU, GPU, memory, and network utilization. This helps understand the performance impact of scaling up the number of model instances on each server.
Goals
- Execute multiple Llama 3 8B instances on the XE9680 and R760xa servers.
- Gather latency and throughput metrics as the number of instances scales up.
- Collect server performance metrics (GPU, memory, and CPU usage).
- Allocate all memory of the GPU.
- Uses 100 percent reservation of GPU capacity.