Home > AI Solutions > Gen AI > White Papers > Maximizing AI Performance: A Deep Dive into Scalable Inferencing on Dell with NVIDIA > Use Case 1 - Single Model Baseline Test
To start the scaling process, we must first establish a baseline. The baseline starts with deploying a single model on a single GPU and measures the performance of this model. The results will serve as a reference point for evaluating the impact of scaling and different configurations on model performance. We will begin on a single PowerEdge R760xa and a single PowerEdge XE9680. We will run Llama 3 8B and 70B to test the performance of each model.
Goals:
XE9680-1xH100-Llama3-8B – This test consists of a single XE9680, with a single instance of Llama 3 8B running on a single H100 GPU. The other 7 H100s in this chassis are not used for this test. This test uses FloatingPoint 16.
XE9680-4xH100-Llama3-70B – This test consists of a single XE9680 with a single instance of Llama 3 70B running on 4 H100 GPUs. The other 4 H100s in this chassis are not used for this test. This test uses FloatingPoint 16.
R760xa-1xL40S-Llama3-8B – This test consists of a single R760xa with a single Llama 3 8B instance running on a L40S GPU. The other 3 GPUs in this chassis are not used for this test. This test uses FloatingPoint 16.
R760xa-4xL40S-Llama3-70B - This test consists of a single R760xa with a single Llama 3 70B instance running on 4 L40S GPUs. Triton and TensorRT-LLM provide a mechanism to enable a large model to be hosted by multiple GPU devices working in concert. This test uses FloatingPoint 8. This model requires parallelism across multiple L40S GPUs due to the large capacity of the model. The GPUs used to host a model must reside on the same node. Combining GPUs that reside on separate nodes is not covered in this guide. FP8 on H100 is covered in use case 5.