Home > Servers > PowerEdge Components > White Papers > Developing and Deploying Vision AI with Dell and NVIDIA Metropolis > Overview
The RetinaNet with ResNet18 models were deployed in the DS-Triton end-to end pipeline with different test configurations to demonstrate inference at scale. All configurations were run using one of the four NVIDIA T4 GPUs on the PowerEdge R7515 server. Each configuration ran on 1,248 x 384 pixel, three-channel images. The following table presents the results and Figure 6 depicts that throughput performance.
Test | Precision |
GPU ID | Runtime inference batch size | Sources | Count model instances | Interval | FPS (avg) per source | FPS (avg) total |
Speed increase from single-instance 32-bit floating-point baseline model |
A | TensorRT-FP32 | 0 | 1 | 1 | 1 | 0 | 96.3 | 96 | Baseline |
B | TensorRT-FP16 | 0 | 1 | 1 | 1 | 0 | 196.7 | 197 | ~2X |
C | TensorRT-INT8 | 0 | 1 | 1 | 1 | 0 | 278.2 | 278 | ~3X |
D | TensorRT-INT8 | 0 | 8 | 8 | 1 | 0 | 68.0 | 544 | ~5X |
E | TensorRT-INT8 | 0 | 8 | 8 | 1 | 1 | 121.1 | 969 | ~10X |
Each custom application involves specific requirements, and the setting or tuning process varies on each case. For setting the parameters, NVIDIA recommends common practices for performance optimization (see DeepStream SDK: Best practices for performance optimization).
The following example shows the visualization produced with the engine running in the DeepStream-Triton framework (see the following figure).