This white paper shows the throughput performance obtained when running a RetinaNet object detection network using an NVIDIA A30 GPU with Dell PowerEdge R7515 server. The A30 is one of NVIDIA’s Ampere line of GPUs. A single A30 can deliver 130 tera operations per second (TOPS) of 8-bit integer performance and has a memory bandwidth of 933 GB/sec. The object detection experiments are set up using the procedures detailed in a previous paper Developing and Deploying Vision AI with Dell and NVIDIA Metropolis (delltechnologies.com).
First, a pre-trained RetinaNet model with EfficientNet_B0 backbone is fine-tuned with more than 7000 traffic images from the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) data set using the NVIDIA TAO Toolkit: a low-code AI model development tool. In our experiments we fine-tune the model using just the original KITTI data set, without creating additional augmented images. The training of the unpruned model continues until validation loss stops decreasing and instead begins to increase due to overfitting.
We then perform optional cycles of pruning to reduce model size followed by re-training to re-capture accuracy. In this example, the model showed 84% mean Average Precision (mAP) across four desired classes (bicycle, car, person, and road sign) at the point where fine tuning and pruning was stopped. Then, we optimized the model for inference using the built-in optimization in the TAO Toolkit, powered by TensorRT. Depending on the use case, the model can be converted to 32-bit, 16-bit, or 8-bit integer inference. We choose 8-bit integer here because this results in the highest inference throughput while loss in accuracy from the 8-bit quantization is minimized by having used Quantization Aware Training (QAT).
The 8-bit inference engine is then copied as needed and the engines are deployed on the Dell R7515 PowerEdge Server through NVIDIA’s DeepStream SDK (v6.0). Using Triton, and by feeding eight video streams simultaneously, total throughput of more than 800 frames per second of object detection inference is achieved. When performing inference on every fourth frame of each stream, the total frames per second passed to downstream tracking can grow to more than 2200. The overall process is shown in the following figure.
Figure 1. An end-to-end training to development pipeline for object detection
The following table shows the resulting inference throughput numbers from four configurations:
Notes:
Table 1. RetinaNet / EfficientNet_B0 object detection inference throughput - frames per second
Input Video Source | Batch | Frame | TRT Exec | Deep Stream | Triton Exec (perf_analyzer) | DeepStream-Triton | ||
|
|
| 1 instance | 1 instance | 1 instance | 2 instances | 1 instance | 2 instances |
sample_1080p_h264.mp4 | 8 | 0 | 813 fps | 362 fps | 818 fps | 1012 fps | 728 fps | 872 fps |
sample_1080p_h264.mp4 | 8 | 1 | N/A | 712 fps | N/A | N/A | 1360 fps | 1496 fps |
sample_1080p_h264.mp4 | 8 | 3 | N/A | 1152 fps | N/A | N/A | 2224 fps | 2224 fps |
Based on the throughput data in Table 1, there is a corresponding number of cameras (each with 30 fps of video) that can be supported if we wish to run object detection on their output. The following figure shows this number of supported cameras.
Figure 2. Number of 30 fps cameras that can be supported based on Table 1 configs