This white paper shows the throughput performance obtained when running a RetinaNet object detection network using an NVIDIA A2 GPU with Dell PowerEdge R6515 servers. The A2 is one of NVIDIA’s Ampere line of GPUs. A single A2 can deliver 36 tera operations per second (TOPS) of 8-bit integer performance and has a memory bandwidth of 200 GB/sec. The object detection experiments are set up using the procedures detailed in a previous paper Developing and Deploying Vision AI with Dell and NVIDIA Metropolis (delltechnologies.com).
First, a pre-trained RetinaNet model with EfficientNet_B0 backbone is fine-tuned with more than 7000 traffic images from the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) data set using the NVIDIA TAO Toolkit: a low-code AI model development tool. In our experiments we fine-tune the model using just the original KITTI data set, without creating additional augmented images. The training of the unpruned model continues until validation loss stops decreasing and instead begins to increase due to overfitting.
We then perform optional cycles of pruning to reduce model size followed by re-training to re-capture accuracy. In this example, the model showed 84% mean Average Precision (mAP) across four desired classes (bicycle, car, person, and road sign) at the point where fine tuning and pruning was stopped. We then optimized the model for inference using the built-in optimization in the TAO Toolkit, powered by TensorRT. Depending on the use case, the model can be converted to 32-bit, 16-bit, or 8-bit integer inference. We choose 8-bit integer here because this results in the highest inference throughput while loss in accuracy from the 8-bit quantization is minimized by having used Quantization Aware Training (QAT).
The 8-bit inference engine is then copied as needed and the engines are deployed on the Dell R7515 PowerEdge Server through NVIDIA’s DeepStream SDK (v6.0). Using Triton, and by feeding eight video streams simultaneously, total throughput of more than 200 frames per second of object detection inference is achieved. When performing inference on every fourth frame of each stream, the total frames per second passed to downstream tracking can grow to more than 700. The overall process is shown in the following figure.
Figure 1. An end-to-end training to development pipeline for object detection
The following table shows the resulting inference throughput numbers from three configurations:
Notes:
Table 1. RetinaNet / EfficientNet_B0 object detection inference throughput - frames per second
Input Video Source | Batch | Frame | 1 GPU TRT Exec Throughput | 1 GPU DeepStream-Triton Throughput | 2 GPU DeepStream-Triton Throughput | 2 GPU DeepStream-Triton Throughput (profiling at 10k samples / sec) |
sample_1080p_h264.mp4 | 8 | 0 | 229 fps | 216 fps | 328 fps | 320 fps |
sample_1080p_h264.mp4 | 8 | 1 | N/A | 424 fps | 648 fps | 616 fps |
sample_1080p_h264.mp4 | 8 | 3 | N/A | 720 fps | 752 fps | 720 fps |
Based on the throughput data (with profiling turned off) in Table 1, there is a corresponding number of cameras (each with 30 fps of video) that can be supported if we want to run object detection on their output. The following figure shows this number of supported cameras.
Figure 2. Number of 30 fps cameras that can be supported based on Table 1 configs