Home > Servers > PowerEdge Components > White Papers > Developing and Deploying Vision AI with Dell and NVIDIA Metropolis > Overview
This paper about intelligent video analytics is an extension of the document Boost Video Analytics Throughput Using NVIDIA Metropolis - Triton Inference Server and DeepStream. In this paper, we incorporate NVIDIA pre-trained models and the NVIDIA TAO Toolkit. This ability enables developers to create production-ready models through fine-tuning, using a only fraction of the data required when training deep neural networks from scratch. These tools combine with Dell server technology, enabling users to capture the full benefits of transfer learning in real-world-vision AI applications that include object detection, object tracking, and image segmentation. The resulting combination delivers vision AI applications that are highly accurate, scalable, performant, and deployed seamlessly.
We use two examples to step through network implementation, showing how to retrain and fine-tune an object detection model with the TAO Toolkit and deploy it for inference with the NVIDIA DeepStream SDK and the Triton Inference Server on the PowerEdge R7515 server.
First, we explore the use of RetinaNet, one of the TAO supported object-detector models and pair it with an EfficientNet backbone for feature extraction. The combined RetinaNet and EfficientNet_b0 backbone pretrained on the ImageNet dataset is available from the NVIDIA NGC catalog—a GPU-optimized hub of AI software. We use the TAO toolkit to target the model towards a different dataset, the Karlsruhe Institute of Technology and Toyota Technological Institute set of traffic images (the KITTI object detection benchmark dataset). Our discussion and diagrams show the overall TAO toolkit usage consisting of dataset conversion, retraining, evaluation, pruning, and modeling to floating-point and integer-inference-engine conversion. We show the throughput performance achieved on the PowerEdge R7515 using both unpruned and pruned engines. We also discuss the optional path of using quantization-aware-training to produce the INT8 engines.
In a second example, we step through one of the same RetinaNet models (this time, paired with a ResNet18 backbone) in greater detail. We look at the model’s mean Average Precision (mAP) and throughput for various instantiations. This analysis includes standalone engines with varying representation (floating point 32 and 16 bit, and 8-bit integer), engines running within DeepStream, and engines configured several ways within the DeepStream multi- stream pipeline framework.
We use profiling to explore how you might load multiple streams and engine instances using the software stack on a PowerEdge R7515 server to maximize the total inference throughput produced by one of the T4 GPUs. Using the RetinaNet and ResNet18 combination as a single 32-bit floating point instance running on one of the T4s can produce ~96 frames per second of inference on our targeted 1248 x 348 pixel, 3 channel KITTI images. We show a case where, with multiple 8-bit integer engine instances running on multiple streams taken at suitable intervals, the frames per second produced using that one T4 GPU can reach as high as ~968 fps. At that higher rate (near a factor of 10) compared with the single-instance, single-stream 32-bit floating point case, the production finally becomes decoder-bound as and compute-bound.