Synthetic Data Generation with Dell PowerEdge R760xa and NVIDIA Omniverse Platform

Related publications

This blog is part three of the blog series Dell Technologies PowerEdge R760xa with NVIDIA Omniverse:

Part One: Deploy Virtualized NVIDIA Omniverse Environment with Dell PowerEdge R760xa and NVIDIA L40 GPUs
Part Two: Developing Digital Twin 3D Models on Dell PowerEdge R760xa with NVIDIA Omniverse virtualized platform

Synthetic Data Generation

Synthetic Data Generation (SDG) is the process of generating data for model development using analysis and tools to represent the patterns, relationships, and characteristics of real-world data.

The collection, curation, and annotation of real-world data is inherently time-consuming, expensive and may not be feasible in many circumstances. Synthetic data on the other hand might prove a suitable alternative, either in its own right or in conjunction with existing real-word data. Synthetic data has the added benefit of less risk of infringing on privacy or exposing sensitive information while providing data diversity.

Types of Synthetic data and use cases:

Synthetic text: Artificially generated text, such as text used to train language models.

Synthetic media: Artificially generated sound, image, and video, such as media used to train autonomous vehicles or computer vision models.

Synthetic tabular data: Artificially generated structured data (such as, spreadsheets and databases).

Synthetic data/datasets are increasingly being used to train AI algorithms and perform real-world modeling.

Figure 1: Gartner Synthetic Data forecast. Source www.gartner.com

Gartner predicts that by 2030, most AI modeling data will be artificially generated by rules, statistical models, simulations, or other techniques.

Part one of this blog series discussed how to deploy a virtual instance of NVIDIAs Omniverse platform with Part two highlighting how 3D scenes that are built with Omniverse SimReady assets can be used in the development of physically accurate Digital Twin 3D workflows.

This article, Part 3, is the final installment in the series and describes how NVIDIA Omniverse™ Isaac Sim with SimReady Assets can produce photorealistic, physically accurate simulations and synthetic data, which can subsequently be used for AI training and integrated into Digital Twin development workflows.

Omniverse Isaac Sim and TAO Toolkit

Isaac Sim is an extensible robotics simulation toolkit for the NVIDIA Omniverse™ platform. With Isaac Sim, developers can train and optimize AI robots for a wide variety of tasks. It provides the tools and workflows needed to create physically accurate simulations and synthetic datasets.

Note: Isaac Sim exposes as a set of extensions (Omniverse Replicator) to provide synthetic data generation capabilities.

Figure 2: NVIDIA Isaac Sim Stack

SimReady 3D assets and SDG are designed to be used in concert to create and randomize a variety of scenarios to meet specific training goals and do it safely as a virtual simulation. Unlike real-world datasets that must be manually annotated before use, SDG data annotated using semantic labeling can be used directly to train AI models.

NVIDIA TAO Toolkit is used for model training and inferencing. The TAO Toolkit is a powerful open-source AI toolkit that is designed to simplify the process of creating highly accurate, customized computer vision AI models. It leverages transfer learning, enabling the adaption of existing AI models for specific data. Built with TensorFlow and PyTorch, it streamlines the model training process.

TAO is part of NVIDIA AI Enterprise, an enterprise-ready AI software platform with security, stability, manageability, and support.

Figure 3: TAO overview 

TAO supports most popular computer vision tasks such as:

Image Classification
Object Detection
Semantic Segmentation
Optical character recognition (OCR)

Synthetic Image Data Generation on Dell Technologies PowerEdge R760xa servers

This article concentrates on leveraging Dell PowerEdge R760xa servers with 4x L40 GPUs, a virtualized instance of NVIDIAs Omniverse Enterprise with Isaac Sim to synthetically generate a dataset of suitable images that can be used to train Computer Vision object detection AI models.

Figure 4: Virtualized Omniverse stack on Dell PowerEdge 760xa server with 4x L40 GPUs

Enhance Computer Vision Model Capabilities with Synthetically Generated Data

The following example explores the generation of synthetic data and how it can be used to train a Computer Vision AI model to detect “Pallet Jack” objects within a warehouse setting for logistics planning.

Figure 5: Synthetic & Real Images of “Pallet Jack” objects within a warehouse scene

High Level Steps:

Isaac Sim configuration
Load warehouse scene & SimReady Assets
Scenario Simulation
Generate synthetic 3D image data
Train CV object detection AI model with synthetic data
Test AI Model Inference performance on real-world images

The following NVIDIA documentation provides a repository of collateral material (information, scripts, notebooks, and so forth) focused on:

Isaac Sim Configuration

Several python scripts are included to configure and run 3D simulations with Isaac Sim. The resulting SDG output consists of ~ 5,000 high quality annotated images of “Pallet Jack” objects within a warehouse scene.

Synthetic Image Data Generation techniques

One factor to consider when generating synthetic image data is diversity. Similar or repetitive images from the synthetic domain/scene will likely not help to improve AI model accuracy. Suitable domain randomization techniques that vary image generation maybe required, such as:

Scene distraction (multiple objects, occlusions),
Scene composition (lighting, reflective surfaces, textures).

See Figure 6 & 7.

Figure 6: Isaac Sim simulation (multi object scene with occlusions) & GPU utilization snapshotFigure 7: Isaac Sim simulation (reflective materials)

Data diversity techniques will likely vary on a use case or scene basis, for example indoor vs outdoor scenes may require different approaches before satisfactory data is generated.

Pre-Trained Computer Vision AI models

A sample Jupyter notebook from the following (Omniverse repository) uses the TAO Toolkit from NVIDIA to download a pre-trained computer vision AI (DetectNet_v2/resnet18) model, which is then trained on the previously generated synthetic images.

Figure 8 shows computer vision AI model validation on synthetic images.

Figure 8: Object detection training validation with synthetic Data

Finally, the synthetically trained computer vision AI object detection model is evaluated on real-world “pallet-jack” images during inferencing to assess the performance and accuracy obtained.

The example in Figure 9 shows computer vision object detection of “pallet-jacks” in real-world images

Figure 9: Object detection Inferencing on Real-World Images

Note: See Logistic Objects in Context (LOCO) dataset containing real-world images within logistics settings.

Conclusion

Synthetic data is playing an increasingly important role in enhancing the capabilities of AI models. Synthetic data used either on its own or in augmenting existing datasets may be used to address certain shortcomings and impediments of real datasets in the training of AI models such as: insufficient data, diversity, privacy, rare scenarios, and cost of curation.

NVIDIA’s Omniverse Platform in conjunction with the Isaac Sim application enables 3D scenario modeling and simulation with corresponding synthetic data image generation which can then be used to train AI and or integrated into 3D data pipelines such as Digital Twins.

In this three-part blog series Dell Technologies PowerEdge 760xa with NVIDIA Omniverse Enterprise Platform, we explored,