Synthetic Data Generation with Dell PowerEdge R760xa and NVIDIA Omniverse Platform
Tue, 14 May 2024 15:08:48 -0000
|Read Time: 0 minutes
Related publications
This blog is part three of the blog series Dell Technologies PowerEdge R760xa with NVIDIA Omniverse:
- Part One: Deploy Virtualized NVIDIA Omniverse Environment with Dell PowerEdge R760xa and NVIDIA L40 GPUs
- Part Two: Developing Digital Twin 3D Models on Dell PowerEdge R760xa with NVIDIA Omniverse virtualized platform
Synthetic Data Generation
Synthetic Data Generation (SDG) is the process of generating data for model development using analysis and tools to represent the patterns, relationships, and characteristics of real-world data.
The collection, curation, and annotation of real-world data is inherently time-consuming, expensive and may not be feasible in many circumstances. Synthetic data on the other hand might prove a suitable alternative, either in its own right or in conjunction with existing real-word data. Synthetic data has the added benefit of less risk of infringing on privacy or exposing sensitive information while providing data diversity.
Types of Synthetic data and use cases:
Synthetic text: Artificially generated text, such as text used to train language models.
Synthetic media: Artificially generated sound, image, and video, such as media used to train autonomous vehicles or computer vision models.
Synthetic tabular data: Artificially generated structured data (such as, spreadsheets and databases).
Synthetic data/datasets are increasingly being used to train AI algorithms and perform real-world modeling.
Gartner predicts that by 2030, most AI modeling data will be artificially generated by rules, statistical models, simulations, or other techniques.
Part one of this blog series discussed how to deploy a virtual instance of NVIDIAs Omniverse platform with Part two highlighting how 3D scenes that are built with Omniverse SimReady assets can be used in the development of physically accurate Digital Twin 3D workflows.
This article, Part 3, is the final installment in the series and describes how NVIDIA Omniverse™ Isaac Sim with SimReady Assets can produce photorealistic, physically accurate simulations and synthetic data, which can subsequently be used for AI training and integrated into Digital Twin development workflows.
Omniverse Isaac Sim and TAO Toolkit
Isaac Sim is an extensible robotics simulation toolkit for the NVIDIA Omniverse™ platform. With Isaac Sim, developers can train and optimize AI robots for a wide variety of tasks. It provides the tools and workflows needed to create physically accurate simulations and synthetic datasets.
Note: Isaac Sim exposes as a set of extensions (Omniverse Replicator) to provide synthetic data generation capabilities.
SimReady 3D assets and SDG are designed to be used in concert to create and randomize a variety of scenarios to meet specific training goals and do it safely as a virtual simulation. Unlike real-world datasets that must be manually annotated before use, SDG data annotated using semantic labeling can be used directly to train AI models.
NVIDIA TAO Toolkit is used for model training and inferencing. The TAO Toolkit is a powerful open-source AI toolkit that is designed to simplify the process of creating highly accurate, customized computer vision AI models. It leverages transfer learning, enabling the adaption of existing AI models for specific data. Built with TensorFlow and PyTorch, it streamlines the model training process.
TAO is part of NVIDIA AI Enterprise, an enterprise-ready AI software platform with security, stability, manageability, and support.
TAO supports most popular computer vision tasks such as:
- Image Classification
- Object Detection
- Semantic Segmentation
- Optical character recognition (OCR)
Synthetic Image Data Generation on Dell Technologies PowerEdge R760xa servers
This article concentrates on leveraging Dell PowerEdge R760xa servers with 4x L40 GPUs, a virtualized instance of NVIDIAs Omniverse Enterprise with Isaac Sim to synthetically generate a dataset of suitable images that can be used to train Computer Vision object detection AI models.
Enhance Computer Vision Model Capabilities with Synthetically Generated Data
The following example explores the generation of synthetic data and how it can be used to train a Computer Vision AI model to detect “Pallet Jack” objects within a warehouse setting for logistics planning.
High Level Steps:
- Isaac Sim configuration
- Load warehouse scene & SimReady Assets
- Scenario Simulation
- Generate synthetic 3D image data
- Train CV object detection AI model with synthetic data
- Test AI Model Inference performance on real-world images
The following NVIDIA documentation provides a repository of collateral material (information, scripts, notebooks, and so forth) focused on:
- Isaac Sim configurations
- Synthetic Image Data Generation techniques
- Pre-Trained Computer Vision AI models
Isaac Sim Configuration
Several python scripts are included to configure and run 3D simulations with Isaac Sim. The resulting SDG output consists of ~ 5,000 high quality annotated images of “Pallet Jack” objects within a warehouse scene.
Synthetic Image Data Generation techniques
One factor to consider when generating synthetic image data is diversity. Similar or repetitive images from the synthetic domain/scene will likely not help to improve AI model accuracy. Suitable domain randomization techniques that vary image generation maybe required, such as:
- Scene distraction (multiple objects, occlusions),
- Scene composition (lighting, reflective surfaces, textures).
See Figure 6 & 7.
Data diversity techniques will likely vary on a use case or scene basis, for example indoor vs outdoor scenes may require different approaches before satisfactory data is generated.
Pre-Trained Computer Vision AI models
A sample Jupyter notebook from the following (Omniverse repository) uses the TAO Toolkit from NVIDIA to download a pre-trained computer vision AI (DetectNet_v2/resnet18) model, which is then trained on the previously generated synthetic images.
Figure 8 shows computer vision AI model validation on synthetic images.
Finally, the synthetically trained computer vision AI object detection model is evaluated on real-world “pallet-jack” images during inferencing to assess the performance and accuracy obtained.
The example in Figure 9 shows computer vision object detection of “pallet-jacks” in real-world images
Note: See Logistic Objects in Context (LOCO) dataset containing real-world images within logistics settings.
Conclusion
Synthetic data is playing an increasingly important role in enhancing the capabilities of AI models. Synthetic data used either on its own or in augmenting existing datasets may be used to address certain shortcomings and impediments of real datasets in the training of AI models such as: insufficient data, diversity, privacy, rare scenarios, and cost of curation.
NVIDIA’s Omniverse Platform in conjunction with the Isaac Sim application enables 3D scenario modeling and simulation with corresponding synthetic data image generation which can then be used to train AI and or integrated into 3D data pipelines such as Digital Twins.
In this three-part blog series Dell Technologies PowerEdge 760xa with NVIDIA Omniverse Enterprise Platform, we explored,
- Virtual deployment,
- Build/simulating high fidelity and physically accurate 3D models
- Synthetic Data Generation
- Training computer vision AI model with synthetic data
For further information see Dell Technologies Technical White Paper: A Digital Twin Journey: Computer Vision AI Model Enhancement with Dell Technologies Integrated Solutions & NVIDIA Omniverse
References
Virtual Workstation Interactive Collaboration with NVIDIA Omniverse
Applying Digital Twin and AMR Technologies for Industrial Manufacturing