Data science professionals have been pushing the limits of IT infrastructure for decades. Dell Technologies and NVIDIA have responded with numerous advancements in computation, data storage options, and high-speed networking to meet that challenge. In this validated design, we show how you can manage those advances in hardware technology more efficiently, simultaneously preserving the impressive performance gains in a virtualized environment using the latest version of VMware virtualization software.
Nowhere have the performance advancements for data science been as rapid as the development of hardware accelerators based on the technology once used primarily for graphics processing. The newest generation of Ampere GPUs from NVIDIA provides up to 20 times higher performance over the prior generation of NVIDIA GPUs. The NVIDIA A100 GPU can be partitioned into seven GPU instances to dynamically adjust to shifting demands. This combination of GPU partitions and VMware virtualization gives IT professionals the tools that they need to meet the most demanding production machine learning workloads. It also meets the experimentation needs of large groups of data science users in a single platform.
In Solution architecture, we describe the recommended configuration for enterprise AI workloads. We describe redundant networking designs for both 25 GbE and 100 GbE equipment. We also provide the best practices for deployment of NVIDIA Virtual GPU Manager for VMware vSphere and how to configure VMs to use GPU partitions base on the Multi-Instance GPU feature of the A100 and A30 GPUs.
In Validation and machine learning performance, we describe the lab environment that we implemented to support the performance evaluation comparing virtualized GPU with bare metal and several GPU partitioning approaches with the A100 GPU. We show the results of eight tests using different partition profiles for the A100 GPU running both training of an image classifier (ResNet 50 v 1.5) and inference using a trained ResNet model. The results show less than a five percent model training performance penalty compared to the gains of increased efficiency. Using vGPU with hardware partitioning should result in more IT organizations deploying machine learning workloads for AI on VMware-managed platforms for most scenarios.