Data science professionals have been pushing the limits of IT infrastructure for decades. Dell Technologies and NVIDIA have responded with numerous advancements in computation, data storage options, and high-speed networking to meet that challenge. In this design guide, we show how you can manage those advances in hardware technology more efficiently, simultaneously preserving the impressive performance gains in a virtualized environment using the latest version of VMware virtualization software.
Nowhere have the performance advancements for data science been as rapid as the development of hardware accelerators based on the technology once used primarily for graphics processing. The newest generation of Ampere GPUs from NVIDIA provides up to 20 times higher performance over the prior generation of NVIDIA GPUs. The NVIDIA A100 GPU can be partitioned into seven GPU instances to dynamically adjust to shifting demands. This combination of GPU partitions and VMware virtualization gives IT professionals the tools that they need to meet the most demanding production machine learning workloads. It also meets the experimentation needs of large groups of data science users in a single platform.
In Dell Technologies, VMware, and NVIDIA technology integration, we list each of the components needed to successfully create an environment that can efficiently meet the various needs of data science work in support of AI from basic experimentation with a small isolated GPU partition to using full GPU instances for production scale model training.
In Design guidance, we describe the lab environment that we implemented to support the performance evaluation comparing several GPU partitioning approaches with the A100 and VMware VMs compared to the same workload on nonvirtualized A100 devices. We describe redundant networking designs for both 25 GbE and 100 GbE equipment. We also provide the best practices for deployment of NVIDIA Virtual GPU Manager for VMware vSphere and how to configure VMs to use GPU partitions base on the Multi-Instance GPU feature of the A100 GPU.
In Machine learning performance, we show the results of eight tests using different partition profiles for the A100 GPU running both training of an image classifier (ResNet 50 v 1.5) and inference using a trained ResNet model. The results show less than a 5 percent model training performance penalty compared to the gains of increased efficiency and throughput for inference. Using vGPU with hardware partitioning should result in more IT organizations deploying machine learning workloads for AI on VMware-managed platforms for most scenarios.