The Tensor Core technology included in the Ampere architecture has brought dramatic performance gains to AI workloads. Large-scale testing and customer case studies prove that A100 GPUs can decrease training times from weeks to hours. The A100 GPU can also achieve massive acceleration for inference workloads. IT professionals will benefit from reduced operational complexity by using a single technology that is easy to onboard and manage for these use cases.
The A100 GPU is a dual-slot 10.5-inch PCI Express (PCIe) Gen4 card that is based on the NVIDIA Ampere GA100 GPU. It uses a passive heat sink for cooling. The A100 PCIe supports double precision (FP64), single precision (FP32), and half precision (FP16) compute tasks. It also supports unified virtual memory, and a page migration engine. The A100 GPU is available in 40 GB and 80 GB memory versions and in PCIe and SXM form factors.
The PowerEdge R740xd server supports the A100 40 GB PCIe GPU. This design guide applies only to the A100 40 GB PCIe GPU.
The new Multi-Instance GPU (MIG) feature was designed to partition the A100 GPU into individual instances, each fully isolated with its own high-bandwidth memory, cache, and compute. The A100 PCIe card supports MIG configurations with up to seven GPU instances per A100 GPU.
When combining MIG with the NVIDIA Virtual GPU (vGPU), enterprises can take advantage of the management, monitoring, and operational benefits of VMware server virtualization. VMware virtual machines (VMs) with MIG-backed vGPUs provide the flexibility to have a mixed-sized (heterogenous) partitioned GPU instance.
MIG allows multiple vGPUs (and therefore, VMs) to run in parallel on a single A100 GPU, while preserving the isolation guarantees that a vGPU provides. Administrators can partition the GPUs and allocate the required GPU capacity to individual data scientists. The data scientist can be assured of predictable performance due to the isolation and Quality of Service guarantees of the vGPU.
The following table lists the supported GPU partitions:
GPU profile name
Profile name on VMs
Fraction of GPU memory
Fraction of GPU computes
Maximum number of instances available
When a VM is created, assign one of the preceding partitions and power on the VM. The GPU resource is then allocated to the VM. No other VM can use that resource, guaranteeing isolation of GPU resources to that VM. The resources are freed only when you power off or migrate the VM from the server.
The following figure shows an example of how MIG partitions are allocated to VMs:
The example shows an A100 GPU partitioned into three MIG profiles: MIG 4g.20gb, MIG 2g.10gb, and MIG 1g.5gb. These profiles are assigned to run light training using TensorFlow, model development using Jupyter notebook, and inference using TensorRT.
As of the publication of this design guide, MIG only supports assigning one profile type to a VM at a time.
The ConnectX®-6 Dx SmartNIC is a secure and advanced cloud network interface card that accelerates mission-critical, data-center applications, such as virtualization, SDN/NFV, big data, machine learning, network security, and storage. This SmartNIC provides up to two ports of 100 Gb/s Ethernet connectivity or a single port of 200 Gb/s Ethernet connectivity through a QSFP56 network interface.
ConnectX-5 Ethernet adapter cards are the previous generation of network adapters offering acceleration engines that optimize performance of data analytics, high performance, and virtualization workloads. ConnectX-5 supports multiple network interfaces. In this design guide, we use two ports for 25 Gb/s Ethernet connectivity.
Both ConnectX-6 and ConnectX-5 support Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE), the network protocol required for multi node training with GPUDirect RDMA.
To be successful with machine learning and AI initiatives, enterprises need a modern coherent computing infrastructure that provides functionality, performance, security, and scalability. Organizations also benefit when they can run both development and production workloads with common technology. With NVIDIA-Certified Systems™ from Dell Technologies, enterprises can confidently choose performance-optimized hardware that runs VMware and NVIDIA software solutions—all backed by enterprise-grade support.
NVIDIA-Certified Systems from Dell Technologies have been proven through a rigorous suite of functional and performance tests. The test results confirm these servers can deliver high performance both in single-node and networked multinode cluster training and inference benchmarks.
Dell Technologies produces a range of PowerEdge servers that are qualified as NVIDIA-Certified Systems. NVIDIA-Certified Systems are shipped with the NVIDIA Ampere architecture’s A100 Tensor Core GPU and the latest NVIDIA Mellanox ConnectX-6 network adapters. An NVIDIA-Certified System conforms to NVIDIA design best practices and has passed certification tests that address a range of use cases. These use cases include deep learning training, AI inference, data science algorithms, intelligent video analytics, security, and network and storage offload for both single-node and multinode clusters. The certified systems also qualify for NVIDIA-Certified Systems Software Support, which you can purchase from the NVIDIA GPU Cloud (NGC™) catalog for NVIDIA drivers and NVIDIA-published software.
The NGC container registry service provides researchers, data scientists, and developers with access to a comprehensive catalog of containerized GPU-accelerated software for AI, machine learning, and high-performance computing (HPC). These containers are fully configured to take advantage of NVIDIA GPUs whether on-premises, at the edge, or in the cloud. Each container image is fully optimized to work across a wide variety of NVIDIA GPU platforms.
The NGC container registry supports containers as well as Kubernetes-ready Helm charts. This design guide only addresses containers. Helm charts are not supported in this release.
NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software that is optimized, certified, and supported by NVIDIA to run on VMware vSphere with NVIDIA-Certified Systems, as shown in the following figure:
NVIDIA AI Enterprise includes key enabling technologies and software from NVIDIA for rapid deployment, management, and scaling of AI workloads in the modern hybrid cloud. NVIDIA licenses and supports NVIDIA AI Enterprise.
With NVIDIA AI Enterprise running on PowerEdge servers managed by vSphere, customers can avoid silos of AI-specific systems that are difficult to manage and secure. They can also mitigate the risks of shadow AI deployments, in which data scientists and machine learning engineers procure resources outside of the IT ecosystem.