Figure 2. NVIDIA AI Enterprise—a comprehensive AI suite
NVIDIA AI Enterprise includes key enabling technologies and software from NVIDIA for rapid deployment, management, and scaling of AI workloads in the modern hybrid cloud. NVIDIA licenses and supports NVIDIA AI Enterprise. NVIDIA AI Enterprise replaces NVIDIA Virtual Compute Server (vCS) for vGPU compute workloads running in VMware vSphere.
The software in the NVIDIA AI Enterprise suite is organized into the following layers:
NVIDIA AI Enterprise is licensed per CPU socket and can be purchased through Dell Software & Peripherals. You can purchase NVIDIA AI Enterprise products either as a perpetual license with support services or as an annual or multiyear subscription. The perpetual license provides the right to use the NVIDIA AI Enterprise software indefinitely, with no expiration. You must purchase NVIDIA AI Enterprise with perpetual licenses with one-year, three-year, or five-year support services. A one-year support service is also available for renewals. For more information, see the NVIDIA AI Enterprise Packaging, Pricing, and Licensing Guide.
NVIDIA Support Services for the NVIDIA AI Enterprise software suite provides seamless access to comprehensive software patches, updates, upgrades, and technical support.
The Tensor Core technology in the Ampere architecture has brought dramatic performance gains to AI workloads. Large-scale testing and customer case studies prove that Ampere GPUs that are based on Tensor Core can decrease training times from weeks to hours. Two types of GPUs are available for compute workloads:
Note: NVIDIA AI Enterprise also supports NVIDIA A40 and T4 GPUs. The NVIDIA A40 GPU accelerates the most demanding visual computing workloads from the data center, combining the latest NVIDIA Ampere architecture RT Cores, Tensor Cores, and CUDA Cores with 48 GB of graphics memory. We recommend the NVIDIA T4 GPU for existing installations. NVIDIA A40 and T4 GPUs do not support MIG capability.
A100 and A30 GPUs support the MIG feature, which allows administrators to partition a single GPU into multiple instances, each fully isolated with its own high-bandwidth memory, cache, and compute. The A100 PCIe card supports MIG configurations with up to seven GPU instances per A100 GPU, while the A30 GPU supports up to four GPU instances. For more information, see the section about virtual GPUs in the Virtualizing GPUs for AI with VMware and NVIDIA design guide.
The ConnectX-6 Dx SmartNIC is a secure and advanced cloud network interface card that accelerates mission-critical, data center applications, such as virtualization, SDN/NFV, big data, machine learning, network security, and storage. ConnectX-5 supports multiple network interfaces. In this validated design, we use ConnectX-6 Lx for 25 Gb/s Ethernet connectivity and optionally ConnectX-6 Dx for 100 Gb/s Ethernet connectivity.
ConnectX-5 Ethernet adapter cards are the previous generation of network adapters offering acceleration engines that optimize the performance of data analytics, high performance, and virtualization workloads. ConnectX-5 supports multiple network interfaces. In this validated design, we use two ports for 25 Gb/s Ethernet connectivity.
Both ConnectX-6 and ConnectX-5 support Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE), the network protocol required for multinode training with GPUDirect RDMA.
To be successful with machine learning and AI initiatives, enterprises need a modern coherent computing infrastructure that provides functionality, performance, security, and scalability. Organizations also benefit when they can run both development and production workloads with common technology. With NVIDIA-Certified Systems from Dell Technologies, enterprises can confidently choose performance-optimized hardware that runs VMware and NVIDIA software solutions—all backed by enterprise-grade support.
Dell Technologies produces a range of PowerEdge servers and VxRail HCI appliances that are qualified as NVIDIA-Certified Systems. NVIDIA-Certified Systems are shipped with NVIDIA Ampere architecture A100 and A30 Tensor Core GPUs and the latest NVIDIA Mellanox ConnectX-6 network adapters.
A subset of NVIDIA-Certified Systems goes through additional certification, including VMware GPU certification, to ensure compatibility with NVIDIA AI Enterprise. An NVIDIA-Certified System that is compatible with NVIDIA AI Enterprise conforms to NVIDIA design best practices and has passed certification tests that address a range of use cases on VMware vSphere infrastructure. These use cases include deep learning training, AI inference, data science algorithms, intelligent video analytics, security, and network and storage offload for both single-node and multinode clusters.