Note: In a customer deployment, the number of DGX A100 systems and PowerScale storage nodes will vary and can be scaled independently to meet the requirements of the specific DL workloads.
The hardware architectures include these key components:
- Compute: Four NVIDIA DGX A100 systems. The DGX A100 system is a fully integrated, turnkey hardware and software system that is purpose-built for DL workflows. Each DGX A100 system is powered by eight NVIDIA A100 Tensor Core GPUs that are interconnected using NVIDIA NVSwitch technology, which provides an ultra-high bandwidth low-latency fabric for inter-GPU communication. This topology is essential for multi-GPU training, eliminating the bottleneck that is associated with PCIe-based interconnects that cannot deliver linearity of performance as GPU count increases. The DGX A100 system is also equipped with eight single-port NVIDIA Mellanox ConnectX-6 VPI HDR InfiniBand adapters for clustering and two dual-port ConnectX-6 VPI Ethernet adapters for storage and management networking, all capable of 200Gb/s.
- Storage: A critical component of DL solutions is high-performance storage. One Dell EMC PowerScale F800 with 4 nodes was used in this solution with a mounted NFS share on the NVIDIA DGX A100 systems. It is uniquely suited for modern DL applications – delivering the flexibility to deal with any data type; the scalability for data sets ranging in the PBs; and the concurrency to support the massive concurrent I/O requests from the GPUs.
- Networking: The solution consists of two network fabrics:
- The NVIDIA Mellanox SN3700V Ethernet switches provide the high speed “front-end” Ethernet connectivity between the Isilon F800 cluster nodes and NVIDIA DGX A100 systems. The F800 nodes connect with 40GbE connections, the DGX A100 systems connect with 100GbE connections, and the SN3700 switches automatically forward traffic across the different speed connections with minimal latency. Based on the NVIDIA Spectrum-2 switch ASIC and purpose built for the modern datacenter, the SN3700V switch combines high-performance packet processing, rich datacenter features, cloud network scale and visibility. A flexible unified buffer to ensure fair and predictable performance across any combination of ports and speeds from 10Gb/s to 200Gb/s, and an Open Ethernet design supports multiple network OS choices including NVIDIA Cumulus Linux, NVIDIA Onyx, and Software for Open Networking in the Cloud(SONiC).
- The NVIDIA Mellanox QM8700 InfiniBand switches provide high-throughput, low-latency networking between the DGX A100 systems. Designed for both EDR 100Gb/s and HDR 200 Gb/s InfiniBand links, they minimize latency and maximize throughput for all GPU-to-GPU communication between systems. The QM8700 switches support Remote Direct Memory Access (RDMA) and in-network computing offloads for AI and data analytics to enable faster and more efficient data transfers. They support NVIDIA GPUDirect, Mellanox SHARP for network-based AI and analytics offloads (such as MPI AllReduce), and Mellanox SHIELD for maximum resiliency in a self-healing network. Learn more about the NVIDIA Mellanox Quantum QM8700 InfiniBand switches.
The software architectures include these key components:
- Docker containers or virtual machines: There has been a dramatic rise in the use of software containers for simplifying deployment of applications at scale. You can use either virtual machines or containers encapsulated by all the application’s dependencies to provide reliable execution of DL training jobs. A docker container is more widely used now for bunding an application with all its libraries, configuration files and environment variables so that the execution environment is always the same. To enable portability in Docker images that leverage GPUs, NVIDIA developed the Docker Engine Utility for NVIDIA GPUs which is also known as the NVIDIA Container Toolkit, an open-source project that provides a command-line tool. The publicly available CSI driver for the scale-out NAS PowerScale provides support for provisioning of persistent storage.