This section provides guidelines for deploying a GPU cluster with vSphere 7.0 with A100 or A30 GPUs on PowerEdge servers. We focus on enabling MIG and configuring the VMs with access to a virtual GPU (vGPU).
See the NVIDIA AI Enterprise platform deployment guide for detailed instructions about installing and configuring the NVIDIA Virtual GPU Manager for vSphere, which includes these high-level steps:
Follow these steps to configure a VM to use MIG:
Figure 6. Creating a VM and assigning a MIG profile
Figure 7. VM boot options
pciPassthru.use64bitMMIO: TRUE
pciPassthru.allowP2P: TRUE
pciPassthru.64bitMMIOSizeGB: 64
For detailed instructions for steps 4 through 6, see the Virtual GPU Software User Guide.
The PowerEdge R750xa server supports configuring a maximum of two GPUs per VM. These GPUs are required to be connected through NVLink for peer-to-peer communication. The following steps determine which GPUs are connected through NVLink in the PowerEdge R750xa server:
[root@esxi01:~] nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X SYS SYS SYS
GPU1 SYS X SYS SYS
GPU2 SYS SYS X SYS
GPU3 SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
The preceding example output is from a PowerEdge R750xa system with four A100 GPUs. GPU0 and GPU1 are connected by NVLink, as are GPU2 and GPU3. Assign GPU0 and GPU1 to a single VM.
pciPassthru0.cfg.gpu-pci-id = "00000000:17:00.0"
pciPassthru0.cfg.gpu-pci-id = "00000000:65:00.0"
Use the following steps to mount AI datasets from PowerScale to the VMs:
sudo mkdir /mnt/dataset
sudo mount -t nfs <NFS IP addr>:<path> /mnt/dataset
df -h
The AI and data science applications and frameworks are distributed as NGC container images through the NVIDIA NGC Enterprise Catalog. Each container image contains the entire user-space software stack that is required to run the application or framework such as the CUDA libraries, cuDNN, any required Magnum IO components, TensorRT, and the framework. See the Installing AI and data science applications and Frameworks chapter in the NVIDIA AI Enterprise documentation.
You can configure multiple VMs to run distributed training of a neural network. Key steps in the configuration include:
For detailed instructions about how to set up multinode training using GPUDirect RDMA, see NVIDIA’s AI Practitioners Deployment Guides.