Multiple virtual machines can be configured to run distributed training of a neural network. Some of the key steps in configuration include:
- Disabling MIG on the GPU and assigning a non-MIG profile to the virtual machines
- Configuring a ConnectX network adapter and assigning it to a VM in passthrough mode
- Ensuring the GPU and NIC pairs are on the same root complex and same NUMA node.
- Configuring RDMA over Converged Ethernet (RoCE) on the network adapter and network switches
For detailed instructions on how to setup multinode training using GPUDirect RDMA, see NVIDIA’s AI Practitioners Deployment Guides.