The following tables list the precise system configurations and software stack used for the validation efforts in this design, one with InfiniBand High Data Rate (HDR) and the other with Next Data Rate (NDR):
Table 6. System configuration
Component | Config 1 (HDR) | Config 2 (NDR) |
Compute server for model customization | 2 x PowerEdge XE9680 | 2 x PowerEdge XE9680 |
GPUs per PowerEdge XE9680 server | 8 x NVIDIA H100 SXM GPUs | 8 x NVIDIA H100 SXM GPUs |
Ethernet Network adapters |
|
|
Ethernet Network switch | 2 x PowerSwitch S5248F-ON | 2 x PowerSwitch S5248F-ON |
InfiniBand Network adapter | 8 x Mellanox ConnectX-6 Single Port HDR200 VPI InfiniBand Adapter PCIe | 8 x NVIDIA ConnectX-7 Single Port NDR OSFP PCIe, No Crypto, Full Height |
InfiniBand Network switch | QM8790 | QM9790 |
Table 7. Software components and versions
Component | Details |
Operating system | Ubuntu 22.04.1 LTS |
Cluster management | NVIDIA Base Command Manager Essentials 10.23.09 |
Slurm cluster | Slurm 23.02.4 |
Kubernetes Cluster | Version 1.27.6 |
AI framework | NVIDIA NeMo Framework v23.08.03 |