Home > Workload Solutions > High Performance Computing > White Papers > Dell Validated Design for Government HPC, Artificial Intelligence, and Data Analytics: AI Inferencing Option > Cluster management software
This Dell Validated Design makes use of the Omnia project as the cluster management software. Omnia is Dell’s first and only open-source tool for this purpose. Omnia leverages the best-of-breed open-source projects and integrates them via Ansible to provision, configure, deploy, and monitor the cluster. The Omnia project resides on github, and documentation and installation guides can be found on readthedocs. ProSupport for Omnia is also available and can be purchased at server point-of-sale.
Task | Deployment | Additional details |
Provision and OS setup | Omnia | Provisioned using Mapping file |
CUDA Installation | Omnia | Latest version 12.3.1 installed |
Slurm Installation | Omnia | Installed, but not needed for this workload |
Kubernetes Installation | Omnia | Version 1.19 installed |
Install NVIDIA device plug-in on compute nodes | Manual | This plugin allows for Kubernetes to access all GPUs on a compute node, track health, and run GPU-enabled containers. NVIDIA Omnia 1.5 currently does not install this package.
Resolution: Verify that all prerequisites are met and follow the installation guide. |
Install NVIDIA GPU feature discovery on compute nodes | Manual | This plugin is used to automatically detect and label GPUs on a node for consumption in Kubernetes. While Omnia 1.5 does install version 0.7.0, while the most recent version at the time of this writing is 0.8.2. CUDA driver 12.3.1 requires the latest version to be installed.
Resolution: Follow the installation guide. |
Install NVIDIA Fabric Manager on compute nodes | Manual | The Nvidia fabric manager is needed for compute servers that use NVSwitch for GPU-to-GPU communication within a single machine. Omnia 1.5 currently does not install this package.
Resolution: Use the APT installer for Ubuntu/Debian compute nodes or YUM installer for RHEL/Rocky compute nodes. |
Issue: The Kubernetes pod communication tool “Calico” requires the NICs on the management and compute nodes to have matching names.
Workaround: Run the following command on the management (control plane) node, which will also restart the Calico pod:
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eno8403
where, “interface=eno8403” should be replaced with the name of the compute node’s NIC
Note: To find a compute node’s NIC name, run the below command on the node: ifconfig
Issue: Docker limits the number of downloads (“pulls”) when users are not logged in.
Workaround: Provide docker credentials in omnia_config.yml if docker pull limit is reached.