Cluster management software

Thank you for your feedback!

This Dell Validated Design makes use of the Omnia project as the cluster management software. Omnia is Dell’s first and only open-source tool for this purpose. Omnia leverages the best-of-breed open-source projects and integrates them via Ansible to provision, configure, deploy, and monitor the cluster. The Omnia project resides on github, and documentation and installation guides can be found on readthedocs. ProSupport for Omnia is also available and can be purchased at server point-of-sale.

Table 3. Tasks run on the cluster

Task	Deployment	Additional details
Provision and OS setup	Omnia	Provisioned using Mapping file
CUDA Installation	Omnia	Latest version 12.3.1 installed
Slurm Installation	Omnia	Installed, but not needed for this workload
Kubernetes Installation	Omnia	Version 1.19 installed
Install NVIDIA device plug-in on compute nodes	Manual	This plugin allows for Kubernetes to access all GPUs on a compute node, track health, and run GPU-enabled containers. NVIDIA Omnia 1.5 currently does not install this package. Resolution: Verify that all prerequisites are met and follow the installation guide.
Install NVIDIA GPU feature discovery on compute nodes	Manual	This plugin is used to automatically detect and label GPUs on a node for consumption in Kubernetes. While Omnia 1.5 does install version 0.7.0, while the most recent version at the time of this writing is 0.8.2. CUDA driver 12.3.1 requires the latest version to be installed. Resolution: Follow the installation guide.
Install NVIDIA Fabric Manager on compute nodes	Manual	The Nvidia fabric manager is needed for compute servers that use NVSwitch for GPU-to-GPU communication within a single machine. Omnia 1.5 currently does not install this package. Resolution: Use the APT installer for Ubuntu/Debian compute nodes or YUM installer for RHEL/Rocky compute nodes.

Known issues

Issue: The Kubernetes pod communication tool “Calico” requires the NICs on the management and compute nodes to have matching names.

Workaround: Run the following command on the management (control plane) node, which will also restart the Calico pod:

kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eno8403

where, “interface=eno8403” should be replaced with the name of the compute node’s NIC

Note: To find a compute node’s NIC name, run the below command on the node: ifconfig

Issue: Docker limits the number of downloads (“pulls”) when users are not logged in.

Workaround: Provide docker credentials in omnia_config.yml if docker pull limit is reached.

Your Browser is Out of Date