Containerized HPC Workloads Made Easy with Omnia and Singularity
Mon, 28 Jun 2021 14:35:14 -0000
|Read Time: 0 minutes
Maximizing application performance and system utilization has always been important for HPC users. The libraries, compilers, and applications found on these systems are the result of heroic efforts by HPC system administrators and teams of HPC specialists who fine tune, test, and maintain optimal builds of complex hierarchies of software for users. Fortunately for both researchers and administrators, some of that burden can be relieved with the use of containers, where software solutions can be built to run reliably when moved from one computing environment to another. This includes moving from one research lab to another, or from the developer’s laptop to a production lab, or even from an on-prem data center to the cloud.
Singularity has provided HPC system administrators and users a way to take advantage of application containerization while running on batch-scheduled systems. Singularity is a container runtime that can build containers in its own native format, as well as execute any CRI-compatible container. By default, Singularity enforces security restrictions on containers by running in user space and can preserve user identification when run through batch schedulers, providing a simple method to deploy containerized workloads on multi-user HPC environments.
Best practices for HPC systems deployment and use is the goal of Omnia and we recognize those practices vary in industry and research institutions. Omnia is developed with the entire community in mind and we aim to provide the tools that help them be productive. To this end, we recently included Singularity as an automatically installed package when deploying Slurm clusters with Omnia.
Building a Singularity-enabled cluster with Omnia
Installing a Slurm cluster with Omnia and running a Singularity job is simple. We provide a repository of Ansible playbooks to configure a pile of metal or cloud resources into a ready-to-use Slurm cluster by applying the Slurm role in AWX or by applying the playbook on the command line.
ansible-playbook -i inventory omnia.yaml --skip-tags kubernetes
Once the playbook has completed users are presented with a fully functional Slurm cluster with Singularity installed. We can run a simple “hello world” example, using containers directly from Singularity Hub. Here is an example Slurm submission script to run the “Hello World” example.
#!/bin/bash #SBATCH -J singularity_test #SBATCH -o singularity_test.out.%J #SBATCH -e singularity_test.err.%J #SBATCH -t 0-00:10 #SBATCH -N 1 # pull example Singularity container singularity pull --name hello-world.sif shub://vsoch/hello-world # execute Singularity container singularity exec hello-world.sif cat /etc/os-release
Executing HPC applications without installing software
The “hello world” example is great but doesn’t demonstrate running real HPC codes, fortunately several hardware vendors have begun to publish containers for both HPC and AI workloads, such as Intel's oneContainer and Nvidia's NGC. Nvidia NGC is a catalog of GPU-accelerated software arranged in collections, containers, and Helm charts. This free to use repository has the latest builds of popular software used for deep learning and simulation with optimizations for Nvidia GPU systems. With Singularity we can take advantage of the NGC containers on our bare-metal Slurm cluster. Starting with the LAMMPS example on the NGC website we demonstrate how to run a standard Lennard-Jones 3D melt experiment, without having to compile all the libraries and executables.
The input file for running this benchmark, in.lj.txt, can be downloaded from the Sandia National Laboratory site:
wget https://lammps.sandia.gov/inputs/in.lj.txt
Next make a local copy of the lammps container from NGC and name it lammps.sif
singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020
This example can be executed directly from the command line using srun. This example runs 8 tasks on 2 nodes with a total of 8 GPUs:
srun --mpi=pmi2 -N2 --ntasks=8 --ntasks-per-socket=2 singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd lammps.sif lmp -k on g 8 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt
Alternatively, the following example Slurm submission script will permit batch execution with the same parameters as above, 8 tasks on 2 nodes with a total of 8 GPUs:
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --ntasks-per-socket=2 #SBATCH --time 00:10:00 set -e; set -o pipefail # Build SIF, if it doesn't exist if [[ ! -f lammps.sif ]]; then singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020 fi readonly gpus_per_node=$(( SLURM_NTASKS / SLURM_JOB_NUM_NODES )) echo "Running Lennard Jones 8x4x8 example on ${SLURM_NTASKS} GPUS..." srun --mpi=pmi2 \ singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd lammps.sif lmp -k on g ${gpus_per_node} -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt
Containers provide a simple solution to the complex task of building optimized software to run anywhere. Researchers are no longer required to attempt building software themselves or wait for a release of software to be made available at the site they are running. Whether running on the workstation, laptop, on-prem HPC resource, or cloud environment they can be sure they are using the same optimized version for every run.
Omnia is an open source project that makes it easy to setup a Slurm or Kubernetes environment. When we combine the simplicity of Omnia for system deployment and Nvidia NGC containers for optimized software, both researchers and system administrators can concentrate on what matters most, getting results faster.
Learn more
Learn more about Singularity containers at https://sylabs.io/singularity/. Omnia is available for download at https://github.com/dellhpc/omnia.