Increased Automation, Scale, and Capability with Omnia 1.1
Mon, 15 Nov 2021 23:08:56 -0000
|Read Time: 0 minutes
The release of Omnia version 1.0 in March of 2020 was a huge milestone for the Omnia community. It was the culmination of nearly a year of planning, conversations with customers and community members, development, and testing. Omnia version 1.0 included:
- bare-metal provisioning with Cobbler,
- automated Slurm and Kubernetes cluster deployment, and
automated Kubeflow deployment.
The Omnia project was designed to rapidly add features and evolve, and we are proud to announce the first update to Omnia just 7 months later. While version 1.0 had a ton of great features for a first release, version 1.1 turned out to be even bigger!
The Omnia Project
Omnia is an open-source, community-driven framework for deploying high-performance computing (HPC) clusters for simulation & modeling, artificial intelligence, and data analytics. By automating the entire process, Omnia reduces deployment time for these complex systems from weeks to hours.
Omnia incubated at Dell Technologies in partnership with Intel. The project was initiated by two HPC & AI experts who needed to quickly setup proof-of-concept clusters in Dell’s HPC & AI Innovation Lab, and has since grown into a much larger effort to create production-grade clusters on demand and at scale. Today, Omnia has thirty collaborators from nearly a dozen organizations, including five official community member organizations. The code repo has been cloned over a thousand times and has over forty thousand views! The project is off to a great start with more new features releasing regularly!
Omnia 1.1
Omnia version 1.1 includes a multitude of new features and capabilities that expand datacenter automation beyond the compute server. This latest release sets the groundwork for Omnia to handle future exascale supercomputer deployments while simultaneously growing the set of end-user features and platforms more rapidly.
New in Omnia 1.1
- iDRAC-based provisioning
- PowerVault provisioning/configuration (automatically turns a PV array into an NFS file share)
- Parallel gang scheduling for Kubernetes (for MPI and Spark jobs)
- User authentication/management using LDAP/Kerberos
- Automatic firmware updating for PowerEdge servers with Intel® 2nd-generation Xeon® Scalable processors when using iDRAC for provisioning
- Automatic configuration of Dell PowerSwitch 100Gb Ethernet and Nvidia InfiniBand switches
- Updated AWX GUI for deploying logical clusters
- Additional MLOps platform options (Polyaxon, in addition to the existing KubeFlow)
A brand-new control plane designed for future growth
The new control plane (formerly called the Omnia appliance) is now a full Kubernetes-based deployment with a wealth of features. The new control plane includes Dell iDRAC integration for firmware updates and OS provisioning when iDRAC Enterprise or Datacenter licenses are detected, plus automatic fallback to Cobbler-based PXE provisioning when those licenses are not available. This allows cluster administrators using Dell servers to take full advantage of their iDRAC Enterprise or Datacenter licenses while continuing to offer a fully open-source and vendor-agnostic alternative. This new Kubernetes-based control plane is the first step in providing an expandable, multi-server control plane that could be used to manage the bare-metal provisioning and deployment of thousands of compute nodes for petascale and eventual exascale systems.
Automatically detect and deploy more than just servers
The development team has also extended Omnia’s automation capability beyond compute servers. The control plane is now able to automatically detect and configure Dell EMC PowerSwitches, Nvidia/Mellanox InfiniBand switches, and Dell EMC PowerVault storage arrays. This allows users to now deploy complete HPC environments using Omnia’s one-touch philosophy, with compute, network, and storage pieces ready to go! Dell EMC PowerSwitches are automatically configured for both management and fabric deployments, with automatic configuration of RoCEv2 for supported 100Gbps Ethernet switches. Nvidia InfiniBand fabrics will automatically be deployed when an InfiniBand switch is detected, with the subnet manager running on the control plane. And when the control plane detects a Dell EMC PowerVault ME4 storage array, it will automatically configure the RAID, format the array, and setup an NFS service that can have shared access by the various logical clusters in the Omnia resource pool. In less than a day a loading dock full of servers, storage, and networking can be transformed into a functional Omnia resource pool, ready to be configured into logical Slurm and Kubernetes clusters on demand.
Automated deployment of LDAP services
Starting with version 1.1, Omnia also reduces the pain of user management. When logical Slurm clusters are created Omnia takes care of all the backend services needed for a fully functional, batch scheduled, simulation and modeling environment including Kerberos user authentication with FreeIPA. System administrators immediately have access to both a CLI and web-based interface for user management built upon well-known open-source components and standard protocols. Systems can also be configured to point to an existing LDAP service elsewhere in the data center.
Preparing Kubernetes for HPC workloads
Interest in Kubernetes has been growing in the HPC community, especially for data science and data analytics workloads. Interest in those use cases is precisely why Omnia included the ability to deploy Kubernetes from the start. However, default configurations of Kubernetes are missing some of the key components needed to make it useful for parallel and distributed data processing. Omnia version 1.0 included the mpi-operator from the Kubeflow project that provides custom resource descriptions (CRDs) for MPI job execution. Version 1.1 now includes the spark-operator to make executing Spark jobs simpler, as well. Another feature of version 1.1 is the option to use gang scheduling for Kubernetes pods through the Volcano project. This gives Kubernetes the ability to understand that all the pods in an MPI job should be scheduled simultaneously, rather than deploying pods a few at a time when resources come available.
A new platform for neural network research
Artificial intelligence research has been central workload for Omnia. Being able to provide users easy-to-deploy MLOps platforms like Kubeflow is critical to enabling data scientists and AI researchers the flexibility to experiment with new neural network architectures. In addition to Kubeflow, Omnia now offers automated installation of the Polyaxon deep learning platform. Polyaxon gives neural network researchers and data science teams the ability to:
- index and catalog experiments,
- execute Distributed TensorFlow experiments,
- train MPI-enabled TensorFlow and PyTorch models, and
- tune/optimize models using parametric sweeps of hyperparameter values.
Even greater things are on the horizon!
Version 1.1 is a big release, but the Omnia community has even greater things planned. Soon we will be adding support for the entire line of Dell EMC PowerEdge servers with Intel® 3rd-generation Xeon® Scalable (code name “Ice Lake”) processors. Additionally, Omnia will soon be able to deploy logical clusters on top of servers provisioned with either Rocky Linux or CentOS, offering users a choice of base operating systems. Looking farther out, we are working with our customers, technology partners, and community members to bring support for creating BeeGFS filesystems on demand, deploying new user platforms like Open OnDemand, and providing better administrative interfaces for Kubernetes cluster administration through Lens. Anyone is free to look at what we’re working on (and suggest new things to try) by going to the Omnia GitHub.
Learn More
Learn more about Omnia by visiting Omnia on GitHub.
Read the Dell Technologies solution overview on Omnia here.
Related Blog Posts
Containerized HPC Workloads Made Easy with Omnia and Singularity
Mon, 28 Jun 2021 14:35:14 -0000
|Read Time: 0 minutes
Maximizing application performance and system utilization has always been important for HPC users. The libraries, compilers, and applications found on these systems are the result of heroic efforts by HPC system administrators and teams of HPC specialists who fine tune, test, and maintain optimal builds of complex hierarchies of software for users. Fortunately for both researchers and administrators, some of that burden can be relieved with the use of containers, where software solutions can be built to run reliably when moved from one computing environment to another. This includes moving from one research lab to another, or from the developer’s laptop to a production lab, or even from an on-prem data center to the cloud.
Singularity has provided HPC system administrators and users a way to take advantage of application containerization while running on batch-scheduled systems. Singularity is a container runtime that can build containers in its own native format, as well as execute any CRI-compatible container. By default, Singularity enforces security restrictions on containers by running in user space and can preserve user identification when run through batch schedulers, providing a simple method to deploy containerized workloads on multi-user HPC environments.
Best practices for HPC systems deployment and use is the goal of Omnia and we recognize those practices vary in industry and research institutions. Omnia is developed with the entire community in mind and we aim to provide the tools that help them be productive. To this end, we recently included Singularity as an automatically installed package when deploying Slurm clusters with Omnia.
Building a Singularity-enabled cluster with Omnia
Installing a Slurm cluster with Omnia and running a Singularity job is simple. We provide a repository of Ansible playbooks to configure a pile of metal or cloud resources into a ready-to-use Slurm cluster by applying the Slurm role in AWX or by applying the playbook on the command line.
ansible-playbook -i inventory omnia.yaml --skip-tags kubernetes
Once the playbook has completed users are presented with a fully functional Slurm cluster with Singularity installed. We can run a simple “hello world” example, using containers directly from Singularity Hub. Here is an example Slurm submission script to run the “Hello World” example.
#!/bin/bash #SBATCH -J singularity_test #SBATCH -o singularity_test.out.%J #SBATCH -e singularity_test.err.%J #SBATCH -t 0-00:10 #SBATCH -N 1 # pull example Singularity container singularity pull --name hello-world.sif shub://vsoch/hello-world # execute Singularity container singularity exec hello-world.sif cat /etc/os-release
Executing HPC applications without installing software
The “hello world” example is great but doesn’t demonstrate running real HPC codes, fortunately several hardware vendors have begun to publish containers for both HPC and AI workloads, such as Intel's oneContainer and Nvidia's NGC. Nvidia NGC is a catalog of GPU-accelerated software arranged in collections, containers, and Helm charts. This free to use repository has the latest builds of popular software used for deep learning and simulation with optimizations for Nvidia GPU systems. With Singularity we can take advantage of the NGC containers on our bare-metal Slurm cluster. Starting with the LAMMPS example on the NGC website we demonstrate how to run a standard Lennard-Jones 3D melt experiment, without having to compile all the libraries and executables.
The input file for running this benchmark, in.lj.txt, can be downloaded from the Sandia National Laboratory site:
wget https://lammps.sandia.gov/inputs/in.lj.txt
Next make a local copy of the lammps container from NGC and name it lammps.sif
singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020
This example can be executed directly from the command line using srun. This example runs 8 tasks on 2 nodes with a total of 8 GPUs:
srun --mpi=pmi2 -N2 --ntasks=8 --ntasks-per-socket=2 singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd lammps.sif lmp -k on g 8 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt
Alternatively, the following example Slurm submission script will permit batch execution with the same parameters as above, 8 tasks on 2 nodes with a total of 8 GPUs:
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --ntasks-per-socket=2 #SBATCH --time 00:10:00 set -e; set -o pipefail # Build SIF, if it doesn't exist if [[ ! -f lammps.sif ]]; then singularity build lammps.sif docker://nvcr.io/hpc/lammps:29Oct2020 fi readonly gpus_per_node=$(( SLURM_NTASKS / SLURM_JOB_NUM_NODES )) echo "Running Lennard Jones 8x4x8 example on ${SLURM_NTASKS} GPUS..." srun --mpi=pmi2 \ singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd lammps.sif lmp -k on g ${gpus_per_node} -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in /host_pwd/in.lj.txt
Containers provide a simple solution to the complex task of building optimized software to run anywhere. Researchers are no longer required to attempt building software themselves or wait for a release of software to be made available at the site they are running. Whether running on the workstation, laptop, on-prem HPC resource, or cloud environment they can be sure they are using the same optimized version for every run.
Omnia is an open source project that makes it easy to setup a Slurm or Kubernetes environment. When we combine the simplicity of Omnia for system deployment and Nvidia NGC containers for optimized software, both researchers and system administrators can concentrate on what matters most, getting results faster.
Learn more
Learn more about Singularity containers at https://sylabs.io/singularity/. Omnia is available for download at https://github.com/dellhpc/omnia.
Taming the Accelerator Cambrian Explosion with Omnia
Thu, 23 Sep 2021 18:29:00 -0000
|Read Time: 0 minutes
We are in the midst of a compute accelerator renaissance. Myriad new hardware accelerator companies are springing up with novel architectures and execution models for accelerating simulation and artificial intelligence (AI) workloads, each with a purported advantage over the others. Many are still in stealth, some have become public knowledge, others have started selling hardware, and still others have been gobbled up by larger, established players. This frenzied activity in the hardware space, driven by the growth of AI as a way to extract even greater value from new and existing data, has led some to liken it to the “Cambrian Explosion,” when life on Earth diversified at a rate not seen before or since.
If you’re in the business of standing up and maintaining infrastructure for high-performance computing and AI, this type of rapid diversification can be terrifying. How do I deal with all of these new hardware components? How do I manage all of the device drivers? What about all of the device plugins and operators necessary to make them function in my container-orchestrated environment? Data scientists and computational researchers often want the newest technology available, but putting it into production can be next to impossible. It’s enough to keep HPC/AI systems administrators lying awake at night.
At Dell Technologies, we now offer many different accelerator technologies within our PowerEdge server portfolio, from Graphics Processing Units (GPUs) in multiple sizes to Field-Programmable Gate Array (FPGA)-based accelerators. And there are even more to come. We understand that it can be a daunting task to manage all of this different hardware – it’s something we do every day in Dell Technologies’ HPC & AI Innovation Lab. So we’ve developed a mechanism for detecting, identifying, and deploying various accelerator technologies in an automated way, helping us to simplify our own deployment headaches. And we’ve integrated that capability into Omnia, an open-source, community-driven high-performance cluster deployment project started by Dell Technologies and Intel.
Deploy-time accelerator detection and installation
We recognize that tomorrow’s high-performance clusters will not be fully homogenous, consisting of exact copies of the same compute building block replicated tens, hundreds, or thousands of times. Instead clusters are becoming more heterogeneous, consisting of as many as a dozen different server configurations, all tied together under a single (or in some cases – multiple) scheduler or container orchestrator.
This heterogeneity can be a problem for many of today’s cluster deployment tools, which rely on the concept of the “golden image” – a complete image of the server's operating system, hardware drivers, and software stack. The golden image model is extremely useful in many environments, such as homogeneous and diskless deployments. But in the clusters of tomorrow, which will try to capture the amazing potential of this hardware diversity, the golden image model becomes unmanageable.
Instead, Omnia does not rely on the golden image. We think of cluster deployment like 3D-printing – rapidly placing layer after layer of software components and capabilities on top of the hardware until a functional server building block emerges. This allows us, with the use of some intelligent detection and logic, to build bespoke software stacks for each server building block; on demand, at deploy time. From Omnia’s perspective, there’s really no difference between deploying a compute server with no accelerators into a cluster versus deploying a compute server with GPUs or FPGAs into that same cluster. We simply pick different component layers during the process.
What does this mean for cluster deployment?
It means that clusters can now be built from a variety of heterogeneous server building blocks, all managed together as a single entity. Instead of a cluster of CPU servers, another cluster of GPU-accelerated servers, and yet another cluster of FPGA-accelerated servers, research and HPC IT organizations can manage a single resource with all of the different types of technologies that their users demand, all connected by a unified network fabric and sharing a set of unified storage solutions.
And by using Omnia, the process of deploying clusters of heterogeneous building blocks has been dramatically simplified. Regardless of how many types of building blocks an organization wants to use within their next-generation cluster, it can all be deployed using the same approach, and at the same time. There’s no need to build special images for this type of server and that type of server, simply start the Omnia deployment process and Omnia’s intelligent software deployment system will do the rest.
Learn more
Omnia is available to download on GitHub today. You can learn more about the Omnia project in our previous blog post.