Artifical Intelligence (AI) and Machine Learning (ML) helps organizations make intelligent data driven business decisions and are critical components to help businesses thrive in a digitally transforming world. While the total annual corporate AI investment has increased substantially since 2019, many organizations are still experiencing barriers to successfully adopt AI. Organizations move along the AI and analytics maturity curve at different rates. Automation methodologies such as Machine Learning Operations (MLOps) and Automatic Machine Learning (AutoML) can serve as the backbone for tools and processes that allow organizations to experiment and deploy models with the speed and scale of a highly efficient, AI-first enterprise. MLOps and AutoML are associated with distinct and important components of AI/ML work streams. This blog introduces how software platfoms like cnvrg.io and H2O Driverless AI make it easy for organizations to adopt these automation methodologies into their AI environment.

This blog is intended to serve as a reference for Dell’s position on MLOps solutions that help organizations scale their AI and ML practices. MLOps and AutoML provide a powerful combination that brings business value out of AI projects quicker and in a secure and scalable way. Dell Validated Designs provides the Reference Architecture that combines the software and Dell hardware to bring these solutions to life.

Importance of automation methodologies

Deploying models to a production environment is an important component to getting the most business value from an AI/ML project. While there are numerous tasks to get a project into production, from Exploratory Data Analysis to model training and tuning, successfully deployed models require additional sets of tasks and procedures, such as runtime model management, model observability and retraining, and inferencing reliability and cost optimization. The lifeycle of a AI/ML project involves disciplines of data engineering, data science, DevOps engineering and roles with differing skillsets across these teams. With all the steps listed above for just a single AI/ML project, it’s not difficult to see the challenges organizations have when faced with wanting to rapidly grow the number of projects across different business units within the organization. Organizations that prioritize ROI, consistency, reusability, traceability, reliability and automation in their AI/ML projects through sets of procedures and tools described in this paper are set up to scale in AI and meet the demand of AI for its business.

Components of an AI/ML project

A typical AI/ML project has many distinct tasks which can flow in a cascading, yet circular manner. This means that while tasks may have dependencies on completion of previous tasks, the continuous learning nature of ML projects create an iterative feedback loop throughout the project.

The following list describes steps that a typical AI/ML project will run through.

Figure 1. Distinct Tasks in an AI/ML Project

Each task serves an important role in the project and can be grouped at a high level by defining the problem statement, the data and modeling work, and productionizing the final model for inference. Because these groups of tasks have different objectives, there are different automation methodologies that have been developed to streamline and scale projects. The concept of Automated Machine Learning (AutoML) was developed to make the data and modeling work as efficient as possible. AutoML is a set of proven and optimized solutions in data modeling. The practice of ModelOps was developed to deploy models faster and more scalable. While AutoML and ModelOps automate specific tasks within a project, the practice of MLOps serves as an umbrella automation methodology that contains guiding principles for all AI/ML projects.

MLOpsThe key to navigating the challenges of inconsistent data, labor constraints, model complexity, and model deployment to operate efficiently and maximize the business value of AI is through the adoption of MLOps. MLOps, at a high level, is the practice of applying software engineering principles of Continuous Improvement and Continuous Delivery (CI/CD) to Machine Learning projects. It is a set of practices that provide the framework for consistencty and reusability that leads to quicker model deployments and scalability.

MLOps tools are the software based applications that help organizations put the MLOps principles into practice in an automated fashion.

The complexities stemming from ever-changing business environments that affect underlying data, inference needs, etc mean that MLOps in AI/ML projects need to have quicker iteratons than typical DevOps software projects.

AutoML

At the heart of a AI/ML project is the quest for business insights, and the tasks that lead to these insights can be done at a scalable and efficient manner with AutoML. AutoML is the process of automating exploratory data analysis (EDA), algorithm selection, training and optimizations of models.

AutoML tools are low-code or no-code platforms that begin with the ingestion of data. Summary statistics, data visualizations, outlier detection, feature interaction, and other tasks associated with EDA are then automatically completed. For model training, AutoML tools can detect what type of algorthms are appopriate for the data and business question and proceed to test each model. AutoML also itierates over hundreds of versions of the models by tweaking the parameters to find the optimal settings. After cross-validation and further testing of the model, a model package is created which includes the data transformations and scoring pipelines for easy deployment into a production environment.

ModelOps

Oncea trained model is ready for deployment in a production environment, a whole new set of tasks and processes begin. ModelOps is the set of practices and processes that support fast and scalable deployment of models to production. Model performance degrades over time for reasons such as underlying trends in the data changing or introduction of new data, so models need to be monitored closely and be updated to keep peak business value throughout its lifecycle.

Model monitoring and management are key components of ModelOps, but there are many other aspects to consider as part of a ModelOps strategy. Managing infrastructure for proper resource allocation (e.g how and when to include accelerators), automatic model re-training in near real time, and integrating with advanced network sevurity solutions, versioning, and migration are other elements that must be considered when thinking about scaling an AI environment.

Dell solution for automation methodologies

Dell offers solutions that bring together the infrastructure and software partnerships to capitalize on the benefits of AutoML, MLOps and ModelOps. Through jointly engineered and tested solutions with Dell Validated Designs, organizations can provide their AI/ML teams with predictable and configurable ML environments and with their operational AI goals.

Dell has partnered with cnvrg.io and H2O to provide the software platforms to pair with the compute, storage, and networking infrastructure from Dell to complete the AI/ML solutions.

MLOps – cnvrg.io

cnvrg.io is a machine learning platform built by data scientists that makes implementation of MLOps and the process of taking models from experimentation to deployment efficient and scalable. cnvrg.io provides the platform to manage all aspects of the ML life cycle, from data pipelines, to experimentation, to model deployment. It is a Kubernetes-based application that allows users to work in any compute environment, whether it be in the cloud or on-premises and have access to any programming language.

The management layer of cnvrg.io is powered by a control plane that leverages Kubernetes to manage the containers and pods that are needed to orchestrate the tasks of a project. Users can view the state and health and resource statistics of the environment and each task using the cnvrg.io dashboard.

cnvrg.io makes it easy to access the algorithms and data components, whether they are pre-trained models or models built from scratch, with Git interaction through the AI Library. Data pre-processing logic or any customized models can be stored and implemented for tasks across any project by using the drag-and-drop interface for end-to-end management called cnvrg.io Pipelines.

The orchestration and scheduling features use Kubernetes-based meta-scheduler, which makes jobs portable across environments and can scale resources up or down on demand. Cnvrg.io facilitates job scheduling across clusters in the cloud and on-premises to navigate through resource contention and bottlenecks. The ability to intelligently deploy and manage compute resources, from CPU, GPU, and other specialized AI accelerators to the tasks where they can be best used is important to achieving operational goals in AI.

cnvrg.io solution architecture

The cnvrg.io software can be installed directly on your data center, or it can be accessed through the cnvrg.io Metacloud offering. Both versions allow users to configure the organization’s own infrastructure into compute templates. For installations into an on-premises data center, cnvrg.io can be deployed on various Kubernetes infrastructures, including bare metal, but the Dell Validated Design for AI uses VMware and NVIDIA to provide a powerful combination of composability and performance.

Dell’s PowerEdge servers that can be equipped with NVIDIA GPUs provide the compute resources required to run any algorithm in machine learning packages like scikit learn to deep learning algorithms in frameworks like TensorFlow and PyTorch. For storage, Dell’s PowerScale appliance with all-flash, scale out NAS storage deliver the concurrency performance to support data heavy neural networks. VMware vSphere with Tanzu allows for the Tanzu Kubernetes clusters, which are managed by Tanzu Mission Control. The servers running VMware vSAN provide a storage repository for the VM and pods. PowerSwitch network switches with a 25 GbE-based design or 100 GbE-based design allow for neural network training jobs than can run on a single node. Finally, the NVIDIA AI Enterprise comes with the software support for GPUs such as fractionalizing GPU resources with the MIG capability.

Dell provides recommendations for sizing of the different worker node configurations, such as the number of CPUs/GPUs and amount of memory, that users can deploy for the various types of algorithms different AI/ML projects may use.

Figure 2. Dell/cnvrg.io Solution Architecture

For more information, see the Design Guide—Optimize Machine Learning Through MLOps with Dell Technologies cnvrg.io.

AutoML – H2O.ai Driverless AI

Dell has partnered with H2O and its flagship product, Driverless AI, to give organizations a comprehensive AutoML platform to empower both data scientists and non-technical folks to unlock insights efficiently and effectively. Driverless AI has several features that help optimize the model development portion of an AI/ML workflow, from data ingestion to model selection, as organizations look to gain faster and higher quality insights to business stakeholders. It is a true no-code solution with a drag and drop type interface that opens the door for citizen data scientists.

Starting with data ingestion, Driverless AI can connect to datasets in various formats and file systems, no matter where the data resides, from on-premises to a clould provider. Once ingested, Driverless AI runs EDA, provides data visualization, outlier detection, and summary statistics on your data. The tool also automatically suggests data transformations based on the shape of your data and performs a comprehensive feature engineering process that search for high-value predictors against the target variable. A summary of the auto-created features is displayed in an easy to digest dashboard.

For model development, Driverless AI automatically trains multiple in-built models, with numerous iterations for hyper parameter tuning. The tool applies a genetic algorithm that creates an ensemble, ‘survival of the fittest’ final model. The user also has the ability to set the priority on factors of accuracy, time, and interpretability. If the user wishes to arrive at a model that needs to be presented to a less technical busines audience, for example, the tool will focus on algorithms that have more explainable features rather than black box type models that may achieve better accuracy with a longer training time. While the Driverless AI tool may be run as a no-code solution, the bring your own recipe feature empowers more seasoned data scientists to bring custom data transformations and algorithms into the tools as part of the experimenting process.

The final output of Driverless AI, after a champion model is crowned, will include a scoring pipeline file that makes it easy to deploy to a production environment for inference. The scoring pipeline can be saved in Python or a MOJO and includes components like data transformations, scripts, runtime, etc.

Driverless AI solution architecture

The H2O Driverless AI platform can be deployed either in Kubernetes as pods or as a stand-alone container. The Dell Validated Design of Driverless AI highlights the flexibility VMware vSphere with Tanzu for the Kubernetes layer works with H2O’s Enterprise Steam to provide resource control and monitoring, access control, and security out of the box.

Dell PowerEdge servers, with optional NVIDIA GPUs, and the NVIDIA AI Enterprise make building containers easy for different sets of users. For use cases that are heavy on EDA or employ traditional machine learning algorithms, Driverless AI containers with CPUs only may be appropriate, while containers with GPUs are best suited for training deep learning models usually associated with natural language processing and computer vision. Dell PowerScale storage and Dell PowerSwitch network adapters provide concurrency at scale to train the data intensive algorithms found within Driverless AI.

Dell provides sizing deployments recommendations specific to an organization’s requirements and capabilities. For organizations starting their AI journey, a deployment with 5 Driverless AI instances, 40 CPU cores, 3.2 TB of memory, and 5 TB storage is recommended for workloads and projects that perform classic machine learning or statistical modeling. For a mainstream deployment with more users and heavier workloads that would benefit from GPU acceleration, 10 Driverless AI instances with 100 CPU cores and 5 NVIDIA A100 GPUs, 8 TB of memory, and 10 TB of storage is recommended. Finally, for high performance deployments for organizations that want to deploy AI models at scale, 20 Driverless AI instances, 200 CPU cores and 10 A100 GPUs, 16 TB of memory, and 20 TB of storage provides the infrastructure for a full-service AI environment.

Figure 3. Dell/H2O Driverless AI Solution Architecture

For more information, see Automate Machine Learning with H2O Driverless AI.

Dell is your partner in your AI Journey

AI is constantly evolving, and many organizations do not have the AI expertise to keep up with designing, developing, deploying, and managing solution stacks at the competitive pace. Dell Technologies is your trusted partner and offers solutions that empower your organization in its AI journey. For over the past decade, Dell has been a proven leader in the advanced computing space that includes industry leading products, solutions, and expertise. We have a specialized team of AI, High Performance Computing (HPC), and Data Analytics experts dedicated to helping you keep pace on your AI journey.

AI Professional Services

Regardless of your AI needs, you can rest assured that your deployments will be backed up by Dell’s world class technology services. Our expert consulting services for AI help you plan, implement, and optimize AI solutions, while more than 35,000 services experts can meet you where you are on your AI journey.

Customer Solution Center

The Customer Solution Center is a team of experienced professionals equipped to provide expert advice, recommendations, and demonstrations of the cutting-edge technologies and platforms essential for successful AI implementation. Our staff maintains a thorough understanding of the diverse needs and challenges of our customers and offers valuable insights garnered from extensive engagement with a broad range of clients. By leveraging our extensive knowledge and expertise, you gain a competitive advantage in your pursuit of AI solutions.

AI and HPC Innovation Lab

The AI and HPC Innovation Lab is a premier infrastructure, equipped with a highly skilled team of computer scientists, engineers, and Ph.D. level experts. This team actively engages with customers and members of the AI and HPC community, fostering partnerships and collaborations to drive innovation. With early access to cutting-edge technologies, the Lab is equipped to integrate and optimize clusters, benchmark applications, establish best practices, and publish insightful white papers. By working directly with Dell's subject matter experts, customers can expect tailored solutions for their specific AI and HPC requirements.

Conclusion

MLOps and AutoML play a critical role in fostering the successful integration of AI/ML into organizations. MLOps provides a standardized framework for ensuring consistency, reusability, and scalability in AI/ML initiatives, while AutoML streamlines the data and modeling process. This synergistic approach enables organizations to make data-driven decisions and derive maximum business value from their AI/ML endeavors. Dell Validated Designs offer a blueprint for implementing MLOps, thereby bringing these concepts to fruition. The dynamic nature of AI/ML projects necessitates rapid iterations and automation to tackle challenges such as data inconsistency and resource limitations. MLOps and AutoML serve as crucial enablers in driving digital transformation and establishing an AI-centric enterprise.

Your Browser is Out of Date

Accelerate Your Journey to AI Success with MLOps and AutoML