
Accelerate Your Journey to AI Success with MLOps and AutoML
Tue, 07 Feb 2023 22:05:15 -0000
|Read Time: 0 minutes
Artifical Intelligence (AI) and Machine Learning (ML) helps organizations make intelligent data driven business decisions and are critical components to help businesses thrive in a digitally transforming world. While the total annual corporate AI investment has increased substantially since 2019, many organizations are still experiencing barriers to successfully adopt AI. Organizations move along the AI and analytics maturity curve at different rates. Automation methodologies such as Machine Learning Operations (MLOps) and Automatic Machine Learning (AutoML) can serve as the backbone for tools and processes that allow organizations to experiment and deploy models with the speed and scale of a highly efficient, AI-first enterprise. MLOps and AutoML are associated with distinct and important components of AI/ML work streams. This blog introduces how software platfoms like cnvrg.io and H2O Driverless AI make it easy for organizations to adopt these automation methodologies into their AI environment.
This blog is intended to serve as a reference for Dell’s position on MLOps solutions that help organizations scale their AI and ML practices. MLOps and AutoML provide a powerful combination that brings business value out of AI projects quicker and in a secure and scalable way. Dell Validated Designs provides the Reference Architecture that combines the software and Dell hardware to bring these solutions to life.
Importance of automation methodologies
Deploying models to a production environment is an important component to getting the most business value from an AI/ML project. While there are numerous tasks to get a project into production, from Exploratory Data Analysis to model training and tuning, successfully deployed models require additional sets of tasks and procedures, such as runtime model management, model observability and retraining, and inferencing reliability and cost optimization. The lifeycle of a AI/ML project involves disciplines of data engineering, data science, DevOps engineering and roles with differing skillsets across these teams. With all the steps listed above for just a single AI/ML project, it’s not difficult to see the challenges organizations have when faced with wanting to rapidly grow the number of projects across different business units within the organization. Organizations that prioritize ROI, consistency, reusability, traceability, reliability and automation in their AI/ML projects through sets of procedures and tools described in this paper are set up to scale in AI and meet the demand of AI for its business.
Components of an AI/ML project
A typical AI/ML project has many distinct tasks which can flow in a cascading, yet circular manner. This means that while tasks may have dependencies on completion of previous tasks, the continuous learning nature of ML projects create an iterative feedback loop throughout the project.
The following list describes steps that a typical AI/ML project will run through.
- Objective Specification
- Exploratory Data Analysis (EDA)
- Model Training
- Model Implementation
- Model Optimization and Cross Validation
- Testing
- Model Deployment
- Inference
Figure 1. Distinct Tasks in an AI/ML Project
Each task serves an important role in the project and can be grouped at a high level by defining the problem statement, the data and modeling work, and productionizing the final model for inference. Because these groups of tasks have different objectives, there are different automation methodologies that have been developed to streamline and scale projects. The concept of Automated Machine Learning (AutoML) was developed to make the data and modeling work as efficient as possible. AutoML is a set of proven and optimized solutions in data modeling. The practice of ModelOps was developed to deploy models faster and more scalable. While AutoML and ModelOps automate specific tasks within a project, the practice of MLOps serves as an umbrella automation methodology that contains guiding principles for all AI/ML projects.
MLOpsThe key to navigating the challenges of inconsistent data, labor constraints, model complexity, and model deployment to operate efficiently and maximize the business value of AI is through the adoption of MLOps. MLOps, at a high level, is the practice of applying software engineering principles of Continuous Improvement and Continuous Delivery (CI/CD) to Machine Learning projects. It is a set of practices that provide the framework for consistencty and reusability that leads to quicker model deployments and scalability.
MLOps tools are the software based applications that help organizations put the MLOps principles into practice in an automated fashion.
The complexities stemming from ever-changing business environments that affect underlying data, inference needs, etc mean that MLOps in AI/ML projects need to have quicker iteratons than typical DevOps software projects.
AutoML
At the heart of a AI/ML project is the quest for business insights, and the tasks that lead to these insights can be done at a scalable and efficient manner with AutoML. AutoML is the process of automating exploratory data analysis (EDA), algorithm selection, training and optimizations of models.
AutoML tools are low-code or no-code platforms that begin with the ingestion of data. Summary statistics, data visualizations, outlier detection, feature interaction, and other tasks associated with EDA are then automatically completed. For model training, AutoML tools can detect what type of algorthms are appopriate for the data and business question and proceed to test each model. AutoML also itierates over hundreds of versions of the models by tweaking the parameters to find the optimal settings. After cross-validation and further testing of the model, a model package is created which includes the data transformations and scoring pipelines for easy deployment into a production environment.
ModelOps
Oncea trained model is ready for deployment in a production environment, a whole new set of tasks and processes begin. ModelOps is the set of practices and processes that support fast and scalable deployment of models to production. Model performance degrades over time for reasons such as underlying trends in the data changing or introduction of new data, so models need to be monitored closely and be updated to keep peak business value throughout its lifecycle.
Model monitoring and management are key components of ModelOps, but there are many other aspects to consider as part of a ModelOps strategy. Managing infrastructure for proper resource allocation (e.g how and when to include accelerators), automatic model re-training in near real time, and integrating with advanced network sevurity solutions, versioning, and migration are other elements that must be considered when thinking about scaling an AI environment.
Dell solution for automation methodologies
Dell offers solutions that bring together the infrastructure and software partnerships to capitalize on the benefits of AutoML, MLOps and ModelOps. Through jointly engineered and tested solutions with Dell Validated Designs, organizations can provide their AI/ML teams with predictable and configurable ML environments and with their operational AI goals.
Dell has partnered with cnvrg.io and H2O to provide the software platforms to pair with the compute, storage, and networking infrastructure from Dell to complete the AI/ML solutions.
MLOps – cnvrg.io
cnvrg.io is a machine learning platform built by data scientists that makes implementation of MLOps and the process of taking models from experimentation to deployment efficient and scalable. cnvrg.io provides the platform to manage all aspects of the ML life cycle, from data pipelines, to experimentation, to model deployment. It is a Kubernetes-based application that allows users to work in any compute environment, whether it be in the cloud or on-premises and have access to any programming language.
The management layer of cnvrg.io is powered by a control plane that leverages Kubernetes to manage the containers and pods that are needed to orchestrate the tasks of a project. Users can view the state and health and resource statistics of the environment and each task using the cnvrg.io dashboard.
cnvrg.io makes it easy to access the algorithms and data components, whether they are pre-trained models or models built from scratch, with Git interaction through the AI Library. Data pre-processing logic or any customized models can be stored and implemented for tasks across any project by using the drag-and-drop interface for end-to-end management called cnvrg.io Pipelines.
The orchestration and scheduling features use Kubernetes-based meta-scheduler, which makes jobs portable across environments and can scale resources up or down on demand. Cnvrg.io facilitates job scheduling across clusters in the cloud and on-premises to navigate through resource contention and bottlenecks. The ability to intelligently deploy and manage compute resources, from CPU, GPU, and other specialized AI accelerators to the tasks where they can be best used is important to achieving operational goals in AI.
cnvrg.io solution architecture
The cnvrg.io software can be installed directly on your data center, or it can be accessed through the cnvrg.io Metacloud offering. Both versions allow users to configure the organization’s own infrastructure into compute templates. For installations into an on-premises data center, cnvrg.io can be deployed on various Kubernetes infrastructures, including bare metal, but the Dell Validated Design for AI uses VMware and NVIDIA to provide a powerful combination of composability and performance.
Dell’s PowerEdge servers that can be equipped with NVIDIA GPUs provide the compute resources required to run any algorithm in machine learning packages like scikit learn to deep learning algorithms in frameworks like TensorFlow and PyTorch. For storage, Dell’s PowerScale appliance with all-flash, scale out NAS storage deliver the concurrency performance to support data heavy neural networks. VMware vSphere with Tanzu allows for the Tanzu Kubernetes clusters, which are managed by Tanzu Mission Control. The servers running VMware vSAN provide a storage repository for the VM and pods. PowerSwitch network switches with a 25 GbE-based design or 100 GbE-based design allow for neural network training jobs than can run on a single node. Finally, the NVIDIA AI Enterprise comes with the software support for GPUs such as fractionalizing GPU resources with the MIG capability.
Dell provides recommendations for sizing of the different worker node configurations, such as the number of CPUs/GPUs and amount of memory, that users can deploy for the various types of algorithms different AI/ML projects may use.
Figure 2. Dell/cnvrg.io Solution Architecture
For more information, see the Design Guide—Optimize Machine Learning Through MLOps with Dell Technologies cnvrg.io.
AutoML – H2O.ai Driverless AI
Dell has partnered with H2O and its flagship product, Driverless AI, to give organizations a comprehensive AutoML platform to empower both data scientists and non-technical folks to unlock insights efficiently and effectively. Driverless AI has several features that help optimize the model development portion of an AI/ML workflow, from data ingestion to model selection, as organizations look to gain faster and higher quality insights to business stakeholders. It is a true no-code solution with a drag and drop type interface that opens the door for citizen data scientists.
Starting with data ingestion, Driverless AI can connect to datasets in various formats and file systems, no matter where the data resides, from on-premises to a clould provider. Once ingested, Driverless AI runs EDA, provides data visualization, outlier detection, and summary statistics on your data. The tool also automatically suggests data transformations based on the shape of your data and performs a comprehensive feature engineering process that search for high-value predictors against the target variable. A summary of the auto-created features is displayed in an easy to digest dashboard.
For model development, Driverless AI automatically trains multiple in-built models, with numerous iterations for hyper parameter tuning. The tool applies a genetic algorithm that creates an ensemble, ‘survival of the fittest’ final model. The user also has the ability to set the priority on factors of accuracy, time, and interpretability. If the user wishes to arrive at a model that needs to be presented to a less technical busines audience, for example, the tool will focus on algorithms that have more explainable features rather than black box type models that may achieve better accuracy with a longer training time. While the Driverless AI tool may be run as a no-code solution, the bring your own recipe feature empowers more seasoned data scientists to bring custom data transformations and algorithms into the tools as part of the experimenting process.
The final output of Driverless AI, after a champion model is crowned, will include a scoring pipeline file that makes it easy to deploy to a production environment for inference. The scoring pipeline can be saved in Python or a MOJO and includes components like data transformations, scripts, runtime, etc.
Driverless AI solution architecture
The H2O Driverless AI platform can be deployed either in Kubernetes as pods or as a stand-alone container. The Dell Validated Design of Driverless AI highlights the flexibility VMware vSphere with Tanzu for the Kubernetes layer works with H2O’s Enterprise Steam to provide resource control and monitoring, access control, and security out of the box.
Dell PowerEdge servers, with optional NVIDIA GPUs, and the NVIDIA AI Enterprise make building containers easy for different sets of users. For use cases that are heavy on EDA or employ traditional machine learning algorithms, Driverless AI containers with CPUs only may be appropriate, while containers with GPUs are best suited for training deep learning models usually associated with natural language processing and computer vision. Dell PowerScale storage and Dell PowerSwitch network adapters provide concurrency at scale to train the data intensive algorithms found within Driverless AI.
Dell provides sizing deployments recommendations specific to an organization’s requirements and capabilities. For organizations starting their AI journey, a deployment with 5 Driverless AI instances, 40 CPU cores, 3.2 TB of memory, and 5 TB storage is recommended for workloads and projects that perform classic machine learning or statistical modeling. For a mainstream deployment with more users and heavier workloads that would benefit from GPU acceleration, 10 Driverless AI instances with 100 CPU cores and 5 NVIDIA A100 GPUs, 8 TB of memory, and 10 TB of storage is recommended. Finally, for high performance deployments for organizations that want to deploy AI models at scale, 20 Driverless AI instances, 200 CPU cores and 10 A100 GPUs, 16 TB of memory, and 20 TB of storage provides the infrastructure for a full-service AI environment.
Figure 3. Dell/H2O Driverless AI Solution Architecture
For more information, see Automate Machine Learning with H2O Driverless AI.
Dell is your partner in your AI Journey
AI is constantly evolving, and many organizations do not have the AI expertise to keep up with designing, developing, deploying, and managing solution stacks at the competitive pace. Dell Technologies is your trusted partner and offers solutions that empower your organization in its AI journey. For over the past decade, Dell has been a proven leader in the advanced computing space that includes industry leading products, solutions, and expertise. We have a specialized team of AI, High Performance Computing (HPC), and Data Analytics experts dedicated to helping you keep pace on your AI journey.
AI Professional Services
Regardless of your AI needs, you can rest assured that your deployments will be backed up by Dell’s world class technology services. Our expert consulting services for AI help you plan, implement, and optimize AI solutions, while more than 35,000 services experts can meet you where you are on your AI journey.
- Consulting Services
- Deployment Services
- Support Services
- Payment Solutions
- Managed Services
- Residency Services
Customer Solution Center
The Customer Solution Center is a team of experienced professionals equipped to provide expert advice, recommendations, and demonstrations of the cutting-edge technologies and platforms essential for successful AI implementation. Our staff maintains a thorough understanding of the diverse needs and challenges of our customers and offers valuable insights garnered from extensive engagement with a broad range of clients. By leveraging our extensive knowledge and expertise, you gain a competitive advantage in your pursuit of AI solutions.
AI and HPC Innovation Lab
The AI and HPC Innovation Lab is a premier infrastructure, equipped with a highly skilled team of computer scientists, engineers, and Ph.D. level experts. This team actively engages with customers and members of the AI and HPC community, fostering partnerships and collaborations to drive innovation. With early access to cutting-edge technologies, the Lab is equipped to integrate and optimize clusters, benchmark applications, establish best practices, and publish insightful white papers. By working directly with Dell's subject matter experts, customers can expect tailored solutions for their specific AI and HPC requirements.
Conclusion
MLOps and AutoML play a critical role in fostering the successful integration of AI/ML into organizations. MLOps provides a standardized framework for ensuring consistency, reusability, and scalability in AI/ML initiatives, while AutoML streamlines the data and modeling process. This synergistic approach enables organizations to make data-driven decisions and derive maximum business value from their AI/ML endeavors. Dell Validated Designs offer a blueprint for implementing MLOps, thereby bringing these concepts to fruition. The dynamic nature of AI/ML projects necessitates rapid iterations and automation to tackle challenges such as data inconsistency and resource limitations. MLOps and AutoML serve as crucial enablers in driving digital transformation and establishing an AI-centric enterprise.
Related Blog Posts

Accelerating the Journey towards Autonomous Telecom Networks
Fri, 06 Jan 2023 14:29:40 -0000
|Read Time: 0 minutes
How Dell Technologies is supporting communications service providers accelerate automation
Communications service providers (CSPs) are on a journey of digital transformation that gives them the ability to offer new innovative services and a better customer experience in an open, agile, and cost-effective manner. Recent developments in 5G, Edge, Radio Access Network disaggregation, and, most importantly the pandemic have all proven to be catalysts that accelerated this digital transformation. However, all these advancements in telecom come with their own set of challenges. New architectures and solutions have made the modern network considerably more complex and difficult to manage.
In response, CSPs are evaluating new ways of managing their complex networks using automation and artificial intelligence. The ability to fully orchestrate the operation of digital platforms is vital for touchless operations and consistent delivery of services. Almost every CSP is working on this today. However, the standard automation architecture and tools can't be directly applied by CSPs as all these solutions need to adhere to strict telecom requirements and specifications such as those defined by enhanced Telecom Operations Map (eTOM), Telecom Management Forum (TM Forum), European Telecommunications Standards Institute (ETSI), 3rd Generation Partnership Project (3GPP), etc. CSPs also need to operate many telecom solutions including legacy physical network functions (PNF), virtual network functions (VNF), and the latest 5G era containerized network functions (CNF).
Removing barriers with telecom automation
Although many CSPs have built cloud platforms, only a handful have achieved their automation targets. So, what do you do when there is no ready-made industry-standard automation solution? You build one. And that’s exactly what Dell Technologies did with the recent launch of its Dell Telecom Multi-Cloud Foundation. Dell Telecom Multi-Cloud Foundation automates the deployment and life-cycle management of the cloud platforms used in a telecom network to reduce operational costs while consistently meeting telco-grade SLAs. It also supports the leading cloud platforms offering operators the flexibility of choosing the platform that best meets their needs based on workload requirements and cost-to-serve. It streamlines telecom cloud design, deployment, and management with integrated hardware, software, and support.
The solution includes Dell Telecom Infrastructure Blocks. Telecom Infrastructure Blocks are engineered systems that provide foundational building blocks that include all the hardware, software and licenses to build and scale out cloud infrastructure for a defined telecom use case.
Telecom Infrastructure Block releases will be delivered in an agile manner with multiple releases per year to simplify lifecycle management. In 2023, Dell Telecom Infrastructure Blocks will support workloads for Radio Access Network and Core network functions with:
- Dell Telecom Infrastructure Blocks for Wind River which will support vRAN and Open RAN workloads.
Dell Telecom Infrastructure Blocks for RedHat will target core network workloads (planned). The primary goal of Telecom Multi-Cloud Foundation with Telecom Infrastructure Blocks is to deliver telco cloud platforms that are engineered for scaled deployments, providing three core capabilities:
- Integration: All components of the platform, including computing, storage, networking, ancillaries like accelerators, Cloud CaaS software, and management tools are integrated into Dell’s factories.
- Validation: A solution engineered and validated by our cloud partners and already proven to work in the field. The engineering and validation process includes detailed test cases across both functional and non-functional aspects of the platform
- Automation: A Solution that is fully automated and that can seamlessly integrate with Telco’s existing orchestration and inventory systems.
Dell Technologies Telecom Multi-Cloud Foundation meets Telco automation requirements
Dell Technologies Multi-Cloud Foundation provides communications service providers with a platform-centric solution based on open Application Programming interfaces (APIs) and consistent tools. This means the platform can deliver outcomes based on a unique use case and workload and then scale out deployments using an API-based approach.
Dell Telcom Multi-Cloud Foundation enables telco-grade automation through the following key capabilities:
- An open API and workflow approach: All the capabilities of the platform are available as declarative APIs so there is no need to manage each infrastructure component independently, rather open APIs and workflows are triggered via northbound orchestration systems. This capability not only automates deployment but also Day 2 operations and life-cycle management.
- Scalable architecture: The automation architecture is based on a fully distributed and federated architecture, so it can scale to 100,000’s of sites.
- Data-Driven architecture: The automation architecture is data-driven and distributed so data can be tapped from edge and regional sites enabling real-time use cases and data-driven automation.
Automation use cases with Dell Technologies Telecom Multi-Cloud Foundation
Telecom Automation is not just about Day 0 (design) and Day 1 (deployment) but should also cover Day 2 (operations and lifecycle management). Dell Telecom Multi-Cloud Foundation supports the following use cases:
- Automated Deployment: It includes a fully-automated deployment of the cloud infrastructure based on customer specifications.
- O-Cloud as Code: It employs declarative automation using infrastructure data, which includes site data, networking, resources, and credentials to automate tasks independent of the workflow. This de-coupling is crucial to orchestrate the platform.
- Operational fulfillment: Integrations with Wind River Studio Conductor delivers full set of operational tools that provide a single management and observation platform for the operations team. This helps with creating a unified layer for Network Operations Center (NOC) teams to monitor and manage the platform.
- Staging: The platform is staged in Dell’s factory to reduce the time spent deploying and configuring the system on-site and can be tuned in the field using the built-in automation to meet any unique operator specifications.
Dell Technologies developed Dell Telecom Multi-Cloud Foundation and Dell Telecom Infrastructure Blocks to accelerate 5G cloud infrastructure transformation. Our current release of Telecom Infrastructure Blocks for Wind River delivers an engineered and factory-integrated system that comes with a fully automated deployment model for CSPs looking to build resilient and high-performance RAN.
To learn more about our solution, please visit the Dell Telecom Multi-Cloud Foundation solutions site.
About the Author: Saad Sheikh
Saad Sheikh is APJ's Lead Systems Architect in Telecom Systems Business at Dell Technologies. In his current role, he is responsible for driving Telecom Cloud, Automation, and NGOPS transformations in APJ supporting partners, NEPs, and customers to accelerate Network Transformation for 5G, Open RAN, Core, and Edge using Dell’s products and capabilities. He is an industry leader with over 20 years of experience in Telco industry holding roles in Telco, System Integrators, Consulting businesses, and with Telecom vendors where he has worked on E2E Telecoms systems (RAN, Transport, Core, Networks), Cloud platforms, Automation, Orchestration, and Intelligent Networking. As part of Dell CTO team, he represents Dell in Linux Foundation, TMforum, GSMA, and TIP.

Introduction to Ansible Network Collection for Dell SmartFabric Services
Wed, 19 Oct 2022 18:54:48 -0000
|Read Time: 0 minutes
SFS Ansible collection
With Dell OS10.5.4.0, the OS10 Ansible automation journey continues. Ansible helps DevOps and NetOps reduce time and effort when designing, managing, and monitoring OS10 networks in enterprise IT environments. Updates to Ansible collections provide new automation benefits for increased operational efficiencies. This blog provides a quick introduction to the Ansible network collection for Dell SmartFabric Services (SFS).
The Ansible network collection for SFS allows you to provision and manage OS10 network switches in SmartFabric Services mode. The collection includes core modules and plugins and supports network_cli and httpapi connections. Supported versions include Ansible 2.10 or later. Sample playbooks and documentation are also included to show how you can use the collection.
With this introduction, there are now additional SFS automation choices. The following table lists some examples of the included modules:
Module Name | Module Description |
---|---|
sfs_setup | Manage configuration of L3 Fabric setup |
sfs_backup_restore | Manage backup restore configuration |
sfs_virtual_network | Manage virtual network configuration |
sfs_preferred_master | Manage preferred master configuration |
sfs_nodes | Manage nodes configuration |
More SFS automation choices
The SmartFabric OS10 SFS feature provides network fabric automation. The SFS leaf and spine personality is integrated with systems including VxRail and PowerStore. SFS delivers autonomous fabric deployment, expansion, and life cycle management. SFS automatically configures leaf and spine fabrics. SFS for leaf and spine is supported on S-series and Z-series Dell PowerSwitches. See the Support matrix for a complete list of supported platforms. For more details see Dell SmartFabric Services User Guide, Release 10.5.4.
You can access the collection by searching for dellemc.sfs on the Ansible Galaxy website. For more information, see the Dell OS10 SmartFabric Services Ansible collection.
