Data analytics and artificial intelligence

Thank you for your feedback!

Introduction
Enterprises are rapidly increasing their investments in infrastructure platforms to support data analytics and artificial intelligence (AI), including the more specific AI disciplines of machine learning (ML) and deep learning (DL). All these disciplines benefit from running in containerized environments. The benefits of running these applications on OpenShift Container Platform are available to developers, data scientists, and IT operators.
For simplicity, we use “data analytics as a service” (DAaaS) to refer to analytics and AI that are operated and instantiated in a containerized environment. OpenShift Container Platform enables operators to create a DAaaS environment as an extensible analytics platform with a private cloud-based delivery model. This delivery model makes various tools available for data analytics and can be configured to efficiently process and analyze huge quantities of heterogeneous data from shared data stores.
The data analytics life cycle, and particularly the ML life cycle, is a multiphase process to integrate large volumes and varieties of data, abundant compute power, and open-source languages, libraries, and tools to build intelligent applications and predictive outcomes. At a high level, the life cycle consists of these steps:
- Data acquisition and preparation—Ensures that the input data is complete and of a high quality
- Modeling creation—Includes training, testing, and selection of the model with the highest prediction accuracy
- Model deployment—Includes inferencing in the application development and operations processes
Key challenges
Data scientists and engineers are primarily responsible for developing modeling methods that ensure the selected outcome continues to provide the highest prediction accuracy. The key challenges that data scientists face include:
- Selection and deployment of the right AI tools (such as Apache Spark, TensorFlow, PyTorch, and so on)
- Complexities and time required to train, test, select, and retrain the AI model that provides the highest prediction accuracy
- Slow execution of AI modeling and inferencing tasks because of a lack of hardware acceleration
- Limited IT operations to provision and manage infrastructure
- Collaboration with data engineers and software developers to ensure input data hygiene and successful AI model deployment in application development processes
Containers and Kubernetes are key to accelerating the data analytics life cycle because they provide data scientists and IT operators with the agility, flexibility, portability, and scalability needed to train, test, and deploy ML models.
OpenShift Container Platform provides all these benefits. Through its DevOps capabilities and integration with hardware accelerators, it enables better collaboration between data scientists and software developers. OpenShift Container Platform also accelerates the roll-out of analytics applications to departments as needed. The benefits include the ability to:
- Empower data scientists with a consistent, self-service-based, cloud-like experience:
- Gives data scientists the flexibility and portability to use containerized ML tools of their choice to quickly build, scale, reproduce, and share ML modeling results in a consistent way with peers and software developers.
- Eliminates dependency on IT to provision infrastructure for iterative, compute-intensive ML modeling tasks.
- Accelerate compute-intensive ML modeling and inferencing jobs:
On-demand access to high-performance hardware can seamlessly meet the high compute resource requirements to help determine the best ML model, providing the highest prediction accuracy.
- Streamline the development and operations of intelligent applications:
Extending OpenShift DevOps automation capabilities to the ML life cycle enables collaboration between data scientists, software developers, and IT operations so that ML models can be quickly integrated into the development of intelligent applications.
Kubeflow ML on OpenShift
One example of ML on OpenShift Container Platform is the work done by Dell Technologies and Red Hat to deploy Kubeflow on OpenShift.
Kubeflow is an open-source Kubernetes-native platform for ML workloads that enables enterprises to accelerate their ML/DL projects. Based originally on Google’s use of TensorFlow on Kubernetes, Kubeflow is a composable, scalable, portable ML stack that includes components and contributions from various sources and organizations. Kubeflow bundles popular ML/DL frameworks such as TensorFlow, MXNet, Pytorch, and Katib with a single deployment binary file. By running Kubeflow on OpenShift Container Platform, you can quickly operationalize a robust ML pipeline.
For more information, see Kubeflow: The Machine Learning Toolkit for Kubernetes.
Spark analytics on OpenShift
An example of large-scale data analytics being run on OpenShift Container Platform is the Dell EMC Spark on Kubernetes Ready Solution for Data Analytics.
Apache Spark, a unified analytics engine for big data and machine learning, is one of the largest open-source projects in data processing. Data scientists want to run Spark processes that are distributed across multiple systems to have access to additional memory and computing cores. OpenShift orchestrates the creation, placement, and life cycle management of those Spark processes across a cluster of servers by using container virtualization to host the processes.
For more information, see the Spark on Kubernetes reference architecture guide on the Dell Technologies Info Hub for AI and Data Analytics.
SQL Server big data clusters on OpenShift
Another example of big data analytics being run on OpenShift is the Dell EMC solution for Microsoft SQL Server 2019 Big Data Clusters.
SQL Server Big Data Clusters enable deployment of scalable clusters consisting of SQL Server, Spark, and HDFS containers running on Kubernetes. These components run side by side to enable you to read, write, and process big data so that you can easily combine and analyze your high-value relational data with high-volume big data. OpenShift Container Platform is one of the Kubernetes platforms on which you can run SQL Server Big Data Clusters.
For more information, see the Microsoft SQL Server 2019 Big Data Clusters White Paper on the Dell Technologies Info Hub for SQL Server.

Your Browser is Out of Date

Data analytics and artificial intelligence

Data analytics and artificial intelligence

Introduction

Key challenges

Kubeflow ML on OpenShift

Spark analytics on OpenShift

SQL Server big data clusters on OpenShift