Data analytics and artificial intelligence

Thank you for your feedback!

Introduction
Enterprises are rapidly increasing their investments in infrastructure platforms to support data analytics and AI, including the more specific AI disciplines of ML and deep learning (DL). All these disciplines benefit from running in containerized environments. The benefits of running these applications on OpenShift Container Platform are available to developers, data scientists, and IT operators.
For simplicity, we use the term “data analytics as a service” (DAaaS) for analytics and AI that are operated and instantiated in a containerized environment. OpenShift Container Platform enables operators to create a DAaaS environment as an extensible analytics platform with a private cloud-based delivery model. This delivery model makes various tools available for data analytics and can be configured to efficiently process and analyze huge quantities of heterogeneous data from shared data stores.
The data analytics life cycle, particularly the ML life cycle, is a multiphase process of integrating large volumes and varieties of data, abundant compute power, and open-source languages, libraries, and tools to build intelligent applications and predictive outcomes. At a high level, the life cycle consists of these phases:
- Data acquisition and preparation—Ensures that the input data is complete and of a high quality.
- Modeling creation—Includes training, testing, and selection of the model with the highest prediction accuracy.
- Model deployment—Includes inferencing in the application development and operations processes.
Key challenges
Data scientists and engineers are primarily responsible for developing modeling methods that ensure that the selected outcome continues to provide the highest prediction accuracy. The key challenges that data scientists face include:
- Selection and deployment of the right AI tools (such as Apache Spark, TensorFlow, and PyTorch)
- Complexities and time required to train, test, select, and retrain the AI model that provides the highest prediction accuracy
- Slow execution of AI modeling and inferencing tasks because of a lack of hardware acceleration
- Limited IT operations to provision and manage infrastructure
- Collaboration with data engineers and software developers to ensure input data hygiene and successful AI model deployment in application development processes
Containers and Kubernetes are key to accelerating the data analytics life cycle because they provide data scientists and IT operators with the agility, flexibility, portability, and scalability that is needed to train, test, and deploy ML models.
OpenShift Container Platform provides all these benefits. Through its DevOps capabilities and integration with hardware accelerators, the platform enables better collaboration between data scientists and software developers. OpenShift Container Platform also accelerates the roll-out of analytics applications to departments as needed. The benefits include the ability to:
- Empower data scientists with a consistent, self-service-based, cloud-like experience which includes:
- Giving data scientists the flexibility and portability to use containerized ML tools of their choice to quickly build, scale, reproduce, and share ML modeling results in a consistent way with peers and software developers.
- Eliminating dependency on IT to provision infrastructure for iterative, compute-intensive ML modeling tasks.
- Accelerate compute-intensive ML modeling and inferencing jobs:
On-demand access to high-performance hardware can seamlessly meet the high compute resource requirements to help determine the best ML model, providing the highest prediction accuracy.
- Streamline the development and operations of intelligent applications:
Extending OpenShift DevOps automation capabilities to the ML life cycle enables collaboration between data scientists, software developers, and IT operations so that ML models can be quickly integrated into the development of intelligent applications.
MLPerf on OpenShift
A recent white paper explored the implications of running resource-intensive ML applications on top of OpenShift. MLPerf benchmarks are an independent valuation of performance for various parts of the machine learning ecosystem, including both the cloud and hardware platforms being used. The MLPerf training and inference benchmarks were run on top of OpenShift and compared to NVIDIA’s MLPerf benchmark results. The NVIDIA MLPerf benchmarks were not run on top of a container automation platform. The results indicated that the addition of the OpenShift platform did not hamper the performance of intensive ML applications and demonstrated that OpenShift provides valuable benefits for running ML applications in production environments.
Kubeflow ML on OpenShift
One example of ML on OpenShift Container Platform is the work that Dell Technologies and Red Hat did to deploy Kubeflow on OpenShift.
Kubeflow is an open-source Kubernetes-native platform for ML workloads that enables service providers to accelerate their ML/DL projects. Based originally on Google’s use of TensorFlow on Kubernetes, Kubeflow is a composable, scalable, portable ML stack that includes components and contributions from a variety of sources and organizations. Kubeflow bundles popular ML/DL frameworks such as TensorFlow, MXNet, PyTorch, and Katib with a single deployment binary file. By running Kubeflow on OpenShift Container Platform, you can quickly operationalize a robust ML pipeline.
For more information, see the Machine Learning Using Red Hat OpenShift Container Platform (this white paper is based on the OpenShift Container Platform 4.2 release).
For more information, see Kubeflow: The Machine Learning Toolkit for Kubernetes.
Spark analytics on OpenShift
An example of large-scale data analytics being run on OpenShift Container Platform is the Dell Spark on Kubernetes Solution for Data Analytics. Apache Spark, a unified analytics engine for big data and ML, is one of the largest open-source projects in data processing. Data scientists want to run Spark processes that are distributed across multiple systems to have access to additional memory and computing cores. OpenShift orchestrates the creation, placement, and life cycle management of those Spark processes across a cluster of servers by using container virtualization to host the processes.
SQL Server big data clusters on OpenShift
Another example of big data analytics being run on OpenShift is the Dell solution for Micosoft SQL Server 2019 Big Data Clusters.
SQL Server Big Data Clusters enable deployment of scalable clusters consisting of SQL Server, Spark, and HDFS containers running on Kubernetes. These components run side by side to enable you to read, write, and process big data so that you can easily combine and analyze your high-value relational data with high-volume big data. OpenShift Container Platform is one of the Kubernetes platforms on which you can run SQL Server Big Data Clusters.
For more information, see the Microsoft SQL Server 2019 Big Data Clusters White Paper on the Dell Technologies Info Hub for SQL Server.

Your Browser is Out of Date

Data analytics and artificial intelligence

Data analytics and artificial intelligence

Introduction

Key challenges

MLPerf on OpenShift

Kubeflow ML on OpenShift

Spark analytics on OpenShift

SQL Server big data clusters on OpenShift