This guide outlines a practical approach to tool selection and model creation for projects with machine learning and deep learning requirements, using technology available today.
This guide describes how to use a combination of open-source projects and commercial software, including:
It covers both issues of interest to both data scientists and IT professionals. It also describes how to use Docker and Kubernetes to manage the infrastructure that can be integrated with software developers working on cloud-native applications. It uses an inventory management use case story for a fictitious large online and physical retailer. This use case story solves the technology puzzle one analytics pipeline stage at a time.
One of the main reasons for Dell EMC's continued interest in Hadoop for advanced analytics is its breadth and flexibility.
Organizations can combine a few projects that handle everything from data ingestion to data cleansing, to model development and model hosting for inference. A platform that enables IT, data engineers, business analysts, and data scientists to focus investments and share resources could couple:
The two main challenges when moving from data analytics on a single system to a scale-out approach are:
The original two projects that made up the first version of Hadoop solved both problems in the mid to late 2000s. The original distributed compute engine, MapReduce, was incapable of performing advanced analytics such as machine learning or deep learning. It has been largely replaced and augmented with new distributed computing frameworks, with Spark being the most important for data science.
Spark started with a concept familiar to everyone that works in data science - the data frame. An approach was then devised to distribute it across many systems. This took advantage of the combined memory and computing cores so that data scientists did not have to change the way they traditionally work with data.
Data science has expanded beyond the computing power of a single machine to escape the memory and computing core limits of ubiquitous, inexpensive x86 systems.
Managing massive, distributed systems that handle the scaling need of data scientists working with increasing data volumes presents new challenges for architects and IT operations professionals. This area is where Kubernetes makes a significant contribution.
The relationship between Spark and Kubernetes is conceptually simple. Data scientists want to run many Spark processes that are distributed across multiple systems to have access to more memory and computing cores. Using container virtualization like Docker to host those Spark processes, Kubernetes orchestrates the creation, placement, and life cycle management of those processes across a cluster of x86 servers. There are many Kubernetes implementations available today, so later chapters discuss using OpenShift Kubernetes with Kubernetes, and how that compares to other options.