There are many lessons that have been learned in the last 4 to 5 decades of using computers for data processing that have led to our current enthusiasm for the possibilities of advanced analytics. Machine learning, deep learning, and artificial intelligence are beginning to solve significant business and societal challenges. How far we can expand those achievements is impossible to predict. The required interplay between data, computer hardware, software, and human creativity is complex. Identifying the right combination of equipment, tools and knowledgeable practitioners that can produce good results with reasonable investments is even more challenging today than it was 40 or 50 years ago. The vast number of technology options available today and the speed of change makes it difficult to construct a plan for a project that can deliver timely results. It is easy to fall into the trap of waiting to make decisions until there is more research or the next promising technology reaches the next stage of maturity.
This eBook outlines a practical approach to the problem of tool selection and staffing for projects with machine learning and/or deep learning requirements using technology that is available today. We will describe how to use a combination of open-source projects and commercial software including Spark, Spark SQL, MLlib, BigDL, and Kubernetes/Docker. We cover both issues of interest to both data scientists and IT pros. We will also describe how to use Docker and Kubernetes to manage the infrastructure that can be integrated with software developers working on cloud-native applications. We use a use case story centered on inventory management for a large fictitious retailer with both online and physical brick and mortar outlets to help weave the pieces of the technology puzzle together focusing on one stage of the analytics pipeline at a time.
Why Spark and Hadoop?
One of the main reasons for our continued interest in the Hadoop ecosystem for advanced analytics is the breadth and flexibility of the platform. It is possible to combine a relatively small number of projects that handle everything from data ingestion to data cleaning, to model development and model hosting for inference. An easy to couple and scale platform can become the rallying point for IT, data engineers, business analysts and data scientists to focus investments and share resources.
The two main challenges when moving from data analytics on a single computer to a scale-out approach is how to distribute the computations along with a subset of the data to a coordinated and interconnected set of compute nodes, and, how to scale-out data storage to keep up with the I/O demands of multiple compute nodes. The original two projects that made up the first version of Hadoop solved both problems in the mid to late 2000s. The original distributed compute engine, MapReduce, was not capable of doing advanced analytics such as machine learning or deep learning, but that has been largely replaced and augmented with new distributed computing frameworks with Spark being the most important for data science. Spark started with a concept familiar to everyone that works in data science, the data frame, and devised an approach to distribute it across many computers taking advantage of the combined memory and computing cores in a way that didn’t force data scientists to change the way they traditionally work with data. We’ll have a lot more to say about Spark, Spark data frames and distributed computing in later sections of this eBook.
As we noted in the previous section, modern data science had to expand beyond the computing power of a single machine to escape the memory and computing core limits imposed by the physical design of widely available, low cost, x86 computers. Managing distributed computing systems with dozens, hundreds, or thousands of machines to handle the scaling needs of data scientists working with increasing volumes of data presents new challenges for the architects and IT operations professionals that support these large data science initiatives. This is where Kubernetes is making a significant contribution.
The relationship between Spark and Kubernetes is conceptually simple. Data scientists want to run many Spark processes distributed across multiple computers to have access to more memory and computing cores. If we use container virtualization like Docker to host those Spark processes, Kubernetes will orchestrate the creation, placement, and life cycle management of those multiple processes across a cluster of low-cost x86 servers. There are many Kubernetes implementations available today and so, we will have a lot more to say in later chapters about how we recommend using Kubernetes with Spark and the trade-offs that should be considered for your needs.
A Use Case Story
We still find the analytics “pipeline” is a useful metaphor for how data engineers and data scientist work through data ingestion, data cleaning and transformation, data merging, model training and testing, and finally model hosting for inference. For this eBook, we choose to develop a story for explaining both the challenges and most successful approaches for each of these steps in the analytic pipeline. That forced us to select a data set that had enough volume and complexity to highlight some real-world challenges and solutions without incurring the cost and lead time of solving an actual enterprise-class problem. We choose to build a story describing a machine learning model-based approach to inventory management for a very large retailer with hundreds of thousands of individual stock units (SKUs). We will show how Spark together with a set of related big data technologies can provide an end-to-end solution for solving real-world data analytics problems by developing the individual pieces using a simplified data set and requirements.