We have developed a use case story that is common to many large retail organizations – demand forecasting and inventory management. We could have chosen any industry including manufacturing, health care, financial and many more that wrestle with vast amounts of structured and unstructured data every day that would make an equally compelling story about the reasons for using Spark. One of the primary reasons we chose the retail use case was due to the availability of a large multi-table simulated data set that has orders with line items, customers and sales regions (geography) for entities. Also, broad familiarity with the concepts of the retail experience (everyone shops and knows about stock outages) was an attractive feature that we felt would make the storytelling easier. Finally, the schema of the generated data could easily be staged in two or more simulated source systems and be brought back together to demonstrate the use of a multi-stage analytics pipeline.
We are not claiming that the data engineering or data science that we are demonstrating is representative of the challenges that would need to be solved for an actual retail analytics software solution. The simulated data we have is not intended to evaluate the relative robustness of any machine learning or deep learning techniques. Our goal is simply to place realistic demands on our lab resources while processing the pipeline. We decided to build the lab demonstration to highlight the completeness of the Spark/Hadoop toolset for implementing full-featured data pipelines. As you will see in our discussion of using Spark for interactive analysis, the relationships between variables in the data set are artificial and simplistic. That, however, does not stop us from showing how the platform handles distributed analytics for large data sets, it only impacts the ability to “find” interesting associations in the data.
We will discuss all the important elements of the typical analytics data pipeline including ingestion, joining and filtering large related tables, restructuring data to ft the input requirements of common machine learning models, distributed training of models, hosting trained models for use in other systems via inference as well as configuring hardware resources and installing software tools.
Our use case story begins with a fictitious retail company that has a catalog with hundreds of thousands of SKUs grouped into 5 market segments:
Their sales demand forecast used for product ordering is based on a manual roll-up of the estimates of segment managers in each of the many sales areas based on experience and knowledge of local conditions. This process is slow and oftentimes produces incomplete forecasts when staff miss submission deadlines. The company is also experiencing high inventory carrying costs since both the area/segment managers and the purchasers tend to overestimate demand in order to avoid stock-outs.
Management would like to add a model-based demand forecasting option to the planning process based on data from their sales order and supply purchasing systems. They have been told that estimating individual models for each product is very challenging given the number of SKUs they manage and sparsity of sales for many items in the catalog. The company hired an inventory management consultant that has suggested a process called hierarchical forecasting that is common for organizations with so many products. The consultant is also recommending that the company may need new technology to implement the new modeling system. The company has extensive experience with enterprise-class relational database management systems including a Massively Parallel Processing database. They have recently started using a Hadoop Distributed File System (HDFS) data lake to offload some analytics from the already overloaded RDBMS systems. Spark is the most popular tool for accessing data in the Hadoop file system. The desire for management is that any new analytics processing for the inventory planning system will be developed using primarily Hadoop and Spark if possible. IT management wants to avoid bringing in new technology silos.
The consultants have confirmed that product forecasting with such a large catalog is complicated. They considered implementing a multi-tier approach where high-profit products would be modeled individually, and the current system would be kept for everything else. Management rejected that for the first round of development and choose to instead implement a consistent two-stage approach for all products. This approach is based on first developing a model to forecast aggregate sales for each market segment. The proposed design will use a classical time-series approach for the aggregate forecast. The aggregate forecasts of daily sales ($) will be allocated to individual products based on that product's historical contribution to total sales for the segment. Sales for each product will be converted from dollar values to units based on recent average prices. The individual product forecasts will be compared against current inventory to estimate when the next out of stock event is likely to occur for each product.
The amount of data and number of computations required for an organization of this size far exceeds the capability of any single scale-up machine architecture. The data preparation, model training, and subsequent allocation and conversion calculations will all involve tables with 10’s of millions of rows that make operations like joining, aggregating, and filtering very processor and memory demanding. The team is confident that choosing Spark for the data ingestion and transformation workload is their best option but does not have experience with data modeling or model inferencing using projects from the Spark ecosystem like MLlib for machine learning or Intel BigDL for deep learning. Most of the team has been working with Python and R model training on smaller data sets using a single workstation.
The team also lacks experience in managing applications that use machine learning models in production. Stock outages are known to influence customer perceptions that can last for long into the future. The data science team will need to define a workflow that incorporates new information into their models within days of becoming available. They will need to have automation that can run without human intervention for normal operations but also be notified quickly if the models produce forecasts that differ significantly from historical trends. The platform and tools they choose for this project will have to be easy to automate and incorporate monitoring.