This use case story is common to many large retail organizations who require demand forecasting and inventory management.
The use case could focus on any industry that wrestles with vast amounts of structured and unstructured data every day, including:
Any of those industries would make an equally compelling story about the reasons for using Spark. One of the primary reasons this guide focuses on the retail use case is the availability of a large, multi-table, simulated dataset. This dataset has orders with line items, customers, and sales regions (geography).
Also, broad familiarity with the retail experience (everyone shops and knows about stock outages) is an attractive feature that makes storytelling easier. Finally, the generated data schema could be staged in two or more simulated source systems. Then, it could be brought back together to demonstrate the use of a multi-stage analytics pipeline.
This use case does not claim that the data engineering or data science that it demonstrates is representative of the challenges that must be solved. The simulated data is not intended to evaluate the relative robustness of any machine learning or deep learning techniques. Its goal is to place realistic demands on lab resources while processing the pipeline.
Dell EMC built the lab demonstration to showcase the entire Spark and Hadoop toolset for implementing full-featured data pipelines. As in the later discussion of using Spark for interactive analysis, the relationships between variables in the dataset are artificial and simplistic. That does not stop the use case from showing how the platform handles distributed analytics for large datasets. It only impacts the ability to find interesting associations in the data.
This guide discusses the important elements of a typical analytics data pipeline, including: