The amount of data and number of computations that are required for an organization of this size exceeds the capability of any single scale-up machine architecture.
The data preparation, model training, and subsequent allocation and conversion calculations all involve tables with tens of millions of rows. That complexity makes operations like joining, aggregating, and filtering processor and memory demanding. The team is confident that choosing Spark for data ingestion and transformation workload is their best option. They have no experience with data modeling or model inferencing, using projects from the Spark ecosystem, like MLib for machine learning or Intel BigDL for deep learning. Most of the team has been working with Python and R model training on smaller datasets, using a single workstation.
The team also lacks experience in managing applications that use machine learning models in production. Stock outages are known to influence customer perceptions that can last long into the future. The data science team must define a workflow that incorporates new information into their models within days of becoming available. They must have automation that can run without human intervention for normal operations. They must be notified quickly if the models produce forecasts that differ significantly from historical trends. The platform and tools they choose for this project must be easily automated, and incorporate monitoring.