The inventory management use case relies upon four workflows to obtain its results.
These workflows, in the order they are performed, include:
- Query a data lake
- Run a data preparation job
- Create a model
- Validate the model
This topic describes how to obtain data from a data lake using Spark.
Before querying a data lake with Spark you must:
- Connect to the data lake platform.
- Locate the datafiles in the data lake platform.
- Log in to the Kubernetes platform.
- Verify network connectivity between the data lake and the Kubernetes platform.
This procedure loads datafiles from the data lake into the Spark memory.
Note: The use case workflows in this reference architecture use the python programming language.
- Create a SparkContext object for the entry point to Spark.
- Use that object to perform SQL operations that read the datafiles, in .csv format, from the data lake.
Figure 1: Reading datafiles from a data lake
The datafiles are now loaded into the Spark memory.
Figure 2: Datafiles loaded into Spark memory
This topic describes how to prepare the data now in Spark dataframes for model testing and training.
Before transforming the dataframes, you must ensure that they are properly loaded and visible in Spark.
After all the dataframes are loaded, this procedure performs Extract, Transform, and Load (ETL) operations to create one coherent file from three different datafiles. It then splits the data into test and train sets. The train set is used to train the model; its performance is measured on the test dataset.
- Join the various dataframes on the join key to create a single data table.
Figure 3: Join the datasets
- Parse the input columns to add additional input columns.
Figure 4: Parse the input columns
Figure 5: Parse the input columns (continued)
- Split the data into test and train set and remove NULL values from the table.
Figure 6: Split the dataset
This topic describes how to create a model for testing and training.
Before you create a model, you must:
- Check that the data tables are joined correctly.
- Ensure that all the NULL values are removed.
This procedure adds statistical parameters about the data, selects model input columns, and defines the model.
- Add the average and variation of sales/target up to a date, as additional columns for that date.
Figure 7: Add new columns
- Check the correlation of the output variable to all the input features.
Figure 8: Code for checking correlation of output to input
Figure 9: Resulting graph
- Select the input features with high correlation values, and define a model.
Figure 10: Select the high correlation input features
The Machine Learning (ML) model is ready to be used for training on train data.
This topic describes how to validate the model on a test dataset, and view the results.
Before validating the model, you must:
- Ensure that the test and train data have the exact same columns that are selected to be fetched into the model.
- Ensure that the statistical columns have correct numerical values in each row.
After the test and train datasets have been created, this procedure:
- Defines a pipeline for the model
- Runs the model on the training data for the model to learn
- Uses the results on the test set to see how it performs
- Concatenate all the input columns as a single vector for the model, and pass it to the model for training.
Note: pyspark only accepts a single vector of all the input at once in its ML models.
- Define the parameters of the model. For example, the depth of the decision tree.
- Train the model on the training set.
Figure 11: Train the model
- Use the trained model on the test set, and check the Root Mean Squared Error (RMSE) value.
Figure 12: Apply the trained model
Now you can plot a graph to get better insights on the model results.
Note: The data that is represented in figure 13 was randomly generated. The model found no seasonality patterns, so it produced a flat line forecast.
Figure 13: Plot a graph