Evaluation of ML Pipeline with Varying Dataset Sizes

Thank you for your feedback!

In the section Optimized End-to-End ML Pipeline vs. Serial Pipeline, the performance of the optimized implementation (using Modin-Dask and oneDAL) was compared against the unoptimized version (using native pandas and XGBoost libraries), with a performance breakdown by the three stages in the pipeline. In this section, the optimized code is used to study the pipeline performance based on the varying dataset size of both datasets. In the section ML Pipeline Performance: NYC Dataset, Varying Dataset Sizes, the optimized code is used to study the pipeline performance based on the varying dataset size of the NYC dataset. The performance comparison is broken down for each of the three stages of the ML pipeline. The composite performance (all three pipeline stages) of the E2E ML pipeline for the NYC dataset is then summarized. Similarly, in the section ML Pipeline Performance: Higgs Dataset, Varying Dataset Sizes, the optimized code is used to study the pipeline performance based on the varying dataset size of the Higgs dataset.