An end-to-end (E2E) Machine Learning (ML) pipeline consists of three general stages:
These stages are generally computationally expensive, which impacts the performance and effectiveness of the resulting ML pipeline. The goal of this project is to accelerate the performance of an E2E ML pipeline using Intel’s oneAPI AI Analytics Toolkit and compare its performance against a baseline pandas implementation[1].
In evaluating the ETL stage (the most time-consuming stage in the pipeline), we compare the ETL performance when implemented using Modin (with Ray, Dask, or Omnisci backends) against the native pandas library. We showed that all three Modin backends performed better than the baseline serial panda library, with Modin-Dask the best performing (~5x speedup). This experiment verified that the Modin API can efficiently make use of all available CPU cores to process data in parallel, thus allowing Modin to support the pandas API efficiently. Based on this result, an optimized E2E ML pipeline was implemented and deployed on a computing node of the Intel® DevCloud (consisting of an Intel® Ice Lake CPU and a 150W Intel® Arctic Sound GPU), to test the acceleration of large-scale workloads using Intel’s OneAPI toolkit.
The optimized implementation uses Modin-Dask to accelerate the ETL stage and oneDAL to accelerate the Inference stage. The comparison was performed using a NYC Taxi historical dataset (~64GB) for ML regression analysis; and a Higgs dataset (~7.5GB) for the classification problem. The optimized E2E ML pipeline was compared against an unoptimized baseline version (using native pandas and XGBoost libraries). The ETL stage of the E2E ML pipeline achieved ~5x speedup for the NYC dataset and ~3x for the Higgs dataset. For the backend Inference stage, a significant improvement was also seen (~3x for NYC dataset, ~2.25x for Higgs dataset) when using oneDAL with XGBoost in an optimized ML pipeline. The reason is that oneDAL utilizes all the capabilities of the Intel hardware by using Intel® Advanced Vector Extensions 512 (Intel® AVX-512) vector instruction set to maximize the utilization of the Intel® Xeon® processors. Finally, experiments were performed for both datasets to observe the speedup trends as the dataset sizes were varied.