An end-to-end (E2E) Machine Learning (ML) pipeline consists of three general stages:
These stages are generally computationally expensive, which impacts the performance and effectiveness of the resulting ML pipeline. The goal of this project is to accelerate the performance of an E2E ML pipeline using Intel’s oneAPI AI Analytics Toolkit and compare its performance against a baseline pandas implementation.
In evaluating the ETL stage (the most time-consuming stage in the pipeline), we showed that all three Modin backends (Ray, Dask, and Omnisci) performed better than the baseline serial pandas library, with Modin-Dask the best performing (~5x speedup). This experiment verified that the Modin libraries can efficiently make use of all available CPU cores to process data in parallel, thus allowing Modin to support the pandas API efficiently. Modin Dask also performed the best since it supports more of the pandas API as compared to the Modin Ray and OmniSci, thus minimizing the need for operations to be defaulted to its original pandas implementations. Modin’s fine-grained control over partitioning allows it to support various pandas operations that are otherwise very challenging to parallelize.
Based on this result, an optimized E2E ML pipeline was implemented and deployed on a computing node of the Intel® DevCloud (consisting of an Intel® Ice Lake CPU and a 150W Intel® Arctic Sound GPU), to test the acceleration of large-scale workloads using Intel’s OneAPI toolkit. The comparison was performed using a NYC Taxi historical dataset (64GB) for ML regression analysis; and a Higgs dataset (7.5GB) for the classification problem. The optimized E2E ML pipeline was compared against an unoptimized version (using native pandas and XGBoost libraries). Experiments were also performed for varying dataset sizes.
In summary, Intel’s AI Analytics toolkit provides familiar Python tools and frameworks built with low-level optimizations to conveniently accelerate end-to-end ML algorithms. With oneAPI libraries built for low-level compute optimizations, it is possible to achieve highly efficient multithreading, vectorization, and memory management, and scale scientific computations across a cluster.
The optimized implementation uses Modin Dask to accelerate the ETL stage of the E2E ML pipeline. For the ETL stage, there is a steady speedup increase for each of the datasets as the size of the dataset increases: the NYC dataset ranges from ~4x at 2GB to ~4.8x at 16GB to ~5x at 64GB; the Higgs dataset ranges from ~1.9x at 1GB to ~2.9x at 4GB to ~3x at 7.5GB. The performance increase of the ETL stage can be attributed to Intel® AI Analytics toolkit’s distribution of Modin with Dask which uses all the CPU cores efficiently.
The optimized implementation uses oneDAL to accelerate the Inference stage of the E2E ML pipeline. For the NYC dataset, there is an increase in speedup of the Inference stage as the dataset size increases, from ~1.8x at 2GB to the largest speedup of ~3x at 64GB. For the Higgs dataset, the increase in speedup is from slightly >1x at 1GB to ~1.8x at 4GB to the largest speedup of >3x at 7.5GB. The performance boost for the Inference stage is the result of the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) vector instruction set which maximizes the utilization of the Intel® Xeon® processors.
For the Training stage, the baseline pandas performed better for both datasets. This is expected because the XGBoost model training is converted to oneDAL using daal4py, which added some overhead. For the NYC dataset, training in the optimized version takes slightly more time than the native pandas for all sizes of the NYC dataset. For the smaller Higgs dataset, the overhead is even more of a factor.
Finally, as shown in the section Evaluation of ML Pipeline with Varying Dataset Sizes, the composite performance (all three pipeline stages) of the E2E ML pipeline summarizes the results. The ETL stage is the most time-consuming stage in the pipeline, for all dataset sizes. Both the ETL and Inference stages showed a steady increase in speedup, but the Training stage cannot be accelerated at all by the AI Analytics Toolkit. In fact, at small dataset sizes, the overhead of the Training stage is especially noticeable. For the NYC dataset, the overall speedup of the E2E pipeline increases steadily from ~2x at 2GB to ~2.6x at 16GB to the maximum speedup of ~3x for 64G. For the Higgs dataset, the overall speedup is very modest at very small dataset sizes (1GB, 2GB). Beyond 2GB, the speedup stayed fairly constant at <2x, with a speedup of ~1.9x at 7.5GB. Thus, small datasets, such as the Higgs dataset, are too small to take full advantage of the parallel benefits offered by the AI Analytics Toolkit