An end-to-end (E2E) Machine Learning (ML) pipeline consists of three general stages: (1) Extraction, Transfer, and Loading (ETL) of data; (2) Model Training; and (3) Inference and Visualization, as illustrated in Figure 1. These stages are generally computationally expensive, which impacts the performance and effectiveness of the resulting ML pipeline.
The goal of this project is to accelerate the performance of an end-to-end ML pipeline using Intel’s oneAPI AI Analytics Toolkit [1]. The oneAPI AI Analytics toolkit is a set of powerful and familiar Python packages and frameworks which are optimized to deliver drop-in acceleration for Intel architectures. With oneAPI libraries built for low-level compute optimizations, it is possible to achieve highly efficient multithreading, vectorization, and memory management, and to scale scientific computations efficiently across a cluster.
Figure 1. End-to-end ML Pipeline using Intel’s oneAPI AI Analytics Toolkit
Some of the key features of this toolkit, as shown in Figure 1, are:
The oneAPI AI Analytics Toolkit [1] is implemented using the oneAPI Data Analytics Library (oneDAL), a powerful machine learning library that helps speed up big data analysis. oneDAL is an extension of Intel® Data Analytics Acceleration Library (DAAL) and is a part of oneAPI. oneAPI is a cross-industry, open, standards-based unified programming model that delivers a common developer experience across accelerator architectures. For more details about these libraries, see the References section for oneAPI [2] and oneDAL [3].
This study was conducted using two datasets: the New York City (NYC) Taxi historical dataset [4] and the Higgs dataset [5]:
As stated above and illustrated in Figure 1, the E2E ML pipeline consists of three general stages. The Extraction, Transfer, and Loading (ETL) stage is the first stage of the pipeline. In this stage, the data is extracted from different sources and transformed into the required format to be used for the Model Training stage. Because the ETL stage is the most time-consuming stage in the pipeline, it is essential to optimize this stage. In the section Evaluation of oneAPI compatible ETL libraries in this paper, we describe the experiments that we performed to evaluate the various oneAPI-compatible ETL libraries. We compared the ETL performance when implemented using Modin [6] (with Ray, Dask, and Omnisci) and the native pandas library. The performance comparison is further broken down into the most commonly used ETL functions: df.read_csv, df.rename, df.drop, df.fill_na, and df.concat. Based on the performance advantage of Modin-Dask, we optimized the ETL stage of our ML pipeline using the Modin-Dask library.
The optimized end-to-end ML pipeline was implemented and deployed on a computing node of the Intel® DevCloud [7], containing an Intel® Ice Lake CPU (with 96 CPU cores) and a 150W Intel® Arctic Sound GPU. In the section Optimized End-to-End ML Pipeline vs. Serial Pipeline, the performance of the optimized implementation (using Modin-Dask and oneDAL) is compared against the unoptimized version (using native panda and XGBoost libraries), with a performance breakdown by the three stages in the pipeline. The comparison is performed for both a NYC dataset (for ML regression analysis) and a Higgs dataset (for classification problem). In the section Evaluation of ML Pipeline with Varying Dataset Sizes, the optimized code is then used to study the pipeline performance based on varying dataset size. The NYC dataset was varied from 2 GB to 64 GB (increment of power of 2). The Higgs dataset was varied from 1 GB to 7.5 GB (increment of 1 GB).