
Deep Learning on Spark is Getting Interesting
Mon, 03 Aug 2020 15:53:44 -0000
|Read Time: 0 minutes
The year 2012 will be remembered in history as a break out year for data analytics. Deep learnings meteoric rise to prominence can largely be attributed to the 2012 introduction of convolution neural networks (CNN)for image classification using the ImageNet dataset during the Large-Scale Visual Recognition Challenge (LSVRC) [1]. It was a historic event after a very, very long incubation period for deep learning that started with mathematical theory work in the 1940s, 50s, and 60s. The prior history of neural networks and deep learning development is a fascination and should not be forgotten, but it is not an overstatement to say that 2012 was the breakout year for deep learning.
Coincidentally, 2012 was also a breakout year for in-memory distributed computing. A group of researchers from the University of AMPlab published a paper with an unusual title that changed the world of data analytics. “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing”. [2] This paper describes how the initial creators developed an efficient, general-purpose and fault-tolerant in-memory data abstraction for sharing data in cluster applications. The effort was motivated by the short-comings of both MapReduce and other distributed-memory programming models for processing iterative algorithms and interactive data mining jobs.
The ongoing development of so many application libraries that all leverage Spark’s RDD abstraction including GraphX for creating graphs and graph-parallel computation, Spark Streaming for scalable fault-tolerant streaming applications and MLlib for scalable machine learning is proof that Spark achieved the original goal of being a general-purpose programming environment. The rest of this article will describe the development and integration of deep learning libraries – a now extremely useful class of iterative algorithms that Spark was designed to address. The importance of the role that deep learning was going to have on data analytics and artificial intelligence was just starting to emerge at the same time Spark was created so the combination of the two developments has been interesting to watch.
MLlib – The original machine learning library for Spark
MLlib development started not long after the AMPlab code was transferred to the Apache Software Foundation in 2013. It is not really a deep learning library however there is an option for developing Multilayer perceptron classifiers [3] based on the feedforward artificial neural network with backpropagation implemented for learning the model. Fully connected neural networks were quickly abandoned after the development of more sophisticated models constructed using convolutional, recursive, and recurrent networks.
Fully connected shallow and deep networks are making a comeback as alternatives to tree-based models for both regression and classification. There is also a lot of current interest in various forms of autoencoders used to learn latent (hidden) compressed representations of data dimension reduction and self-supervised classification. MLlib, therefore, can be best characterized as a machine learning library with some limited neural network capability.
BigDL – Intel open sources a full-featured deep learning library for Spark
BigDL is a distributed deep learning library for Apache Spark. BigDL implements distributed, data-parallel training directly on top of the functional compute model using the core Spark features of copy-on-write and coarse-grained operations. The framework has been referenced in applications as diverse as transfer learning-based image classification, object detection and feature extraction, sequence-to-sequence prediction for precipitation nowcasting, neural collaborative filtering for recommendations, and more. Contributors and users include a wide range of industries including Mastercard, World Bank, Cray, Talroo, University of California San Francisco (UCSF), JD, UnionPay, Telefonica, GigaSpaces. [4]
Engineers with Dell EMC and Intel recently completed a white paper demonstrating the use of deep learning development tools from the Intel Analytics Zoo [5] to build an integrated pipeline on Apache Spark ending with a deep neural network model to predict diseases from chest X-rays. [6] Tools and examples in the Analytics Zoo give data scientists the ability to train and deploy BigDL, TensorFlow, and Keras models on Apache Spark clusters. Application developers can also use the resources from the Analytics Zoo to deploy production class intelligent applications through model extractions capable of being served in any Java, Scala, or other Java virtual machine (JVM) language.
The researchers conclude that modern deep learning applications can be developed and deployed at scale on an existing Hadoop and Spark cluster. This approach avoids the need to move data to a different deep learning cluster and eliminates the operational complexities of provisioning and maintaining yet another distributed computing environment. The open-source software that is described in the white paper is available from Github. [7]
H20.ai – Sparkling Water for Spark
H2O is fast, scalable, open-source machine learning, and deep learning for smarter applications. Much like MLlib, the H20 algorithms cover a wide range of useful machine learning techniques but only fully connected MLPs for deep learning. With H2O, enterprises like PayPal, Nielsen Catalina, Cisco, and others can use all their data without sampling to get accurate predictions faster. [8] Dell EMC, Intel, and H2o.ai recently developed a joint reference architecture that outlines both technical considerations and sizing guidance for an on-premises enterprise AI platform. [9]
The engineers show how running H2O.ai software on optimized Dell EMC infrastructure with the latest Intel® Xeon® Scalable processors and NVMe storage, enables organizations to use AI to improve customer experiences, streamline business processes, and decrease waste and fraud. Validated software included the H2O Driverless AI enterprise platform and the H2O and H2O Sparkling Water open-source software platforms. Sparkling Water is designed to be executed as a regular Spark application. It provides a way to initialize H2O services on Spark and access data stored in both Spark and H2O data structures. H20 Sparkling Water algorithms are designed to take advantage of the distributed in-memory computing of existing Spark clusters. Results from H2O can easily be deployed using H2O low-latency pipelines or within Spark for scoring.
H2O Sparkling Water cluster performance was evaluated on three- and five-node clusters. In this mode, H2O launches through Spark workers, and Spark manages the job scheduling and communications between the nodes. Three and five Dell EMC PowerEdge R740xd Servers with Intel Xeon Gold 6248 processors were used to train XGBoost and GBM models using the mortgage data set derived from the Fannie Mae Single-Family Loan Performance data set.
Spark and GPUs
Many data scientists familiar with Spark for machine learning have been waiting for official support for GPUs. The advantages realized from modern neural network models like the CNN entry in the 2012 LSVRC would not have been fully realized without the work of NVIDIA and others on new acceleration hardware. NVIDIA’s GPU technology like the Volta V100 has morphed into a class of advanced, enterprise-class ML/DL accelerators that reduce training time for all types of neural network configurations including CCN, RNN (recurrent neural networks) and GAN (generative adversarial networks) to mention just a few of the most popular forms. Deep learning researchers see many advantages to building end-to-data model training “pipelines” that take advantage of the generalized distributed computing capability of Spark for everything from data cleaning and shaping through to scale-out training using integration with GPUs.
NVIDIA recently announced that it has been working with Apache Spark’s open source community to bring native GPU acceleration to the next version of the big data processing framework, Spark 3.0 [10] The Apache Spark community is distributing a preview release of Spark 3.0 to encourage wide-scale community testing of the upcoming release. The preview is not a stable release of the expected API specification or functionality. No firm date for the general availability of Spark 3.0 has been released but organizations exploring options for distributed deep learning with GPUs should start evaluating the proposed features and advantages of Spark 3.0.
Cloudera is also giving developers and data science an opportunity to do testing and evaluation with the preview release of Spark 3.0. The current GA version of the Cloudera Runtime includes the Apache Spark 3.0 preview 2 as part of their CDS 3 (Experimental) Powered by Apache Spark release. [11] Full Spark 3.0 preview 2 documentation including many code samples is available from the Apache Spark website [12]
What’s next
It’s been 8 years since the breakout events for deep learning and distributed computing with Spark were announced. We have seen tremendous adoption of both deep learning and Spark for all types of analytics use cases from medical imaging to language processing to manufacturing control and beyond. We are just now poised to see new breakthroughs in the merging of Spark and deep learning, especially with the addition of support for hardware accelerators. IT professionals and data scientists are still too heavily burdened with the hidden technical debt overhead for managing machine learning systems. [13] The integration of accelerated deep learning with the power of the Spark generalized distributed computing platform will give both the IT and data science communities a capable and manageable environment to develop and host end-to-end data analysis pipelines in a common framework.
References
[1] Alom, M. Z., Taha, T. M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M. S., ... & Asari, V. K. (2018). The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv preprint arXiv:1803.01164.
[2] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., ... & Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12) (pp. 15-28).
[3] Apache Spark (June 2020) Multilayer perceptron classifier https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier
[4] Dai, J. J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., ... & Wang, J. (2019, November). Bigdl: A distributed deep learning framework for big data. In Proceedings of the ACM Symposium on Cloud Computing (pp. 50-60).
[5] Intel Analytics Zoo (June 2020) https://software.intel.com/content/www/us/en/develop/topics/ai/analytics-zoo.html
[6] Chandrasekaran, Bala (Dell EMC) Yang, Yuhao (Intel) Govindan, Sajan (Intel) Abd, Mehmood (Dell EMC) A. A. R. U. D. (2019). Deep Learning on Apache Spark and Analytics Zoo.
[7] Dell AI Engineering (June 2020) BigDL Image Processing Examples https://github.com/dell-ai-engineering/BigDL-ImageProcessing-Examples
[8] Candel, A., Parmar, V., LeDell, E., and Arora, A. (Apr 2020). Deep Learning with H2O https://www.h2o.ai/wp-content/themes/h2o2016/images/resources/DeepLearningBooklet.pdf
[9] Reference Architectures for H2O.ai (February 2020) https://www.dellemc.com/resources/en-us/asset/white-papers/products/ready-solutions/dell-h20-architectures-pdf.pdf Dell Technologies
[10] Woodie, Alex (May 2020) Spark 3.0 to Get Native GPU Acceleration https://www.datanami.com/2020/05/14/spark-3-0-to-get-native-gpu-acceleration/ datanami
[11] CDS 3 (Experimental) Powered by Apache Spark Overview (June 2020) https://docs.cloudera.com/runtime/7.0.3/cds-3/topics/spark-spark-3-overview.html
[12] Spark Overview (June 2020) https://spark.apache.org/docs/3.0.0-preview2/
[13] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in neural information processing systems (pp. 2503-2511).
Related Blog Posts

Omnia: Open-source deployment of high-performance clusters to run simulation, AI, and data analytics workloads
Mon, 12 Dec 2022 18:31:28 -0000
|Read Time: 0 minutes
High Performance Computing (HPC), in which clusters of machines work together as one supercomputer, is changing the way we live and how we work. These clusters of CPU, memory, accelerators, and other resources help us forecast the weather and understand climate change, understand diseases, design new drugs and therapies, develop safe cars and planes, improve solar panels, and even simulate life and the evolution of the universe itself. The cluster architecture model that makes this compute-intensive research possible is also well suited for high performance data analytics (HPDA) and developing machine learning models. With the Big Data era in full swing and the Artificial Intelligence (AI) gold rush underway, we have seen marketing teams with their own Hadoop clusters attempting to transition to HPDA and finance teams managing their own GPU farms. Everyone has the same goals: to gain new, better insights faster by using HPDA and by developing advanced machine learning models using techniques such as deep learning and reinforcement learning. Today, everyone has a use for their own high-performance computing cluster. It’s the age of the clusters!
Today's AI-driven IT Headache: Siloed Clusters and Cluster Sprawl
Unfortunately, cluster sprawl has taken over our data centers and consumes inordinate amounts of IT resources. Large research organizations and businesses have a cluster for this and a cluster for that. Perhaps each group has a little “sandbox” cluster, or each type of workload has a different cluster. Many of these clusters look remarkably similar, but they each need a dedicated system administrator (or set of administrators), have different authorization credentials, different operating models, and sit in different racks in your data center. What if there was a way to bring them all together?
That’s why Dell Technologies, in partnership with Intel, started the Omnia project.
The Omnia Project
The Omnia project is an open-source initiative with a simple aim: To make consolidated infrastructure easy and painless to deploy using open open source and free use software. By bringing the best open source software tools together with the domain expertise of Dell Technologies' HPC & AI Innovation Lab, HPC & AI Centers of Excellence, and the broader HPC Community, Omnia gives customers decades of accumulated expertise in deploying state-of-the-art systems for HPC, AI, and Data Analytics – all in a set of easily executable Ansible playbooks. In a single day, a stack of servers, networking switches, and storage arrays can be transformed into one consolidated cluster for running all your HPC, AI, and Data Analytics workloads.Omnia project logo
Simple by Design
Omnia’s design philosophy is simplicity. We look for the best, most straightforward approach to solving each task.
- Need to run the Slurm workload manager? Omnia assembles Ansible plays which build the right rpm files and deploy them correctly, making sure all the correct dependencies are installed and functional.
- Need to run the Kubernetes container orchestrator? Omnia takes advantage of community supported package repositories for Linux (currently CentOS) and automates all the steps for creating a functional multi-node Kubernetes cluster.
- Need a multi-user, interactive Python/R/Julia development environment? Omnia takes advantage of best-of-breed deployments for Kubernetes through Helm and OperatorHub, provides configuration files for dynamic and persistent storage, points to optimized containers in DockerHub, Nvidia GPU Cloud (NGC), or other container registries for unaccelerated and accelerated workloads, and automatically deploys machine learning platforms like Kubeflow.
Before we go through the process of building something from scratch, we will make sure there isn’t already a community actively maintaining that toolset. We’d rather leverage others' great work than reinvent the wheel.
Inclusive by Nature
Omnia’s contribution philosophy is inclusivity. From code and documentation updates to feature requests and bug reports, every user’s contributions are welcomed with open arms. We provide an open forum for conversations about feature ideas and potential implementation solutions, making use of issue threads on GitHub. And as the project grows and expands, we expect the technical governance committee to grow to include the top contributors and stakeholders from the community.
What's Next?
Omnia is just getting started. Right now, we can easily deploy Slurm and Kubernetes clusters from a stack of pre-provisioned, pre-networked servers, but our aim is higher than that. We are currently adding capabilities for performing bare-metal provisioning and supporting new and varying types of accelerators. In the future, we want to collect information from the iDRAC out-of-band management system on Dell EMC PowerEdge servers, configure Dell EMC PowerSwitch Ethernet switches, and much more.
What does the future hold? While we have plans in the near-term for additional feature integrations, we are looking to partner with the community to define and develop future integrations. Omnia will grow and develop based on community feedback and your contributions. In the end, the Omnia project will not only install and configure the open source software we at Dell Technologies think is important, but the software you – the community – want it to, as well! We can’t think of a better way for our customers to be able to easily setup clusters for HPC, AI, and HPDA workloads, all while leveraging the expertise of the entire Dell Technologies' HPC Community.
Omnia is available today on GitHub at https://github.com/dellhpc/omnia. Join the community now and help guide the design and development of the next generation of open-source consolidated cluster deployment tools!

Interpreting TPCx-AI Benchmark Results
Wed, 01 Feb 2023 14:29:11 -0000
|Read Time: 0 minutes
TPCx-AI Benchmark
Overview
TPCx-AI Benchmark abstracts the diversity of operations in a retail data center scenario. Selecting a retail business model assists the reader relate intuitively to the components of the benchmark, without tracking that industry segment tightly. Such tracking would minimize the relevance of the benchmark. The TPCx-AI benchmark can be used to characterize any industry that must transform operational and external data into business intelligence.
This paper introduces the TPCx-AI benchmark and uses a published TPCx-AI result to describe how the primary metrics are determined and how they should be read.
Benchmark model
TPCx-AI data science pipeline
The TPCx-AI benchmark imitates the activity of retail businesses and data centers with:
- Customer information
- Department stores
- Sales
- Financial data
- Product catalog and reviews
- Emails
- Data center logs
- Facial images
- Audio conversations
It models the challenges of end-to-end artificial intelligence systems and pipelines where the power of machine learning and deep learning is used to:
- Detect anomalies (fraud and failures)
- Drive AI-based logistics optimizations to reduce costs through real-time forecasts (classification, clustering, forecasting, and prediction)
- Use deep learning AI techniques for customer service management and personalized marketing (facial recognition and speech recognition)
It consists of ten different use cases that help any retail business data center address and manage any business analysis environment.
The TPCx-AI kit uses a Parallel Data Generator Framework (PDGF) to generate the test dataset. To mimic the datasets of different company sizes the user can specify scale factor (SF), a configuration parameter. It sets the target input dataset size in GB. For example, SF=100 equals 100 GB. Once generated, all the data is processed for subsequent stages of postprocessing within the data science pipeline.
Use cases
The TPCx-AI Benchmark models the following use cases:
Figure 1: TPCx-AI benchmark use case pipeline flow
Table 1: TPCx-AI benchmark use cases
ID | Use case | M/DL | Area | Algorithm |
UC01 | Customer Segmentation | ML | Analytics | K-Means |
UC02 | Conversation Transcription | DL | NLP | Recurrent Neural Network |
UC03 | Sales Forecasting | ML | Analytics | ARIMA |
UC04 | Spam Detection | ML | Analytics | Naïve Bayes |
UC05 | Price Prediction | DL | NLP | RNN |
UC06 | Hardware Failure Detection | ML | Analytics | Support Vector Machines |
UC07 | Product Rating | ML | Recommendation | Alternating Least Squares |
UC08 | Trip Type Classification | ML | Analytics | XGBoost |
UC09 | Facial Recognition | DL | Analytics | Logistic Regression |
UC10 | Fraud Detection | ML | Analytics | Logistic Regression |
Benchmark run
The TPCx-AI Benchmark run consists of seven separate tests run sequentially. The tests are listed below:
- Data Generation using PDGF
- Load Test – Loads data into persistent storage (HDFS or other file systems)
- Power Training Test – Generates and trains models
- Power Serving Test I – Uses the trained model in Training Phase to conduct the serving phase (Inference) for each use case
- Power Serving Test II – There are two serving tests that run sequentially. The test with the greater geometric mean (geomean) of serving times is used in the overall score.
- Scoring Test – Model validation stage. Accuracy of the model is determined using defined accuracy metrics and criteria
- Throughput Test – Runs two or more concurrent serving streams
The elapsed time for each test is reported.
Note: There are seven benchmark phases that span an end-to-end data science pipeline as shown in Figure 1. For a compliant performance run, the data generation phase is run but not scored and consists of the subsequent six separate tests, load test through throughput test, run sequentially.
Primary metrics
For every result, the TPC requires the publication of three primary metrics:
- Performance
- Price-Performance
- Availability Date
Performance metric
It is possible that not all scenarios in TPCx-AI will be applicable to all users. To account for this situation, while defining the performance metric for TPCx-AI, no single scenario dominates the performance metric. The primary performance metric is the throughput expressed in terms of AI use cases per minute (AIUCpm) @ SF is defined in the figure below.
Figure 2: Definition of the TPCx-AI benchmark metric
Where:
TLD = Load time
TPTT = Geomean of training times
TPST1 = Geomean of Serving times
TPST2 = Geomean of serving times
TPST = Max (TPST1, TPST2)
TTT = Total elapsed time/ (#streams * number of use cases)
N = Number of use cases
Note: The elapsed time for the scoring test is not considered for the calculation of the performance metric. Instead, the results of the scoring test are used to determine whether the Performance test was successful.
The scoring test result for each user case should meet or better the reference result set provided in the kit as shown in the figure below.
Figure 3: Benchmark run accuracy metrics
Calculating the Performance metric
To illustrate how the performance metric is calculated, let us consider the results published for SF=10 at:
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110802
A portion of the TPCx-AI result highlights, showing the elapsed time for the six sequential tests constituting the benchmark run is shown in the figure below.
Figure 4: Elapsed time for the benchmark test phases
The result highlights only provide the training times and the serving times. To calculate the final performance metric, we need to use the geometric mean of the training times and serving times. To arrive at the geomean of the training times and the testing times, the time taken for each use case is needed. That time is provided in the Full Disclosure Report (FDR) that is part of the benchmark results. The link to the FDR of the SF=10 results that we are considering are at:
https://www.tpc.org/results/fdr/tpcxai/dell~tpcxai~10~dell_poweredge_r7615~fdr~2022-11-09~v01.pdf
The use case times and accuracy table from the FDR are shown in the figure below.
Figure 5: Use case times and accuracy
Note: The accuracy metrics are defined in Table 7a of the TPCx-AI User Guide.
Using the data in Figure 4 and Figure 5:
TLD = Load time =2.306 seconds
TPTT = Geomean of training time =316.799337
(119.995*2104.383*113.122*89.595*974.454*424.76*26.14*4928.427*29.112*253.63)1/10
TPST1 = Geomean of Serving times =19.751 seconds
(10.025*8.949*4.405*12.05*4.489*144.016*4.254*396.486*75.706*22.987)1/10
TPST2 = Geomean of serving times = 19.893 seconds
(10.043*8.92*4.39*12.288*4.622*148.551*4.275*396.099*75.508*22.881)1/0
TPST = Max (TPST1, TPST2)= 19.893 seconds
TTT = Total elapsed time/ (#streams * # of use cases) =2748.071/ (100*10)= 2.748 seconds
N = Number of use cases =10
Note: The geometric mean is arrived at by multiplying the time taken for each of the use cases and finding the 10th root of the product.
Plugging the values in the formula for calculating the AIUCpm@SF given in Figure 2, we get:
AIUCpm@SF= 10*10*60/ (2.306*316.799*19.893*2.748)1/4
= 6000/ (39935.591)1/4
= 6000/14.1365=424.433
The actual AIUCpm@SF10=425.31
Calculating the Price-Performance metric
The Price-Performance metric is defined in the figure below.
Figure 6: Price-Performance metric definition
Where:
- P = is the price of the hardware and software components in the System Under Test (SUT)
- AIUCpm@SF is the reported primary performance metric
Note: A one-year pricing model must be used to calculate the price and the price-performance result of the TPCx-AI Benchmark.
AIUCpm@SF10 = 425.31
Price of the configuration =$ 48412
$/AIUCpm@SF10 = 113.83 USD per AIUCpm@SF10
Availability date
All components used in this result will be orderable and available for shipping by February 22, 2023.
Performance results
Dell has published six world record-setting results based on the TPCx-AI Benchmark standard of the TPC. Links to the publications are provided below.
SF1000
Dell PowerEdge R650/Intel Xeon Gold (Ice Lake) 6348/CDP 7.1.7—11 nodes
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122120101
SF300
Dell PowerEdge R6625/AMD EPYC Genoa 9354/CDP 7.1.7—four nodes
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110805
SF100
Dell PowerEdge R6625/AMD EPYC Genoa 9354/CDP 7.1.7—four nodes
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110804
SF30
Dell PowerEdge R6625/AMD EPYC Genoa 9174F/Anaconda3—one node
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110803
SF10
Dell PowerEdge R7615/AMD EPYC Genoa 9374F/Anaconda3—one node
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110802
SF3
Dell PowerEdge R7615/AMD EPYC Genoa 9374F/Anaconda3—one node
https://www.tpc.org/tpcx-ai/results/tpcxai_result_detail5.asp?id=122110801
With these results, Dell Technologies holds the following world records on the TPCx-AI Benchmark Standard:
- #1 Performance and Price-Performance on SF1000
- #1 Performance and Price-Performance on SF300
- #1 Performance and Price-Performance on SF100
- #1 Performance and Price-Performance on SF30
- #1 Performance on SF10
- #1 Performance Price-Performance on SF3
Conclusion
Summary
This blog describes the TPCx-AI benchmark and how the performance result of the TPCx-AI Benchmark can be interpreted. It also describes how Dell Technologies maintains leadership in the TPCx-AI landscape.