Spark is commonly referred to as an “open-source, general-purpose, distributed computing platform”. We will break that sentence down in the rest of this chapter starting with a discussion of why this class of platforms exists. There are a handful of application development sub-systems that are so broadly applicable that virtually every software developer will use at least one of them during their career. However, building platforms that can take advantage of multi-threaded processors, memory caching and/or distributed processing using a cluster of computers is beyond the skill set of many professional software developers let alone most data scientists or data engineers. This need for saleable computing has fortunately led to the development of many open-source and commercial offerings. Spark is one example of an open-source scalable distributed computing platform.
The most successful of the scalable platforms are both robust and flexible enough to be used by millions of application developers that otherwise would not be capable of building scalable foundations for critical supporting services on their own. Another characteristic that you will see in all successful platforms is that application developers will use them in creative ways not anticipated by the core developers. This can be either a benefit or a liability to the organizations that adopt them.
Some examples of platforms that have become a core part of so many applications are worth mentioning before we talk about Spark. Relational database management systems and distributed message systems meet all the criteria mentioned above of sub-systems used by millions of application developers that would be difficult or impossible for most software professionals to recreate for their own dedicated use. Developing multi-user, ACID compliant, transaction processing systems with online backup capabilities and reliable crash recovery is only achievable by the best-trained computer scientists with years of experience. Of the major offerings available only Oracle has developed a truly distributed offering for transaction processing (Real Application Clusters) however the successful development of scale-up systems using advances in multi-core CPUs and memory technology have has made these platforms an essential component of the success of nearly every organization we see. If every one of these organizations needed to find and hire resources with the necessary skills to build an RDBMS, we would not have the widespread success associated with the last 40+ years of digital transformation. (Oracle introduced in 1979, IBM DB2 in 1983, and Microsoft SQL Server in 1989 to name a few). Millions of application developers have built scalable and resilient applications using these RDBMS platforms that they otherwise would not have been able to do alone.
There are also many software applications that require the advantages that come from using scale-out message processing. Although the applicability is perhaps not as general as ACID-compliant transaction processing, the software industry still has produced several successful message processing systems including IBM MQ, Microsoft Message Queue (MSMQ) and the widely used Apache open-source Kafka project. While these messaging systems have different approaches and applicable use cases, the important point is that it was more efficient for the software industry to have these platforms developed by a small number of exceptionally talented platform developers rather than having hundreds of thousands of application developers each trying to write their own foundational platform for messaging.
Spark is a relative newcomer to the list of highly scalable distributed computing platforms that were first available from the Apache Software Foundation around 2010. Spark is robust and has the flexibility to be called a “general purpose” distributed computing platform. The breadth of Spark use cases can be seen from the large number of organizations that publish papers and speak at conferences including making movie recommendations at Netflix, performing DNA sequencing, analyzing particle physics at CERN, and protecting financial payments at MasterCard and PayPal. The remainder of this chapter is focused on providing enough details of the architecture, advantages, and limitations of Spark at the beginning or intermediate level. Our intent is to provide background so that data professionals can follow our use of Spark for data transformation and model training as described later in this eBook. You will see how our engineers used Spark to efficiently execute those tasks on a cluster of machines without having to code the distributed computing details directly. Spark allows data professionals to achieve a faster time to solution for many enterprise-class problems including data processing and data science.
During the incubation of Google, the founders realized that in order to revolutionize the efficiency and relevance of web search, they were going to need to develop new computing tools to do it. The number of URLs that existed on the Internet in the early 2000s plus the complexity of analyzing the relationships of inter-page linking meant that Google needed both a new scale-out file system and a new scale-out computing platform. The first publicly available descriptions of how those two challenges might be solved were published as white papers in 2003-2004. Researchers at Yahoo that developed the first versions of the Hadoop Distributed File System (HDFS) and the MapReduce computing platform credit those early Google papers for the architecture foundations that started the Hadoop open-source initiative.
The rapid accumulation of data that came with digital transformation created an ever-growing need for distributed computing and storage platforms. At the same time, interest and acceptance of open-source software were on the rise. The rapid expansion in the development and use of Apache Hadoop ecosystem projects including HDFS and MapReduce demonstrated the applicability of distributed compute and storage for many areas beyond the web search interests of Google and Yahoo. It is a testament to the original architects that well-designed generalized tools can produce many benefits beyond the narrow application requirements that initially prompted the research.
As the volume and veracity of the data that organizations wanted to explore grew, MapReduce began to show signs of severe limitations. The name reveals the first limitation, the architecture is limited to just two operation types: mapping and reducing. Mapping was defined as the process of transforming a file with one format into a new file with a different format. The most common example used when teaching developers about MapReduce is a word count map. Given a file of text: transform that unstructured format into a two-column list of all the words in the original document in the first column and a count of the number of times they appear in the second column.
If there are multiple files in the source or if you wanted to split several large files into smaller chunks to have more parallelism in processing, then you might also add a reduce step to the map task described above. For example, a simple reduce task would be to combine the union of all the results from the map into a single “master list” of word counts, i.e., input a lot of files and reduce those to a single output file with the summary results.
The MapReduce design limited the scope of a job to, at most, one map task and one reduce task. The analyst could split the tasks and have one job perform the just the map before writing those results to disk. Then another job could read the output of the map task from disk, perform the reduce step, and then end with writing the single summary output file out to disk. Breaking the work into two jobs with all the extra reads and writes might seem inefficient but there may be an operational justification for doing it.
However, combing jobs to cut down on reads and writes was not possible with MapReduce. If you wanted to do another step to filter the summary to eliminate any common words plus any in all capital letters you can’t string that together with the first map and reduce. There can only be one of each at most that come with the overhead of the initial reads from disk and writes to disk.
These limitations led researchers at the UC Berkeley APMLab to review the internals of MapReduce to see if they could improve the efficiency of distrusted computing compatible with the Hadoop ecosystem. The result of that work was the design for a new parallel processing framework – Spark.
Entire books are available on the Spark architecture and internals. Our goal is to provide enough summary material to understand the remainder of this eBook for those without much background. For those more familiar with Spark, we hope this treatment gives you some new insights and possibly adds some new approaches to how you explain Spark to others. The generally accepted way to describe a generic computer object that is common to many data science disciplines is with the two word form - data frame. Spark dataframes are always written as a single word in the documentation. We try to consistently follow this convention for this paper.
Many problem domains that need an approach for efficiently describing and processing a sequence of events have used a tool from mathematical graph theory called a directed acyclic graph (DAG). The MapReduce architecture for defining jobs can be described with a very simple (almost trivial) DAG.
The designers of Spark wanted to expand support for more complex DAGs while simultaneously isolating application developers from having to define the DAG before submitting a job to Spark. The new approach would define a larger set of common tasks like union(), randomSample(), sortBy(), filter(), etc. that could be strung together to perform a wide range of useful jobs or programs. Application developers could describe what they wanted Spark to do with the data and the Spark core runtime would construct an efficient DAG that it could execute to produce the desired result. Developers working with MapReduce were responsible for building and sharing reuseable building blocks of code to execute in either a map or reduce task. Spark provides a rich set of prebuilt functions grouped into transformations and actions available through an API. This built-in shared library makes Spark development available to a wider range of developers and speeds delivery of results for experienced developers.
The other limitation of MapReduce that was addressed in the Spark core design was removing the requirement to read from and write to disk after each 1 or 2 step job (map, reduce, or map and reduce). Spark DAGs could execute an arbitrary number of transformations or actions in one job that was determined the requirements of the application developer. The preferred data location during multi-stage job processing is memory. Spark will manage moving data between memory and disk when there is a need, however, some subsets of data that Spark cannot split between memory and disk may still cause out-of-memory errors.
Spark core defines an abstraction for managing and working with distributed data called a resilient distributed dataset (RDD). RDDs can be pinned in memory for efficient execution of multistep transformation pipelines, can recover from many types of node failures when distributed across multiple nodes in the cluster without requiring a complete restart for a job, and can manage spooling or spilling data to disk when necessary under some conditions. Together the more flexible job composition capabilities, DAG processing, and RDD functionality addressed many of the limitations of MapReduce discussed above.
You may encounter the term “rectangular data” when reading about data engineering and data science. Understanding the appeal of Spark starts with understanding the near universal adoption of a common type of rectangular data structure in the data science community – data frames. A data frame is a 2-dimensional data structure with labeled columns. Each column must be formatted for a single data type that might include real numbers, character strings, integers, Boolean or logical values, etc. Every cell value in the column must match the data type for that column. These requirements place constraints on how a data scientist constructs a data frame, but it also ensures that they are compatible with many types of statistical and data manipulation libraries.
For Python developers, the data frame object and functions implemented in the Pandas library are used in virtually every data science project. Data scientists that work with R have data frames implemented as a data type in the base libraries. The Java open source community has developed several data frame libraries that have features applicable to data science including data importing, filtering, aggregation and columns creation but vary considerably in ease of programming.
Data frames are by far the most important type of rectangular data for data science work but there are other common types of rectangular data in the world including relational database tables, comma separated variable (csv) files, and spreadsheets to name a few. Comparing data frames to spreadsheets can help conceptually, but in practice, data scientists generally avoid spreadsheets for data work in favor of a data frame library supported for their development environment. Most machine learning and some deep learning algorithms used for model training are designed to accept data frames as inputs. Even in those instances where the final step of a data pipeline does not accept a data frame, data scientists will often work with a data frame to import, sort, filter and augment the data using a data frame object and then convert to another data structure as late as possible in the pipeline.
The main limitation of data frames is in most cases memory. Many implementations of data frames that we will discuss shortly are limited to the availability of memory on a single computer. Many implementations are also either restricted to a single processor core on a single computer. While many laptops today can have 16GB and more of memory and can, therefore, hold large data frames in memory, sorting, filtering and selecting operations can become painfully slow with data frames over 1-2GB. Spark let’s data scientists create distributed data frames that can spread across the memory of many computers and use multiple cores of each machine in the Spark cluster using the same types of operations that are used with Python, R and Java data frame libraries and data types.
Traditional databases that support asynchronous, fine-grained updates to shared state typically use centralized update logging and data checkpointing to provide fault tolerance. That approach has not scaled well for large data transformations. RDDs, on the other hand, are particularly well suited to batch processing applications that apply a set of sequential transformations to large datasets and still provide efficient fault tolerance.
Programmers can specify that the rows of an RDD be partitioned across multiple machines in a cluster based on a key in each record. Partitioning aids both performance as well as fault tolerance. One the design goals for RRDs was that only the RDD partitions on a failed node would need to be recomputed without having to roll back the whole program. Another goal was that the failed partitions could be recomputed in parallel on different nodes.
In order to achieve these goals, Spark needs to track only the definitions for the transformations (DAG) that would need to be replayed to change an RDD partition from an initial state to the final state. This frees Spark from having to log the large amounts of data need for fault tolerance to support fine-grained updates. There are some properties of RDDs that are necessary to support job fault tolerance:
RDDs are one of the key Spark developments that isolate application programmers from the details of distributed computing. Programmers see RDDs as two-dimensional data structures with typed columns that can scale larger than the memory of any typical workstation or single server. From the Spark internal perspective, an RDD is a distributed memory abstraction with fault tolerance that allows programmers to perform in-memory computations using one or more nodes of a computer cluster. The in-memory nature of RDDs has significant performance benefits over MapReduce for iterative algorithms and interactive data science.
Spark dataframes can seem indistinguishable from Python and R data frames however there are very important differences. Spark dataframes are a higher-level abstraction of a two-dimensional row and column data structure that is implemented on top of Spark Core RDDs. The primary motivation for this additional lever of abstraction was to provide a more familiar experience to developers that use Structured Query Language (SQL) for data transformation and analysis. Providing a SQL interface to Spark data also allowed the Spark developers to implement an optimizer that would interpret SQL code and emit an efficient physical plan for performing work on the cluster.
Spark dataframes are accessible from Scala, Java, Python, and R. This broad support for languages is possible even though Python and R lack of compile-time type-safety because dataframes do not impose type safety. This design choice for Spark dataframes is considered a significant downside, especially in large environments with lots of users. A considerable amount of cluster resources can be wasted when jobs fail due to analysis errors that are difficult to avoid for large datasets with lots of columns.
Spark datasets are a newer option that is available only to Scala and Java developers since the API is typed. Datasets share many of the benefits of RDDs including typing and use of looping with inline code expressions (lambda functions) plus the performance gains from the Spark optimizer for SQL.
Many developers feel that Spark datasets will be the preferred path forward although they are still relatively new. Python and R developers do not have an option for using datasets today and so will continue to generate demand for new features in Spark dataframes. Conversion between RDDs, dataframes, and datasets is possible if your programming language supports the implementation but can be costly for large amounts of data.
The introduction of the Spark SQL library made Spark distributed computing accessible to the very large population of developers that work with Structured Query Language (SQL). Spark SQL allows developers to write common SQL expressions that get passed through an optimizer that produces an “execution plan” to perform the work on the cluster and return the results as either a Spark dataframe or dataset.
Since Spark SQL generates the low-level instructions to access the underlying core RDDs, all Spark SQL jobs have the advantages of fault tolerance and scale-out partitioned performance of RDDs. Spark SQL can also be used to directly query other data sources including Hive, Avro, Parquet, JSON, and JDBC including joins across multiple sources.
Our use case described later in this eBook uses Spark SQL for all data access and transformations.
Spark was designed to improve the performance of tasks that require or benefit from iterative computation and parallelism. Many machine learning algorithms use iteration with an optimization technique like stochastic gradient descent. MLlib contains many algorithms for regression, classification, clustering, topic modeling, and frequent itemset analysis that leverage iteration and can yield better results than the single-pass approximations. There is also a wide range of decision tree models including an extreme boosted gradient version XGBoost that we use later in our lab work.
Our use case described later in this eBook uses MLlib for machine learning.
Developers working with Spark Streaming access a high-level abstraction called discretized streams (Dstream) to represent a continuous stream of data. The physical implementation of a Dstream in Spark is based on a series of RDDs that hold a “window” of data. Therefore, operations on Dstreams can be distributed across partitions of the RDDs leveraging the parallel execution engine of Spark core to achieve both high performance and fault tolerance.
Spark Streaming has native support for file systems and socket connections built-in. Advanced sources like Kafka and Flume require interfacing with external non-Spark libraries with complex dependencies.
There are Scala, Java and Python APIs for Spark Streaming. The R community has developed several packages for working with Spark Streaming available on CRAN.
We do not use Spark Streaming in the use case described later in this eBook
We have shown how all the Spark libraries discussed above leverage the Spark Core distributed memory abstraction referred to as an RDD and the Spark parallel execution engine. Since graphs in computer science are an abstract data structure, it presents an opportunity to leverage the Spark Core engine for another form of specialized data analysis.
Graphs are defined by a set of vertices (nodes) and the corresponding edges (arcs). GraphX uses two RDDs, one for the definition of the vertices and one for the definition of the edges for the physical data structures. The abstract graph object along with any defined properties of the vertices and edges can be read and modified using the underlying RDDs.
GraphX includes a set of fundamental graph operators (subgraph, joinVertices, numVertices, …) as well as a variant of the Pregel API for implementing graph-parallel iterative algorithms such as Page Ranking. GraphX comes with a variety of graph algorithms including PageRank, Connected components, Strongly connected components, and Triangle count are some of the many which were contributed by the Spark community.
We do not use GraphX in the use case described later in this eBook
The availability of so many language choices available to Spark developers has “sparked” much debate over which is the best language. The answer to that question is nuanced. It is safe to say larger organizations with substantial investments in Spark are more likely to have developers using multiple languages. Larger organizations also are more likely to have standards and a published usage policy especially for production clusters or shared development environments hosting time-critical projects. In this section, we try to highlight a few of the differences that can be important and let our readers decide on the best policy for their organization and environment type.
There are two fundamental ways that developers can interact with Spark. The safest way is writing code to automate Spark operations through a controller application. An example would be reading data from HDFS, using Spark SQL to reshape the data and then calling a model training function on the transformed dataframe using MLlib. This example is part of the workflow that we implement in our test use case using Python.
Even with the restrictions above there is still a reason to be concerned about the impact of choosing Python or R for these scenarios. Since Python and R cannot enforce data type consistency between code and data structures in a compile phase, the corresponding Spark language APIs only support untyped dataframes. Scala and Java developers can choose between Spark untyped dataframes and typed datasets. The advantage of using a compiled language and a typed dataset means that developers can be notified of errors such as referencing a non-existent column of a dataset before the code is submitted to the cluster. Jobs that are submitted to the cluster with errors may consume a significant amount of resources before encountering a bug and halting. Experienced developers will write and test code in non-production environments to avoid wasting precious resources, but things still can and do go wrong sometimes.
The other way of writing code for Spark is to have the Spark driver execute a user-supplied code block that is run as part of a job. An example is a User Defined Function (UDF). Experienced Spark programmers try to limit their use of UDFs whenever possible, but many forms of input data transformation and business logic need the flexibility of procedural code applied to an entire data set to accomplish a critical task. This code will be executed by each worker for a partition started by the driver. UDFs written in Scala and Java are typically a full order of magnitude (10x) faster than standard Python UDFs. There is very little documentation available for UDFs written in R so proceed with caution if you are exploring that approach.
Understanding this difference in performance for UDFs written in different languages begins with the initial design choice to develop Spark with Scala compiled to run with the Java Virtual Machine. UDFs written in Scala or Java and complied for the JVM can communicate with the workers with very little overhead. Python and R code must run outside the JVM in another process since they cannot be compiled into Java bytecode. Passing data and code between the worker running on the JVM and the outside process must be converted to a common protocol that requires serialization on one end and deserialization on the other. This additional processing overhead is costly and gets more significant as the size of the data being process increases.
For both the reasons discussed above, Scala and Java both have advantages over Python and R. So how do developers rate Scala vs Java? The overwhelming consensus is that Scala code for Spark is more compact and easier to both write and maintain compared to Java. Some of the criticisms of Java code complexity for Spark development were addressed with the release of Java 8 but even the latest articles on Spark development strongly favor Scala over Java.
Despite these facts, the use of R and especially Python for Spark seems to be growing rapidly. Some of the very large organizations that often make presentations at conferences and publish technical papers on Spark are multi-language environments. For some organizations, any loss of efficiency on a Spark cluster resulting from the use of Python or R might be acceptable in exchange for productivity gains of using those tools. There are also new libraries and packages being developed by the Python and R communities to smooth the rough interfaces between these interpreted languages and Spark.
Also new developments from the Spark community are reducing or eliminating the need to use Sparks low-level data structures, transformations and actions. Many analytics professionals may be able to achieve near performance parity with Scala with applications written completely in Python or R. The future capabilities of the higher-level Spark libraries like Spark SQL, Spark Streaming, MLlib, and or GraphX combined with more sophisticated optimizers and perhaps code generators may someday eliminate most of the need for programming in Scala or Java directly with RDDs and low-level operators.
The foundational philosophy of the Hadoop project was to develop data processing tools that leveraged the write-once-read-many pattern. The Spark project including Core, SQL, GraphX and even Streaming all rely on this pattern for their speed and scale. If data in the source system(s) change, it is better to bulk delete and reload portions of the data that Spark draws from rather than attempting to use RDDs or higher-level abstractions to control fine-grained updates.
We discussed above some of the changes that were made to traditional RDBMSs to improve their ability to meet the challenges of analytics on the rapidly growing data volumes that came to be referred to as “big data”. The scalable multi-user transactional semantics that those systems excel at just as limiting to big data analytics as the write-once-read-many optimizations in Spark are to transaction processing (TP). If your application needs strong transactional consistency support, use a tool that is designed for that. If the volume of historical data that the application maintains can be efficiently analyzed in the same TP system, then you may not need big data tools at all. If your application accumulates more data than can be efficiently analyzed on the TP system or you need to consolidate data from multiple sources for analysis, then you most likely need big data tools like Spark.
The terminology used for data streaming and messaging applications is not well defined, and therefore, not used consistently. In this discussion, we will both define our use of the terms and explain why we think treating these technologies separately is important.
When the stream is made up of relatively small units of information packaged as a complex data type it is likely that the application receiving this data will want to inspect each package as it arrives. The “listener” application will probably evaluate some logic using data in the package and then make one of more decisions regarding next steps. The application could transform or extract a subset of the original package. All the results of the inspection and logic could get routed to another microservice or the stream may be split into two or more new streams based on the application logic. These are not good use cases for Spark Streaming.
The best use cases for Spark Streaming is to buffer incoming streams and fill the partitions of an RDD. Programmers can have a single interface to all the data in the RDD for doing any length of time window analysis, however, Spark RDDs are especially useful when operating on multiple partitions in parallel.