The terms data analytics, data analysis, and data analyst have been in use for many decades and therefore have very broad definitions that everyone will not always agree on. For the purpose of this eBook we use data analytics to mean the process of extracting information from raw data either visually, via summaries or using a data modeling process like machine learning. This chapter is going to describe some of the challenges and methods used for processing large sources of raw data (big data) through inspection, cleansing, transforming, modeling, and publishing results for external consumption by other processes. Every one of these components of data analysis has been written about extensively, but we believe that there is an additional value that can be added by discussing methodologies for developing complete pipelines for data processing in the context of interesting use cases. As the title reveals, we will be using Apache Spark as the primary technology for implementing an end-to-end pipeline for one use case in later chapters, but in this chapter, we will describe some important aspects of data analytics and working with data in general terms. This background will help us describe what Spark can and cannot do during the use case discussion. We will not cover any topics in the areas of decision science, information theory, organizational dynamics or business metrics/KPIs related to the processes for incorporating the information derived from data analytics into business decision making.
Ever since humans decided to collect data starting with scratch marks on cave walls, other humans have tried to develop tools to make the task easier. Our experience has shown that the collection, storing, and analysis data along with the communication of insights to others can all be improved with the right tools. This work has turned into a huge industry. The field of statistics was born from the need to have better ways to characterize and summarize data than what could be accomplished with other types of applied mathematics. Later, machine learning and deep learning were developed as the need for tools to analyze larger datasets with more sophisticated relationships emerged. Innovation in the development of the first analog and now digital computing hardware has also been intimately connected to the drive for more capable tools for doing data analysis.
The volume of data available in the age of digital transformation has created some of the most challenging problems that IT professionals face today. At the same time, we have been so successful in the race to produce better and better tools for data analytics capable of working with these large data volumes of data that we have created another problem – an almost overwhelming set of software and hardware choices that is difficult to evaluate and choose from. How does a data analytics team comprised of IT professionals, data engineers, and data scientists agree on a combination of hardware and software that meets everyone’s needs for ease of use, scalability and future resiliency?
The needs of data scientists today cover an extremely broad spectrum of techniques and tools that often require specialized skills and may produce many different types of outcomes. IT needs to be flexible but also cost-conscious and therefore can’t implement everything that might be in demand. In a field with this much diversity, organizations will want to adopt one or more classification schemes to organize the discussion of use cases and techniques that can aid in decision making. There are many relevant dimensions and therefore many parallel and overlapping grouping schemes exist. The table below shows just a few of the more commonly discussed classifications of types of analytics and data.
4 Types of Analytics
3 Types of Machine Learning
3 Types of Data
2 Types of Analytics
In the following sections of this chapter, we will discuss some less common and slightly more technical ways to classify data along with the implications for the types of specialized analytics that are required to do advanced analytics with complex data.
The ability to speak with confidence about the validity of data analytics results begins with building trust in the source data. Virtually every entry-level course in statistics and data science begins with a section on data visualization focused on how to use interactive visual techniques to learn about the interesting aspects of a data set prior to doing modeling. You can think about interactive analytics in the same way you might approach preparing a vehicle for a long trip. A walk around the vehicle might prompt you to look more closely at a small crack in the windshield or the need to check the air pressure in a tire that looks low. The process follows a trajectory of asking a question, look at the result, evaluate what questions that answer suggests.
The term interactive analytics usually refers to an interaction between a human and a dataset that is too large and/or complex to understand using simple descriptive statistics. Interactive analytics is also often referred to as interactive visualization reflecting the reality that it almost always involves producing and evaluating graphical summaries of the data. The goal is to use a combination of “best practices” and ad hoc approaches together in an iterative process of getting to “know” your data where each result typically generates more questions than answers in the beginning.
A researcher beginning to work with a new dataset that includes physical characteristics of people may want to know the proportion of people with blonde hair and the proportion of those with blue eyes in the sample. If these results seem reasonable, she may then compute the proportion of people with both blonde hair and blue eyes to uncover more information that will confirm that the sample is representative of the population in general.
The order and types of questions that are explored during an interactive analysis session mainly derive from the results of previous questions. The data scientist starts by exploring the aspects of data they already have some expectations to see if there are signs of observable relationships that confirm those expectations. The output from these queries together with domain knowledge prompt the data scientist to look at other summaries or details in the data. The cycle continues and the data scientist begins to either build trust in the quality of the data or begins to form one more hypothesis about why the data may not be of good quality, e.g., the data are too problematic to be trusted for model building.
The efficiency of interactive analytics depends on many factors including:
The data analytics industry is very concerned about the amount of time analysts and data scientists spend on preparing/munging data prior to applying algorithms and training models. No one likes to write code for processing filthy data or document decisions regarding what data was kept, discarded or modified in order to even get a model to train. However, time spent “getting to know the data” should not also be targeted as an unproductive use of time that should be eliminated. Especially in the era where data science generalists develop specialties in fields like language processing or computer vision and travel between subject domains, there needs to be time budgeted for exploratory data analysis that relies heavily on tools that support human interaction with the data.
Data analytics is about extracting information from data. Data generated in the last 5 minutes will contain some information but typically data collected over the last day will have richer information. If you want to study the December buying patterns of retail consumers, you will need several to many seasons of data to assess if any trends are discoverable. For these types of research questions, you will need to store data and perform batch data analytics and therefore, we do not believe that batch analytics is outdated and soon to be replaced by a sole focus on “real-time” analytics. We need to continue developing improvements for all types of analytics including batch analytics.
All machine learning and deep learning model training have historically been batch processing oriented. If you constantly add new data to the training set during the training session it would be impossible to tell what went wrong if the model failed to converge. Also, there would be no reproducibility. Once a model has been trained and validated, adding additional near-real-time observations to the training set is feasible for some use cases.
When working with historical data batches one of the most difficult decisions is deciding the best time frame to use for a training set. Many relationships that we want to study change over time, especially when humans are a factor. For example, studying the factors that lead consumers to purchase electric vehicles has changed over the last 50 years as more convenient options have been developed and attitudes regarding energy use and the environment have changed. Training the model on 25 or 30 years of historical purchase and demographic data might be biased by the old relationships in the data. But should the researcher use just the last 2-3 years of data or perhaps 10. Exploring options for choosing the best sample of data together with what type of model along with the specific hyperparameters is largely an intuitive and time-consuming process. More research into improvements in batch analytics is underway but certainly not getting the same level of attention as streaming analytics.
Although most people would have a very difficult time defining what a time series is, they are relatively easy to identify once you see an example. A simple definition of a time series is:
a collection of data points with a one-to-one association to discrete points in time where the time points are most commonly equal in length.
While even this simple formal definition of a time series may seem overly complex, see how the examples below are easier to understand and explain than a formal definition. Here is a series of outdoor air temperature readings in degrees Fahrenheit taken every 5 minutes:
In this example, the time interval between observations is equal in length (5 minutes) and we have one data point (number), a temperature, that corresponds to each time interval. This is consistent with the definition of a time series from above. What the definition does not include is how the single measurement was calculated for the 5-minute interval. It could be the maximum, minimum, average, beginning or ending value. Details like this should be captured in the metadata (data about data) for the time series and stored with the source data since it may be important to the design of our analysis and/or interpretation of the results.
If you have some familiarity with machine learning and/or data models you probably think of data as either features (inputs or explanatory variables) or outcomes (a number or probability that you want to forecast). In modeling a time series, we have one feature, time, and the outcome is the data associated with each time interval. In the example above the outcome is outdoor air temperature and the explanatory feature is time.
Time series analytics research began long before computers were invented. Most of this research has been conducted by statisticians. Very recently machine learning tools including regression trees and fully connected shallow neural networks are being used for the analysis of time-series data, especially when there are other available features that can be used to explain some other related influences such as geography or demographics for example.
While most time series involve numeric data types, we can also have a time series of discrete data represented as character strings or a discrete set of integers. There are entire textbooks (for example, An Introduction to Discrete-Valued Time Series ) devoted just to the analysis of times series of discrete data values. It pays to keep both types of time series in mind when exploring a new dataset and be aware that the techniques for analysis are vastly different. Below is an example of data tracking the status of a traffic signal every minute for a period of 10 minutes.
We used a set of unique character strings (RED, YELLOW, GREEN) but we could have also used integers (1,2,3). The difference between this example and the one above is that temperature can be expressed along with a continuous number range with almost limitless possibilities depending on how precisely we can measure where the time series of discrete values have a fixed number of values. The traffic light example above can produce exactly 1 of 3 possible values for any single measurement.
Time series data may be the largest source of raw data generated in the world today. It is impossible to say how much storage would be required to collect all the time-series data that is available in the world since only a small fraction of it is ever accessed and/or analyzed. The potential to capture and analyze time-series data for a multitude of uses cases is an area that every data scientist should be considering. The laptop or another computer that you are using to read this document is generating thousands of data points every second related to performance characteristics, environmental factors, and security. Every imaginable aspect of a disk drive, network, graphics card, and central processing unit (CPU) metrics are being monitored by the system and made available to system administrators for viewing and analysis if they choose to do so. Most systems are configured to write a small amount of historical of data to the local drive that constantly gets overwritten by new data, therefore keeping the drive from filling up with diagnostic data that would eventually halt operations. This is just one example that anyone with a computer can explore for themselves. There are billions of devices that are both capable of generating operational time series data and have network connections, but the vast majority are either not connected to a network or are only streaming a very small subset of the data series they can produce.
Virtually every mid-sized and larger commercial building in the world has a building energy management system that monitors and stores detailed temperature, humidity, airflow, lighting status, and energy consumption details for all major environmental components at regular time intervals varying from minutes to seconds. Every modern piece of industrial equipment including pumps, fans, boilers, furnaces, conveyor belts, and robotics to name just a few can produce thousands of data points every second and made accessible by a remote network data connection just like we discussed above for computers. Imagine the amount of equipment, processes, and data that could be accessed from a hospital network, a transportation system, a smart home or a smart city. And this list only scratches the surface of the total sources of time series data available in the digital world. Many of the sources of data listed above are grouped together under the concept of the Internet of Things (IoT) umbrella but there are many other scientific and engineering applications that are either just called operational data or research data.
Everyone can probably relate to the frustration experienced when your computer “crashes”. In information technology support circles this is referred to as an “event”. Other examples of events that might be familiar to most people are your airplane is late leaving the gate, you have an automobile accident or perhaps you win the lottery. If one of the primary characteristics of time series is the regularity of the time between data samples, then as you can see from the examples above, the time interval between events is typically difficult to determine. In fact, the challenge of predicting when the next event of interest is going to occur is what makes data analytics with event data so challenging and interesting.
Most of the event processing pipelines you will encounter start with descriptive analytics, e.g, counts, sums, and simple statistics like average, minimum and maximum values summarized over regular time intervals and possibly grouped by related features like geography or some population characteristic. This type of “pre-processing” can turn event data with irregular intervals into time series of numbers with equal intervals that are easier to visualize and see trends. Predictive analytics for event data involves modeling and therefore is probabilistic. A common type of research question is “what is the probability that event A will occur in population B during the time interval T. For example, what is the probability of a computer crash happening for either Windows or Linux users in the next 6 months. Or, what is the probability of a person aged 40-45 being diagnosed with skin cancer in Florida compared to New York? The list of possible research questions related to event data modeling is limitless.
You can also frame event data analytics in terms of numerical forecasting or regression. There is obviously a relationship between event counts, probability of occurrence, population size, and time intervals. Both machine learning and deep learning can be used for virtually any type of use case and research question formulation, however, special care needs to be exercised with event data when the set of possible outcomes is extremely imbalanced. Credit card fraud seems pervasive these days but the ratio of non-fraudulent to fraudulent transactions is still very high given how often people use credit and debit cards. Data analytics techniques employing “oversampling” and other techniques can help improve the understanding of the relationships between the input features and “rare” of imbalanced outcomes but require advanced training to select the right techniques and for interpretation of results.
Another interesting modeling opportunity with events in the area of prescriptive analytics to answer questions about what actions an actor takes to avoid or trigger an event occurrence. What factors would encourage a competitor’s customer to leave their existing mobile phone service contract and switch to another service. What actions would extend the expected duration before the next failure event for a piece of equipment? Answers to these questions based on data modeling are also described in terms of uncertainty and probabilities that decision-makers need to be comfortable evaluating.
The data associated with an event may be as simple as a single number (0,1) or string of text (“it happened again”) but could also be a complex multi-part data object. Computer crashes often generate metadata describing the time, user name, version of various software and drivers as well as files storing part or all the data that was in memory when the event was detected. Part of the challenge when working with event data is anticipating how much data each event will generate and what is needed for analysis. A philosophy of “keep everything” provides future flexibility but may limit how much history can be affordably retained.
We most often think of streaming data as a near-continuous set of values coming from a single source, however, we can also have “event streams” from many sources each of which has irregular and potentially long intervals between values but appear at the target as a steady high-volume stream. The most salient characteristics of a streaming data analytics application are that data arrives in units with very low latency between items and is processed as soon as possible, often while still in the memory of the receiving system and before hardening to disk. Very high-velocity data may be temporarily stored in a local queue and then sent for processing in “mini-batches” to cut down on communication overhead but would most likely be considered a data streaming analysis so long as the latency between generation and processing was low.
Engineers have been interested in streaming data for centuries. In 1874, French engineers built a system of weather and snow-depth sensors on Mont Blanc that transmitted near real-time information to Paris and so began the “modern” era of data streaming. Extreme weather conditions can come and go rapidly on and around Mont Blanc. Public safety concerns and research interests were important factors in the decision to undertake the expense of establishing the capability to stream data to a remote and safer monitoring site. More recently, the hype suggesting that real-time analytics is replacing batch analytics is growing all most as fast as the capability of technology for streaming data. System architects should not give in to over-engineering every application by adding the cost of designing and engineering a streaming data solution without careful consideration of the expected value of a lower data latency on the use case being considered.
Data streaming technology and streaming data analytics technology can either be packaged together in a single platform or chosen separately and integrated by a developer. The last 10 years of technology development in the “big data” and Hadoop ecosystems have produced a number of advances for designing streaming data applications. The open-source projects for Flink, Spark Streaming, and Kafka give system architects many options to choose from but increase the risk associated with developing on a platform that may not have enough momentum in such a crowded field.
Kafka began primarily to address message queuing technology that later evolved into a platform with both data streaming and analytics capability. Even the primary message queuing technology in Kafka can be configured with different size message payloads and the processes that read new messages from a topic can be scheduled to meet a wide range of processing latency that can still meet the requirements of many streaming applications. In some uses cases, we see Kafka satisfying both the data management and analytics requirements of a streaming analytics application wherein other use cases specialized analytics such as Druid, Hive or even custom software are used with Kafka.
The variety and complexity of uses cases that can benefit from streaming data analytics are staggering. At the simplest end of the continuum would be a filter for a single numeric or string value. For example, if the temperature of this pump motor is more than 150 degrees Fahrenheit – sound an alarm. This type of low complexity application can typically be programmed on or close to the piece of equipment and require very little computing power or memory. Moving toward the complex end of the spectrum are uses cases that require multiple adjacent data points from the stream, often referred to as windowing the data. Going back to our pump example, if the temperature rises in 3 consecutive readings from a normal of 100 degrees to 150 degrees – shut the pump off. However, if the motor temperature is oscillating up and down between 100 to 150 degrees over 10 or more values – log an incident report. Moving further up the complexity continuum would be an application that needs data from multiple streams available in a common window. The streams may come from sources far away physically or on disconnected networks and so the data needs to be routed to a commonly accessible device and would require much more complex analytics programming.
These are only a few of the many opportunities and issues that organizations encounter when working with streaming data and stream analytics. In general, the adage “push analytics to the edge” should be followed when possible. Any efficiency that can be realized by data filtering, aggregation, or conversion of continuous streams to a set of less frequent events is worth exploring. However, statements such as “all analytics are moving to the edge” are hyperbole and only add to the existing confusion caused by the rapid expansion of technology options for data analytics and data streaming.
Much has been written about the four (4) Vs of big data – Volume, Velocity, Variety, and Veracity, however, you could have a data pipeline that processes high volumes of data generated frequently consisting of numbers, text, audio, and images and not have to deal with much complexity if all that data is from a single source with a consistent format. While the four Vs are all important characteristics of data there is a complexity factor introduced when handling multiple sources of data that will affect the choice of tools and platforms for data analytics.
When information about a single object like a physical facility is assembled from multiple sources there must be a key that uniquely matches a single entity in each of the sources. Managing keys, joining data based on keys, and handling key errors are just a few of the many significant challenges encountered with doing multi-source data analytics. We have been working on this problem for decades in the context of enterprise data warehousing and still the number of difficult to join data sets may be getting worse. Application modernization and the transformation to agile methods for cloud-native applications cannot address the challenges of data integration unless the requirements are documented and implemented early in the design.
The resource impacts of joining large datasets from multiple sources are also both well documented and challenging. How the data are keyed, sorted, and coded never seems to be optimal in the source systems. The overall impact is multiplied by the many steps in an analytics pipeline that are frequently devoted to data lookups, sorting and writing out of intermediate results. Platforms capable of performing these large-scale data integration jobs typically require significant manual tuning that relies on trial and error to find a good set of hyperparameters for any given job.
The realization that timelines for gaining business benefits from data analytics are being heavily impacted by the effort that needs to be expended on the acquisition, transformation, persistence, exploration and cleaning of data is moving the focus from tools for model development to tools for data management. The front end of the analytics pipeline is time-consuming and expensive.
The most frequently cited estimate is that data analytics professionals spend on average 80% of their time/resources on these non-model development activities. When you have requirements that include the need to integrate large data sets from multiple sources you will almost certainly encounter significant complexity and be at the top end of the percentage range spent prior to model development.
These effort estimates are fueling a growing misconception among non-practitioners that skills specialization and division of labor can lead to lower project costs and shorter time to value. The argument is that data engineers should manage the pre-processing of data and data scientists should just build models with the cleanly structured and sanitized data. Our experience shows that data modelers/data scientists need to be intimately connected to all the stages of data ingestion and transformation process. We believe that data science is a team sport and there are many opportunities to involve people with different skills and levels of experience in the most labor-intensive aspects of a project, however, managers should not try to separate the people building models from the data pipeline that is feeding the models in an effort to save money. Data science teams should be composed of data engineers, data analysts, data scientists, subject matter experts and project management specialist working in lockstep at each stage of the analytics pipeline for every non-trivial project.
There is also some interesting development of tools that automate some aspects of data pre-processing pipeline. It is still very early days for seeing significant benefits from intelligent data joining, exception handling and coding of input data but the rate of investment in this category of tools is accelerating. While there are clear patterns of data processing that can benefit from automation, it is our opinion that when dealing with multiple large datasets the benefits of generic automation are going to be marginal for some time to come.
When approaching the design of a solution for a new analytics use case, start with an understanding of what information could improve the business process or inform a decision-maker. Then realize that there is not now, nor ever likely to be a single tool or technique that will do everything to meet those requirements. The other extreme of piecing together a broad collection of specialized open source and commercial software will most likely lead to worse results than trying to force all problems into one tool. Most organizations have long fought against both tool and data platform silo proliferation vs the anticipated benefits of introducing “best of breed” data analytics technologies. Investments in tools that require proprietary data formats and hardware are especially limiting and difficult to break free from once implemented. Perhaps the best way to characterize the challenge is to ask - what is the minimum number of tools and platforms that will meet all the requirements of the organization and create the fewest islands and silos.
The chapter covered some types of data analytics and data you may encounter in analyzing use cases. There are of course specialized systems for doing all the above types of analytics. In the remaining chapters, we will show how the Spark platform and related tools can be used to implement a complete analytics pipeline that covers many of the topics introduced above based on the requirements of a specific use case. Our solutions engineering team built a lab environment to test interesting features of the Spark platform and toolset. We wrap-up with some conclusions, lessons learned and resources for continuing to learn about and use Spark.