Real-Time Streaming Solutions Beyond Data Ingestion
Wed, 16 Dec 2020 22:31:30 -0000|
Read Time: 0 minutes
So, it has been all about data—data at rest, data in-flight, IoT data, and so forth. Let’s touch base on the traditional data processing approaches and look at their synergy with modern database technologies. Users’ model-based inquiries manifest to a data entity that is created upon initiation of the request payloads. Traditional database and business applications have been the lone actors that collaborated to provide implementations of such data models. They interact in processing of the users’ inquiries and persisting the results in static data stores for further updates. The business continuity is measured by a degree of such activities among business applications consuming data from these shared data stores. Of course, with a lower degree of such activities, there exists a high potential for the business to be at idle states of operations caused by waiting for more data acquisitions.
The above paradigm is inherently set to potentially miss a great opportunity to maintain a higher degree of business continuity. To fill these gaps, a shift in the static data store paradigm is necessary. The new massive ingested data processing requirements mandate the implementation of processing models that continuously generate insight from any “data in-flight,” mostly in real time. To overcome storage access performance bottlenecks, persisting the interim computed results in a permanent data store is expected to be kept at a minimal level.
This blog addresses these modern data processing models from a real-time streaming ingestion and processing perspective. In addition, it discusses Dell Technologies’ offerings of such models in detail.
Customers have an option of building their own solutions based on the open source projects for adopting real-time streaming analytics technologies. The mix and match of such components to implement real-time data ingestion and processing infrastructures is cumbersome. It requires a variety of costly skills to stabilize such infrastructures in production environments. Dell Technologies offers validated reference architectures to meet target KPIs on storage and compute capacities to simplify these implementations. The following sections provide high-level information about real-time data streaming and popular platforms to implement these solutions. This blog focuses particularly on two Ready Architecture solutions from Dell—Streaming Data Platform (formerly known as Nautilus) and a Real-Time Streaming reference architecture based on Confluent’s Kafka ingestion platform—and provides a comparative analysis of the platforms.
Real-time data streaming
The topic of real-time data streaming goes far beyond ingesting data in real time. Many publications clearly describe the compelling objectives behind a system that ingests millions of data events in real time. An article from Jay Kreps, one of the co-creators of open source Apache Kafka, provides a comprehensive and in-depth overview of ingesting real-time streaming data. This blog focuses on both ingestion and the processing side of the real-time streaming analytics platforms.
Real-time streaming analytics platforms
A comprehensive end-to-end big data analytics platform demands must-have features that:
- Simplify the data ingestion layer
- Integrate seamlessly with other components in the big data ecosystem
- Provide programming model APIs for developing insight-analytics applications
- Provide plug-and-play hooks to expose the processed data to visualization and business intelligence layers
Over the past many years, demand for real-time ingestion features have created motivations for implementing several streaming analytics engines, each with a unique targeted architecture. Streaming analytics engines provide capabilities ranging from micro-batching the streamed data during processing to a near-real-time performance to a true-real-time processing behavior. The ingested datatype may range from a byte-stream event to a complex event format. Examples of such data size ingestion engines are Dell Technologies supported Pravega and open source Apache 2.0 Kafka that can be seamlessly integrated with open source big data analytics engines such as Samza, Spark, Flink, and Storm, to name a few. Proprietary implementations of similar technologies are offered by a variety of vendors. A short list of these products includes Striim, WSO2 Complex Event processor, IBM Streams, SAP Event Stream Processor, and TIBCO Event Processing.
Real-time streaming analytics solutions: A Dell Technologies strategy
Dell Technologies offer customers two solutions to implement their real-time streaming infrastructure. One solution is built on Apache Kafka as the ingestion layer and Kafka Stream Processing as the default streaming data processing engine. The second solution is built on open source Pravega as the ingestion layer and Flink as the default real-time streaming data processing engine. But how are these solutions being used in response to customers’ requirements? Let’s review possible integration patterns where Dell Technologies real-time streaming offerings facilitate data ingestion and partial preprocessing layers for implementing these patterns.
Real-time streaming and big data processing patterns
Customers implement real-time streaming in different ways to meet their specific requirements. This implies that there may exist many ways of integrating a real-time streaming solution, with the remaining components in the customer’s infrastructure ecosystem. Figure 1 depicts a minimal big data integration pattern that customers may implement by mixing and matching a variety of existing streaming, storage, compute, and business analytics technologies.
Figure 1: A modern big data integration pattern for processing real-time ingested data
There are several options to implement the Stream Processors layer, including the following two offerings from Dell Technologies.
Dell EMC–Confluent Ready Architecture for Real-Time Data Streaming
The core component of this solution is Apache Kafka, which also delivers Kafka Stream Processing in the same package. Confluent provides and supports the Apache Kafka distribution along with Confluent Enterprise-Ready Platform with advanced capabilities to improve Kafka. Additional community and commercial platform features enable:
- Accelerated application development and connectivity
- Event transformations through stream processing
- Simplified enterprise operations at scale and adherence to stringent architectural requirements
Dell Technologies provides infrastructure for implementing stream processing deployment architectures using one of two Kafka distributions from Confluent—Standard Cluster Architecture or Large Cluster Architecture. Both cluster architectures may be implemented as either the streaming branch of a Lambda Architecture or as the single process flow engine in a Kappa Architecture. For a description of the difference between the two architectures, see this blog. For more details about the product, see Dell Real-Time Big Data Streaming Ready Architecture documentation.
- Standard Cluster Architecture: This architecture consists of two Dell EMC PowerEdge R640 servers to provide resources for Confluent’s Control Center, three R640 servers to host Kafka Brokers, and two R640 servers to provide compute and storage resources for Confluent’s higher-level KSQL APIs leveraging the Apache Kafka Stream Processing engine. The Kafka Broker nodes also host the Kafka Zookeeper and the Kafka Rebalancer applications. Figure 2 depicts the Standard Cluster Architecture.
Figure 2: Standard Dell Real-Time Streaming Big Data Cluster Architecture
- Large Cluster Architecture: This architecture consists of two PowerEdge R640 servers to provide resources for Confluent’s Control Center, a configurable number of R640 servers for scalability to host Kafka Brokers, and a configurable number of R640 servers to provide compute and storage resources for Confluent’s KSQL APIs to the implementation of the Apache Kafka Stream Processing engine. The Kafka Broker nodes also host the Kafka Zookeeper and the Kafka Rebalancer applications. Figure 3 depicts the Standard Cluster Architecture.
Figure 3: Large Scalable Dell Real-Time Streaming Big Data Cluster Architecture
Dell EMC Streaming Data Platform (SDP)
SDP is an elastically scalable platform for ingesting, storing, and analyzing continuously streaming data in real time. The platform can concurrently process both real-time and collected historical data in the same application. The core components of SDP are open source Pravega for ingestion, Long Term Storage, Apache Flink for compute, open source Kubernetes, and a Dell Technologies proprietary software known as Management Platform. Figure 4 shows the SDP architecture and its software stack components.
Figure 4: Streaming Data Platform Architecture Overview
- Open source Pravega provides the ingestion and storage artifacts by implementing streams built from heterogeneous datatypes and storing them as appended “segments.” The classes of Unstructured, Structured, and Semi-Structured data may range from a small number of bytes emitted by IoT devices, to clickstreams generated from the users while they surf websites, to business applications’ intermediate transaction results, to virtually any size complex events. Briefly, SDP offers two options for Pravega’s persistent Long Term Storage: Dell EMC Isilon and Dell EMC ECS S3. These storage options are mutually exclusive—that is, both cannot be used in the same SDP instance. Currently, upgrading from one to another is yet to be supported. For details on Pravega and its role in providing storage for SDP streams using Isilon or ECS S3, refer to this Pravega webinar.
- Apache Flink is SDP’s default event processing engine. It consumes ingested streamed data from Pravega’s storage layer and processes it in an instance of a previously implemented data pipeline application. The pipeline application invokes Flink DataStream APIs and processes continuous unbounded streams of data in real time. Alternatives to Flink analytics engines, such as Apache Spark, are also available. To unify multiple analytics engines’ APIs and to prevent writing multiple versions of the same data pipeline application, an attempt is underway to add Apache Beam APIs to SDP to allow the implementation of one Flink data pipeline application that can run on multiple underlying engines on demand.
Comparative analysis: Dell EMC real-time streaming solutions
Both Dell EMC real-time streaming solutions address the same problem and ultimately provide the same solution for it. However, in addition to using different technology implementations, each tends to be a better fit for certain streaming workloads. The best starting point for selecting one over the other is with an understanding of the exactions of the target use case and workload.
In most situations, users know what they want for a real-time ingestion solution—typically an open-source solution that is popular in the industry. Kafka is demanded by customers in most of these situations. Additional characteristics, such as the mechanisms for receiving and storing events and for processing, are secondary. Most of our customer conversations are about a reliable ingestion layer that can guarantee delivery of the customer’s business events to the consuming applications. Further detailed expectations are focused on no loss of events, simple yet long-term storage capacity, and, in most cases, a well-defined process integration method for implementing their initial preprocessing tasks such as filtering, cleansing, and any transformation-like Extract Transform Load (ETL). The purpose of preprocessing is to offload nonbusiness-logic-related work from the target analytics engine—i.e., Spark, Flink, Kafka Stream Processing—resulting in better overall end-to-end real-time performance.
Kafka and Pravega in a nutshell
Kafka is essentially a messaging vehicle to decouple the sender of the event from the application that processes it for gaining business insight. By default, Kafka uses the local disk for temporarily persisting the incoming data. However, the longer-term storage for the ingested data is implemented in what’s known as Kafka Broker Servers. When an event is received, it is broadcast to the interested applications known as subscribers. An application may subscribe to more than one event-type-group, also known as a topic. By default, Kafka stores and replicates events of a topic in partitions configured in Kafka Brokers. The replicas of an event may be distributed among several Brokers to prevent data loss and guarantee recovery in case of a failover. A Broker cluster may be constructed and configured on several Dell EMC PowerEdge R640 servers. To avoid Brokers’ storage and compute capacity limitations, the Brokers’ cluster may be extended through the addition of more Brokers to the cluster topology. This is a horizontally scalable characteristic of Kafka architecture. By design, the de facto analytics engine provided in an open source Kafka stack is known as Kafka Stream Processing. It is customary to use Kafka Stream Processing as a preprocessing engine and then route the results as real-time streaming artifacts to an actual business logic implementing analytics engine such as Flink or Spark Streaming. Confluent wraps the Kafka Stream Processing implementation in an abstract process layer known as KSQL APIs. It makes it extremely simple to run SQL like statements to process events in the core Kafka Stream Processing engine instead of complex third-generation languages such as Java or C++, or scripting languages such as Python.
Unlike Kafka’s messaging protocol and events persisting partitions, Pravega implements a storage protocol and starts to temporarily persist events as appended streams. As time goes by, and the events age, they become long-term data entities. Therefore, unlike Kafka, the Pravega architecture does not require separate long-term storage. Eventually, the historical data is available in the same storage. Pravega, in Dell’s current SDP architecture, routes previously appended streams to Flink, which provides a data pipeline to implement the actual business logic. When it comes to scalability, Pravega uses Isilon or ECS S3 as extended and/or archiving storage.
Although both SDP and Kafka act as a vehicle between the event sender and the event processor, they implement this transport differently. By design, Kafka implements the pub/sub pattern. It basically broadcasts the event to all interested applications at the same time. Pravega makes specific events available directly to a specific application by implementing a point-to-point pattern. Both Kafka and Pravega claim guaranteed delivery. However, the point-to-point approach supports a more rigid underlying transport.
Dell Technologies offers two real-time streaming solutions, and it is not a simple task to promote one over the other. Ideally, every customer problem requires an initial analysis on the data source, data format, data size, expected data ingestion frequency, guaranteed delivery requirements, integration requirements, transactional rollback requirements (if applicable), storage requirements, transformation requirements, and data structural complexity. Aggregated results from such analysis may help us suggest a specific solution.
Dell works with customers to collect as much detailed information as possible about the customer’s streaming use cases. Kafka Stream Processing has an impressive feature that offloads the transformation portion of the analytics of a pipeline engine such as Flink or Spark to its Kafka Stream Processing engine. This could be a great advantage. Meanwhile SDP requires extra scripting efforts outside of the Flink configuration space to provide the same logically equivalent capability. On the other hand, SDP simplifies storage complexities through Pravega native streams-per-segments architecture, while Kafka core storage logic pertains to a messaging layer that requires a dedicated file system. Customers that have IoT device data use cases are concerned with ingestion high frequency rate (number of events per second). Soon, we can use this parameter and provide some benchmarking results of a comparative analysis of ingestion frequency rate performed on our SDP and Confluent Real-Time Streaming solutions.
I owe an enormous debt of gratitude to my colleagues Mike Pittaro and Mike King of Dell Technologies. They shared their valuable time to discuss the nuances of the text, guided me to clarify concepts, and made specific recommendations to deliver cohesive content.
Author: Amir Bahmanyari, Advisory Engineer, Dell Technologies Data-Centric Workload & Solutions. Amir joined Dell Technologies Big Data Analytics team in late 2017. He works with Dell Technologies customers to build their Big Data solutions. Amir has a special interest in the field of Artificial Intelligence. He has been active in Artificial and Evolutionary Intelligence work since late 1980’s when he was a Ph.D. candidate student at Wayne State University, Detroit, MI. Amir implemented multiple AI/Computer Vision related solutions for Motion Detection & Analysis. His special interest in biological and evolutionary intelligence algorithms lead to innovate a neuron model that mimics the data processing behavior in protein structures of Cytoskeletal fibers. Prior to Dell, Amir worked for several startups in the Silicon Valley and as a Big Data Analytics Platform Architect at Walmart Stores, Inc.
Related Blog Posts
AI-based Edge Analytics in the Service Provider Space
Fri, 15 Jan 2021 11:48:37 -0000|
Read Time: 0 minutes
Advances in Service Provider performance management lags behind growth in digital transformation. Consider, for example, Dynamic Spectrum Sharing (DSS) in 5G networks – operators need to rapidly map small-cell flows to available frequency bands, in the presence of constraints like differing radio technologies and interference. Another example is the need to detect and/or predict infrastructure failures from KPIs, Traces, Profiles and Knowledge-bases, to trigger a fix before an issue manifests itself. Yet another example is energy optimization in data-centers, where servers are powered off to save energy, and workloads are moved around in the cluster, without affecting end to-end service. It is clear that in all of these scenarios, and in numerous other use-cases affecting industries such as factory automation, automotive, IIoT, smart cities, energy, healthcare, entertainment, and surveillance, AI on Big Data needs to replace legacy IT processes and tasks to trigger timely changes in the network substrate. The following figure illustrates how Big Data from the substrate can be consumed by fast-responding, interconnected AI models to act on service degradations. The traditional approach of DevOps reacting to irregularities visualized through Network Operations Center (NOC) terminals does not scale. Gartner and IDC both predict that by 2024 more than 60 percent of Mobile Network Operators (MNO) will adopt AI-based analytics in their IT operations.
Figure 1. Decision and Controls with Models
Data streams originating in the substrate, and gathered in the collection gateway, may be compressed. There may be gaps in data collection that need interpolation. Not all data types collected will have an equal impact on decision-making, which means that feature-filtering is important for decision-making. These issues justify the need for multi-stage pre-processing. Similarly, rapid decision-making can be achieved through multi-stage interconnected models using deep-learning technology. Instead of having one huge and complex model, experts agree that simpler interconnected models lead to more reusable design. The following figure illustrates the decision-making process. It shows a sample interconnected model graph that detects anomalies, identifies root-causes, and decides on a control sequence to recommend remediation measures.
Figure 2. Runtime acceleration key for real-time loops
Deep-learning is a good tool for inductive reasoning, but deductive reasoning is also important for decision-making (for example, to limit cascading errors) and this requires one or more postprocessing stages. Collectively, these arguments point to a need for auto-pipelining through Function-as-a Service (FaaS) for WebScale automation in the cloud-native space. Add to this the need for streaming, visualization, and time-series databases for selective data-processing in the stages, and what we end up with is a Machine Learning Operating System (ML-OS) that provides these services. An ML-OS, such as Nuclio, automatically maps pipelined functions (for example, python definitions) to cloud-native frameworks, utilizing specified configurations, as well as supporting open-source tools for visualization, streaming, in-memory time-series databases, and GPU-based model acceleration. Applications developed on the ML-OS ingest data and output control sequences for continuous optimization in decision-making. These real-time decision-making loops collectively enable WebScale Network Automation, Slice Management, RAN operations, Traffic Optimization, QoE Optimization, and Security. The following figure illustrates the AIOPs platform.
Figure 3. AIOPs Platform
In this section we show our prototype deployment using Generation (substrate) and Analytics infrastructure, as shown in the following figure. Generation includes workload that is placed in Red Hat OpenStack (R16) VMs, where synthetically-generated tomography images are compressively sensed in one VM and then reconstructed in another VM. System performance metrics from this workload environment are exported to the Service Telemetry Framework (STF) Gateway placed in RedHat OpenShift (v4.3) containers, which gather data for streaming to the Analytics cluster placed in VMware (v6.7) VMs. The Analytics cluster includes Iguazio ML-OS with native GPU acceleration, and Anodot models for correlation and anomaly detection.
Figure 4. Workload layout in virtual infrastructure
The OpenStack cluster has 3 physical control nodes (R640) and 2 physical compute nodes (R740). Vm-1 generates random tomography images, which are compressed and dispatched to Vm-2 for Reconstruction using L1 Lasso Regression. OpenShift (OCP) is deployed on a pool of VMware virtual hosts (v6.7) with vSAN (see Reference Design) on 3 physical nodes (R740xd). OCP deployment spans 3 control and 5 compute virtual hosts. There is a separate administration virtual host (CSAH) for infrastructure services (DHCP, DNS, HAPROXY, TFTP, PXE, and CLI) on the OCP platform. vSphere CSI drivers are enabled on the OCP platform so that persistent volume requirements for OCP pods are satisfied by vSAN storage. RH STF deployed on OpenShift facilitates the automated collection of measurements from a workload environment over RabbitMQ message bus. STF stores metrics in the local Prometheus database and can forward to data sinks like Nuclio or Isilon (remote storage). Nuclio ML-OS is installed as 3 data VMs and 3 application VMs using data, client, and management networks. Anodot models in the application VMs process metrics from the OpenStack environment to detect anomalies and correlate them, as shown in the following figure.
Figure 5. Sleeve tightening of metrics in Anodot
The Python snippet and image snapshot shown below capture a workload running in the OpenStack space. Self-timed logic (not shown here) in Vm-1 is used to randomly generate CPU -utilization surges in compression by resizing imaging resolution. The Anodot dashboard shows the resulting surge in CPU-utilization in Vm-2 during reconstruction, hinting at a root cause issue. Similar behavior can be seen in network utilization, which the Anodot dashboard shows by aligning the anomalies to indicate correlation.
Figure 6. Anomaly detection in Anodot
The analytics solution proposed here uses open interfaces to aggregate data from all segments of the network, such as RAN, Packet Core, IMS, Messaging, Transport, Platform, and Devices. This provides the ability to correlate metrics and events across all nodes to create an end-to-end view of the network, the flow or a slice. AI turns this end-to-end insight into tangible inferences that drive autonomics in the substrate through control sequences.
Big Solutions on Dell EMC VxRail with SQL 2019 Big Data Cluster
Thu, 09 Jul 2020 19:20:04 -0000|
Read Time: 0 minutes
The amount of data and different formats organizations must manage, ingest, and analyze has been the driving force behind Microsoft SQL 2019 Big Data Clusters (BDC). SQL Server 2019 BDC demonstrates the deployment of scalable clusters of SQL Server, Spark, and containerized HDFS (Hadoop Distributed File System) running on Kubernetes.
We recently deployed and tested SQL Server 2019 BDC on Dell EMC VxRail hyperconverged infrastructure to demonstrate how VxRail delivers the performance, scalability, and flexibility needed to bring these multiple workloads together.
The Dell EMC VxRail platform was selected for its ability to incorporate compute, storage, virtualization, and management in one platform offering. The key feature of the VxRail HCI is the integration of vSphere, vSAN, and VxRail HCI System Software for an efficient and reliable deployment and operations experience. The use of VxRail with SQL Server 2019 BDC makes it easy to unite relational data with big data.
The testing demonstrates the advantages of using VxRail with SQL Server 2019 BDC for analytic application development. This also demonstrates how Docker, Kubernetes, and the vSphere Container Storage Interface (CSI) driver accelerate the application development life cycle when they are used with VxRail. The lab environment for development and testing used four VxRail E560F nodes supported by the vSphere CSI driver. With this solution, developers can provision SQL Server BDC in containerized environments without the complexities of traditional methods for installing databases and provisioning storage.
Our white paper, Microsoft SQL Server 2019 Big Data Cluster on Dell EMC VxRail shows the power of implementing SQL Server 2019 BDC technologies on VxRail. Integrating SQL Server 2019 RDBMS, SQL Server BDC, MongoDB, and Oracle RDBMS helps to create a unified data analytics application. Using VxRail enhances the ability of SQL Server 2019 to scale out storage and compute clusters while embracing the virtualization techniques from VMware. This SQL Server 2019 BDC solution also benefits from the simplicity of a complete yet flexible validated Dell EMC VxRail with Kubernetes management and storage integration.
The solution demonstrates the combined value of the following technologies:
- VxRail E560F – All-flash performance
- Large tables stored on a scaled-out HDFS storage cluster that is hosted by BDC
- Smaller related data tables that are hosted on SQL Server, MongoDB, and Oracle databases
- Distributed queries that are enabled by the PolyBase capability in SQL Server 2019 to process Transact-SQL queries that access external data in SQL Server, Oracle, Teradata, and MongoDB.
- Red Hat Enterprise Linux
Big Data Cluster Services
This diagram shows how the pools are built. It provides details of the benefits for Kubernetes features for container orchestration at scale, including:
- Autoscaling, replication, and recovery of containers
- Intracontainer communication, such as IP sharing
- A single entity—a pod—for creating and managing multiple containers
- A container resource usage and performance analysis agent, cAdvisor
- Network pluggable architecture
- Load balancing
- Health check service
This white paper, Microsoft SQL Server 2019 Big Data Cluster on Dell EMC VxRail, addresses big data storage, the tools for handling big data, and the details around testing with TPC-H. When we tested data virtualization with PolyBase, the queries were successful, running without error and returning the results that joined all four data sources.
Because data virtualization does not involve physically copying and moving the data (so that the data is available to business users in real-time), BDC simplifies and centralizes access to and analysis of the organization’s data sphere. It enables IT to manage the solution by consolidating big data and data virtualization on one platform with a proven set of tools.
Success starts with the right foundation:
SQL Server 2019 BDC is a compelling new way to utilize SQL Server to bring high-value relational data and high-volume big data together on a unified, scalable data platform. All of this can be deployed with VxRail, enabling enterprises to experience the power of PolyBase to virtualize their data stores, create data lakes, and create scalable data marts in a unified, secure environment without needing to implement slow and costly Extract, Transform, and Load (ETL) pipelines. This makes data-driven applications and analysis more responsive and productive. SQL Server 2019 BDC and Dell EMC VxRail provide a complete unified data platform to deliver intelligent applications that can help make any organization more successful.
Read the full paper to learn more about how Dell EMC VxRail with SQL 2019 Big Data Clusters can:
- Bring high-value relational data and high-volume big data together on a single, scalable platform.
- Incorporates intelligent features and gets insights from more of your data—including data stored beyond SQL Server in Hadoop, Oracle, Teradata, and MongoDB.
- Supports and enhances your database management and data-driven apps with advanced analytics using Hadoop and Spark.
Additional VxRail & SQL resources:
Author: Vic Dery, Senior Principal Engineer, VxRail Technical Marketing