The figure in Edge connectivity shows the primary functional components at the edge.
The edge environment runs under Red Hat Enterprise Linux Server or CentOS. The recommended infrastructure for the edge environment is covered in Edge configurations.
The edge environment runs the Confluent Platform directly under Linux. Three broker nodes are recommended at the edge to provide high availability for the Kafka topics. A single broker can be used to minimize cost at the expense of high availability. This single broker configuration may be appropriate for environments with many edge installations where overall cost is more important than reliability of a single edge site. A single edge installation can support many pieces of equipment by scaling the number of broker nodes.
The Kafka topic data is stored directly on local storage in each broker node. The amount of storage that is required depends on:
The architecture is designed to forward the data to the core. The retention time can be relatively short, and only must provide enough storage to queue data when the interface to the core is temporarily unavailable.
The correct sizing for a specific environment can be calculated from the Capacity Planning section of the Confluent Platform.
The architecture allows the edge to operate independently of the core. All the necessary functions for anomaly detection and alert generation are hosted at the edge. The edge requires connectivity to the core for replication of edge data, forwarding of alerts, and updates to the machine learning models.
The details of the connection are site and installation specific. The connection should have enough bandwidth to support the volume of data coming from all the systems in the equipment topics, plus any aggregated streams created at the edge. In general, the connection is not latency sensitive since alerts are processed locally. If alerts must be forwarded to the core, then latency becomes a consideration.
All connections between the edge and the core should be firewalled for security.
Data collection from the monitored equipment is specific to the type of equipment, the interfaces and data formats available, and the parametric data necessary for the detection models. The collection and formatting of the data is performed in the protocol interface blocks. This step will typically involve some equipment-specific software to collect the data and perform any necessary formatting and conversion. This software can use the Kafka producer library to pass the data into a Kafka topic for further processing. Depending on the specific telemetry interfaces supported by the equipment, a Dell Edge Gateway may be necessary to support the physical hardware interfaces. The Dell Edge Gateway 3001 provides multifunction general-purpose I/O (GPIO) hardware support, while the Dell Edge Gateway 3002 provides a CANbus interface for industrial equipment that supports the CANbus standard.
After the data has been collected, formatted, or converted as appropriate, the data is passed to a Kafka topic for each individual machine using the Kafka producer library. The data is also converted to Apache Avro format before it is passed to Kafka. Avro is used to provide a compact data representation with a defined schema. The Confluent Schema Registry is used to ensure data compatibility between the edge data format and the rest of the environment.
The lab development environment used a data generator program to simulate the equipment and its telemetry. The data generator was designed to create data, including anomalies, so the machine learning models could be tested. This design also allowed the simulation of multiple pieces of equipment.
Anomaly detection at the edge is done using a machine learning model that is developed in TensorFlow and Keras. Model development is done at the core, and the model is pushed to the edge for execution. Confluent KSQL and its user-defined function (UDF) capability is used to run the model on each data record and generate events for anomalies.
KSQL user-defined functions allow developers to extend KSQL with Java functions including:
User-defined function provides a clean method of integrating the machine learning model with the data streams and support execution on each data record. This method also eliminates the need for a separate model serving and execution capability at the edge.
The UDF is written in Java. It uses the Java API for TensorFlow to load and run the model. The UDF is added to KSQL by copying it to the KSQL extension directory before starting KSQL.
The data from each individual machine topic in Kafka is fed to Confluent KSQL by creating a KSQL stream from the topic. A second KSQL stream is created from the first stream as a SELECT that calls a KSQL user-defined function. The user-defined function then runs the model with the stream data and returns the results as a second topic.
Model deployment and updates are accomplished by deploying a new version of the KSQL user-defined function or the model file.
The handling of alerts typically depends on the implementation requirements and must be flexible. In this architecture, the second topic that is created using KSQL contains the data describing any anomalies (that is, the results of the model execution.) It can also contain equipment identification, time references, or any other information relevant to generating an alert. This data can be used in multiple alert scenarios:
Any single approach or a combination of these approaches can be used in a specific deployment and may use any features of the stream processing platform.
The architecture is designed to support multiple edge locations. Topics from each edge location are copied to the core using Confluent Replicator. Replicator runs in the core, and pulls data from the edges. This design provides data protection and enables analysis of the raw data in the core.