There are many potential approaches to developing a model-based approach to data filtering for detecting anomalies.
The Confluent-developed example above uses an open-source framework from H2O.ai to create the models. Dell Technologies recently completed a white paper for the PowerEdge Reference Architecture series titled Unleash the power of AI using H2O.ai on Dell EMC infrastructure optimized for machine learning.
For this use case, Dell EMC chose to use a deep learning model that is developed in Python using Keras with a TensorFlow backend.
(TensorFlow Core Documentation for Keras) The use of deep learning for anomaly detection is an evolving area of research beyond the scope of this document. It is therefore important to highlight the opportunities and challenges in implementing uses cases that support the Industry 4.0 vision - merging data science, IT and OT.
For deployment to production, data scientists can save an H2O model as a Plain Old Java Object (POJO) or Model Object, Optimized (MOJO). This feature integrates well with the programming environment that is supported for KSQL on the Confluent platform. These artifacts are not tied to a specific version of H2O because they are plain Java code. As such, they do not require an H2O cluster to be running for use in inferencing. (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/save-and-load-model.html#saving-loading-downloading-and-uploading-models )
For this use case, Dell EMC chose the MLFlow open source platform to manage the machine learning life cycle, including:
The primary deep learning models that are used for anomaly detection use an encoder and decoder pattern, which is known as an autoencoder. The prior state of the art for automated anomaly detection used supervised machine learning or statistical regression models. The detection tool is trained to distinguish among two prelabeled classes (healthy and unhealthy). ( Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., & Benini, L. (2019, July). Anomaly detection using autoencoders in high performance computing systems. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 9428-9433).)
Autoencoders are trained with only normal (healthy) behavior eliminating the requirement to find and label anomalies before the model can be deployed. The trained autoencoder can then be used to identify any abnormal conditions. When anomalies occur in the data stream, the model cannot accurately encode the input data. It then decoded the values back to a close approximation of the original state.
The use of a Keras and TensorFlow model also fits this scenario. Keras supports saving a single HDF5 format file containing the model's architecture, weights values, and compile information. That information can be used with a Java application through KSQL UDFs. HDF5 is a light-weight Keras model output format, and is an alternative to the more recent SavedModel format. (TensorFlow Guide)