Guest blog post by Ajit Jaokar
This two part blog is based on my forthcoming book: Data Science for Internet of Things.
It is also the basis for the course I teach Data Science for Internet of Things Course. I will be syndicating sections of the book on the Data Science Central blog. Welcome your comments. Please email me at ajit.jaokar at futuretext.com - Email me also for a pdf version if you are interested in joining the course
Here, we start off with the question: At which points could you apply analytics to the IoT ecosystem and what are the implications? We then extend this to a broader question: Could we formulate a methodology to solve Data Science for IoT problems? I have illustrated my thinking through a number of companies/examples. I personally work with an Open Source strategy (based on R, Spark and Python) but the methodology applies to any implementation. We are currently working with a range of implementations including AWS, Azure, GE Predix, Nvidia etc. Thus, the discussion is vendor agnostic.
I also mention some trends I am following such as Apache NiFi etc
The Internet of Things and the flow of Data
As we move towards a world of 50 billion connected devices, Data Science for IoT (IoT analytics) helps to create new services and business models. IoT analytics is the application of data science models to IoT datasets. The flow of data starts with the deployment of sensors. Sensors detect events or changes in quantities. They provide a corresponding output in the form of a signal. Historically, sensors have been used in domains such as manufacturing. Now their deployment is becoming pervasive through ordinary objects like wearables. Sensors are also being deployed through new devices like Robots and Self driving cars. This widespread deployment of sensors has led to the Internet of Things.
Features of a typical wireless sensor node are described in this paper (wireless embedded sensor architecture). Typically, data arising from sensors is in time series format and is often geotagged. This means, there are two forms of analytics for IoT: Time series and Spatial analytics. Time series analytics typically lead to insights like Anomaly detection. Thus, classifiers (used to detect anomalies) are commonly used for IoT analytics to detect anomalies. But by looking at historical trends, streaming, combining data from multiple events(sensor fusion), we can get new insights. And more use cases for IoT keep emerging such as Augmented reality (think – Pokemon Go + IoT)
Meanwhile, sensors themselves continue to evolve. Sensors have shrunk due to technologies like MEMS. Also, their communications protocols have improved through new technologies like LoRA. These protocols lead to new forms of communication for IoT such as Device to Device; Device to Server; or Server to Server. Thus, whichever way we look at it, IoT devices create a large amount of Data. Typically, the goal of IoT analytics is to analyse the data as close to the event as possible. We see this requirement in many ‘Smart city’ type applications such as Transportation, Energy grids, Utilities like Water, Street lighting, Parking etc
IoT data transformation techniques
Once data is captured through the sensor, there are a few analytics techniques that can be applied to the Data. Some of these are unique to IoT. For instance, not all data may be sent to the Cloud/Lake. We could perform temporal or spatial analysis. Considering the volume of Data, some may be discarded at source or summarized at the Edge. Data could also be aggregated and aggregate analytics could be applied to the IoT data aggregates at the ‘Edge’. For example, If you want to detect failure of a component, you could find spikes in values for that component over a recent span (thereby potentially predicting failure). Also, you could correlate data in multiple IoT streams. Typically, in stream processing, we are trying to find out what happened now (as opposed to what happened in the past). Hence, response should be near real-time. Also, sensor data could be ‘cleaned’ at the Edge. Missing values in sensor data could be filled in(imputing values), sensor data could be combined to infer an event(Complex event processing), Data could be normalized, we could handle different data formats or multiple communication protocols, manage thresholds, normalize data across sensors, time, devices etc
Applying IoT Analytics to the Flow of Data
Here, we address the possible locations and types of analytics that could be applied to IoT datasets.
Some initial notes:
Some initial thoughts:
- IoT data arises from sensors and ultimately resides in the Cloud.
- We use the concept of a ‘Data Lake’ to refer to a repository of Data
- We consider four possible avenues for IoT analytics: ‘Analytics at the Edge’, ‘Streaming Analytics’ , NoSQL databases and ‘IoT analytics at the Data Lake’
- For Streaming analytics, we could build an offline model and apply it to a stream
- If we consider cameras as sensors, Deep learning techniques could be applied to Image and video datasets (for example CNNs)
- Even when IoT data volumes are high, not all scenarios need Data to be distributed. It is very much possible to run analytics on a single node using a non-distributed architecture using Python or R systems.
- Feedback mechanisms are a key part of IoT analytics. Feedback is part of multiple IoT analytics modalities ex Edge, Streaming etc
- CEP (Complex event processing) can be applied to multiple points as we see in the diagram
We now describe various analytics techniques which could apply to IoT datasets
Complex event processing
Complex Event Processing (CEP) can be used in multiple points for IoT analytics (ex : Edge, Stream, Cloud et).
In general, Event processing is a method of tracking and analyzing streams of data and deriving a conclusion from them. Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances. The goal of complex event processing is to identify meaningful events (such as opportunities or threats) and respond to them as quickly as possible.
In CEP, the data is at motion. In contrast, a traditional Query (ex an RDBMS) acts on Static Data. Thus, CEP is mainly about Stream processing but the algorithms underlining CEP can also be applied to historical data
CEP relies on a number of techniques including for Events: pattern detection, abstraction, filtering, aggregation and transformation. CEP algorithms model event hierarchies and detect relationships (such as causality, membership or timing) between events. They create an abstraction of an event-driven processes. Thus, typically, CEP engines act as event correlation engines where they analyze a mass of events, pinpoint the most significant ones, and trigger actions.
Most CEP solutions and concepts can be classified into two main categories: Aggregation-oriented CEP and Detection-oriented CEP. An aggregation-oriented CEP solution is focused on executing on-line algorithms as a response to event data entering the system – for example to continuously calculate an average based on data in the inbound events. Detection-oriented CEP is focused on detecting combinations of events called events patterns or situations – for example detecting a situation is to look for a specific sequence of events. For IoT, CEP techniques are concerned with deriving a higher order value / abstraction from discrete sensor readings.
CEP uses techniques like Bayesian networks, neural networks, Dempster- Shafer methods, kalman filters etc. Some more background at Developing a complex event processing architecture for IoT
Real-time systems differ in the way they perform analytics. Specifically, Real-time systems perform analytics on short time windows for Data Streams. Hence, the scope of Real Time analytics is a ‘window’ which typically comprises of the last few time slots. Making Predictions on Real Time Data streams involves building an Offline model and applying it to a stream. Models incorporate one or more machine learning algorithms which are trained using the training Data. Models are first built offline based on historical data (Spam, Credit card fraud etc). Once built, the model can be validated against a real time system to find deviations in the real time stream data. Deviations beyond a certain threshold are tagged as anomalies.
IoT ecosystems can create many logs depending on the status of IoT devices. By collecting these logs for a period of time and analyzing the sequence of event patterns, a model to predict a fault can be built including the probability of failure for the sequence. This model to predict failure is then applied to the stream (online). A technique like the Hidden Markov Model can be used for detecting failure patterns based on the observed sequence. Complex Event Processing can be used to combine events over a time frame (ex in the last one minute) and co-relate patterns to detect the failure pattern.
Typically, streaming systems could be implemented in Kafka and spark
Some interesting links on streaming I am tracking:
Part two will consider other more technologies including Edge processing and Deep learning
If you want to be a part of my course please see the testimonials at Data Science for Internet of Things Course.