Subscribe to our Newsletter | To Post On IoT Central, Click here


bill vorhies (3)

By Bill Vorhies

Bill is Editorial Director for our sister site Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. This article originally appeared here

Summary:  In this Lesson 3 we continue to provide a complete foundation and broad understanding of the technical issues surrounding an IoT or streaming system so that the reader can make intelligent decisions and ask informed questions when planning their IoT system. 

In Lesson 1

In Lesson 2

In This Article

Is it IoT or Streaming

Stream Processing – Open Source

Three Data Handling Paradigms – Spark versus Storm

Basics of IoT Architecture – Open Source

What Can Stream Processors Do

Streaming and Real Time Analytics

Data Capture – Open Source with Options

Open Source Options for Stream Processors

Beyond Open Source for Streaming

Storage – Open Source with Options

Spark Streaming and Storm

Competitors to Consider

Query – Open Source Open Source with Options

Lambda Architecture – Speed plus Safety

Trends to Watch

 

Do You Really Need a Stream Processor

 

 

Four Applications of Sensor Data

 

 

Continuing from Lesson 2, our intent is to provide a broad foundation for folks who are starting to think about streaming and IoT.  In this lesson we’ll explain how Spark and Storm handle data streams differently, discuss what real time analytics actually means, offer some alternatives for streaming beyond open source, and suggest some trends you should watch in this fast evolving space.

 

Three Data Handling Paradigms:  SPARK Versus Storm

When users compare SPARK versus Storm the conversation usually focuses on the difference in the way they handle the incoming data stream. 

  • Storm processes incoming data one event at a time – called Atomic processing. 
  • SPARK processes incoming data in very small batches – called Micro Batch.  A SPARK micro batch is typically between ½ second and 10 seconds depending on how often the sensors are transmitting.  You can define this value.
  • A third method called Windowing allows for much longer windows of time and can be useful in some text or sentiment analysis applications, or systems in which signals only evolve over a relatively long period of time.

 

Atomic:  (aka one-tuple-at-a-time) Processes each inbound data event as a separate element.  This is the most intuitively obvious but also the most computationally expensive design.  For example, it’s used to guarantee fastest processing of individual events with least delay in transmitting the event to the subscriber.  Seen often for customer transactional inputs so that if some element of the event block fails the entire block is not deleted but moved to a bad record file that can later be processed further.  Apache Storm uses this paradigm.

Micro batching:  The critique of this approach is that it processes in batches (not atomic level streaming) but typically those batches are extremely small encompassing actions that occur within only a few seconds.  You can adjust the time window.  This makes the process somewhat more efficient.  SPARK Streaming uses this paradigm.

Windowing:  A hybrid of the two approaches, Windowing maintains the atomic processing of each data item but creates pseudo-batches (windows) to make processing more efficient.  This also allows for many more sophisticated interpretations such as sliding windows (e.g. everything that occurred in the last X period of time). 

All three of these approaches can guarantee that each data element is processed at least once.  Only the Atomic paradigm can guarantee that each data element is processed only once.

 

Consider this Example 

Your sensors are like FitBits and sample data every 10 seconds.  They transmit that in bursts whenever the sensor is cued to dump its data into a Wi-Fi stream.  One user may monitor the results of the stream many times during the day, valuing low latency and causing his sensor to upload via Wi-Fi frequently.  Another user may not be near a Wi-Fi connection or may simply not bother to download the data for several days.  Still a third user may have trouble with a network connection or the hardware itself that causes the sensor to transmit incomplete or missing packets that are then repeated later or are simply missing from the stream.

In this scenario, data from sensors originating at the same time may arrive at the stream processor with widely different delays and some of those packets that were disrupted may have been transmitted more than once or not at all.

You will need to carefully evaluate whether guaranteeing ‘only once’ processing, or the marginally faster response time of atomic processing warrant using this factor in your selection of the Stream Processor.

 

Streaming and Real Time Analytics

It’s common in IoT to find references to “real time analytics” or “in stream analytics” and these terms can be misleading.  Real time analytics does not mean discovering wholly new patterns in the data in real time while it is streaming by.  What it means is that previously developed predictive models that were deployed into the Stream Processor can score the streaming data and determine whether that signal is present, in real time.

It’s important to remember that the data science behind your sophisticated Stream Processor was developed in the classic two step data science process. First data scientists worked in batch with historical data with a known outcome (supervised learning) to develop an algorithm that uses the inputs to predict the likelihood of the targeted event.  The model, an algebraic formula, represented by a few lines of code (C, Python, Java, R, and others) is then exported into a program within the Stream Processor and goes to work evaluating the passing data to see if the signal is present.  If it is, some form of action alert is sent to the human or machine, or sent as a visual signal to a dashboard.

Recently the first indications that some new discoveries can be made in real time have been emerging but they are exceedingly rare.  See more in this article.

 

Beyond Open Source for Streaming

Why would you want to look beyond open source for your IoT system?  Largely because while open source tools and packages are practically free, this is the same as ‘free puppy’. 

Yes these packages can be downloaded for free from Apache but the most reasonable sources are the three primary distributors, Hortonworks, Cloudera, and MapR all of whom make sure the code is kept up to date and add certain features that make it easier to maintain.  Even from these distributors, your total investment should be in the low five figures.  This does not of course include implementation, consulting, or configuration support which is extra, either from the distributors, from other consultants, or from your own staff if they are qualified.

With open source what you also get is complexity.  Author Jim Scott writing about SPARK summed it up quite nicely.  “SPARK is like a fighter jet that you have to build yourself. The great thing about that is that after you are done building it, you have a fighter jet. Problem is, have you ever flown a fighter jet? There are more levers than could be imagined.”

In IT parlance, the configurations and initial programs you create in SPARK or other open source streaming platforms will be brittle.  That is every time your business rules change you will have to modify the SPARK code written in Scala, though Python is also available.

Similarly, standing up a SPARK or Hadoop storage cluster comes with programming and DBA overhead that you may not want to incur, or at least to minimize.  Using one of the major cloud providers and/or adding a SaaS service like Qubole will greatly reduce your labor with only a little incremental cost.

The same is true for the proprietary Stream Processors many of which are offered by major companies and are well tested and supported.  Many of these come with drag-and-drop visual interfaces eliminating the need for manual coding so that any reasonably dedicated programmer or analyst can configure and maintain the internal logic as your business changes.  (Keep your eye on NiFi, the new open source platform that also claims drag-and-drop).

 

Competitors to Consider

Forrester publishes a periodic rating and ranking of the competitor “Big Data Streaming Analytic Platforms” and as of the spring of 2016 listed 15 worthy of consideration.

Here are the seven Forrester regards as leaders in rank order:

  1. IBM
  2. Software AG
  3. SAP
  4. TIBCO Software
  5. Oracle
  6. DataTorrent,
  7. SQLstream

There are eight additional ‘strong performers’ in rank order:

  1. Impetus Technologies
  2. SAS
  3. Striim
  4. Informatica
  5. WSO2
  6. Cisco Systems
  7. data Artisans
  8. EsperTech

Note that the ranking does not include the cloud-only offerings which should certainly be included in any competitive comparison:

  1. Amazon Web Services’ Elastic MapReduce
  2. Google Cloud Dataflow
  3. Microsoft Azure Stream Analytics

Here’s the ranking chart:

 

It’s likely that you can get a copy of the full report from one of these competitors.  Be sure to pay attention to the detail.  For example here are some interesting observations from the numerical scoring table.

Stream Handling:  In this presumably core capability SoftwareAG got a perfect score while Impetus and WSO2 scored decidedly below average.

Stream Operators (Programs):  Another presumably core capability.  IBM Streams was given a perfect score.  Most other competitors had scores near 4.0 (out of 5.0) except for data Artisans given a noticeably weak score.

Implementation Support: data Artisans and EsperTech were decidedly weaker than others.

In all there are 12 scoring categories that you’ll want to examine closely.

What these 15 leaders and 3 cloud offerings have in common is that they greatly simplify the programming and configuration and hide the gory details.  That’s a value well worth considering.

 

Trends to Watch

IoT and streaming is a fast growth area with a high rate of change.  Witness the ascendance of SPARK in just the last year to become the go-to open source solution.  All of this development reflects the market demand for more and more tools and platforms to address the exploding market for data-in-motion applications.

All of this means you will need to keep your research up to date during your design and selection period.  However, don’t let the rate of change deter you from getting started.

  • One direction of growth will be the further refinement of SPARK to become a single platform capable of all four architectural elements:  data capture, stream processing, storage, and query.
  • We would expect many of the proprietary solutions to stake this claim also.
  • When this is proven reliable you can abandon the separate components required by the Lambda architecture.
  • We expect SPARK to move in the direction of simplifying set up and maintenance which is the same ground the proprietary solutions are claiming.  Watch particularly for integration of NiFi into SPARK, or at least the drag-and-drop interface elements creating a much friendlier UI.
Read more…

By Bill Vorhies.

Bill is Editorial Director for our sister site Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. This article originally appeared here

Summary:  In this Lesson 2 we continue to provide a complete foundation and broad understanding of the technical issues surrounding an IoT or streaming system so that the reader can make intelligent decisions and ask informed questions when planning their IoT system. 

In Lesson 1

In This Article

In Lesson 3

Is it IoT or Streaming

Stream Processing – Open Source

Three Data Handling Paradigms – Spark versus Storm

Basics of IoT Architecture – Open Source

What Can Stream Processors Do

Streaming and Real Time Analytics

Data Capture – Open Source with Options

Open Source Options for Stream Processors

Beyond Open Source for Streaming

Storage – Open Source with Options

Spark Streaming and Storm

Competitors to Consider

Query – Open Source Open Source with Options

Lambda Architecture – Speed plus Safety

Trends to Watch

 

Do You Really Need a Stream Processor

 

 

Four Applications of Sensor Data

 

 

Continuing from Lesson 1, our intent is to provide a broad foundation for folks who are starting to think about streaming and IoT.  In this lesson we’ll dive into Stream Processing the heart of IoT, then discuss Lambda architecture, whether you really need a Stream Processor, and offer a structure for thinking about what sensors can do.

 

Stream Processing – Open Source

Event Stream Processing platforms are the Swiss Army knives that can make data-in-motion do almost anything you want it to do.

The easiest way to understand ESP architecture is to see it as three layers or functions, input, processing, and output.

 

Input accepts virtually all types of time-based streaming data and multiple input streams are common.  In the main ESP processor occur a variety of actions called programs or operators.  And the results of those programs are passed to the subscriber interface which can send alerts via human interfaces or create machine automated actions, and also pass the data to Fast and Forever data stores.

It is true that Stream Processing platforms can directly receive data streams, but recall that they are not good at preserving accidentally lost data so you will still want a Data Capture front end like Kafka that can rewind and replay lost data.  It’s likely over the near future that many stream processors will resolve this problem and then you will need to revisit the need for a Kafka front end.

 

Stream Processing Requirements

The requirements for your stream processor are these:

  • High Velocity:  Capable of ingesting and processing millions of events per seconds depending on your specific business need.
  • Scales Easily:  These will all run on distributed clusters.
  • Fault Tolerant:  This is different than guaranteeing no lost data.
  • Guaranteed Processing:  This comes in two flavors: 1.) Process each event at least once, and 2. Process each event only once.  The ‘only-once’ criteria is harder to guarantee.  This is an advanced topic we will discuss a little later.
  • Performs the Programs You Need for Your Application.

 

What Can ESP Programs Do

The real power is in the programs starting with the ability to do data cleansing on the front end (kind of a mini-MDM), then duplicate the stream of data multiple times so that each identical stream can be used in different analytic routines simultaneously without waiting for one to finish before the next begins.  Here’s a diagram from a healthcare example used in a previous article describing how this works that illustrates multiple streams being augmented by static data, and processed by different logic types at the same time.  Each block represents a separate program within the ESP that needs to be created by you.

 

There are a very large number of different logic types that can be applied through these ESP programs including:

  • Compute
  • Copy, to establish multiple processing paths – each with different retention periods of say 5 to 15 minutes
  • Aggregate
  • Count
  • Filter – allows you to keep only the data from the stream that is useful and discard the rest, greatly reducing storage.
  • Function (transform)
  • Join
  • Notification email, text, or multimedia
  • Pattern (detection) (specify events of interest EOIs)
  • Procedure (apply advanced predictive model)
  • Text context – could detect for example Tweet patterns of interest
  • Text Sentiment – can monitor for positive or negative sentiments in a social media stream

There is some variation in what open source and proprietary packages can do so check the details against what you need to accomplish.

 

Open Source Options for Stream Processing

The major open source options (all Apache) are these:

Samza:  A distributed stream processing framework. It uses Kafka for messaging, and YARN to provide fault tolerance, processor isolation, security, and resource management.

NiFi: This is a fairly new project still in incubation.  It is different because of its user-friendly drag-and-drop graphical user interface and the ease with which it can be customized on the fly for specific needs.

Storm:  A well tested event based stream processor originally developed by Twitter.

SPARK Streaming:  SPARK Streaming is one of the four components of SPARK which is the first to integrate batch and streaming in a single enterprise capable platform.

 

SPARK Streaming and Storm Are the Most Commonly Used Open Source Packages

SPARK has been around for several years but in the last year it’s had an amazing increase in adoption, now replacing Hadoop/MapReduce in most new projects and with many legacy Hadoop/MapReduce systems migrating to SPARK.  SPARK development is headed toward being the only stack you would need for an IoT application.

SPARK consists of five components all of which support Scala, Java, Python, and R.

  1. SPARK:  The core application is a batch processing engine that is compatible with HDFS and other NoSQL DBs.  Its popularity is driven by the fact that it is 10X to 100X times faster than Hadoop/MapReduce.
  2. ML.lib: A powerful on-board library of machine learning algorithms for data science.
  3. SPARK SQL:  For direct support of SQL queries.
  4. SPARK Streaming:  Its integrated stream processing engine.
  5. GraphX:  A powerful graph database engine useful outside of streaming applications.

 

Storm by contrast is a pure event stream processor.  The differences between Storm and SPARK Streaming are minor except in the area of how they partition the incoming data.  This is an advanced topic discussed later.

If after you’ve absorbed the lesson about data partitioning and you determine this does not impact your application then in open source SPARK / SPARK Streaming is the most likely choice.

 

Lambda Architecture – Speed Plus Safety

The standard reference architecture for an IoT streaming application is known as the Lambda architecture which incorporates a Speed Layer and a Safety Layer

The inbound data stream is duplicated by the Data Capture app (Kafka) and sent in two directions, one to the safety of storage, and the other into the Stream Processing platform (SPARK Streaming or Storm).  This guarantees that any data lost can be replayed to ensure that all data is processed at least once.

 

The queries on the Stream Processing side may be extracting static data to add to the data stream in the Stream Processor or they may be used to send messages, alerts, and data to the consumers via any number of media including email, SMS, customer applications, or dashboards.  Alerts are also natively produced within the Stream Processor.

Queries on the Storage safety layer will be batch used for creating advanced analytics to be embedded in the Stream Processor or to answer ad hoc inquiries, for example to develop new predictive models.

 

Do You Really Need a Stream Processor?

As you plan your IoT platform you should consider whether a Stream Processor is actually required.  For certain scenarios where the message to the end user is required only infrequently or for certain sensor uses it may be possible to skip the added complexity of a Stream Processor altogether.

 

When Real Time is Long

When real time is fairly long, for example when notifying the end user of any new findings can occur only once a day or even less often it may be perfectly reasonable to process the sensor data in batch.

From an architecture standpoint the sensor data would arrive at the Data Capture app (Kafka) and be sent directly to storage.  Using regular batch processing routines today’s data would be analyzed overnight and any important signals sent to the user the following day.

Batch processing is a possibility where ‘real time’ is 24 hours or more and in some cases perhaps as short as 12 hours.  Shorter than this and Stream Processing becomes more attractive.

It is possible to configure Stream Processing to evaluate data over any time period including days, weeks, and even months but at some point the value of simplifying the system outweighs the value of Stream Processing.

 

Four Applications of Sensor Data

There are four broad applications of sensor data that may also impact your decision as to whether or not to incorporate Stream Processing as illustrated by these examples.

Sensor Direct:  For example, reading the GPS coordinates directly from the sensor and dropping them on to a map can readily create a ‘where’s my phone’ style app.  It may be necessary to join static data regarding the user (their home address in order to limit the map scale) and that could be accomplished external to a Stream Processor using a standard table join or it could be accomplished within a Stream Processor.

Expert Rules:  Without the use of data science, it may be possible to write rules that give meaning to the inbound stream of data.  For example, when combined with the patient’s static data an expert rule might be to summon medical assistance if the patient’s temperature reaches 103°.

Predictive Analytics: The next two applications are both within the realm of data science.  Predictive analytics are used by a data scientist to find meaningful information in the data.

Unsupervised Learning:  In predictive analytics unsupervised learning means applying techniques like clustering and segmentation that don’t require historical data that would indicate a specific outcome.  For example, an accelerometer in your FitBit can readily learn that you are now more or less active than you have been recently, or that you are more or less active than other FitBit users with whom you compare.  Joining with the customer’s static data is a likely requirement to give the reading some context. 

The advantage of unsupervised learning is that it can be deployed almost immediately after the sensor is placed since no long period of time is required to build up training data. 

Some unsupervised modeling will be required to determine the thresholds at which the alerts should be sent.  For example, a message might only be appropriate if the period of change was more than say 20% day-over-day, or more than one standard deviation greater than a similar group of users. 

These algorithms would be determined by data scientists working from batch data and exported into the Stream Processor as a formula to be applied to the data as it streams by.

Supervised Learning:  Predictive models are developed using training data in which the outcome is known.  This requires some examples of the behavior or state to be detected and some examples where that state is not present. 

For example we might record the temperature, vibration, and power consumption of a motor and also whether that motor failed within the next 12 hours following the measurement.  A predictive model could be developed that predicts motor failure 12 hours ahead of time if sufficient training data is available. 

The model in the form of an algebraic formula (a few lines of C, Java, Python, or R) is then exported to the Stream Processor to score data as it streams by, automatically sending alerts when the score indicates an impending failure. 

The benefits of sophisticated predictive models used in Stream Processing are very high.  The challenge may be in gathering sufficient training data if the event is rare as a percentage of all readings or rare over time meaning that much time may pass before adequate training data can be acquired.

Watch for our final installment, Lesson 3.

Read more…

By Bill Vorhies.

Bill is Editorial Director for our sister site Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. This article originally appeared here

Summary: This is the first in a series of articles aimed at providing a complete foundation and broad understanding of the technical issues surrounding an IoT or streaming system so that the reader can make intelligent decisions and ask informed questions when planning their IoT system. 

In This Article

In Lesson 2

In Lesson 3

Is it IoT or Streaming

Stream Processing – Open Source

Three Data Handling Paradigms – Spark versus Storm

Basics of IoT Architecture – Open Source

What Can Stream Processors Do

Streaming and Real Time Analytics

Data Capture – Open Source with Options

Open Source Options for Stream Processors

Beyond Open Source for Streaming

Storage – Open Source with Options

Spark Streaming and Storm

Competitors to Consider

Query – Open Source Open Source with Options

Lambda Architecture – Speed plus Safety

Trends to Watch

 

Do You Really Need a Stream Processor

 

 

Four Applications of Sensor Data

 

 

In talking to clients and prospects who are at the beginning of their IoT streaming projects it’s clear that there’s a lot of misunderstanding and gaps in their knowledge.  You can find hundreds of articles on IoT and inevitably they focus on some portion of the whole without an overall context or foundation.  This is understandable since the topic is big and far ranging not to mention changing fast. 

So our intent is to provide a broad foundation for folks who are starting to think about streaming and IoT.  We’ll start with the basics and move up through some of the more advanced topics, hopefully leaving you with enough information to then begin to start designing the details of your project or at least equipped to ask the right questions.

Since this is a large topic, we’ll spread it out over several articles with the goal of starting with the basics and adding detail in logical building blocks.

 

Is It IoT or Is It Streaming?

The very first thing we need to clear up for beginners is the nomenclature.  You will see the terms “IoT” and “Streaming” used to mean different things as well as parts of the same thing.  Here’s the core of the difference:  If the signal derives from sensors it’s IoT (Internet of Things).  The problem is that there are plenty of situations where the signal doesn’t come from sensors but are handled in essentially the same way.  Web logs, click streams, streams of text from social media, and streams of stock prices are examples of non-sensor streams that are therefore not “IoT”.

What they share however is that all are data-in-motion streams of data. Streaming is really the core concept and we could just as easily have called this “Event Stream Processing”, except that focusing on streaming leaves out several core elements of the architecture such as how we capture the signal, store the data, and query it.

In terms of the architecture, the streaming part is only one of the four main elements we’ll discuss here.  Later we’ll also talk about the fact that although the data may be streaming, you may not need to process it as a stream depending on what you think of as real time.  It’s a little confusing but we promise to clear that up below.

The architecture needed to handle all types of streaming data is essentially the same regardless of whether the source is specifically a sensor or not so throughout we’re going to refer to this as “IoT Architecture”.  And since this is going to be a discussion that focuses on architecture, if you’re still unclear about streaming in general you might start with these overviews: Stream Processing – What Is It and Who Needs It and Stream Processing and Streaming Analytics – How It Works”.

 

Basics of IoT Architecture – Open Source

Open source in Big Data has become a huge driver of innovation.  So much so that probably 80% of the information available on-line deals with some element or package for data handling that is open source.  Open source is also almost completely synonymous with Apache Institute.  So to understand the basics of IoT architecture we’re going to start by focusing on open source tools and packages.

If you’re at all familiar with IoT you cannot have avoided learning something about SPARK and Storm, two of the primary Apache open source streaming projects but these are only part of the overall architecture.  Also, later in this series we’ll turn our attention to the emerging proprietary non-open source options and why you may want to consider them.

Your IoT architecture will consist of four components: Data Capture, Stream Processing, Storage, and Query.  Depending on the specific packages you choose some of these may be combined but for this open source discussion we’ll assume they’re separate.

 

Data Capture – Open Source

Think of the Data Capture component as the catchers mitt for all your incoming sources be they sensor, web streams, text, image, or social media.  The Data Capture application needs to:

  1. Be able to capture all your data as fast as it’s coming from all sources at the same time.  In digital advertising bidding for example this can easily be 1 million events per second.  There are applications where the rate is even higher but it’s unlikely that yours will be this high.  However, if you have a million sensors each transmitting once per second you’re already there.
  2. Must not lose events.  Sensor data is notoriously dirty.  This can be caused by malfunction, age, signal drift, connectivity issues, or a variety of other network, software and hardware issues.  Depending on your use case you may be able to stand some data loss but our assumption is that you don’t want to lose any.
  3. Scale Easily:  As your data grows, your data capture app needs to keep up.  This means that it will be a distributed app running on a cluster as will all the other components discussed here.

Streaming data is time series so it arrives with at least three pieces of information: 1.) the time stamp from its moment of origination, 2.) sensor or source ID, and 3.) the value(s) being read at that moment.

Later you may combine your streaming data with static data, for example about your customer, but that happens in another component.

 

Why Do You Need a Message Collector At All?

Many of the Stream Processing apps including SPARK and Storm can directly ingest messages without a separate Message Collector front end.  However, if a node in the cluster fails they can’t guarantee that the data can be recovered.  Since we assume your business need demands that you be able to save all the incoming data, a front end Message Collector that can temporarily store and repeat data in the case of failure is considered a safe architecture.

 

Open Source Options for Message Collectors

In open source you have a number of options.  Here are some of the better known Data Collectors.  This is not an exhaustive list.

  • FluentD – General purpose multi-source data collector.
  • Flume – Large scale log aggregation framework.  Part of the Hadoop ecosystem.
  • MQ (e.g. RabbitMQ) There are a number of these lightweight message brokers deriving from the original IBM MQTT (message queuing telemetry transport, shortened to MQ).
  • AWS Kinesis – The other major cloud services also have open source Data Collectors.
  • Kafka – Distributed queue publish-subscribe system for large amounts of streaming data.

 

Kafka is Currently the Most Popular Choice

Kafka is not your only choice but it is far and away today’s most common choice used by LinkedIn, Netflix, Spotify, Uber, and AirBNB among others.

Kafka is a distributed messaging system designed to tolerate hardware, software, and network failures and to allow segments of failed data to be essentially rewound and replayed, providing the needed safety in your system.  Kafka came out of LinkedIn in 2011 and is known for its ability to handle very high throughput rates and to scale out.

If your stream of data needed no other processing, it could be passed directly through Kafka to a data store.

 

Storage – Open Source

Here’s a quick way to do a back-of-envelope assessment of how much storage you’ll need.  For example:

Number of Sensors

1 Million

Signal Frequency

Every 60 seconds

Data packet size

1 Kb

Events per sensor per day

1,440

Total events per day

1.44 Billion

Events per second

16,667

Total data size per day

1.44 TB per day

 

Your system will need two types of storage, ‘Forever’ storage and ‘Fast’ storage.

Fast storage is for real time look up after the data has passed through your streaming platform or even while it is still resident there.  You might need to query Fast storage in just a few milliseconds to add data and context to the data stream flowing through your streaming platform, like what were the min and max or average readings for sensor X over the last 24 hours or the last month.  How long you hold data in Fast storage will depend on your specific business need.

Forever storage isn’t really forever but you’ll need to assess exactly how long you want to hold on to the data.  It could be forever or it could be a matter of months or years.  Forever storage will support your advanced analytics and the predictive models you’ll implement to create signals in your streaming platform, and for general ad hoc batch queries.

RDBMS is not going to work for either of these needs based on speed, cost, and scale limitations.  Both these are going to be some version of NoSQL.

 

Cost Considerations

In selecting your storage platforms you’ll be concerned about scalability and reliability, but you’ll also be concerned about cost.  Consider this comparison drawn from Hortonworks:

 

For on premise storage a Hadoop cluster will be both the low cost and best scalability/reliability option.  Cloud storage also based on Hadoop is now approaching 1¢ per GB per month from Google, Amazon, and Microsoft.

 

Open Source Options for Storage

Once again we have to pause to explain nomenclature, this time about “Hadoop”.  Many times, indeed most times that you read about “Hadoop” the author is speaking about the whole ecosystem of packages that are available to run on Hadoop. 

Technically however Hadoop consists of three elements that are the minimum requirements for it to operate as a database.  Those are: HDFS (Hadoop file system – how the data is stored), YARN (the scheduler), and Map/Reduce (the query system).  “Hadoop” (the three component database) is good for batch queries but has recently been largely overtaken in new projects by SPARK which runs on HDFS and has a much faster query method. 

What you should really focus on is the HDFS foundation.  There are other open source alternatives to HDFS such as S3 and Mongo, and these are viable options.  However almost universally what you will encounter are NoSQL database systems based on HDFS.  These options include:

  • Hbase
  • Cassandra
  • Accumulo
  • SPARK
  • And many others.

We said earlier that RDBMS was non-competitive based on many factors, not the least of which is that the requirement for a schema-on-write is much less flexible than the NoSQL schema-on-read (late schema).  However, if you are committed to RDBMS you should examine the new entries in NewSQL which are RDBMS with most of the benefits of NoSQL.  If you’re not familiar, try one of these refresher articles here,here, or here.

 

Query – Open Source

The goal of your IoT streaming system is to be able to flag certain events in real time that your customer/user will find valuable.  At any given moment your system will contain two types of data, 1.) Data-in-motion, as it passes through your stream processing platform, and 2.) Data-at-rest, some of which will be in fast storage and some in forever storage.

There are two types of activity that will require you to query your data:

Real time outputs:  If your goal is to send an action message to a human or a machine, or if you are sending data to a dashboard for real time update you may need to enhance your streaming data with stored information.  One common type is static user information.  For example, adding static customer data to the data stream while it is passing through the stream processor can be used to enhance the predictive power of the signal.  A second type might be a signal enhancement.  For example if your sensor is telling you the current reading from a machine you might need to be able to compare that to the average, min, max, or other statistical variations from that same sensor over a variety of time periods ranging from say the last minute to the last month.

These data are going to be stored in your Fast storage and your query needs to be completed within a few milliseconds.

Analysis Queries:  It’s likely that your IoT system will contain some sophisticated predictive models that score the data as it passes by to predict human or machine behavior.  In IoT, developing predictive analytics remains the classic two step data science process: first analyze and model known data to create the predictive model, and second, export that code (or API) into your stream processing system so that it can score data as it passes through based on the model.  Your Forever data is the basis on which those predictive analytic models will be developed.  You will extract that data for analysis using a batch query that is much less time sensitive.

Open Source Options for Query

In the HDFS Apache ecosystem there are three broad categories of query options.

  1. Map/Reduce:  This method is one of the three legs of a Hadoop Database implementation and has been around the longest.  It can be complex to code though updated Apache projects like Pig and Hive seek to make this easier.  In batch mode, for analytic queries where time is not an issue Map/Reduce on a traditional Hadoop cluster will work perfectly well and can return results from large scale queries in minutes or hours.
  2. SPARK:  Based on HDFS, SPARK has started to replace Hadoop Map/Reduce because it is 10X to 100X faster at queries (depending on whether the data is on disc or in memory).  Particularly if you have used SPARK in your streaming platform it will make sense to also use it for your real time queries.  Latencies in the milliseconds range can be achieved depending on memory and other hardware factors.
  3. SQL:  Traditionally the whole NoSQL movement was named after database designs like Hadoop that could not be queried by SQL.  However, so many people were fluent in SQL and not in the more obscure Map/Reduce queries that there has been a constant drumbeat of development aimed at allowing SQL queries.  Today, SQL is so common on these HDFS databases that it’s no longer accurate to say NoSQL.  However, all these SQL implementations require some sort of intermediate translator so they are generally not suited to millisecond queries.  They do however make your non-traditional data stores open to any analysts or data scientists with SQL skills.

Watch for Lessons 2 and 3 in the next weeks.

Read more…

IoT Open Discussion Forums

Upcoming IoT Events

More IoT News

How wearables can improve healthcare | TECH(talk)

Wearable tech can help users track their fitness goals, but these devices can also give wearers ownership of their electronic health records. TECH(talk)'s Juliet Beauchamp and Computerworld's Lucas Mearian take a look at how wearable health tech can… Continue

IoT Career Opportunities