Real-time processing is an important part of the Hadoop architecture, but is it always the best approach to a given problem? In this post, we will provide guidelines to help you decide when to use streaming technologies and when to use batch processing.
What is real-time in the context of big data? At a high level, it is data input from an external source leading to information coming out or actions triggered. An architecture can be considered real-time if the latency from data are input to an action or analytics reflecting that data is no more than a few seconds.
Tools that support building of real-time systems in the Hadoop ecosystem fit in three general categories:
- Streaming technologies (i.e., Spark Streaming and Storm)
- SQL on Hadoop (i.e., Presto, Drill and Impala)
- NoSQL Databases (i.e. HBase, Cassandra, Elasticsearch)
These tools can be used in various combinations to yield systems with latencies in seconds or less.
Many people make the assumption that stream processing is inherently better than batch processing because derived data will be fresher. While this might be true, there are many situations where batch architectures make more sense. Here are a few:
- Unique user counts (how many users interacted with a website) have no defined instantaneous meaning and can only be determined in a batch manner
- Ratios of measured variables (such a click through or conversion rates) will only reach statistical significance if computed for a period of time, which might as well be done using a simpler, batch architecture
- Certain computations, such a deep dimensional counts with rollups, are significantly more efficient if the rolling up is done in a batch (once at the end of the period) rather than as each new value arrives
- Batch tools are more mature and stable and usually require less hardware than a corresponding stream architecture
Still want a real time architecture for your Hadoop ecosystem? Here are some options:
- Lambda architecture—data comes in and then forks to both streaming and batch processing, with each feeding a serving layer with derived data. The argument for two parallel processing paths is based on an assumption that the streaming path might be unreliable (see figure 1).
Figure 1. Lambda architecture example
- Kappa architecture—a reaction to the Lambda architecture where Kafka is used to buffer data input to a streaming-only processing path. If an error is found, then historical data can be re-processed to correct the error (see figure 2)
Figure 2. Kappa architecture example
- Mu architecture—this builds on the Lambda architecture; there’s still a batch processing path and a streaming path, however paths are used for different derived data depending on which is most appropriate. A common serving layer combines the results of both processing paths presenting a combined view to the user (see figure 3).
Figure 3. Mu architecture example
The illustration below (figure 4) provides an example of the Mu architecture combined with Think Big’s Dashboard Engine for Hadoop. You will see in the diagram below that there is both a batch and a stream aggregator. Also note that time-based aggregates are stored in HBase. We also use an API that simulates SQL, so tools like Tableau and MicroStrategy can send SQL to the API server, which then translate it into our storage format in HBase. Implemented as Mu architecture, aggregates can be specified for either batch or streaming pathway with the resulting aggregate data stored and served in the same manner.
Figure 4. Example of the Mu Architecture with Think Big’s Dashboard Engine for Hadoop
The bottom line is there are use cases for both batch and real-time data processing, and many systems will have some of each. Batch tools are typically more stable and less subject to frequent revision, and generally require fewer resources. Real-time processing offers tremendous potential, but there are tradeoffs and no simple answers. Indeed, batch architectures will continue to be the technology of choice in many situations.