Applying Big Data Analytics to Clickstream Data

Comments (0)

An important application for Big Data, indeed, the original application for which Big Data technologies were first created, is clickstream analytics.  Using predictive models to analyze it, clickstream data can be rich with insights for user engagement, site performance, query response times and more.


Clickstreams are records of users’ interaction with a website or other compute application.   Each row of the clickstream contains a timestamp and an indication of what the user did.  Every click or other action is logged—hence the term “clickstream.”   In some circumstances, what the website does is also logged; this is useful when the website does different things for different users, such as post recommendations.


Clickstream data generally comes from one of two sources: (1) Logs from servers that originally served the website or (2) internet messages transmitted by javascript embedded in pages of the website that are received by a central server.


Many of Think Big’s customers have clickstream data that they wish to analyze for eCommerce, brand, operational or other uses.  A typical approach is to load the data into a relational database, such as Oracle, and derive analytics with multiple SQL calls.   Because of the sheer scale of clickstream data, this tends to not perform acceptably.


We have developed a Big Data approach to clickstream analytics that allows producing aggregates for reporting as well as session statistics, cohort analysis and path analysis.  An integrated A/B testing system is also available.


Our clickstream solution is based on proven techniques and technologies we’ve successfully utilized for similar projects completed in the past.  It can process events in batches of any size down to approximately five minutes.  Latency—the interval between when event data is first available to process and when analytics are available–is generally a few minutes. For most customers and most use cases this latency is adequate and is generally one to three orders of magnitude lower than solutions incorporating relational databases.


To make access to analytics fast under a high request load, we pre-compute numerous aggregates (crossproduct of metric and dimensions) on a periodic basis.  The structure of Hadoop allows us to do this in near-constant time regardless of the number of aggregates we are producing.   We make a single scan through the data (or through sessionized data, depending on the nature of the metric we are computing at the moment).  For each element scan, a single key-value pair is emitted for each aggregate that this element participates in. The key indicates the aggregate.


As an example, consider that we are scanning a click event record.  We certainly want to count all clicks (first key-value pair).  Suppose that the URL fits hierarchically into five different categories within the website.  We emit five more pairs with the keys containing the categories. Suppose the user is known to be a male, age 26.  We could then emit one pair for the male aggregate, one for the age range we’re bucketing (say 20-29), and perhaps one for males age 20-29.  We could emit one pair for each of the five categories combined with gender, age, or both.  Reducers then add up the data for each aggregate combination and upload the result to hbase where it is available for efficient retrieval.


With our Big Data architecture, we have built clickstream analytics solutions that produce hundreds of millions of distinct aggregates per day. Because all the aggregates are produced concurrently, rather than one-by-one as they would be using an RDBMS solution, we are able to scale without increasing latency.  And because we can scale naturally, there is no need to employ complex staging tables to increase efficiency as would be done with an RDMBS solution, greatly decreasing the development time necessary to deploy the solution.

Leave a Reply

Your email address will not be published. Required fields are marked *