Clustering Clickstream Data Using Apache Spark

Comments (0)

Want to know more about your customers? Of course, you can always ask about their preferences, interests, intent to purchase and more. However, this process potentially introduces bias as customers may tell you exactly what you want to hear, or worse say one thing and then do another. A supplemental—and some would say better—method to understanding customers is to observe their actual behavior, and there’s few better places to get real customer interaction data than online.

After capturing a data set of website clickstreams, various technologies and applications might be useful for parsing and analyzing customer interactions. Hadoop’s MapReduce is one such option, but MapReduce tends to be painfully slow as each pass through data goes through a Map, Shuffle, Sort, and Reduce process. And if you need to iterate, it’s possible to have 100s of MapReduce jobs going at any one time and plenty of IO hits (translation: there’s a better way of doing things.)

Enter Apache Spark. This exciting technology is gaining significant adoption in IT circles as a speedy platform for data processing and exploratory analytics. Now, taking a clickstream data set and using Spark—instead of Hadoop’s tedious MapReduce—it’s possible to gain business insights 10-100X faster!

One method of clickstream data analysis is attempting to understand similarities in various customer groups. For example, suppose you want to group audiences in a pretty messy numeric dataset. By using Spark’s “built-in” K-Means clustering function in MLLib, you can run the algorithm to help delineate a clearer picture of the various groupings of continuous data and how they relate to each other.

However, let’s now suppose we want to use categorical data which is data that cannot be described as a number (i.e. “color”, “heavy” vs. “light”, “type” etc.). This can be a difficult process unless you utilize Think Big’s recently open sourced K-Modes algorithm for Spark. With K-Modes you can cluster on user by looking at the most frequently occurring values or most common responses.

As they say in the “sham-wow” commercials, “But wait, there’s more…”

Marissa Saunders, data scientist at Think Big, has created an easy to access tutorial on using Spark for clustering clickstream data. In this 30 minute video Marissa explains the difference between k-means and k-modes clustering and why Spark is the perfect choice for this endeavor. You’ll also view a sample use case for clickstream analysis on Spark utilizing a publicly available dataset from This is the kind of data you’re probably going to encounter when analyzing clickstreams—data that are large in volume, not cleansed, and in JSON format.

Identifying different user types can drive insight into online customer behavior. It’s important to have answers to questions such as: “who are my customers?”, “what are they looking at?”, “where do they come from?”, “how similar are they to one another” and “what else might they be interested in viewing?” and more. The name of the game is understanding customers better and getting insights faster than your competition. Using Apache Spark for clickstream analysis can help you arrive at answers sooner than you probably expect.

Additional resources:

Leave a Reply

Your email address will not be published. Required fields are marked *