At conferences and on client campuses, I am frequently cornered by people seeking to understand if they have Big Data. Is their data “big?” Do they need a tool like Hadoop?
At the heart of the confusion is a simple question: How big does a dataset need to be to be considered a “Big Dataset”? Industry articles frequently reference Google’s search data and the social data of Twitter, Facebook, and others. But the deluge of Big Data encompasses social media, search, customer interactions, device data, and the recently coined exhaust data.
Big Datasets come from a variety of sources, represent data from many industries, and are used in different ways by successful firms. But, at the core, Big Datasets are large, event oriented, and multi-structured.
Big Datasets Are Large
While size alone does not define Big Data, Big Datasets are typically large enough to challenge existing tools. Due to their size, manipulating and analyzing Big Datasets requires specialized engines and parallel compute power. However, Big Datasets are larger because they contain a different type of data, not a greater volume of traditional data.
Big Datasets Are Event Oriented
Big Datasets are large because they capture more detailed information and information about events. A traditional dataset for an online retailer houses information about transactions and profiles. A Big Dataset will contain every click that led to a transaction, how a transaction was made, how long the process took, etc. And customers no longer have static profiles in Big Data; even preferences are represented as events – changes over time in the tastes and tendencies of a customer measured at a micro scale. Did you start to type a product query and stop? Did you browse the search page but never click a product? Did your attention focus on a recommended product for longer than usual? These big event-oriented datasets enabled a Think Big client to identify latent and growing demand for products before supply chain analysts.
Event data extends beyond interactions of customers or devices. Exhaust data captures the byproduct of events. Typing a URL into a web browser loads a webpage; but before that happens, the browser quietly makes a DNS request to determine the location of the webpage. Most ways one interacts with the Internet has this byproduct and firms like DNS hosts and Internet Service Providers are storing it. Think Big worked with a client to show that even hackers generate exhaust event data and it reveals where the hacker is hiding.
Big Datasets Are Multi-structured
There are more events happening today than analysts understand how to find value in, but the mantra of Big Data is “store everything, analyze it later.” Big Datasets contain raw event data, unaltered by prior hypotheses. Some is structured and quantified like a purchase. Some is multi-structured like a network packet. And some is completely unstructured like a customer call center recording.
Storing raw events in mass means that big datasets are structurally and semantically different from traditional data. But this tradeoff enables analysts to find relationships post hoc, without knowing which data and which fields they needed to store (and how) in advance. This flexibility of big datasets enables the identification of novel relationships like that of social media behaviors and economic performance or payment defaults.