Much Hadoop about Nothing? We Don’t Think So.

Comments (1)

In the mid 2000’s when I was Chief Architect and later, VP Engineering, at Quantcast, we were using one of the first Hadoop clusters as a low-cost way to process the vast amount of data needed to directly measure many of the largest websites in the world. We even helped a skeptical Facebook see the value of this new, cool technology. Back then, we were processing Petabytes of data each day to apply predictive models to offer our customers new features that we couldn’t offer before, all thanks to Hadoop. For example, we developed “lookalike” programs, a new advertising approach to find millions of new people who are likely to respond to any one of thousands of ad campaigns. We also built realtime systems powered by NoSQL databases to compute tens of millions of these predictive models each second, all within milliseconds.


When I saw the power of what Hadoop, NoSQL, and predictive analytics could do, I knew enterprises would benefit greatly if they could find a way to intelligently process the vast amounts of data they were collecting, or their “Big Data”. I also knew it would be a very long time before the skills and products would mature to the point where Big Data could be considered “plug and play.” Images of repeating my experience building C-bridge came into view, as I recognized the “skills and technology gap” that existed and made C-bridge successful back then was about to repeat itself.


So, on Easter Sunday 2010, Katie, Rick, and I decided to start Think Big. Our goal was to recruit the brightest data scientists and data engineers to fill the knowledge gap that is so critical to the success of any Big Data project. From the very beginning, we knew we could differentiate ourselves from traditional system integrators by developing a unique methodology, combining it with the best talent, and working side-by-side with clients to implement Big Data projects that would produce amazing results. Although we are often unable to speak about them publicly, we have helped our Fortune 50 client list exceed their business goals by using Big Data to:

  • Launch and tailor new product offerings to consumers;
  • Building predictive patterns to prevent device failures before they occur, optimization
  • Identify ways to increase operational efficiencies.


Our goal is to provide services that allow clients to harness the power of Big Data to create radical new value by automating decisions, tailoring interactions to consumers, and enabling intelligent networks of devices.  Along the way, we intend to turn the traditional consulting model on it’s head, evangelizing fast, nimble projects that emphasize learning instead of the long, drawn-out ones that people envision when they hear the word, “consultant.” I agree with Geoffrey Moore that The Tide Has Turned: Big Data is going to usher in a new wave of innovation and economic growth across a range of industries, and companies will need purpose-built consultancies to enable that transformation. That’s why we created Think Big and we operate focusing on one thing only: Help Clients Achieve the Promise and Value of Big Data.


To do so requires a perfect blend of technology, skills and planning, the three of which encompass Think Big’s three major service offerings: Imagine, Illuminate and Implement. Big Data is complex and no one should enter into it blindly so we offer our “Imagine” services to ensure roadmap and use cases are properly defined and prioritized. We offer “Illuminate” services, which include expert training and side-by-side mentoring so our clients are prepared internally to fully realize the value their data can bring. Finally, we offer “Implement” services, which really bring our customers’ Big Data to life. The combination of our data scientists and engineers and our proven “test and learn” methodology enables our clients to innovate by building large-scale Big Data analytics, data integration and real-time systems for device data, advertising, network data, consumer recommendations, retail and financial market data, and more.


Recently, there has been backlash about Hadoop because companies are struggling to find the ROI behind their Big Data technology. The question that people should be asking isn’t “Is this ‘Much Hadoop about Nothing’ “ but rather “Do you have the right partner?” Think Big is here to help.

One response to “Much Hadoop about Nothing? We Don’t Think So.

  1. Amen to the data munging being most of the work. We’re ctrrenuly working on a customer project and I’ve already written two different parsers for their CSV data. The customer’s programmers’ efforts to piece together the data from their web logs has been Herculean. And still we’re not 100% sure of what we have. There are all kinds of fields with numeric codes that are just plain hard to figure out, even with the business people and some of the coders present much of it’s legacy codes that have to be extracted from comments in their .h header files. We were getting good results with a feature that turned out to be cheating because while it made sense to use it, the value in the logs didn’t reflect the value in the incoming request, but rather the value in the outgoing response, which indirectly coded the category for classification. The kicker is that this data’s only an approximation of the real problem. But it’s the best we have, and while more data’s better than more learning, some data’s better than nothing.Other customers have wanted us to find things for them (e.g. forward earnings statements in 10Q footnotes, opinions of cars in blogs, recording artists in news), but there was no existing data, so we had to (help them) create it. That’s when they run into the problems like whether Bob Dylans is a person mention in . It turns out the customers are not semantic grad students or ontology boffins, so they usually just don’t care.But the real problem with all of this tuning to within a percent of a system’s life is that it’s usually just overfitting when you go out into the wild. For instance, the customer mentioned above plans to change the overall organization and the instruction text on their site, so that none of our training data will exactly replicate the runtime environment.18c8

Leave a Reply

Your email address will not be published. Required fields are marked *