The “2.0” moniker is not just marketing hype. A lot has changed in the last year to justify the version bump:
- A whole new Hadoop. No, “Big Data 2.0” is not synonymous with “Hadoop 2.0”, but as the open source platform has been undeniably central to the Big Data story to date, major improvements are noteworthy. The new YARN resource manager makes it much easier to share a cluster with custom applications and MapReduce alternatives. One of the first YARN tutorials I saw deployed trusty ol’ memcached as a service on the cluster.
- It’s not all about you, Hadoop. While Hadoop has been opening up, some notable candidates have been emerging as likely candidates to supplement MapReduce. Storm and Spark, for instance, have both joined the Apache family. Spark is particularly interesting for its smart use of memory, greatly accelerating certain analyses such as the iterative calculations so common in machine learning and query-heavy exploratory analytics. At Think Big, we have deployed Storm for several customers for in-flight processing of data streams.
- It’s not even all about “Big Data” anymore. New, disruptive technologies often enjoy a honeymoon, when everything is “green field” and “new new”. But as Big Data tools like Hadoop and NoSQL databases have matured, they have proven themselves to an ever-expanding audience. And that means we’re just as likely to be designing a Hadoop-based solution for an established enterprise customer as we are for a cutting edge startup. Integrating with existing databases and data warehouses is central to many of our projects today, and will only grow more critical as enterprise adoption continues to grow.
What to expect from Big Data 2.0
What differentiated Web 2.0 from Web 1.0? It wasn’t just the relative lack of “because we can!” businesses like pets.com. After that bubble burst, some relatively modest engineering advancements like AJAX enabled some revolutionary breakthroughs in user experience. The result? An entirely new generation of interactive, user-focused web sites and applications–and an empowered, engaged user base to enjoy them. What would Facebook be without user-contributed content? “Phonebook”, that’s what. No thanks.
Similarly, expect these and other evolutionary changes to conspire over the coming months to shift the landscape in some important ways:
- The enterprise takes over Hadoop. For all of its virtues, Hadoop is not going to take over the enterprise. Most companies will continue to rely on packaged applications running on relational databases for the foreseeable future. But Hadoop excels at busting through these data silos to help connect the dots like never before. As more IT departments seek to fit these tools into production environments, the elephant is going to need to dress up to fit into its new Fortune 500 digs. Expect much more emphasis on security-focused projects and solutions like Falcon, Sentry, Knox, and XA Secure than on tech-focused efforts like yet another columnar storage format.
- More and more SQL in NoSQL.It’s easy to forget how long it has taken today’s BI, analytics, and visualization tools to reach their current state of maturity–and they started with the 40-year-solid foundation of relational databases. The obvious way forward is to leverage these existing investments in SQL tools and skills. Hive showed it was possible to add a SQL layer to Hadoop, and Shark showed it could be fast. Plenty has already been written about Impala, Stinger, and the rest. My pick for the project to watch? Apache Phoenix. It’s immature, to be sure, but it’s the SQL layer which HBase should always have come with; its inclusion in Hortonworks HDP 2.1 was one of the best surprises of my summer.
- There will be no Big Data 3.0. “Big Data” became capitalized when Google broke with parallel processing canon and eschewed “enterprise class” hardware, special-purpose network interconnects, expensive SAN storage and the like. Using MapReduce to run relatively simple jobs on commodity hardware against a software-based distributed file system was clearly revolutionary, but the world has taken notice and learned. Document-oriented databases, distributed search engines, graph databases, key-value stores, and in-memory databases are no longer fringe. Today’s system architects–and mainstream vendors–are increasingly incorporating disparate technologies and components into data platforms to meet their wide ranging demands. This blurring will only continue and accelerate, and I’m not sure we’ll even be capitalizing “Big Data” next year.
And while we’re de-capitalizing “Big Data”, we should note that this week, Seagate introduced the first 8TB 3.5″ hard drive. At this rate, “Big Data”–capitalized or not–will soon simply be known as “data”.