To understand the role of open source in data analytics, it is helpful to think back to a concept from telecommunications: the last mile. The last mile is the distance from the distribution points of telecom and cable operators to the house and it was always seen as a bottleneck.
But there was good reason for that bottleneck. About 95% or more of all of the wiring in the system was in the last mile.
Something similar is happening in the world of open source data analytics. Hadoop is rightly celebrated as a game changer, but people forget about the “last mile”: the process of converting that power into something that most people can use. In my view, the revolution in big data will come from solving the “last mile” problem, that is, making big data useful for the masses.
Business and technology executives who are building data analytics capabilities to take advantage of big data can avoid many problems by understanding the relationship between open source and the last mile of data analytics.
The Awesome Power of Open Source
Open source should not be ignored by anyone seeking to create a best-in-class analytics system. Hadoop and its ecosystem don’t solve every challenge, but they do solve many. A wealth of data never before available has led to deeper insights and richer models that support advanced discovery and automated execution through predictive analytics and recommendation engines.
Hadoop has also led to transformative innovations in how data is processed and stored. Think about the way data lakes enable storage of massive amounts of data without first having to create a schema or even know what you might do with that data later. Consider how sessionized schemas help aggregate data related to a user so you can analyze and attribute behavior across channels.
Hadoop has answered an important question about the power of open source communities. Is it possible for open source to innovate and create an entirely new product? The answer is yes. Hadoop’s open tool chain allows data to be stored in HDFS but then processed by Hive, Spark, Pig, Storm, and HBase.
The value, of course, does not come out of raw analytical power, but out of:
- Applying that analytical power to data
- Creating efficient, optimized approaches for analysis and modeling
- Creating a secure, manageable environment for data
- Building richer models of behavior and business activity
- Extracting valuable signals
- Delivering the ability to use and explore data to as many people as possible
The last mile problem for data analytics involves making these steps as easy as possible. This is where open source needs some support.
The Limits of Open Source for Applications
Most great open source projects created platforms for developers and gave rise to a thriving economy of applications and components, many of which are commercial. It’s often a case of open source plus. Here are three examples.
1. The best systems leverage open source, but are customized. Facebook, Amazon, and Google all created systems relying heavily on open source, but their most valuable analytics are custom-built. I suspect that systems that enable thousands of users to do relevant analytics will be customized, even if they use open source.
2. Open source communities meet the most common needs, leaving niches for others to fill. Open source communities typically limit productization to a subset of common needs. Problem areas such as metadata management, integration with enterprise applications and datasets, and application security will likely be solved by commercial solutions because the knowledge required is too specialized and the market too small to nurture an open source community.
For all of these reasons, it seems likely that the best solutions will leverage open source but be enhanced by commercial or customized solutions that meet the needs of the last mile.