Originally featured on O’Reilly.
Performing business analytics on the data lake using next-gen open source tools.
“This world is increasingly being driven by quantitative analysis, not qualitative. More than ever, corporate roles involved in decision-making corporate roles involved in decision-making need to have access to data and be able to make sense of it.” – Thomas Nield, revenue management analyst and self-taught engineer at Southwest Airlines.
Many companies want to find more value by deploying analytics projects across the entire enterprise. Scaling analytics on the data lake, however, presents some tool and skill-related challenges. Primarily, the data lake itself is often not safe for mass use due to complexity, security and governance issues. As such, access to it is usually restricted to engineering and IT teams who have the technical know-how and permissions to use it.
This is not simply an access problem, however. On the business side, many analysts do not have the skill set needed to match data to organizational problems. They are also hampered by a lack of awareness of the data or tools available that could solve the problems that they’re seeing.
As the big data ecosystem matures, business users will power their decisions with big data analytics, but it won’t happen automatically. To get there, users need the right tools, awareness and skills to work in the data ecosystem, and companies need to adopt a culture that fosters a creative data mindset.
Jumping through the IT bottleneck
For business users, the current process for getting insight from big data is often long and sometimes frustrating.
According to Dr. Matt North, who comes from a risk analytics background at eBay, “Business analysts are often hogtied when it comes to big data and need to go through an IT approval process or sell their idea to their supervisor before they go anywhere.”
If a business analyst wants to set up a data feed into Hadoop to investigate a particular business problem in detail, they have to get the right permissions and make the right requests for the data. Then, their project becomes IT’s project. Depending on the department and the project’s complexity, this process could take weeks, if not longer, to be completed. The asking, waiting, and following up creates friction and wastes time in deploying analytics projects.
A big data tool for many users
The current lack of access to data reflects the fact that most business users do not have the technical experience necessary to use the data lake, which is often built with open source tools like Hadoop and Spark that are not readily enterprise-friendly. Data lakes are also often disorganized, which makes finding data even more difficult.
But big data does not have to remain the domain of power users – this is simply a limitation of the current tools. There is now a push in the marketplace to make the data lake more accessible to a wide variety of users.
One tool that addresses a number of data lake access, governance, awareness, and skill issues is Kylo, a soon-to-be open sourced data lake orchestration framework based on Spark and NiFi. Kylo automates many functions related to data lakes, including data ingest, preparation, discovery, profiling, and management. The tools is accessed through a GUI built with the business analyst in mind, as well as modules for IT operations and data science.
Through Kylo‘s UI, business users can manipulate data they care about – creating feeds, defining ingests, wrangling data, transforming it, and publishing to target systems. They retain ownership of the project without needing to deploy any code or turn over control to IT, drastically reducing the amount of time these kinds of projects tend to take.
For power users such as data scientists, data stewards and IT operations, Kylo provides metadata tracking, allowing them to make sure data is ingesting properly and understand its accuracy with Google-like search capabilities. These power users can create permissions and readily available templates for business analysts to use, in addition to monitoring and enforcing SLA policies.
Though not a cure-all (especially for companies with a disorganized data lake), Kylo and products like it are interesting because they allow both business and power users to manipulate big data, bridging the gap between big data skills and access. Business users get insight faster, and IT can focus on engineering and data architecture problems instead of coding mundane routines.
Making big data a culture fit
Company culture should not be underestimated in its power to enable data analytics at scale. Successful data-driven companies do this by fostering an environment of exploration and awareness.
Adopting such a culture relies on support from both the top and the bottom, whereby management encourages and enables experimentation, and individuals take initiative to learn about new tools and build products that solve problems of immediate interest to them.
Building the right culture is a key step since, according to Dr. North, a lack of awareness of the tools and data available is one of the biggest challenges facing at-scale big data adoption in the enterprise today. Tools like Kylo are promising, but they do little good if business users don’t know they exist or how to use them.
According to Thomas Neild at Southwest Airlines, “Data analysts must be resourceful and driven to not wait on others to provide what they need, and to develop whatever means they need to get something that may not even exist, data wise.” With tools like Kylo, for example, an analyst could discover relevant data sets in the catalog that he or she could then ingest and prep on their own before downloading a desktop analytics tool to work with them.
The open source ecosystem is where this kind of self-driven learning thrives, since you don’t need to go through the process of acquiring – and paying for – a commercial tool in order to start using it or teaching others how to use it.
While more feasible for some tools than others (for example, most users probably wouldn’t run a Hadoop cluster on their personal machine), the principle remains the same: learning new skills and tools from open source can be very enabling, and is becoming increasingly crucial in today’s business environment.
Toward a data-ocracy
The emergence of Kylo and other tools like it is paving the way for the next phase of big data – democratization. There is still a long way to go: in the vendor marketplace, tools need to be improved both to increase ease of use as well as to streamline analytics operations, with the goal or efficiently operationalizing models in the enterprise.
In order to adapt, companies must embark on projects designed to enable business users with awareness of the data and tools that are available, adopt products that facilitate access, and encourage exploration and skill development. At the same time, business users should be proactive in searching out new tools and developing skills on their own.
As tools like Kylo mature, everyone who wants to will be able to use data to make their jobs easier or more fun – hopefully both.