With Hadoop adoption about to take off over the next five years according to Wikibon, there’s still plenty of companies “kicking the tires” with pilot projects. In order to move past the experimental phase of Hadoop, we’ve distilled five best practices that will help promote the integration of Hadoop in your enterprise.
- Define Your Initial Use Case
Although your end goal may envision a large enterprise data lake, it’s actually your initial use case that will ultimately define the success of your Hadoop project. First, define a use case of a manageable scope that will effectively utilize data ingestion, processing and access.
When defining the use case, start with data access:
- What data are users requiring?
- How will users access data? (e.g. pre-canned reports, data streams, visualizations, batch data extracts, request response)
Work back through your data processing to raw data sources and define latency requirements and data extraction methods. Define the boundaries and implement the initial use case to demonstrate business functionality.
- Utilize Existing Enterprise Frameworks
Enterprise frameworks are the foundation for modern for software platforms. Frameworks such as Spring, Camel, and Jax-RS can be leveraged to provide controlling and monitoring functions for data ingestion, pipelines, data access, communication etc. Frameworks enable easier Hadoop adoption, allowing the focus to center on building business logic rather than re-inventing the wheel developing a series of control processes. Utilizing frameworks used elsewhere within an enterprise also means developers can leverage their existing skill sets, while learning big data technologies. This approach creates a smoother path to a successful implementation.
- Data Quality and Platform Management
Many early Hadoop projects were offline analytics applications where guarantees of data quality were not of paramount importance. “At most once” or “at least once” paradigms could be implemented to simplify processing and ensure data ingestion provided large enough data sets to derive representational models. Enterprises are now demanding “exactly once”. Thus, to ensure data quality we must design processing with this in mind. Enterprise frameworks assist with dependency management, exception reporting and replay functionality.
Most enterprises have existing system monitoring and exception management tools. Therefore, Hadoop development should work in harmony with these tools to capture exceptions and monitor cluster health, utilization etc. Data reconciliation frameworks are necessary to catch data quality issues and replay data through pipelines to correct issues.
While Hadoop provides replication of storage and redundancy of processing, control processes need to be designed with redundancy in mind. It is of little use for a cluster to replicate everything in triplicate and distribute processing if your control process sits on a single edge node, creating a single point of failure!
- Data Modeling
A common Hadoop misconception is that because it can accept any file type, that means you can dump data into the platform and immediately gain advantage of the platform’s processing power. This is a huge fallacy! In fact, data modeling in Hadoop is of great importance. Data modeling should be tailored primarily to access patterns. To optimize data modeling, you need to gain a deep understanding of how data are to be utilized. Ask these questions:
- How are data accessed?
- What formats are the raw data?
- What is the update frequency of data sets?
- What processing of raw data is required?
Another tip, don’t be afraid of data duplication. Accept that eventual consistency works for many data attributes that change infrequently, or not at all (e.g. date of birth). Take advantage of different file formats like ORC to optimize for access patterns.
- Data Lineage
Applying metadata to incoming data enables tracking of data lineage. This is increasingly important as data sets grow and the Hadoop platform evolves into a multi-facetted data reservoir/lake. Tracking of data lineage provides several valuable features including:
- Tracking of data elements from source to destination(s)
- Tracking data quality and identifying data
- Providing the basis for data access rights
- Cataloging data sets within the Hadoop cluster
- Data retention criteria
Metadata models should be defined before initial implementation. After all, it is very easy to define your models upfront, but very difficult to add to existing data sets after the fact.