Getting the Most from Your Data Lake

Comments (0)

Enterprises are approaching big data with a heady mixture of anxiety and anticipation. On one hand, these initiatives promise to deliver new levels of flexibility and productivity to competitive business models, but on the other there is widespread concern over acquiring the appropriate skills to secure, govern, and manage big data systems on a daily basis.


Adding insult to injury, data volumes are exploding. Some projections put the worldwide annual data load at 44 zettabytes (that’s 44 followed by 21 zeroes) by 2020, more than 10 times what it is today. The cumulative amount of data that will be under management by the end of the decade is truly mind-boggling.


To prepare for this onslaught, some IT leaders are urging the creation of “data lakes.” These are centralized repositories based on Hadoop that draw raw data from source systems and then pass them to downstream facilities for utilization by the knowledge workforce. Data lake designs vary, but they typically involve data stored in Hadoop’s distributed file system (HDFS) and accessed by YARN applications such as MapReduce, Spark, Storm, Solar, Hive, HBase etc.


The key driver for the data lake is data variety. As enterprises encounter data from multiple sources – everything from RFID sensors and mobile communications to ecommerce web logs and enterprise database applications – the need to analyze these multi-structured data types becomes paramount. Hadoop provides unprecedented opportunity to draw connections between these disparate data flows and that might otherwise go unnoticed. With Hadoop’s ability to handle diverse multi-structured data, you have a platform for analysis of all corporate data.


Indeed, the use cases for a data lake are myriad. As a means to manage corporate data sources, it can govern who, what and when data are accessed and provisioned, and then track usage, resolve anomalies, find patterns and perform a range of other tasks. As an offload for historical data, it can act as a repository for operational and analytical platforms performing deep history analysis. It can also be leveraged for data discovery, organization and identification, as well as ETL functions like data integration and validation. A data lake—accompanied by an analytic engine that can pre-compute aggregated results and a visualization tool such as Tableau—can even allow business users to report on big data sets that might not live in the data warehouse.


A properly managed data lake is one that provides a high degree of access, provenance, and governance of fast moving data streams. One way to ensure your data lake doesn’t become a messy swamp of un-identifiable data is to enforce metadata management. While schema metadata is a given in complex storage and analytics environments, a data lake can incorporate varied other forms of metadata to enable non-subject matter experts to easily locate and access the data they need. You want to have the right amount of metadata and governance for the business context: sensitive data and critical production processes need more than experimental data labs on low sensitivity data. Indeed, a lot of the power of a data lake is the ability to incrementally curate and increase your governance as data proves its value.


For example, metadata covering “business-ontology” can classify data within a specific business environment and help establish relationships between various data sets. As well, it can model coarse grain entities, such as product lines or geographic regions to each other. Security-related metadata can establish mechanisms like data ownership and access, as well as group association and data or column read authorization.


Other types of metadata may include operational, covering data identity and environmental issues – as in, “when did data become ingested, transformed and exported?” and index-related to provide data serialization which helps users track down key content. And finally, there is scheme metadata that handles column name and type, along with data interpretation, de-normalization, and other functions.


The data lake built on Hadoop is a powerful platform that allows an enterprise to quickly and efficiently manage all data types so as to provide consistent and accurate information to the business. With big data emerging as the new enterprise paradigm, it will take an entirely new data architecture to handle it.

Leave a Reply

Your email address will not be published. Required fields are marked *