Big data deployments challenge organizations to rethink the role of data across their entire operations landscape. But a funny thing happens when it comes to administering these systems: There is an understandable tendency to map the organizational responsibilities for the care and feeding of big data systems such as Hadoop to existing database administrators (DBAs). After all, these same teams have kept your data assets humming along to this point. Surely they can pick up the running of the new Hadoop cluster, right?
Not in most cases.
Today’s DBAs work on the equivalent of a thoroughly modern car in data warehouses where everything has been precisely fitted together, the structure of the system is orderly, and endless diagnostics are available for the asking. This is a carefully thought-out world where the database administrator’s job is to store, manage, and secure data and to make sure the right people can access it.
Contrast this with the Hadoop stack. Because some elements of Hadoop are immature and its workload management capabilities are still coming of age, it requires a specific understanding of how to best tune for daily operations and performance.
In addition, a typical Hadoop cluster may include any of a growing number of analytic processing engines – Hive, Pig, Spark, Storm, HBase, and others – each bringing its own unique characteristics and complexity. Like SQL databases, the Hadoop Distributed File System (HDFS) stores data, but the comparison barely survives beyond that simplest definition. For example, because it is a file system and not a database, HDFS lacks the structure that DBAs expect. Furthermore, in a database such as MySQL, Oracle, or Microsoft SQL Server, DBAs define schemas in advance, whereas analytic schemas in Hadoop can be defined at runtime and frequently evolve over time (for example, parsing structure out of binary data).