Follow the mainstream technology hype machine and one could easily be led to believe that data modeling is dead. The story goes something like, “Thanks to agile technologies like Hadoop and techniques such as schema on read, prescribed data modeling is now replaced by instantaneous execution on demand.”
If you were laughing, choking, and wiping the coffee off your screen, then you are clearly someone who has done real work with Hadoop and recognizes that line of thinking as utter nonsense. There is a fundamental misunderstanding about schema on read. In fact, schema on read is just a point on a spectrum that begins with old-school schema on write and fully curated models. There’s more to the data modeling story.
When a schema on read model is formed, it is predicated on the fact that a lot of structure was previously encoded on write as well. There is a file format, columns, and keys in a JSON schema. All those structural decisions represent a kind of modeling that influences what can be accomplished on read. The fact that those pre-existing characteristics are in place does not detract from the fact that schema on read is a much more agile process that allows models to be developed incrementally. This is a radical departure from traditional data models that were highly linear and built on a cascade of limitations. Nevertheless, early decisions about how to organize data, lay it out, and make it efficient do continue to be a big part of working with big data systems, especially in the relational world.