When a dataset gets too big for your computer to handle comfortably, there are a lot ways to speed up a computation. You can add RAM, remove other processes, or overclock the CPU. You can even upgrade to the fastest processor on the market. But at some point these are all quick fixes; they might shave 20% off of the runtime, but if your dataset grows again you’ll be in the same bind. This is a fundamental issue: not how long your computation takes in absolute terms, but how it scales. It’s similar to traditional complexity theory, where no constant-factor optimization will redeem an O(n^2) algorithm for large n.
Big Data technology is about solving this problem once and for all, with performance that scales gracefully to datasets of any size. Systems like Hadoop are nothing special for small datasets (in fact they’re worse, due to parallelism overhead), but as the gigabytes pile up, the time to process all your data increases only gradually – say, an extra several minutes every time your dataset doubles in size. For the massive datasets being mined today, this becomes the only practical option.
The key to Big Data technologies is “horizontal scaling”. This means that, instead of upgrading to fancier computers when the current system becomes insufficient (“vertical scaling”), you just add more computers to a cluster. The focus is on parallelism, rather than the performance of any one node. Typically the cost of a few commodity machines pales next to a better processor, and the best processor on the market represents an absolute bound on vertical scaling’s power. But Big Data systems have no such limit. Hadoop, for example, organizes computers into a tree topology that can be grown to any size; the largest ones currently in use process many petabytes across thousands of nodes.
A corollary of horizontal scaling is fault tolerance. The law of large numbers dictates that, no matter how reliable your computers are individually, a large enough cluster will have node failures. Rather than struggling in vain to escape node failures, Big Data systems accept them as a fact of life, and handle them automatically. Hadoop, again, has a default 3x replication of data across nodes, and the system re-directs a computation if one node goes offline.
One key technique in horizontal scaling is to “send the code to the data”. In many traditional supercomputers the limiting factor isn’t the CPU’s power; it’s the bandwidth required to get the data *to* the CPU. Most Big Data systems solve this by sending code to whichever computers are storing data. As much as possible, the processing is done by each node in a share-nothing parallelism, and coordination between nodes is kept to an absolute minimum.
In practice, the extreme-case scaling of Big Data technology is often less important than its flexibility. Vertical scaling requires a massive investment in new hardware, which may quickly become obsolete if the dataset grows again (or could be wasted money if you overestimated your needs). With Big Data, clusters can be grown or shrunk whenever the occasion calls for it.
At Think Big, our Big Data technology of choice is Hadoop. Besides the huge range of problems that it can tackle, we love Hadoop for its elegant interface. The details of what data is stored on what node, where computation is going on, or even how many nodes are in a cluster, are all masked behind the simple MapReduce API (although you can easily get under the hood if you want to). We believe that sleak APIs, wrapped around massive parallelism, are the future of data processing.
If you want to learn more about specific Big Data technologies, we have an overview of the Hadoop ecosystem here: https://thinkbiganalytics.com/leading_big_data_technologies/