NoSQL is an umbrella term for a broad class of database management systems that relax some of the tradition design constraints of relational database management systems (RDBMS) in order to meet goals of more cost-effective scalability, flexible tradeoffs of availability vs. consistency (as described by the CAP theorem), and flexibility for data structures that don’t fit well into the relational model, such as key-value data and large graphs. NoSQL databases typically don’t offer ACID transactions nor full SQL dialects.
The NoSQL ecosystem is very large. Among the better known databases are HBase, Cassandra, Aerospike, DynamoDB, MongoDB, Riak, Redis, Accumulo, Datatomic, and Couchbase. Of these, HBase and Accumulo are more closely tied to Hadoop than the others, as both use HDFS, by default, for persistent storage and Zookeeper for service federation.
NoSQL databases expose different information models, including key-value records, JSON or XML documents as records, or graph-oriented data. They expose corresponding programmer APIs and sometimes custom query languages that may or may not be SQL-based. However, a recent trend in this industry is the re-introduction of restricted SQL dialects to support the large user community accustomed to SQL and improving support for transactions.
As an example of a scenario where a NoSQL database is a good fit, an event log for a web site might be captured in a key-value store, where fast appends and key-based retrievals are required, but not updates nor joins.
HBase is a distributed, column-oriented database, where each cell is versioned (a configurable number of previous values is retained). HBase provides Bigtable-like capabilities on top of Hadoop. SQL queries (but not updates) are supported using Hive, but with high latency. Eventually, Impala will also support Hive queries with lower latency. Like many NoSQL databases, HBase does not support complex transactions, SQL, or ACID transactions. However, HBase offers high read and write performance and is used in several large applications, such as Facebook’s Messaging Platform. By default, HBase uses HDFS for durable storage, but it layers on top of this storage fast record-level queries and updates, which “raw” HDFS doesn’t support. Hence, HBase is useful when fast, record-level queries and updates are required, but storage in HDFS is desired for use with Pig, Hive, or other MapReduce-based tools.
Cassandra is the most popular NoSQL database for very large data sets. It is a key-value, clustered database that uses column-oriented storage, sharding by key ranges, and redundant storage for scalability in both data sizes and read/write performance, as well as resiliency against “hot” nodes and node failures. Cassandra has configurable consistency vs. availability (CAP theorem) tradeoffs, such as a tunable quorum model for writes.
DynamoDB is Amazon’s highly scalable and available, key-value, NoSQL database. DynamoDB was one of the earliest NoSQL databases and papers written about it influenced the design of many other NoSQL databases, such as Cassandra.
Couchbase is a key-value NoSQL database that is well-suited for mobile applications where a copy of a data set is resident on many devices, where changes can be performed on any copy, and copies are synchronized when connectivity is available. Think of how an email client works with local copies of your email history and corresponding email servers.
Redis is a key-value store with the specific support for fundamental data structures as values, including strings, hash maps, lists, sets, and sorted sets, whereas most key-value stores have limited understanding of a value’s meaning, except to represent the value as column cells, if many cases. For this reason, Redis is sometimes called a data structure server. Redis keeps all data in memory, which improves performance, but limits the data set sizes it can manage. Durability is optional, by periodic flushing to disk or writing updates to an append log. Master slave replication is also supported.
Datomic is a newer entrant in the NoSQL landscape with a unique data model that remembers the state of the database at all points in the past, making historical reconstruction of events and state trivial. Many standard database operations are supported, including joins and ACID transactions. Deployments are distributed, elastic, highly available.
Riak is a fault-tolerant, distributed, key-value NoSQL database designed for large-scale deployments in cloud or hosted environments. A Riak database is masterless, with no single points of failure. It is resilient against the failure of multiple nodes and nodes can be added or removed easily. Riak is also optimized for read and write-intensive applications.