Data is getting bigger and more complex by the day, and so are the choices in handling that data. As a modern application developer you need to understand the emerging field of data management, both RDBMS and NoSQL. Seven Databases in Seven Weeks takes you on a tour of some of the hottest open source databases today. In the tradition of Bruce A. Tate's Seven Languages in Seven Weeks, this book goes beyond your basic tutorial to explore the essential concepts at the core each technology.
Redis, Neo4J, CouchDB, MongoDB, HBase, Riak and Postgres. With each database, you'll tackle a real-world data problem that highlights the concepts and features that make it shine. You'll explore the five data models employed by these databases-relational, key/value, columnar, document and graph-and which kinds of problems are best suited to each.
You'll learn how MongoDB and CouchDB are strikingly different, and discover the Dynamo heritage at the heart of Riak. Make your applications faster with Redis and more connected with Neo4J. Use MapReduce to solve Big Data problems. Build clusters of servers using scalable services like Amazon's Elastic Compute Cloud (EC2).
Discover the CAP theorem and its implications for your distributed data. Understand the tradeoffs between consistency and availability, and when you can use them to your advantage. Use multiple databases in concert to create a platform that's more than the sum of its parts, or find one that meets all your needs at once.
Seven Databases in Seven Weeks will take you on a deep dive into each of the databases, their strengths and weaknesses, and how to choose the ones that fit your needs.
What You Need:
To get the most of of this book you'll have to follow along, and that means you'll need a *nix shell (Mac OSX or Linux preferred, Windows users will need Cygwin), and Java 6 (or greater) and Ruby 1.8.7 (or greater). Each chapter will list the downloads required for that database.
Q&A with Authors Eric Redmond and Jim Wilson
How did you pick the seven databases?
Eric: We did have some criteria, if not explicit. The databases had to be open source---we didn't want to cover any databases that would tie readers to a company. We wanted at least one implementation for each of the five database genres (Relational, Key-Value, Columnar, Document, Graph). Then we chose databases that exemplified some general concepts we wanted to cover, like the CAP theorem, or mapreduce. Finally, we chose databases that were good counterpoints to each other--so we chose MongoDB and CouchDB (different ways of implementing document stores). Or we chose Riak because it was a Dynamo (Amazon's database) implementation to compare to HBase as a BigTable (Google's database) implementation.
Jim: Our goal with the book was principally to introduce readers to the field of choices they now have. Our selections were largely in service of that goal. Even so, it was a pretty long and iterative process. We knew that no matter which ones we picked there'd be people asking why we did or didn't include their favorite. It came down to choosing the genres we wanted to discuss and then picking databases that had the right combination of (A) representing their genre and (B) relative popularity.
For example, we picked PostgreSQL since it sticks very closely to the SQL standard and is relatively less well known than other OSS competitors like MySQL. Similarly, even though both HBase and Cassandra are column-oriented databases, we went with HBase because Cassandra is more of a hybrid that incorporates elements from both the BigTable paper and the Dynamo paper.
Databases are rapidly changing. What do you wish you’d included now?
Eric: There are hundreds of database options, but I'm glad to see that our choices are still going strong a year later. However, if I had to do it over again, I would like to have added a Triplestore (like Mulgara), since the semantic web is slowly popularizing this method of data storage. I also would have liked to spend more time on Neo4j's Cypher language, or have covered Hadoop in a bit of detail, since analytics is a huge part of data storage.
Jim: Yes, databases are rapidly changing, in two senses. First, the field of available data storage technology has been seeing an explosion in recent years. More and more different sorts of databases are cropping up to fill in various niche needs. In the other sense, the databases themselves are rapidly evolving. Even between minor version releases, modern NoSQL databases incorporate more and more features in order to claim more of the market and remain competitive. In that regard, there's a bit of convergence happening and it makes choosing one even harder as there are more that can meet your needs all the time.
I still think the five genres and seven databases we chose satisfy the criteria that we set out to achieve. But there are others I'd like to write about as well. These include some old favorites like SQLite and some databases you might not think of as such, like OpenLDAP and SOLR (an inverted index/search engine).
Why did you decide to write this book?
Eric: Jim and I discussed writing a book like this for quite some time. About a year and a half ago he sent me an email with no body--the subject was "Seven Databases in Seven Weeks?" The title sold me. We both loved Bruce's "Seven Languages" book, and this seemed the perfect format to explore this emerging field.
Jim: As early as March of 2010, Eric and I brainstormed about writing a NoSQL book of some kind. At the time there was a lot of buzz around the term, but also a lot of confusion. We thought we could bring some structure to the discussion and educate people who might not be up to speed yet on all the latest developments.
After reading Bruce A. Tate's Seven Languages in Seven Weeks I thought, "What about Seven Databases?" Eric submitted a proposal and a few weeks later we were off to the races.
What do you see as up and coming databases?
Eric: I've become a big fan of Neo4j. It's one we covered in the book, but in all honesty we chose it because we wanted to explore an open source graph database. But over the past year it's really come into its own. I really do believe this is the year we'll see wider adoption of graph databases.
As for ones we did not cover, I think ElasticSearch is clearly gaining traction. OrientDB is also interesting, as it can act as a relational, key-value, document, or a graph database. I think you'll see more of this multi-genre behavior in the future. And as I hinted at before, Triplestores are gaining a bit of traction, too, though their problem-set greatly overlaps with general graph databases.
Jim: There are many, of course, but there are at least two that I personally look forward to exploring in more detail: ElasticSearch and doozer.
ElasticSearch is a distributed, peer-based, REST/JSON powered document search engine. Using a distributed Lucene index at its core, ElasticSearch allows REST clients to find documents based on fuzzy criteria. Everyone needs a search engine, and ElasticSearch makes it easy.
Doozer is a fast, headless consensus engine. It's written in the Go programming language by the smart folks at Heroku. Doozer is great for storing small blobs of important information that absolutely must be consistent (like cluster configuration metadata), but without a single point of failure.
Table of Contents
9. Wrapping Up
A1. Database Overview Tables
A2. The CAP Theorem