Friday, March 29, 2013

Is MongoDB still on course?

I would like to debate about these NoSQL products, and especially MongoDB. I was a real fan of MongoDB in it's early days, while all of the current solutions were just emerging. Voldemort, HBase, Redis, Riak, CouchDB... everything was kind of new, developers highly inspired by last researches in the world of distributed databases and highly motivated by solving scalability issues. It was the beginning of the war, and I'd chosen my side: MongoDB.

Where does MongoDB stand?

The past

What did I like in MongoDB? It was amazingly fast to deploy, and I liked its philosophy of dealing with OS virtual memory manager to handle memory mapped files in a very clean way. Fast to deploy, and fast to serve your queries, without any SPOF. As the opposite way I considered HBase as a real gas plant where you have to configure up to 6 knobs just to set a decent compaction policy. Oh, of course, you'll also have to manage your Hadoop configuration files, and deal with a freebie SPOF, what a stupid design! Even in 2009 in was not really acceptable.

Now

Now we are in 2013, and things have changed. My blog is not a place where I want to discuss about performances comparisons; you'll easily find plenty of benchmarks (not always that relevant, by the way). Just let's recap what's the current situation and point out how everything has evolved. First, I've been so upset to see the MongoDB team was so focused on adding new features to its NoSQL product, such as the aggregation framework – a user-friendly implementation of common map/reduce queries – or new query operators. It looked like they were working in order to increase their audience rather than solving complex computer science challenges.

Alright, MongoDB still gets no SPOF, whereas HBase still has this single Namenode; but HBase has strongly increased its overall reliability, while not adding user-friendly feature. You'll only find operational features, such as news metrics, new compression algorithm supports or coprocessors (≈ triggers), but I insist, most of the efforts were dedicated to RPC enhancements, concurrency improvements and data-safety. However it remains easy to make criticisms: HBase's data locality design is definitely not perfect (whereas MongoDB at least scales reads through the replication servers) and some of underlying Hadoop operations would have deserved to be synchronized using a ZooKeeper quorum.

Was the past wrong?

What about this "easy to use" point I mentioned earlier? I totally changed my mind. Letting the OS handling virtual memory eviction is not always the best choice. And fine tuning hundreds configuration settings can be so efficient! Learn about your work load, think about your priorities, and then you will be able to do so much more with solutions like HBase than you could do with MongoDB. Yes, the learning curve remains more complex as there are more settings and more layers to understand, but that's the cost to pay if you're expecting to put your data project at scale (I mean, not only on 3 servers as you could have done with an elementary MySQL instance).

Jira hall of fame

After 3 or 4 years, have a look of the different project's Jiras. I picked up 3 classical MongoDB bugs (subjective point of view), still open and marked as blocking/major bugs :
  • Broken indexes [SERVER-8336]
  • A new server added to a Replica Set Fails to start Replication [SERVER-8023]
  • Socket exception [SEND_ERROR] on Mongo Sharding [SERVER-7008]
Let's compare with HBase bugs still open for the current branch (0.94) :
  • NPE while replicating a log that is acquiring a new block from HDFS [HBASE-8096]
  • Replication could have data loss when machine name contains hyphen "-" [HBASE-8207]
  • Increment is non-idempotent but client retries RPC [HBASE-3787]
Obviously, every solution has its own (stupid) bugs lot. Some of them will occur under very specific conditions and thus become very hard to fix: full moon, Dvorak keyboard layout or exotic linux kernel (THIS looks crazy, isn't it?). But getting a still open "broken indexes" ticket in 2013 is clearly a youth issue for such a project. I would also like to mention a great article, "MongoDB is still broken by design 5-0".

Expectations

Honestly, in 2009, I was expecting MongoDB to be a leader, offering a kind of tight, clean, minimal but strongly reliable framework to store "semi-structured" data, handling memory efficiently, offering a distributed computation stack and letting developers imagine new features at a higher level. I can't agree with this policy of "offering more" rather than "offering the same but stronger". Hopefully the last versions seems to be a little bit more focused on performances and reliability (as an example, a global lock has been FINALLY removed).

And reality

To be clear, I'm definitely not against MongoDB. I just wanted through this article to point out the fact that they roughly changed their directions and lead to a project that I could not follow anymore; but this project is probably still suitable for many uses, as I introduced these reasons in my talk at MongoDB Paris last year: geographical indexes, complex query processing, easy secondary indexing, easy to try...

Considering your indices fit in RAM and you'll need to extend your platform in the future, MongoDB offers various features within a single solution. The problem is, there are many opponents to consider as well, such as ElasticSearch.

I wish the best to the team and hope to see all of these features taking part into a very robust and reliable project.