Friday, March 29, 2013

Is MongoDB still on course?

I would like to debate about these NoSQL products, and especially MongoDB. I was a real fan of MongoDB in it's early days, while all of the current solutions were just emerging. Voldemort, HBase, Redis, Riak, CouchDB... everything was kind of new, developers highly inspired by last researches in the world of distributed databases and highly motivated by solving scalability issues. It was the beginning of the war, and I'd chosen my side: MongoDB.

Where does MongoDB stand?

The past

What did I like in MongoDB? It was amazingly fast to deploy, and I liked its philosophy of dealing with OS virtual memory manager to handle memory mapped files in a very clean way. Fast to deploy, and fast to serve your queries, without any SPOF. As the opposite way I considered HBase as a real gas plant where you have to configure up to 6 knobs just to set a decent compaction policy. Oh, of course, you'll also have to manage your Hadoop configuration files, and deal with a freebie SPOF, what a stupid design! Even in 2009 in was not really acceptable.


Now we are in 2013, and things have changed. My blog is not a place where I want to discuss about performances comparisons; you'll easily find plenty of benchmarks (not always that relevant, by the way). Just let's recap what's the current situation and point out how everything has evolved. First, I've been so upset to see the MongoDB team was so focused on adding new features to its NoSQL product, such as the aggregation framework – a user-friendly implementation of common map/reduce queries – or new query operators. It looked like they were working in order to increase their audience rather than solving complex computer science challenges.

Alright, MongoDB still gets no SPOF, whereas HBase still has this single Namenode; but HBase has strongly increased its overall reliability, while not adding user-friendly feature. You'll only find operational features, such as news metrics, new compression algorithm supports or coprocessors (≈ triggers), but I insist, most of the efforts were dedicated to RPC enhancements, concurrency improvements and data-safety. However it remains easy to make criticisms: HBase's data locality design is definitely not perfect (whereas MongoDB at least scales reads through the replication servers) and some of underlying Hadoop operations would have deserved to be synchronized using a ZooKeeper quorum.

Was the past wrong?

What about this "easy to use" point I mentioned earlier? I totally changed my mind. Letting the OS handling virtual memory eviction is not always the best choice. And fine tuning hundreds configuration settings can be so efficient! Learn about your work load, think about your priorities, and then you will be able to do so much more with solutions like HBase than you could do with MongoDB. Yes, the learning curve remains more complex as there are more settings and more layers to understand, but that's the cost to pay if you're expecting to put your data project at scale (I mean, not only on 3 servers as you could have done with an elementary MySQL instance).

Jira hall of fame

After 3 or 4 years, have a look of the different project's Jiras. I picked up 3 classical MongoDB bugs (subjective point of view), still open and marked as blocking/major bugs :
  • Broken indexes [SERVER-8336]
  • A new server added to a Replica Set Fails to start Replication [SERVER-8023]
  • Socket exception [SEND_ERROR] on Mongo Sharding [SERVER-7008]
Let's compare with HBase bugs still open for the current branch (0.94) :
  • NPE while replicating a log that is acquiring a new block from HDFS [HBASE-8096]
  • Replication could have data loss when machine name contains hyphen "-" [HBASE-8207]
  • Increment is non-idempotent but client retries RPC [HBASE-3787]
Obviously, every solution has its own (stupid) bugs lot. Some of them will occur under very specific conditions and thus become very hard to fix: full moon, Dvorak keyboard layout or exotic linux kernel (THIS looks crazy, isn't it?). But getting a still open "broken indexes" ticket in 2013 is clearly a youth issue for such a project. I would also like to mention a great article, "MongoDB is still broken by design 5-0".


Honestly, in 2009, I was expecting MongoDB to be a leader, offering a kind of tight, clean, minimal but strongly reliable framework to store "semi-structured" data, handling memory efficiently, offering a distributed computation stack and letting developers imagine new features at a higher level. I can't agree with this policy of "offering more" rather than "offering the same but stronger". Hopefully the last versions seems to be a little bit more focused on performances and reliability (as an example, a global lock has been FINALLY removed).

And reality

To be clear, I'm definitely not against MongoDB. I just wanted through this article to point out the fact that they roughly changed their directions and lead to a project that I could not follow anymore; but this project is probably still suitable for many uses, as I introduced these reasons in my talk at MongoDB Paris last year: geographical indexes, complex query processing, easy secondary indexing, easy to try...

Considering your indices fit in RAM and you'll need to extend your platform in the future, MongoDB offers various features within a single solution. The problem is, there are many opponents to consider as well, such as ElasticSearch.

I wish the best to the team and hope to see all of these features taking part into a very robust and reliable project.

Thursday, March 21, 2013

Why do I hate plumbing repairs?

Last days I had troubles with my hot water tank, which was leaking. It was not obvious at all because everything was totally wet in the bathroom, including walls and ceiling. I don't know anything about plumbing, that's why I called a plumber. He had the explanation: everything was wet because of the condensation, and the tank was kind of destroyed because of calcium deposits.

Well, you probably don't read my blog to get such information. The interesting part? This is one of my recent thought: WHY do I dislike plumbing? Actually I don't really dislike, but I'm not very comfortable with manual labor. I don't feel very confident. And it's all because of computer science.

Hum, this article now looks fascinating, how could I compare plumbing and computer science? It's a story about testability, a story about QA. Think about this simple question: how would you ensure the pipe you've just repaired has fixed your problems for real?

You can't. In fact, this is exactly what happened: the (first) plumber said "Your tank is out of service, here is the invoice for changing it!". Two days after, something was still leaking: a flexible pipe did not support the pressure. But nevertheless, this plumber activated the tank and showed us everything was okay. Looks like it was not enough at all. It's probably not feasible to unit test such things. When you're writing code, you're writing its unit test(s). And if the unit test has been properly designed, if it's working, it will work forever. No matter you change your CPU, hard drive or your keyboard, your test will remain able to tell you if your code is still working.

For example, I don't want to change a tank by myself, just because I think I can't rely on my impressions. Okay, I can turn it on, nothing is leaking now... but how can I be sure I didn't break something else, somewhere else, for another reason? It's not about writing such codes:

// Check nothing is leaking once pipe has been changed.

function test_repair_pipe() {
  Pipe old_pipe = new Pipe(age = 10 * 365*86400) // about 10 years
  Pipe new_pipe = new Pipe(age = 0)
  Bathroom context = new Bathroom()
  assert(context.getTank().setPipe(old_pipe).isLeaking) == true)
  assert(context.getTank().setPipe(new_pipe).isLeaking) == false)

Imagine if we could do such things for real-world issues? Surgery, cooking, and even fighting against bad breath? What a comfortable idea! Well, there is a probably business to do: e-plumbing, cloud plumbing or even agile plumbing repair.

Saturday, March 16, 2013

It's a kind of Magic, The Gathering

Most of you might have heard one day about Magic, The Gathering (MTG): a famous trading card game. I played this game when I was 12, only for a few months while the game was very popular at school.

And 2 years ago, I discovered Magic once again. But this time I realized that it was totally different than what I could have understood so far. Far more complex, fascinating and entertaining than I thought  when I was a young teenager. Not a simple collectible card game. Maybe you already know MTG and you don't need to read this article further. If you don't know it, let's discover why I definitely love this game.


In a few words, each game consists in a battle with 2 players starting with 20 life points and a 60 cards deck (= the "library"). Basically a player loses when its life is reduced to zero. Decks are shuffled and each players draws 7 cards, then the battle begins. Players can cast spells, summon creatures and attack their opponents to reduce their total life amount. A spell needs "mana" to be cast, and this mana can be produced with cards called "lands", and everything take place in a fantasy universe closed to the Dungeons & Dragons one. Here is the most simple overview of the game, real rules take much more time to be be read. You can even find a concept of "stack"; yes, this is a real stack (you know, the data structure with push() and pop() methods) that describes how spells will be resolved.

One of my current favorite card.

You are a manager!

In the Magic world, if you're expecting to win, you have to manage efficiently different kind of resources:
  • Cards in hand: basically, the more cards you have, the better chance you have to be able to play the great card at the great time).
  • Mana sources: you need to have enough mana to play your other spells, but you need to not only draw mana source cards. Remember, you need to play spells if you want to win...
  • Creatures on the battlefields: put pressure on your opponent with lots of great creatures.
  • Graveyard (cards you've discarded): some cards can be played even if they've already been played! Once you've discarded these cards, you can play them almost whenever you want/need, that's powerful!
  • Cards in your library: instead of putting pressure with creatures, some players are focusing on improving the "quality" of cards they will draw. Some spells allows you to draw more cards than your opponents, this way you might draw a spell that can destroy all your opponent's creatures.
  • Time: if you think you're going to lose a game, you can try to "lock" it and ensure getting a draw game, that's still better than losing it...
These are the main resources you have to handle in a MTG game. To increase your victory rate in a tournament, you must handle your resources better than everyone else. Don't waste your mana, draw cards, don't let statistics beating you.

Statistics and behavior models

With the most traditional game rules, there are some rules you have to follow when building your own deck:
  • At least 60 cards in your deck.
  • You can put as much land cards (= mana) as you want.
  • Regarding the spells, you can put at most 4 times the same card in your deck.
Most of the time, a good deck is composed in the following way:
  • EXACTLY 60 cards (only the best, discard the rest).
  • Between 25 and 30 land cards (by the way, there are 5 different mana colors, but let's keep it simple...)
  • Some spells requiring 1 mana source (it's called "converted mana cost (CMC) = 1"), then 2 mana, 3 and so on. Basically "expensive" spells are more powerful than the cheapest ones but... they are more expensive, you'll need to survive several turns before playing your "game breaker" card and thus be sure to be still alive. In the opposite way, playing only cheap spells will probably not be enough to deal the last damages to your opponent.
  • This way, the dream hand you can get after drawing 7 cards would be, for instance, 3 land cards, 1 card with CMC=1, one with CMC=2, 3 and 4. This way you are sure to put stronger pressure each turn, if everything is going well. But remember: your opponent is also playing good cards and can counter your plans.
Now I hope I've described the game mechanisms clearly enough so that you understand the role the statistics are playing in a classical Magic game. According to me, the best player is the one who picks the right card at the right moment. This way, that player will get a solution for each problem the opponents will play. This requires strong building skills, metagame knowledge (what is considered as the current deck-to-beat ?). Now, you can deal with game theory, rational choice models and so on.

Kind of statistics you might want to deal with.

But as shown in a previous part, even when getting a perfectly built deck ready-to-win, you still have to  optimize your resources usage. Nowadays thanks to the Internet, lots of players are just copying deck lists from world champions and try to play and win. That's not enough at all.

Even graph theory can help

Some decks are expected to win when gathering 2, 3 or more cards that make a perfect lethal combo. When playing with a such deck, you have to know the shortest path to gather these cards and to build your deck so that you will get the best chance to easily gather these cards.

Oh, and it's also a game. It's an opportunity to share fun and meet great people.

References :

Saturday, March 9, 2013

Regexp: think DFA

Automata theory is such a great science. No matter how abstract it can be, you can find it everywhere. From simplest recognizable langage to the most complex problems, you might want to use automaton. Today, I would like to discuss about these automata in regular expressions.

Perl and regular expressions
Source : xkcd

For a long time, I used "traditional" regexp engines. Mostly java.util.regex and python's re module -- oh, I can also add grep, sed, and lisp engine within emacs to this list. I use regular expressions every day, mostly for simple use cases, that's true (ie: grep "foo" in my files) but also for powerful strings replacements. This is because regular expressions ARE powerful. But you probably know they can be painful as well. Let's begin with this first regexp,

> time perl -e '("a" x 10) =~ /^(a?){10}(a){10}$/;'

real 0m0.010s
user 0m0.005s
sys 0m0.004s

Well, as you might have noticed, this is not real-world regexp, but this can easily lead in a real-world issue. First, we are building a string looking like "aa....a" with 10 times repeated "a" character. Then, our regexp will try to match the a? pattern 10 times, and that will match, but then the engine will realize that the (a){10} statement won't match anything, that's why the perl engine will do recursive backtracking. Now, you're lost. This is an horrible case since the underlying implementation will always choose the positive matching for the a? statement. The engine will have at most n such choices to make, this is how you get your 2^n complexity.  How does it behave with n = 25 ?

> time perl -e '("a" x 25) =~ /^(a?){25}(a){25}$/;'

real 0m5.541s
user 0m5.532s
sys 0m0.007s

Of course inefficient regexp can be considered as acceptable when you're trying to pick up some relevant files, but it's definitely cumbersome if this kind of regular expression is used for real-time processing (think about filtering website forms, small RPC calls, or even batch job processing thousands lines of text). In our situation the exponential time is certainly NOT suitable for any case.

Go back to implementations. Basically, most of common engines implement a NFA – Non-deterministic Finite Automaton – a finite state machine where a given input can lead to several possible states (and obviously, it does not know which one is the better). Perl, Java, Ruby and Python for instance implement NFA for processing regular expressions. This implementation is known to be far slower than DFA, its deterministic cousin processing input data in polynomial time. Hopefully the theory wants that NFA can be be transformed to DFA and vice versa.

Here are 2 examples I picked up from this page to illustrate these concepts:

DFA, recognizing the same langage.
The famous grep tool, for example, uses DFA. Want to get a simple comparison with the previous perl program?

> time perl -e 'print ("a" x 25)' | grep -E "^(a?){25}(a){25}$"

real 0m0.0011s
user 0m0.007s
sys 0m0.007s

Polynomial time looks more reasonable. How can I move from recursive backtracking NFA to DFA implementations in my projects? I've heard about re2. This library also uses DFA implementation and describes itself as a "fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library". By the way you should easily find bindings (or alternatives) for langages other than C++. This website is comparing plenty of libraries and in term of performances, re2 definitely looks interesting. So, why is DFA not used by default? Honestly, I don't know if it's the actual reason, but NFA implementations offer more features than DFA, such as back references (e.g. you might want to match something like <(.+)>(.*)</(\1)> to extract XML tags, where \1 refers to the matched opening tag)

If you can't use DFA, perhaps you can deduce some good practices from this article. The first one it's good to know: there is a common syntax allowing you to disable backtracking if you're sure that you won't need it: (?> pattern). You might also want to disable grouping thanks to (?: pattern) if you just need to know if your pattern is matching something/nothing in your input data. That's particularly true with Java and its java.util.regex package. You can find more general good advices on JavaWorld, and lots of them remain true for other langages.

What a shame, I don't know what to tell as a conclusion. I tried to talked both about abstract concepts behind automata and about efficient/inefficient patterns. I can just give you a so vague advice like "think". It's not just about saving CPU cycles, it's mostly about NOT delirious stuff.



I will simply say: echo "Welcome, are you also born to segfault ?" > /dev/thisblog. This is the first useless post I'm writing to introduce this new blog. It should become a place to talk about various stuff such as algorithms, distributed systems, external (great) articles, software engineering and funny pictures.