Thursday, September 11, 2014

The factorial exercise

When I'm conducting job interviews, one of my favorite recurring question is: "Can you write a factorial function?".

Even during the first interview, I think it's relevant and important to see your candidate writing code in live. Note that this is a very open question, e.g.:
  • this can be written in any language
  • candidate may supply both iterative or recursive version
Now that I've asked this question many, many times, I decided to share some of the collected answers:








Asking such a simple question may look quite easy, even more for senior candidates. But writing live code, when you're not ready, is never that easy, considering stress and time factors. And as you can see it's amazing to notice so many different versions: useless intermediate variables, useless conditions, brain-cracking complexity... Nevertheless, this basic quizz always ends up with deeper aspects and more interesting questions, such as:
  • overflows: what happen exactly if you compute factorial(2 << 32) ?)
  • memoization: what's that and when could you consider it?
  • unit testing: what and how do you test?
  • how to make your code run in parallel?
  • maths and logic: what are the last 10 digits of factorial(123456789) ?
It's obviously not the only one question, and it's probably not the most important one. But I'm pretty sure you'd also expect a minimal working version from a candidate with a young academic background.

Thursday, July 3, 2014

LSE Summer Week 2014

If you're interested in GNU/Linux, *BSD, Android and any other security area, you might love these outstanding French conferences. These are organized by LSE, security laboratory of my former engineering school (EPITA).

Everything you need to know: http://lse.epita.fr/lse-summer-week-2014/

Feel free to attend! Online videos were available last year, don't know yet if they will this year.

Sunday, May 25, 2014

Book review: "Clojure for machine learning"

I had the opportunity to give my opinion about a book that I expected to read. After having encountered some issues dealing with Incanter matrices (a ML framework for Clojure programers), I googled my issue and got answers that seem to be part of Clojure for Machine Learning book (Packt Publishing). Of course, the book wasn't entirely available online, so you can imagine how I felt when I got the whole copy three days later for review.

First of all, it was the first time I've read a book from Packt. I have to admit it was even the first time I've heard about Packt. I can't say I was amazed by the layout and print styles, but it's more than acceptable (and totally subjective!).


Machine Learning

Well, it's time to talk about the content. I know it's always hard to provide well balanced explanations of machine learning algorithms: too academic books become so boring for those who already master the concepts and underlying mathematics (and for those who absolutely don't have any maths background). Too pragmatic books, at the contrary, can be interesting and relevant but most of the time it becomes quite hard to apply concepts to other use cases and problems, and it can be frustrating to not know how and why everything is working. The author explains his point of view at the early beginning of the book: it aims to be at the middle of this large scale, wants to cover most relevant problematic in a generic fashion but avoids diving too deep into low-level maths. What a challenge!

Clojure

What about Clojure content? I'm still considering myself as a Clojure beginner since I didn't write any large program nor solved really hard issues. Even if I deeply know and understand all the concepts individually, it remains not-so-easy to gather all of them and think everything in a perfect Clojure's style when I have to solve real life problems. Until now I mostly played with Machine Learning algorithms using GNU Octave or Python. This books looks perfect in this dimension! The author provides clear Clojure code, assuming you already know Clojure fundamentals and giving relevant notes for lines that deserve further explanations.

Clojure for Machine Learning

Mixing these 2 axis together, you get this book. At first sight I was surprised to not see any reference to Incanter project. You just have to wait and keep reading, then this book will transform into a nice Incanter introduction. However, I had the feeling the author didn't want to speak too much about Incanter but he had to anyway (just a personal feeling). I think it would make sense to explicitly introduce this amazing project and would't mean that the book was 100% single-library oriented.

The first part I really liked within the first chapters was the nice introduction to available frameworks to deal with matrices (and not Incanter!). It calls into question how I considered matrices in Clojure so far. I just regret some figures are missing but it highly motivates me to write these benchmarks.

Another point I realized is the strange overall organization of this book. Despite the table of content makes it very clear, I think the author is doing strong shortcuts, bypassing some important concepts; on the other hand he supplies long and deep details about some fundamentals (mostly about ML, not really about Clojure). Make your choice: there is too much theory or not enough theory.

Worth reading?

To be honest, I don't know. I guess it depends who you are, and what you'll expect from this book. The best I can do is give this classical overview as pros/cons list.

Pros:

  • Nice introduction to LOTS OF concepts
  • Pragmatic, real-life oriented
  • Well-balanced required Clojure's skill
  • Fair level of details overall

Cons:

  • Lack of fundamentals / academic stuff
  • Lack of summarized explanations
  • No real Incanter introduction
  • Not enough organized

Thursday, May 1, 2014

Emacs, clojure/cider mode: boring autofocus...

Today, – labor day – I've fine tuned my Emacs setup which hasn't evolved for a while. Digging into emacs24 packages, testing rainbow-delimiters, customizing company-mode, etc. As a Clojure beginner, I'm discovering joy of programming lisp-languages within my favorite $EDITOR as well.

Then I've played with CIDER, a Clojure IDE and REPL for Emacs. As expected you can benefit form REPL model directly, plus many additional features, such as stacktraces analysis. One of my favorite feature is to get shortcut (C-c C-d in my case) to display the documentation of an input function.


The problem was that CIDER always decided to automatically move the focus to the attached buffer. This leads into boring keyboard combinations to switch back to the original buffer, even smart shortcuts such as C-<up> won't satisfy myself. There are configuration options to annihilate this behavior when throwing an error or the REPL prompts, but I didn't find anything when accessing the documentation.

And here comes the patch! If you're annoyed by this behavior, feel free to patch your setup:

diff --git a/emacs/emacs.d/elpa/cider-0.5.0/cider-interaction.el b/emacs/emacs.d/elpa/cider-0.5.0/cider-interaction.el
index 8f4b8cb..8bcccbf 100644
--- a/emacs/emacs.d/elpa/cider-0.5.0/cider-interaction.el
+++ b/emacs/emacs.d/elpa/cider-0.5.0/cider-interaction.el
@@ -1153,7 +1153,7 @@ if there is no symbol at point, or if QUERY is non-nil."
 (defun cider-doc-handler (symbol)
   "Create a handler to lookup documentation for SYMBOL."
   (let ((form (format "(clojure.repl/doc %s)" symbol))
-        (doc-buffer (cider-popup-buffer cider-doc-buffer t)))
+        (doc-buffer (cider-popup-buffer cider-doc-buffer nil)))
     (cider-tooling-eval form
                         (cider-popup-eval-out-handler doc-buffer)
                         nrepl-buffer-ns)))

Of course, you may create your own variable to customize focus strategy, but I've never seen a very long documentation which could justify moving focus to another buffer.

Sunday, April 6, 2014

Fight against TEOOD – too-early-optimization-oriented-development

When should you optimize your program?

This is is something you might have seen many times around you so far: some developers love to optimize things. They want their code to run faster. They're speaking ops/s (« operations per second ») and aim to get the highest performances. That's probably a worthy goal, but let me explain why I'm thinking it's sometimes especially penalizing.

Why does it matter?

I guess that (great) people want to deliver great projects. They don't want to to become blameful once their program has reached the first trivial bottleneck and everything got blocked at a ridiculous performance rate.
They keep in mind that considering performances transforms you into a super-developer.

super-developer, super hero of all the developers. (source)

What is TEOOD?

To counter this trend, I'd like to describe my personal view of optimization-oriented-development. This is obviously a mocking name, and I only want you to consider the following points. I guess I also don't need to add that these arguments are probably not working for 100% performances driven programs (for real), even if some points could be considered.

Define your goal(s): performance != optimization

Of course you're probably coding with tons of XXIInd century ideas, but have you ever thought about what is really important in your super-program? Some ideas:
  • Reliable: not failing on the first encountered corner-case. Perhaps some programs could be allowed to constantly fail, but I'd call them "proof of concept" rather than "program".
  • Open-sourceable: this means a lot to me. Ensuring your program is properly designed, well documented and strongly tested, allowing other super-developers to add features or fix your mistakes.
  • Simple: This can be considered, especially when solving simple problems. Do you really need  1.000 lines of C++ code to find the maximum value among 1000 integers? Of course your program is now able to compute min/avg/max saving precious CPU cycles, but does is really matter?
Does "Hello World" really need your state-of-the-art GPU optimization? (source)

I guess you'll easily find something in this list more important that raw performance. And that's the point: performance IS NOT optimization. Don't stay focused too much on performances. Stay far from optimizations if you haven't reached a critical bottleneck. Without talking about this famous 80/20 rule, I  would say that if you get ridiculous performances, this is probably not an optimization problem but a performance one. You've missed something trivial, and this thing is certainly related to global design or a crappy complex method, not because you used a deprecated API or you're browsing your ArrayList twice instead of once. And if I'm wrong, guess what? Because you've properly designed your program, you'll easily find the awful costly function call(s) thanks to a simple profiling tool. And I'm pretty sure the solution will remain quite simple (caching, buffering, multi-threading, etc.).

And now this is time to tell you what I have in mind: you should take your time to think about the overall design. About how your program will live or die. About what you're trying to do: implementing an algorithm? adding abstraction? automating something? You've probably noticed that successful projects have been refactored due to design or API issues, barely because of ugly performances issues.

In TEOOD, I would advise developers at most to consider performance troubles, but never to think about optimized calls while they're unable to provide a stable and strong design, unable to point out such issues within their unit test or overall benchmarks.

That's cool, so I can write O(2n) algorithms?

That's not the point. Actually, even in this extreme scenario, I think this might be totally acceptable if you're aware about limitations, you don't need to scale right now, this has been documented, and is replacing 7 days of hard-work and thousands lines of code by a simple single instruction that just required 5 minutes of your precious time.

If you're writing a program to solve issues, not performances issues, you're probably concerned by this article. Plan your scaling options, both horizontal and vertical (but keep in mind that scaling too early is also a very bad practice). Plan your tolerance margin if you think you're doing un-optimized things. Think about the complexity you're really fighting against, and define the bottlenecks you're ready to reach.

Don't think too fast, speed is not your main concern!

For instance, if you consider writing an ETL, reading various files formats and writing to various databases, you should NOT be worried about performing many serializing/deserializing operations in your workflow. At least when designing the first parts. Even worse: don't waste your time comparing different serializing methods for every possible cases. Even if you're converting String to byte[] and byte[] back to String 42 times in your Java program, you'll have all the cards in hands to test (benchmark) the limit and fix this sooner or later. Your application is probably not even CPU-bound but IO-bound, and thus wasting many CPU-cycle at the first draft is not a disaster. At this point, just think about overall design. Make the pipeline smooth, explicit and easy to maintain for you and each other. This way you'll be able to optimize the right thing at the right time.

I could have called this article "swimming slowly vs sinking fast": taking care of optimizations at the early beginning will irreparably impose to dive deeper and deeper into source code, continuously turning knobs left and right (note that this stands for benchmark-addicts as well).

Once you meet strong performance issues and you don't want to scale, if you've decided to keep the whole program clean, simple, and well-designed, this probably won't take lots of time to fix it. On the contrary, diving into a complex-because-faster program could have lead to many unsafe tricks, perhaps introducing new unknown bugs due to side-effects.

Conclusion

TL;DR. If I had to describe TEOOD in a few points:
  • Stay focused on what you're doing: solving algorithm issue? automating something?
  • Take care of your design, performances will follow.
  • Think (at most) about performances, not about optimizations.
  • Know the limits, document them.
  • Define which bottlenecks are important and how to surpass them.

Saturday, October 26, 2013

Listen to algorithms

This is a useless-so-interesting Youtube video a friend of mine has just shared with me.


Until now, people with a supernatural echoic memory couldn't benefit from their power when they were learning sorting algorithms within books. So they now might want to learn them through Youtube videos.

Friday, March 29, 2013

Is MongoDB still on course?

I would like to debate about these NoSQL products, and especially MongoDB. I was a real fan of MongoDB in it's early days, while all of the current solutions were just emerging. Voldemort, HBase, Redis, Riak, CouchDB... everything was kind of new, developers highly inspired by last researches in the world of distributed databases and highly motivated by solving scalability issues. It was the beginning of the war, and I'd chosen my side: MongoDB.

Where does MongoDB stand?

The past

What did I like in MongoDB? It was amazingly fast to deploy, and I liked its philosophy of dealing with OS virtual memory manager to handle memory mapped files in a very clean way. Fast to deploy, and fast to serve your queries, without any SPOF. As the opposite way I considered HBase as a real gas plant where you have to configure up to 6 knobs just to set a decent compaction policy. Oh, of course, you'll also have to manage your Hadoop configuration files, and deal with a freebie SPOF, what a stupid design! Even in 2009 in was not really acceptable.

Now

Now we are in 2013, and things have changed. My blog is not a place where I want to discuss about performances comparisons; you'll easily find plenty of benchmarks (not always that relevant, by the way). Just let's recap what's the current situation and point out how everything has evolved. First, I've been so upset to see the MongoDB team was so focused on adding new features to its NoSQL product, such as the aggregation framework – a user-friendly implementation of common map/reduce queries – or new query operators. It looked like they were working in order to increase their audience rather than solving complex computer science challenges.

Alright, MongoDB still gets no SPOF, whereas HBase still has this single Namenode; but HBase has strongly increased its overall reliability, while not adding user-friendly feature. You'll only find operational features, such as news metrics, new compression algorithm supports or coprocessors (≈ triggers), but I insist, most of the efforts were dedicated to RPC enhancements, concurrency improvements and data-safety. However it remains easy to make criticisms: HBase's data locality design is definitely not perfect (whereas MongoDB at least scales reads through the replication servers) and some of underlying Hadoop operations would have deserved to be synchronized using a ZooKeeper quorum.

Was the past wrong?

What about this "easy to use" point I mentioned earlier? I totally changed my mind. Letting the OS handling virtual memory eviction is not always the best choice. And fine tuning hundreds configuration settings can be so efficient! Learn about your work load, think about your priorities, and then you will be able to do so much more with solutions like HBase than you could do with MongoDB. Yes, the learning curve remains more complex as there are more settings and more layers to understand, but that's the cost to pay if you're expecting to put your data project at scale (I mean, not only on 3 servers as you could have done with an elementary MySQL instance).

Jira hall of fame

After 3 or 4 years, have a look of the different project's Jiras. I picked up 3 classical MongoDB bugs (subjective point of view), still open and marked as blocking/major bugs :
  • Broken indexes [SERVER-8336]
  • A new server added to a Replica Set Fails to start Replication [SERVER-8023]
  • Socket exception [SEND_ERROR] on Mongo Sharding [SERVER-7008]
Let's compare with HBase bugs still open for the current branch (0.94) :
  • NPE while replicating a log that is acquiring a new block from HDFS [HBASE-8096]
  • Replication could have data loss when machine name contains hyphen "-" [HBASE-8207]
  • Increment is non-idempotent but client retries RPC [HBASE-3787]
Obviously, every solution has its own (stupid) bugs lot. Some of them will occur under very specific conditions and thus become very hard to fix: full moon, Dvorak keyboard layout or exotic linux kernel (THIS looks crazy, isn't it?). But getting a still open "broken indexes" ticket in 2013 is clearly a youth issue for such a project. I would also like to mention a great article, "MongoDB is still broken by design 5-0".

Expectations

Honestly, in 2009, I was expecting MongoDB to be a leader, offering a kind of tight, clean, minimal but strongly reliable framework to store "semi-structured" data, handling memory efficiently, offering a distributed computation stack and letting developers imagine new features at a higher level. I can't agree with this policy of "offering more" rather than "offering the same but stronger". Hopefully the last versions seems to be a little bit more focused on performances and reliability (as an example, a global lock has been FINALLY removed).

And reality

To be clear, I'm definitely not against MongoDB. I just wanted through this article to point out the fact that they roughly changed their directions and lead to a project that I could not follow anymore; but this project is probably still suitable for many uses, as I introduced these reasons in my talk at MongoDB Paris last year: geographical indexes, complex query processing, easy secondary indexing, easy to try...

Considering your indices fit in RAM and you'll need to extend your platform in the future, MongoDB offers various features within a single solution. The problem is, there are many opponents to consider as well, such as ElasticSearch.

I wish the best to the team and hope to see all of these features taking part into a very robust and reliable project.

Thursday, March 21, 2013

Why do I hate plumbing repairs?

Last days I had troubles with my hot water tank, which was leaking. It was not obvious at all because everything was totally wet in the bathroom, including walls and ceiling. I don't know anything about plumbing, that's why I called a plumber. He had the explanation: everything was wet because of the condensation, and the tank was kind of destroyed because of calcium deposits.

Well, you probably don't read my blog to get such information. The interesting part? This is one of my recent thought: WHY do I dislike plumbing? Actually I don't really dislike, but I'm not very comfortable with manual labor. I don't feel very confident. And it's all because of computer science.

Hum, this article now looks fascinating, how could I compare plumbing and computer science? It's a story about testability, a story about QA. Think about this simple question: how would you ensure the pipe you've just repaired has fixed your problems for real?

You can't. In fact, this is exactly what happened: the (first) plumber said "Your tank is out of service, here is the invoice for changing it!". Two days after, something was still leaking: a flexible pipe did not support the pressure. But nevertheless, this plumber activated the tank and showed us everything was okay. Looks like it was not enough at all. It's probably not feasible to unit test such things. When you're writing code, you're writing its unit test(s). And if the unit test has been properly designed, if it's working, it will work forever. No matter you change your CPU, hard drive or your keyboard, your test will remain able to tell you if your code is still working.

For example, I don't want to change a tank by myself, just because I think I can't rely on my impressions. Okay, I can turn it on, nothing is leaking now... but how can I be sure I didn't break something else, somewhere else, for another reason? It's not about writing such codes:

// Check nothing is leaking once pipe has been changed.

function test_repair_pipe() {
  Pipe old_pipe = new Pipe(age = 10 * 365*86400) // about 10 years
  Pipe new_pipe = new Pipe(age = 0)
  Bathroom context = new Bathroom()
  assert(context.getTank().setPipe(old_pipe).isLeaking) == true)
  assert(context.getTank().setPipe(new_pipe).isLeaking) == false)
}

Imagine if we could do such things for real-world issues? Surgery, cooking, and even fighting against bad breath? What a comfortable idea! Well, there is a probably business to do: e-plumbing, cloud plumbing or even agile plumbing repair.

Saturday, March 16, 2013

It's a kind of Magic, The Gathering

Most of you might have heard one day about Magic, The Gathering (MTG): a famous trading card game. I played this game when I was 12, only for a few months while the game was very popular at school.



And 2 years ago, I discovered Magic once again. But this time I realized that it was totally different than what I could have understood so far. Far more complex, fascinating and entertaining than I thought  when I was a young teenager. Not a simple collectible card game. Maybe you already know MTG and you don't need to read this article further. If you don't know it, let's discover why I definitely love this game.

Gameplay

In a few words, each game consists in a battle with 2 players starting with 20 life points and a 60 cards deck (= the "library"). Basically a player loses when its life is reduced to zero. Decks are shuffled and each players draws 7 cards, then the battle begins. Players can cast spells, summon creatures and attack their opponents to reduce their total life amount. A spell needs "mana" to be cast, and this mana can be produced with cards called "lands", and everything take place in a fantasy universe closed to the Dungeons & Dragons one. Here is the most simple overview of the game, real rules take much more time to be be read. You can even find a concept of "stack"; yes, this is a real stack (you know, the data structure with push() and pop() methods) that describes how spells will be resolved.

One of my current favorite card.

You are a manager!

In the Magic world, if you're expecting to win, you have to manage efficiently different kind of resources:
  • Cards in hand: basically, the more cards you have, the better chance you have to be able to play the great card at the great time).
  • Mana sources: you need to have enough mana to play your other spells, but you need to not only draw mana source cards. Remember, you need to play spells if you want to win...
  • Creatures on the battlefields: put pressure on your opponent with lots of great creatures.
  • Graveyard (cards you've discarded): some cards can be played even if they've already been played! Once you've discarded these cards, you can play them almost whenever you want/need, that's powerful!
  • Cards in your library: instead of putting pressure with creatures, some players are focusing on improving the "quality" of cards they will draw. Some spells allows you to draw more cards than your opponents, this way you might draw a spell that can destroy all your opponent's creatures.
  • Time: if you think you're going to lose a game, you can try to "lock" it and ensure getting a draw game, that's still better than losing it...
These are the main resources you have to handle in a MTG game. To increase your victory rate in a tournament, you must handle your resources better than everyone else. Don't waste your mana, draw cards, don't let statistics beating you.

Statistics and behavior models

With the most traditional game rules, there are some rules you have to follow when building your own deck:
  • At least 60 cards in your deck.
  • You can put as much land cards (= mana) as you want.
  • Regarding the spells, you can put at most 4 times the same card in your deck.
Most of the time, a good deck is composed in the following way:
  • EXACTLY 60 cards (only the best, discard the rest).
  • Between 25 and 30 land cards (by the way, there are 5 different mana colors, but let's keep it simple...)
  • Some spells requiring 1 mana source (it's called "converted mana cost (CMC) = 1"), then 2 mana, 3 and so on. Basically "expensive" spells are more powerful than the cheapest ones but... they are more expensive, you'll need to survive several turns before playing your "game breaker" card and thus be sure to be still alive. In the opposite way, playing only cheap spells will probably not be enough to deal the last damages to your opponent.
  • This way, the dream hand you can get after drawing 7 cards would be, for instance, 3 land cards, 1 card with CMC=1, one with CMC=2, 3 and 4. This way you are sure to put stronger pressure each turn, if everything is going well. But remember: your opponent is also playing good cards and can counter your plans.
Now I hope I've described the game mechanisms clearly enough so that you understand the role the statistics are playing in a classical Magic game. According to me, the best player is the one who picks the right card at the right moment. This way, that player will get a solution for each problem the opponents will play. This requires strong building skills, metagame knowledge (what is considered as the current deck-to-beat ?). Now, you can deal with game theory, rational choice models and so on.

Kind of statistics you might want to deal with.

But as shown in a previous part, even when getting a perfectly built deck ready-to-win, you still have to  optimize your resources usage. Nowadays thanks to the Internet, lots of players are just copying deck lists from world champions and try to play and win. That's not enough at all.

Even graph theory can help

Some decks are expected to win when gathering 2, 3 or more cards that make a perfect lethal combo. When playing with a such deck, you have to know the shortest path to gather these cards and to build your deck so that you will get the best chance to easily gather these cards.

Oh, and it's also a game. It's an opportunity to share fun and meet great people.

References :

Saturday, March 9, 2013

Regexp: think DFA

Automata theory is such a great science. No matter how abstract it can be, you can find it everywhere. From simplest recognizable langage to the most complex problems, you might want to use automaton. Today, I would like to discuss about these automata in regular expressions.

Perl and regular expressions
Source : xkcd

For a long time, I used "traditional" regexp engines. Mostly java.util.regex and python's re module -- oh, I can also add grep, sed, and lisp engine within emacs to this list. I use regular expressions every day, mostly for simple use cases, that's true (ie: grep "foo" in my files) but also for powerful strings replacements. This is because regular expressions ARE powerful. But you probably know they can be painful as well. Let's begin with this first regexp,

> time perl -e '("a" x 10) =~ /^(a?){10}(a){10}$/;'

real 0m0.010s
user 0m0.005s
sys 0m0.004s

Well, as you might have noticed, this is not real-world regexp, but this can easily lead in a real-world issue. First, we are building a string looking like "aa....a" with 10 times repeated "a" character. Then, our regexp will try to match the a? pattern 10 times, and that will match, but then the engine will realize that the (a){10} statement won't match anything, that's why the perl engine will do recursive backtracking. Now, you're lost. This is an horrible case since the underlying implementation will always choose the positive matching for the a? statement. The engine will have at most n such choices to make, this is how you get your 2^n complexity.  How does it behave with n = 25 ?

> time perl -e '("a" x 25) =~ /^(a?){25}(a){25}$/;'

real 0m5.541s
user 0m5.532s
sys 0m0.007s

Of course inefficient regexp can be considered as acceptable when you're trying to pick up some relevant files, but it's definitely cumbersome if this kind of regular expression is used for real-time processing (think about filtering website forms, small RPC calls, or even batch job processing thousands lines of text). In our situation the exponential time is certainly NOT suitable for any case.

Go back to implementations. Basically, most of common engines implement a NFA – Non-deterministic Finite Automaton – a finite state machine where a given input can lead to several possible states (and obviously, it does not know which one is the better). Perl, Java, Ruby and Python for instance implement NFA for processing regular expressions. This implementation is known to be far slower than DFA, its deterministic cousin processing input data in polynomial time. Hopefully the theory wants that NFA can be be transformed to DFA and vice versa.

Here are 2 examples I picked up from this page to illustrate these concepts:

NFA
DFA, recognizing the same langage.
The famous grep tool, for example, uses DFA. Want to get a simple comparison with the previous perl program?

> time perl -e 'print ("a" x 25)' | grep -E "^(a?){25}(a){25}$"
aaaaaaaaaaaaaaaaaaaaaaaaa

real 0m0.0011s
user 0m0.007s
sys 0m0.007s

Polynomial time looks more reasonable. How can I move from recursive backtracking NFA to DFA implementations in my projects? I've heard about re2. This library also uses DFA implementation and describes itself as a "fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library". By the way you should easily find bindings (or alternatives) for langages other than C++. This website is comparing plenty of libraries and in term of performances, re2 definitely looks interesting. So, why is DFA not used by default? Honestly, I don't know if it's the actual reason, but NFA implementations offer more features than DFA, such as back references (e.g. you might want to match something like <(.+)>(.*)</(\1)> to extract XML tags, where \1 refers to the matched opening tag)

If you can't use DFA, perhaps you can deduce some good practices from this article. The first one it's good to know: there is a common syntax allowing you to disable backtracking if you're sure that you won't need it: (?> pattern). You might also want to disable grouping thanks to (?: pattern) if you just need to know if your pattern is matching something/nothing in your input data. That's particularly true with Java and its java.util.regex package. You can find more general good advices on JavaWorld, and lots of them remain true for other langages.

What a shame, I don't know what to tell as a conclusion. I tried to talked both about abstract concepts behind automata and about efficient/inefficient patterns. I can just give you a so vague advice like "think". It's not just about saving CPU cycles, it's mostly about NOT delirious stuff.

References: