Saturday, March 9, 2013

Regexp: think DFA

Automata theory is such a great science. No matter how abstract it can be, you can find it everywhere. From simplest recognizable langage to the most complex problems, you might want to use automaton. Today, I would like to discuss about these automata in regular expressions.

 Source : xkcd

For a long time, I used "traditional" regexp engines. Mostly `java.util.regex` and python's `re` module -- oh, I can also add `grep`, `sed`, and lisp engine within emacs to this list. I use regular expressions every day, mostly for simple use cases, that's true (ie: grep "foo" in my files) but also for powerful strings replacements. This is because regular expressions ARE powerful. But you probably know they can be painful as well. Let's begin with this first regexp,

```> time perl -e '("a" x 10) =~ /^(a?){10}(a){10}\$/;'

real 0m0.010s
user 0m0.005s
sys 0m0.004s
```

Well, as you might have noticed, this is not real-world regexp, but this can easily lead in a real-world issue. First, we are building a string looking like "aa....a" with 10 times repeated "a" character. Then, our regexp will try to match the `a?` pattern 10 times, and that will match, but then the engine will realize that the `(a){10}` statement won't match anything, that's why the perl engine will do recursive backtracking. Now, you're lost. This is an horrible case since the underlying implementation will always choose the positive matching for the `a?` statement. The engine will have at most n such choices to make, this is how you get your 2^n complexity.  How does it behave with n = 25 ?

```> time perl -e '("a" x 25) =~ /^(a?){25}(a){25}\$/;'

real 0m5.541s
user 0m5.532s
sys 0m0.007s
```

Of course inefficient regexp can be considered as acceptable when you're trying to pick up some relevant files, but it's definitely cumbersome if this kind of regular expression is used for real-time processing (think about filtering website forms, small RPC calls, or even batch job processing thousands lines of text). In our situation the exponential time is certainly NOT suitable for any case.

Go back to implementations. Basically, most of common engines implement a NFA – Non-deterministic Finite Automaton – a finite state machine where a given input can lead to several possible states (and obviously, it does not know which one is the better). Perl, Java, Ruby and Python for instance implement NFA for processing regular expressions. This implementation is known to be far slower than DFA, its deterministic cousin processing input data in polynomial time. Hopefully the theory wants that NFA can be be transformed to DFA and vice versa.

Here are 2 examples I picked up from this page to illustrate these concepts:

 NFA
 DFA, recognizing the same langage.
The famous `grep` tool, for example, uses DFA. Want to get a simple comparison with the previous perl program?

```> time perl -e 'print ("a" x 25)' | grep -E "^(a?){25}(a){25}\$"
aaaaaaaaaaaaaaaaaaaaaaaaa

real 0m0.0011s
user 0m0.007s
sys 0m0.007s
```

Polynomial time looks more reasonable. How can I move from recursive backtracking NFA to DFA implementations in my projects? I've heard about re2. This library also uses DFA implementation and describes itself as a "fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library". By the way you should easily find bindings (or alternatives) for langages other than C++. This website is comparing plenty of libraries and in term of performances, re2 definitely looks interesting. So, why is DFA not used by default? Honestly, I don't know if it's the actual reason, but NFA implementations offer more features than DFA, such as back references (e.g. you might want to match something like `<(.+)>(.*)</(\1)>` to extract XML tags, where `\1` refers to the matched opening tag)

If you can't use DFA, perhaps you can deduce some good practices from this article. The first one it's good to know: there is a common syntax allowing you to disable backtracking if you're sure that you won't need it: `(?> pattern)`. You might also want to disable grouping thanks to `(?: pattern)` if you just need to know if your pattern is matching something/nothing in your input data. That's particularly true with Java and its `java.util.regex` package. You can find more general good advices on JavaWorld, and lots of them remain true for other langages.

What a shame, I don't know what to tell as a conclusion. I tried to talked both about abstract concepts behind automata and about efficient/inefficient patterns. I can just give you a so vague advice like "think". It's not just about saving CPU cycles, it's mostly about NOT delirious stuff.