50

Regular expressions are powerful tool in programmer's arsenal, but - there are some cases when they are not a best choice, or even outright harmful.

Simple example #1 is parsing HTML with regexp - a known road to numerous bugs. Probably, this also attributes to parsing in general.

But, are there other clearly no-go areas for regular expressions ?


p.s.: "The question you're asking appears subjective and is likely to be closed." - thus, I want to emphasize, that i am interested in examples where usage of regexps is known to cause problems.

c69
  • 1,358
  • 1
  • 12
  • 19
  • 9
    Parsing HTML with regexp is not just "a known road to numerous bugs". It is actually *impossible*. – Kramii Oct 09 '11 at 09:13
  • 19
    Not only is it impossible, it also leads to [madness and eternal damnation](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Martin Wickman Oct 09 '11 at 09:45
  • Are you talking about *regular expressions* or *regexps*? Those two are completely different things! Also, *if* you are talking about *regexps*, then, well, there is no such thing as *regexps*, there are dozens of incompatible, only superficially related things with widely varying expressivity and computational power, all called *regexps*, but you need to specify which particular thing you are talking about. – Jörg W Mittag Oct 09 '11 at 10:55
  • 3
    @Jörg: Regexp is just an abbreviation for regular expression. – Joren Oct 09 '11 at 11:04
  • 1
    @Joren: The *word* "regexp" is a portmanteau of "regular expression", but they denote two completely different things. *Regular expressions* are a precisely defined, very simple, very elegant mathematical concept. *Regexps* are neither precisely defined, nor simple, nor elegant. *Regular expressions* can only parse regular languages (in fact, that's pretty much the *definition* of "regular language"), *regexps* can -- depending on the flavor -- parse much more. E.g., Ruby 1.9 Regexps can parse many (mabe all?) context-free languages as well. – Jörg W Mittag Oct 09 '11 at 11:17
  • 3
    @Jörg: It is very much true that there is a massive difference between regular expressions in mathematics and their implementations in software libraries. It is also true that most regular expression libraries have extensions that place them far beyond accepting merely regular languages, and that calling them regular expressions is not always so appropriate. I agree with you that there are two different concepts. But they have the same name; regexp is still just an abbreviation, not a term in itself. Plenty of this examples on this site of using the full term for the software libraries. – Joren Oct 09 '11 at 11:34
  • 2
    @Jörg - these are semantics. While it may be a good idea to call these patterns in different names (if only to avoid the "regular expressions are for regular languages" fallacy), "regexp"/"regular expressions" is not a very good attempt, and only leads to additional confusion. – Kobi Oct 09 '11 at 11:41
  • Is this question asking for a list? – Cyclops Oct 09 '11 at 14:46
  • @Cyclops, yes, list is ok. But if you can pick a single criterion - that would be even better than long list ;) speaking of such a criterion, i think, Mateo hit the right spot. Yet, there are still "other" cases, when its not feasible to use Regexps. – c69 Oct 09 '11 at 14:55
  • 1
    Actually, [attacking bits of HTML with regexes is easy and worthwhile](http://stackoverflow.com/a/4234491/471272). It’s trying to parse an entire web page with a single regex that’s ridiculous. But there is no reason in the world why not to use a regex on a string like `"foo bar"`. The parrots don’t know what they’re squawking about. – tchrist Mar 04 '12 at 23:30

5 Answers5

60

Don't use regular expressions:

  • When there are parsers.

This doesn't limit to HTML. A simple valid XML cannot be reasonably parsed with a regular expression, even if you know the schema and you know it will never change.

Don't try, for example, parse C# source code. Parse it instead, to get a meaningful tree structure or the tokens.

  • More generally, when you have better tools to do your job.

What if you must search for a letter, both small and capital? If you love regular expressions, you'll use them. But isn't it easier/faster/readable to use two searches, one after another? Chances are in most languages you'll achieve better performance and make your code more readable.

For example the sample code in Ingo's answer is a good example when you must not use regular expressions. Just search for foo, then for bar.

  • When parsing human writing.

A good example is an obscenity filter. Not only it is a bad idea in general to implement it, but you may be tempted to do it using regular expressions, and you'll do it wrong. There are plenty of ways an human can write a word, a number, a sentence and will be understood by another human, but not your regular expression. So instead of catching real obscenity, your regular expression will spend her time hurting other users.

  • When validating some types of data.

For example, don't validate an e-mail address through a regular expression. In most cases, you'll do it wrong. In a rare case, you'll do it right and finish with a 6 343 characters length coding horror.

Without the right tools, you will make mistakes. And you will notice them at the last moment, or maybe never. If you don't care about clean code, you'll write a twenty lines string with no comments, no spaces, no newlines.

  • When your code will be read. And then read again, and again and again, every time by different developers.

Seriously, if I take your code and must review it or modify it, I don't want to spend a week trying to understand a twenty lines long string plenty of symbols.

ChrisF
  • 38,878
  • 11
  • 125
  • 168
Arseni Mourzenko
  • 134,780
  • 31
  • 343
  • 513
  • 9
    "Seriously, if I take your code and must review it or modify it, I don't want to spend a week trying to understand a twenty lines long string plenty of symbols." +1! – funkybro Oct 09 '11 at 10:29
  • 1
    This is a much better answer than its step sister on stack overflow: http://stackoverflow.com/questions/7553722/when-should-i-not-use-regular-expressions – Kobi Oct 09 '11 at 11:42
  • 1
    If you are using Perl/PCRE (and probably the other modern regex flavors too), read up about subroutines, named capturing groups and `(?(DEFINE))` assertions ;) You can write very clean regexes using those and actually when you use those you will write grammars that are very similar to what you would write in yacc or alike ;) – NikiC Oct 09 '11 at 18:30
  • 2
    Using regular expressions to parse away blacklisted words is a clbuttic error. – Dan Ray Oct 10 '11 at 12:39
  • There is no reason in the world to avoid throwing a regex at a string like `"stuff"`. Modern regexes have no trouble with this. – tchrist Mar 04 '12 at 23:34
18

The most important thing: when the language you are parsing is not a regular language.

HTML is not a regular language and parsing it with a regular expression is not possible (not only difficult or a road to buggy code).

Matteo
  • 471
  • 3
  • 10
  • 4
    Wrong! If you are using any of the modern regex flavors (Perl, PCRE, Java, .NET, ...) you can do recursion and assertions and thus can parse also match context-free and context-sensitive grammars. – NikiC Oct 09 '11 at 18:23
  • 9
    @NikiC. Not wrong. "Modern regex flavors" are not regular expressions (which can be used to parse regular languages, hence the name). I agree that with PRE you can do more but I would not call them just "regular expressions" (as in the original question). – Matteo Oct 09 '11 at 19:18
  • 1
    Modern regexes are so far beyond what your grandma was taught that regexes could do that it her advice is immaterial. And even primitive regexes can handle most little snippets of HTML. This blanket prohibition is ridiculous and unrealistic. Regexes were ***made*** for this sort of thing. And yes, [I do know what I’m talking about](http://stackoverflow.com/a/4234491/471272). – tchrist Mar 04 '12 at 23:35
12

On stackoverflow one often sees people ask for regexes that find out whether a given string does not contain this or that. This is, IMHO, reversing the purpose of regular expression. Even if a solution exists (employing negative lookbehind assertions or such stuff), it is often much better to use the regex for what it was made for and handle the negative case with program logic.

Example:

# bad
if (/complicated regex that assures the string does NOT conatin foo|bar/) {
    # do something
}

# appropriate
if (/foo|bar/) {
    # error handling
} else {
    # do something
}
Ingo
  • 3,903
  • 18
  • 23
  • 1
    +1: A few times, I've avoided coding myself into a corner with regexes by stopping and asking myself "Okay, what am I specifically trying to match?" rather than "What am I trying to avoid?" –  Feb 08 '12 at 18:30
5

Two cases:

When there is an easier way

  • Most languages provide a simple function like INSTR to determine if one string is a subset of another. If that's what you want to do, use the simpler function. Don't write your own regular expression.

  • If there is a library available for performing a complex string manipulation, use it rather than writing your own regular expression.

When regular expressions are not sufficiently powerful

  • If you need a parser, use a parser.
Kramii
  • 14,029
  • 5
  • 44
  • 64
0

Regular expressions cannot identify recursive structures. This is the fundamental limitation.

Take JSON - it is a pretty simple format, but since an object may contain other objects as member values (arbitrarily deep), the syntax is recursive and cannot be parsed by a regex. On the other hand CSV can be parsed by regex'es since it does not contain any recursive structures.

In short regular expressions does not allow the pattern to refer to itself. You cannot say: at this point in the syntax match the whole pattern again. To put it another way, regular expressions only matches linearly, it does not contain a stack which would allow it to keep track of how deep it is an a nested pattern.

Note it has nothing to do with how complex or convoluted the format is otherwise. S-expressions are really really simple, but cannot be parsed with a regex. CSS2 on the other hand is a pretty complex language, but does not contain recursive structures and therefor can be parsed with a regex. (Although this is not true for CSS3 due to CSS expressions, which have a recursive syntax.)

So it is not because it is ugly or complex or error-prone to parse HTML using only regex. It is that it is simply not possible.

If you need to parse a format which contains recursive structures, you need to at least supplement the use of regular expressions with a stack to keep track of the level of recursive structures. This is typically how a parser works. Regular expressions is used to recognize the "linear" parts, while custom code outside the regex is used to keep track of the nested structures.

Usually parsing like this is split into separate phases. Tokenization is the first phase where regular expressions are used to split the input into a sequence of "tokens" like words, punctuation, brackets etc. Parsing is the next phase where these tokens are parsed into a hierarchical structure, a syntax tree.

So when you hear that HTML or C# cannot be parsed by regular expressions, be aware that regular expressions still are a critical part of the parsers. You just cannot parse such language using only regular expressions and no helper code.

JacquesB
  • 57,310
  • 21
  • 127
  • 176