200

There is a popular quote by Jamie Zawinski:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

How is this quote supposed to be understood?

Ixrec
  • 27,621
  • 15
  • 80
  • 87
IQAndreas
  • 2,675
  • 3
  • 15
  • 20
  • 7
    I was trying to find the context, it's better to guess what he means. It could be nothing much about regex, but a metaphor. Anyway, I found this page: [Source of the famous “Now you have two problems” quote](http://regex.info/blog/2006-09-15/247) – livibetter Oct 11 '10 at 13:47
  • 5
    Consider also "Some people, when faced with a problem, think "I know, I'll dismiss it with a witticism." Now they have zero problems" - http://twitter.com/#!/jongalloway/status/28863872993 – Kate Gregory Oct 29 '10 at 16:46
  • 1
    This type of question is now being [discussed on our meta-discussion site](http://meta.programmers.stackexchange.com/q/2645/8). –  Dec 05 '11 at 20:20
  • 1
    What he might mean is that regular expressions might not be suited for solving the problem. Like parsing HTML ? – James P. Feb 13 '13 at 11:26
  • 1
    I'm not sure which SE site this question belongs on, but I posted it here, because if he means there is a problem with using regular expressions to solve problems, I'm guessing the answer is on-topic here. – IQAndreas Jan 09 '14 at 17:55
  • 46
    The 2nd problem is that they are using regex and still haven't solved the first problem, hence 2 problems. – Ampt Jan 09 '14 at 17:57
  • 5
    Read http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html – Euphoric Jan 09 '14 at 18:00
  • 1
    IOW, according to the quotee, resorting to regex is like saying, "My leg itches; I think I'll chop it off." – B. Clay Shannon-B. Crow Raven Jan 09 '14 at 18:01
  • 2
    @Ampt I still don't follow, you can do [a lot of great things](http://codegolf.stackexchange.com/questions/17676/weird-string-calculation/17745#17745) with Regex, things that would take a hundred lines in plain code. – IQAndreas Jan 09 '14 at 18:06
  • 10
    @IQAndreas Good code is not short. It is understandable and clear. And regexes are far from understandable. Especially the complex ones. – Euphoric Jan 09 '14 at 18:11
  • 24
    @Euphoric - actually, good code *is* short - but without being cryptically concise. –  Jan 09 '14 at 18:30
  • Just like a project I have: create an API (usually in C# as COM visible) for a Perl script. We need a TelNet app - we have one written in Perl. I don't know Perl. I can't use the TelNet source from C#. So, we haven't solved the first problem and now we have a 2nd problem: write a Perl API instead of C#, or write a TelNet class in C#. – IAbstract Jan 09 '14 at 18:43
  • 24
    @IQAndreas: I think it is intended to be semi-humorous. The comment that's being made is that if you are not careful, using regular expressions can make things worse instead of better. – FrustratedWithFormsDesigner Jan 09 '14 at 19:24
  • 145
    Some people, when trying to explain something, think "I know, I'll use a Jamie Zawinski quote." Now they have two things to explain. – detly Jan 09 '14 at 23:22
  • 4
    Jamie disliked and did not 'grok' regexes. Its a pretty stupid quote as its an "I don't like broccoli" preference pretending to be some sort of wisdom. – James Anderson Jan 10 '14 at 01:16
  • 3
    @JamesAnderson Actually, when looking at the context of the quote, that does not seem to be the case: http://programmers.stackexchange.com/a/223641/103266 – IQAndreas Jan 10 '14 at 01:19
  • @IQAndreas -- but any tool or facility can be misused its not a particular problem with regexes or perl. – James Anderson Jan 10 '14 at 07:14
  • 3
    @Euphoric Actually it’s typically the other way round: with a little bit of care, even complex regex are *easier* to understand and maintain than equivalent non-regex parsing code. JZW’s quote is pure FUD, and abused almost as often as Knuth’s adage about premature optimisation (which, despite being often abused, at least has a kernel of truth). – Konrad Rudolph Jan 10 '14 at 10:27
  • 66
    [A programmer had a problem. He thought to himself, "I know, I'll solve it with threads!". has Now problems. two he](http://chat.stackexchange.com/transcript/message/7607698#7607698) – TRiG Jan 10 '14 at 11:06
  • 4
    **Please avoid extended discussions in the comments sections. If you would like to have an extended conversation then visit the [Chat Room](http://chat.stackexchange.com/rooms/21/the-whiteboard)** – maple_shaft Jan 10 '14 at 12:13
  • Just watch your performance and use something like http://rubular.com/. – Fabian Zeindl May 18 '14 at 09:40
  • [Previously](http://www.jwz.org/blog/2014/05/so-this-happened/) – badp May 21 '14 at 09:51
  • I would update this question with link to this post: http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016 – Mateusz Jul 21 '16 at 08:37

17 Answers17

219

Some programming technologies are not generally well-understood by programmers (regular expressions, floating point, Perl, AWK, IoC... and others).

These can be amazingly powerful tools for solving the right set of problems. Regular expressions in particular are very useful for matching regular languages. And there is the crux of the problem: few people know how to describe a regular language (it's part of computer science theory / linguistics that uses funny symbols - you can read about it at Chomsky hierarchy).

When dealing with these things, if you use them wrong it is unlikely that you've actually solved your original problem. Using a regular expression to match HTML (a far too common occurrence) will mean that you will miss edge cases. And now, you've still got the original problem that you didn't solve, and another subtle bug floating around that has been introduced by using the wrong solution.

This is not to say that regular expressions shouldn't be used, but rather that one should work to understand what the set of problems they can solve and can't solve and use them judiciously.

The key to maintaining software is writing maintainable code. Using regular expressions can be counter to that goal. When working with regular expressions, you've written a mini computer (specifically a non-deterministic finite state automaton) in a special domain specific language. It's easy to write the 'Hello world' equivalent in this language and gain rudimentary confidence in it, but going further needs to be tempered with the understanding of the regular language to avoid writing additional bugs that can be very hard to identify and fix (because they aren't part of the program that the regular expression is in).

So now you've got a new problem; you chose the tool of the regular expression to solve it (when it is inappropriate), and you've got two bugs now, both of which are harder to find, because they're hidden in another layer of abstraction.

  • 8
    I'm not sure perl itself belongs in a list of technologies not well-understood by programmers ;) – crad Jan 09 '14 at 22:02
  • 21
    @crad its more that it has been said about perl too... Many people have heard it popularized there. I still like the floating point one in the rand talk: "Now you have 2.00000152 problems" –  Jan 09 '14 at 22:05
  • 56
    @crad Some people, when confronted with a problem, think "I know, I'll use perl." Now they have $(^@#%()^%)(#) problems. – Michael Hampton Jan 10 '14 at 00:54
  • I have to point out that regular expressions as implemented by most recent programming languages have little to do with regular languages. They tend to be far more powerful. One could argue though, that for maintainability reasons there should not be used that way. – Jens Jan 10 '14 at 08:29
  • @MichaelHampton http://twoproblems.com/50 – Gilles 'SO- stop being evil' Jan 10 '14 at 13:25
  • 4
    @Jens if anything, the additional power of the PCRE vs traditional regex makes it a more tempting solution *and* a more difficult to maintain one. The finite automata that the PCRE matches is explored in [Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.154.6795)... and its a non-trivial thing. At least with the traditional regex, one can get their head around it without *too* much trouble once the necessary concepts are understood. –  Jan 10 '14 at 15:23
  • This makes me wonder how many people took the floating-point course versus the finite automata one at my university, where you only need one of the two for a CS degree. (I went for automata, though if I'd had time I would have tried to fit in the FP course as well.) – JAB Jan 10 '14 at 15:48
  • 6
    You make a good point. regular expressions are effectively a second, non-trivial language. Even if the original programmer is competent in the main language and the flavor of regex used, adding in a "second language" means lower odds that maintainers will know both. Not to mention that regex readability is often lower than the "host" language. – JS. Jan 10 '14 at 18:25
  • Not many people are aware (or at least make use) of [perl's ability to have comments in a regex](http://stackoverflow.com/questions/632795/how-do-you-comment-a-perl-regular-expression). Not sure if other languages allow comments inside the regex, but it helps in writing maintainable code. – jippie Jan 10 '14 at 18:56
  • @jippie P.SE's similar nature of that question - [Readable regular expressions without losing their power?](http://programmers.stackexchange.com/questions/194975/readable-regular-expressions-without-losing-their-power) –  Jan 10 '14 at 20:54
  • @JS. I don’t buy that argument – it’s true for *any* library that you might leverage in a given programming language. Granted, regex are more complex than many libraries, but they are also far more universal, small variations notwithstanding. It’s far more reasonable to expect programmers to know regex than most other libraries – they are a *fundamental* tool of text processing. In fact, if you look at the Unicode definition you’ll see that except for encoding (and normalisation) they are the *only* fundamental tool. – Konrad Rudolph Jan 11 '14 at 10:06
  • ..Know what you mean about floating point. Once had to fix a floating point divide instruction, and had to first write a simulator to simulate how the spark machine was executing it because the edge cases were too hard to test otherwise. We take things like floating point for granted, but it is very tricky to get it exactly right. – Elliptical view Jan 11 '14 at 16:04
  • @KonradRudolph: Are you arguing that programmers *should* know regex, or that they are able to effectively write regex? I agree they *should*, but many I know don't. I'm reasonably competent at regex, but *very few* of my programming co-workers are -- however good they may be at C, etc. Competent PERL programmers may have an easier time with regex, but for C, Python, etc., regex is effectively a very different language. I'll make a similar argument for Make. Different paradigms. Many of my very competent programming contemporaries simply aren't competent at regex or Make. Sad, but true. – JS. Jan 13 '14 at 17:30
  • @JS. I’m saying that they *should*, and that it’s so integral that you can safely *force* them to learn it. Do not bow to engineering ignorance, if they contradict best practices. Following this argument to its logical conclusion we would never have gone to the moon because most engineers don’t understand the tools required. The argument is simply invalid. – Konrad Rudolph Jan 13 '14 at 18:37
  • @KonradRudolph: I do not disagree with your point, but the OP asked what the comment meant. In short, if you use Feature X and aren't competent at it, you now have made things worse. – JS. Jan 13 '14 at 18:46
  • @JS. Oh sure. But I feel that an explanation without context obscures a crucial point: that this quote is often (predominantly, in my experience!) used as an [argument from ignorance](https://en.wikipedia.org/wiki/Argument_from_ignorance) against using regex. Your comment on the face of it reads like an endorsement of that position. It seems I’ve misread that. – Konrad Rudolph Jan 13 '14 at 18:52
  • @KonradRudolph a bit that is a good read on the question can also be found at [Codeless Code #121: Where Angels Fear to Thread](http://thecodelesscode.com/case/121) that describes it as "The Proverb of Two Problems". While that particular example deals with threading, all that it says can be said of regex too. –  Jan 13 '14 at 19:00
  • 1
    @KonradRudolph: Like any tool, quotes can be used for both good and ill. I use regex almost daily. Very powerful. Has gotten me out of many jams. I encourage others to learn it. MichaelT's answer helped me to see regex not so much as a "just another library call", but a language unto itself... and very different from many of the languages that "host" it. I also acknowledge regex is just one step above line noise in appearance (write once, read never :-), thus I can understand people's reluctance to invest the effort to learn it. Unfortunate, but part of life. – JS. Jan 13 '14 at 19:07
  • There's no reason to say "non-deterministic". "Finite state automaton" is the same as "deterministic finite state automaton" which is equivalent to a "non-deterministic finite state automaton". – Elliot Gorokhovsky Aug 27 '15 at 03:16
  • @RenéG they are indeed equivalent. However, the mental model that one has when working with `/^([fb]*[oa]+r{0,1})*$/` is more likely that of the NDFA rather than a DFA. Similarly, one could say that lua is equivalent to the bytecode it compiles into when used as a inner language in another program - but people don't think in lua bytecode, they think in the lua. When I see a regular expression, I *see* its NDFA representation rather than its DFA representation. –  Aug 27 '15 at 16:53
  • @MichaelT Fair enough – Elliot Gorokhovsky Aug 27 '15 at 16:54
  • Regexps and too powerful not to use them. My advice when using them is "don't be as clever as you can be when writing regexps", always write regexps that are at least a bit below your level of regexpertise. – Tulains Córdova Jun 22 '17 at 13:52
95

Regular expressions - particularly non trivial ones - are potentially difficult to code, understand and maintain. You only have to look at the number of questions on Stack Overflow tagged [regex] where the questioner has assumed that the answer to their problem is a regex and have subsequently got stuck. In a lot of cases the problem can (and perhaps should) be solved a different way.

This means that, if you decide to use a regex you now have two problems:

  1. The original problem you wanted to solve.
  2. The support of a regex.

Basically, I think he means you should only use a regex if there's no other way of solving your problem. Another solution is probably going to be easier to code, maintain and support. It may be slower or less efficient, but if that's not critical ease of maintenance and support should be the overriding concern.

ChrisF
  • 38,878
  • 11
  • 125
  • 168
  • 27
    And worse: they're just powerful enough to trick people into trying to use them to parse things they can't, like HTML. See the numerous questions on SO on "how do I parse HTML?" – Frank Shearar Oct 11 '10 at 14:53
  • 6
    For certain situations regex is awesome. In many other cases not so much. At the other end it is a horrifying pit of despair. The problem often arises when someone learns about them for the first time and starts to see applications everywhere. Another famous saying: "When the only tool you have is a hammer, everything looks like a nail." – Todd Williamson Oct 11 '10 at 15:10
  • 3
    Does this mean that by the number of questions in the SO [c#] tag, it is the hardest programming language to understand? –  Oct 29 '10 at 09:45
  • 1
    @Roger - I think C# is an easy language to pick up, but a difficult one to master. c/c++ on the other hand are hard to pick up and hard to master. – ChrisF Oct 29 '10 at 09:58
  • @Chris: I picked C# because it's the most used tag; to point out the problem with "people have lots of questions, it must be difficult." –  Oct 29 '10 at 10:01
  • @Roger - I know ;) – ChrisF Oct 29 '10 at 10:03
  • @Frank: if you make your comment an answer, I will vote for it instead. Yeah, regexps are hard to read, but it's the problem where people think they are a silver bullet for parsing turing-complete input that is the key. – Alex Feinman Oct 29 '10 at 12:56
  • 2
    I would much rather see a complex regular expression than a long series of calls to string methods. OTOH, I really hate seeing regular expressions misused to parse complex languages. – kevin cline May 20 '11 at 03:57
  • 5
    "Basically, I think he means you should only use a regex if there's no other way of solving your problem. Any other solution is going to be easier to code, maintain and support." - seriously disagree.. Regexes are excellent tools, you just have to know their limits. A lot of tasks can be coded more elegantly with regexes. (but, just to make an example, you shouldn't use them to parse HTML) – Karoly Horvath Dec 05 '11 at 23:30
68

It's mostly a tongue-in-cheek joke, albeit with a grain of truth.

There are some tasks for which regular expressions are an excellent fit. I once replaced 500 lines of manually written recursive descent parser code with one regular expression that took around 10 minutes to fully debug. People say regexes are hard to understand and debug, but appropriately-applied ones are not nearly as hard to debug as a huge hand-designed parser. In my example, it took two weeks to debug all the edge cases of the non-regex solution.

However, to paraphrase Uncle Ben:

With great expressivity comes great responsibility.

In other words, regexes add expressivity to your language, but that puts more responsibility on the programmer to choose the most readable mode of expression for a given task.

Some things initially look like a good task for regular expressions, but aren't. For example, anything with nested tokens, like HTML. Sometimes people use a regular expression when a simpler method is more clear. For example, string.endsWith("ing") is easier to understand than the equivalent regex. Sometimes people try to cram a large problem into a single regex, where breaking it into pieces is more appropriate. Sometimes people fail to create appropriate abstractions, repeating a regex over and over instead of creating a well-named function to do the same job (perhaps implemented internally with a regex).

For some reason, regexes have a weird tendency to create a blind spot to normal software engineering principles like single responsibility and DRY. That's why even people who love them find them problematic at times.

Karl Bielefeldt
  • 146,727
  • 38
  • 279
  • 479
  • 10
    Didn't Uncle Ben also say "Perfect results, every time"? Maybe that's why people get so trigger happy with regexes... – Andrzej Doyle Jan 10 '14 at 15:52
  • 4
    The issue with regex regarding HTML that trips up inexperienced developers is that HTML has a context-free grammar, not regular: regex can be used for some simple HTML (or XML) parsing (e.g. grabbing a URL from a named anchor tag), but is not well-suited for anything complex. For that, DOM parsing is more appropriate. Related reading: **[Chomsky hierarchy](http://en.wikipedia.org/wiki/Chomsky_hierarchy)**. –  Mar 03 '15 at 03:17
52

Jeff Atwood brings out a different interpretation in a blog post discussing this very quote: Regular Expressions: Now You Have Two Problems (thanks to Euphoric for the link)

Analyzing the full text of Jamie's posts in the original 1997 thread, we find the following:

Perl's nature encourages the use of regular expressions almost to the exclusion of all other techniques; they are far and away the most "obvious" (at least, to people who don't know any better) way to get from point A to point B.

The first quote is too glib to be taken seriously. But this, I completely agree with. Here's the point Jamie was trying to make: not that regular expressions are evil, per se, but that overuse of regular expressions is evil.

Even if you do fully understand regular expressions, you run into The Golden Hammer problem, trying to solve a problem with regular expressions, when it would have been easier and more clear to do the same thing with regular code (see also CodingHorror: Regex use vs. Regex abuse).

There is another blog post which looks at the context of the quote, and goes into more detail than Atwood: Jeffrey Friedl's Blog: Source of the famous “Now you have two problems” quote

IQAndreas
  • 2,675
  • 3
  • 15
  • 20
  • 3
    This is, to my mind, the best answer because it adds context. jwz's criticism of regexes was as much about Perl as anything. – Evicatos Jan 09 '14 at 20:15
  • 3
    @Evicatos There was even more research done on the same 1997 thread in another blog post: http://regex.info/blog/2006-09-15/247 – IQAndreas Jan 09 '14 at 20:30
30

There are a few things going on with this quote.

  1. The quote is a restatement of an earlier joke:

    Whenever faced with a problem, some people say "Lets use AWK." Now, they have two problems. — D. Tilbrook

    It is a joke and a real dig, but it's also a way of highlighting regex as a bad solution by linking it with other bad solutions. It's a great ha ha only serious moment.

  2. To me—mind you, this quote is purposely open to interpretation—the meaning is straight forward. Simply announcing the idea of using a regular expression has not solved the problem. In addition, you've increased the cognitive complexity of the code by adding an additional language with rules that stand apart from whatever language you are using.

  3. Although funny as a joke, you need to compare the complexity of a non-regex solution with the complexity of the regex solution + the additional complexity of including regexes. It may be worthwhile to solve a problem with a regex, despite the additional cost of adding regexes.

Jeffery Thomas
  • 2,115
  • 1
  • 14
  • 19
21

RegularExpressionsarenoworsetoreadormaintainthananyotherunformattedcontent;indeedaregexisprobablyeasiertoreadthanthispieceoftexthere-butunfortunatelytheyhaveabadreputationbecausesomeimplementationsdon'tallowformattingandpeopleingeneraldon'tknowthatyoucandoit.

(Regular Expressions are no worse to read or maintain than any other unformatted content; indeed a regex is probably easier to read than this piece of text here - but unfortunately they have a bad reputation because some implementations don't allow formatting and people in general don't know that you can do it.)


Here's a trivial example:

^(?:[^,]*+,){21}[^,]*+$


Which isn't really that difficult to read or maintain anyway, but is even easier when it looks like this:

(?x)    # enables comments, so this whole block can be used in a regex.
^       # start of string

(?:     # start non-capturing group
  [^,]*+  # as many non-commas as possible, but none required
  ,       # a comma
)       # end non-capturing group
{21}    # 21 of previous entity (i.e. the group)

[^,]*+  # as many non-commas as possible, but none required

$       # end of string

That's a bit of an over-the-top example (commenting $ is akin to commenting i++) but clearly there should be no problem reading, understanding, and maintaining that.


So long as you're clear as to when regular expressions are suited and when they're a bad idea, there's nothing wrong with them, and most times the JWZ quote doesn't really apply.

Peter Boughton
  • 4,562
  • 1
  • 30
  • 27
  • 1
    Sure, but I'm not looking for discussions of the merits of regexs, and I wouldn't like to see this discussion go that way. I'm just trying to understand what he was getting at. – Paul Biggar Oct 11 '10 at 14:05
  • 1
    Then the link in livibetter's comment tells you what you need to know. This response is just pointing out that regexes do not need to be obscure, and thus the quote is nonsense. – Peter Boughton Oct 11 '10 at 14:57
  • 8
    What’s the point of using `*+`? How is that any different (functionally) from just `*`? – Timwi Jan 12 '11 at 18:38
  • You will _love_ the `x` modifier in Perl. –  Aug 07 '11 at 09:21
  • +1: Idontknowwhatregularexpressionsarebutabsolutelylovedthewayyouwrotethefirstfewlines. – KK. Nov 06 '11 at 08:39
  • @PaulBiggar, It seems that the thing JWZ was getting at is the merits of regular expressions. – Caleb Nov 06 '11 at 20:09
  • 1
    While what you say may be true, it doesn't answer this specific question. Your answer boils down to "in my opinion that quote usually isn't true". The question isn't about whether it's true or not, but what the quote means. – Bryan Oakley Dec 05 '11 at 23:40
  • 1
    Also, what's to point of saying `[^,]*+,` instead of `[^,]+`? – Tamás Szelei Dec 06 '11 at 08:20
  • 2
    There's literally no point in doing `*+` in this case; everything is anchored and can be matched in a single pass by an automaton that can count up to 22. The correct modifier on those non-comma sets is just plain old `*`. (What's more, there should also be no differences between greedy and non-greedy matching algorithms here. It's an extremely simple case.) – Donal Fellows Apr 08 '13 at 08:08
14

Regular expressions are very powerful, but they have one small and one big problem; they are hard to write, and near impossible to read.

In a best case the use of the regular expression solves the problem, so then you only have the maintenance problem of the complicated code. If you don't get the regular expression just right, you have both the original problem and the problem with unreadable code that doesn't work.

Sometimes regular expressions are referred to as write-only code. Faced with a regular expression that needs fixing, it's often faster to start from scratch than to try to understand the expression.

gnat
  • 21,442
  • 29
  • 112
  • 288
Guffa
  • 2,990
  • 20
  • 16
  • 1
    The real problem is that regexps cannot implement e.g. a parser since they cannot count how deeply nested they currently are. –  Aug 07 '11 at 09:21
  • 4
    @Thorbjørn Ravn Andersen: That's more of a limitation than a problem. It's only a problem if you try to use regular expressions for that, and then it's not a problem with the regular expressions, it's a problem with your choise of method. – Guffa Aug 07 '11 at 11:28
  • 1
    You can use REs just fine for the lexer (well, for most languages) but assembling the token stream into a parse tree (i.e., _parsing_) is formally beyond them. – Donal Fellows Dec 06 '11 at 15:56
14

In addition to ChrisF's answer - that regular expressions "are difficult to code, understand and maintain", there's worse: they're just powerful enough to trick people into trying to use them to parse things they can't, like HTML. See the numerous questions on SO on "how do I parse HTML?" For instance, the single most epic answer in all of SO!

Frank Shearar
  • 16,643
  • 7
  • 48
  • 84
10

The problem is that regex is a complicated beast, and you only solve your problem if you use regex perfectly. If you don't, you end up with 2 problems: your original problem and regex.

You claim that it can do the work of a hundred lines of code, but you could also make the argument that 100 lines of clear, concise code is better than one line of regex.

If you need some proof of this: You can check out this SO Classic or simply comb through the SO Regex Tag

Ampt
  • 4,605
  • 2
  • 23
  • 45
  • 8
    Neither of the claims in your first sentence are true. Regex is not particularly complicated, and like no other tool do you need to know it perfectly in order to solve problems with it. That’s just FUD. Your second paragraph is plain ridiculous: *of course* you can make the argument. But it’s not a good one. – Konrad Rudolph Jan 10 '14 at 10:21
  • 1
    @KonradRudolph I think the fact that there are numerous regex generation and validation tools goes to show that regex *is* a complicated mechanism. It's not human readable (by design) and can cause a complete change in flow for someone modifying or writing a piece of code which uses regex. As to the second part, I think it's clear in it's implication from the vast grouping of knowledge on P.SE and by the saying "Debugging code is twice as hard as writing it, so if you write the most clever code you can, you are, by definition, not smart enough to debug it" – Ampt Jan 10 '14 at 16:30
  • 2
    That’s not a proper argument. Yes, sure regex are complex. But so are other programming languages. Regex is considerably *less* complex than most other languages, and the tools that exist for regex are dwarfed by development tools for other languages (FWIW I work extensively with regex and I’ve never used such tools …). It’s a simple truth that even complex regex are *simpler* than equivalent non-regex parsing code. – Konrad Rudolph Jan 10 '14 at 17:29
  • @KonradRudolph I think we have a fundamental disagreement about the definition of the word simple then. I'll give you that regex can be more *efficient* or even more *powerful* but I don't think that *simple* is the word that comes to anyone's mind when you think of regex. – Ampt Jan 11 '14 at 04:54
  • Maybe we do but my definition is actionable: I take simple to mean easy to comprehend, easy to maintain, low number of bugs hidden etc. Of course a complex regex will at first glance *not* look very comprehensible. But *the same is true* for an equivalent non-regex piece of code. I’ve never said that regex are simple. I’m saying they’re *simpler* – I’m comparing. That’s important. – Konrad Rudolph Jan 11 '14 at 10:03
  • @Ampt, I don't think regexes are much more complicated than any bad written programming language. For example, a [Perl regex written with best practices standards is pretty readable](https://gist.github.com/smonff/8467532). The argument about pain to parse HTML with regexes is true because nobody will recommend to anybody to try to parse (X)HTML with it. – smonff Jan 17 '14 at 02:40
7

The meaning has two parts:

  • First, you didn't solve the original problem.
    This probably refers to the fact that regular expressions often offer incomplete solutions to common problems.
  • Second, you now added additional difficulty associated with the solution you've picked.
    In the case of regular expressions, the additional difficulty probably refers to complexity, maintainability, or the additional difficulty associated with making regular expressions fit a problem it wasn't supposed to solve.
tylerl
  • 4,850
  • 21
  • 32
7

As you ask for it in 2014, it would be interesting to focus on programming languages ideologies of 1997 context comparing to today's context. I will not enter this debate here but opinions about Perl and Perl itself have greatly changed.

However, to stay in a 2013 context (de l'eau a coulé sous les ponts depuis), I would suggest to focus on reenactment in quotes using a famous XKCD comic that is a direct quote of Jamie Zawinski's one :

A comic from XKCD about regexes, Perl and problems

First I had problems to understand this comic because it was a reference to the Zawinski quote, and a quote of a Jay-z song lyrics, and a reference of GNU program --help -z flag2, so, it was too much culture for me to understand it.

I knew it was fun, I was feeling it, but I didn't really know why. People are often doing jokes about Perl and regexes, especially since it's not the hipstiest programming language, don't really know why it is supposed to be fun... Maybe because Perl mongers do silly things.

So the initial quote seems to be a sarcastic joke based on real life problems (pain?) caused by programming with tools that hurts. Just like a hammer can hurt a mason, programming with tools that are not the ones that a developer would choose if he could can hurt (the brain, the feelings). Sometimes, great debates about which tool is the best occurs, but it's almost worthless cause it's a problem of your taste or your programming team taste, cultural or economic reasons. Another excellent XKCD comic about this :

A comic from XKCD about programming tools debates

I can understand people feeling pain about regexes, and they do believe that another tool is better suited for what regexes are designed for. As @karl-bielefeldt answers your question with great expressivity comes great responsibility, and regexes are especially concerned by this. If a developer don't care of how s-he deals with regexes, it will eventually be a pain for people who will maintain the code later.

I will finish with this answer about quotes reenactment by a quote showing a typical example from Damian Conw ay's Perl Best Practices (a 2005 book).

He explains that writing a pattern like this:

m{'[^\\']*(?:\\.[^\\']*)*'}

...is no more acceptable than writing a program like this:

sub'x{local$_=pop;sub'_{$_>=$_[0
]?$_[1]:$"}_(1,'*')._(5,'-')._(4
,'*').$/._(6,'|').($_>9?'X':$_>8
?'/':$")._(8,'|').$/._(2,'*')._(
7,'-')._(3,'*').$/}print$/x($=).
x(10)x(++$x/10).x($x%10)while<>;

But it can be rewritten, it's still not pretty, but at least it's now survivable.

# Match a single-quoted string efficiently...
m{ '            # an opening single quote
    [^\\']*     # any non-special chars (i.e., not backslash or single quote)
    (?:         # then all of...`
    \\ .        # any explicitly backslashed char
    [^\\']*     #    followed by any non-special chars
    )*          # ...repeated zero or more times
    '           # a closing single quote
}x

This kind of rectangular shaped code is the second problem not regexes that can be formatted in a clear, maintainable and readable way.

smonff
  • 239
  • 1
  • 7
  • 2
    `/* Multiply the first 10 values in an array by 2. */ for (int i = 0 /* the loop counter */; i < 10 /* continue while it is less than 10 */; ++i /* and increment it by 1 in each iteration */) { array[i] *= 2; /* double the i-th element in the array */ }` – 5gon12eder May 11 '15 at 18:14
6

If there is one thing you should learn from computer science, it is Chomsky hierarchy. I would say that all problems with regular expressions come from attempts to parse context-free grammar with it. When you can impose a limit (or think you can impose a limit) to nesting levels in CFG, you get those long and complex regular expressions.

Juha Autero
  • 59
  • 1
  • 1
  • 1
    Yes! People who learning regular expressions without that part of CS background don't always understand that there are just some things that a regex *mathematically* cannot do. – benzado Jun 22 '11 at 16:49
5

Regular expressions are more suitable for tokenisation than for full-scale parsing.

But, a surprisingly large set of things that programmers need to parse are parseable by a regular language (or, worse, almost parseable by a regular language and if you only write a little more code...).

So if one is habituated to "aha, I need to pick text apart, I'll use a regular expression", it's easy to go down that route, when you need something that's closer to a push-down automaton, a CFG parser or even more powerful grammars. That usually ends in tears.

So, I think the quote isn't so much slamming regexps, they have their use (and well-used, they're very useful indeed), but the over-reliance on regexps (or, specifically, the uncritical choice of them).

Vatine
  • 4,251
  • 21
  • 20
3

jwz is simply off his rocker with that quote. regular expressions are no different than any language feature - easy to screw up, hard to use elegantly, powerful at times, inappropriate at times, often well documented, often useful.

the same could be said for floating point arithmetic, closures, object-orientation, asynchronous I/O, or anything else you can name. if you don't know what you are doing, programming languages can make you sad.

if you think regexes are hard to read, try reading the equivalent parser implementation for consuming the pattern in question. often regexes win because they are more compact than full parsers...and in most languages, they are faster as well.

don't be put off of using regular expressions (or any other language feature) because a self-promoting blogger makes unqualified statements. try things out for yourself and see what works for you.

Brad Clawsie
  • 740
  • 3
  • 7
  • 1
    FWIW, floating point arithmetic is _waaay more_ tricky than REs, but appears simpler. Beware! (At least tricky REs tend to look dangerous.) – Donal Fellows Dec 06 '11 at 15:59
3

My favourite, in-depth answer to this is given by the famous Rob Pike in a blog post reproduced from an internal Google code comment: http://commandcenter.blogspot.ch/2011/08/regular-expressions-in-lexing-and.html

The summary is that it's not that they are bad, but they are frequently used for tasks for whcih they are not necessarily suited, especially when it comes to lexing and parsing some input.

Regular expressions are hard to write, hard to write well, and can be expensive relative to other technologies... Lexers, on the other hand, are fairly easy to write correctly (if not as compactly), and very easy to test. Consider finding alphanumeric identifiers. It's not too hard to write the regexp (something like "[a-ZA-Z_][a-ZA-Z_0-9]*"), but really not much harder to write as a simple loop. The performance of the loop, though, will be much higher and will involve much less code under the covers. A regular expression library is a big thing. Using one to parse identifiers is like using a Ferrari to go to the store for milk.

He says a lot more than that, arguing that regular expressions are useful in, e.g. disposable matching of patterns in text editors but should rarely be used in compiled code, and so on. It's worth a read.

0

This is related to Alan Perlis' epigram #34:

The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information.

So if your choose the character string as your data structure (and, naturally, regex-based code as the algorithms to manipulate it), you have a problem, even if it works: bad design around an inappropriate representation of data which is hard to extend, and inefficient.

However, often it doesn't work: the original problem isn't solved, and so in that case you have two problems.

Kaz
  • 3,572
  • 1
  • 19
  • 30
0

Regexes are widely used for quick and dirty text parsing. They are a great tool for expressing patterns that are a little bit more complex than just a plain string match.

However as regexes get more complex serveral issues raise their head.

  1. The syntax of regexes is optimised for simple matching, most characters match themselves. That is great for simple patterns but once you end up with more than a couple of levels of nesting you end up with something looking more like line noise than well structured code. I guess you could write a regex as a series of concatenated strings with indentation and comments in-between to show the structure of the code but it seems to be rare for that to actually happen.
  2. Only certain types of text matching are well suited to regexes. Often you find yourself getting a quick and dirty regex based parser for some kind of markup language working but then you try to cover more corner cases and you find the regexes getting more and more complex and less and less readable
  3. The time complexity of a regex may be non-obvoius. It is not that difficult to end up with a pattern that works great when it matches but has O(2^n) complexity under certain cases of non-matching.

Thus it's all too easy to start with a text processing problem, apply regular expressions to it and end up with two problems, the original problem you were trying to solve and dealing with the regular expressions that are attempting to solve (but not solving correctly) the original problem.

Peter Green
  • 2,125
  • 9
  • 15