Is there a specific reason for the poor readability of regular expression syntax design?

Question

Programmers all seem to agree that readability of code is far more important than short-syntaxed one-liners which work, but require a senior developer to interpret with any degree of accuracy - but that seems to be exactly the way regular expressions were designed. Was there a reason for this?

We all agree that selfDocumentingMethodName() is far better than e(). Why should that not apply to regular expressions as well?

It seems to me that rather than designing a syntax of one-line logic with no structural organization:

var parse_url = /^(?:([A-Za-z]+):)?(\/{0,3})(0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;

And this isn't even strict parsing of a URL!

Instead, we could make a some pipeline structure organized and readable, for a basic example:

string.regex
   .isRange('A-Z' || 'a-z')
   .followedBy('/r');

What advantage does the extremely terse syntax of a regular expression offer other than the shortest possible operation and logic syntax? Ultimately, is there a specific technical reason for the poor readability of regular expression syntax design?

Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/29897/discussion-on-question-by-jonathan-todd-is-there-a-specific-reason-for-the-poor). — maple_shaft, Oct 05 '15 at 12:33
I've tried to tackle exactly this readability problem with a library called RegexToolbox. So far it's ported to C#, Java and JavaScript - see https://github.com/markwhitaker/RegexToolbox.CSharp. — Mark Whitaker, Sep 04 '17 at 08:38
many attempts have been made to resolve this issue, but culture is hard to change. see my answer about [verbal expressions here](https://softwareengineering.stackexchange.com/a/333551/249817). People reach for lowest common tool available. — Parivar Saraff, Dec 29 '17 at 06:15
Modern day 2022: I'd post a larger answer but I don't have 10 rep to do it. Regular expressions are coded/read just like any other language. It has to be formatted and read vertically just like C++. https://justpaste.it/5mdsf The transition steps from a readable state to production state (in string form) seems a bridge too far for the skeptics. However, modern tools erase the tediousness of this. See Regexformat 9 among others. — sln, Nov 08 '22 at 21:01

score 182 · Accepted Answer · answered Sep 29 '15 at 19:09

182

There is one big reason why regular expressions were designed as terse as they are: they were designed to be used as commands to a code editor, not as a language to code in. More precisely, ed was one of the first programs to use regular expressions, and from there regular expressions started their conquest for world domination. For instance, the ed command g/<regular expression>/p soon inspired a separate program called grep, which is still in use today. Because of their power, they subsequently were standardized and used in a variety of tools like sed and vim

But enough for the trivia. So why would this origin favor a terse grammar? Because you don't type an editor command to read it even one more time. It suffices that you can remember how to put it together, and that you can do the stuff with it that you want to do. However, every character you have to type slows down your progress editing your file. The regular expression syntax was designed to write relatively complex searches in a throw-away fashion, and that is precisely what gives people headaches who use them as code to parse some input to a program.

answered Sep 29 '15 at 19:09

cmaster - reinstate monica

7,941
1
22
25

5

regex are not meant to parse. otherwise, http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 . and headaches. – njzk2 Sep 30 '15 at 03:34
19

@njzk2 That answer is actually wrong. An HTML _document_ is not a regular language, but an HTML _open tag_, which is what the question asks about, actually is. – Random832 Sep 30 '15 at 04:30
11

This is a good answer explaining why original regex is as cryptic as it is, but it does not explain why there is currently no alternative standard with increased readability. – Doc Brown Sep 30 '15 at 05:56
13

So for those thinking that `grep` is a mispronounced "grab", it comes in fact from `g` / `re`(for regular expression) / `p`? – Hagen von Eitzen Sep 30 '15 at 06:34
3

@HagenvonEitzen Correct. The `g` stands for global, as in "search every line in the file," and the `p` stands for print (if it matches). – 8bittree Sep 30 '15 at 13:20
6

@DannyPflughoeft No, it doesn't. An open tag is just ``, there are no tags nested inside it. You can't nest tags, what you nest is _elements_ (open tag, contents including child elements, close tag), which the question was _not_ asking about parsing. HTML _tags_ are a regular language - the balancing/nesting occurs at a level above tags. – Random832 Sep 30 '15 at 17:31
1

@DocBrown: There is: BNF. In fact it predates regexp and if I'm not mistaken the first implementation of a regexp compiler used BNF (or rather, yacc) to define regexp syntax. Yes, they're not exactly the same but since people nowdays use regexp for syntax parsing (what BNF was designed for) it seems fitting that BNF can be used/abused for string parsing. – slebetman Oct 01 '15 at 03:17
1

@Random832: what about `` – mike3996 Oct 01 '15 at 05:52
@slebetman: thanks for this insight, I was aware of BNF, but not for this purpose. – Doc Brown Oct 01 '15 at 05:57
1

@progo Just because you put data that looks like a pair of tags there doesn't mean that that's what it actually is required there. Any sequence of characters (except & < and ") and entity references is allowed inside the quotes. And anyway, < is not actually allowed in that position, so your example is not a valid tag. The relevant portion of such a regex would be `"([^"<&]|&...;)*"` [the ... is something I'm not bothering to write now to match an entity reference] – Random832 Oct 01 '15 at 06:00
@Random832: you may be right, I did forget about escaped entities. – mike3996 Oct 01 '15 at 08:23
@progo The wider point is that the content of attributes doesn't matter and therefore wouldn't be part of the regex. You can put `att="(1 (2 3) 4)"` too, but that doesn't mean the regex has to check for balanced parentheses. – Random832 Oct 01 '15 at 15:35
3

@Random832: actually the regex *does* have to check for balanced quotes. – mike3996 Oct 01 '15 at 15:39
3

@progo I said parentheses. I don't even know what you mean by balanced quotes, there's no such thing as left and right quote marks or strings that contain nested quoted strings inside them. There's no "START1 ... START2 ... END2 ... END1" construct (where both kinds of pairs are the same and can nest indefinitely), and _that_ is what regex can't parse. – Random832 Oct 01 '15 at 15:41
Here's a regex - it's simplified [only parses XML start tags (no unquoted attributes for example), and no non-ASCII], but should be enough to prove that "open tag" is a regular language: `<[A-Za-z:_][A-Za-z0-9:_-.]*(\s+[A-Za-z:_][A-Za-z0-9:_-.]*\s*=\s*("([^<&"]|&(#[0- 9]+|#x[0-9A-Fa-f]+|[A-Za-z:_][A-Za-z0-9:_-.]*))"|'([^<&']|&(#[0-9]+|#x[0-9A-Fa-f ]+|[A-Za-z:_][A-Za-z0-9:_-.]*))'))*\s*>` – Random832 Oct 01 '15 at 15:58
And let's not forget that any finite language is by definition a regular language, so if you limit your HTML files to a maximum size (say 1 gigabyte), then a regular expression matching all the valid HTML strings but none of the non-valid ones exists. And "exists" is the operational word here :) – biziclop Oct 02 '15 at 15:20
I think at least part of the reason for the terseness of the syntax was also that network connections (and as a consequence, networked terminal connections) were much, much slower back when all of this was happening, so a lot of work went into minimizing the number of keystrokes needed to do pretty much anything. The preference for readability over conciseness is a comparatively modern phenomenon. And though that preference may be correct, it's far too late to replace regex syntax with something else at this point. – aroth Oct 03 '15 at 03:25
1

Regular expressions were not designed to be used as commands at all, they were invented by mathematics to describe regular languages. Hence, they are more like math equations then program code. Later, much syntactic sugar was added, making them even more complicated. Add to that the complications of perl regex, which are in fact more powerful then just regular (they can describe some non-regular languages, but not all context-free languages), and you have the unreadable mess that regex are now in most programming languages. Regex originated in 1956, `ed` was first released in 1971. – Polygnome Oct 03 '15 at 09:17
1

regex are NOT standardized in any way, but for programming languages perl-style regex are the most commonly used. – Polygnome Oct 03 '15 at 09:17
If I remember correctly, the syntax used in ed in the early 1980's was significantly simpler than the current syntax, so the terseness was less of a problem. – Patricia Shanahan Oct 03 '15 at 09:20
1

@Polygnome `man 3 regex` says "conforming to POSIX.1-2001", at least on my system. – cmaster - reinstate monica Oct 03 '15 at 09:21
@slebetman BNF predates `ed`, but Kleene and Backus were working on regular expressions and metalanguages at about the same time, in the mid 1950s. – Ross Patterson Oct 04 '15 at 12:34
2

FWIW, `vim` **is** `ed`. Well, `vim` is the reimplementation of `vi` which was a shorthand to start `ex` -- an improved version of `ed` -- in "**vi**sual mode". Yes, Vim has *that* old a pedigree. ;-) – DevSolar Oct 05 '15 at 12:39

score 61 · Answer 2 · edited Sep 30 '15 at 12:55

The regular expression you cite is a terrible mess and I don't think anyone agrees that it's readable. At the same time, much of that ugliness is inherent to the problem being solved: There are several layers of nesting and the URL grammar is relatively complicated (certainly too complicated to communicate succinctly in any language). However, it's certainly true that there are better ways to describe what this regex is describing. So why aren't they used?

A big reason is inertia and ubiquity. It doesn't explain how they became so popular in the first place, but now that they are, anyone who knows regular expressions can use these skills (with very few differences between dialects) in a hundred different languages and an additional thousand software tools (e.g., text editors and command line tools). By the way, the latter wouldn't and couldn't use any solution that amounts to writing programs, because they are heavily used by non-programmers.

Despite that, regular expressions are often overused, that is, applied even when another tool would be much better. I don't think regex syntax is terrible. But it is clearly much better at short and simple patterns: The archetypal example of identifiers in C-like languages, [a-zA-Z_][a-zA-Z0-9_]* can be read with an absolute minimum of regex knowledge and once that bar is met it is both obvious and nicely succinct. Requiring fewer characters is not inherently bad, quite the opposite. Being concise is a virtue provided you remain comprehensible.

There are at least two reasons why this syntax excels at simple patterns like these: It doesn't require escaping for most characters, so it reads relatively naturally, and it uses all available punctuation to express a variety of simple parsing combinators. Perhaps most importantly, it does not require anything at all for sequencing. You write the first thing, then the thing that comes after it. Contrast this with your followedBy, especially when the following pattern is not a literal but a more complicated expression.

So why do they fall short in more complicated cases? I can see three main problems:

There are no abstraction capabilities. Formal grammars, which originate from the same field of theoretical computer science as regexes, have a set of productions, so they can give names to intermediate parts of the pattern:
```
# This is not equivalent to the regex in the question
# It's just a mock-up of what a grammar could look like
url      ::= protocol? '/'? '/'? '/'? (domain_part '.')+ tld
protocol ::= letter+ ':'
...
```
As we could see above, whitespace having no special significance is useful to permit formatting that is easier on the eyes. Same thing with comments. Regular expressions can't do that because a space is just that, a literal ' '. Note though: some implementations allow a "verbose" mode where whitespace is ignored and comments are possible.
There is no meta-language to describe common patterns and combinators. For example, one can write a digit rule once and keep using it in a context free grammar, but one cannot define a "function" so to speak that is given a production p and creates a new production that does something extra with it, for example create a production for a comma separated list of occurrences of p.

The approach that you propose certainly solves these problems. It just doesn't solve them very well, because it trades in far more conciseness for it than is necessary. The first two problems can be solved while remaining within a relatively simple and terse domain-specific language. The third, well... a programmatic solution requires a general purpose programming language of course, but in my experience the third is by far the least of the those problems. Few patterns have enough occurrences of the same complex task that the programmer yearns for the ability to define new combinators. And when this is necessary, the language is often complicated enough that it can't and shouldn't be parsed with regular expressions anyway.

Solutions for those cases exist. There are approximately ten thousand parser combinator libraries which do roughly what you propose, just with a different set of operations, often different syntax, and almost always with more parsing power than regular expressions (i.e., they deal with context-free languages or some sizable subset of those). Then there are parser generators, which go with the "use a better DSL" approach described above. And there's always the option of writing some of the parsing by hand, in proper code. You can even mix-and-match, using regular expressions for simple sub-tasks and doing the complicated things in the code invoking the regexes.

I don't know enough about the early years of computing to explain how regular expressions came to be so popular. But they're here to stay. You just have to use them wisely, and not use them when that is wiser.

`I don't know enough about the early years of computing to explain how regular expressions came to be so popular.` We can hazard a guess though: a basic regular expression engine is very easy to implement, much easier than an efficient context-free parser. — biziclop, Sep 29 '15 at 19:08
@biziclop I wouldn't overestimate this variable. Yacc, which apparently had enough predecessors to be called "*yet another* compiler compiler", was created in the early 70s, and was included in Unix a version before `grep` was (Version 3 vs Version 4). It appears the first major use of regex was in 1968. — , Sep 29 '15 at 19:15
I can only go on what I found on Wikipedia (so I wouldn't believe it 100%) but according to that, `yacc` was created in 1975, the whole idea of LALR parsers (which were amongst the first class of practically usable parsers of their kind) originated in 1973. Whereas the first regexp engine implementation which JIT compiled expressions(!) was published in 1968. But you're right, it's hard to say what swung it, in fact it's hard to say when regexes started to "take off". But I'd suspect once they were put in text editors developers used, they wanted to use them in their own software too. — biziclop, Sep 29 '15 at 21:36
Fun fact, the `terrible mess` I cite was written by one of JavaScript's most revered developers, Douglas Crockford, currently employed as a senior developer at PayPal, where poorly written RegEx's could be a serious problem. — J.Todd, Sep 29 '15 at 23:11
@JonathanTodd Citation needed. My guess is there's some context. — jpmc26, Sep 29 '15 at 23:27
@jpmc26 open his book, **JavaScript** *The Good Parts* to the Regex Chapter. — J.Todd, Sep 29 '15 at 23:29
`with very few differences between dialects` I wouldn't say it's "very few". Any predefined character class has several definitions between different dialects. And there are also parsing quirks specific to each dialect. — nhahtdh, Sep 30 '15 at 04:42
"the latter wouldn't and couldn't use any solution that amounts to writing programs, because they are heavily used by non-programmers" - are there any actual non-programmers who know regular expressions? I always thought of regexes as a special-purpose programming language in its own right. — Emilio M Bumachar, Oct 02 '15 at 15:00
Actually, RegEx *does* provide a way to abstract parts of it: groups, which allow you to name parts of the match and extract their value by referencing the name of the group. Don't ask me to recall the syntax off the top of my head, though. ;) — Chris Bordeman, Oct 04 '15 at 00:39
@ChrisBordeman That is not what I meant. Groups allow you to name *matches*, what would be needed for making complex regexes readable would be naming *parts of the regex*, so that e.g. you could define e-mail address regex as (made up syntax) `${name}@${domain}`. — , Oct 04 '15 at 06:42

score 39 · Answer 3 · edited Apr 13 '17 at 12:38

Historical perspective

The Wikipedia article is quite detailed about the origins of regular expressions (Kleene, 1956). The original syntax was relatively simple with only *, +, ?, | and grouping (...). It was terse (and readable, the two are not necessarily opposed), because formal languages tend to be expressed with terse mathematical notations.

Later, the syntax and capabilities evolved with editors and grew with Perl, which was trying to be terse by design ("common constructions should be short"). This complexified the syntax a lot, but note that people are now accustomized to regular expressions and are good at writing (if not reading) them. The fact that they are sometimes write-only suggest that when they are too long, they are generally not the right tool. Regular expressions tend to be unreadable when being abused.

Beyond string-based regular expressions

Speaking about alternative syntaxes, let's have a look at one that already exists (cl-ppcre, in Common Lisp). Your long regular expression can be parsed with ppcre:parse-string as follows:

(let ((*print-case* :downcase)
      (*print-right-margin* 50))
  (pprint
   (ppcre:parse-string "^(?:([A-Za-z]+):)?(\\/{0,3})(0-9.\\-A-Za-z]+)(?::(\\d+))?(?:\\/([^?#]*))?(?:\\?([^#]*))?(?:#(.*))?$")))

... and results in the following form:

(:sequence :start-anchor
 (:greedy-repetition 0 1
  (:group
   (:sequence
    (:register
     (:greedy-repetition 1 nil
      (:char-class (:range #\A #\Z)
       (:range #\a #\z))))
    #\:)))
 (:register (:greedy-repetition 0 3 #\/))
 (:register
  (:sequence "0-9" :everything "-A-Za-z"
   (:greedy-repetition 1 nil #\])))
 (:greedy-repetition 0 1
  (:group
   (:sequence #\:
    (:register
     (:greedy-repetition 1 nil :digit-class)))))
 (:greedy-repetition 0 1
  (:group
   (:sequence #\/
    (:register
     (:greedy-repetition 0 nil
      (:inverted-char-class #\? #\#))))))
 (:greedy-repetition 0 1
  (:group
   (:sequence #\?
    (:register
     (:greedy-repetition 0 nil
      (:inverted-char-class #\#))))))
 (:greedy-repetition 0 1
  (:group
   (:sequence #\#
    (:register
     (:greedy-repetition 0 nil :everything)))))
 :end-anchor)

This syntax is more verbose, and if you look at comments below, not necessarily more readable. So don't assume that because you have a less compact syntax, things will be automatically clearer.

However, if you start having troubles with your regular expressions, turning them into this format might help you decipher and debug your code. This is one advantage over string-based formats, where a single character error can be difficult to spot. The main advantage of this syntax is to manipulate regular expressions using a structured format instead of a string-based encoding. That allows you to compose and build such expressions like any other data-structure in your program. When I use the above syntax, this is generally because I want to build expressions from smaller parts (see also my CodeGolf answer). For your example, we may write¹:

`(:sequence
   :start-anchor
   ,(protocol)
   ,(slashes)
   ,(domain)
   ,(top-level-domain) ... )

String-based regular expressions can also be composed, using string concatenation and or interpolation wrapped in helper functions. However, there are limitations with string manipulations which tend to clutter the code (think about nesting problems, not unlike backticks vs. $(...) in bash; also, escape characters may give you headaches).

Note also that the above form allows (:regex "string") forms so that you can mix terse notations with trees. All of that leads IMHO to good readability and composability; it addresses the three problems expressed by delnan, indirectly (i.e. not in the language of regular expressions itself).

To conclude

For most purpose, the terse notation is in fact readable. There are difficulties when dealing with extended notations which involves backtracking, etc., but their used are rarely justified. The unwarranted use of regular expressions can lead to unreadable expressions.
Regular expressions need not be encoded as strings. If you have a library or a tool that can help you build and compose regular expressions, you'll avoid a lot of potential bugs related to string manipulations.
Alternatively, formal grammars are more readable and are better at naming and abstracting sub-expressions. Terminals are generally expressed as simple regular expressions.

1. You may prefer to build your expressions at read-time, because regular expressions tend to be constants in an application. See create-scanner and load-time-value:

'(:sequence :start-anchor #.(protocol) #.(slashes) ... )

Maybe I'm just used to traditional RegEx syntax, but I'm not so sure that 22 somewhat readable lines are easier to understand than the equivalent one line regex. — , Sep 30 '15 at 06:04
@dan1111 "somewhat readable" ;-) Okay, but if you need to have a really long regex, it makes sense to define subsets, like `digits`, `ident`, and compose them. They way I see it done is generally with string manipulations (concatenation or interpolation), which brings other problems like proper escaping. Search for occurences of `\\\\\`` in emacs packages, for example. Btw, this is made worse because the same escape character is used both for special characters like `\n` and `\"` and for regex syntax `\(`. A non-lisp example of good syntax is `printf`, where `%d` does not conflict with `\d`. — coredump, Sep 30 '15 at 07:20
fair point about the defined subsets. That makes a lot of sense. I'm just skeptical that the verbosity is an improvement. It might be easier for beginners (though concepts like `greedy-repetition` are not intuitive and still have to be learned). However, it sacrifices usability for experts, since it is much harder to see and grasp the whole pattern. — , Sep 30 '15 at 13:02
@dan1111 I agree that verbosity by itself is not an improvement. What can be an improvement is manipulating regex using structured data instead of strings. — coredump, Sep 30 '15 at 13:14
@dan1111 Maybe I should propose an edit using Haskell? Parsec does it in only nine lines; as a one-liner: `do {optional (many1 (letter) >> char ':'); choice (map string ["///","//","/",""]); many1 (oneOf "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-."); optional (char ':' >> many1 digit); optional (char '/' >> many (noneOf "?#")); optional (char '?' >> many (noneOf "#")); optional (char '#' >> many (noneOf "\n")); eof}`. With a few lines like designating the long string as `domainChars = ...` and `section start p = optional (char start >> many p)` it looks pretty simple. — CR Drost, Oct 01 '15 at 16:01
The long and short of it is that regular expressions originated as a mathematical thesis. In mathematics, notation is geared towards the tersest, simplest expression-- not readability (with some caveats). Regular expressions were implemented in code more-or-less "as-is" in code from the mathematical ideas (at a time when brevity was required for memory considerations), and once that syntax took hold, it became the standard. So the reason is "historical reasons". — user1936, Oct 02 '15 at 14:52

score 25 · Answer 4 · edited Oct 07 '21 at 07:34

The biggest problem with regex isn't the overly terse syntax, it's that we try to express a complex definition in a single expression, instead of composing it from smaller building blocks. This is similar to programming where you never use variables and functions and instead embed your code all in a single line.

Compare regex with BNF. Its syntax isn't that much cleaner than regex, but it's used differently. You start by defining simple named symbols and compose them until you arrive at a symbol describing the whole pattern you want to match.

For example look at the URI syntax in rfc3986:

URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
hier-part     = "//" authority path-abempty
              / path-absolute
              / path-rootless
              / path-empty
...

You could write nearly the same thing using a variant of the regex syntax that supports embedding named sub-expressions.

Personally I think that a terse regex like syntax is fine for commonly used features like character-classes, concatenation, choice or repetition, but for more complex and rarer features like look-ahead verbose names are preferable. Quite similar to how we use operators like + or * in normal programming and switch to named functions for rarer operations.

score 12 · Answer 5 · answered Sep 30 '15 at 07:43

12

selfDocumentingMethodName() is far better than e()

is it? There's a reason most languages have { and } as block delimiters rather than BEGIN and END.

People like terseness, and once you know the syntax, short terminology is better. Imagine your regex example if d (for digit) was 'digit' the regex would be even more horribe to read. If you made it more readily parseable with control characters, then it would look more like XML. Neither are as good once you know the syntax.

To answer your question properly though, you have to realise that regex comes from the days when terseness was mandatory.Its easy to think a 1 MB XML document is no big deal today, but we're talking about days when 1 MB was pretty much your entire storage capacity. There were also fewer languages used back then, and regex isn't a million miles away from perl or C, so the syntax would be familiar to programmers of the day who would be happy with learning the syntax. So there was no reason to make it more verbose.

answered Sep 30 '15 at 07:43

gbjbaanb

48,354
6
102
172

1

`selfDocumentingMethodName` is *generally agreed* to be better than `e` because programmer intuition doesn't line up with [reality in terms of what actually constitutes readability or good quality code](http://programmers.stackexchange.com/questions/185660/is-the-average-number-of-bugs-per-loc-the-same-for-different-programming-languag). The people doing the agreeing are wrong, but that's how it is. – Alex Celeste Oct 01 '15 at 08:18
@Leushenko in a normal program sure, but when writing an expression codes are often much better - d for digit _within a regex_ is much more readable than AnyDigitOrOtherNonNumericCharacter. Its the context of where these things are used that matters, you wouldn't write a function called e() in your usual language for the same reason. – gbjbaanb Oct 01 '15 at 08:38
1

@Leushenko: Are you claiming that `e()` is better than `selfDocumentingMethodName()`? – JacquesB Oct 01 '15 at 19:37
3

@JacquesB maybe not in all contexts (like a global name). But for tightly-scoped things? Almost certainly. Definitely more often than the conventional wisdom says. – Alex Celeste Oct 01 '15 at 22:26
1

@Leushenko: I have a hard time imagining a context were a single letter function name is better than a more descriptive name. But I guess this is pure opinion. – JacquesB Oct 05 '15 at 11:02
@JacquesB function names, probably nowhere.. but variables are regularly used (eg `for (int i =0; i< 10`) etc. Now for a syntax such as sed where s means search-replace and g means globally, these are also better than descriptive names in that context. You wouldn't create a function in C# called s() though. – gbjbaanb Oct 05 '15 at 11:59
@Leushenko: Fair enough, but both the answer and your comment uses function names as examples, which I why I balked. – JacquesB Oct 05 '15 at 12:36
@JacquesB From a very quick glance at your profile it looks like you primarily use Java (correct me if I'm wrong). In most other languages, variables can hold references to functions, so the idea of 'function names' and 'variable names' aren't as distinct as they might be to the average Java dev. – Miles Rout Oct 06 '15 at 20:25
1

@MilesRout: The example is actually for `e()` versus a self documenting *method* name. Can you explain in which context it is an improvement to use single-letter method names rather than descriptive method names? – JacquesB Oct 07 '15 at 10:01

score 6 · Answer 6 · answered Sep 30 '15 at 07:41

Regex is like lego pieces. At first glance, you see some differently shaped plastic parts that can be joined. You might think there would not be too many possible different things you can shape but then you see the amazing things other people do and you just wonder how an amazing toy it is.

Regex is like lego pieces. There are few arguments that can be used but chaining them in different forms will form millions of different regex patterns which can be used for many complicated tasks.

People rarely used regex parameters alone. Many languages offers you functions to check the length of a string or split the numeric parts out of it. You can use string functions to slice texts and reform them. The power of regex is noticed when you use complex forms to do very specific complex tasks.

You can found tens of thousands of regex questions on SO and they rarely marked as duplicate. This alone shows the possible unique use-cases which are very different from each other.

And it is not easy to offer pre-defined methods to handle this much different unique tasks. You have string functions for those kind of tasks, but if those functions are not enough for your specifix task, then it is time to use regex

score 2 · Answer 7 · answered Sep 30 '15 at 08:53

I recognize this is a problem of practise rather than potency. The problem usually arises when regular expressions are directly implemented, instead of assuming a composite nature. Similarly, a good programmer will decompose the functions of his program into concise methods.

For example, a regex string for an URL could be reduced from approximately:

UriRe = [scheme][hier-part][query][fragment]

to:

UriRe = UriSchemeRe + UriHierRe + "(/?|/" + UriQueryRe + UriFragRe + ")"
UriSchemeRe = [scheme]
UriHierRe = [hier-part]
UriQueryRe = [query]
UriFragRe = [fragment]

Regular expressions are nifty things, but they are prone to abuse by those who turn absorbed in their apparent complexity. The resulting expressions are rhetoric, absent of a long-term value.

Unfortunately most programming languages don't include functionality that helps with composing regexes and the way group capture works isn't very friendly to composition either. — CodesInChaos, Sep 30 '15 at 11:35
Other languages need to catch up to Perl 5 in their "perl compatible regular expression" support. Subexpressions are not the same as simply concatenating strings of regex specification. Captures should be named, not relying on implicit numbering. — JDługosz, Sep 30 '15 at 17:52

score 0 · Answer 8 · answered Sep 29 '15 at 20:31

0

As @cmaster says, regexps were originally designed to be used only on-the-fly, and it is simply bizarre (and slightly depressing) that the line-noise syntax is still the most popular one. The only explanations I can think of involve either inertia, masochism, or machismo (it's not often that ‘inertia’ is the most appealing reason for doing something...)

Perl makes a rather weak attempt at making them more readable by allowing whitespace and comments, but doesn't do anything remotely imaginative.

There are other syntaxes. A good one is the scsh syntax for regexps, which in my experience produces regexps which are reasonably easy to type, but still readable after the fact.

[scsh is splendid for other reasons, just one of which is its famous acknowledgements text]

answered Sep 29 '15 at 20:31

Norman Gray

682
5
5

2

Perl6 does! Look at grammars. – JDługosz Sep 30 '15 at 00:59
@JDługosz As far as I can see, that looks more like a mechanism for parser generators, rather than an alternative syntax for regular expressions. But the distinction is maybe not a deep one. – Norman Gray Sep 30 '15 at 15:31
It can be a replacement, but isn't limited to the same power. You could translate a regedp into an inline grammar with 1 to 1 correspondence of the modifiers but in a more readable syntax. Examples promoting it as such are in the original Perl Apocalypse. – JDługosz Sep 30 '15 at 17:47

score 0 · Answer 9 · answered Sep 30 '15 at 13:07

I believe regular-expressions were designed to be as 'general' and simple as possible, so they can be used (roughly) the same way anywhere.

You're example of regex.isRange(..).followedBy(..) is coupled to both a specific programming language's syntax and perhaps object-oriented style (method chaining).

How would this exact 'regex' look in C for example? The code would have to be changed.

The most 'general' approach would be to define a simple concise language which can then be easily embedded in any other language without change. And that's (almost) what regex are.

score 0 · Answer 10 · answered Sep 07 '16 at 21:48

Perl-Compatible Regular Expression engines are widely used, providing a terse regular expression syntax that many editors and languages understand. As @JDługosz pointed out in comments, Perl 6 (not just a new version of Perl 5, but a wholly different language) has attempted to make regular expressions more readable by building them up from individually-defined elements. For example, here is an example grammar for parsing URLs from Wikibooks:

grammar URL {
  rule TOP {
    <protocol>'://'<address>
  }
  token protocol {
    'http'|'https'|'ftp'|'file'
  }
  rule address {
    <subdomain>'.'<domain>'.'<tld>
  }
  ...
}

Splitting up the regular expression like this allows each bit to be individually defined (e.g. constraining domain to be alphanumeric) or extended through subclassing (e.g. FileURL is URL that constraints protocol to only be "file").

So: no, there is no technical reason for the terseness of regular expressions, but newer, cleaner and more readable ways to represent them are already here! So hopefully we'll see some new ideas in this field.

Is there a specific reason for the poor readability of regular expression syntax design?

10 Answers10

Historical perspective

Beyond string-based regular expressions

To conclude