A better way of doing Regex?

Question

I really dislike regular expressions, each time I come back to it I seem to have to relearn it. It's also incredibly hard to maintain, modify and at a glance understand what it is doing.

Has anyone ever tried writing another layer on top of it that turns more semantic 'sql like' statements into regex? I imagine it working along the lines of:

AnotherString = "coffee hello beep 15"

FindString.StartsWith string longer than 5
FindString.Contains "beep" after "hello"
FindString.EndsWidth int < 20
FindString.DoesntContain "no!!" and DoesntContain "what!"
Foreach FindString match in AnotherString
    ...
Next

This is probably not the greatest example ever, but the idea is that the pattern is built with some sort of semantic meaningful language that can break down into traditional regular expressions. The above would be a lot easier for a developer to modify. I sort of invision it being like SQL/Linq to some degree.

It would make regex a lot more semantic and maintainable. Has this been tried before, and is it a bad/good idea to try this? Could it work?

Edit

Perhaps this is a better example (I know URL's are notoriously difficult to parse and this is over simplified):

string UserInputtedURL = "http://www.google.com/page.html?ID=5"

Protocols = {"http", "https"};
Domains = {"com", "net", "org"}
Rule.CaseSensitive = false;
Rule starts with Protocols OR starts with "www";
Rule followedby string endson "."
Rule followedby Domains
Rule if stringend or endswidth " " end else continuewith Ruleset2

RuleSet2.startswith "/"
etc...

if(UserInputtedURL.Matches(Rule)){
    // URL is valid!
}

you don't need regex to parse URLs, that can be done simply with regular old scanning parser. that is a contrived example as well. — , May 12 '11 at 18:14
You can use Linq, you can write macros in Clojure. Functional languages naturally lend a hand for this problem. There is no need to reinvent the wheel. You can document Regex, learn it really well, perhaps not use it all the time but combine it with other filter functions. Eitrher way, I do not see a silver bullet ... — Job, May 12 '11 at 18:26
@Kilian Well normal code is easier to understand than machine code, SQL/Linq are easier to understand than whatever is beneath them, so yeah I think there's a lot of room for improvement! — Tom, May 13 '11 at 11:33
We use operators instead of words because they are more easily comprehensible after just a little experience. Do you also prefer "ADD deposit TO current_balance GIVING new_balance" ? — kevin cline, Jul 02 '11 at 06:38
see my answer about [verbal expressions here](https://softwareengineering.stackexchange.com/a/333551/249817) — Parivar Saraff, Dec 29 '17 at 06:10

score 25 · Answer 1 · answered May 12 '11 at 18:22

25

The main purpose of regular expressions is to provide a terse notation for forming statements that describe regular languages (IOW matching a string based on rules). A verbose notation that generates such a terse notation is basically a Rube Goldberg device.

Also, given the following:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski

You seem to be doing the following:

Some people, when confronted with using regular expressions, think "I know, I'll write a language that generates regular expressions."

Now you have three problems.

answered May 12 '11 at 18:22

Rein Henrichs

13,112
42
66

8

+1 for the "now you have 3 problems". Taking it to the next level. Ooh yeah. – Joseph Earl May 12 '11 at 19:34
@Joseph. Or possibly 4 problems (2^2). – Rein Henrichs May 12 '11 at 19:35
1

+1 Heh, to paraphrase Macbeth "the villainies of regular expressions do multiply upon me." – TechZen Jul 24 '13 at 12:19
But haters like myself would say regex itself is a Rube Goldberg device. Why isn't my whole application a Rube Goldberg device if it uses long names rather than single character names? – LegendLength Jul 15 '17 at 13:43

score 8 · Accepted Answer · answered May 12 '11 at 18:32

This has already been done, at least for Perl.

See http://search.cpan.org/~chromatic/Regexp-English-1.01/lib/Regexp/English.pm

It hasn't really taken the world by storm, but it might be a good starting point if you want to write a similar mechanism for another language.

It's not that hard to get the basics of RegExes down; I find switching dialects (Emacs vs Perl Compatible Regular Expressions vs that weird variant in the Visual Studio Find dialog, for example) the biggest issue. I wouldn't be motivated to learn a "plain English" version. It's almost easier to accept the abstraction, because the natural-language translation of the commonly used symbols is imperfect, too.

This reminded me of Perligata - http://www.csse.monash.edu.au/~damian/papers/HTML/Perligata.html - _dum listis decapitamentum damentum nexto fac sic nextum tum novumversum scribe egresso. lista sic hoc recidementum nextum cis vannementa da listis. cis._ — , Jul 02 '11 at 07:17

score 6 · Answer 3 · edited May 23 '17 at 12:40

What you propose is incredibly verbose. Even though regex can be hard to digest if done wrong, and takes some (only a little, I'd claim - I rarely read or write regexes, but I still remember the syntax for the most important features (repetition, characters classes, lookahead) and can read regexes using those features relatively fluently) getting used to, I'd prefer them to something like this which requires me to type out a full pseudo-english sentence for something that can be expressed perfectly well with a few characters. Also consider the complexity (and error-proneness) of an implementation of such a language!

Another issue I have to raise: The checks you use as examples include some things that are completely unreasonable to do with regex - ending with an integer is easy enough, but comparing numbers is a no-go with regex. Also, many of these tests are written more easily with the programming language's nativ string processing tools - checking the length, for instance, or substring checking if the string gets longer or dynamic. The fact that regexes exist and are useful sometimes doesn't mean you have to use them for all string processing. Use them with care and everything's fine.

score 5 · Answer 4 · edited Jun 16 '20 at 10:01

5

What is simple to humans is endlessly complex to computers:

What you are describing in almost AppleScript like in its syntax, and AppleScript is universally loathed, even by people who know it well, the syntax may look easy and readable, but its verbosity is its down fall, unless you do it ever day you forget all the grammar and keyword rules and it becomes just as opaque as the regex syntax. It is hard for beginners to understand because of the verbosity and hard for experts because of its verbosity.

Your contrived straw man example:

Rule followedby string endson "."

So how do I remember to use followedby instead of followed by or after or before or precedes or preceding or any one of the dozen or so English alternatives to that concept of "coming after" something else. You can apply the same logic to endson which could be endswith or endingwith or ending, you would still have to have a cheat sheet or book to use your proposed syntax.

edited Jun 16 '20 at 10:01

Community

1

answered May 12 '11 at 18:11

It's only a proposed solution, more of an illustration. Don't take what I wrote to be the literal solution! It can be different. – Tom May 13 '11 at 11:35
different != better that is my point, what appears "simpler" is actually more complex – May 13 '11 at 12:57
2

+1 A-bleeping-men to loathing applescript. I had to provide what little support end user support apple provided for applescript back in the day and I learned to hate it with a passion. It's pseudo-english code, combined with it's reliance on the details implementation in each individual app, make it bloody nightmare to program. I have to use it but I hate it. Regex is walk in the park compared to applescript. – TechZen Jul 24 '13 at 12:23

score 3 · Answer 5 · answered May 12 '11 at 17:48

Sure it could work, but it would be tremendously difficult to implement (imho), when you take into account anything but the most basic of expressions.

Regex is a language all unto its own. Once you have an understanding of how it works, you don't forget it (you might need a refresher on the syntax, but that's the same for all languages) and a wrapper becomes unneeded (and the additional overhead would be unwanted).

I thought (somewhat) like you until I read Mastering Regular Expressions (O'Reilly). I'd strongly recommend picking it up.

score 2 · Answer 6 · answered May 12 '11 at 18:52

What you want is a DSL for creating regexes. This is not all too hard. It will only get complex/verbose by the flags for capturing groups, special codes for frequently used character classes, anchoring and so on.

The basics are:

a single character is a regex.
if r is a regex, then begin r end is also a regex.
If r1 and r2 are regexes then r1 then r2 is a regex
If r1 is a regex, and n is an integer and m is either an integers or many, then r1 fór n to m times is a regex
If r1 and r2 are regexes, then r1 or r2 is a regex.

Of course, one would want to abbreviate:

'h' then 'e' then 'l' then 'l' then 'o'

with

"hello"

Likewise, after using this DSL for some time, one would want to write

\s*

instead of

begin ' ' or '\t' or '\r' or '\n' end for 0 to many times

The following is interesting: There may be a parallel universe that is exactly like ours, except that there regular expressions had been introduced and were common in a verbose DSL like above. And in the stackexchange.com of this universe, one person could ask why regular expressions have to be so clumsy? He/she had a good idea to make work with regexes far easier by inventing a concise, but equally powerful notation ....

score 1 · Answer 7 · answered Apr 17 '13 at 11:27

The best way to do regular expressions is to learn and understand them, or don't use them at all. Using some other tool as an excuse to not learn regular expressions means you have to re-"learn" them every time you encounter them.

Spend a day, just one (full, undistracted) day to deeply study regular expressions and you'll be rewarded with a new tool you can use your whole career. You'll also have a much greater understanding of when they are appropriate and -- more importantly -- when they are not.

Yup, RegExp are a powerful yet dangerous tool - but if we can create a tool that improves readability while keeping the same expressiveness...well, I don't see the bad part of it :) — ArtoAle, Oct 16 '13 at 15:20

score 1 · Answer 8 · answered Jul 02 '11 at 06:13

An alternative to regular expressions is Backus-Naur Form, and some more human variations like EBNF or ABNF. Roughly, each part of the grammar is broken into a 'production rule', with a nonterminal definition on the left and a sequence of terminals and nonterminals describing the rule on the right. your example, in BNF would look something like this:

expr ::= startword "hello" "beep" endword
       ;

startword ::= WORD_CHAR WORD_CHAR WORD_CHAR WORD_CHAR WORD_CHAR 
            | startword WORD_CHAR
            ;

endword ::= DIGIT
          | "1" DIGIT
          ;

Also, BNF happens to express context free languages, a proper superset of the regular languages that regular expressions describe.

score 0 · Answer 9 · answered Jul 02 '11 at 00:24

I found that Ruby Regexp Generator http://www.rubyregexp.sf.net is helpful, because you can give it a verbose definition, but it will output a regexp for your code. I use it like a crutch, to help me build my regexp, instead of as a standalone solution.

A better way of doing Regex?

9 Answers9

What is simple to humans is endlessly complex to computers: