9

I've been looking at a few lexers in various higher level langauges (Python, PHP, Javascript among others) and they all seem to use regular expressions in one form or another. While I'm sure regex's are probably the best way to do this, I was wondering if there was any way to achieve basic lexing without regular expressions, maybe some sort of direct string parsing or something.

So yeah, is it possible to implement some sort of basic lexing in a higher level language* without using regular expressions in any form?

*Higher level languages being things like Perl/PHP/Python/Javascript etc. I'm sure there is a way to do it in C

Blank
  • 253
  • 3
  • 7

5 Answers5

3

First of all, there have been regular expression libraries for C since before your "higher-level" languages were invented. Just saying, C programs aren't as podunk as some people seem to think.

For most grammars, lexing is a matter of searching for whitespace and a few other characters like ()[]{}; to split the words, and then matching against a list of keywords to see if any match.

Karl Bielefeldt
  • 146,727
  • 38
  • 279
  • 479
  • 1
    I didn't mean C couldn't do regex's, I meant it has more powerful features for doing this sort of stuff. I'd imagine it's easier to build an advanced and performant lexer in C than a higher level language. – Blank Feb 10 '12 at 14:19
  • 1
    @sam the complexity and performance of a lexer or parser is more a function of the complexity of the language being parsed than the langugae the parser is implemented in, so no. – jk. Feb 10 '12 at 14:29
  • +1. A lexer is incredibly simple; you just need a string, a data type for your tokens, and a table of predefined keywords. The trickiest part is dealing with whitespace and comments :P – Mason Wheeler Feb 10 '12 at 19:36
2

You might be interested in "scannerless parsers", which don't have a separate tokenization step. One explanation of the benefits of scannerless parsers is given at the beginning of this paper: Disambiguation Filters for Scannerless Generalized LR Parsers. (There are disadvantages, too, though.)

(PEGs, which have been mentioned in other answers, can also be used to construct scannerless parsers.)

Ryan Culpepper
  • 1,413
  • 9
  • 8
1

There's nothing specific about regular expressions. They are simply shorthand which allows you to generate the code much easier, and implementations are commonly shipped. However, fundamentally, lexers are FSMs and regular expressions are just one way to achieve that goal.

DeadMG
  • 36,794
  • 8
  • 70
  • 139
0

Of course you can use other parsers, as every regular language is also context free. The question really comes down to why you would want to.

There's not really anything simpler than regular expressions (how can you improve O(N)?) and trying to simplify won't help. You can always use simple backtracking as Jetti has pointed out, although I recommend avoiding it if possible.

If you're going to use a more advanced parser for lexing then you likely don't need a lexing phase at all. In fact, the reasons why we have a lexing phase is that it's faster to parse lexed tokens than to parse characters, along with it drastically simplifying our parsing step. So by using a more advanced parser you simply lose all benefit of lexing in the first place.

Pubby
  • 3,290
  • 1
  • 21
  • 26
  • So how does regex do it? Wouldn't it still have to go character by character (for most patterns used in lexing at least)? – Jetti Feb 10 '12 at 14:17
  • @Jetti Yes, of course. – Pubby Feb 10 '12 at 14:25
  • It would be just as easy to read each character and then backtrack if needed to pull out a token. It would be more code but not more difficult. – Jetti Feb 10 '12 at 14:29
  • @Jetti I fail to see how naive backtracking is better. – Pubby Feb 10 '12 at 14:32
  • I never said better. But the OP asked if there are other ways and it is another way that isn't an advanced parser. – Jetti Feb 10 '12 at 14:37
0

It makes sense to either do a lexical analysis with regular expressions, or skip this pass at all and do a much more flexible and powerful lexerless parsing with PEG or GLR.

SK-logic
  • 8,497
  • 4
  • 25
  • 37