23

When I began to use parser combinators my first reaction was a sense of liberation from what felt like an artificial distinction between parsing and lexing. All of a sudden everything was just parsing!

However, I recently came across this posting on codereview.stackexchange illustrating someone reinstating this distinction. At first I thought this was very silly of them, but then the fact that functions exist in Parsec to support this behavior leads me to question myself.

What are the advantages/disadvantages to parsing over an already lexed stream in parser combinators?

Eli Frey
  • 331
  • 2
  • 8

5 Answers5

16

Under parsing we understand most often analysis of context free languages. A context free language is more powerful than a regular one, hence the parser can (most often) do the job of the lexical analyser right away.

But, this is a) quite unnatural b) often inefficient.

For a), if I think about how for example an if expression looks, I think IF expr THEN expr ELSE expr and not 'i' 'f', maybe some spaces, then any character an expression can start with, etc. you get the idea.

For b) there are powerful tools that do an excellent job recognizing lexical entities, like identifiers, literals, brackets of all kinds, etc. They will do their work in practically no time and give you a nice interface: a list of tokens. No worries about skipping spaces in the parser anymore, your parser will be much more abstract when it deals with tokens and not with characters.

After all, if you think a parser should be busy with low level stuff, why then process characters at all? One could write it also on the level of bits! You see, such a parser that works on the bit level would be almost incomprehensible. It's the same with characters and tokens.

Just my 2 cents.

Ingo
  • 3,903
  • 18
  • 23
  • 4
    Just for the sake of precision: a parser can _always_ do the job of a lexical analyser. – Giorgio Jan 07 '12 at 11:07
  • Also, regarding efficiency: I am not sure if a parser would be less efficient (slower). I would expect that the resulting grammar would contain a sub-grammar that describes a regular language, and the code for that sub-grammar would be as fast as a corresponding lexical analyser. IMO the real point is (a): how natural, intuitive it is to work with a simpler, more abstract parser. – Giorgio Jan 07 '12 at 11:22
  • @Giorgio - Regarding your 1st comment: You are right. What I had in mind here are cases where the lexer pragmatically does some work that makes the grammar easier, so that one can use LALR(1) instead of LALR(2). – Ingo Jan 07 '12 at 11:32
  • @Giorgio the efficiency thing comes into play when you have a backtracking parser. It's then just faster to backtrack a few tokens instead of maybe 100s of characters. – Ingo Jan 07 '12 at 11:34
  • You are right: without a lexer you would probably need much more look-ahead (2 would not even be sufficient). Why do you need backtracking? In case of errors? – Giorgio Jan 07 '12 at 11:51
  • @Giorgio - I haven't used backtracking myself, I prefer straight good old LALR(1). But I know there are parsing technologies that do use it. I think that parser combinators can do this, too. – Ingo Jan 07 '12 at 12:00
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/2167/discussion-between-giorgio-and-ingo) – Giorgio Jan 07 '12 at 12:17
  • 2
    I have removed my acceptence of your answer after further experimentation and reflection. It seams that you two come from yon world of Antlr et all. Considering the first class nature of parser combinators I often simply end up defining a wrapper parser for my token parsers leaving each token as a single name in the parsing layer of parsers. for instance your if example would look like `if = string "if" >> expr >> string "then" >> expr >> string "else" >> expr`. – Eli Frey Jan 08 '12 at 06:04
  • 1
    Performance is still an open question, I will do some benchmarks. – Eli Frey Jan 08 '12 at 06:06
  • @Eli Frey: I have no experience with parser combinators. In my answer I have described why I think the distinction between lexical analysis and parsing is useful in general (modularity). I do not know Antlr either. – Giorgio Jan 08 '12 at 10:28
  • @Eli, you are free to up- or downvote as you please. I'm still not convinced that parser combinators must be necessarily unfriendly to tokenized input. Hence, I see no reason why one could not write something like `ifexpr = token IF >> expr >> token THEN >> expr >> token ELSE >> expr`. Note that this gives you just a new level of abstraction. You could, for example, make your language friendly to germans by changing the lexer so that it accepts `wenn`, `dann` and `sonst` instead of `if`, `then`and `else`. – Ingo Jan 08 '12 at 15:08
10

Everyone suggesting that separating lexing and parsing is a "good practice" -- I have to disagree - in many cases performing lexing and parsing in a single pass gives much more power, and performance implications are not as bad as they're presented in the other answers (see Packrat).

This approach shines when one has to mix a number of different languages in a single input stream. This is not only needed by the weird metaprogramming-oriented languages like Katahdin and alike, but for much more mainstream applications as well, like literate programming (mixing latex and, say, C++), using HTML in comments, stuffing Javascript into HTML, and so on.

Michael Kohne
  • 10,038
  • 1
  • 36
  • 45
SK-logic
  • 8,497
  • 4
  • 25
  • 37
  • 1
    In my answer I suggested that it is a "good practice in certain contexts" and not that it is a "better practice in all contexts". – Giorgio Jan 08 '12 at 11:54
5

A lexical analyser recognizes a regular language and a parser recognizes a context-free language. Since each regular language is also context free (it can be defined by a so-called right-linear grammar), a parser can also recognize a regular language and the distinction between parser and lexical analyser seems to add some unneeded complexity: a single context-free grammar (parser) could do the job of a parser and a lexical analyser.

On the other hand, it can be useful to capture some elements of a context-free language through a regular language (and therefore a lexical analyser) because

  1. Often these elements appear so often that they can be dealt with in a standard way: recognizing number and string literals, keywords, identifiers, skipping white space, and so on.
  2. Defining a regular language of tokens makes the resulting context-free grammar simpler, e.g. one can reason in terms of identifiers, not in terms of individual characters, or one can ignore white space completely if it is not relevant for that particular language.

So separating parsing from lexical analysis has the advantage that you can work with a simpler context-free grammar and encapsulate some basic (often routine) tasks in the lexical analyser (divide et impera).

EDIT

I am not familiar with parser combinators so I am not sure how the above considerations apply in that context. My impression is that even if with parser combinators one only has one context-free grammar, distinguishing between two levels (lexical analysis / parsing) could help to make this grammar more modular. As said, the lower lexical-analysis layer could contain basic reusable parsers for identifiers, literals, and so on.

Giorgio
  • 19,486
  • 16
  • 84
  • 135
  • 2
    Lexemes falls into regular grammars not naturally, but by convention, since all the lexers are built upon regular expression engines. It is limiting the expressive power of the languages you can design. – SK-logic Jan 08 '12 at 09:44
  • 1
    Can you give an example of a language for which it would be appropriate to define lexemes that cannot be described as a regular language? – Giorgio Jan 08 '12 at 10:03
  • 1
    for example, in a couple of the domain specific languages I've built, identifiers could have been TeX expressions, which simplified pretty-printing the code, e.g., an expression like `\alpha'_1 (K_0, \vec{T})`, where \alpha'_1, K_0 and \vec{T} are identifiers. – SK-logic Jan 08 '12 at 10:56
  • 1
    Given a context-free grammar you can always take a non-terminal N and treat the words it can derive as units that have a useful meaning in themselves (e.g. an expression, a term, a number, a statement). This can be done regardless of how you parse that unit (parser, parser + lexer, etc). IMO the choice of a parser + lexer is more a technical one (how to implement the parsing) than a semantic one (what is the meaning of the blocks of source code that you parse). Maybe I am overlooking something but the two aspects look orthogonal to me. – Giorgio Jan 08 '12 at 11:37
  • 5
    So, I agree with you: if you define some arbitrary basic building blocks (_lexemes_) and want to use a lexical analyser to recognize them, this is not always possible. I just wonder if this is the goal of a lexer. As far as I understand, the goal of a lexical analyser is more a technical one: taking away some low-level, tedious implementation details from the parser. – Giorgio Jan 08 '12 at 11:43
  • these tiny implementation details can be just factored out into separate terminals. – SK-logic Jan 08 '12 at 18:57
4

Simply, lexing and parsing should be separated because they're different complexities. Lexing is a DFA (deterministic finite automaton) and a parser is a PDA (push-down automaton). This means that parsing inherently consumes more resources than lexing, and there are specific optimization techniques available to DFAs only. In addition, writing a finite state machine is much less complex, and it's easier to automate.

You're being wasteful by using a parsing algorithm to lex.

DeadMG
  • 36,794
  • 8
  • 70
  • 139
  • If you use a parser to to do lexical analysis, the PDA would never use the stack, it would basically work as a DFA: just consuming input and jumping between states. I am not 100% sure, but I think that the optimization techniques (reducing the number of states) that can be applied to a DFA can also be applied to a PDA. But yes: it is easier to write the lexical analyser as such without using a more powerful tool, and then to write a simpler parser on top of it. – Giorgio Jan 07 '12 at 11:39
  • In addition, it makes the whole thing more flexible and maintenable. For instance, suppose we have a parser for the Haskell language without the layout rule (i.e., with semicolons and braces). If we have a separate lexer, we could now add the layout rules by just doing another pass over the tokens, adding braces and semicolons as needed. Or, for an easier example: suppose we started out with a language supporting ASCII characters in identifiers only and now we want to support unicode letters in identifiers. – Ingo Jan 07 '12 at 12:11
  • 1
    @Ingo, and why would you need to do it in a separate lexer? Just factor out those terminals. – SK-logic Jan 08 '12 at 11:13
  • 1
    @SK-logic: I am not sure I understand your question. Why a separate lexer may be a good choice I have tried to substantiate in my post. – Ingo Jan 08 '12 at 14:57
  • Giorgio, no. The stack is a crucial component of a normal LALR style parser. Doing lexing with a parser is a hideous waste of memory (both static storage and dynamically allocated) and will be much slower. The Lexer/Parser model is efficient--use it :) – riwalk Jan 08 '12 at 17:01
  • @Ingo, I meant that maintainability is not an excuse for a separate lexer. You can keep things maintainable with a mixed parsing as well. – SK-logic Jan 08 '12 at 19:00
  • @SK-logic, I must admit that I have few experience with parser combinator libs. Maybe I see it too narrowly from the perspective of a yacc et al user. Anyway, to me it just feels right to have a lexer for the lexical part. – Ingo Jan 08 '12 at 19:22
  • @Ingo, with the combinators (or PEGs or whatever else similar) you still can have the tokens parsing code pretty separated, but it won't *run* in a separate pass. – SK-logic Jan 08 '12 at 19:52
  • @SK-logic Sure. Actually, yacc/lex did not use a separate pass, instead yacc calls a gimme-next-token function. And in lazy languages like Haskell even if the type of the lexer is `String -> [Token]` this is most likely lazy processing in the end, so too, the next token is gathered on demand. Still this is nice, you could also pass a list created in a really separated pass. – Ingo Jan 08 '12 at 21:30
  • @Ingo, it is still a separate pass, which means that once tokenisation is done you cannot add new tokens or change the existing tokenisation rules depending on the parsing *context*. Which means that mixing languages in a stream is not possible. So it does not matter how lazy the resulting stream is, it is still a separate pass with all its disadvantages. – SK-logic Jan 09 '12 at 07:01
  • @SK-logic: Yes, sure it is a separate pass at least conceptually. – Ingo Jan 09 '12 at 14:31
  • @Stargazer712: You are right: I had the wrong intuition. Yesterday I wrote down an example regular grammar and simulated LALR parsing on it: the whole input is pushed on the stack before it is reduced. So, yes, do not use the context-free machinery if the regular one is sufficient. +1 for this answer. – Giorgio Jan 09 '12 at 18:04
1

One of the main advantage of separate parse / lex is the intermediate representation - the token stream. This can be processed in various ways that would otherwise not be possible with a combined lex/parse.

That said, I have found that good 'ol recursive decent can be less complicated and easier to work with vs learning some parser generator, and having to figure out how to express the weakness of the grammer within the rules of the parser generator.

sylvanaar
  • 2,295
  • 1
  • 19
  • 26
  • Could you explain more about grammars that are more easily expressed on a prefabbed stream then performed at parse time? I only have experience implementing toy languages and a scant few data formats, so perhaps I have missed something. Have you noticed any performance characteristics between your hand-rolled RD parser/lex combos and BNF fed (I'm assuming) generators? – Eli Frey Jan 08 '12 at 07:49