25

I'm trying to understand compilation and interpretation, step by step figuring out a total image. So I came up to a question while reading http://www.cs.man.ac.uk/~pjj/farrell/comp3.html this article

It says :

The next stage of the compiler is called the Parser. This part of the compiler has an understanding of the language's grammar. It is responsible for identifying syntax errors and for translating an error free program into internal data structures that can be interpreted or written out in another language.

But I couldn't figure out how tokenizer can properly tokenize the given stream which has the syntax error.

It should be stuck at there or give some wrong information to the parser. I mean isn't the tokenizing also a kind of translator?

So how it just overcome the lexical corrupted lines of code while tokenizing.

There is an example of token inside above link at The Tokenizer heading.

As I understand the form of the token seems like, if there is something wrong in the code token would be corrupted too.

Could you please clarify my misunderstanding?

gnat
  • 21,442
  • 29
  • 112
  • 288
FZE
  • 469
  • 4
  • 12

3 Answers3

34

A tokenizer is just a parser optimization. It's perfectly possible to implement a parser without a tokenizer.

A tokenizer (or lexer, or scanner) chops the input into a list of tokens. Some parts of the string (comments, whitespace) are usually ignored. Each token has a type (the meaning of this string in the language) and a value (the string that makes up the token). For example, the PHP source snippet

$a + $b

could be represented by the tokens

Variable('$a'),
Plus('+'),
Variable('$b')

The tokenizer does not consider whether a token is possible in this context. For example, the input

$a $b + +

would happily produce the token stream

Variable('$a'),
Variable('$b'),
Plus('+'),
Plus('+')

When the parser then consumes these tokens, it will notice that two variables cannot follow each other, and neither can two infix operators. (Note that other languages have different syntaxes where such a token stream may be legal, but not in PHP).

A parser may still fail at the tokenizer stage. For example, there might be an illegal character:

$a × ½ — 3

A PHP tokenizer would be unable to match this input to its rules, and would produce an error before the main parsing starts.

More formally, tokenizers are used when each token can be described as a regular language. The tokens can then be matched extremely efficiently, possibly implemented as a DFA. In contrast, the main grammar is usually context-free and requires more complicated, less performant parsing algorithm such as LALR.

amon
  • 132,749
  • 27
  • 279
  • 375
  • So we could think tokenizer as is a part of parser and the syntax error can be occur either tokenizing step or parsing step according to syntax error form. Thanks for the clarification. – FZE Mar 31 '16 at 21:04
  • 4
    @FZE: You *could* think that way, but it's misleading. Lexing is not "just a parser optimization". Rather, lexing maps a physical representation (some sequence of characters) into a logical representation (the tokens represented by those characters). This isolates the parser from minutiae like how the end of a line is represented, or whether you decide to represent a logical-and as `and` or `&&` or something else. It's (mostly) separate and different from parsing. Optimization (if any) is an almost accidental side effect. – Jerry Coffin Mar 31 '16 at 21:30
  • @JerryCoffin thanks for the further explanation it makes more sense now. – FZE Mar 31 '16 at 21:40
  • Unless the language has a grammar where some character can *only* occur as part of a multi-character token, then *every* possible sequence of characters in the language's character set can be converted into a sequence of tokens. Since most languages allow single-letter variable names and single-digit numbers, the lexer is unlikely to be unable to detect many syntax errors on its own, except "the program contains an illegal input character" - and it may be simpler just to output that character as an "invalid token" and leave the parser to write the actual error message. – alephzero Mar 31 '16 at 23:36
  • That's just not true @alephzero. There are many syntax errors that can be found during lexing. That process can also find invalid *combinations* of characters, such as `{ )( }`. Those are all valid characters, but (collectively) not a valid token. – RubberDuck Apr 01 '16 at 01:34
  • 2
    @JerryCoffin, amon is correct that is the difference is not fundamental. You can create a cohesive BNF grammar that covers both the "lexer" and "parser" parts. We usually categorize the rules into low level (e.g., number, addition operator) and high level (summation), but the grammar itself makes no such distinction. – Paul Draper Apr 01 '16 at 06:48
  • @PaulDraper: The fact that they can be combined doesn't make them the same, any more than the ability to dissolve salt in water means there's no fundamental difference between salt and water. – Jerry Coffin Apr 01 '16 at 07:23
  • @JerryCoffin, more formally, the difference is: lexers are for regular languages, the rest of the parser is for context-free languages. Need to parse a regular language? You can use just a lexer. Needs to parse a context-free language? You can use just a context-free parser. Or, for a context-free language, you could split your language into two parts and use one on each. Personally, it's a lot easier to write a parser that's a straight clone of the BNF, but that won't be very performant (e.g. recursive decent parsing a 10000 character string literal, character by character). – Paul Draper Apr 01 '16 at 07:41
  • 1
    @PaulDraper Not sure if separating out a regular language as first phase is the right choice. For example matched pairs (not regular) might be necessary to describe string literals in some languages, yet it still makes sense to handle them in the first phase. Avoiding/minimizing back-tracking seems like a better guideline. – CodesInChaos Apr 01 '16 at 09:57
  • For an example of how insanely complex all of this can be I'd suggest looking at parsing C++. Classifying tokens there in general is undecidable. – Daniel Jour Apr 01 '16 at 12:36
16

You'd usually expect most syntax errors to come from the parser, not the lexer.

The lexer would generate an error if (and mostly only if) there's something in the input that can't be tokenized. In many languages, however, almost any sequence of characters can be turned into tokens of some sort, so errors here are fairly unusual.

The parser will generate an error if the input contains valid tokens, but those tokens aren't arranged so they form valid statements/expressions in the target language. This is much more common as a rule.

Jerry Coffin
  • 44,385
  • 5
  • 89
  • 162
12

Tokenizer just splits character stream into tokens. From tokenizer POV this is completely valid:

1 * * 1

and translates to something like: ["1", MULTIPLY, MULTIPLY, "1"] Only parser can reject such expressions - it knows multiply operator cannot follow another multiply operator. For example in JavaScript this produces:

Uncaught SyntaxError: Unexpected token *(…)

There are errors which might be detected by tokenizer. For example unfinished string literals: "abc or invalid numbers: 0x0abcdefg. They still might be reported as syntax errors though:

Uncaught SyntaxError: Unexpected token ILLEGAL

Note however the token was not recognized and is reported as ILLEGAL.

Banthar
  • 350
  • 1
  • 4