What is the responsibility or benefit of a Tokenizer?

Question

Suppose I had a grammar like:

object 
    { members } 
members 
    pair
pair
    string : value 
value 
    number
    string
string 
    " chars " 
chars 
    char
    char chars 
number
    digit
    digit number

I could parse the following example: { "one" : 1234 }

As far as I understand, I should have the tokens object, members, pair, value, string and chars.

Tokenizing the example should produce

object
    ->members
        ->pair
            ->"one"
            ->"1234"

Parsing the tokens should produce

object
    ->pair
        ->"one"
        ->1234

It seems to me like the tokenizer is either useless or I don't fully understand what a it should do.

What is the responsibility of a tokenizer? What is the benefit of a tokenizer over parsing the original string?

See also [Are separate parsing and lexing passes good practice with parser combinators?](http://programmers.stackexchange.com/q/128888/15162) — Brian, May 16 '14 at 13:10

score 19 · Accepted Answer · answered May 16 '14 at 13:13

19

You don't seem to understand what a tokenizer should do. In this example, I'd make the tokenizer recognize six tokens: {, }, :, string, number. The tokenizer produces a string/sequence of tokens, not a tree. And instead of the grammar being written in terms of individual characters (char, digit), it is now written in terms of tokens.

The benefit is that this simplifies the grammar and parser: You no longer need to describe how to parse strings and numbers (note that real languages' string and numeric literals are far more complicated, which boosts this benefit). As far as the parser is concerned, the grammar becomes

object 
    '{' members '}'
members 
    pair
pair
    string ':' value 
value 
    number
    string

Which is not only simpler to write a parser for, but also more useful for comprehending the syntactic structure of programs. I know what a string literal is, the interesting part is how I can combine string literals and other atomic units to form programs.

answered May 16 '14 at 13:13

1

This misconception seems to be at the core of this answer: `The benefit is that this simplifies the grammar and parser`. That complexity doesn't just disappear: it goes somewhere else (in this case, to the token grammar and tokenizer). So now you have two things, each of which does ~1/2 the job of a parser. About the same amount of work. Where's the benefit? – May 16 '14 at 17:03
9

@MattFenwick The same benefits of breaking any other kind of program into individual modules. Easier to reason about, easier to make changes without affecting unrelated parts. Turning a giant string into a sequence of tokens and turning a sequence of tokens into an abstract syntax tree are two different tasks. There's no reason they need to be related. You should need a reason to introduce dependencies, not need reasons to break them. – Doval May 16 '14 at 17:23
1

@Doval: Another advantage is that on a multi-core system, one may have one thread which reads source code from a file and puts tokens into a queue, and another thread that reads tokens from the queue and acts upon them. If the parsing of input text into tokens can be done without regard for what those tokens represent, then neither thread will have to wait for the other unless the parser catches up to the tokenizer. – supercat May 16 '14 at 20:28
@supercat Can you point to an existing implementation of this? I suspect that the synchronization overhead eats any advantage, in particular since in most compilers, tokenization is far from a bottleneck. Conventional serial optimizations are probably more fruitful. – May 16 '14 at 20:42
@Doval you're making a circular argument -- you're essentially saying that we need separate tokenizers and hierarchical parsers, because tokenizing and hierarchical parsing are different. But why are they different? More importantly, what is the cost of differentiating them? – May 19 '14 at 01:09
@Doval re: cost. Your statement `easier to make changes without affecting unrelated parts` is actually an interesting one. Why do you think they're unrelated? In practice, this assumption is not universally true (i.e. you can not split an arbitrary language into "tokens" and "not tokens" -- [example](http://stackoverflow.com/questions/16803185/are-s-in-type-parameters-tokenized-using-a-special-rule) of Java's lexical grammar failing). – May 19 '14 at 01:31
@MattFenwick Suppose there's two pieces of code A and B, where A depends on the output of B. If you can implement A just as efficiently regardless of whether it knows B's implementation details, then I consider those two computations unrelated and they should be kept that way. Making A depend on the implementation details of B will bring you no benefits and all the drawbacks of tight coupling. I'm not an expert on parsers but it seems to me this is the case here. – Doval May 19 '14 at 10:32
@Doval: By separating the two, you then have to have some common definition of a "token" (which is to some degree an implementation detail). If that definition belongs to either A or B, though, a change to one must cause a change in the other. The definition should therefore belong to some bridge C, but then C can't change without causing a cascade of changes through both A and B. Most of this disappears (potentially leaving fewer moving parts) when A and B are combined. – cHao May 19 '14 at 14:49
@cHao Both parts need to agree on their definition of a token regardless of whether you've split them or joined them. Combining them buys you nothing regarding that issue. In fact, you were internally going to define a type for the tokens regardless, simply because parsing doesn't care about strings. – Doval May 19 '14 at 15:08
@Doval: If there is no "both parts", then there's no need for "both parts" to agree. And to some degree, parsing does care about strings -- a keyword, for example, only needs to be a token if you've embedded the assumption that a parser can only work with tokens. If it can work more directly with the input, then the conversion of (say) "if" to a token becomes largely unnecessary. – cHao May 19 '14 at 15:23
@cHao You're going to have code that turns strings to tokens and code that turns tokens to syntax trees no matter how you arrange them. That's part of the definition of what you're trying to do. So, yes, both parts have to agree; if you make a change to the possible tokens the entirety of your code is broken just like both the lexer and parser would be broken if you implemented them separately. – Doval May 19 '14 at 16:20
2

@Doval [Scannerless parsing](http://en.wikipedia.org/wiki/Scannerless_parsing) is a thing. Such a parser does not necessarily have explicit tokens; there may be implicit knowledge like "I'm currently parsing a string literal" in what grammar rules are currently being applied, but otherwise it goes string -> parse tree. – May 19 '14 at 16:24
@delnan That's fascinating. In that case there's presumably a difference in efficiency (or easy of implementation?) so it doesn't necessarily contradict my suggestion you need to a reason to couple things and not vice-versa. – Doval May 19 '14 at 16:59
@Doval I was not asking for the definition of "unrelated", but why you think that tokenizing and hierarchical parsing are unrelated. Because both in theory (of languages) and in practice, they *are* related: splitting them introduces problems that are not present when they're combined. (Of course it's possible to minimize and work-around these problems with careful design, but that's not the same as just not having them in the first place.) – May 21 '14 at 16:26
@delnan: I haven't looked at how compilers are implemented these days, but I would not expect synchronization overhead to be particularly large in this sort of situation, especially in a runtime environment where variables written in one thread will "usually" be seen by another reasonably soon, and memory barriers would only be needed in cases where either the reader thread caught up to whatever it could see from the writer, or the writer got too far ahead of the reader. I would expect that if the tokenizer built a dictionary of all unique tokens received in a job... – supercat May 22 '14 at 19:49
...and exposed two `OneWriterMultiReaderAppendOnlyList` objects--one of which held tokens and the other of which held token numbers [that would if positive identify things in the list, and if negative encapsulate either things like reserved words, etc. or pre-evaluated small numeric literals] such a design could perform quite well. – supercat May 22 '14 at 19:52
@supercat Well, I'm no expert on synchronization, I just know that it's rather hard to do right and fast at the same time. Striking the right balance (presumably by adjusting some buffer's size) between "reader doesn't catch up" and "producer doesn't get too far ahead" also seems like a delicate numeric optimization problem. That's why I'd like to see an existing implementation. Plus (as mentioned before), even if you achieve have zero overhead and perfect parallelization, you only achieved a small speedup because lexing time is usually small (cf. Amdhal's law). – May 22 '14 at 20:12
@delnan: For some kinds of compilers, lexing (folding in dictionary lookup, assuming a global token table, so the remainder of the compiler works with integers rather than strings) can be a significant part of overall compilation speed [simply because the rest of the compiler doesn't have to do a while lot]. On other types, much more time is spent on graph coloring and other related optimizations. – supercat May 22 '14 at 20:26
The argument about separating the scanner and parser for the sake of parallelization seems moot. Most programs have more than one source input file. So a scannerless parser can run one thread per source file - which, assuming you have more source files than CPU cores, is going to require fewer synchronizations, fewer allocations and calls, etc. Also, scannerless parsers may have some advantages vs (context-free) scanners in terms of ambiguity - like, does your parser reserve keywords or allow them in symbol names? Permitting them probably requires context. – mindplay.dk Feb 13 '21 at 10:26

score 5 · Answer 2 · answered May 16 '14 at 13:17

5

The original source file, in whatever programming or markup language you're parsing, is just a long sequence of characters. The "words" that make up the language may be conveniently separated by spaces, or they may not.

For instance in C, the character sequences "foo = bar << 2;" and "foo=bar<<2;" should be considered to be equivalent. The first step in parsing a document is therefore to analyse the sequence of characters and work out where one token ("word") ends and the next one begins.

In my little C example, the tokens are "foo", "=", "bar", "<<", "2" and ";" in both cases. Note the subtlety in this case that it's "<<" and not "<" followed by "<". Tokenisers need to know about the language's syntax, but not it's meaning.

Only once you've tokenised the string can you start to think about what the document means.

answered May 16 '14 at 13:17

Simon B

9,167
4
26
33

Whitespace can be a token too, depending on your grammar. I can't think of a case where a C parser would need to be aware of it, though. – cHao May 16 '14 at 17:31
1

-1. This is a fallacious appeal to common sense, and raises more questions than it answers. Why is that the first step? Why does it need to be separated from the rest of parsing? What's so important about tokens? What if your language doesn't have have "words" in the same sense as C? What if your "tokens" can't be parsed with a regular grammar, but require context-free or context-sensitive? How can you think about what the document means if you've tokenized it, but not assembled it into a parse tree? – May 21 '14 at 16:35
Actually, C++'s `>>` is one case where a tokenizer clearly creates a disadvantage: in a normal expression (`2 >> 1`) you want to parse `>>` as the bit shift operator; in a type definition (`vector>`) you want to parse it as two separate closing angle brackets. To do this you need to have an understanding of where in the syntax tree you currently are, which is impossible when tokenizing blindly. – Arshia001 Aug 11 '20 at 06:29
@Arshia001 The complexity arises because C++ grammar is context sensitive. For all languages that are context free its fine to have tokenizer separate from the parser. – Aiono Dec 26 '22 at 15:27
@f.smith remind me please, is a context free grammar parsable by PDAs and a context sensitive grammar only by turing machines? Also, do we have mainstream languages that are actually context free? – Arshia001 Dec 27 '22 at 16:16
1

@Arshia001 Yes it can only be parsed by turing machines. Whether a programming language is context free or context sensitive depends on what you count as grammar or not. For example, most languages don't allow redeclaration of the same variable in the same scope. If you check at parser level then your language becomes context sensitive. But most languages treat those as valid programs syntatically and give semantic errors. In that sense I think there context free production languages. One such example is Lua and maybe Javascript but I am not sure. – Aiono Dec 27 '22 at 18:13
@f.smith realistically, it's hard to imagine languages being truly context-free. Most languages' grammars aren't even unambiguous; IIRC, when they added generics to C#, certain expressions which could be parsed as comparisons could suddenly also be parsed as explicit generic arguments, so instead of going out of their way to conjure strange syntax, they made a rule that generic arguments take precedence over comparisons, and they've probably introduced more syntactic ambiguities since then. – Arshia001 Dec 29 '22 at 07:45
1

@Arshia001 yeah I said there are but didn't said anything about whether most of them are or not. In reality most are context sensitive but usually it's some particular parts of the syntax and most of the language can be parsed with a context free parser. And even for a context sensitive language you can have a lexer but the parser needs to decide what to do with the token based on the context. In your C++ example, lexer can just tokenize two consecutive `>` tokens and parser can decide later to make it right shift or generic parameter syntax. – Aiono Dec 29 '22 at 09:24

score 5 · Answer 3 · edited May 23 '17 at 12:40

5

You bring up an excellent point. I'm going to disagree with the other answers here, and say that the main goal of a tokenizer is to get better performance during parsing -- i.e., tokenizers are an optimization: an implementation detail of parsing, but not a fundamental one. A large part of the time spent parsing is breaking the input string up into pieces. By optimizing this, the parser's performance can be greatly increased.

So that's a pretty vague definition I just gave, and that's why it's hard to precisely define what a tokenizer should do.

Many languages are defined using two separate grammars: one for tokens, and one for hierarchical syntax elements. You could argue that the purpose of a tokenizer is to implement the token grammar, but this misses the point: splitting a grammar into token and hierarchical grammars is arbitrary and unnecessary from the point of view of expressiveness (although, again, useful as a performance optimization).

It's perfectly reasonable and practical to implement parsers without separate tokenizers, although it's likely the performance will be worse.

It is important to note that there are drawbacks to using a separate tokenizer. One is that the token grammar can become restricted (example, another example). In my personal experience, avoiding separate tokenization reduces the overall complexity (LOC, interfaces between subsystems, etc.) of a parser.

edited May 23 '17 at 12:40

Community

1

answered May 16 '14 at 16:51

Why would a tokenizer speed things up? Seems like mashing things together would speed it up. – Michael Shaw May 16 '14 at 18:20
Michael: Many parsers have a high cost per token, because of backtracking and other lookahead mechanisms. In such parsers, tokenizing 1000 chars to 100 tokens and parsing those 100 tokens can be much easier than parsing 1000 characters directly. – Anonymous May 16 '14 at 19:30
If it was simply a performance optimization among many, it would not be such a major part of parsing. It's often presented as a core part of the design of a language implementation, one fully-fledged stage in the compilation pipeline along with parsing, semantic analysis, code generation and the like. This even extends to pure exposition that doesn't try to equip the reader to write their own, only tell them how compilers work. Unless most people who write about the topic have *very* poor judgement on what's essential and what's not, this hints at tokens being more than just an optimization. – May 16 '14 at 20:09
Note that there are languages like Perl and C++ where the possible tokens are modified by where you are in the parsing. `template>` for example, changes the meaning of '>>' from '>>' to '>' '>'. – Zan Lynx May 16 '14 at 21:45
@MichaelShaw tokenizers (and token grammars) typically limit the complexity (approximately Chomsky type 3) of the tokens, allowing the increase in performance. – May 19 '14 at 00:57
@delnan you're reading too much in my answer that's not there. The subtle implication that I'm pooping on everybody who uses/supports/write about separate tokenization is pure strawman. Plus, you're just making stuff up about the importance of tokenization; tokens aren't interesting, tokenization isn't a big deal. Just one of several ways to get a parse tree. – May 19 '14 at 01:23
@MattFenwick I'm doing no such thing. I'm pointing out that tokens are treated as important (in the same general category as parse trees, IRs and final code generated) by a whole lot of literature, including literature that has no business talking about optimizations. Even the most generous reading of your answer is in disagreement with this fact, so possible conclusions are: People write far too much about what is just an implementation detail and a optimization, or your characterization as just that doesn't do tokens justice. That argument does not hinge on *you* to believe or imply either. – May 19 '14 at 08:26

score 2 · Answer 4 · answered Dec 25 '22 at 15:20

I agree that the main advantage of tokenizing is speeding up the process. For example, parsing a number is not so simple. You have to check for the decimals, the exponential form, etc... It's something you want to do only once per number.

With tokenization, you ensure that every number will be parsed only once.

Depending on the kind of parser, tokenizing can add complexity (one more parsing step) or remove complexity (splitting the job in two simpler, distinct steps).

score 1 · Answer 5 · edited May 16 '14 at 23:34

Lexing and Parsing lend themselves to different formalisms. The goal is to make both tasks easier to program and maintain, as well as faster at runtime.

If you look at (f)lex, the usual lexer-generator, it uses regular expressions to express the lexical rules. This is notationally much more compact than a grammar expressed as BNF or some similar parser specification.

At runtime, a lexer can be baked down to a finite automaton. A parser cannot. So splitting the job speeds up the process. A parser has to be something like LR(k) or LL(1) or LALR to deal with the ambiguity.

The 'dragon book' has been the classic undergraduate-level text on compilation techniques for nearly 30 years.

What if your tokens can't be described with a regular grammar? Also, a huge difference between a "lexer" and a "parser" is the latter's stack, allowing it to for example, correctly parse arbitrarily-deep nested parentheses -- `((()())(((()))))`. — , May 21 '14 at 16:40

What is the responsibility or benefit of a Tokenizer?

5 Answers5

Linked