Writing a lexer in C++

Question

What are good resources on how to write a lexer in C++ (books, tutorials, documents), what are some good techniques and practices?

I have looked on the internet and everyone says to use a lexer generator like lex. I don't want to do that, I want to write a lexer by hand.

I want to learn how lexers work. I cannot do that with a lexer generator. — , Jan 01 '12 at 11:22
Lex generates disgusting C code. Anyone who wants a decent lexer doesn't use Lex. — DeadMG, Jan 01 '12 at 11:28
I wouldn't read the generated code, one is normally only interested in the lex source code. — Giorgio, Jan 01 '12 at 11:45
@Giorgio: The generated code is the code you have to interface with, with disgusting non-thread-safe global variables, for example, and it's the code whose NULL-termination bugs you're introducing into your application. — DeadMG, Jan 01 '12 at 11:50
@DeadMG: I hadn't thought about the lexer running in a separate thread. BTW, are there more modern lexer generators for C++? — Giorgio, Jan 01 '12 at 11:54
@Giorgio: I've no idea. I just wrote my own- it's easier and higher quality. — DeadMG, Jan 01 '12 at 16:33
@mattnz: No, why would you do that? Is it not sufficient to check the lex definitions in case some tokens are not recognized correctly? Debugging the generated code seems to me like debugging the assembly code generated by a C++ compiler instead of debugging the source code. But I admit I haven't used lex for any real application so I have little experience with it. — Giorgio, Jan 02 '12 at 11:30

score 9 · Answer 1 · answered Jan 01 '12 at 11:37

Lexers are finite state machines. Therefore, they can be constructed by any general-purpose FSM library. For the purposes of my own education, however, I wrote my own, using expression templates. Here's my lexer:

static const std::unordered_map<Unicode::String, Wide::Lexer::TokenType> reserved_words(
    []() -> std::unordered_map<Unicode::String, Wide::Lexer::TokenType>
    {
        // Maps reserved words to TokenType enumerated values
        std::unordered_map<Unicode::String, Wide::Lexer::TokenType> result;

        // RESERVED WORD
        result[L"dynamic_cast"] = Wide::Lexer::TokenType::DynamicCast;
        result[L"for"] = Wide::Lexer::TokenType::For;
        result[L"while"] = Wide::Lexer::TokenType::While;
        result[L"do"] = Wide::Lexer::TokenType::Do;
        result[L"continue"] = Wide::Lexer::TokenType::Continue;
        result[L"auto"] = Wide::Lexer::TokenType::Auto;
        result[L"break"] = Wide::Lexer::TokenType::Break;
        result[L"type"] = Wide::Lexer::TokenType::Type;
        result[L"switch"] = Wide::Lexer::TokenType::Switch;
        result[L"case"] = Wide::Lexer::TokenType::Case;
        result[L"default"] = Wide::Lexer::TokenType::Default;
        result[L"try"] = Wide::Lexer::TokenType::Try;
        result[L"catch"] = Wide::Lexer::TokenType::Catch;
        result[L"return"] = Wide::Lexer::TokenType::Return;
        result[L"static"] = Wide::Lexer::TokenType::Static;
        result[L"if"] = Wide::Lexer::TokenType::If;
        result[L"else"] = Wide::Lexer::TokenType::Else;
        result[L"decltype"] = Wide::Lexer::TokenType::Decltype;
        result[L"partial"] = Wide::Lexer::TokenType::Partial;
        result[L"using"] = Wide::Lexer::TokenType::Using;
        result[L"true"] = Wide::Lexer::TokenType::True;
        result[L"false"] = Wide::Lexer::TokenType::False;
        result[L"null"] = Wide::Lexer::TokenType::Null;
        result[L"int"] = Wide::Lexer::TokenType::Int;
        result[L"long"] = Wide::Lexer::TokenType::Long;
        result[L"short"] = Wide::Lexer::TokenType::Short;
        result[L"module"] = Wide::Lexer::TokenType::Module;
        result[L"dynamic"] = Wide::Lexer::TokenType::Dynamic;
        result[L"reinterpret_cast"] = Wide::Lexer::TokenType::ReinterpretCast;
        result[L"static_cast"] = Wide::Lexer::TokenType::StaticCast;
        result[L"enum"] = Wide::Lexer::TokenType::Enum;
        result[L"operator"] = Wide::Lexer::TokenType::Operator;
        result[L"throw"] = Wide::Lexer::TokenType::Throw;
        result[L"public"] = Wide::Lexer::TokenType::Public;
        result[L"private"] = Wide::Lexer::TokenType::Private;
        result[L"protected"] = Wide::Lexer::TokenType::Protected;
        result[L"friend"] = Wide::Lexer::TokenType::Friend;
        result[L"this"] = Wide::Lexer::TokenType::This;

        return result;
    }()
);

std::vector<Wide::Lexer::Token*> Lexer::Context::operator()(Unicode::String* filename, Memory::Arena& arena) {

    Wide::IO::TextInputFileOpenArguments args;
    args.encoding = Wide::IO::Encoding::UTF16;
    args.mode = Wide::IO::OpenMode::OpenExisting;
    args.path = *filename;

    auto str = arena.Allocate<Unicode::String>(args().AsString());
    const wchar_t* begin = str->c_str();
    const wchar_t* end = str->c_str() + str->size();

    int line = 1;
    int column = 1;

    std::vector<Token*> tokens;

    // Some variables we'll need for semantic actions
    Wide::Lexer::TokenType type;

    auto multi_line_comment 
        =  MakeEquality(L'/')
        >> MakeEquality(L'*')
        >> *( !(MakeEquality(L'*') >> MakeEquality(L'/')) >> eps)
        >> eps >> eps;

    auto single_line_comment
        =  MakeEquality(L'/')
        >> MakeEquality(L'/')
        >> *( !MakeEquality(L'\n') >> eps);

    auto punctuation
        =  MakeEquality(L',')[[&]{ type = Wide::Lexer::TokenType::Comma; }]
        || MakeEquality(L';')[[&]{ type = Wide::Lexer::TokenType::Semicolon; }]
        || MakeEquality(L'~')[[&]{ type = Wide::Lexer::TokenType::BinaryNOT; }]
        || MakeEquality(L'(')[[&]{ type = Wide::Lexer::TokenType::OpenBracket; }]
        || MakeEquality(L')')[[&]{ type = Wide::Lexer::TokenType::CloseBracket; }]
        || MakeEquality(L'[')[[&]{ type = Wide::Lexer::TokenType::OpenSquareBracket; }]
        || MakeEquality(L']')[[&]{ type = Wide::Lexer::TokenType::CloseSquareBracket; }]
        || MakeEquality(L'{')[[&]{ type = Wide::Lexer::TokenType::OpenCurlyBracket; }]
        || MakeEquality(L'}')[[&]{ type = Wide::Lexer::TokenType::CloseCurlyBracket; }]

        || MakeEquality(L'>') >> (
               MakeEquality(L'>') >> (
                   MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::RightShiftEquals; }]
                || opt[[&]{ type = Wide::Lexer::TokenType::RightShift; }]) 
            || MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::GreaterThanOrEqualTo; }]
            || opt[[&]{ type = Wide::Lexer::TokenType::GreaterThan; }])
        || MakeEquality(L'<') >> (
               MakeEquality(L'<') >> (
                      MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::LeftShiftEquals; }]
                   || opt[[&]{ type = Wide::Lexer::TokenType::LeftShift; }] ) 
            || MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::LessThanOrEqualTo; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::LessThan; }])

        || MakeEquality(L'-') >> (
               MakeEquality(L'-')[[&]{ type = Wide::Lexer::TokenType::Decrement; }]
            || MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::MinusEquals; }]
            || MakeEquality(L'>')[[&]{ type = Wide::Lexer::TokenType::PointerAccess; }]
            || opt[[&]{ type = Wide::Lexer::TokenType::Minus; }])

        || MakeEquality(L'.')
            >> (MakeEquality(L'.') >> MakeEquality(L'.')[[&]{ type = Wide::Lexer::TokenType::Ellipsis; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::Dot; }])

        || MakeEquality(L'+') >> (  
               MakeEquality(L'+')[[&]{ type = Wide::Lexer::TokenType::Increment; }] 
            || MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::PlusEquals; }]
            || opt[[&]{ type = Wide::Lexer::TokenType::Plus; }])
        || MakeEquality(L'&') >> (
               MakeEquality(L'&')[[&]{ type = Wide::Lexer::TokenType::LogicalAnd; }]
            || MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::BinaryANDEquals; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::BinaryAND; }])
        || MakeEquality(L'|') >> (
               MakeEquality(L'|')[[&]{ type = Wide::Lexer::TokenType::LogicalOr; }]
            || MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::BinaryOREquals; }]
            || opt[[&]{ type = Wide::Lexer::TokenType::BinaryOR; }])

        || MakeEquality(L'*') >> (MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::MulEquals; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::Multiply; }])
        || MakeEquality(L'%') >> (MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::ModulusEquals; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::Modulus; }])
        || MakeEquality(L'=') >> (MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::EqualTo; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::Assignment; }])
        || MakeEquality(L'!') >> (MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::NotEquals; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::LogicalNOT; }])
        || MakeEquality(L'/') >> (MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::DivEquals; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::Divide; }])
        || MakeEquality(L'^') >> (MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::BinaryXOREquals; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::BinaryXOR; }])
        || MakeEquality(L':') >> (MakeEquality(L'=')[[&]{ type = Wide::Lexer::TokenType::VarAssign; }] 
            || opt[[&]{ type = Wide::Lexer::TokenType::Colon; }]);

    auto string
        =  L'"' >> *( L'\\' >> MakeEquality(L'"') >> eps || !MakeEquality(L'"') >> eps) >> eps;

    auto character
        =  L'\'' >> *( L'\\' >> MakeEquality(L'\'') >> eps || !MakeEquality(L'\'') >> eps);

    auto digit
        =  MakeRange(L'0', L'9');

    auto letter
        =  MakeRange(L'a', L'z') || MakeRange(L'A', L'Z');

    auto number
        =  +digit >> ((L'.' >> +digit) || opt);

    auto new_line
        = MakeEquality(L'\n')[ [&] { line++; column = 0; } ];

    auto whitespace
        =  MakeEquality(L' ')
        || L'\t'
        || new_line
        || L'\n'
        || L'\r'
        || multi_line_comment
        || single_line_comment;

    auto identifier 
        =  (letter || L'_') >> *(letter || digit || (L'_'));
        //=  *( !(punctuation || string || character || whitespace) >> eps );

    bool skip = false;

    auto lexer 
        =  whitespace[ [&]{ skip = true; } ] // Do not produce a token for whitespace or comments. Just continue on.
        || punctuation[ [&]{ skip = false; } ] // Type set by individual punctuation
        || string[ [&]{ skip = false; type = Wide::Lexer::TokenType::String; } ]
        || character[ [&]{ skip = false; type = Wide::Lexer::TokenType::Character; } ]
        || number[ [&]{ skip = false; type = Wide::Lexer::TokenType::Number; } ]
        || identifier[ [&]{ skip = false; type = Wide::Lexer::TokenType::Identifier; } ];

    auto current = begin;
    while(current != end) {
        if (!lexer(current, end)) {
            throw std::runtime_error("Failed to lex input.");
        }
        column += (current - begin);
        if (skip) {
            begin = current;
            continue;
        }
        Token t(begin, current);
        t.columnbegin = column - (current - begin);
        t.columnend = column;
        t.file = filename;
        t.line = line;
        if (type == Wide::Lexer::TokenType::Identifier) { // check for reserved word
            if (reserved_words.find(t.Codepoints()) != reserved_words.end())
                t.type = reserved_words.find(t.Codepoints())->second;
            else
                t.type = Wide::Lexer::TokenType::Identifier;
        } else {
            t.type = type;
        }
        begin = current;
        tokens.push_back(arena.Allocate<Token>(t));
    }
    return tokens;
}

It's backed by an iterator-based, back-tracking, finite state machine library which is ~400 lines in length. However, it's easy to see that all I had to do was construct simple boolean operations, like and, or, and not, and a couple of regex-style operators like * for zero-or-more, eps to mean "match anything" and opt to mean "match anything but don't consume it". The library is fully generic and based on iterators. The MakeEquality stuff is a simple test for equality between *it and the value passed in, and MakeRange is a simple <= >= test.

Eventually, I'm planning to move from backtracking to predictive.

I've seen several lexers that just read the next token when requested by the parser to do so. Yours seems to go through a whole file and make a list of tokens. Is there any particular advantage this method? — user673679, Oct 27 '13 at 12:20
@DeadMG: Care to share `MakeEquality` snippet? Specifically the object returned by that function. Looks very interesting. — Erunehtar, Apr 21 '17 at 13:15

score 6 · Accepted Answer · answered Jan 01 '12 at 14:23

6

Keep in mind that every finite state machine corresponds to a regular expression, which corresponds to a structured program using if and while statements.

So, for example, to recognize integers you could have the state machine:

0: digit -> 1
1: digit -> 1

or the regular expression:

digit digit*

or the structured code:

if (isdigit(*pc)){
  while(isdigit(*pc)){
    pc++;
  }
}

Personally, I always write lexers using the latter, because IMHO it is no less clear, and there is nothing faster.

answered Jan 01 '12 at 14:23

Mike Dunlavey

12,815
2
35
58

I think that if the regular expression becomes very complex, so is the corresponding code. That's why lexer generator are good: I would normally only code a lexer myself if the language is very simple. – Giorgio Jan 01 '12 at 14:30
2

@Giorgio: Maybe it's a matter of taste, but I've built many parsers this way. The lexer doesn't have to handle anything beyond numbers, punctuation, keywords, identifiers, string constants, whitespace, and comments. – Mike Dunlavey Jan 01 '12 at 14:40
I have never written a complex parser and all the lexers and parsers I have written were also hand-coded. I just wonder how this scales for more complex regular languages: I have never tried it but I imagine that using a generator (like lex) would be more compact. I admit I have no experience with lex or other generators beyond some toy examples. – Giorgio Jan 01 '12 at 14:55
1

There would be a string you append `*pc` to, right? Like `while(isdigit(*pc)) { value += pc; pc++; }`. Then after the `}` you convert the value into a number and assign that to a token. – Jan 01 '12 at 15:59
@WTP: For numbers, I just calculate them on the fly, similar to `n = n * 10 + (*pc++ - '0');`. It gets a little more complex for floating point and 'e' notation, but not bad. I'm sure I could save a little code by packing the characters into a buffer and calling `atof` or whatever. It wouldn't run any faster. – Mike Dunlavey Jan 01 '12 at 17:53
Here is a link with some indications if you want to simulate a finite state automaton in a procedural language: http://en.wikipedia.org/wiki/Automata-based_programming – Giorgio Jan 03 '12 at 20:29
@Giorgio: True? I suppose. Interesting? Not to me. – Mike Dunlavey Jan 03 '12 at 20:48
Just out of curiosity: is the material too obvious (actually it just shows a few techniques to simulate FSM's) or you use a completely different approach? If I understand correctly your approach is to map the FSM to a control structure, right? – Giorgio Jan 03 '12 at 20:56
@Giorgio: Actually I go the other way. I start with the idea of a regular expression / structured code (which to me are equivalent). Where I even start to care about the state-machine formulation is if I want to prove or understand something about it *[as in this paper.](http://arxiv.org/abs/quant-ph/9807026)* – Mike Dunlavey Jan 03 '12 at 21:02

score 3 · Answer 3 · answered Jan 01 '12 at 13:19

First of all, there are different things going on here:

splitting the bare characters list into tokens
recognizing those tokens (identifying keywords, literals, brackets, ...)
verifying a general grammar structure

Generally, we expect a lexer to do all 3 steps in one go, however the latter is inherently more difficult and there are some issues with automation (more on this later).

The most amazing lexer I know of is Boost.Spirit.Qi. It uses expression templates to generate your lexer expressions, and once accustomed to its syntax the code feels really neat. It compiles very slowly though (heavy templates), so it's best to isolate the various portions in dedicated files to avoid recompiling them when they haven't been touched.

There are some pitfalls in performance, and the author of the Epoch compiler explains how he got a 1000x speed-up by intensive profiling and investigation in how Qi works in an article.

Finally, there are also generated code by external tools (Yacc, Bison, ...).

But I promised a write-up on what was wrong with automating the grammar verification.

If you check out Clang, for example, you will realize that instead of using a generated parser and something like Boost.Spirit, instead they set out to validate the grammar manually using a generic Descent Parsing technic. Surely this seems backward?

In fact, there is a very simple reason: error recovery.

The typical example, in C++:

struct Immediate { } instanceOfImmediate;

struct Foo {}

void bar() {
}

Notice the error ? A missing semi-colon right after the declaration of Foo.

It is a common error, and Clang recovers neatly by realizing that it is simply missing and void is not an instance of Foo but part of the next declaration. This avoid hard to diagnose cryptic error messages.

Most automated tools have no (at least obvious) ways on specifying those likely mistakes and how to recover from them. Often recovering requires a little syntactic analysis so it's far from evident.

So, there is trade-off involved in using an automated tool: you get your parser quickly, but it is less user-friendly.

score 3 · Answer 4 · answered Jan 01 '12 at 14:12

Since you want to learn how lexers work, I presume you actually want to know how lexer generators work.

A lexer generator takes a lexical specification, which is a list of rules (regular-expression-token pairs), and generates a lexer. This resulting lexer can then transform an input (character) string into a token string according to this list of rules.

The method that is most commonly used mainly consists of transforming a regular expression into a deterministic finite automata (DFA) via a nondeterministic automata (NFA), plus a few details.

A detailed guide of doing this transformation can be found here. Note that I haven't read it myself, but it looks quite good. Also, just about any book on compiler construction will feature this transformation in the first few chapters.

If you are interested in lecture slides of courses on the topic, there are no doubt an endless amount of them from courses on compiler construction. From my university, you can find such slides here and here.

There are few more things that are not commonly employed in lexers or treated in texts, but are quite useful nonetheless:

Firstly, handling Unicode is somewhat nontrivial. The problem is that ASCII input is only 8 bits wide, which means that you can easily have a transition table for every state in the DFA, because they only have 256 entries. However, Unicode, being 16 bits wide (if you use UTF-16), requires 64k tables for every entry in the DFA. If you have complex grammars, this may start taking up quite some space. Filling these tables also starts taking quite a bit of time.

Alternatively, you could generate interval trees. A range tree may contain the tuples ('a', 'z'), ('A', 'Z') for example, which is a lot more memory efficient than having the full table. If you maintain non-overlapping intervals, you can use any balanced binary tree for this purpose. The running time is linear in the number of bits you need for every character, so O(16) in the Unicode case. However, in the best case, it will usually be quite a bit less.

One more issue is that the lexers as commonly generated actually have a worst-case quadratic performance. Although this worst-case behaviour is not commonly seen, it might bite you. If you run into the problem and want to solve it, a paper describing how to achieve linear time can be found here.

You'll probably want to be able to describe regular expressions in string form, as they normally appear. However, parsing these regular expression descriptions into NFAs (or possibly a recursive intermediate structure first) is a bit of a chicken-egg problem. To parse regular expression descriptions, the Shunting Yard algorithm is very suitable. Wikipedia seems to have an extensive page on the algorithm.

Writing a lexer in C++

4 Answers4

Linked