41

I was researching about the gcc compiler suite on wikipedia here, when this came up:

GCC started out using LALR parsers generated with Bison, but gradually switched to hand-written recursive-descent parsers; for C++ in 2004, and for C and Objective-C in 2006. Currently all front ends use hand-written recursive-descent parsers

So by that last sentence, (and for as much as I trust wikipedia) I can definitely say that "C (gcc), C++ (g++), Objective-C, Objective-C++, Fortran (gfortran), Java (gcj), Ada (GNAT), Go (gccgo), Pascal (gpc),... Mercury, Modula-2, Modula-3, PL/I, D (gdc), and VHDL (ghdl)" are all front-ends that no longer use a parser generator. That is, they all use hand-written parsers.

My question then is, is this practice ubiquitous? Specifically, I'm looking for exact answers to "does the standard/official implementation of x have a hand-written parser" for x in [Python, Swift, Ruby, Java, Scala, ML, Haskell]? (Actually, information on any other languages is also welcome here.) I'm sure I can find this on my own after a lot of digging. But I'm also sure this is easily answerable by the community. Thanks!

eatonphil
  • 561
  • 4
  • 8
  • 3
    Data point: CPython has a home brew LALR parser generator (pgen). Don't know about the rest. –  Jul 17 '14 at 17:50
  • 9
    Data point: Ghc (haskell) uses a LALR parser generator (happy), as does OCaml. – Twan van Laarhoven Jul 17 '14 at 18:02
  • 2
    Should be *"Do modern high performance compilers ..."* or similar, because the language is the spec not the implementation, while it is the compiler that either does or does not use a machine generated parser. – dmckee --- ex-moderator kitten Jul 17 '14 at 21:18
  • @dmckee, yes you are correct. However, the naming starts to get long and less to the point. Feel free to edit it though if you are more creative than I! – eatonphil Jul 17 '14 at 21:19
  • Regarding ML: MLton uses a parser generator that's specific to ML, I'm 90% sure that SML/NJ does too although I'm less familiar with it. You may or may not want to consider that "hand-written." – Patrick Collins Jul 17 '14 at 21:25
  • It's not an issue of "naming", @dmckee's point is that a language can be specified by a grammar, in prose or, worst of all, only by an implementation. The ANTLR home page has testimonials from the creators of Python and Objective C describing how a formal grammar helped their lives. In contrast for PERL "the Synopses have often been referred to as the formal Perl 6 specification, but this usage is being deprecated in favor of treating tests as official specifications." ASN.1 and XML Schema are examples of grammars being THE definition of correct for data structures. – Dave Apr 14 '16 at 11:47

2 Answers2

36

AFAIK, GCC use hand-written parsers in particular to improve syntactic error diagnostics (i.e. giving human meaningful messages on syntax errors).

Parsing theory (and the parsing generators descending from it) is mostly about recognizing and parsing a correct input phrase. But we expect from compilers that they give a meaningful error message (and that they are able to parse meaningfully the rest of the input after the syntactic error), for some incorrect input.

Also, old legacy languages -like C11 or C++11- (which are conceptually old, even if their latest revision is only three years old) are not at all context-free. Dealing with that context sensitiveness in grammars for parser generators (i.e. bison or even menhir) is boringly difficult.

Basile Starynkevitch
  • 32,434
  • 6
  • 84
  • 125
  • 2
    Concur. Recovering well from parsing errors (when you don't want to stop parsing at the very first error, a la the old Borland Pascal) and creating good quality error messages (including hints and suggestions for resolution, like humans want) are both inherently context-sensitive, heuristic tasks. They can be done atop stock parser generator output, somewhat, but it's a slog. – Jonathan Eunice Jul 17 '14 at 20:12
  • 2
    `Dealing with that context sensitiveness in grammars for parser generators is boringly difficult`. It's also more or less impossible as these tools generate context-free parsers. The correct place to check if all context-sensitive constraints are present is *after* you've generated the parse tree if you're using tools like this. – dtech Jun 03 '15 at 13:03
7

Parser generators and parser engines are quite general. The advantage of the generality is that building an accurate parser quickly and getting it functional is easy, in the overall scheme of things.

The parser engine itself suffers on the performance front because of its generality. Any hand-written code will always be significantly faster than the table-driven parser engines.

The second area where parser generators/engines have difficulty is that all real programming languages are context sensitive, often in quite subtle ways. LR languages are context-free, meaning that there are many subtleties about positioning and environment that are impossible to properly convey in the syntax. Attributed grammers attempt to address basic language rules like "declare before use", etc. Wiring this context-sensitivity into hand-written code is straight forward.

BobDalgleish
  • 4,644
  • 5
  • 18
  • 23
  • 16
    Citation for the performance claim please? Being table-driven can be a significant performance optimization and generators have access to algorithms that are very efficient but virtually never implemented by hand (precisely because they are an impenetrable mess of tables and magic numbers). –  Jul 17 '14 at 19:29
  • 3
    And about the second area: Many many major real programming languages are not context sensitive in any sense that applies (you'd have to refer to the set of all *valid* programs after type checking and such, which is **never** what a hand-written or generated parser tries to parse). It's true that hand-written parsers are more flexible, and this is useful for some languages, but mostly in the realm of error recovery and reporting, incrementality, etc. -- parser generators are rarely eschewed because of recognition power (whether you'd *want* to write such a grammar is a different story). -1 –  Jul 17 '14 at 19:33
  • If you use symbol table information during parsing, then you might as well call it context-sensitive. Attributed grammars are definitely not context free, although I don't think they are fully context sensitive. Your other points about error recovery and reporting are well taken. – BobDalgleish Jul 17 '14 at 19:36
  • 2
    C and C++ need symbol table information during parsing (or accept a far less specific parse tree where no distinction is made between, for example, expression statements and variable declarations). But I wasn't thinking of those. Languages like Java, Lisps, JavaScript, Ruby, Python, Go, Rust, Scala, Swift, Haskell (and probably several more, maybe C# and ML too?) don't need any such information to build the kind of AST you'd want anyway. Many of them actually have LL(1) grammars, or even LALR grammars. –  Jul 17 '14 at 19:39
  • 2
    citation for all real languages are context sensitive please? – psr Jul 18 '14 at 00:29