How should I specify a grammar for a parser?

Question

I have been programming for many years, but one task that still takes me inordinately long is to specify a grammar for a parser, and even after this excessive effort, I'm never sure that the grammar I've come up with is good (by any reasonable measure of "good").

I don't expect that there is an algorithm for automating the process of specifying a grammar, but I hope that there are ways to structure the problem that eliminate much of the guesswork and trial-and-error of my current approach.

My first thought has been to read about parsers, and I've done some of this, but everything I've read on this subject takes the grammar as a given (or trivial enough that one can specify it by inspection), and focuses on the problem of translating this grammar into a parser. I'm interested in the problem immediately before: how to specify the grammar in the first place.

I'm primarily interested in the problem of specifying a grammar that formally represents a collection of concrete examples (positive and negative). This is different from the problem of designing a new syntax. Thanks to Macneil for pointing out this distinction.

I had never really appreciated the distinction between a grammar and a syntax, but now that I'm beginning to see it, I could sharpen my first clarification by saying that I'm primarily interested in the problem of specifying a grammar that will enforce a predefined syntax: it just so happens that in my case, the basis for this syntax is usually a collection of positive and negative examples.

How is the grammar specified for a parser? Is there a book or reference out there that's the de-facto standard for describing best practices, design methodologies, and other helpful information about specifying a grammar for a parser? What points, when reading about parser grammar, should I be focusing on?

I edited your question a bit to focus on your actual problem. This site is exactly the kind of place where you can ask your questions about grammars and parsers and get expert answers. If there are external resources worth looking at, they will emerge naturally in the answers that directly help you. — Adam Lear, Sep 11 '11 at 06:03
@Anna: rolled back your edits. The reason I have not accepted any reply so far is precisely because they don't answer my question. I am *specifically* looking for a reference to a textbook, or peer-reviewed article, i.e. a reasonably thorough and authoritative treatment of this problem. I took pains when I wrote my original question to indicate this, by including precisely the phrases that you edited out from my question. — , Sep 11 '11 at 12:57
BTW, this is a meta-problem I have not yet figured out a solution for: how to post a reference request that does not elicit answers that, in essence, attempt to serve as (and therefore, obviate the need for) the reference work I am asking for. I rarely succeed. I feel doubly bad about this, because, in addition to not getting what I am asking for, I feel ungrateful to those who respond, since the replies that I do get tend to require significant more writing from the responders than the what I am asking for. — , Sep 11 '11 at 13:05
@kjo That's unfortunate. If all you're asking for is a list of references, you're not using Stack Exchange to its full potential. Your meta-problem is not using the site as intended. List questions are almost universally [discouraged](http://meta.programmers.stackexchange.com/questions/1624/are-im-looking-for-x-questions-on-topic) on Stack Exchange because they do not fit into the Q&A model very well. I highly recommend you shift your mindset towards asking [questions that have answers, not items, ideas, or opinions](http://blog.stackoverflow.com/2011/01/real-questions-have-answers/). — Adam Lear, Sep 11 '11 at 13:37
@Anna: I find your comment truly baffling. How is "**Q**: I'm looking a reference for X; **A**: Look at *Fundamentals of X, Y, and Z* by F. O. O'Bar", not asking a question that has an answer??? The answer is the reference to a book, or the reference to an article! I don't see what is vague about that. I also don't see why that would be not using the site to its full potential: a reference to a good book or paper by a knowledgeable person can save the asker a *huge* amount of time, of wading through irrelevant or poor-quality material. — , Sep 11 '11 at 13:57
@kjo It's certainly a question, but not the right question to ask *on Stack Exchange*. SE isn't here to build lists of references. It's here to *be* the reference. Please read through the [meta post](http://meta.programmers.stackexchange.com/questions/1624/are-im-looking-for-x-questions-on-topic) I linked to in my comment for a more detailed explanation. — Adam Lear, Sep 11 '11 at 13:59
...(continued from last comment) I have to add that I have been participating in question-answer websites for almost 20 years, and I have *never* had so much difficulty with finding what is considered acceptable and not acceptable to post than with the SE sites. Despite all my efforts to the contrary, my posts are routinely at odds with what others think is appropriate for the site. I have had posts down-voted for being "too specific"! It's damned if you do, damned if you don't. Now I learn that asking for a reference is not asking a question that has an answer????? *sheeesh* — , Sep 11 '11 at 14:04
OK, all I can say is that if a professional asking for a reference to the literature in his/her field is considered an inappropriate question in SE, then it has truly lopped off a *huge* chunk of its potential usefulness as a Q & A site. — , Sep 11 '11 at 14:12
@kjo: Please don't get discouraged! Anna's edits kept the core and the heart of your question and she helped you out by making your question more of the form that we expect on Programmers.SE. I know of no definitive references that you seek, but was able to provide an answer. [OTOH, had I known such a reference, I certainly would have included it.] We want to encourage more of the answers like mine because, in this specific case, I don't believe there is a reference for what you seek, just experience from talking to others. — Macneil, Sep 11 '11 at 16:52
@kjo I've rolled back to Anna's edits and tried to incorporate a specific call for a canonical reference based on [our guidance for book recommendations](http://meta.programmers.stackexchange.com/questions/2003/are-book-recommendations-on-topic): there's a lot of good information in the answers provided, and to invalidate them by making the scope of the question to only be about finding a book would be a waste. Now if we can all just stop with the edit warring, that would be awesome. — , Sep 11 '11 at 17:01
@Macneil: are you saying that you know better than I do what is it that I am looking for??? — , Sep 11 '11 at 17:57
@kjo: I'm not sure what you are referring to specifically, but I do believe that asking a *general* question, which encompasses your more specific question, is the approach that is expected around here. I've had a good question get closed by a single moderator, and it is a very frustrating experience. So, no, I'm not saying I can read your mind, but I can get a read of Programmers.SE. — Macneil, Sep 11 '11 at 18:30
This question is now being [discussed on our meta-discussion site](http://meta.programmers.stackexchange.com/questions/2254/asking-for-a-reference-to-the-literature-on-a-specific-subject). — , Sep 11 '11 at 20:20

Macneil · Answer 1 · 2011-09-10T22:05:22.637

From the sample files you will need to make decisions based on how much you want to generalize from those examples. Suppose you had the following three samples: (each is a separate file)

f() {}
f(a,b) {b+a}
int x = 5;

You could trivially specify two grammars that will accept these samples:

Trivial Grammar One:

start ::= f() {} | f(a,b) {b+a} | int x = 5;

Trivial Grammar Two:

start ::= tokens
tokens ::= token tokens | <empty>
token ::= identifier | literal | { | } | ( | ) | , | + | = | ;

The first one is trivial because it accepts only the three samples. The second one is trivial because it accepts everything that could possibly use those token types. [For this discussion I'm going to assume that you aren't concerned about the tokenizer design much: It's simple to assume identifiers, numbers, and punctuation as your tokens, and you could borrow any token set from any scripting language you'd like anyway.]

So, the procedure you'll need to follow is to start at the high level and decide "how many of each instance do I want to allow?" If a syntactic construct can make sense to repeat any number of times, such as methods in a class, you will want a rule with this form:

methods ::= method methods | empty

Which is better stated in EBNF as:

methods ::= {method}

It will probably be obvious when you only want zero or one instances (meaning that the construct is optional, as with the extends clause for a Java class), or when you want to allow one or more instances (as with a variable initializer in a declaration). You'll need to be mindful of issues like requiring a separator between elements (as with the , in an argument list), requiring a terminator after each element (as with the ; to separate statements), or requiring no separator or terminator (as the case with methods in a class).

If your language uses arithmetic expressions, it would be easy for you to copy from an existing language's precedence rules. It's best to stick to something well-known, like C's expressions rules, than going for something exotic, but only provided that all else is equal.

In addition to precedence issues (what gets parsed with each other) and repetition issues (how many of each element should occur, how are they separated?), you will also need to think about order: Must something always appear before another thing? If one thing is included, should another be excluded?

At this point, you may be tempted to grammatically enforce some rules, a rule such as if a Person's age is specified you don't want to allow their birthdate to be specified as well. While you can construct your grammar to do so, you may find it easier to enforce this with a "semantic check" pass after everything is parsed. This keeps the grammar simpler and, in my opinion, makes for better error messages for when the rule is violated.

+1 for better error messages. Most users of your language won't be experts, whether there are 10 or 10 million of them. Parsing theory has neglected this aspect far too long. — MSalters, Sep 12 '11 at 08:05

Blrfl · Answer 2 · 2011-09-10T20:36:42.953

Where can I learn how to specify the grammar for a parser?

For most parser generators, it's usually some variant of Backus-Naur's <nonterminal> ::= expression format. I'm going to go on the assumption that you're using something like that and not trying to build your parsers by hand. If you can produce a parser for a format where you've been given the syntax (I've included a sample problem below), specifying grammars isn't your problem.

What I think you're up against is divining syntax from a set of samples, which is really more pattern recognition than it is parsing. If you're having to resort to that, it means whoever is providing your data can't give you its syntax because they don't have a good handle on its format. If you have the option of pushing back and telling them to give you a formal definition, do it. It isn't fair for them to give you a vague problem if you could be held responsible for the consequences of a parser based on a deduced syntax accepting bad input or rejecting good input.

...I'm never sure that the grammar I've come up which is good (by any reasonable measure of "good").

"Good" in your situation would have to mean "parses the positives and rejects the negatives." Without any other formal specification of the syntax of your input file, the samples are your only test cases and you can't do any better than that. You could put your foot down and say that only the positive examples are good and reject anything else, but that probably isn't in the spirit of what you're trying to accomplish.

Under saner circumstances, testing a grammar is like testing anything else: you have to come up with enough test cases to exercise all variants of the nonterminals (and terminals, if they're being generated by a lexer).

Sample Problem

Write a grammar that will parse text files containing a list as defined by the rules below:

A list consists of zero or more things.
A thing consists of an identifier, an open brace, an item list and a closing brace.
An _item_list_ consists of zero or more items.
An item constsis of an identifier, an equal sign, another identifier and a semicolon.
An identifier is a sequence of one or more of the characters A-Z, a-z, 0-9 or the underscore.
Whitespace is ignored.

Example of input (all valid):

clank { foo = bar; baz = bear; }
clunk {
    quux =bletch;
    281_apple = OU812;
    He_Eats=Asparagus ; }

And make sure to use "some variant of Backus-Naur" and not BNF itself. BNF can express a grammar, but it makes a lot of very common concepts, such as lists, far more complicated than they need to be. There are various improved versions, such as EBNF, that improve upon these issues. — Mason Wheeler, Sep 11 '11 at 16:58

score 7 · Answer 3 · answered Sep 10 '11 at 22:55

The answers by Macneil and Blrfl are great. I just want to add some comments about the beginning of the process.

A syntax is just a way to represent a program. So the syntax of your language should be determined by your answer to this question: What is a program?

You might say that a program is a collection of classes. Okay, that gives us

program ::= class*

as a starting point. Or you may have to write it

program ::= ε
         |  class program

Now, what is a class? It's got a name; an optional superclass specification; and a bunch of constructor, method, and field declarations. Also, you need some way of grouping a class into a single (unambiguous) unit, and that should involve some concessions to usability (eg, tag it with the reserved word class). Okay:

class ::= "class" identifier extends-clause? "{" class-member-decl * "}"

That's one notation ("concrete syntax") you could choose. Or you could just as easily decide on this:

class ::= "(" "class" identifier extends-clause "(" class-member-decl* ")" ")"

or

class ::= "class" identifier "=" "CLASS" extends-clause? class-member-decl* "END"

You've probably already made this decision implicitly, especially if you have examples, but I just want to reinforce the point: The structure of the syntax is determined by the structure of the programs it represents. That's what gets you past the "trivial grammars" from Macneil's answer. Example programs are still very important, though. They serve two purposes. First, they help you figure out, at the abstract level, what a program is. Second, they help you decide what concrete syntax you should use to represent your language's structure.

Once you've got the structure down, you should go back and deal with issues like allowing whitespace and comments, fixing ambiguities, etc. These are important, but they're secondary to the overall design, and they're highly dependent on the parsing technology you're using.

Finally, don't try to represent everything about your language in the grammar. For example, you may want to forbid certain kinds of unreachable code (eg, a statement after a return, like in Java). You probably shouldn't try to cram that into the grammar, because you'll either miss things (whoops, what if return is in braces, or what if both branches of an if statement return?) or you'll make your grammar way too complicated to manage. It's a context-sensitive constraint; write it as a separate pass. Another very common example of a context-sensitive constraint is a type system. You could reject expressions like 1 + "a" in the grammar, if you worked hard enough, but you couldn't reject 1 + x (where x has type string). So avoid half-baked restrictions in the grammar and do them correctly as a separate pass.

How should I specify a grammar for a parser?

3 Answers3