Circular dependencies: Recursive grammar parser (e.g. json)

Question

(TLDR) To build a parser for a recursive grammar by composition of individual parsers (e.g. with a parser combinator framework), there are often circular dependencies between some of the individual parsers. While circular dependencies are generally a sign of bad design, is this a valid case where the circular dependency is inevitable? If so, which solution would be preferable to deal with the circular dependency? Or are parser combinators just a bad idea altogether? (/TLDR)

There are other questions asking about dependency injection with circular dependencies. Typically, the answer is to change the design to avoid the circularity.

I have come across a typical case where I encounter circular dependencies: If I want to have different services to inspect a recursive structure.

I have been thinking about other examples, but so far the best I come up with is a parser for a recursive grammar. Let's use json as an example, because it is simple and well-known.

A json "value" can be a string (".."), an object ({..}) or an array ([..]).
A string is simple to parse, and has no further children.
An object is made up of keys and values, and the surrounding object syntax.
An array is made up of values, and the surrounding array syntax.
A "key" within an object is basically the same as a string.

Now I am going to create a number of parser objects:

A value parser, which depends on the other 3 parsers.
An object parser, which depends on string parser and value parser.
An array parser, which depends on value parser.
A string parser.

I want to manage the 4 parsers with a dependency injection container. Or, even if I don't want a container, we still have to figure out in which order to create the different parsers. There is a chicken-and-egg problem.

There are known solutions to this, which can be observed in existing parser libraries. So far I have mostly seen the "stub" solution.

Avoid the circular dependency..

.. by passing the value parser as an argument to the object and array parsers' parse() method.

This works, but it taints the signature of the parse() method. Imagine we want this to be something like a parser combinator, which can be reused for other grammars. We would want a parser interface that is generic and independent of the specific grammar, so we can't have it require a specific parser to be passed around.

Use a stub.

Instead of requiring each dependency in the constructor, we could use a set() or add() method on one of the parsers. E.g. we first create an empty value parser ("stub"), and then add the object, array and string parsers to it via the add() method.

Use a proxy.

Instead of creating the actual value parser, we create a proxy object, with a reference to the container. Only when the parse() method is called the first time on the proxy value parser, the real value parser is created.

Now this is all fine, and I suppose it is just a matter of taste, which solution I prefer.

But how does this fit with the typical high-horse response that circular dependencies are a sign of bad design? The example seems totally valid, and there is an entire class of problems like this.

Just a thought... You should place the specific question at the very beginning and rephrase it so we don't have to read all of the extra info in order to understand exactly what you're asking. — Levi Fuller, Jan 08 '16 at 21:08
A valid questioning is that if you should use dependency injection for a given project. It's a nice technique, but it's no silver bullet. — T. Sar, Mar 07 '16 at 20:40

score 6 · Accepted Answer · answered Jan 08 '16 at 20:40

6

Don't make 4 parsers; make 1 parser. You're not parsing 4 different languages; you're parsing 1 language with 4 major grammatical components. The "circular dependency problem" can be handled quite easily with a bog-standard recursive-descent parser whose ParseObject and ParseValue methods can both call each other.

answered Jan 08 '16 at 20:40

Mason Wheeler

82,151
24
234
309

Sure, this works. But what if the grammar is more complex, and I want to split it up into more than one parser class/object? Or I want to use a parser combinator, where I always have to slice it up into elementary parser classes? – donquixote Jan 08 '16 at 20:54
@donquixote If the grammar is too complex to easily get right in a simple, hand-written RDP, my advice would be "use ANTLR", not "split it up into multiple parsers." The bigger and more complex a grammar is, the more likely it is to contain mutually-recursive grammatical elements like in this example, which means your problem never goes away. – Mason Wheeler Jan 08 '16 at 20:56
From a quick look, I assume that ANTLR is a code generation tool, that will generate a big parser class. Right? So I would have to hope that such a tool exists for the language of my choice, or write my own. – donquixote Jan 08 '16 at 21:05
Which still leaves the question what I do if I really want something with smaller objects and composition. E.g. if the exact grammar rules are not known at compile time. But I think I need to come up with a convincing example. For the existing example, your answer is valid. – donquixote Jan 08 '16 at 21:10
@donquixote: It's a parser generator, a very good one. You write up some special source code in ANTLR format which describes the grammar of the language you're trying to parse, and feed it into ANTLR, and it generates a parser from that. The generated parser is pretty complicated and it's essentially a black box, (even though the entire thing is open-source,) but it's widely used and well-tested. – Mason Wheeler Jan 08 '16 at 21:10
@donquixote: If the exact grammar rules aren't known at compile time, you'd want to use a different parser generator. Something like OMeta, which is more complicated and slower than ANTLR, but highly dynamic and customizable. You can plug in new rules very easily with it. I don't know why you keep trying to come back around to this idea of "composing smaller parsers into larger ones" but it really sounds like a disaster waiting to happen! My advice is, just don't do it. – Mason Wheeler Jan 08 '16 at 21:13
Ok, you just destroyed the entire concept of parser combinators. https://github.com/ferno/loco https://github.com/Vektah/parser-combinator I suppose these exist in languages other than php, but only had a closer look at those two. (I'm off for now, will follow up tomorrow) – donquixote Jan 08 '16 at 21:21
Yeah, I've heard of parser combinators. They exist in plenty of languages, and they've always seemed like a really bad idea IMO. – Mason Wheeler Jan 08 '16 at 21:32
Just for completeness: ANTLR generates a parser in a specific language. Java, I suppose. A language-agnostic answer would be "Find or write a parser generator". And in any case, this would be a code generation tool that generates the code for one parser class. Correct? – donquixote Jan 09 '16 at 07:54
1

Also, ANTLR only works for the parser problem. If we have a different case of "process a recursive structure", we need another generator tool. Or hand-craft it instead of generating. But the answer would still be, the circular parts should be in a single class. – donquixote Jan 09 '16 at 07:57
@donquixote Correct. While it's common to have mutually recursive functions, they almost always come up as a small, manageable set of functions that are used only to process some kind of recursive structure (a BNF grammer, a red-black tree, etc), so they should be part of a single class, and then there are no circular dependencies. – Ixrec Jan 09 '16 at 13:05

score 1 · Answer 2 · answered Jan 08 '16 at 21:59

First, there is no way of expressing a JSON parser in a way that is not ultimately "circular". We can however, stave off the circularity. To express this in a simpler way, we use a simple formal language defined as

array = squareArray | angleArray
squareArray = "[" array* "]"
angleArray = "<" array* ">"

and a corresponding type (in pseudo-code)

type Array = SquareArray List[Array] | AngleArray List[Array]

We then define a simple parser of this type (using a fictional monadic parser library).

arrayParser : Parser[Array]
arrayParser = try squareArray `or` angleArray

squareArrayParser : Parser[Array]
squareArrayParser = 
      from start  in parseText "["
      from arrays in many arrayParser
      from end    in parseText "]"
      select SquareArray arrays

angleArrayParser : Parser[Array] 
angleArrayParser = 
      from start  in parseText "<"
      from arrays in many arrayParser
      from end    in parseText ">"
      select AngleArray arrays

There is, to my mind, absolutely nothing wrong with the above definition. If we are dead set on removing circularity, we can note that arrayParser, squareArrayParser and angleArrayParser are all defined recursively. We can factor out this recursive nature by first making each take the other as a parameter.

arrayParserF : forall t. Parser[t] -> Parser[t] -> Parser[t]
arrayParserF pSquareArray pAngleArray = try pSquareArray `or` pAngleArray

squareArrayParserF : forall s, t. (s -> t) -> Parser[s]  -> Parser[t]
squareArrayParserF pSquareArray pArrayParser = 
      from start  in parseText "["
      from arrays in many pArrayParser
      from end    in parseText "]"
      select pSquareArray arrays

angleArrayParserF : forall s, t. (s -> t) -> Parser[s] -> Parser[t]
angleArrayParserF pAngleArray pArrayParser = 
      from start  in parseText "<"
      from arrays in many pArrayParser
      from end    in parseText ">"
      select pAngleArray arrays

We see that we can immediately get back what we started with by saying

arrayParser' : Parser[Array]
arrayParser' = arrayParserF squareArrayParser' angleArrayParser'
  where
    squareArrayParser' : Parser[Array]
    squareArrayParser' = squareArrayParserF SquareArray arrayParser' 

    angleArrayParser' : Parser[Array]
    angleArrayParser' = angleArrayParserF AngleArray arrayParser'

Or equivalently

arrayParser' : Parser[Array]
arrayParser' = (\ pArrayParser -> 
                  arrayParserF 
                    (squareArrayParserF pArrayParser)
                    (angleArrayParserF pArrayParser)) arrayParser'

Or equivalently (for fun)

arrayParser' : Parser[Array]
arrayParser' = fix makeArrayParser SquareArray AngleArray

makeArrayParser : ([t] -> t) -> ([t] -> t) -> Parser[T] -> Parser[T]
makeArrayParser pSquareArray pAngleArray pArrayParser = 
    arrayParserF 
        (squareArrayParserF pSquareArray pArrayParser)
        (angleArrayParserF  pAngleArray  pArrayParser)

While this is mostly academic, it is not entirely useless, as we can now very easily define a parser that parses up to some fixed nesting of brackets.

fixedArrayParser : Parser [Array]
fixedArrayParser 1 = makeArrayParser SquareArray AngleArray blankParser
fixedArrayParser (n + 1) = makeArrayParser SquareArray AngleArray (fixedArrayParser n)

We can even get more creative with our combinators and interpret our language as expressions of sums and products

sumsAndProductParser : Parser [int]
sumsAndProductsParser = fix (makeArrayParser product sum)

So factoring out our parser into subparsers can help us achieve flexibility and usability -- it does make it much easier to parse restrictions and extensions of the original grammar -- but it does so at a cost of complexity. If all you want to do is parse JSON then this may well be too much.

Hu.. I suppose this is pseudocode, but what kind of language is it based on? I am not familiar with it. — donquixote, Jan 09 '16 at 08:15
And, do you think this matches any of the patterns/solutions mentioned above? (argument, stub, proxy) — donquixote, Jan 09 '16 at 08:17
I don't recognise the language either, but it is very similar to Haskell (the only difference I see are the use of ':' to introduce a type signature while Haskell uses '::' and 'forall *v* .' to introduce a type variable which Haskell does implicitly). — Jules, Jan 09 '16 at 09:50
It appears to correspond to your argument method, but then uses higher-order functions to make the arguments implicit, resulting in something close to your proxy, only based on functions rather than objects. — Jules, Jan 09 '16 at 09:52

jon hanson · Answer 3 · 2016-01-09T08:53:28.210

Many grammars have mutually recursive production rules, JSON being a prime example. It is not a mark of a bad design.

With regards to using parser combinator frameworks, if the host language allows mutually recursive definitions (e.g. Haskell) , then there is no problem. In languages which do not (e.g. Java), then one trick is to allow for lazily initialised parsers - I used this in my framework. An example of it's usage, for a JSON grammar, is here.

The lazily initialised parser is jvalue:

private static final Parser.Ref<Character, Node> jvalue = Parser.ref();

which is declared at the beginning of the grammar, and initialised at the end:

static {
    jvalue.set(
        choice(
            jnull,
            jbool,
            jnumber,
            jtext,
            jarray,
            jobject
        ).label("JSON value")
    );
}

The Parser.Ref type is itself a Parser (i.e. it implements the Parser interface).

One of the prime advantages of the parser combinator approach is that it provides a DSL for defining grammars, which is hosted in the programming language you're already using. This itself has benefits, such as making the semantic actions associated with your grammar production rules subject to the type-checking provided by the host language - i.e. your rules have to be well-typed. Another benefit is being able to extend the DSL with your own combinators.

I think what you are doing here matches the "stub" solution I mentioned above. — donquixote, Jan 09 '16 at 08:08
As a side note: I don't know why everything in your class has to be a static property or static method. Instead you could use instance methods, and local variables. Yes, not even private on object scope, but local within the scope of one function. — donquixote, Jan 09 '16 at 08:10
@donquixote Each of the constituent parsers used to build up the main parser are pure functions that do not operate on any object state. The enclosing class is simply being used as a module to house the functions. Consequently they are static fields and methods, which are initialised once when the class is first referenced. If they were declared as variables within a method then they would be initialised every time the method is called. — jon hanson, Jan 09 '16 at 08:38

score 0 · Answer 4 · answered Jan 08 '16 at 22:31

Recursive data structures are always a bit tricky, and you will either need some kind of laziness, or to mutate your data structure after initial construction. Of these two, mutation is simpler, and can be easily used to implement laziness. However, external set()ers lead to quite fragile design where you are forced to perform easily forgettable, manual steps to make sure your parser object has been properly initialized.

For that reason, I would consider a dependency container that is queried lazily to be the far cleaner solution:

class ObjectParser(parsers) {
    parse(input) { ... parsers.getArrayParser().parse(input) ... }
}

class ArrayParser(parsers) {
    parse(input) { ... parsers.getObjectParser().parse(input) ... }
}

class Parsers {
   // Instead of creating a new object each time,
   // you could use the getter to lazily initialize some field.
   getArrayParser()  { return new ObjectParser(this) }
   getObjectParser() { return new ArrayParser(this) }
}

new Parsers().getObjectParser().parse(input)

Note that the parser “objects” ObjectParser and ArrayParser only have a single public method parse() in your design, so they would actually be equivalent to functions or closures. We can inline parsres.getArrayParser().parse(input) to parsers.parseArray(input):

class Parsers {
  parseArray(input) { ... this.parseObject(input) ... }
  parseObject(input) { ... this.parseArray(input) ... }
}

new Parsers().parseObject(input);

So magically, the latter code is exactly equivalent and equally extensible, but uses far less code. Why? Because the host language object system already introduces the required indirection to make this work: the implicit this parameter (previously the explicit parsers) is already pointer-like, and method lookup must be done fairly lazily if you might have subclasses (“late binding”). But we don't even need that for a recursive descent parser, if our language allows us to pre-declare functions, or does not even need predeclarations. In C, this would work as well:

// predeclarations
ResultT parseArray(InputT input);
ResultT parseObject(InputT input);

// implementation
ResultT parseArray(InputT input) { ... parseObject(input) ... }
ResultT parseObject(InputT input)  { ... parseArray(input) ... }

Where does the indirection come from now? From the compiler: due to the predeclarations, a call to one function can be compiled before the call target is known. This is fairly straightforward if both functions are in the same compilation unit, otherwise a linker is required to “tie the knot”.

If you are just writing a recursive descent parser (which works for JSON but doesn't work well for more complicated languages that aren't LL(1)), then using the simplest thing that could work is the solution you should choose: no mutability, no proxy objects, not virtual dispatch, just mutually recursive procedures.

If you want to use a parser generator that simulates a recursive descent parser for you, you will need some values or objects to pass around. Should your language happen to support functional programming or at least higher-order functions, nothing changes because you can just pass handles to those parser function around (e.g. function pointers in C, method objects in Java8, …). If your language is more purist OOP (e.g. Java < 8, C++ < 11), you will need the explicit objects ObjectParser, ArrayParser etc. as in the first example. Only then would I consider this massive waste of code to be a viable solution.

In the first example, every elementary parser depends on the container (the "Parsers" object). This defeats the idea of dependency inversion and composition. Instead of depending on an interface/abstraction, the elementary parsers depend on a specific key in the container, a specific parser. — donquixote, Jan 09 '16 at 08:29
The second part of the answer is, I think, equivalent to what Mason Wheeler suggested in his answer above. And indeed, a single class with methods is preferable to a bunch of one-method objects all depending on the container. — donquixote, Jan 09 '16 at 08:29

Circular dependencies: Recursive grammar parser (e.g. json)

4 Answers4

Linked