Is it worth parsing an infix algebraic expression to postfix and then an expression tree?

Question

I am trying to make a simple expression parser, in which users type an expression in infix notation (for example ln((1+2)(3-4))) to apply some calculations. To make this possible I need to tokenize the expression as a expression tree. I've read in blogs and forum that I need first to convert the expression to postfix (or similar notations) and then convert the postfixed expression to a tree. Is there any advantage in doing this? Why cannot I simply convert the infix expression to a binary tree? Is there any performance improvement? I think parsing two times an algebraic expression is a waste of CPU cycles, am I wrong?

If you are just parsing a math expression, converting to postfix should be all you need, since you can easily calculate that from its stack. — GrandmasterB, Nov 15 '17 at 03:47
FWIW, you can certainly parse infix expressions directly into trees. — Erik Eidt, Nov 15 '17 at 03:51
No, you're not wrong. There's absolutely no need or benefit to going through "post-fix" expressions to produce a parse tree. The articles saying so were either doing it for didactic reasons or were just confused. Instead of reading "blogs and forums", I recommend reading books and lecture notes. There are *tons* of *free* resources of that sort that are available on the internet for most CS topics and certainly for parsing. — Derek Elkins left SE, Nov 15 '17 at 05:52
You have all the information you require with a postfix stack or an expression tree in order to perform the calculation. Expression tree is more flexible, so if you think you might need to do something else with it afterwards, you should prefer that. Otherwise a postfix stack is sufficient. — Neil, Nov 15 '17 at 07:31
I'm guessing that the articles you read talked about the [shunting yard algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm) rather than a grammar-driven parser. The former produces postfix output that can easily be interpreted as a tree, the latter most naturally produces a tree structure than can then be interpreted as postfix (I thought a bit before adding the second clause, but I'll stick by it). — kdgregory, Nov 15 '17 at 11:02
Finding an answer on google I've found [this similar question](https://stackoverflow.com/questions/18144196/convert-algebraic-expression-directly-into-binary-tree-structure-sans-prefix) on stackoverflow, so it seems that it's not a misconception of me. I think it's a waste of CPU cycles unless handling nodes of a tree is heavier for CPU, in this case a tree creation is needless. But the target is to create a tree due to its flexibility, as @Neil said. Thanks you all for helping. — JulianSoto, Nov 16 '17 at 01:16

score 2 · Accepted Answer · answered May 09 '18 at 06:55

Is it worth parsing an infix algebraic expression to postfix and then an expression tree?

No, go directly to the expression tree. In all compilers I've checked out (Lua, Go, tinyCC), there is no step converting to postfix.

I need first to convert the expression to postfix (or similar notations) and then convert the postfixed expression to a tree. Why cannot I simply convert the infix expression to a binary tree?

You can and it is often easier to create a tree directly.

Is there any advantage in doing this? Is there any performance improvement? I think parsing two times an algebraic expression is a waste of CPU cycles, am I wrong?

In the past, compilers were advertised as being one-pass to improve efficiency. The speed of modern processors does not warrant it. I bet either technique can processes megabytes of input per second, which is enough for most applications of parsers. The big-oh complexity of both algorithms are the same and both make use of a stack. The performance difference can only be known by comparing two implementations, but I speculate it is minor.

For the language you provided, it is possible to write a acceptable parser in 100-200 lines of C (I put an example on Github). I'll give you a primer of each stage.

Lexing

The lexer takes a string as input and outputs a stream of tokens and their types. For example, for the input ln((1+2)(3-4)), the token types are:

LN ( ( CONST ADD CONST ) ( CONST SUB CONST ) ) EOF

The lexer is also responsible for removing whitespace and comments from the input.

Typically to implement a lexer, you write three functions:

next(): Returns the next token and its type. The token is removed from the input. Repeatedly calling this function will read the entire input until EOF is read.
peek(): Same as next(), but does not remove the token from the input. Calling this function twice in a row has the same output.
expect(token_type): Raise a parsing error if the next token is not of the given type.

See these slides for an example of how these functions are implemented in Golang's compiler. You might also want to into using regex to implement a lexer.

Grammar

I took some liberties when converting your input to a Context-Free Grammar:

Expression = Terms { ( '+' | '-' ) Terms }
Terms      = Primary { '(' Expression ')' }
Primary    = 'ln' Primary
           | '(' Expression ')'
           | CONST

This grammar is written in Extended Backus-Naur Form (EBNF). The terminals are '+', '-', '(', ')', 'ln' and CONST; where CONST is real number. The non-terminals are Expression, Terms and Primary. The {} brackets mean zero or more. The | pipe means either the right side or the left side.

For example, the Expression rule is one or more Terms separated by + or - tokens. Notice the grammar is recursive. For inspiration on creating a grammar, look at any programming language's specification. For example, section 6.5 of the ISO C standard lists the rules for C's expression syntax.

With this grammar, we can create the top-down derivation from a series of tokens. For example, CONST + ln CONST:

   Expression
=> Terms { ( '+' | '-' ) Terms }
=> Primary { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST { ( '+' | '-' ) Terms }
=> CONST ( '+' | '-' ) Terms { ( '+' | '-' ) Terms }
=> CONST '+' Terms { ( '+' | '-' ) Terms }
=> CONST '+' Primary { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST '+' 'ln' Primary { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST '+' 'ln' CONST { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST '+' 'ln' CONST { ( '+' | '-' ) Terms }
=> CONST '+' 'ln' CONST

The first non-terminal (or decision) on each line is expanded to a rule. This is performed repeatedly until only terminals remain. Any input derived from Expression is part of the language.

I'll explain later in the parser section how to write a parser for this grammar.

AST

The Abstract Syntax Tree is a data structure in memory which is a simplification of the grammar and easy to traverse. You described this as a "binary tree". Note that the CONST type has zero children (but stores a value), the LN type has one child and all the other types have two children each.

enum NodeType { LN, ADD, SUB, MUL, CONST };
struct AstNode {
  enum NodeType type;
  double value;
  struct AstNode *children[2];
};

Passes

The following compiler pass traverses the AST and computes the expression's value as a double:

double eval(struct AstNode *node) {
  struct AstNode **c = node->children;
  switch (node->type) {
    case LN:    return log(eval(c[0]));
    case ADD:   return eval(c[0]) + eval(c[1]);
    case SUB:   return eval(c[0]) - eval(c[1]);
    case MUL:   return eval(c[0]) * eval(c[1]);
    case CONST: return node->value;
    default:    abort();
  }
}

More passes can be implemented using the same AST. For example, passes might include:

free the memory used by the tree
Perform optimizations such as constant folding.
Satisfiability or symbolic analysis (think WolframAlpha)

Being able to write multiple passes for the same AST structure is a huge advantage over postfix evaluation. Modern compiler like LLVM have dozens of passes!

Predictive Parser

For non-obvious reasons, the above grammar is LL(1); left-to-right, leftmost-derivation, 1 lookahead. This means you know which rule to expand by looking at only the next token (the peek() function described earlier). Since the grammar is LL(1), a predictive parser can be written.

Generally, a EBNF is converted to code with the following steps:

Rules become functions
{} becomes for or while loops
| becomes if or switch statements
Terminals become calls to next() or expect()
Be careful about associativity; a-b-c != a-(b-c)

The following function parses the Primary rule. Notice this allocates a new AST node and fills it with data depending on a lookahead of one:

struct AstNode* parsePrimary(const char **s) {
  struct AstNode *n = malloc(sizeof(*n));
  switch (peek(s)) {
    case LN:
      n->type = next(s);
      n->children[0] = parsePrimary(s);
      break;
    case '(':
      expect(s, '(');
      n = parseExpression(s);
      expect(s, ')');
      break;
    case CONST:
      n->value = atof(*s);
      n->type = next(s);
      break;
    default:
      expect(s, CONST); // Exits
  }
  return n;
}

Putting it all together

There needs to be a function to tie together all the stages:

int main(int argc, const char **argv) {
  const char **s = &(argv[1]);
  struct AstNode *ast = parseExpression(s);
  expect(s, END);
  printf("%lf\n", eval(ast));
}

This is a lot of information, so I would recommend reading and possibly watching some YouTube videos. I posted the source code on Github (it is rather hastily written like all software, but should work).

Is it worth parsing an infix algebraic expression to postfix and then an expression tree?

1 Answers1

Lexing

Grammar

AST

Passes

Predictive Parser

Putting it all together