I am trying to make a simple expression parser, in which users type an expression in infix notation (for example ln((1+2)(3-4))
) to apply some calculations. To make this possible I need to tokenize the expression as a expression tree. I've read in blogs and forum that I need first to convert the expression to postfix (or similar notations) and then convert the postfixed expression to a tree. Is there any advantage in doing this? Why cannot I simply convert the infix expression to a binary tree? Is there any performance improvement? I think parsing two times an algebraic expression is a waste of CPU cycles, am I wrong?

- 2,305
- 1
- 13
- 26

- 43
- 5
-
1If you are just parsing a math expression, converting to postfix should be all you need, since you can easily calculate that from its stack. – GrandmasterB Nov 15 '17 at 03:47
-
4FWIW, you can certainly parse infix expressions directly into trees. – Erik Eidt Nov 15 '17 at 03:51
-
3No, you're not wrong. There's absolutely no need or benefit to going through "post-fix" expressions to produce a parse tree. The articles saying so were either doing it for didactic reasons or were just confused. Instead of reading "blogs and forums", I recommend reading books and lecture notes. There are *tons* of *free* resources of that sort that are available on the internet for most CS topics and certainly for parsing. – Derek Elkins left SE Nov 15 '17 at 05:52
-
You have all the information you require with a postfix stack or an expression tree in order to perform the calculation. Expression tree is more flexible, so if you think you might need to do something else with it afterwards, you should prefer that. Otherwise a postfix stack is sufficient. – Neil Nov 15 '17 at 07:31
-
2I'm guessing that the articles you read talked about the [shunting yard algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm) rather than a grammar-driven parser. The former produces postfix output that can easily be interpreted as a tree, the latter most naturally produces a tree structure than can then be interpreted as postfix (I thought a bit before adding the second clause, but I'll stick by it). – kdgregory Nov 15 '17 at 11:02
-
Finding an answer on google I've found [this similar question](https://stackoverflow.com/questions/18144196/convert-algebraic-expression-directly-into-binary-tree-structure-sans-prefix) on stackoverflow, so it seems that it's not a misconception of me. I think it's a waste of CPU cycles unless handling nodes of a tree is heavier for CPU, in this case a tree creation is needless. But the target is to create a tree due to its flexibility, as @Neil said. Thanks you all for helping. – JulianSoto Nov 16 '17 at 01:16
1 Answers
Is it worth parsing an infix algebraic expression to postfix and then an expression tree?
No, go directly to the expression tree. In all compilers I've checked out (Lua, Go, tinyCC), there is no step converting to postfix.
I need first to convert the expression to postfix (or similar notations) and then convert the postfixed expression to a tree. Why cannot I simply convert the infix expression to a binary tree?
You can and it is often easier to create a tree directly.
Is there any advantage in doing this? Is there any performance improvement? I think parsing two times an algebraic expression is a waste of CPU cycles, am I wrong?
In the past, compilers were advertised as being one-pass to improve efficiency. The speed of modern processors does not warrant it. I bet either technique can processes megabytes of input per second, which is enough for most applications of parsers. The big-oh complexity of both algorithms are the same and both make use of a stack. The performance difference can only be known by comparing two implementations, but I speculate it is minor.
For the language you provided, it is possible to write a acceptable parser in 100-200 lines of C (I put an example on Github). I'll give you a primer of each stage.
Lexing
The lexer takes a string as input and outputs a stream of tokens and their types. For example, for the input ln((1+2)(3-4))
, the token types are:
LN ( ( CONST ADD CONST ) ( CONST SUB CONST ) ) EOF
The lexer is also responsible for removing whitespace and comments from the input.
Typically to implement a lexer, you write three functions:
next()
: Returns the next token and its type. The token is removed from the input. Repeatedly calling this function will read the entire input until EOF is read.peek()
: Same asnext()
, but does not remove the token from the input. Calling this function twice in a row has the same output.expect(token_type)
: Raise a parsing error if the next token is not of the given type.
See these slides for an example of how these functions are implemented in Golang's compiler. You might also want to into using regex to implement a lexer.
Grammar
I took some liberties when converting your input to a Context-Free Grammar:
Expression = Terms { ( '+' | '-' ) Terms }
Terms = Primary { '(' Expression ')' }
Primary = 'ln' Primary
| '(' Expression ')'
| CONST
This grammar is written in Extended Backus-Naur Form (EBNF). The terminals are '+'
, '-'
, '('
, ')'
, 'ln'
and CONST
; where CONST
is real number. The non-terminals are Expression
, Terms
and Primary
. The {}
brackets mean zero or more. The |
pipe means either the right side or the left side.
For example, the Expression
rule is one or more Terms
separated by +
or -
tokens. Notice the grammar is recursive. For inspiration on creating a grammar, look at any programming language's specification. For example, section 6.5 of the ISO C standard lists the rules for C's expression syntax.
With this grammar, we can create the top-down derivation from a series of tokens. For example, CONST + ln CONST
:
Expression
=> Terms { ( '+' | '-' ) Terms }
=> Primary { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST { ( '+' | '-' ) Terms }
=> CONST ( '+' | '-' ) Terms { ( '+' | '-' ) Terms }
=> CONST '+' Terms { ( '+' | '-' ) Terms }
=> CONST '+' Primary { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST '+' 'ln' Primary { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST '+' 'ln' CONST { '(' Expression ')' } { ( '+' | '-' ) Terms }
=> CONST '+' 'ln' CONST { ( '+' | '-' ) Terms }
=> CONST '+' 'ln' CONST
The first non-terminal (or decision) on each line is expanded to a rule. This is performed repeatedly until only terminals remain. Any input derived from Expression
is part of the language.
I'll explain later in the parser section how to write a parser for this grammar.
AST
The Abstract Syntax Tree is a data structure in memory which is a simplification of the grammar and easy to traverse. You described this as a "binary tree". Note that the CONST
type has zero children (but stores a value), the LN
type has one child and all the other types have two children each.
enum NodeType { LN, ADD, SUB, MUL, CONST };
struct AstNode {
enum NodeType type;
double value;
struct AstNode *children[2];
};
Passes
The following compiler pass traverses the AST and computes the expression's value as a double
:
double eval(struct AstNode *node) {
struct AstNode **c = node->children;
switch (node->type) {
case LN: return log(eval(c[0]));
case ADD: return eval(c[0]) + eval(c[1]);
case SUB: return eval(c[0]) - eval(c[1]);
case MUL: return eval(c[0]) * eval(c[1]);
case CONST: return node->value;
default: abort();
}
}
More passes can be implemented using the same AST. For example, passes might include:
free
the memory used by the tree- Perform optimizations such as constant folding.
- Satisfiability or symbolic analysis (think WolframAlpha)
Being able to write multiple passes for the same AST structure is a huge advantage over postfix evaluation. Modern compiler like LLVM have dozens of passes!
Predictive Parser
For non-obvious reasons, the above grammar is LL(1)
; left-to-right, leftmost-derivation, 1 lookahead. This means you know which rule to expand by looking at only the next token (the peek()
function described earlier). Since the grammar is LL(1), a predictive parser can be written.
Generally, a EBNF is converted to code with the following steps:
- Rules become functions
{}
becomesfor
orwhile
loops|
becomesif
orswitch
statements- Terminals become calls to
next()
orexpect()
- Be careful about associativity;
a-b-c != a-(b-c)
The following function parses the Primary
rule. Notice this allocates a new AST node and fills it with data depending on a lookahead of one:
struct AstNode* parsePrimary(const char **s) {
struct AstNode *n = malloc(sizeof(*n));
switch (peek(s)) {
case LN:
n->type = next(s);
n->children[0] = parsePrimary(s);
break;
case '(':
expect(s, '(');
n = parseExpression(s);
expect(s, ')');
break;
case CONST:
n->value = atof(*s);
n->type = next(s);
break;
default:
expect(s, CONST); // Exits
}
return n;
}
Putting it all together
There needs to be a function to tie together all the stages:
int main(int argc, const char **argv) {
const char **s = &(argv[1]);
struct AstNode *ast = parseExpression(s);
expect(s, END);
printf("%lf\n", eval(ast));
}
This is a lot of information, so I would recommend reading and possibly watching some YouTube videos. I posted the source code on Github (it is rather hastily written like all software, but should work).

- 136
- 2