Parsing infix notation

A common problem in computer science is that of parsing expressions, that is, reading and interpreting (obtaining the semantic meaning of) strings of a given formal language. This is a key part of the process of compiling or interpreting a programming language, or other languages such as those describing mathematical expressions or regular expressions. This article describes an efficient method of parsing *infix notation*, the syntax in which mathematical expressions are conventionally written. While mathematical expressions are used as examples, the principles extend to any language using infix operators.

Infix notation is the name given to a syntax where binary operators are placed between their two operands. In mathematical expressions, the operands may be numbers, or they may be expressions themselves; in the infix expression:

3 + 7 ÷ (4 × 5 − 6)

the operator ÷ is applied to (and lies between) the operands 7 and (4 × 5 − 6).

To parse the expression we require also that the operator precedences be defined. This is because infix notation allows an operand surrounded by two operators — such as the 5 in the example above — to be an operand of either neighbouring operator. Here, the expression could be evaluated as:

(4 × 5) − 6, where × has a higher precidence than −, or as

4 × (5 − 6), where − has a higher precidence than ×.

It's possible to make the order of evaluation unambiguous using brackets alone, but this can be overly verbose.Direct evaluation

Infix expressions can be evaluated by scanning through the expression for the highest precidence operator remaining in it. The operands adjacent to this operator must be associated with it, since there is no higher precidence operator in the expression to 'steal' one of the operands. This operator and pair of operands are then replaced in the expression with the result of the operator applied to the operands, and the procedure is repeated until all operators in the expression have been evaluated. This algorithm is complicated by the presence of brackets, which are outside the framework of binary operators and operands, and are often the highest precidence operator. Brackets can be dealt with recursively by evaluating the sub-expression within the brackets as an expression in its own right, and substituting the evaluated value back into the original expression in place of the bracketed sub-expression. As an example, in the expression above:

3 + 7 ÷ (4 × 5 − 6)

the brackets are the highest precidence operator. The bracketed term:
4 × 5 − 6

is therefore evaluated first as a new expression. The highest precidence operator in this expression is ×, so this and its operands are evaluated to give:
20 − 6.

The highest precidence (and only) operator in this expression is − , whose evaluation to:
14

completes the evaluation of the bracketed term. This value is substituted into the original expression to give:
3 + 7 ÷ 14

The ÷, as the remaining operator of highest precidence, evaluates to give:
3 + ½

and the final answer: 3½.While this algorithm works, it is a slow process to search through the expression for the highest precidence operator at every step. When an expression may be evaluated many times (for example if it defines a user-inputted function to be plotted), another approach may be needed.

Postfix notation

It's possible to rewrite expressions in infix notation in other ways, such as in *prefix notation*, and in *postfix notation* (also known as *Reverse-Polish notation* (RPN)). In postfix notation, a binary operator is placed after both its operands: the expression 3 + 7 in infix notation is written 3 7 + in postfix notation. A compound expression such as (4 × 5) − 6 (in infix) is be written as 4 5 × 6 − in postfix. The expression is read from left-to-right, replacing an operator (when one occurs) with the result of that operator applied to the two previous operands: the first operator in

4 5 × 6 −

is the × operator, which is evaluated with the two previous operands, 4 and 5, to give:
20 6 −

The next operator, −, is then evaluated on 20 and 6 to give the answer: 14.Postfix notation, while less common than infix notation in written mathematics, is much more explicit in sense that the operands belonging to a particular operator can be expressed without the need for brackets or operator precidence. Rebracketting the infix expression to 4 × (5 − 6) does not introduce brackets into the postfix form, but instead causes a change in token ordering, to 4 5 6 − ×.

This explicitness means that the algorithm for evaluating postfix notation can be made much faster than that for evaluating infix notation. A fast implemention of the postfix evaluation algorithm uses a stack: as operands are read from left-to-right, they are pushed onto the stack. When an operator is reached, its two operands are popped from the stack, the result of the operation is calculated, and the result is pushed back onto the stack. For a correctly formed postfix expression, processing all the tokens in the expression in this way will result in one number — the evaluated expression — being left on the stack.

The algorithm for evaluating an expression in postfix notation can also be used to generate an abstract syntax tree for the expression. In this case, the stack is a stack of trees, on to which single-node trees (operands) are pushed. When an operator is reached in the parsing of the postfix expression, it becomes the root of a new tree, with the two trees popped off the stack (the operands for this operator) becoming the two main 'branches'. This new tree is then pushed onto the stack as before.

The shunting-yard algorithm

Ascertaining that expressions in postfix notation can be evaluated much more quickly than those in infix notation suggests that a fast way to evaluate an infix expression might be to convert it first from infix into postfix notation, then evaluate the postfix version. Dijkstra's shunting-yard algorithm is such an algorithm for converting infix expressions to postfix expressions. The algorithm reads tokens and outputs tokens from left-to-right: the next character to be read/written immidiately follows the previous one. It is is based around a stack, which is used to store brackets, and operators which are to occur later in the output. Note that only the operators and brackets are stored on a stack for later use: the operands are output in postfix notation in exactly the same order as they occur in the infix expression. The most basic form of the shunting-yard algorithm is as follows:

Click to display

For each token in turn in the input infix expression:

- If the token is an operand, append it to the postfix output.
- If the token is an operator A then:
- While there is an operator B of higher or equal precidence than A at the top of the stack, pop B off the stack and append it to the output.
- Push A onto the stack.

- If the token is an opening bracket, then push it onto the stack.
- If the token is a closing bracket:
- Pop operators off the stack and append them to the output, until the operator at the top of the stack is a opening bracket.
- Pop the opening bracket off the stack.

- While there are still operator tokens in the stack:
- Pop the operator on the top of the stack, and append it to the output.

A practical infix-to-postfix algorithm for mathematical expressions

The algorithm as presented above deals only with brackets and left-associative binary infix operators. While mathematical expressions mainly use this notation, non-binary operators are common in maths, and have varying conventions. Factorial (n!) is a unary postfix operator (that is, it occurs after a single argument), while arithmetic negation (−, in the context of "−5" rather than "5 − 4") is a unary prefix operator. Functions (with arbitrarily many arguments) are also conventially written in prefix operators, with further brackets, commas and other symbols used to distinguish their arguments. Furthermore, while most common arithmetic operations are left-associative (that is, a − b − c = (a − b) − c), the binary operation of exponentiation is conventionally right-associative a ^ b ^ c = a ^ (b ^ c)). These conventions of syntax can be incorperated into the shunting-yard algorithm, at the expense of reduced clarity.

Click to display

For each token in turn in the input infix expression:

- If the token is an operand, append it to the postfix output.
- If the token is a unary postfix operator, append it to the postfix output.
- If the token is a unary prefix operator, push it on to the stack.
- If the token is a function token, push it on to the stack.
- If the token is a function argument separator
- Pop the top element off the stack and append it to the output, until the top element of the stack is an opening bracket

- If the token is a binary operator A then:
- If A is left-associative, while there is an operator B of higher or equal precidence than A at the top of the stack, pop B off the stack and append it to the output.
- If A is right-associative, while there is an operator B of higher precidence than A at the top of the stack, pop B off the stack and append it to the output.
- Push A onto the stack.

- If the token is an opening bracket, then push it onto the stack.
- If the token is a closing bracket:
- Pop operators off the stack and append them to the output, until the operator at the top of the stack is a opening bracket.
- Pop the opening bracket off the stack.
- If the token at the top of the stack is a function token, pop it and append it to the output.

- While there are still operator tokens in the stack:
- Pop the operator on the top of the stack, and append it to the output.

The following Java applet evaluates a mathematical expression in infix notation by converting it first to postfix notation using the shunting-yard algorithm, then evaluating the expression in postfix notation. Each step of the shunting-yard algorithm and the postfix evaluation algorithm are shown. The supported operators are +, −, × ÷, ^, parentheses and arithmetic negation, with conventional operator precidence and associativity.

The source code for the applet is available for download, under the GNU General Public License (the implementation of the shunting yard algorithm is in Parser.java).