Sunday, April 20, 2008

Using ANTLR to create an Excel Like Formula Parser in Flex

Introduction
One of my first Flex programming projects was to write an Excel-like expression parser. This parser was written using the “shunting yard” algorithm by Edsger Dijkstra to transform infix notation to a reverse-polish notation (postfix) stack which is easily processed. That code isn't publicly available but the algorithms are described in a paper I wrote here.

I decided to re-write the expression evaluator using ANTLR. ANTLR is a parser generator that takes BNF-like notation and generates code for a target language. The default target language is Java, but there are many other target languages including an ActionScript target in the upcoming 3.1 version.

Flex Formula Evaluator
The actual Flex application is available here with source available. This is an example of a formula that can be evaluatedl feel free to copy and paste into the evaluator below (variables are noted in "[variable_name]"):
(1+2)*sum(9-8,5,17.5,[numarray])/if(or([string]="hello",1>5),2,99)

How Does it Work?

The formula parser begins with the ANTLR grammar found in Formula.g and FormulaTree.g (in src com.arcanearcade.antlr). ANTLR has 3 types of grammar rules:
  • Lexer - Used to transform the character stream into a series of tokens. Lexer rule names are written in all capital letters. For example, the following code describes how to recognize variables which are any characters except [ or ] found between [ and ].
  • VARIABLE : '[' ~('[' | ']')+ ']';
  • Parser - Walks though the tokens to form sentences in the grammar. Parser rules begin with lower case letters. The following example shows how to recognize a percent token, which is a number token followed by the percent sign (both described in the parser grammar). The '^' in the grammar is a special token that shows what the head of the created tree grammar looks like. So "50%" is transformed into a tree with an s-expression notation of "(% 50)" and will be evaluated to the value "0.5" by the tree grammar.
  • percent: NUMBER PERCENT^;
  • Tree - This optional grammer taks the tree that can be created by parser grammar and with the help of native code snippets interprets the tree to produce output. While you can use code snippets directly in the parser, using a tree grammer allows you to reuse the lexer and parser for other languages and write a (usually simpler) tree parser for each target language. The following tree grammar shows how to evaluate addition where "6 + 7" was turned into "(+ 6 7)" by the parser and is evaluated to the number "13" by the ActionScript code between the {}
  • operation returns [Object value]
    : ^(ADD a=expression b=expression)
    { $value = Number($a.value) + Number($b.value); }
The ANTLR tools parse the grammar files to produce native code as shown in the diagram below:
Code in FlexFormula.mxml evaluateFormula() takes a string, feeds it into the lexer, takes the output of the lexer feeds it into the parser, and then finally feeds that into the tree to evaluate the result. FormulaTree#addFunction() adds new spreadsheet functions like "sum" and "if" and FormulaTree#lookupVariable sets the function callback to retrieve variables.

What Next?
The expression parser isn't as friendly as it could be (it is just a demo). To improve the formula engine to the point it could be used in a spreadsheet the FormulaLexer, FormulaParser, and FormulaTree in a class that would:
  • Add all the common function.
  • Setup binding to watch for changes to the formula string and only call the lexer and parser when the formula changes.
  • Setup a public and bindable variable for the result value.
  • Add binding for all variables in an expression so when they change the tree can be re-evaluated.
  • Improve error handling

3 comments:

  1. This is awesome! great work.
    I have a question for you. I have a sample grammar, for example:
    "expression: Number,Operand,Number \r"
    + "Number : '-'? [0-9]+\r"
    + "Operand : MUL | DIV | ADD | SUB\r"
    + "MUL : '*'\r"
    + "DIV : '/'\r"
    + "ADD : '+'\r"
    + "SUB : '-'\r"

    I want to parse this make a table driver parser that I can use during runtime. Can you please comment on this ? I just need a direction to start in.

    ReplyDelete
  2. I'd recommend the following tutorial on ANTLR as it has a simple calculator grammar for several languages:
    http://www.antlr.org/wiki/display/ANTLR3/Five+minute+introduction+to+ANTLR+3

    Once you understand that tutorial, the grammar included in my excel expression parser would take you the rest of the way.

    ReplyDelete
  3. Hi, thanks for your great work ... your post certainly made my current task a lot easier. I have one thing to add though. I'd suggest appending an "EOF" to the "formula" rule in Formula.g

    formula
    : (EQ!)? expression EOF
    ;

    In my current case a user used "OR" instead of "||" and the evaluation of "false OR true" was evaluated to false without errors, because ANTLR avaluated the first term correctly and treated the rest as "preceding rubbish that doesn't make the term invalid".

    I also added code to override the default recovery functions of antlr to make it throw errors instead of trying to recover by adding the following to Formula.g:

    @parser::members {
    override public function recoverFromMismatchedToken(input:org.antlr.runtime.IntStream,ttype:int,follow:org.antlr.runtime.BitSet):Object {
    throw new MismatchedTokenException(ttype, input);
    }

    override public function recoverFromMismatchedSet(input:org.antlr.runtime.IntStream,e:org.antlr.runtime.RecognitionException,follow:org.antlr.runtime.BitSet):Object {
    throw e;
    }
    }

    @rulecatch {
    catch (re:RecognitionException) {
    throw re;
    }
    }

    @lexer::members {
    public override function reportError(e:org.antlr.runtime.RecognitionException):void {
    throw new Error(e);
    }
    }

    Keep up the good work,
    Chris

    ReplyDelete