Tutorial 1: Write a grammar, compile and use the parser

This tutorial explains through a simple example how to use the parser generator. It assumes you are moderately familiar with regular expressions, and context-free grammars. The presentation goes as follow:
  • Write a grammar
  • Generate the lexer and parser code
  • Use the generated code in your own project
This tutorial will use the example of a small language for simple mathematical expressions. Using this language, one would be able to use the standards +, -, * and / operators, as well as ( ).

Writing a grammar

The Hime Parser Generator provides its own language for expressing context-free grammars. It is largely similar to the standard BNF form with some enhancements for the sake of expressivity. In the example below, the grammar is shown to have three sections:
cf grammar MathExp
{
	options { }
	terminals { }
	rules { }
} 
  • The options section lets you specify options for the lexer and parser generator.
  • The terminals section is used to describe the regular expressions matching the terminals in the grammar. This section is optional.
  • The rules section is used for stating the context-free rules of the grammar.

In our example, we want to use numbers so we have to define regular expressions matching these:
terminals {
	INTEGER	-> [1-9] [0-9]* | '0' ;
	REAL		-> INTEGER? '.' INTEGER  (('e' | 'E') ('+' | '-')? INTEGER)?
				|  INTEGER ('e' | 'E') ('+' | '-')? INTEGER ;
	NUMBER	-> INTEGER | REAL ;
} 

Here, we define three terminals:
  • INTEGER, which is defined using a very simple regular expression
  • REAL, which matches decimal numbers (23.15) and floats using a power expression (12e-15)
  • NUMBER, which is the union of the two former
Because NUMBER will match any INTEGER and REAL, we have to decide which of these is to be recognized by the lexer. The simple rule here is that the latest defined terminals have higher priority. Hence, the generated lexer will only be able to recognize NUMBER and will never yield INTEGER or REAL.
Also, we want to define a regular expression matching blank spaces that will be used as separators:
WHITE_SPACE	-> 0x0020 | 0x0009 | 0x000B | 0x000C ;
SEPARATOR	-> WHITE_SPACE+;

Here, the WHITE_SPACE expression will match a single space, or tab, or vertical tab. The SEPARATOR expression will match any string of at least one WHITE_SPACE and has a higher priority that WHITE_SPACE.

Now, we can use these terminals to write the context-free grammar rules:
rules {
	exp_atom-> NUMBER
		| '('exp ')' ;

	exp_op0	-> exp_atom
		|  exp_op0 '*' exp_atom
		|  exp_op0 '/' exp_atom ;
		
	exp_op1	-> exp_op0
		|  exp_op1 '+' exp_op0
		|  exp_op1 '-' exp_op0 ;

	exp	-> exp_op1 ;
} 

Here, we have 4 grammar rules, using 4 grammar variables and the NUMBER terminal. Note that it is possible to add inline terminals in the grammar rules (e.g.: '+'). These terminals will have higher priority over those defined in the terminals section, although their definition is limited to simple text (no regular expressions here).
Context-free grammar rules are to be written as variable -> definition ;. You can also use the the *, +, and ? operators as in regular expressions. The parser generator will automatically generates the additional corresponding rules. Use ( ) for grouping terms in rules and | for alternatives.

Finally, we setup the grammar options as follow:
options
{
	Axiom = "exp";
	Separator = "SEPARATOR";
} 
  • The Axiom option specifies which grammar variable is to be used as the root symbol for the grammar.
  • The Separator option specifies which terminal is to be used as token separator in the lexer. Text matching the expression of the separator will automatically be discarded by the lexer.
You can find the complete grammar here.

Generating the lexer and parser

The lexer and parser corresponding to the grammar are easily generated with the command line tool. In this simple case, no additional option is required. Save the grammar in a file named MathExp.gram and execute the following command line:
himecc MathExp.gram
The tool will generate 4 files:
  • MathExpLexer.cs, the source file for the lexer
  • MathExpLexer.bin, the binary representation of the lexer’s automaton
  • MathExpParser.cs, the source file for the parser
  • MathExpParser.bin, the binary representation of the parser’s automaton

Use in your project

To use the generated lexer and parser, include the two .cs files in your project, as well as the two .bin files as embedded resources. You will also need to add the Hime.Redist.dll assembly as a reference in your project. It contains the redistributable implementation of the parsing algorithms. The lexer and parser are used as follow:
MathExpLexer lexer = new MathExpLexer("2 + 3");
MathExpParser parser = new MathExpParser(lexer);
ASTNode root = parser.Parse() ;

The Abstract Syntax Tree (AST) produced by the parser is returned by the Parse() function in the form of a reference to the AST's root. The ASTNode class contains two important properties:
  • Symbol contains the grammar's symbol attached to the node, a variable or a token, i.e. a piece of text matched by the lexer corresponding to one of the grammar's terminals.
  • Children is the list of the current node's children.

In the example above ("2 + 3") the produced AST is as follow:
tree_2+3.png

The lexers and parser created in this manner are bound to parse the piece of text given to the lexer. In order to parse another piece of text, simply create a new lexer and parser in the same way. The two objects are lightweight so you should not have to worry about creating multiple ones.
In this example, the lexer is given the full string that has to be parsed. Another constructor allows you to give a stream of text in the form of a TextReader.

Last edited May 14, 2013 at 11:49 AM by lwouters, version 17

Comments

No comments yet.