Input Language Reference

This page contains the reference documentation on the input language for the Hime Parser Generator.

Structure

The Hime Parser Generator provides its own language for expressing context-free grammars. It is largely similar to the standard BNF form with some enhancements for the sake of expressivity. In the example below, the grammar is shown to have three sections:
cf grammar MathExp
{
	options { }
	terminals { }
	rules { }
} 
  • The options section lets you specify options for the lexer and parser generator.
  • The terminals section is used to describe the regular expressions matching the terminals in the grammar. This section is optional.
  • The rules section is used for stating the context-free rules of the grammar.

Options Section

The options section lets you specify special information for the lexer and parser generator. For the time being, it recognizes two special arguments, written down as follow:
options
{
	Axiom = "exp";
	Separator = "SEPARATOR";
} 
  • The Axiom option specifies which grammar variable is to be used as the root symbol for the grammar.
  • The Separator option specifies which terminal is to be used as token separator in the lexer. Text matching the expression of the separator will automatically be discarded by the lexer.

Terminals Section

This section is used to specify the rules that will be used by the lexer to match terminals. This section is optional and can be removed entirely if only very simple terminals will be used and they can be specified inline in the rules section.
The lexing rules in the terminals section form regular expressions. Detailed explanation of regular expressions can be found: A lexical rule is written as:
NAME -> regexp ;
For example, the following rule defines a terminal symbol named IDENTIFIER that will match any word composed of at least one lower-case Latin letters:
IDENTIFIER -> [a-z]+ ;
The meaning of a rule is that the terminal IDENTIFIER is to be matched by the generated lexer using the given regular expression in the rule's body. The order of appearance of the rules in the terminals section is important for two reasons:
  • First, a regular expression cannot reference itself or part of itself. That is to say, in the previous example you cannot reference IDENTIFIER in the body of IDENTIFIER's rule. In a rule's body you can only refer to previously defined terminals.
  • Second, different rules may overlap. For instance the rule ANIMAL -> 'cat' | 'dog' ; is more specific that the IDENTIFIER rule, any word matched by ANIMAL is also matched by IDENTIFIER. The lexer generator then have to decide which is to be preferred in the overlapping cases. This decision is taken by using a notion of priority in lexical rules. The later a rule is defined, the more priority is has. In our example, if the ANIMAL rule is defined after IDENTIFIER, then the 'cat' value will be matched by ANIMAL. On the contrary, if IDENTIFIER is defined after ANIMAL, then ANIMAL can never be matched because all its possible matches are overridden by IDENTIFIER.
Regular expressions in the body of lexical rules conform to the standard semantics of POSIX regular expression. In addition, some extensions are provided for the sake of expressivity. In the regular expressions, you may use the following constructs:

Atoms

Simple Text

In regular expressions, plain text must be enclosed in single quotes. The 'abc' expression will only match the "abc" string. The single quote character itself can also be used by escaping it. The 'a\'bc' expression will match the "a'bc" string. Blank spaces can also be used and will have to be matched. The 'a bc' expression will only match the "a bc" string. The possible escape sequences are summarized hereafter:
Escape sequence Meaning
'\\' Backslash
'\'' single quote
'\0' Unicode character 0
'\a' Alert (character 7)
'\b' Backspace (character 8)
'\f' Form feed (character 12)
'\n' New line (character 10)
'\r' Carriage return (character 13)
'\t' Horizontal tab (character 9)
'\v' Vertical tab (character 11)

Individual Characters

Individual characters can also be represented using their Unicode code point written in hexadecimal, prefixed by 0x. For example 0xAA and 0x1234 are both valid Unicode code points. Note that only code points in the plane 0 (Basic Multilingual Plane), i.e. with code points between 0x0000 and 0xFFFF included are supported.

Character Ranges

Ranges of characters can be specified using the 0x1234 .. 0x5678 construct. This expression will match any character which Unicode code point is between 0x1234 and 0x5678 (included). Once again, only code points in the Unicode plane 0 are supported.

Any Character

The dot (.) can be used to match any character. It is equivalent to 0x0000 .. 0xFFFF. It should not be enclosed within single quotes. The . expression will match any character, whereas the '.' expression will match only a dot (Unicode code point 0x002E). Note that the dot will also match line ending characters like New Line (0x0A) and Carriage Return (0x0D).

Character Classes

Character classes are to be expressed using the usual [] construction. The [abc] character class will match the a, b, and c characters. [a-z] will match any lowercase Latin letter. Negative classes can be expressed using [^]. The [^a-z] expression will match any character that is not a lowercase Latin letter.

Unicode Character Blocks

Unicode blocks can be used using the construct \ub{NAME} . For example, the \ub{IsGreek} expression will only match characters in the 0x0370 .. 0x03FF range (Greek characters). The supported Unicode character blocks are:
Block Name Start End
IsBasicLatin 0x0000 0x007F
IsLatin-1Supplement 0x0080 0x00FF
IsLatinExtended-A 0x0100 0x017F
IsLatinExtended-B 0x0180 0x024F
IsIPAExtensions 0x0250 0x02AF
IsSpacingModifierLetters 0x02B0 0x02FF
IsCombiningDiacriticalMarks 0x0300 0x036F
IsGreek 0x0370 0x03FF
IsCyrillic 0x0400 0x04FF
IsCyrillicSupplement 0x0500 0x052F
IsArmenian 0x0530 0x058F
IsHebrew 0x0590 0x05FF
IsArabic 0x0600 0x06FF
IsSyriac 0x0700 0x074F
IsThaana 0x0780 0x07BF
IsDevanagari 0x0900 0x097F
IsBengali 0x0980 0x09FF
IsGurmukhi 0x0A00 0x0A7F
IsGujarati 0x0A80 0x0AFF
IsOriya 0x0B00 0x0B7F
IsTamil 0x0B80 0x0BFF
IsTelugu 0x0C00 0x0C7F
IsKannada 0x0C80 0x0CFF
IsMalayalam 0x0D00 0x0D7F
IsSinhala 0x0D80 0x0DFF
IsThai 0x0E00 0x0E7F
IsLao 0x0E80 0x0EFF
IsTibetan 0x0F00 0x0FFF
IsMyanmar 0x1000 0x109F
IsGeorgian 0x10A0 0x10FF
IsHangulJamo 0x1100 0x11FF
IsEthiopic 0x1200 0x137F
IsCherokee 0x13A0 0x13FF
IsUnifiedCanadianAboriginalSyllabics 0x1400 0x167F
IsOgham 0x1680 0x169F
IsRunic 0x16A0 0x16FF
IsTagalog 0x1700 0x171F
IsHanunoo 0x1720 0x173F
IsBuhid 0x1740 0x175F
IsTagbanwa 0x1760 0x177F
IsKhmer 0x1780 0x17FF
IsMongolian 0x1800 0x18AF
IsLimbu 0x1900 0x194F
IsTaiLe 0x1950 0x197F
IsKhmerSymbols 0x19E0 0x19FF
IsPhoneticExtensions 0x1D00 0x1D7F
IsLatinExtendedAdditional 0x1E00 0x1EFF
IsGreekExtended 0x1F00 0x1FFF
IsGeneralPunctuation 0x2000 0x206F
IsSuperscriptsandSubscripts 0x2070 0x209F
IsCurrencySymbols 0x20A0 0x20CF
IsCombiningDiacriticalMarksforSymbols 0x20D0 0x20FF
IsLetterlikeSymbols 0x2100 0x214F
IsNumberForms 0x2150 0x218F
IsArrows 0x2190 0x21FF
IsMathematicalOperators 0x2200 0x22FF
IsMiscellaneousTechnical 0x2300 0x23FF
IsControlPictures 0x2400 0x243F
IsOpticalCharacterRecognition 0x2440 0x245F
IsEnclosedAlphanumerics 0x2460 0x24FF
IsBoxDrawing 0x2500 0x257F
IsBlockElements 0x2580 0x259F
IsGeometricShapes 0x25A0 0x25FF
IsMiscellaneousSymbols 0x2600 0x26FF
IsDingbats 0x2700 0x27BF
IsMiscellaneousMathematicalSymbols-A 0x27C0 0x27EF
IsSupplementalArrows-A 0x27F0 0x27FF
IsBraillePatterns 0x2800 0x28FF
IsSupplementalArrows-B 0x2900 0x297F
IsMiscellaneousMathematicalSymbols-B 0x2980 0x29FF
IsSupplementalMathematicalOperators 0x2A00 0x2AFF
IsMiscellaneousSymbolsandArrows 0x2B00 0x2BFF
IsCJKRadicalsSupplement 0x2E80 0x2EFF
IsKangxiRadicals 0x2F00 0x2FDF
IsIdeographicDescriptionCharacters 0x2FF0 0x2FFF
IsCJKSymbolsandPunctuation 0x3000 0x303F
IsHiragana 0x3040 0x309F
IsKatakana 0x30A0 0x30FF
IsBopomofo 0x3100 0x312F
IsHangulCompatibilityJamo 0x3130 0x318F
IsKanbun 0x3190 0x319F
IsBopomofoExtended 0x31A0 0x31BF
IsKatakanaPhoneticExtensions 0x31F0 0x31FF
IsEnclosedCJKLettersandMonths 0x3200 0x32FF
IsCJKCompatibility 0x3300 0x33FF
IsCJKUnifiedIdeographsExtensionA 0x3400 0x4DBF
IsYijingHexagramSymbols 0x4DC0 0x4DFF
IsCJKUnifiedIdeographs 0x4E00 0x9FFF
IsYiSyllables 0xA000 0xA48F
IsYiRadicals 0xA490 0xA4CF
IsHangulSyllables 0xAC00 0xD7AF
IsHighSurrogates 0xD800 0xDB7F
IsHighPrivateUseSurrogates 0xDB80 0xDBFF
IsLowSurrogates 0xDC00 0xDFFF
IsPrivateUse 0xE000 0xF8FF
IsCJKCompatibilityIdeographs 0xF900 0xFAFF
IsAlphabeticPresentationForms 0xFB00 0xFB4F
IsArabicPresentationForms-A 0xFB50 0xFDFF
IsVariationSelectors 0xFE00 0xFE0F
IsCombiningHalfMarks 0xFE20 0xFE2F
IsCJKCompatibilityForms 0xFE30 0xFE4F
IsSmallFormVariants 0xFE50 0xFE6F
IsArabicPresentationForms-B 0xFE70 0xFEFF
IsHalfwidthandFullwidthForms 0xFF00 0xFFEF
IsSpecials 0xFFF0 0xFFFF

Unicode Categories

Unicode catgories can be used using the construct \uc{NAME} . For example, the \uc{Lu} expression will only match uppercase letters. The supported Unicode character categories are:
Code Meaning Comments
L Letter
Lu Letter, Uppercase
Ll Letter, Lowercase
Lt Letter, Titlecase
Lm Letter, Modifier
Lo Letter, Other
M Mark
Mn Mark, Non-spacing
Mc Mark, Spacing combining
Me MArk, Enclosing
N Number
Nd Number, Decimal Digit
Nl Number, Letter Includes Roman numerals
No Number, Other
P Punctuation
Pc Punctuation, Connector Includes the underscore
Pd Punctuation, Dash Includes hyphen characters
Ps Punctuation, Open Opening brackets
Pe Punctuation, Close Closing brackets
Pi Punctuation, Initial quote Opening quotation mark
Pf Punctuation, Final quote Closing quotation mark
Po Punctuation, Other
S Symbol
Sm Symbol, Math
Sc Symbol, Currency
Sk Symbol, Modifier
So Symbol, Other
Z Separator
Zs Separator, Space Includes the ASCII spaces
Zl Separator, Line Only the 0x2028 LINE SEPARATOR
Zp Separator, Paragraph Only the 0x2029 PARAGRAPH SEPARATOR
C Other
Cc Other, Control
Cf Other, Format
Cs Other, Surrogate
Co Other, Private Use
Cn Other, Not assigned

Nested Regular Expressions

Nested regular expressions are expressed by referring to the token class names. The 'a' NAME 'b' expression will match any string matched by the regular expression corresponding to NAME with a 'a' before and a 'b' after it.

Regular Expression Operators

  • Use parenthesis () to group terms of the regular expression.
  • Zero or more time: *
  • One or more time: +
  • Optional: ?
  • Cardinality: 'a' {4} will only match "aaaa".
  • Cardinality range: 'a' {2, 4} will match the following strings: "aa", "aaa" and "aaaa".
  • Concatenation: 'a' 'b' will only match the "ab" string.
  • Union: 'a' | 'b' will match only two strings: "a" and "b".
  • Difference: exp1 – exp2 will match strings matched by exp1, but not matched by exp2. The expression will never match strings matched by exp2.
The operators' priorities are defined as follow, from the highest to the lowest:
Operators
( )
*, +, ?, {x}, {x, y}
Concatenation
Difference
Union


Rules Section

This section presents the constructions that can be used when writing syntactic rules. Syntactic rules are used by the parser to match the sequence of tokens read by the lexer. These rules are context-free rules and are compiled by the Hime Parser Generator in the form of automata. For more information on context-free grammars: Wikipedia Context-Free Grammars. The Hime Parser Generator provides its own language to express context-free grammars. This language is nonetheless very similar to the standard BNF notation (Wikipedia Backus–Naur Form).

The philosophy developed by Hime is that a grammar is a declarative artifact and should not contain code or other constructions not related to context-free grammars. A grammar rule can be expressed in the following pattern:
variable -> rule body ;

The rule's head (variable in this example) is interpreted as a variable. The rule's body is composed using elements defined hereafter.

Atoms

Basic constructions that can be used in syntactic rule' bodies are:

Terminals and Variables

Terminals and variables referred to using their name. Reading a single rule there is no way to determine if a referred name is a terminal or a variable. To disambiguate, the whole grammar is necessary. It is therefore a good practice (although it is not enforced by the Hime Parser Generator) to write terminals' name in uppercase and variables' name in lowercase. In the rule a -> B c ; , we can interpret that a and c are probably variables and B is a terminal. Terminals referred to in the syntactic rules must be defined in the terminals section of the grammar using a lexical rule (See Terminals Section).

Inline terminals

Inline terminals can be defined within syntactic rules. They are to be expressed within singles quots. For example, in a -> 'b', 'b' is an inline terminal defined using the regular expression 'b'.

Virtual symbols

Virtual symbols are neither terminals nor variables. They are not matched by the lexer and parser. Instead, they are inserted by the parser where they are found in the syntactic rules. For example:
variable -> a b "my_virtual" c ;

Note that virtual symbols are noted between double quotes. When a parser encounters this rule, it will insert the virtual symbol named "my_virtual". A corresponding node will be created in the AST. Virtual symbols can be used to insert indicators in the AST that can be leveraged during its later traversal by a semantic analyzer for example.

Semantic actions

Semantic actions are pieces of behaviors that are introduced in grammar rules and executed when the parser encountered them. In the Hime Parser Generator, we strongly believe in the separation of concerns and we do not write the content of the semantic actions within the grammar rules. Semantic actions are marked in the grammar rules, but must be specified in one of the language for the .Net platform. For example:
variable -> A {MyAction} B ;

Here, the { MyAction } represents the semantic action. Note that it is written between curly braces. It is placed directly after the A element in the rule. The meaning of this rule is that when the parser finds itself in this rule and immediately after it encounters a A token, it will execute the { MyAction } semantic action.

The { MyAction } semantic action is itself a piece of code that you specify and that will be called by the parser in this situation. In a sense, they are callbacks that you define and that are given to the parser. In C#, a semantic action is implemented as a method with the following signature:
void MyAction(Variable head, Symbol[] body, int length);
There should be no return value. The semantic action is given three parameters:
  • head is the symbol corresponding to the head of the rule in which the parser is in. In our example, this would be the A symbol.
  • body is an array of the symbols found by the parser for the rule so far. In our example, there will be only one, a A token.
  • length gives the number of symbols in the body array.

Note that if the semantic action is placed in the middle of the rule, the parser will only give in the body parameter the symbols that are before the semantic action. This is due to the fact that the parser is precisely at this point in the rule.

Operators

In order to simplify the writing of syntactic rules, the following operators are available:
  • Term grouping using parenthesis ()
  • Optional term using "?"
  • Zero or more occurence of a term using "*"
  • One or more occurence of a term using "+"
  • Union using "|"
Internally, the parser generator will "flatten" the provided syntactic rules in order for them to be usable. This process should not affect how the rule are written but this information is provided here anyway because it has some impact on the production of the abstract syntax trees, as explained in the next section.

Formally, the body of a syntactic rule can only be a sequence (possibly empty) of symbols, that is to say terminals and variables. The operators provided by the tool are great for expressivity but they cannot be used directly for the generation of a suitable parser. Simple rewriting examples are given hereafter:
a -> x | y ;
rewritten as
a -> x ;
a -> y ; 

a -> x? ;
rewritten as
a -> x ;
a -> ; 

a -> x y* ;
rewritten as
_genvar -> y ;
_genvar -> _genvar y ;
a -> x ;
a -> x _genvar ; 

a -> x y+ ;
rewritten as
_genvar -> y ;
_genvar -> _genvar y ;
a -> x _genvar ;

where _genvar designates an automatically generated new variable.

Template rules

Repetitive and similar rules can be factorized using template rules. A template rule is a syntactic rule with parameters which can be invoked, passing values as arguments. A template rule is written as:
head<param1, ...> -> rule's body ;

There may be as many parameters as needed. In the rule's body, parameters may be directly referenced, as in head<param1> -> param1 ; Template's parameters can also be used as arguments for the invocation of another template, as in:
a<x> -> x b<x> ;
b<y> -> y ;
s -> a<i> | a<j> ;

This is equivalent to:
ai -> i bi ;
bi -> i ;
aj -> j bj ;
bj -> j ;
s -> ai | aj ;


Abstract Syntax Tree

Parsers generated by the tool are designed to output an abstract syntax tree (AST) corresponding to the parsed input. The structure of the produced AST is similar to those produced by other tools. Given a rule a -> x y z, the parser will produce for this rule a tree node a with x, y and z as children nodes.
The general rule to keep in mind is that for any rule, the parser produces a sub-tree with the head as top node and symbols in the rule's body as direct children of this node. Consequently, for the rule a -> x y* z, produces AST will have a as top node and then x, multiple instances of y and finally z as children of this node.

Tree Actions

The parser generator supports the specification of simple tree actions. This feature enables you to declaratively specify the setup of the AST trees resulting from parsing in order ease the traversal afterwards. The parser generators offer two basic actions:
  • The drop action specifies the node on which it applies will simply be removed from the AST along with all its children. The drop action is specified using '!'. For example, the rule a -> x y! z will produce ASTs with a as top node and only x and z as children. The drop action may be applied to a complete group in a rule's body. For example, the rule a -> x (c|d*)! z will produce the same ASTs as the previous example. This DOES NOT change the semantic of the rule, this only drop potential c and d nodes from the AST after them being recognized by the parser in the input.
  • The promote action specifies that a particular symbol in the rule's body must become the top-node of the produced AST for the rule, instead of the rule's head. The promote action is noted '^'. Multiple promote actions may be specified in the body of a rule. In addition, this action may also be applied on a whole group in the body. See hereafter:
a -> x y^ z will originally produce:
graph1.PNG
Applying the promote actions it becomes:
graph2.PNG

When multiple promote actions are found, they are applied sequentially from left to right:
a -> x y^ z^ will originally produce:
graph3.PNG
Applying the first promote action it becomes:
graph4.PNG
Applying the second promote action it becomes:
graph5.PNG

Promote actions are applied bottom-up in the AST as follow:
a -> x y^
y -> u v^ w

These rules will originally produce:
graph6.PNG
Applying the first promote action it becomes:
graph7.PNG
The second promote action should have applied to a y node but is instead applied to its replacement, thus producing:
graph8.PNG

Grammar Inheritance

It is possible to make one grammar inherit from one or more grammars. This means that all the symbols and rules defined in the inherited grammars will be understood in the inheriting one as if defined in it. This is expressed in the following manner:
cf grammar MyGrammar : BaseGram1, BaseGram2 { }
The inheritance has the following effects:
  • All lexical rules defining terminals in BaseGram1 and BaseGram2 are understood in MyGrammar as if defined in there. The priority of the lexical rules is then, from the lowest to the highest:
    • Lexical rules in BaseGram1
    • Inline terminals in BaseGram1
    • Lexical rules in BaseGram2
    • Inline terminals in BaseGram2
    • Lexical rules in MyGrammar
    • Inline terminals in MyGrammar
  • All syntactical rules defining variables in BaseGram1 and BaseGram2 are understood in MyGrammar as if defined in there.
  • All meta-rules defined in BaseGram1 and BaseGram2 are understood in MyGrammar as if defined in there.


Version 1 links:

Last edited Apr 27, 2014 at 3:01 PM by lwouters, version 7

Comments

No comments yet.