AST Generation

Aug 5, 2012 at 9:02 PM

Hi again,

I'm trying to use a top-down approach to morph the AST output by Hime Parser into something I can easily use to emit IL.

I was planning on creating a class called Program (or something like that), and have the tree output by generated parser Analyse() instance method be read in as a parameter. I would then use recursion to encapsulate each symbol using the rules I've defined in my grammar, and then write methods for each symbol that would emit the appropriate IL, again using recursion so I would only have to call new Program(x.Analyse()).Compile("test.exe") at the end of the day.

Here's an example of what I meant to make things clearer.

	class TestASTBuilder
	{
		public interface TestSymbol { }
		
		public TestSymbol Tree;
		
		public class ExpAtom : TestSymbol
		{
			public string Value;
			
			public class Number
			{
				
			}
			
			public class Identifier
			{
				
			}
		}
		
		public class Exp0Symbol : TestSymbol
		{
			public Exp0Symbol Left;
			public Exp0Symbol Right;
			public ExpAtom Child;
			
			public string Value;
		}
		
		public class Exp1Symbol : TestSymbol
		{
			public Exp1Symbol Left;
			public Exp1Symbol Right;
			
			public Exp0Symbol Child;
			
			public string Value;
			
			public Exp1Symbol(Hime.Redist.Parsers.SyntaxTreeNodeAction stn, TestASTBuilder k)
			{
			  /*more recursion goes here*/
			}
		}
		
		public TestASTBuilder(Hime.Redist.Parsers.SyntaxTreeNode stn)
		{
		    Tree = new Exp1Symbol(stn, this);
		}
	}

Here's the problem...

The SyntaxTreeNode object exposes the symbols that were parsed, but it doesn't tell me what rules were used for that specific node, making it difficult to write my own logic in to morph the tree into what I need. Since I'm writing a compiled language, and not an interpreted language, this method is probably best/easiest for me.

So here's what I'm proposing as the solution:

cf grammar TestGrm
{
	options
	{
		Axiom = "exp";
		Separator = "SEPARATOR";
	}
	terminals {
		INTEGER -> [1-9] [0-9]* | '0' ;
		NUMBER -> INTEGER;

		WHITE_SPACE	-> 0x0020 | 0x0009 | 0x000B | 0x000C ;
		SEPARATOR	-> WHITE_SPACE+;
		
		
		PRINTABLE -> . - WHITE_SPACE;
		PRINTABLE_NON_NUMBER -> . - WHITE_SPACE - [0-9];
		IDENTIFIER -> PRINTABLE_NON_NUMBER PRINTABLE*;
		
		OPERATOR_MULTLEVEL -> ('*' | '/' | '%') IDENTIFIER*;
		OPERATOR_ADDLEVEL -> ('+' | '-' ) IDENTIFIER*;
	}
	rules {
		exp_atom-> NUMBER <EXP_ATOM_NUMB>
			|  IDENTIFIER <EXP_ATOM_IDENT>
			|  '(' exp ')' <EXP_ATOM_EXP> ;

		exp_op0	-> exp_atom <EXP_OP0>
			|  exp_op0 OPERATOR_MULTLEVEL exp_atom <EXP_OP0_A> ;
		
		exp_op1	-> exp_op0 <EXP_OP1>
			|  exp_op1 OPERATOR_ADDLEVEL exp_op0 <EXP_OP1_A> ;

		exp	-> exp_op1 <EXP> ;
	}
}

I've added <X> after each rule, so I can reference each potential rule by a tag I set in the grammar. This is also helpful so that if I change the grammar around, I can keep the tags in place, minimizing the need for changing the code each time a new grammar is generated. This is a problem I've had with other parsers, where every time I changed a minor rule, the entire table would shift, and I had to manually update all my code by hand. If tables are generated in the manner I'm proposing, it would be pretty helpful.

Here's how I'd hope to use this:

 

public TestASTBuilder(Hime.Redist.Parsers.SyntaxTreeNode stn)
{
     if (stn.Rule == TestGrm.TestGrmParser.Rules["EXP"]) 
     {
          Tree = new Exp1Symbol(stn.children[0], this);
     }
     else if (stn.Rule == TestGrm.TestGrmParser.Rules["SomeOtherRule"])
    {
          Tree = new SomeOtherSymbol(stn.children[0], stn.children[1], this);
    }
}

If you could implement this, I would be very, very happy. Or if I'm missing something, please let me know.

Thanks again for the wonderful parser.

Coordinator
Aug 6, 2012 at 4:41 AM

Hi,

Thank you for your insight. I think I see your point. The best for you would be for the parser generator to let you define your own class of symbols, similar to what Bison or Anlr are doing if my memory is correct. Unfortunatly this is not possible here at this time, however I still have some good news for you:

The tag system you propose has already be implemented from the start, tags are called virtual symbols because they are not matched from the input but inserted by the parser. In your proposal, you just have to replace your <TAG> by "TAG", i.e. use quotes to define a terminal symbol in the rule. The generated parser will then insert this symbol in the exact same place it has been defined in your rule.

Then, the generated parser does not directly let you detect which rule has been used, however you can use tree actions to ensure that your tags are always at the root of the AST for a given rule. For example:

exp_atom-> NUMBER "EXP_ATOM_NUMB" ^ ;

Note that the ^ symbol is used the specify that the tag should be promoted as the root for this rule. The parser will output the syntax tree with the tag symbol as the root for this rule. In other words, you can apply this action to all your tags to ensure they will effectively replace the respective rule's heads. For more information on tree actions, I may refer you to http://himeparser.codeplex.com/wikipage?title=Parser&referringTitle=Documentation.

Note that your proposed notation for tags using angle brackets is used in this tools for the definition of template rules, as explained in the above link.

Does this answer your problem?

Aug 6, 2012 at 5:30 AM

Yes, the virtual symbols and the syntax work perfectly for me, and I like the syntax a lot. I only have one criticism on the matter. If I use the virtual symbols to match rules, I'm forced to do a string compare on each of the nodes' first child. It would be computationally faster if I could do a numerical compare instead.

I've redefined my grammar like this:

                exp_atom-> "EXP_ATOM_NUM" NUMBER {ATOMNUMBER}
			|  "EXP_ATOM_IDENT" IDENTIFIER {ATOMIDENTIFIER}
			|  "EXP_ATOM_EXP" '(' exp ')' ;

		exp_op0	-> "EXP_OP0" exp_atom
			|  "EXP_OP0_A" exp_op0 OPERATOR_MULTLEVEL exp_atom ;
		
		exp_op1	-> "EXP_OP1" exp_op0
			|  "EXP_OP1_A" exp_op1 OPERATOR_ADDLEVEL exp_op0 ;

		exp	-> "EXP" exp_op1 ;

Then I can check the first child of each node to see how I should build my tree...

But that involves using if (treeNode.Symbol.Name == "EXP") { ... } It would be nice if the Hime Parser generated an integer table for each of the virtual symbols (along with a dictionary) such that I could do this instead: if (treeNode.Symbol.VirtualID == Test.TestParser.VirtualSymbols["EXP"]) { ... }

Where treeNode.Symbol.VirtualID would be set to 0 (or -1, or something like that), when it's a normal symbol, and otherwise be set equal to some integer representing which virtual symbol. This should decrease compile time for very big trees.

Does this make sense?

Thanks again.

Coordinator
Aug 6, 2012 at 5:48 AM

Yes, I agree with what you said. In fact I'm currently in the process of refactoring the library in order to improve the performances. This is the BinTables branch in the code repository. In this branch, the parsers uses a tables for virtual symbols that should be accessible to you. Unfortunatly, this branch is not yet mature enough to be merged back into the default branch just yet. I will check tonight if some of the relevant changes can be re-integrated into default for this purpose though. I'll let you know if this is possible or not.

Also, I have made the changes you requested regarding the output parser errors. They are now in the default branch if you want to check it out. I'll make an official release later with more changes.

Hope this helps.

Aug 6, 2012 at 7:37 AM

Thanks that sounds great. I'll look forward for the new branch of the code to be integrated with the main branch. I think this is the only last thing I need to start fully working on my project.

Thanks, and I do think you've done a great job.