This project is read-only.

Generated parser not halting on incorrect input?


I have been attempting to use Hime to generate a parser for a subset of C++. In doing so, I have stumbled across behavior I didn't expect and which seems like a bug to me (admittedly my experience with parser generation is limited, though).

I have reduced the behavior to a minimal example. Using the attached grammar file, I generate a parser using the command-line "himecc.exe -m RNGLR1 demo.gram" and invoke that parser as follows in C#:

var lexer = new Demo.Demo_Lexer("typedef char * test;");
var parser = new Demo.Demo_Parser(lexer);
var result = parser.Analyse();

This executes and returns a non-null result. There are no errors in the parser error list. However, the grammar specifier the axiom rule as:
demo -> 'typedef' type 'test' ';';

and for the type rule to be:
type -> 'char' | 'short';

The "*" is not specified in the grammar, and I would expect the parser to halt upon encountering it. In fact, if I change the input string to replace the * with several other invalid characters, I get the same behavior. This also occurs if I change the parse method from RNGLR1 to the default (omitting the -m command line switch, essentially).

Is this expected? If so, why?

file attachments


lwouters wrote Oct 21, 2011 at 2:25 PM

I was indeed able to replicate the behaviour you described in this ticket.
The lexer is in fact discarding any character it cannot match using the given lexical rules. In the described case, the two lexical rules (SEPARATOR and WHITE_SPACE) cannot match the ‘*’ character. Accordingly, the generated lexer will discard this character and inform you by reporting an error accessible through the lexer.Errors property.
The parser itself does not report any error as it is always provided expected tokens in this example. Parser errors are reported when a syntactic rule has been violated. This happens when the lexer feeds the parser an unexpected token. Accordingly, you can retrieve the syntactic errors in the parser.Errors property.
The philosophy developed by the parser is that it will try as best as possible to return a syntactic tree, even if errors when encountered. This is useful when programming language compilers must report as much errors as possible is the provided sources instead of only the first one. This is the base rationale behind this behaviour.
However, we can infer some improvement tracks from this:
1) Regarding the non-halting behaviour, we have to improve the documentation or let you parameterize it so you shall not be surprised by it.
2) Regarding the error reporting, we shall either make the two natures of errors (lexical and syntactic) more obvious in the API, or merge them to provide only one place to look for errors.
Thank you for your input.