Symbol.Name not as specified in terminals


When we have a specified terminal, for example:
ASSIGN -> ':=' ;
and any from own rules contains a direct terminal expression, for example:
statement -> variable ASSIGN expression;
statement2 -> variable2 ':=' expression2;
statement3 -> variable3 ASSIGN expression3;
then after parse we will get three symbols which have Name = ":=" and Value = ":=".
Should be Name = "ASSIGN" and Value= ":=".


lwouters wrote Oct 22, 2014 at 7:03 AM

To resolve ambiguities when multiple lexical rules overlap, there is a concept or priority between the terminals. In essence, the later the definition of a terminal, the more priority it has. This is explained in the documentation for the lexical rules at the Order paragraph. In addition, terminals that are defined in-line in a syntactic rule as in the example have always a greater priority than the one defined in the terminals section of a grammar; simply because they appear later.
statement2 -> variable2 ':=' expression2;
This mechanism is used to define keywords in a simple manner because it allows to have:
    IDENTIFIER -> [_a-zA-Z] [_a-zA-Z0-9]* ;
    class_definition -> 'class' IDENTIFIER class_body ;
In this small example, the 'class' keyword could be matched by the IDENTIFIER rule; but it is not because it is defined later than the IDENTIFIER rule. The definition of the 'class' keyword does not preclude the IDENTIFIER rule to match other identifiers such as 'foo' or 'bar'.

However in the reported case, the ':=' terminal defined in-line in the statement2 rule always supercedes the definition of the ASSIGN lexical rule. For this reason, the ASSIGN terminal can never be produce by the lexer and the rules statement and statement3 are therefore invalid.

The real issue behind this bug report is that the compiler fails to report that the ASSIGN terminal can never be produced by the generated parser and that the grammar is therefore ill-formed. I raised this bug accordingly.

Regarding the terminals defined in-line in the syntactic rule, as shown in the 'statement2' rule, their name is always the same as their value because they simply are constant pieces of text that are not explicitly named by the person writing the grammar.

As a side note, all symbols are given a unique identifier. It is much more efficient to check for the identifier of the symbol that its name, which devolves to a string comparison (instead of a simple integer check). You look into the generated code of the lexer and parsers to see the existing identifiers (they are constants inside an inner ID class).

kep4uk wrote Oct 22, 2014 at 12:36 PM

Yes, I agree with you in part priopity of the terminals.
But may be would be nice that all terminals which defined in-line were checked with terminals defined in the terminals section and replace it if identically (during compile the lexer and scanner).
Of cource, may be this not correct in the theory, but it reduce grammar readability. For example
        LPAREN      -> '(' ;
        RPAREN      -> ')' ;
        SEMI        -> ';' ;
        fb_invocation -> fb_name LPAREN assignment_list? RPAREN SEMI ;
very hard for understanding. So, with my offer, will be possible re-write rule like this
fb_invocation -> fb_name '(' assignment_list? ')' ';' ;
In most cases this easier.