Terminal not Matching Correctly

Jun 1, 2012 at 3:49 PM

Me again. I'm having an issue with the terminal failing to match a sequence when it should be. The grammar definition in question is this:

LETTER -> [a-zA-Z] | 0x0080 .. 0x02AF;
        ID -> (LETTER|'_') ( LETTER | [0-9] | '_' | '-' | 0x0027 /* Single Quote */ )*;

And it's failing to match sequences like this:

k_papal_state
k_andalusia
d_cordoba
e_venice

As far as I can tell, it should match it, but it's not. In case it's relevant, I'm using a version compiled from the latest source code.

Coordinator
Jun 1, 2012 at 6:58 PM
Edited Jun 1, 2012 at 6:59 PM

Hi,

The two terminals you provided work fine for me (verbatim copy paste). I suspect you are experiencing the same issue as the one in this discussion.  Do you get an error in particular, or does it just silently fail ? Could you post your complete set of terminals ?

Laurent

Jun 1, 2012 at 8:04 PM
Edited Jun 1, 2012 at 8:08 PM

I got no errors when running the generator. Here's the full list of terminals:

        INT -> [0-9]*;
        FLOAT -> INT? '.' INT;
        NUMBER -> INT|FLOAT;
       
        ASSIGN -> '=';
        OPEN -> '{';
        CLOSE -> '}';
       
        LETTER -> [a-zA-Z] | 0x0080 .. 0x02AF;
        ID -> (LETTER|'_') ( LETTER
                           | [0-9]
                           | '_'
                           | '-'
                           | 0x0027 /* Single Quote */ )*;
        STRING -> '"' ' '? ID+ (' ' ID+)? ' '? '"';
       
        BOOL -> 'yes'|'no';
       
        NEW_LINE -> 0x000D /* CR */
                  | 0x000A /* LF */
                  | 0x000D 0x000A /* CR LF */
                  | 0x2028 /* LS */
                  | 0x2029 /* PS */ ;
        COMMENT -> '#' (.* - (.* NEW_LINE .*)) NEW_LINE ;
       
        WHITE_SPACE -> 0x0020 /* Space */
                     | 0x0009 /* Tab */
                     | NEW_LINE
                     | 0x00A0 /* No Break Space */;
        SEPARATOR -> (WHITE_SPACE | COMMENT)+;

 

I am reading from a file encoded in ANSI, but have had no problems with reading constructs like this with that ID: controls_religion = catholic

I am reading the file into a string object first with the proper encoding, and then passing the string to the lexer.

Coordinator
Jun 1, 2012 at 8:22 PM

From what i understand now you can correctly parse this input:

controls_religion = catholic

but not this input (sequence of IDs):

k_papal_state
k_andalusia
d_cordoba
e_venice

To match the first input you had the syntactic rule:

idOption -> ID ASSIGN ID

Now you need a new syntactic rule for this second kind of input. I would write it as:

mysequence -> ID* ; // This is a sequence of IDs

and set the axiom of your grammar as follow:

myfile -> idOption | mysequence ;

Does it answer your problem? If not, could you also post the syntactic rules ?

Sorry for the trouble.

Laurent

Jun 1, 2012 at 8:38 PM
Edited Jun 1, 2012 at 8:47 PM

Here:

        groupOption -> ID ASSIGN! OPEN! Option+ CLOSE!;
        idOption -> ID ASSIGN! ID;
        boolOption -> ID ASSIGN! BOOL;
        stringOption -> ID ASSIGN! STRING;
        numberOption -> ID ASSIGN! NUMBER;
        colorOption -> ID! ASSIGN! OPEN! NUMBER NUMBER NUMBER CLOSE!;       
               
        Option -> (idOption | boolOption | stringOption | numberOption | colorOption | groupOption );       
       
        MaleName -> 'male_names'! ASSIGN! OPEN! (STRING|ID)+ CLOSE!;
        FemaleName -> 'female_names'! ASSIGN! OPEN! (STRING|ID)+ CLOSE!;
        Name -> (MaleName|FemaleName)^;
       
        Barony -> TITLENAME ASSIGN! OPEN! CLOSE!;
        Title -> TITLENAME ASSIGN! OPEN! (Option | Title | Barony | Name)+ CLOSE!;
       
        Groups -> ID+;

Groups is set as the axiom in the options.

What I meant when I said I could parse the "controls_religion = catholic" construct is that the ID terminal works fine with them. As far as I can tell, there should be no difference between "controls_religion" and "e_venice" as far as the terminal rule is concerned.

For reference, this is what I think the grammar of Title and its sub-calls should be matching, though I've not been able to test it properly because of this issue with the ID. Note that a Title can can a sub-title beneath it, which is why the Title rule has itself as a possible child.

k_andalusia = {
    color = { 31 138 40 }
    color2={ 255 255 255 }
   
    capital = 181 # Cordoba
   
    culture = andalusian_arabic
   
    catholic = 100 # Crusade target weight
    muslim = 50 # Crusade target weight
   
    allow = {
        OR = {
            religion_group = muslim
            religion_group = zoroastrian_group
        }
    }
   
    d_cordoba = {
        color = { 60 180 12 }
        color2={ 255 255 255 }
       
        c_cordoba = {
            color = { 246 216 16 }
            color2={ 255 255 255 }
           
            b_cordoba = {
            }
        }
    }
}

 

[Edit] the TITLENAME construct was me attempting to correct it, but it didn't seem to work, so I deleted the terminal, and at the same time switched the Groups grammar to only contain ID+. Because the Title and Barony rules aren't being used, they threw no errors when running the generator.

Coordinator
Jun 1, 2012 at 9:21 PM

I am confused because I could not find anything wrong with the ID terminal. However i modified your grammar a bit and the following is able to correctly parse your intended input:

cf grammar Test
{
    options
    {
        Axiom = "Groups";
        Separator = "SEPARATOR";
    }
    terminals {
        INT -> [0-9]*;
        FLOAT -> INT? '.' INT;
        NUMBER -> INT|FLOAT;
      
        ASSIGN -> '=';
        OPEN -> '{';
        CLOSE -> '}';
      
        LETTER -> [a-zA-Z] | 0x0080 .. 0x02AF;
        ID -> (LETTER|'_') ( LETTER
                           | [0-9]
                           | '_'
                           | '-'
                           | 0x0027 /* Single Quote */ )*;
        STRING -> '"' ' '? ID+ (' ' ID+)? ' '? '"';
      
        BOOL -> 'yes'|'no';
      
        NEW_LINE -> 0x000D /* CR */
                  | 0x000A /* LF */
                  | 0x000D 0x000A /* CR LF */
                  | 0x2028 /* LS */
                  | 0x2029 /* PS */ ;
        COMMENT -> '#' (.* - (.* NEW_LINE .*)) NEW_LINE ;
      
        WHITE_SPACE -> 0x0020 /* Space */
                     | 0x0009 /* Tab */
                     | NEW_LINE
                     | 0x00A0 /* No Break Space */;
        SEPARATOR -> (WHITE_SPACE | COMMENT)+;
    }
    rules {
        idOption -> ID ASSIGN! ID;
        boolOption -> ID ASSIGN! BOOL;
        stringOption -> ID ASSIGN! STRING;
        numberOption -> ID ASSIGN! NUMBER;
        colorOption -> ID! ASSIGN! OPEN! NUMBER NUMBER NUMBER CLOSE!;      
              
        Option -> (idOption | boolOption | stringOption | numberOption | colorOption );      
      
        MaleName -> 'male_names'! ASSIGN! OPEN! (STRING|ID)+ CLOSE!;
        FemaleName -> 'female_names'! ASSIGN! OPEN! (STRING|ID)+ CLOSE!;
        Name -> (MaleName|FemaleName)^;
      
        Barony -> ID ASSIGN! OPEN! CLOSE!;
        Title -> ID ASSIGN! OPEN! (Option | Title | Barony | Name)+ CLOSE!;
      
        Groups -> Title+;
    }
}

I removed the groupOption rule because it creates conflicts with the Barony and title rules. I you still need the groupOption rule, i suggest, if acceptable in your domain, to add a keywords for Title and Barony in order to disambiguate them.

Jun 1, 2012 at 10:04 PM
Edited Jun 1, 2012 at 10:14 PM

Thank you for giving me a working base to follow on from. I can't change the data to be read, as it's a data file from the game Crusader Kings II, and I do need the groupOption for the 'allow' section and sub-groups within that, but the titles do have prefixes. I've tried adding the prefixes to a terminal like this:

TITLEPREFIX -> 'e_' | 'k_' | 'd_' |'c_' | 'b_';

And then put that terminal in the rule like this:

Title -> TITLEPREFIX ID ASSIGN! OPEN! (Option | Title | Barony | Name)+ CLOSE!;

But I'm getting an OnLexicalError on the title name (in the above example, k_andalusia) when I try to parse the data. There's no mention of conflicts when running the generator.

[Edit] The TITLEPREFIX terminal has been added after the ID terminal, so as I understand the syntax it should take precedence over ID.

Coordinator
Jun 2, 2012 at 5:09 AM
Edited Jun 2, 2012 at 5:17 AM

The problem seems to be that now the parser is expected a prefix and then an ID. In the case of a the k_andalusia input for example, the lexer will read the first letter and the underscore and see it can match a PREFIX. However it keeps on reading to see whether it can be a longer match and doing so it finds out it can also match a ID it returns the ID because the priority is always for longer matches. Indeed when a particlar input matches two terminal definition the priority system is applied, however here this is just the preference for longer matches that takes place. This is a well-known and desirable behavior.


OK, so what you can do is modify the grammar as follow:

Add a new terminal TITLENAME defined just after the definition of ID as:

TITLENAME -> LETTER '_' ID ;

Keep the groupOption rule and modify the Title and Barony rules as follow:

Barony -> TITLENAME ASSIGN! OPEN! CLOSE!;
Title -> TITLENAME ASSIGN! OPEN! (Option | Title | Barony | Name)+ CLOSE!;
Groups -> Title+;

This works on your input. The catch is that with this solution you cannot have an ID that start with a prefix of the form letter underscore, which can be annoying if the value of a idOption enters this case. A quick fix to that is to modify the idOption rule as follow:

idOption -> ID ASSIGN! (ID | TITLENAME) ;


A second solution in cases like yours is to write a grammar that is more general that the expected input and validate the produced ASTs in a later phase. This means you do not distinguish anymore between groupOption and Title because their respective contents are very similar. In this case, you need to write the rules as follow:

 

groupOption -> ID ASSIGN! OPEN! (Option | Name)+ CLOSE!;              
Option -> (idOption | boolOption | stringOption | numberOption | colorOption | groupOption);
Groups -> groupOption+;

The Barony and Title rules are completely removed and the axiom is still Groups. This works also well on your input. However you will have to write a little more code to handle the produced the AST for the groupOption in order to check whether this is actually a "real" option group or a title definition by reading the value of the ID and checking it begins with an expected prefix.

Hope this helps.

Laurent

Jun 2, 2012 at 11:20 AM

Ah, I see.  As you probably guessed by now, this is my first time using a parser generator. Thank you for clearing that bit up.