Where to declare a string terminal

Aug 15, 2013 at 12:29 AM
I want to put a regex that matches every string( like STR -> .* ;) but I don't want this to override my existing terminals. I tried declaring it as the first terminal and as the last terminal. In both cases the lexer catches the entire input as STR.
Coordinator
Aug 15, 2013 at 6:05 AM
Edited Aug 15, 2013 at 6:06 AM
Hi,
What you describe ought to work when the STR rule is the first one.
For example, considering the following grammar:
cf grammar Test1 {
    options {
        Axiom = "test";
    }
    terminals {
        STR     -> .* ;
        NAME    -> [a-zA-Z_] [a-zA-Z_0-9]* ;
    }
    rules {
        test    -> STR | NAME ;
    }
}
On the input "abc", the grammar matches the NAME rule, whereas on the input "123" it matches the STR rule.

What can happen in your case is that another rule may take precedence in a non-obvious manner. If you don't mind putting your grammar here, I could be more specific.
Aug 15, 2013 at 10:00 AM
Edited Aug 15, 2013 at 10:00 AM
I'm trying to build grammar for a wiki text parser.
I'm stuck at the first step. This is intended to handle the case for Heading 2. But there's always a single STR in the lexer output.
cf grammar MathExp {
    options {
        Axiom = "exp";
        Separator = "SEPARATOR";
    }
    terminals {
        STR -> .* ;
        TAG_H2 -> '==' ;
        SEPARATOR    -> WHITE_SPACE+;
    }
rules {
        h2 -> TAG_H2 STR TAG_H2 ;
        exp -> h2 | STR ;
    }
} 
Coordinator
Aug 15, 2013 at 12:05 PM
Edited Aug 15, 2013 at 12:08 PM
The difficulty in your case comes from the fact that some markups alone (e.g. ==) should not be interpreted specially when they are not in pairs and you also that you need some form of "catch-all" token for the raw text. I can propose you the following grammar as a starting point. Below are some explanations about the assumptions and what the rules mean.
cf grammar WikiText {
    options {
        Axiom = "file";
    }
    terminals {
        LINE_END  -> 0x000D /* CR */
                    |  0x000A /* LF */
                    |  0x000D 0x000A /* CR LF */
                    |  0x2028 /* LS */
                    |  0x2029 /* PS */ ;

        CONTENT -> .* - (.* LINE_END .*) ;

        H1 -> '=' CONTENT '=' ;
        H2 -> '==' CONTENT '==' ;
        H3 -> '===' CONTENT '===' ;
    }
    rules {
        file -> line* last;
        line -> content | title1 | title2 | title3 ;
        last -> CONTENT LINE_END? ;

        content -> CONTENT LINE_END ;
        title1 -> H1 LINE_END ;
        title2 -> H2 LINE_END ;
        title3 -> H3 LINE_END ;
    }
}
This grammar correctly parses the following input:
some=content
==MyTitle==
Some other content
In this grammar we "manually" handle the line endings because we use it to separate the content of the file.
We make the following assumptions:
  • The file can be looked as a list of lines
  • Each special markups (titles, etc.) are on their own line
  • A final line ending character is not required if any content is preceding
  • A file cannot terminate on a line other than a (possibly empty) content line (i.e. not title, etc.)
Now the meaning of the rules :
  • LINE_END simply matches any line ending character, or sequence of characters
  • CONTENT matches any string (possibly empty) that does not contain any line ending character
  • A file is a sequence of zero or more lines, plus a last one that may contain some data (or not)
  • Lines are either content lines, or title lines
Aug 15, 2013 at 8:19 PM
Great! Thank you!
I assume drop action is not allowed with terminals.
No big deal though!
Coordinator
Aug 16, 2013 at 5:34 AM
You're welcome.
And indeed you can't apply drop actions in rules for the terminals.
Aug 16, 2013 at 11:18 AM
I'm trying to implement tag pairings in rules section (not in terminals) I hope I could make it to some HTML-like tag-pairings using the same method.
This is the grammar I came up with but I'm getting a shift reduce conflict for terminal H2TAG between rules h2 and broken_h2.
cf grammar WikiText {
    options {
        Axiom = "file";
    }
    terminals {
        H2TAG -> '==' ;
        CONTENT -> .* - (.* H2TAG .*) ;
    }
    rules {
        file -> content* ;
        content -> CONTENT^ | heading^ | broken_heading^ | special_characters^ ;
        special_characters -> H2TAG ;
        broken_heading -> broken_h2 ;
        heading -> h2^ ;
        broken_h2 -> H2TAG CONTENT ;
        h2 -> H2TAG! CONTENT H2TAG! ;
    }
}
I think if I could use an epsilon token this could be re-written like this with no conflict.
cf grammar WikiText {
    options {
        Axiom = "file";
    }
    terminals {
        H2TAG -> '==' ;
        CONTENT -> .* - (.* H2TAG .*) ;
    }
    rules {
        file -> content* ;
        content -> CONTENT^ | heading^ | special_characters^ ;
        special_characters -> H2TAG ;
        heading -> h2^ ;
        h2 -> H2TAG! CONTENT h2_ending ;
        h2_ending -> H2TAG | epsilon ;
    }
}
So, do we have an epsilon token?
Coordinator
Aug 16, 2013 at 11:36 AM
In your grammar you can can simply define a rule with en empty body, so adding the following rule should do the trick:
epsilon -> ;
Alternately, you can also use the "?" operator as follow:
h2_ending -> H2TAG ? ;
This rule is internally rewritten into the equivalent:
h2_ending -> H2TAG ;
h2_ending -> ;
Now, as to your definition of the CONTENT token, I tried this form when building the grammar I gave. The problem is that the lexer always matches as much characters as possible in a token. So there is a problem in the following input:
Some content
== Title ==
In this example, the lexer will match "Some content\r\n=" as a CONTENT token (which is correct) but it destroys the "==" markup. This is why I came up with the restriction of line endings, so that a CONTENT token cannot span multiple lines and as long as special markups are on their own line we are OK.
I also find this solution not so elegant and I would welcome a new one!
Aug 16, 2013 at 11:45 AM
I changed the CONTENT definition to prevent the lexing issue.
I also used the epsilon ending as follows:
cf grammar WikiText {
    options {
        Axiom = "file";
    }
    terminals {
        CONTENT -> .* - (.* '=' .*) ;
        H2TAG -> '==' ;
    }
    rules {
        file -> content* ;
        content -> CONTENT^ | heading^ | special_characters^ ;
        special_characters -> '=' ;
        heading -> h2^ ;
        h2 -> H2TAG! CONTENT h2_ending ;
        h2_ending -> H2TAG | epsilon ;
        epsilon -> ;
    }
}
I still get this shift/reduce conflict:
ERROR: Conflict Shift/Reduce in C on terminal 'H2TAG' for items { [h2_ending → ? H2TAG]  [epsilon → ?] }
Coordinator
Aug 16, 2013 at 4:01 PM
This grammar is indeed ambiguous although it is not immediately obvious. In simple terms, because in the given rules the closing H2TAG is optional the following input is ambiguous:
some content == Title == something
There is two ways to match the input:
some content (== Title ==) something
CONTENT      h2            CONTENT
and
some content (== Title) (== something)
CONTENT      h2            h2
Note that if you use the -l option when compiling the grammar, the tool outputs the log as a MHTML file and adds some additional information, including some example of ambiguous inputs for the reported conflicts.

There are several ways to fix this. The first one is to make the closing tag mandatory, which is simple enough to do.
If you really want to make the closing tag optional I see no easy way to disambiguate the grammar. In fact, is think this grammar is not even LR(*) because on the example above the ambiguity persists even after having reached the end of the input.
Still, if you don't care about determinism, you can still ignore the conflict and compile the grammar with the '-m:rnglr" option. This will force the compiler to generate a RNGLR parser. Such parser will give you one of the two matches in the example above, but which one exactly is undetermined. The consistency across multiple parses on the same input is also not guaranteed, meaning it can (but likely won't) return one match at a given time and the other at another time.

On another note, you may find specifications for Wikitext here, including an attempt using EBNF here. According to these links, and in particular this one, a closing tag is always necessary. Note that they also define line endings as significant in the definition of headers.
Aug 16, 2013 at 4:08 PM
Thank you very much indeed.
I did not want to make the closing tag optional, what I wanted was to accept forms that do not match heading formatting as raw text.
I introduced the broken heading rule in an attempt to escape the shift/reduce conflict there (which itself led to another shift/reduce).
I will find a way out though, you have been way more than helpful.
Thank you again.