Issue: Why an input string without whitespace are recognized as two separate tokens?

Oct 15, 2014 at 6:43 AM
Edited Oct 15, 2014 at 6:47 AM
Hi again )
I noticed a strange behavior when input string like this
successfully parsed without any errors. And generated AST is right.
I do not understand reason for this.
My grammar contains an follow terminals and rules:
        Axiom = <my_root_rule> ;
        Separator = "SEPARATOR" ;
        WHITE_SPACE     -> ' ' | '\t' | '\r' | '\n' ;
        SEPARATOR       -> WHITE_SPACE+;
        IDENTIFIER      -> [a-zA-Z_][a-zA-Z0-9_]* ;
        INTEGER         -> [0-9]* ;
        /*other rules skipped*/
        if_statement -> 'IF' expression 'THEN' statement_list 
                        ('ELSIF' expression 'THEN' statement_list)*
                        ('ELSE' statement_list)?
                        'END_IF' ;
Could you help me? Thanks
Oct 15, 2014 at 7:06 AM

In the second case the parse is successful because there is technically no error. The semantic of the separator is the specification of a token that is silently dropped whenever it is matched. Usually this corresponds to whitespaces, and line ending sequences. The subtlety is that a separator token is not required to be matched between others. This is immediately seen with C-like language:
In this example, there is not whitespace, but this is still a valid piece of code. The matched tokens would be:
[if] [(] [x] [==] [0] [)]
This is exactly the same process in your example, the matched tokens are:
[IF] [x] [>] [9] [THEN]
The fact that a sparator is not required between [9] [THEN] comes from your lexical rules. An identifier cannot begin by a digit, so that the input string [9THEN] is unambiguously matched as INTEGER and THEN.

On the other hand, a separator is always required between two IDENTIFIER, otherwise they would be matched as one.

I hope this helps!

Oct 15, 2014 at 7:10 AM
To be absolutely complete, the matching algorithm always try to find the longest matching rule at any time. Once a rule is matched, the lexer advances from the length of the matched token and repeat. Doing so, it drops all separator tokens.

Hence in your case facing the string "9THEN", the longest matching rule is DIGIT. It matches "9" as a digit and advances. Thus it now faces the string "THEN", which is matched as the (implicit) token 'THEN'.
Oct 15, 2014 at 9:13 AM
Edited Oct 15, 2014 at 9:27 AM
Thanks for the quick and advanced reply!
As I understand it, I must re-write own rules like this
if_statement -> 'IF' expression WHITE_SPACE 'THEN' statement_list 
                ('ELSIF' expression WHITE_SPACE 'THEN' statement_list)*
                ('ELSE' statement_list)?
                'END_IF' ;
or is there a better solution for this?

PS: Hmm I tried it, but it does not work (error - WHITE_SPACE expected), I think that tokenizer ignore any whitespace because option the Separator = "SEPARATOR" are set. On this reason my modified rule can't be recognized.
Oct 15, 2014 at 10:32 AM
If you feel like forcing the presence of a whitespace, you will indeed have to rewrite the rule. However, you will have to rewrite them all and remove the separator option, as you said in your PS above.

I don't see any easy solution to force a whitespace in this specific spot, so I would recommend to let it as is.
If you think this is a problem, please raise a bug at Bitbucket.

I'm currently working on a feature for the support of context-sentive lexing, i.e. the support of lexical rules that depend on the parser's context. This is supposed to solve issues similar to this one (although the current work in progress won't solve this specific issue).
Oct 16, 2014 at 7:04 AM
Thanks for your advice, I think this question may be closed for now )