how to get human-readable tokens from lexer?

Oct 21, 2014 at 6:29 AM
Hi, Laurent!
I want to get all tokens from source text. So, I trying use follow construction for this
Token token;
while (...)
    token = lexer.GetNextToken();
but Token struct contains only 'int Index' and 'int SymbolID' properties. How I can receive advanced info for each token from lexer?
Oct 21, 2014 at 6:43 AM
Edited Oct 21, 2014 at 6:44 AM

This can be easily achieved as follow:
token = lexer.GetNextToken();
Symbol symbol = lexer.Output[token.Index];
// get the name of the symbol (terminal name in the grammar)
string name = symbol.Name;
// get the content of the token
string value = symbol.Value;
Note that lexer.Output is of type TokenizedText and has a bunch of useful methods to retrieve information about the tokens (position, etc.).
You can refer to the API documentation for this.
Oct 21, 2014 at 6:46 AM
so fast! cool )
Thanks for your reply!
Oct 21, 2014 at 6:51 AM
I'm sorry for my obtrusive, but may be need to do composite Token with all relevant info? It is will be more intuitive and better... I think
Oct 21, 2014 at 7:04 AM
Yes the API have some kind of abstraction leakage ...
I would have hoped to keep the Token concept internal. It is there due to performance reasons. The parser only need the ID of the matched terminal, as well as an index identifying the matched piece of text. This is the Token.
Oct 21, 2014 at 7:51 AM
This discussion may last for long time, and how many people, so many opinions about tokens )
Of course, internally, token it's some indexed data, but I think is not nothing wrong when public properties and returned values (like Token) will have some more information about self. This will raise up the readability and reduce potential errors in user code.

you offer use
token = lexer.GetNextToken();
Symbol symbol = lexer.Output[token.Index];
This works, but in clear concept token - clear internal data, and Output - clear internal data, why user must operate with lexer internal data structures? This internal indexed data not clear for user.

So, C# is top level language and programmer wants to see a clear and understandable data instead an indexed values.

I hope you understand me right, Sorry for my English )
Oct 21, 2014 at 9:39 AM
Hi again.
Now I just got all symbols from lexer. But I have a two questions:
  • how I can get line and symbol position? (at lexer stage);
  • how to "reset" lexer before parse? (i.e. reset cursor position for parse from the beginning).
Oct 21, 2014 at 10:59 AM
Getting the position is straightforward:
token = lexer.GetNextToken();
TextPosition position = lexer.Output.GetPositionOf(token.Index);
As to the second point, the short answer is that you can't. A lexer (and parser) are one shot, they are created, used and then discarded. If you need to parse (or lex) multiple time the same input you simply create a new parser (or lexer). However, what you can do is first parse the input and then inspect the used tokens. You can always look back at the tokens after the parse:
MyLexer lexer = new MyLexer("some input");
MyParser parser = new MyParser(lexer);
ParseResult result = parser.Parse();

for (int i = 0; i != result.Input.TokenCount; i++)
    Symbol symbol = result.Input[i];
    TextPosition position = result.GetPositionOf(i);
Oct 21, 2014 at 5:45 PM
Edited Oct 21, 2014 at 5:45 PM
Thanks for your reply!
I trying to build this, but method GetPositionOf() is unresolved. Which the hime.redist version are needed?
Currently I use the Hime.Redist v1.3.0
Oct 22, 2014 at 8:14 AM
Edited Oct 22, 2014 at 8:15 AM
My bad, there is a typo, this is:
TextPosition position = result.Input.GetPositionOf(i);
Oct 22, 2014 at 6:17 PM
Yes, this is wirks!