Is it possible to split a token into 2 in Antlr4?

Is it possible to split a token into 2 in Antlr4? - antlr

I need to be able to split one token into 2 for highlighting purposes, I have a token that looks like this:
ID_INTERP: '$' IDEN;
but I want to highlight the dollar sign differently from the identifier, so is it possible to split this token into two, one with the dollar sign and the other with the identifier? I know I can change the entire token into a different type under certain conditions, but I'd like to be able to add and change what text it contains, basically to change the tokenstream so instead of saying
ID_INTERP["$foo"]
it would see something like this:
DOLLAR_SIGN["$"] IDEN["foo"]

It is possible by extending your token source to emit more than a single token for a given match. I have used this idea to generate 2 tokens for the lexer rule DOT_IDENTIFIER (see the MySQL grammar in the MySQL Workbench parser). On match it pushes a dot token and sets the result to IDENTIFIER, effectivly creating 2 separate tokens for a single rule.
Sam Harwell described the technique to extend your lexer for this approach in his answer with some Java code. And here is a possible C++ implementation that I'm using:
std::unique_ptr<antlr4::Token> MySQLBaseLexer::nextToken() {
// First respond with pending tokens to the next token request, if there are any.
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
return pending;
}
// Let the main lexer class run the next token recognition.
// This might create additional tokens again.
auto next = Lexer::nextToken();
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
_pendingTokens.push_back(std::move(next));
return pending;
}
return next;
}

Related

Bison: Syntax Error processing, unexpected and undefined<token>

I want to process undefined and unexpected token error in yyerror func (or maybe by another func if it's possible)
for example, i get a error message from Bison
...
LAC: checking lookahead EXECSQL: S4
Error: popping nterm component_list ()
Stack now 0
Cleanup: discarding lookahead token $undefined ()
Stack now 0
ERRSTAT = "%X0000002C"
But I want to print which token hasn't been founded and the line number. Is it possible to implement it in Bison and how?

The special token $undefined is reported when yylex returns a token number which doesn't appear in any parser rule. Most of the time, that's the result of the lexer fallback rule:
. { return yytext[0]; }
But it can also happen if you declare a token in your parser file, and the lexer returns that token, but the token is never actually used in any rule.
Unused tokens don't have names, in the sense that the array of names which Bison includes in your parser doesn't include unused tokens, and so there's no way to look up what the token name originally was. You can, however, often get the token number from the variable yychar. If that number is greater than 0 and less than 256, then the token is probably a single-character token, and you could use that to print an additional error message. However, there's no simple way to modify the error message generated by Bison's verbose error messages; if you're using that feature, you'll still see the invalid token message.
In order to print line numbers, you only need to enable line number counting in the lexical scanner, using
%option yylineno
in your Flex (.l) file. Then you can print the value of yylineno in yyerror. (If you're using a "pure" (reentrant) scanner, then yylineno will be in the scanner_t object. In the normal use case where that object is an extra parser argument, it will also be available inside yyerror.)
I know that the above is a bit confusing because there are a lot of different code-generation options with slightly different behaviours. You didn't specify the particular options you're using, so the answer is a bit generic.

Is there a way to get the number of tokens in an ANTLR4 parser rule?

In ANTLR4, it seems that predicates can only be placed at the front of sub-rules in order for them to cause the sub-rule to be skipped. In my grammar, some predicates depend on a token that appears near the end of the sub-rule, with one or more rule invocations in front of it. For example:
date :
{isYear(_input.LT(3).getText())}?
month day=INTEGER year=INTEGER { ... }
In this particular example, I know that month is always one single token, so it is always Token 3 that needs to be checked by isYear(). In general, though, I won't know the number of tokens making up a rule like month until runtime. Is there a way to get its token count?

There is no built-in way to get the length of the rule programmatically. You could use the documentation for ATNState in combination with the _ATN field in your parser to calculate all paths through a rule - if all paths through the rule contain the same number of tokens the you have calculated the exact number of tokens used by the rule.

How to get current token in yyerror?

My problem is that the message passed to yyerror is already formatted (i.e. it is actually an English explanation what went wrong), and what I would like to get is just the current token (i.e. the one before the error pseudo-token).
So how to get it?
I use gplex/gppg which are lex/yacc implementations in C#.
I am sorry for not being 100% precise -- what I need is token (symbol) not the body (text) which was matched (by the token).
Let's say I have a rule [A-Za-z0-9_]+ constitutes an ID. So I would like to get token ID not a foobar.

Found this in an old project of mine, with a redefined yyerror:
int yyerror (char *msg) {
printf("oha, %s: '%s' in line %d\n", msg, yytext, yylineno);
return 0;
}
This was a c++-project using flex/bison, and the interesting thing i think you can find in yytext.

There's no standard, but bison and most versions of yacc store the current token in yychar. Unfortunately, this is generally a local variable (of yyparse), so you can't access it in other functions (such as yyerror), only in parser actions.
It might be helpful if you say WHY you want the current token -- its not generally a useful peice of information. You mention the error pseudo-token, which makes no sense as that is associated with error recovery, not errors as such -- by the time it comes into the picture normally a bunch of tokens from the input have been discarded.

Issues of Error handling with ANTLR3

I tried error reporting in following manner.
#members{
public String getErrorMessage(RecognitionException e,String[] tokenNames)
{
List stack=getRuleInvocationStack(e,this.getClass().getName());
String msg=null;
if(e instanceof NoViableAltException){
<some code>
}
else{
msg=super.getErrorMessage(e,tokenNames);
}
String[] inputLines = e.input.toString().split("\r\n");
String line = "";
if(e.token.getCharPositionInLine()==0)
line = "at \"" + inputLines[e.token.getLine() - 2];
else if(e.token.getCharPositionInLine()>0)
line = "at \"" + inputLines[e.token.getLine() - 1];
return ": " + msg.split("at")[0] + line + "\" => [" + stack.get(stack.size() - 1) + "]";
}
public String getTokenErrorDisplay(Token t){
return t.toString();
}
}
And now errors are displayed as follows.
line 6:7 : missing CLOSSB at "int a[6;" => [var_declaration]
line 8:0 : missing SEMICOL at "int p" => [var_declaration]
line 8:5 : missing CLOSB at "get(2;" => [call]
I have 2 questions.
1) Is there a proper way to do the same thing I have done?
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
Thank you.

1) Is there a proper way to do the same thing I have done?
I don't know if there is a defined proper way of showing errors. My take on showing errors is a litmis test. If the user can figure out how to fix the error based on what you have given them then it is good. If the user is confued by the error message then the message needs more work. Based on the examples given in the question, symbols were only char constants.
My favorite way of seeing errors is with the line with an arrow pointing at the location.
i.e.
Expected closing brace on line 6.
int a[6;
^
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You will have to read the separately generated token file and then make a map, i.e. a dictionary data structure, to translate the token name into the token character(s).
EDIT
First we have to clarify what is meant by symbol. If you limit the definition of symbol to only tokens that are defined in the tokens file with a char or string then this can be done, i.e. '!'=13, or 'public'=92, if however you chose to use the definition of symbol to be any text associated with a token, then that is something other than what I was or plan to address.
When ANTLR generates its token map it uses three different sources:
The char or string constants in the lexer
The char or string constants in the parser.
Internal tokens such as Invalid, Down, Up
Since the tokens in the lexer are not the complete set, one should use the tokens file as a starting point. If you look at the tokens file you will note that the lowest value is 4. If you look at the TokenTypes file (This is the C# version name) you will find the remaining defined tokens.
If you find names like T__ in the tokens file, those are the names ANTLR generated for the char/string literals in the parser.
If you are using string and/or char literals in parser rules, then ANTLR must create a new set of lexer rules that include all of the string and/or char literals in the parser rules. Remember that the parser can only see tokens and not raw text. So string and/or char literals cannot be passed to the parser.
To see the new set of lexer rules, use org.antlr.Tool –Xsavelexer, and then open the created grammar file. The name may be like.g . If you have string and/or char literals in your parser rules you will see lexer rules with name starting with T .
Now that you know all of the tokens and their values you can create a mapping table from the info given in the error to the string you want to output instead for the symbol.
The code at http://markmail.org/message/2vtaukxw5kbdnhdv#query:+page:1+mid:2vtaukxw5kbdnhdv+state:results
is an example.
However the mapping of the tokens can change for such things as changing rules in the lexer or changing char/string literals in the parser. So if the message all of a sudden output the wrong string for a symbol you will have to update the mapping table by hand.
While this is not a perfect solution, it is a possible solution depending on how you define symbol.
Note: Last time I looked ANTLR 4.x creates the table automatically for access within the parser because it was such a problem for so many with ANTLR 3.x.

Bhathiya wrote:
*1) Is there a proper way to do the same thing I have done?
There is no single way to do this. Note that proper error-handling and reporting is tricky. Terence Parr spends a whole chapter on this in The Definitive ANTLR Reference (chapter 10). I recommend you get hold of a copy and read it.
Bhathiya wrote:
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You can't. For SEMICOL this may seem easy to do, but how would you get this information for a token like FOO:
FOO : (X | Y)+;
fragment X : '4'..'6';
fragment Y : 'a' | 'bc' | . ;

What does it mean when yacc {code} are in the middle?

extdefs:
{$<ttype>$ = NULL_TREE; } extdef
| extdefs {$<ttype>$ = NULL_TREE; } extdef
;
Why is it in the middle?

It could be everywhere. Sometimes it's useful to have something done in between the tokens, especially in this kind of or expressions.
In the standard description of the yacc utility it's said that:
Actions can occur anywhere in a rule
(not just at the end); an action can
access values returned by actions to
its left, and in turn the value it
returns can be accessed by actions to
its right. An action appearing in the
middle of a rule shall be equivalent
to replacing the action with a new
non-terminal symbol and adding an
empty rule with that non-terminal
symbol on the left-hand side. The
semantic action associated with the
new rule shall be equivalent to the
original action. The use of actions
within rules might introduce conflicts
that would not otherwise exist.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas