I'm asking this question after asking that one. We experience a line break problem with the for each (...in...) statement in C++/CLI formating.
My question is quite simple: in C++/CLI, are for each two tokens for the tokenizer or a single one with a space (0x20) inside it?
My point is that I tought that no token could contain a space but when they are split (by clang-format formatter) in two lines like this:
for
each ( auto a in b )
{
}
I get the errors:
C2061 syntax error: identifier 'each'
C2143 syntax error: missing ';' before '{'
But when I write:
for each ( a in aa ) {}
I have no error.
Related
I want to process undefined and unexpected token error in yyerror func (or maybe by another func if it's possible)
for example, i get a error message from Bison
...
LAC: checking lookahead EXECSQL: S4
Error: popping nterm component_list ()
Stack now 0
Cleanup: discarding lookahead token $undefined ()
Stack now 0
ERRSTAT = "%X0000002C"
But I want to print which token hasn't been founded and the line number. Is it possible to implement it in Bison and how?
The special token $undefined is reported when yylex returns a token number which doesn't appear in any parser rule. Most of the time, that's the result of the lexer fallback rule:
. { return yytext[0]; }
But it can also happen if you declare a token in your parser file, and the lexer returns that token, but the token is never actually used in any rule.
Unused tokens don't have names, in the sense that the array of names which Bison includes in your parser doesn't include unused tokens, and so there's no way to look up what the token name originally was. You can, however, often get the token number from the variable yychar. If that number is greater than 0 and less than 256, then the token is probably a single-character token, and you could use that to print an additional error message. However, there's no simple way to modify the error message generated by Bison's verbose error messages; if you're using that feature, you'll still see the invalid token message.
In order to print line numbers, you only need to enable line number counting in the lexical scanner, using
%option yylineno
in your Flex (.l) file. Then you can print the value of yylineno in yyerror. (If you're using a "pure" (reentrant) scanner, then yylineno will be in the scanner_t object. In the normal use case where that object is an extra parser argument, it will also be available inside yyerror.)
I know that the above is a bit confusing because there are a lot of different code-generation options with slightly different behaviours. You didn't specify the particular options you're using, so the answer is a bit generic.
I am aware what implicit token definition error in parser means, but am having difficulty getting rid of it. (v4)
stripped down statements:
enum_decl : GTYPE_ENUM ID LSQUARE STRING STRING* RSQUARE SEMI ;
string_decl: GTYPE_STRING ID (COMMA ID)* SEMI ;
In string_decl, that error appears on SEMI
In enum_decl the same error is on RSQUARE
GTYPE_ENUM, ID, etc. all are defined / accepted correctly, in the Lexer section.
Have you type in that little tiny section trying to find a small test case that doesn't work? Without a grammar to test there's nothing we can do. Is either a bug or a problem with your grammar.
I have both books by T.Parr about ANTLR and I see dollar sign all over with references to symbols. It work(ed) for me too:
term : IDENT -> { new TokenNode($IDENT) };
or something more complex:
type_enum : 'enum' name=IDENT '=' val+=IDENT (',' val+=IDENT)* ';'
-> { new EnumNode($name,$val) };
But this line gives me absurd error:
not_expr : term
| NOT ex=not_expr -> { new UnaryExpression($NOT,$ex) };
The error says missing attribute access on rule scope: ex. You know what the fix is? Removing the dollar sign on "ex". That's it.
Out of curiosity I checked the mentioned rules (above) and removed the dollar sign -- they work as before (i.e. I don't get any error).
QUESTION: so what is this story with dollar sign? Should I not use it? Or should I use it until I get an error?
I would not ask this question, if I not saw this convention almost used as a standard in ANTLR.
QUESTION: so what is this story with dollar sign? Should I not use it? Or should I use it until I get an error?
It depends what you want to reference.
Understand that there are 3 different types of "labels":
name=IDENT, the label name references a CommonToken;
val+=IDENT, the label val references a List containing CommonToken instances, in this case;
ex=not_expr the label ex references a ParserRuleReturnScope
I recommend always using a $. I don't know if it is by design that NOT ex=not_expr -> { new UnaryExpression($NOT,$ex) }; doesn't work, but to get a hold of whatever not_expr matched, I'd simply do this:
not_expr : term
| NOT ex=not_expr -> { new UnaryExpression($NOT, $ex.tree) }
;
I don't see why you'd want to get a hold of the entire ParserRuleReturnScope: the tree holds all the information you need.
HTH
I'm using altlr version 3.4.
First question, please see grammar:
request: 'C' DELIM source DELIM target
{ System.out.println("Hi"); }
;
source: ID ;
target: ID ;
DELIM: '|' ;
fragment ALPHA: 'a'..'z' | 'A'..'Z' ;
fragment NUM: '0'..'9' ;
ID: ALPHA (ALPHA | NUM)* ;
"source" and "target" cannot be empty. But my test shows the following:
for input "C|n1|n2" : normal case, no problem.
for input "C||n2" : syntax error, and "Hi" not printed. Expected. Ok
for input "C|n1|" : syntax error, but "Hi" is printed. Not good.
I do need to set other things if "request" token is reached. But from above even for syntax error the code still reaches "request" token. Why?
Second question: how do I specify a rule for fixed length token, for example, a token of exact 10 digits?
Third question is about error handling. I override emitErrorMessage() in parser to set an error flag, but I found another emitErrorMessage() in lexer. I don't want to share the error flag between the parser and lexer objects. Can I override emitErrorMessage() in lexer to do nothing, and totally rely on the parser to report error? Or put another way, if there is an error, will the parser capture it for sure?
And if the error flag is set for one error, can the parser actually recovers and matches anther rule, so the previous error is false alarm?
Thanks for any help!
...
for input "C|n1|" : syntax error, but "Hi" is printed. Not good.
I do need to set other things if "request" token is reached. But from above even for syntax error the code still reaches "request" token. Why?
Because the parser tries to recover from this. If you don't want the parser to (try to) recover from mis-matched tokens, simply throw an exception like this:
grammar T;
// options...
#members {
#Override
public void emitErrorMessage(String message) {
throw new RuntimeException(message);
}
}
request
: 'C' DELIM source DELIM target { System.out.println("Hi"); }
;
// more rules...
Note that #members is short for #parser::members, it will only cause the emitErrorMessage(...) to be overridden in the parser, not the lexer. For lexer-members, you need to do #lexer::members.
Second question: how do I specify a rule for fixed length token, for example, a token of exact 10 digits?
See: ANTR3 set the number of accepted characters for a token
Third question is about error handling. ...
See the first part of my answer: simply override emitErrorMessage() and do nothing in it (the default action is to print on the std.err).
Can I override emitErrorMessage() in lexer to do nothing, and totally rely on the parser to report error?
Well, the parser and lexer handle different type or errors, so ignoring certain problems in the lexer might not cause the parser to produce a warning/error.
Bart, your help is great. I also thought it through and understood the behavior for Question#1 is legitimate. Like a compiler the parser will recover and continue to find as many errors as possible.
For question#2, I also figured out some way to do fixed length. Don't know if it's the popular way:
example : exact3 '|' exact4 ;
// method 1:
exact3 : (d+=DIGIT)+ {$d!=null && $d.size()==3}? ;
// method 2
exact4 : atmost4 {$atmost4.text.length()==4}? ;
atmost4:
#init {int n=1;}
: ({n<=4}?=>DIGIT {n++;})+
;
DIGIT:'0'..'9' ;
For question#3, I'll do fail on first error, i.e. override emitErrorMessage() in both lexer and parser to throw an exception. The choice of emitErrorMessage(msg) is because it has the error message properly prepared.
Thanks all who are sharing!
I tried error reporting in following manner.
#members{
public String getErrorMessage(RecognitionException e,String[] tokenNames)
{
List stack=getRuleInvocationStack(e,this.getClass().getName());
String msg=null;
if(e instanceof NoViableAltException){
<some code>
}
else{
msg=super.getErrorMessage(e,tokenNames);
}
String[] inputLines = e.input.toString().split("\r\n");
String line = "";
if(e.token.getCharPositionInLine()==0)
line = "at \"" + inputLines[e.token.getLine() - 2];
else if(e.token.getCharPositionInLine()>0)
line = "at \"" + inputLines[e.token.getLine() - 1];
return ": " + msg.split("at")[0] + line + "\" => [" + stack.get(stack.size() - 1) + "]";
}
public String getTokenErrorDisplay(Token t){
return t.toString();
}
}
And now errors are displayed as follows.
line 6:7 : missing CLOSSB at "int a[6;" => [var_declaration]
line 8:0 : missing SEMICOL at "int p" => [var_declaration]
line 8:5 : missing CLOSB at "get(2;" => [call]
I have 2 questions.
1) Is there a proper way to do the same thing I have done?
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
Thank you.
1) Is there a proper way to do the same thing I have done?
I don't know if there is a defined proper way of showing errors. My take on showing errors is a litmis test. If the user can figure out how to fix the error based on what you have given them then it is good. If the user is confued by the error message then the message needs more work. Based on the examples given in the question, symbols were only char constants.
My favorite way of seeing errors is with the line with an arrow pointing at the location.
i.e.
Expected closing brace on line 6.
int a[6;
^
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You will have to read the separately generated token file and then make a map, i.e. a dictionary data structure, to translate the token name into the token character(s).
EDIT
First we have to clarify what is meant by symbol. If you limit the definition of symbol to only tokens that are defined in the tokens file with a char or string then this can be done, i.e. '!'=13, or 'public'=92, if however you chose to use the definition of symbol to be any text associated with a token, then that is something other than what I was or plan to address.
When ANTLR generates its token map it uses three different sources:
The char or string constants in the lexer
The char or string constants in the parser.
Internal tokens such as Invalid, Down, Up
Since the tokens in the lexer are not the complete set, one should use the tokens file as a starting point. If you look at the tokens file you will note that the lowest value is 4. If you look at the TokenTypes file (This is the C# version name) you will find the remaining defined tokens.
If you find names like T__ in the tokens file, those are the names ANTLR generated for the char/string literals in the parser.
If you are using string and/or char literals in parser rules, then ANTLR must create a new set of lexer rules that include all of the string and/or char literals in the parser rules. Remember that the parser can only see tokens and not raw text. So string and/or char literals cannot be passed to the parser.
To see the new set of lexer rules, use org.antlr.Tool –Xsavelexer, and then open the created grammar file. The name may be like.g . If you have string and/or char literals in your parser rules you will see lexer rules with name starting with T .
Now that you know all of the tokens and their values you can create a mapping table from the info given in the error to the string you want to output instead for the symbol.
The code at http://markmail.org/message/2vtaukxw5kbdnhdv#query:+page:1+mid:2vtaukxw5kbdnhdv+state:results
is an example.
However the mapping of the tokens can change for such things as changing rules in the lexer or changing char/string literals in the parser. So if the message all of a sudden output the wrong string for a symbol you will have to update the mapping table by hand.
While this is not a perfect solution, it is a possible solution depending on how you define symbol.
Note: Last time I looked ANTLR 4.x creates the table automatically for access within the parser because it was such a problem for so many with ANTLR 3.x.
Bhathiya wrote:
*1) Is there a proper way to do the same thing I have done?
There is no single way to do this. Note that proper error-handling and reporting is tricky. Terence Parr spends a whole chapter on this in The Definitive ANTLR Reference (chapter 10). I recommend you get hold of a copy and read it.
Bhathiya wrote:
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You can't. For SEMICOL this may seem easy to do, but how would you get this information for a token like FOO:
FOO : (X | Y)+;
fragment X : '4'..'6';
fragment Y : 'a' | 'bc' | . ;