antlr length of token and error handling - antlr

I'm using altlr version 3.4.
First question, please see grammar:
request: 'C' DELIM source DELIM target
{ System.out.println("Hi"); }
;
source: ID ;
target: ID ;
DELIM: '|' ;
fragment ALPHA: 'a'..'z' | 'A'..'Z' ;
fragment NUM: '0'..'9' ;
ID: ALPHA (ALPHA | NUM)* ;
"source" and "target" cannot be empty. But my test shows the following:
for input "C|n1|n2" : normal case, no problem.
for input "C||n2" : syntax error, and "Hi" not printed. Expected. Ok
for input "C|n1|" : syntax error, but "Hi" is printed. Not good.
I do need to set other things if "request" token is reached. But from above even for syntax error the code still reaches "request" token. Why?
Second question: how do I specify a rule for fixed length token, for example, a token of exact 10 digits?
Third question is about error handling. I override emitErrorMessage() in parser to set an error flag, but I found another emitErrorMessage() in lexer. I don't want to share the error flag between the parser and lexer objects. Can I override emitErrorMessage() in lexer to do nothing, and totally rely on the parser to report error? Or put another way, if there is an error, will the parser capture it for sure?
And if the error flag is set for one error, can the parser actually recovers and matches anther rule, so the previous error is false alarm?
Thanks for any help!

...
for input "C|n1|" : syntax error, but "Hi" is printed. Not good.
I do need to set other things if "request" token is reached. But from above even for syntax error the code still reaches "request" token. Why?
Because the parser tries to recover from this. If you don't want the parser to (try to) recover from mis-matched tokens, simply throw an exception like this:
grammar T;
// options...
#members {
#Override
public void emitErrorMessage(String message) {
throw new RuntimeException(message);
}
}
request
: 'C' DELIM source DELIM target { System.out.println("Hi"); }
;
// more rules...
Note that #members is short for #parser::members, it will only cause the emitErrorMessage(...) to be overridden in the parser, not the lexer. For lexer-members, you need to do #lexer::members.
Second question: how do I specify a rule for fixed length token, for example, a token of exact 10 digits?
See: ANTR3 set the number of accepted characters for a token
Third question is about error handling. ...
See the first part of my answer: simply override emitErrorMessage() and do nothing in it (the default action is to print on the std.err).
Can I override emitErrorMessage() in lexer to do nothing, and totally rely on the parser to report error?
Well, the parser and lexer handle different type or errors, so ignoring certain problems in the lexer might not cause the parser to produce a warning/error.

Bart, your help is great. I also thought it through and understood the behavior for Question#1 is legitimate. Like a compiler the parser will recover and continue to find as many errors as possible.
For question#2, I also figured out some way to do fixed length. Don't know if it's the popular way:
example : exact3 '|' exact4 ;
// method 1:
exact3 : (d+=DIGIT)+ {$d!=null && $d.size()==3}? ;
// method 2
exact4 : atmost4 {$atmost4.text.length()==4}? ;
atmost4:
#init {int n=1;}
: ({n<=4}?=>DIGIT {n++;})+
;
DIGIT:'0'..'9' ;
For question#3, I'll do fail on first error, i.e. override emitErrorMessage() in both lexer and parser to throw an exception. The choice of emitErrorMessage(msg) is because it has the error message properly prepared.
Thanks all who are sharing!

Related

Bison: Syntax Error processing, unexpected and undefined<token>

I want to process undefined and unexpected token error in yyerror func (or maybe by another func if it's possible)
for example, i get a error message from Bison
...
LAC: checking lookahead EXECSQL: S4
Error: popping nterm component_list ()
Stack now 0
Cleanup: discarding lookahead token $undefined ()
Stack now 0
ERRSTAT = "%X0000002C"
But I want to print which token hasn't been founded and the line number. Is it possible to implement it in Bison and how?
The special token $undefined is reported when yylex returns a token number which doesn't appear in any parser rule. Most of the time, that's the result of the lexer fallback rule:
. { return yytext[0]; }
But it can also happen if you declare a token in your parser file, and the lexer returns that token, but the token is never actually used in any rule.
Unused tokens don't have names, in the sense that the array of names which Bison includes in your parser doesn't include unused tokens, and so there's no way to look up what the token name originally was. You can, however, often get the token number from the variable yychar. If that number is greater than 0 and less than 256, then the token is probably a single-character token, and you could use that to print an additional error message. However, there's no simple way to modify the error message generated by Bison's verbose error messages; if you're using that feature, you'll still see the invalid token message.
In order to print line numbers, you only need to enable line number counting in the lexical scanner, using
%option yylineno
in your Flex (.l) file. Then you can print the value of yylineno in yyerror. (If you're using a "pure" (reentrant) scanner, then yylineno will be in the scanner_t object. In the normal use case where that object is an extra parser argument, it will also be available inside yyerror.)
I know that the above is a bit confusing because there are a lot of different code-generation options with slightly different behaviours. You didn't specify the particular options you're using, so the answer is a bit generic.

Rule with identical string token twice

Using yacc, I want to parse text like
begin foo ... end foo
The string foo is not known at compile time and there can be different
such strings in the same input.
So far, the only option I see is to check for syntactical correctness after parsing:
block : BEGIN IDENT something END IDENT
{ if (strcmp($2, $5) != 0) yyerror("Mismatch"); }
This feels wrong. The parser should already detect the errors. Is there something built-in to yacc?
yacc only knows about tokens which the lexer can identify. Since those are identical, the lexer could only improve this case by using states.
That is, you could tell lex to remember that it saw a BEGIN and to count the tokens itself, and return a different type of IDENT (and do the checking there).
However, yacc is better suited to this sort of thing, so the answer to the original question is "no", there is no better solution.

how to check the end of the line in the ANTLR parser rule even if the new line character is sent to hidden channel

I have 3 statements as follows
1) IF a==b THEN print(a);
2) IF a==b THEN /* Action block follows */
3) IF a==b THEN
how can I differentiate between these statements using ANTLR parser rule
I'm using a rule like
if_stmt : IF_T LITERAL_T '==' LITERAL_T THEN_T
{
/* My java code goes here*/
}
I would like to maintain the rule as same and differentiate in the action block of the rule
Note : new line character and comment goes to hidden channel
Maybe you should not put all the "intelligence" into the parser itself. It might get over complicated very quickly. You can either traverse the AST (maybe using treewalker) and check if getLineNumber() for if statement returns the same value as for the first statement in the then block.
You can also put similar condition into the if_stmt rule action.

Another implicit token error - how to tweak definitions to address it

I am aware what implicit token definition error in parser means, but am having difficulty getting rid of it. (v4)
stripped down statements:
enum_decl : GTYPE_ENUM ID LSQUARE STRING STRING* RSQUARE SEMI ;
string_decl: GTYPE_STRING ID (COMMA ID)* SEMI ;
In string_decl, that error appears on SEMI
In enum_decl the same error is on RSQUARE
GTYPE_ENUM, ID, etc. all are defined / accepted correctly, in the Lexer section.
Have you type in that little tiny section trying to find a small test case that doesn't work? Without a grammar to test there's nothing we can do. Is either a bug or a problem with your grammar.

Issues of Error handling with ANTLR3

I tried error reporting in following manner.
#members{
public String getErrorMessage(RecognitionException e,String[] tokenNames)
{
List stack=getRuleInvocationStack(e,this.getClass().getName());
String msg=null;
if(e instanceof NoViableAltException){
<some code>
}
else{
msg=super.getErrorMessage(e,tokenNames);
}
String[] inputLines = e.input.toString().split("\r\n");
String line = "";
if(e.token.getCharPositionInLine()==0)
line = "at \"" + inputLines[e.token.getLine() - 2];
else if(e.token.getCharPositionInLine()>0)
line = "at \"" + inputLines[e.token.getLine() - 1];
return ": " + msg.split("at")[0] + line + "\" => [" + stack.get(stack.size() - 1) + "]";
}
public String getTokenErrorDisplay(Token t){
return t.toString();
}
}
And now errors are displayed as follows.
line 6:7 : missing CLOSSB at "int a[6;" => [var_declaration]
line 8:0 : missing SEMICOL at "int p" => [var_declaration]
line 8:5 : missing CLOSB at "get(2;" => [call]
I have 2 questions.
1) Is there a proper way to do the same thing I have done?
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
Thank you.
1) Is there a proper way to do the same thing I have done?
I don't know if there is a defined proper way of showing errors. My take on showing errors is a litmis test. If the user can figure out how to fix the error based on what you have given them then it is good. If the user is confued by the error message then the message needs more work. Based on the examples given in the question, symbols were only char constants.
My favorite way of seeing errors is with the line with an arrow pointing at the location.
i.e.
Expected closing brace on line 6.
int a[6;
^
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You will have to read the separately generated token file and then make a map, i.e. a dictionary data structure, to translate the token name into the token character(s).
EDIT
First we have to clarify what is meant by symbol. If you limit the definition of symbol to only tokens that are defined in the tokens file with a char or string then this can be done, i.e. '!'=13, or 'public'=92, if however you chose to use the definition of symbol to be any text associated with a token, then that is something other than what I was or plan to address.
When ANTLR generates its token map it uses three different sources:
The char or string constants in the lexer
The char or string constants in the parser.
Internal tokens such as Invalid, Down, Up
Since the tokens in the lexer are not the complete set, one should use the tokens file as a starting point. If you look at the tokens file you will note that the lowest value is 4. If you look at the TokenTypes file (This is the C# version name) you will find the remaining defined tokens.
If you find names like T__ in the tokens file, those are the names ANTLR generated for the char/string literals in the parser.
If you are using string and/or char literals in parser rules, then ANTLR must create a new set of lexer rules that include all of the string and/or char literals in the parser rules. Remember that the parser can only see tokens and not raw text. So string and/or char literals cannot be passed to the parser.
To see the new set of lexer rules, use org.antlr.Tool –Xsavelexer, and then open the created grammar file. The name may be like.g . If you have string and/or char literals in your parser rules you will see lexer rules with name starting with T .
Now that you know all of the tokens and their values you can create a mapping table from the info given in the error to the string you want to output instead for the symbol.
The code at http://markmail.org/message/2vtaukxw5kbdnhdv#query:+page:1+mid:2vtaukxw5kbdnhdv+state:results
is an example.
However the mapping of the tokens can change for such things as changing rules in the lexer or changing char/string literals in the parser. So if the message all of a sudden output the wrong string for a symbol you will have to update the mapping table by hand.
While this is not a perfect solution, it is a possible solution depending on how you define symbol.
Note: Last time I looked ANTLR 4.x creates the table automatically for access within the parser because it was such a problem for so many with ANTLR 3.x.
Bhathiya wrote:
*1) Is there a proper way to do the same thing I have done?
There is no single way to do this. Note that proper error-handling and reporting is tricky. Terence Parr spends a whole chapter on this in The Definitive ANTLR Reference (chapter 10). I recommend you get hold of a copy and read it.
Bhathiya wrote:
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You can't. For SEMICOL this may seem easy to do, but how would you get this information for a token like FOO:
FOO : (X | Y)+;
fragment X : '4'..'6';
fragment Y : 'a' | 'bc' | . ;