Error in ANTLR4 when parsing for EOF - antlr

I am new to ANTLR (any version) and I am just getting started writing my first grammar file. I'm using the IntelliJ IDE with the ANTLR plugin (v1.6).
My grammar is
grammar TestGrammar;
testfile : message+ EOF;
message : timestamp WS id (NL | EOF);
timestamp : NumericLiteral;
id : NumericLiteral;
NumericLiteral : INTEGER | DECIMAL;
INTEGER : [+-]? [0-9]+;
DECIMAL : [+-]? [0-9]* '.' [0-9]+;
EXPONENT : [eE] [+-]? [0-9]+;
WS: (' ' | '\t')+;
NL: '\r'? '\n';
When I apply the simple test input
123 1231
123 1312
The data is correctly parsed, but I get an error in the IntelliJ preview window.
"extraneous input '<EOF>' expecting {<EOF>, NL}"
What have I done wrong? The EOF seems to be correctly detected... If I add a NL on the last line then the file is parsed correctly but I need to make sure that a final NL is optional.
Additional detail for the format:
We are reverse engineering a data format so I will be honest and say that we don't really know what the constraints are going to be! Our current understanding is that:
Each message must be on its own row
Allow empty lines between messages
Don't require a new line at the end of the file
We have seen evidence of files following these patterns so we know that they are valid input.

In your grammar you explicitly states that a 'new line' must end a line. The question here is: Is the 'new line' at the end of a message part of the language? The same question occurs about white spaces. Are they part of the language? If not, you can skip them:
WS: (' ' | '\t') -> skip;
NL: '\r'? '\n' -> skip;
Then, you can simplify your message rule:
message: timestamp id;
If you really need to keep the end of lines:
NL: '\r'? '\n';
And you add this token as optional at the end of your message rule:
message: timestamp id NL?;
This will work with your example, but will fail for:
123 1231
123 1312
The \n between the two lines will produce an error. The solution that seems the most promising is the first one (skip NL and WS with the simplified message rule) BUT, this entry will be matched as OK:
123 1231 123 1312
It will produce two message rule context.
To sum up, in your example, in order to give you the best way of building your grammar we must know the constraints of your input language.
< EDIT >
Regarding your comments, there is two solutions. Either you are sure your files are well formed and the idea is to extract the information of the file without constraints or you are in a dynamic where you have to be sure the input files conforms to the grammar (in order to also remove "bad files").
I'm pretty sure you are in the first case (as you said you are performing reverse engineering), so you probably want to create a CST from your file to extract information. In that case, considering your input files are always well formed, you don't need to bother about checking if NL are present at the end of messages (by construction, files always have one message by line). In this case, you can skip everything you don't need. The grammar became:
grammar TestGrammar;
testfile : message+ EOF;
message : timestamp id;
timestamp : NumericLiteral;
id : NumericLiteral;
NumericLiteral : INTEGER | DECIMAL;
INTEGER : [+-]? [0-9]+;
DECIMAL : [+-]? [0-9]* '.' [0-9]+;
EXPONENT : [eE] [+-]? [0-9]+;
WS: (' ' | '\t')+ -> skip;
NL: '\r'? '\n' -> skip;
This grammar will recognize
123 1231
123 1312
as well as
123 1231
(as many as \n you want between them)
123 1312
but also
123 1231 123 1312 (-> this will produce two messages as expected)
However, if your input files could be not well formed, with this grammar, you will not be able to exclude them. If you need to ensure that only one message is present by line, you should go with a slightly modified version of the grammar proposed by Raz Friman here:
grammar TestGrammar;
testfile : (message? NL)* message EOF;
message : timestamp id;
timestamp : NumericLiteral;
id : NumericLiteral;
WS: [\t ]+ -> skip;
NL: '\r'? '\n';
NumericLiteral : INTEGER | DECIMAL;
INTEGER : [+-]? [0-9]+;
DECIMAL : [+-]? [0-9]* '.' [0-9]+;
EXPONENT : [eE] [+-]? [0-9]+;
With this grammar:
123 1231
(as many as \n you want between them)
123 1312
will be recognized whereas:
123 1231 123 1312
will throw an error.

If you need to ensure the constraints of the Newlines, you can make the root file check for a final "message" rule and then apply the EOF rule.
The grammar for this is posted below:
testfile : (message NL)* message EOF;
message : timestamp id;
timestamp : NumericLiteral;
id : NumericLiteral;
WS: [\t ]+ -> channel(HIDDEN);
NL: '\r'? '\n';
NumericLiteral : INTEGER | DECIMAL;
INTEGER : [+-]? [0-9]+;
DECIMAL : [+-]? [0-9]* '.' [0-9]+;
EXPONENT : [eE] [+-]? [0-9]+;

Related

How do I force the the parser to match a content as an ID rather than a token?

I have a grammar as the following (It's a partial view with only the relevant parts):
elem_course : INIT_ABSCISSA '=' expression;
expression
: ID
| INT_VALUE
| '(' expression ')'
| expression OPERATOR1 expression
| expression OPERATOR2 expression
;
OPERATOR1 : '*' | '/' ;
OPERATOR2 : '+' | '-' ;
fragment
WORD : LETTER (LETTER | NUM | '_' )*;
ID : WORD;
fragment
NUM : [0-9];
fragment
LETTER : [a-zA-Z];
BEACON_ANTENNA_TRAIN : 'BEACON_ANTENNA_TRAIN';
And, I would like to match the following line :
INIT_ABSCISSA = 40 + BEACON_ANTENNA_TRAIN
But as BEACON_ANTENNA_TRAIN is a lexer token and even the rule states that I except and ID, the parser matchs the token and raise me the following error when parsing:
line 11:29 mismatched input 'BEACON_ANTENNA_TRAIN' expecting {'(', INT_VALUE, ID}
Is there a way to force the parser that it should match the content as an ID rather than a token?
(Quick note: It's nice to abbreviate content in questions, but it really helps if it is functioning, stand-alone content that demonstrates your issue)
In this case, I've had to add the following lever rules to get this to generate, so I'm making some (probably legitimate) assumptions.
INT_VALUE: [\-+]? NUM+;
INIT_ABSCISSA: 'INIT_ABSCISSA';
WS: [ \t\r\n]+ -> skip;
I'm also going to have to assume that BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN'; appears before your ID rule. As posted your token stream is as follows and could not generate the error you show)
[#0,0:12='INIT_ABSCISSA',<ID>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<ID>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
If I reorder the lexer rules like this:
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
I can get your error message.
The lexer looks at you input stream of characters and attempts to match all lexer rules. To choose the token type, ANTLR will:
select the rule that matches the longest stream of input characters
If multiple Lever rules match the same sequence of input characters, then the rule that appears first will be used (that's why I had to re-order the rules to get your error.
With those assumptions, now to your question.
The short answer is "you can't". The Lexer processes input and determines token types before the parser is involved in any way. There is nothing you can do in parser rules to influence Token Type.
The parser, on the other hand starts with the start rule and then uses a recursive descent algorithm to attempt to match your token stream to parser rules.
You don't really give any idea what really guides whether BEACON_ANTENNA_TRAIN should be a BEACON_ANTENNA_TRAIN or an ID, so I'll put an example together that assumes that it's an ID if it's on the right hand side (rhs) of the elemen_course rule.
Then this grammar:
grammar IDG
;
elem_course: INIT_ABSCISSA '=' rhs_expression;
rhs_expression
: id = (ID | BEACON_ANTENNA_TRAIN | INIT_ABSCISSA)
| INT_VALUE
| '(' rhs_expression ')'
| rhs_expression OPERATOR1 rhs_expression
| rhs_expression OPERATOR2 rhs_expression
;
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
produces this token stream and parse tree:
$ grun IDG elem_course -tokens -tree IDG.txt
[#0,0:12='INIT_ABSCISSA',<'INIT_ABSCISSA'>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<'BEACON_ANTENNA_TRAIN'>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
(elem_course INIT_ABSCISSA = (rhs_expression (rhs_expression 40) + (rhs_expression BEACON_ANTENNA_TRAIN)))
As a side note: It's possible that, depending on what drives your decision, you might be able to leverage Lexer modes, but there's not anything in your example to leaves that impression.
This is the well known keyword-as-identifier problem and Mike Cargal gave you a working solution. I just want to add that the general approach for this problem is to add all keywords to a parser id rule that should be matched as an id. To restrict which keyword is allowed in certain grammar positions, you can use multiple id rules. For example the MySQL grammar uses this approach to a large extend to define keywords that can go as identifier in general or only as a label, for role names etc.

ANTLR4 grammar: getting mismatched input error

I have defined the following grammar:
grammar Test;
parse: expr EOF;
expr : IF comparator FROM field THEN #comparatorExpr
;
dateTime : DATE_TIME;
number : (INT|DECIMAL);
field : FIELD_IDENTIFIER;
op : (GT | GE | LT | LE | EQ);
comparator : op (number|dateTime);
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
IF : '$IF';
FROM : '$FROM';
THEN : '$THEN';
OR : '$OR';
GT : '>' ;
GE : '>=' ;
LT : '<' ;
LE : '<=' ;
EQ : '=' ;
INT : DIGIT+;
DECIMAL : INT'.'INT;
DATE_TIME : (INT|DECIMAL)('M'|'y'|'d');
FIELD_IDENTIFIER : (LETTER|DIGIT)(LETTER|DIGIT|' ')*;
WS : [ \r\t\u000C\n]+ -> skip;
And I try to parse the following input:
$IF >=15 $FROM AgeInYears $THEN
it gives me the following error:
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
All SO posts I found point out to the same reason for this error - identical LEXER rules. But I cannot see why 15 can be matched to either DECIMAL - it requires . between 2 ints, or to DATE_TIME - it has m|d|y suffix as well.
Any pointers would be appreciated here.
It's always a good idea to run take a look at the token stream that your Lexer produces:
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:8='15 ',<FIELD_IDENTIFIER>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:25='AgeInYears ',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
(parse (expr $IF (comparator (op >=) 15 ) $FROM (field AgeInYears ) $THEN) <EOF>)
Here we see that "15 " (1 5 space) has been matched by the FIELD_IDENTIFIER rule. Since that's three input characters long, ANTLR will prefer that Lexer rule to the INT rule that only matches 2 characters.
For this particular input, you can solve this be reworking the FIELD_IDENTIFIER rule to be:
FIELD_IDENTIFIER: (LETTER | DIGIT)+ (' '+ (LETTER | DIGIT))*;
grun Test parse -tokens -tree Test.txt
[#0,0:2='$IF',<'$IF'>,1:0]
[#1,4:5='>=',<'>='>,1:4]
[#2,6:7='15',<INT>,1:6]
[#3,9:13='$FROM',<'$FROM'>,1:9]
[#4,15:24='AgeInYears',<FIELD_IDENTIFIER>,1:15]
[#5,26:30='$THEN',<'$THEN'>,1:26]
[#6,31:30='<EOF>',<EOF>,1:31]
(parse (expr $IF (comparator (op >=) (number 15)) $FROM (field AgeInYears) $THEN) <EOF>)
That said, I suspect that attempting to allow spaces within your FIELD_IDENTIFIER (without some sort of start/stop markers), is likely to be a continuing source of pain as you work on this. (There's a reason why you don't see this is most languages, and it's not that nobody thought it would be handy to allow for multi-word identifiers. It requires a greedy lexer rule that is likely to take precedence over other rules (as it did here)).

ANTLR: too greedy rule

It looks like I have a problem understanding a too greedy rule match. I'm trying to lex a .g4 file for syntax coloring. Here is a minimum (simplified) extract for making this problem reproducible:
lexer grammar ANTLRv4Lexer;
Range
: '[' RangeChar+ ']'
;
fragment EscapedChar
: '\\' ~[u]
| '\\u' EscapedCharHex EscapedCharHex EscapedCharHex EscapedCharHex
;
fragment EscapedCharHex
: [0-9A-Fa-f]
;
fragment RangeChar
: ~']'
| EscapedChar
;
Punctuation
: [:;()+\->*[\]~|]
;
Identifier
: [a-zA-Z0-9]+
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
LineComment
: '//' ~[\r\n]*
;
The (incomplete) test file is following:
: (~ [\]\\] | EscAny)+ -> more
;
// ------
fragment Id
: NameStartChar NameChar*
;
String2Part
: ( ~['\\]
| EscapeSequence
)+
;
I don't understand why it matches Range so greedy:
[#0,3:3=':',<Punctuation>,1:3]
[#1,5:5='(',<Punctuation>,1:5]
[#2,6:6='~',<Punctuation>,1:6]
[#3,8:135='[\]\\] | EscAny)+ -> more\r\n ;\r\n\r\n // ------\r\n\r\nfragment Id\r\n : NameStartChar NameChar*\r\n ;\r\n\r\n\r\nString2Part\r\n\t: ( ~['\\]',<Range>,1:8]
[#4,141:141='|',<Punctuation>,13:3]
[#5,143:156='EscapeSequence',<Identifier>,13:5]
[#6,162:162=')',<Punctuation>,14:3]
[#7,163:163='+',<Punctuation>,14:4]
[#8,167:167=';',<Punctuation>,15:1]
[#9,170:169='<EOF>',<EOF>,16:0]
I understand why in the first line it matches [, \] and \\, but why it obviously treats ] as RangeChar?
Your lexer matches the first \ in \\] using the ~']' alternative and then matches the remaining \] as an EscapedChar. The reason it does this is that this interpretation leads to a longer match than the one where \\ is the EscapedChar and ] is the end of the range and when there are multiple valid ways to match a lexer rule, ANTLR always chooses the longest one (except when *? is involved).
To fix this, you should change RangeChar, so that backslashes are only allowed as part of escape sequences, i.e. replace ~']' with ~[\]\\].

ANTLR4 grammar regex and tilde

I want to have an ANTLR grammar for CSV input.
What's the difference between (~["])+ and (~['"'])+ ?
Why ~ is important?
Here is my grammar:
grammar Exercice4;
csv : ligne+
;
ligne : exp (',' exp)* ('\n' | EOF)
;
exp : ID
| INT
| STRING
;
INT : '0'..'9'+;
ID : ('0'..'9' | 'a'..'z' | 'A'..'Z')+;
STRING : '"' (~["])+ '"';
WS : [ ,\n, \t, \r] -> skip;
In a lexer rule, the characters inside square brackets define a character set. So ["] is the set with the single character ". Being a set, every character is either in the set or not, so defining a character twice, as in [""] makes no difference, it's the same as ["].
~ negates the set, so ~["] means any character except ".

How to allow an identifer which can start with a digit without causing MismatchedTokenException

I want to match the following input:
statement span=1m 0_dur=12
with the following grammar:
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
statement :'statement' 'span' '=' INTEGER 'm' ident '=' INTEGER;
INTEGER
: DIGIT+
;
ident : IDENT | 'AVG' | 'COUNT';
IDENT
: (LETTER | DIGIT | '_')+ ;
WHITESPACE
: ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment
LETTER : ('a'..'z' | 'A'..'Z') ;
fragment
DIGIT : '0'..'9';
but it cause an error:
MismatchedTokenException : line 1:15 mismatched input '1m' expecting '\u0004'
Does anyone has any idea how to solve this?
THanks
Charles
I think your grammar is context sensitive, even at the lexical analyser(Tokenizer) level. The string "1m" is recognized as IDENT, not INTEGER followed by 'm'. You either redefine your syntax, or use predicated parsing, or embed Java code in your grammar to detect the context (e.g. If the number is presented after "span" followed by "=", then parse it as INTEGER).