Why , in pase tree , ANTLR keep mismatching input? - antlr

I am new with ANTLR4 and I am trying to visualize the Parse Tree of a text input in a simple form :
grammar Expr;
contract: (I WS SEND WS quantity WS asset WS TO WS beneficiary WS ON WS send_date WS)*;
asset: '$'| 'TND' | 'USD';
quantity:Q;
beneficiary: B;
send_date : day SLASH month SLASH year;
day: D ;
month: M ;
year: Y ;
B : LETTERUP (LETTERLOW+)+ LETTERLOW*;
Q : DIGITO DIGITZ*|DIGITO DIGITZ* POINT DIGITZ*;
D : DIGIT0 DIGITO|(DIGIT1|DIGIT2)DIGITZ|DIGIT3(DIGIT0|DIGIT1);
M : DIGIT0 DIGITO| DIGIT1(DIGIT0|DIGIT1|DIGIT2);
Y : DIGIT2 DIGIT0((DIGIT1(DIGIT7|DIGIT8|DIGIT9))|(DIGIT2 DIGITZ));
I: 'I';
SEND: 'send';
TO:'to' ;
ON: 'on';
LETTER : [a-zA-Z];
LETTERUP : [A-Z];
LETTERLOW : [a-z];
DIGITZ : [0-9];
DIGITO : [1-9];
DIGIT0 : [0];
DIGIT1 : [1];
DIGIT2 : [2];
DIGIT3 : [3];
DIGIT4 : [4];
DIGIT5 : [5];
DIGIT6 : [6];
DIGIT7 : [7];
DIGIT8 : [8];
DIGIT9 : [9];
SLASH:'/';
POINT:'.'|',';
WS : (' ' | '\t' |'\n' |'\r' )+ ;
But it keeps mismatching the send_date as you can see here:
I know it is a seriously complex numerical grammar I did just want some control the 01<= day <= 31 , 01<= month <= 12 and 2017<= year <= 2029 that's all
is there any help? and thanks

The problem happens because your grammar is ambiguous. 07 can match D and 2017 can match Q.
You can fix it like this:
grammar Expr;
contract: (I WS SEND WS quantity WS asset WS TO WS beneficiary WS ON WS send_date WS)*;
asset: '$'| 'TND' | 'USD';
quantity:Q;
beneficiary: B;
send_date : day month year ;
day: D ;
month: M ;
year: Y ;
D : DIGIT0 DIGITO|(DIGIT1|DIGIT2)DIGITZ|DIGIT3(DIGIT0|DIGIT1);
M : SLASH (DIGIT0 DIGITO| DIGIT1(DIGIT0|DIGIT1|DIGIT2));
Y : SLASH (DIGIT2 DIGIT0((DIGIT1(DIGIT7|DIGIT8|DIGIT9))|(DIGIT2 DIGITZ)));
B : LETTERUP (LETTERLOW+)+ LETTERLOW*;
Q : DIGITO DIGITZ*|DIGITO DIGITZ* POINT DIGITZ*;
I: 'I';
SEND: 'send';
TO:'to' ;
ON: 'on';
LETTER : [a-zA-Z];
LETTERUP : [A-Z];
LETTERLOW : [a-z];
DIGITZ : [0-9];
DIGITO : [1-9];
DIGIT0 : [0];
DIGIT1 : [1];
DIGIT2 : [2];
DIGIT3 : [3];
DIGIT4 : [4];
DIGIT5 : [5];
DIGIT6 : [6];
DIGIT7 : [7];
DIGIT8 : [8];
DIGIT9 : [9];
SLASH:'/';
POINT:'.'|',';
WS : (' ' | '\t' |'\n' |'\r' )+ ;

That's a seriously complex numerical grammar. Perhaps you could simplify:
day: NUMBER ;
month: NUMBER ;
year: NUMBER ;
NUMBER : DIGITZ+ ;
DIGITZ : [0-9];
You could enforce semantics like limiting year to [2017...2020] or whatever in your code. Just an idea. Simplifying often helps and then you can enhance it from there, knowing if you make a mistake you can always revert to something that will at least work.
EDIT:
The reason your grammar doesn't work is because the month is being lexed as a day:
[#0,0:0='I',<'I'>,1:0]
[#1,1:1=' ',<WS>,1:1]
[#2,2:5='send',<'send'>,1:2]
[#3,6:6=' ',<WS>,1:6]
[#4,7:9='300',<Q>,1:7]
[#5,10:10=' ',<WS>,1:10]
[#6,11:11='$',<'$'>,1:11]
[#7,12:12=' ',<WS>,1:12]
[#8,13:14='to',<'to'>,1:13]
[#9,15:15=' ',<WS>,1:15]
[#10,16:20='Ahmed',<B>,1:16]
[#11,21:21=' ',<WS>,1:21]
[#12,22:23='on',<'on'>,1:22]
[#13,24:24=' ',<WS>,1:24]
[#14,25:26='03',<D>,1:25]
[#15,27:27='/',<'/'>,1:27]
[#16,28:29='07',<D>,1:28] <-- see, this is being lexed as a D (day)
[#17,30:30='/',<'/'>,1:30]
[#18,31:34='2017',<Q>,1:31] <-- and this is being lexed as a Q (quantity)
[#19,35:36='\r\n',<WS>,1:35]
[#20,37:36='<EOF>',<EOF>,2:0]
line 1:28 mismatched input '05' expecting M
line 1:31 mismatched input '2017' expecting Y
Lexer rules are applied in the order in which they appear, and Day appears before Month. Quantity appears before Year. Hence the improper lexing. This is a scenario, honestly, where I think you need to simplify and just accept numbers. Then in your code, enforce the semantics (make sure year is in range, etc) in your code and provide a helpful error message to the user if values are not in range. Your total effort spend will be less that way.
NEW VERSION
grammar Test2;
contract: (I SEND quantity asset TO beneficiary ON send_date)*;
asset: '$'| 'TND' | 'USD';
send_date : DATE ;
quantity: NUMBER;
beneficiary: B;
DATE : NUMBER SLASH NUMBER SLASH NUMBER ;
B : LETTERUP (LETTERLOW+)+ LETTERLOW*;
I: 'I';
SEND: 'send';
TO:'to' ;
ON: 'on';
LETTER : [a-zA-Z];
LETTERUP : [A-Z];
LETTERLOW : [a-z];
NUMBER: DIGIT+;
DIGIT : [0-9];
SLASH:'/';
POINT:'.'|',';
WS : [ \t\n\r]+ -> skip;
Improvements:
1. Better handling of whitespace much more conventional.
2. Simplified number syntax.
3. It works
[#0,0:0='I',<'I'>,1:0]
[#1,2:5='send',<'send'>,1:2]
[#2,7:9='300',<NUMBER>,1:7]
[#3,11:11='$',<'$'>,1:11]
[#4,13:14='to',<'to'>,1:13]
[#5,16:20='Ahmed',<B>,1:16]
[#6,22:23='on',<'on'>,1:22]
[#7,25:34='03/07/2017',<DATE>,1:25]
[#8,37:36='<EOF>',<EOF>,2:0]
Problem: I simplified away the ability to do decimal numbers for quantity. You can add that back in as you wish.

Related

ANTLR4: matching token with same rule but with different position in the grammar

I have the following statement I wish to parse:
in(name,(Silver,Gold))
in: is a function.
name: is a ID.
(Silver, Gold): is string array with elements 'Silver', and 'Gold'.
The parser is always confused as ID and string array elements have the same rule. Using quotes or double quotes for string will help, but this is not the case here.
Also, predicates didn't help much.
The grammar:
grammar Rql;
statement
: EOF
| query EOF
;
query
: function
;
function
: FUNCTION_IN OPAR id COMMA OPAR array CPAR CPAR
;
array
: VALUE (COMMA VALUE)*
;
FUNCTION_IN: 'in';
id
: {in(}? ID
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
VALUE
: STRING
| INT
| FLOAT
;
OPAR : '(';
CPAR : ')';
COMMA : ',';
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
SPACE
: [ \t\r\n] -> skip
;
STRING
: [a-zA-Z_] [a-zA-Z_0-9]*
;
OTHER
: .
;
The idea is to change the type of the token under some condition. Here seeing an ID for the first time in a line sets a switch to true. The next time an ID is matched, the lexer will execute the if and set the type to ID_VALUE. I wanted to reset the switch while entering the rule function, but it doesn't work :
function
#init {QuestionLexer.id_seen = false; System.out.println("id_seen has been reset" + QuestionLexer.id_seen);}
: FUNCTION_IN OPAR ID COMMA OPAR array CPAR CPAR
ID=name1 seen ? false
ID=Silver seen ? true
...
ID=Platinum seen ? true
[#0,0:1='in',<'in'>,1:0]
[#1,2:2='(',<'('>,1:2]
[#2,3:7='name1',<ID>,1:3]
[#3,8:8=',',<','>,1:8]
[#4,9:9='(',<'('>,1:9]
[#5,10:15='Silver',<10>,1:10]
...
[#12,27:31='name2',<10>,2:3]
...
[#20,52:51='<EOF>',<EOF>,3:0]
Question last update 1336
id_seen has been reset false
id_seen has been reset false
line 2:3 mismatched input 'name2' expecting ID
.
That's why I reset it in the FUNCTION_IN rule.
Grammar Question.g4 :
grammar Question;
#lexer::members {
static boolean id_seen = false;
}
tokens { ID_VALUE }
question
#init {System.out.println("Question last update 1352");}
: function+ EOF
;
function
: FUNCTION_IN OPAR ID COMMA OPAR array CPAR CPAR
;
array
: value (COMMA value)*
;
value
: ID_VALUE
| INT
| FLOAT
;
FUNCTION_IN: 'in' {id_seen = false;} ;
ID : [a-zA-Z_] [a-zA-Z_0-9]*
{System.out.println("ID=" + getText() + " seen ? " + id_seen);
if (id_seen) setType(QuestionParser.ID_VALUE); id_seen = true; } ;
OPAR : '(';
CPAR : ')';
COMMA : ',';
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
SPACE
: [ \t\r\n] -> skip
;
OTHER
: .
;
File t.text :
in(name1,(Silver,Gold))
in(name2,(Copper,Platinum))
Execution with ANTLR 4.6 :
$ grun Question question -tokens -diagnostics t.text
ID=name1 seen ? false
ID=Silver seen ? true
ID=Gold seen ? true
ID=name2 seen ? false
ID=Copper seen ? true
ID=Platinum seen ? true
[#0,0:1='in',<'in'>,1:0]
[#1,2:2='(',<'('>,1:2]
[#2,3:7='name1',<ID>,1:3]
[#3,8:8=',',<','>,1:8]
[#4,9:9='(',<'('>,1:9]
[#5,10:15='Silver',<10>,1:10]
[#6,16:16=',',<','>,1:16]
[#7,17:20='Gold',<10>,1:17]
[#8,21:21=')',<')'>,1:21]
[#9,22:22=')',<')'>,1:22]
[#10,24:25='in',<'in'>,2:0]
[#11,26:26='(',<'('>,2:2]
[#12,27:31='name2',<ID>,2:3]
[#13,32:32=',',<','>,2:8]
[#14,33:33='(',<'('>,2:9]
[#15,34:39='Copper',<10>,2:10]
[#16,40:40=',',<','>,2:16]
[#17,41:48='Platinum',<10>,2:17]
[#18,49:49=')',<')'>,2:25]
[#19,50:50=')',<')'>,2:26]
[#20,52:51='<EOF>',<EOF>,3:0]
Question last update 1352
Type <10> is ID_VALUE as can be seen in the .tokens file
$ cat Question.tokens
FUNCTION_IN=1
...
OTHER=9
ID_VALUE=10
'in'=1

"The following sets of rules are mutually left-recursive"

I have tried to write a grammar to recognize expressions like:
(A + MAX(B) ) / ( C - AVERAGE(A) )
IF( A > AVERAGE(A), 0, 1 )
X / (MAX(X)
Unfortunately antlr3 fails with these errors:
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
error(211): DerivedKeywords.g:110:13: [fatal] rule booleanTerm has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
error(206): DerivedKeywords.g:110:13: Alternative 1: after matching input such as decision cannot predict what comes next due to recursion overflow to additiveExpression from formula
I have spent some hours trying to fix these, it would be great if anyone could at least help me fix the first problem. Thanks
Code:
grammar DerivedKeywords;
options {
output=AST;
//backtrack=true;
}
WS : ( ' ' | '\t' | '\n' | '\r' )
{ skip(); }
;
//for numbers
DIGIT
: '0'..'9'
;
//for both integer and real number
NUMBER
: (DIGIT)+ ( '.' (DIGIT)+ )?( ('E'|'e')('+'|'-')?(DIGIT)+ )?
;
// Boolean operatos
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
EQ : '=';
NEQ : '!=';
GT : '>';
LT : '<';
GTE : '>=';
LTE : '<=';
COMMA : ',';
// Token for Functions
IF : 'IF';
MAX : 'MAX';
MIN : 'MIN';
AVERAGE : 'AVERAGE';
VARIABLE : 'A'..'Z' ('A'..'Z' | '0'..'9')*
;
// OPERATORS
LPAREN : '(' ;
RPAREN : ')' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
STAR : '*' ;
expression : formula;
formula
: functionExpression
| additiveExpression
| LPAREN! a=formula RPAREN! // First Problem
;
additiveExpression
: a=multiplicativeExpression ( (MINUS^ | PLUS^ ) b=multiplicativeExpression )*
;
multiplicativeExpression
: a=unaryExpression ( (STAR^ | DIV^ ) b=unaryExpression )*
;
unaryExpression
: MINUS^ u=unaryExpression
| primaryExpression
;
functionExpression
: f=functionOperator LPAREN e=formula RPAREN
| IF LPAREN b=booleanExpression COMMA p=formula COMMA s=formula RPAREN
;
functionOperator :
MAX | MIN | AVERAGE;
primaryExpression
: NUMBER
// Used for scientific numbers
| DIGIT
| VARIABLE
| formula
;
// Boolean stuff
booleanExpression
: orExpression;
orExpression : a=andExpression (OR^ b=andExpression )*
;
andExpression
: a=notExpression (AND^ b=notExpression )*
;
notExpression
: NOT^ t=booleanTerm
| booleanTerm
;
booleanOperator :
GT | LT | EQ | GTE | LTE | NEQ;
booleanTerm : a=formula op=booleanOperator b=formula
| LPAREN! booleanTerm RPAREN! // Second problem
;
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
- this means that if the parser enters unaryExpression rule, it has the possibility to match additiveExpression, primaryExpression, formula, multiplicativeExpression and unaryExpression again without ever consuming a single token from input - so it cannot decide whether to use those rules or not, because even if it uses the rules, the input will be the same.
You're probably trying to allow subexpressions in expressions by this sequence of rules - you need to make sure that path will consume the left parenthesis of the subexpression. Probably the formula alternative in primaryExpression should be changed to LPAREN formula RPAREN, and the rest of grammar be adjusted accordingly.

ParserRule matching the wrong token

I'm trying to learn a bit ANTLR4 and define a grammar for some 4GL language.
This is what I've got:
compileUnit
:
typedeclaration EOF
;
typedeclaration
:
ID LPAREN DATATYPE INT RPAREN
;
DATATYPE
:
DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
:
'A'
;
DATATYPE_NUMERIC
:
'N'
;
fragment
DIGIT
:
[0-9]
;
fragment
LETTER
:
[a-zA-Z]
;
INT
:
DIGIT+
;
ID
:
LETTER
(
LETTER
| DIGIT
)*
;
LPAREN
:
'('
;
RPAREN
:
')'
;
WS
:
[ \t\f]+ -> skip
;
What I want to be able to parse:
TEST (A10)
what I get:
typedeclaration:1:6: mismatched input 'A10' expecting DATATYPE
I am however able to write:
TEST (A 10)
Why do I need to put a whitespace in here? The LPAREN DATATYPE in itself is working, so there is no need for a space inbetween. Also the INT RPAREN is working.
Why is a space needed between DATATYPE and INT? I'm a bit confused on that one.
I guess that it's matching ID because it's the "longest" match, but there must be some way to force to be lazier here, right?
You should ignore 'A' and 'N' chats at first position of ID. As #CoronA noticed ANTLR matches token as long as possible (length of ID 'A10' more than length of DATATYPE_ALPHANUMERIC 'A'). Also read this: Priority rules. Try to use the following grammar:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: ID LPAREN datatype INT RPAREN
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
: 'A'
;
DATATYPE_NUMERIC
: 'N'
;
INT
: DIGIT+
;
ID
: [b-mo-zB-MO-Z] (LETTER | DIGIT)*
;
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
fragment
DIGIT
: [0-9]
;
fragment
LETTER
: [a-zA-Z]
;
Also you can use the following grammar without id restriction. Data types will be recognized earlier than letters. it's not clear too:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: id LPAREN datatype DIGIT+ RPAREN
;
id
: (datatype | LETTER) (datatype | LETTER | DIGIT)*
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC: 'A';
DATATYPE_NUMERIC: 'N';
// List with another Data types.
LETTER: [a-zA-Z];
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
DIGIT
: [0-9]
;

Intellij Antlr4 Plugin Left direct recursion doesn't work

I'm trying to make parser using Antlr4 for the sql select statement, in which contains the following part
expr: '1' | expr('*'|'/'|'+'|'-'|'||') expr; // As the re-factored form of expression: compound expression;
WS :[ \t\r\n]+ -> skip ;
I suppose this rule will allow the following sets of result:
1
1+1
1+1-1
....
But in the graph it shows that it cannot be parsed
Does anyone get the idea why it cannot be parsed like what i expected?
This slightly adjusted grammar works for me. Tested on input 1+1-1||1*1-1/1. Tested in ANTLRWorks2.1
grammar myGrammar;
top : expr EOF ;
expr : '1'
| expr '+' expr
| expr '*' expr
| expr '/' expr
| expr '+' expr
| expr '-' expr
| expr '||' expr
;
WS :[ \t\r\n]+ -> skip ;
One : '1' ;
Times : '*' ;
Div : '/' ;
Plus : '+' ;
Minus : '-' ;
Or : '||' ;
EDIT
I was able to get this to work, too, when matching the rule top:
grammar newEmptyCombinedGrammar;
top : expr EOF ;
expr: one
| expr op=(Times|Div|Plus|Minus|Or) expr
;
one : One ;
One : '1' ;
Times : '*' ;
Div : '/' ;
Plus : '+' ;
Minus : '-' ;
Or : '||' ;
WS :[ \t\r\n]+ -> skip ;

Antlr grammar won't work

My grammar won't accept inputs like "x = 4, xy = 1"
grammar Ex5;
prog : stat+;
stat : ID '=' NUMBER;
LETTER : ('a'..'z'|'A'..'Z');
DIGIT : ('0'..'9');
ID : LETTER+;
NUMBER : DIGIT+;
WS : (' '|'\t'| '\r'? '\n' )+ {skip();};
What do i have to do to make it accept a multitude of inputs like ID = NUMBER??
Thank you in advance.
You'll have to account for the comma, ',', inside your grammar. Also, since you (most probably) don't want LETTER and DIGIT tokens to be created since they are only used in other lexer rules, you should make these fragments:
grammar Ex5;
prog : stat (',' stat)*;
stat : ID '=' NUMBER;
ID : LETTER+;
NUMBER : DIGIT+;
fragment LETTER : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
WS : (' '|'\t'| '\r'? '\n')+ {skip();};