ANTLR4: matching token with same rule but with different position in the grammar - antlr

I have the following statement I wish to parse:
in(name,(Silver,Gold))
in: is a function.
name: is a ID.
(Silver, Gold): is string array with elements 'Silver', and 'Gold'.
The parser is always confused as ID and string array elements have the same rule. Using quotes or double quotes for string will help, but this is not the case here.
Also, predicates didn't help much.
The grammar:
grammar Rql;
statement
: EOF
| query EOF
;
query
: function
;
function
: FUNCTION_IN OPAR id COMMA OPAR array CPAR CPAR
;
array
: VALUE (COMMA VALUE)*
;
FUNCTION_IN: 'in';
id
: {in(}? ID
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
VALUE
: STRING
| INT
| FLOAT
;
OPAR : '(';
CPAR : ')';
COMMA : ',';
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
SPACE
: [ \t\r\n] -> skip
;
STRING
: [a-zA-Z_] [a-zA-Z_0-9]*
;
OTHER
: .
;

The idea is to change the type of the token under some condition. Here seeing an ID for the first time in a line sets a switch to true. The next time an ID is matched, the lexer will execute the if and set the type to ID_VALUE. I wanted to reset the switch while entering the rule function, but it doesn't work :
function
#init {QuestionLexer.id_seen = false; System.out.println("id_seen has been reset" + QuestionLexer.id_seen);}
: FUNCTION_IN OPAR ID COMMA OPAR array CPAR CPAR
ID=name1 seen ? false
ID=Silver seen ? true
...
ID=Platinum seen ? true
[#0,0:1='in',<'in'>,1:0]
[#1,2:2='(',<'('>,1:2]
[#2,3:7='name1',<ID>,1:3]
[#3,8:8=',',<','>,1:8]
[#4,9:9='(',<'('>,1:9]
[#5,10:15='Silver',<10>,1:10]
...
[#12,27:31='name2',<10>,2:3]
...
[#20,52:51='<EOF>',<EOF>,3:0]
Question last update 1336
id_seen has been reset false
id_seen has been reset false
line 2:3 mismatched input 'name2' expecting ID
.
That's why I reset it in the FUNCTION_IN rule.
Grammar Question.g4 :
grammar Question;
#lexer::members {
static boolean id_seen = false;
}
tokens { ID_VALUE }
question
#init {System.out.println("Question last update 1352");}
: function+ EOF
;
function
: FUNCTION_IN OPAR ID COMMA OPAR array CPAR CPAR
;
array
: value (COMMA value)*
;
value
: ID_VALUE
| INT
| FLOAT
;
FUNCTION_IN: 'in' {id_seen = false;} ;
ID : [a-zA-Z_] [a-zA-Z_0-9]*
{System.out.println("ID=" + getText() + " seen ? " + id_seen);
if (id_seen) setType(QuestionParser.ID_VALUE); id_seen = true; } ;
OPAR : '(';
CPAR : ')';
COMMA : ',';
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
SPACE
: [ \t\r\n] -> skip
;
OTHER
: .
;
File t.text :
in(name1,(Silver,Gold))
in(name2,(Copper,Platinum))
Execution with ANTLR 4.6 :
$ grun Question question -tokens -diagnostics t.text
ID=name1 seen ? false
ID=Silver seen ? true
ID=Gold seen ? true
ID=name2 seen ? false
ID=Copper seen ? true
ID=Platinum seen ? true
[#0,0:1='in',<'in'>,1:0]
[#1,2:2='(',<'('>,1:2]
[#2,3:7='name1',<ID>,1:3]
[#3,8:8=',',<','>,1:8]
[#4,9:9='(',<'('>,1:9]
[#5,10:15='Silver',<10>,1:10]
[#6,16:16=',',<','>,1:16]
[#7,17:20='Gold',<10>,1:17]
[#8,21:21=')',<')'>,1:21]
[#9,22:22=')',<')'>,1:22]
[#10,24:25='in',<'in'>,2:0]
[#11,26:26='(',<'('>,2:2]
[#12,27:31='name2',<ID>,2:3]
[#13,32:32=',',<','>,2:8]
[#14,33:33='(',<'('>,2:9]
[#15,34:39='Copper',<10>,2:10]
[#16,40:40=',',<','>,2:16]
[#17,41:48='Platinum',<10>,2:17]
[#18,49:49=')',<')'>,2:25]
[#19,50:50=')',<')'>,2:26]
[#20,52:51='<EOF>',<EOF>,3:0]
Question last update 1352
Type <10> is ID_VALUE as can be seen in the .tokens file
$ cat Question.tokens
FUNCTION_IN=1
...
OTHER=9
ID_VALUE=10
'in'=1

Related

"The following sets of rules are mutually left-recursive"

I have tried to write a grammar to recognize expressions like:
(A + MAX(B) ) / ( C - AVERAGE(A) )
IF( A > AVERAGE(A), 0, 1 )
X / (MAX(X)
Unfortunately antlr3 fails with these errors:
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
error(211): DerivedKeywords.g:110:13: [fatal] rule booleanTerm has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
error(206): DerivedKeywords.g:110:13: Alternative 1: after matching input such as decision cannot predict what comes next due to recursion overflow to additiveExpression from formula
I have spent some hours trying to fix these, it would be great if anyone could at least help me fix the first problem. Thanks
Code:
grammar DerivedKeywords;
options {
output=AST;
//backtrack=true;
}
WS : ( ' ' | '\t' | '\n' | '\r' )
{ skip(); }
;
//for numbers
DIGIT
: '0'..'9'
;
//for both integer and real number
NUMBER
: (DIGIT)+ ( '.' (DIGIT)+ )?( ('E'|'e')('+'|'-')?(DIGIT)+ )?
;
// Boolean operatos
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
EQ : '=';
NEQ : '!=';
GT : '>';
LT : '<';
GTE : '>=';
LTE : '<=';
COMMA : ',';
// Token for Functions
IF : 'IF';
MAX : 'MAX';
MIN : 'MIN';
AVERAGE : 'AVERAGE';
VARIABLE : 'A'..'Z' ('A'..'Z' | '0'..'9')*
;
// OPERATORS
LPAREN : '(' ;
RPAREN : ')' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
STAR : '*' ;
expression : formula;
formula
: functionExpression
| additiveExpression
| LPAREN! a=formula RPAREN! // First Problem
;
additiveExpression
: a=multiplicativeExpression ( (MINUS^ | PLUS^ ) b=multiplicativeExpression )*
;
multiplicativeExpression
: a=unaryExpression ( (STAR^ | DIV^ ) b=unaryExpression )*
;
unaryExpression
: MINUS^ u=unaryExpression
| primaryExpression
;
functionExpression
: f=functionOperator LPAREN e=formula RPAREN
| IF LPAREN b=booleanExpression COMMA p=formula COMMA s=formula RPAREN
;
functionOperator :
MAX | MIN | AVERAGE;
primaryExpression
: NUMBER
// Used for scientific numbers
| DIGIT
| VARIABLE
| formula
;
// Boolean stuff
booleanExpression
: orExpression;
orExpression : a=andExpression (OR^ b=andExpression )*
;
andExpression
: a=notExpression (AND^ b=notExpression )*
;
notExpression
: NOT^ t=booleanTerm
| booleanTerm
;
booleanOperator :
GT | LT | EQ | GTE | LTE | NEQ;
booleanTerm : a=formula op=booleanOperator b=formula
| LPAREN! booleanTerm RPAREN! // Second problem
;
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
- this means that if the parser enters unaryExpression rule, it has the possibility to match additiveExpression, primaryExpression, formula, multiplicativeExpression and unaryExpression again without ever consuming a single token from input - so it cannot decide whether to use those rules or not, because even if it uses the rules, the input will be the same.
You're probably trying to allow subexpressions in expressions by this sequence of rules - you need to make sure that path will consume the left parenthesis of the subexpression. Probably the formula alternative in primaryExpression should be changed to LPAREN formula RPAREN, and the rest of grammar be adjusted accordingly.

ParserRule matching the wrong token

I'm trying to learn a bit ANTLR4 and define a grammar for some 4GL language.
This is what I've got:
compileUnit
:
typedeclaration EOF
;
typedeclaration
:
ID LPAREN DATATYPE INT RPAREN
;
DATATYPE
:
DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
:
'A'
;
DATATYPE_NUMERIC
:
'N'
;
fragment
DIGIT
:
[0-9]
;
fragment
LETTER
:
[a-zA-Z]
;
INT
:
DIGIT+
;
ID
:
LETTER
(
LETTER
| DIGIT
)*
;
LPAREN
:
'('
;
RPAREN
:
')'
;
WS
:
[ \t\f]+ -> skip
;
What I want to be able to parse:
TEST (A10)
what I get:
typedeclaration:1:6: mismatched input 'A10' expecting DATATYPE
I am however able to write:
TEST (A 10)
Why do I need to put a whitespace in here? The LPAREN DATATYPE in itself is working, so there is no need for a space inbetween. Also the INT RPAREN is working.
Why is a space needed between DATATYPE and INT? I'm a bit confused on that one.
I guess that it's matching ID because it's the "longest" match, but there must be some way to force to be lazier here, right?
You should ignore 'A' and 'N' chats at first position of ID. As #CoronA noticed ANTLR matches token as long as possible (length of ID 'A10' more than length of DATATYPE_ALPHANUMERIC 'A'). Also read this: Priority rules. Try to use the following grammar:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: ID LPAREN datatype INT RPAREN
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
: 'A'
;
DATATYPE_NUMERIC
: 'N'
;
INT
: DIGIT+
;
ID
: [b-mo-zB-MO-Z] (LETTER | DIGIT)*
;
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
fragment
DIGIT
: [0-9]
;
fragment
LETTER
: [a-zA-Z]
;
Also you can use the following grammar without id restriction. Data types will be recognized earlier than letters. it's not clear too:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: id LPAREN datatype DIGIT+ RPAREN
;
id
: (datatype | LETTER) (datatype | LETTER | DIGIT)*
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC: 'A';
DATATYPE_NUMERIC: 'N';
// List with another Data types.
LETTER: [a-zA-Z];
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
DIGIT
: [0-9]
;

Guide or approval for ANTLR example

I have an AlgebraRelacional.g4 file with this. I need to read a file with a syntax like a CSV file, put the content in some memory tables and then resolve relational algebra operations with that. Can you tell me if I am doing it right?
Example data file to read:
cod_buy(char);name_suc(char);Import(int);date_buy(date)
“P-11”;”DC Med”;900;01/03/14
“P-14”;”Center”;1500;02/05/14
Current ANTLR grammar:
grammar AlgebraRelacional;
SEL : '\u03C3'
;
PRO : '\u220F'
;
UNI : '\u222A'
;
DIF : '\u002D'
;
PROC : '\u0058'
;
INT : '\u2229'
;
AND : 'AND'
;
OR : 'OR'
;
NOT : 'NOT'
;
EQ : '='
;
DIFERENTE : '!='
;
MAYOR : '>'
;
MENOR : '<'
;
SUMA : '+'
;
MULTI : '*'
;
IPAREN : '('
;
DPAREN : ')'
;
COMA : ','
;
PCOMA : ';'
;
Comillas: '"'
;
file : hdr row+ ;
hdr : row ;
row : field (',' field)* '\r'? '\n' ;
field : TEXT | STRING | ;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""'|~'"')* '"' ;
I suggest you that read this document (http://is.muni.cz/th/208197/fi_b/bc_thesis.pdf), It contains usefull information about how to write a parser for relational algebra. That is not ANTLR, but you only has to translate the grammar in BNF to EBNF.

Antlr grammar for parsing simple expression

I would like to parse following expresion with antlr4
termspannear ( xxx, xxx , 5 , true )
termspannear ( xxx, termspannear ( xxx, xxx , 5 , true ) , 5 , true )
Where termspannear functions can be nested
Here is my grammar:
//Define a gramar to parse TermSpanNear
grammar TermSpanNear;
start : TERMSPAN ;
TERMSPAN : TERMSPANNEAR | 'xxx' ;
TERMSPANNEAR: 'termspannear' OPENP BODY CLOSEP ;
BODY : TERMSPAN COMMA TERMSPAN COMMA SLOP COMMA ORDERED ;
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
After running:
antlr4 TermSpanNear.g4
javac TermSpanNear*.java
grun TermSpanNear start -gui
termspannear ( xxx, xxx , 5 , true )
^D![enter image description here][1]
line 1:0 token recognition error at: 'termspannear '
line 1:13 extraneous input '(' expecting TERMSPAN
and the tree looks like:
Can someone help me with this grammar ?
So the parsed tree contains all params and and also nesting works
NOTE:
After suggestion by I rewrote it to
//Define a gramar to parse TermSpanNear
grammar TermSpanNear;
start : termspan EOF;
termspan : termspannear | 'xxx' ;
termspannear: 'termspannear' '(' body ')' ;
body : termspan ',' termspan ',' SLOP ',' ORDERED ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
I think now it works
I'm geting the following trees:
For
termspannear ( xxx, xxx , 5 , true )
For
termspannear ( xxx, termspannear ( xxx, xxx , 5 , true ) , 5 , true )
You're using way too many lexer rules.
When you're defining a token like this:
BODY : TERMSPAN COMMA TERMSPAN COMMA SLOP COMMA ORDERED ;
then the tokenizer (lexer) will try to create the (single!) token: xxx,xxx,5,true. E.g. it does not allow any space in between it. Lexer rules (the ones starting with a capital) should really be the "atoms" of your language (the smallest parts). Whenever you start creating elements like a body, you glue atoms together in parser rules, not in lexer rules.
Try something like this:
grammar TermSpanNear;
// parser rules (the elements)
start : termpsan EOF ;
termpsan : termpsannear | 'xxx' ;
termpsannear : 'termspannear' OPENP body CLOSEP ;
body : termpsan COMMA termpsan COMMA SLOP COMMA ORDERED ;
// lexer rules (the atoms)
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ;

ANTLR grammar error

I'm trying to built C-- compiler using ANTLR 3.4.
Full set of the grammar listed here,
program : (vardeclaration | fundeclaration)* ;
vardeclaration : INT ID (OPENSQ NUM CLOSESQ)? SEMICOL ;
fundeclaration : typespecifier ID OPENP params CLOSEP compoundstmt ;
typespecifier : INT | VOID ;
params : VOID | paramlist ;
paramlist : param (COMMA param)* ;
param : INT ID (OPENSQ CLOSESQ)? ;
compoundstmt : OPENCUR vardeclaration* statement* CLOSECUR ;
statementlist : statement* ;
statement : expressionstmt | compoundstmt | selectionstmt | iterationstmt | returnstmt;
expressionstmt : (expression)? SEMICOL;
selectionstmt : IF OPENP expression CLOSEP statement (options {greedy=true;}: ELSE statement)?;
iterationstmt : WHILE OPENP expression CLOSEP statement;
returnstmt : RETURN (expression)? SEMICOL;
expression : (var EQUAL expression) | sampleexpression;
var : ID ( OPENSQ expression CLOSESQ )? ;
sampleexpression: addexpr ( ( LOREQ | LESS | GRTR | GOREQ | EQUAL | NTEQL) addexpr)?;
addexpr : mulexpr ( ( PLUS | MINUS ) mulexpr)*;
mulexpr : factor ( ( MULTI | DIV ) factor )*;
factor : ( OPENP expression CLOSEP ) | var | call | NUM;
call : ID OPENP arglist? CLOSEP;
arglist : expression ( COMMA expression)*;
Used lexer rules as following,
ELSE : 'else' ;
IF : 'if' ;
INT : 'int' ;
RETURN : 'return' ;
VOID : 'void' ;
WHILE : 'while' ;
PLUS : '+' ;
MINUS : '-' ;
MULTI : '*' ;
DIV : '/' ;
LESS : '<' ;
LOREQ : '<=' ;
GRTR : '>' ;
GOREQ : '>=' ;
EQUAL : '==' ;
NTEQL : '!=' ;
ASSIGN : '=' ;
SEMICOL : ';' ;
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
OPENSQ : '[' ;
CLOSESQ : ']' ;
OPENCUR : '{' ;
CLOSECUR: '}' ;
SCOMMENT: '/*' ;
ECOMMENT: '*/' ;
ID : ('a'..'z' | 'A'..'Z')+/*(' ')*/ ;
NUM : ('0'..'9')+ ;
WS : (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;};
COMMENT: '/*' .* '*/' {$channel = HIDDEN;};
But I try to save this it give me the error,
error(211): /CMinusMinus/src/CMinusMinus/CMinusMinus.g:33:13: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
|---> expression : (var EQUAL expression) | sampleexpression;
1 error
How can I resolve this problem?
As already mentioned: your grammar rule expression is ambiguous: both alternatives in that rule start, or can be, a var.
You need to "help" your parser a bit. If the parse can see a var followed by an EQUAL, it should choose alternative 1, else alternative 2. This can be done by using a syntactic predicate (the (var EQUAL)=> part in the rule below).
expression
: (var EQUAL)=> var EQUAL expression
| sampleexpression
;
More about predicates in this Q&A: What is a 'semantic predicate' in ANTLR?
The problem is this:
expression : (var EQUAL expression) | sampleexpression;
where you either start with var or sampleexpression. But sampleexpression can be reduced to var as well by doing sampleexpression->addExpr->MultExpr->Factor->var
So there is no way to find a k-length predicate for the compiler.
You can as suggested by the error message set backtrack=true to see whether this solves your problem, but it might lead not to the AST - parsetrees you would expect and might also be slow on special input conditions.
You could also try to refactor your grammar to avoid such recursions.