ANTLR4: wrong lexer rule matches - antlr

I'm at a very beginning of learning ANTLR4 lexer rules. My goal is to create a simple grammar for Java properties files. Here is what I have so far:
lexer grammar PropertiesLexer;
LineComment
: ( LineCommentHash
| LineCommentExcl
)
-> skip
;
fragment LineCommentHash
: '#' ~[\r\n]*
;
fragment LineCommentExcl
: '!' ~[\r\n]*
;
fragment WrappedLine
: '\\'
( '\r' '\n'?
| '\n'
)
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
Key
: KeyLetterStart
( KeyLetter
| Escaped
)*
;
fragment KeyLetterStart
: ~[ \t\r\n:=]
;
fragment KeyLetter
: ~[\t\r\n:=]
;
fragment Escaped
: '\\' .?
;
Equal
: ( '\\'? ':'
| '\\'? '='
)
;
Value
: ValueLetterBegin
( ValueLetter
| Escaped
| WrappedLine
)*
;
fragment ValueLetterBegin
: ~[ \t\r\n]
;
fragment ValueLetter
: ~ [\r\n]+
;
Whitespace
: [ \t]+
-> skip
;
My test file is this one:
# comment 1
# comment 2
#
.key1= value1
key2\:sub=value2
key3 \= value3
key4=value41\
value42
# comment3
#comment4
key=value
When I run grun, I'm getting following output:
[#0,30:42='.key1= value1',<Value>,4:0]
[#1,45:60='key2\:sub=value2',<Value>,5:0]
[#2,63:76='key3 \= value3',<Value>,6:0]
[#3,81:102='key4=value41\\r\nvalue42',<Value>,8:0]
[#4,130:138='key=value',<Value>,13:0]
[#5,141:140='<EOF>',<EOF>,14:0]
I don't understand why the Value definition is matched. When commenting out the Value definition, however, it recognizes the Key and Equal definitions:
[#0,30:34='.key1',<Key>,4:0]
[#1,35:35='=',<Equal>,4:5]
[#2,37:42='value1',<Key>,4:7]
[#3,45:49='key2\',<Key>,5:0]
[#4,50:50=':',<Equal>,5:5]
[#5,51:53='sub',<Key>,5:6]
[#6,54:54='=',<Equal>,5:9]
[#7,55:60='value2',<Key>,5:10]
[#8,63:68='key3 \',<Key>,6:0]
[#9,69:69='=',<Equal>,6:6]
[#10,71:76='value3',<Key>,6:8]
[#11,81:84='key4',<Key>,8:0]
[#12,85:85='=',<Equal>,8:4]
[#13,86:93='value41\',<Key>,8:5]
[#14,96:102='value42',<Key>,9:0]
[#15,130:132='key',<Key>,13:0]
[#16,133:133='=',<Equal>,13:3]
[#17,134:138='value',<Key>,13:4]
[#18,141:140='<EOF>',<EOF>,14:0]
but how to let it recognize the Key, Equal and Value definitons?

ANTLR's lexer rules match as much characters as possible, that is why you're seeing all these Value tokens being created (they match the most characters).
Lexical modes seem like a good fit to use here. Something like this:
lexer grammar PropertiesLexer;
COMMENT
: [!#] ~[\r\n]* -> skip
;
KEY
: ( '\\' ~[\r\n] | ~[\r\n\\=:] )+
;
EQUAL
: [=:] -> pushMode(VALUE_MODE)
;
NL
: [\r\n]+ -> skip
;
mode VALUE_MODE;
VALUE
: ( ~[\\\r\n] | '\\' . )+
;
END_VALUE
: [\r\n]+ -> skip, popMode
;

Related

"The following sets of rules are mutually left-recursive"

I have tried to write a grammar to recognize expressions like:
(A + MAX(B) ) / ( C - AVERAGE(A) )
IF( A > AVERAGE(A), 0, 1 )
X / (MAX(X)
Unfortunately antlr3 fails with these errors:
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
error(211): DerivedKeywords.g:110:13: [fatal] rule booleanTerm has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
error(206): DerivedKeywords.g:110:13: Alternative 1: after matching input such as decision cannot predict what comes next due to recursion overflow to additiveExpression from formula
I have spent some hours trying to fix these, it would be great if anyone could at least help me fix the first problem. Thanks
Code:
grammar DerivedKeywords;
options {
output=AST;
//backtrack=true;
}
WS : ( ' ' | '\t' | '\n' | '\r' )
{ skip(); }
;
//for numbers
DIGIT
: '0'..'9'
;
//for both integer and real number
NUMBER
: (DIGIT)+ ( '.' (DIGIT)+ )?( ('E'|'e')('+'|'-')?(DIGIT)+ )?
;
// Boolean operatos
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
EQ : '=';
NEQ : '!=';
GT : '>';
LT : '<';
GTE : '>=';
LTE : '<=';
COMMA : ',';
// Token for Functions
IF : 'IF';
MAX : 'MAX';
MIN : 'MIN';
AVERAGE : 'AVERAGE';
VARIABLE : 'A'..'Z' ('A'..'Z' | '0'..'9')*
;
// OPERATORS
LPAREN : '(' ;
RPAREN : ')' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
STAR : '*' ;
expression : formula;
formula
: functionExpression
| additiveExpression
| LPAREN! a=formula RPAREN! // First Problem
;
additiveExpression
: a=multiplicativeExpression ( (MINUS^ | PLUS^ ) b=multiplicativeExpression )*
;
multiplicativeExpression
: a=unaryExpression ( (STAR^ | DIV^ ) b=unaryExpression )*
;
unaryExpression
: MINUS^ u=unaryExpression
| primaryExpression
;
functionExpression
: f=functionOperator LPAREN e=formula RPAREN
| IF LPAREN b=booleanExpression COMMA p=formula COMMA s=formula RPAREN
;
functionOperator :
MAX | MIN | AVERAGE;
primaryExpression
: NUMBER
// Used for scientific numbers
| DIGIT
| VARIABLE
| formula
;
// Boolean stuff
booleanExpression
: orExpression;
orExpression : a=andExpression (OR^ b=andExpression )*
;
andExpression
: a=notExpression (AND^ b=notExpression )*
;
notExpression
: NOT^ t=booleanTerm
| booleanTerm
;
booleanOperator :
GT | LT | EQ | GTE | LTE | NEQ;
booleanTerm : a=formula op=booleanOperator b=formula
| LPAREN! booleanTerm RPAREN! // Second problem
;
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
- this means that if the parser enters unaryExpression rule, it has the possibility to match additiveExpression, primaryExpression, formula, multiplicativeExpression and unaryExpression again without ever consuming a single token from input - so it cannot decide whether to use those rules or not, because even if it uses the rules, the input will be the same.
You're probably trying to allow subexpressions in expressions by this sequence of rules - you need to make sure that path will consume the left parenthesis of the subexpression. Probably the formula alternative in primaryExpression should be changed to LPAREN formula RPAREN, and the rest of grammar be adjusted accordingly.

ParserRule matching the wrong token

I'm trying to learn a bit ANTLR4 and define a grammar for some 4GL language.
This is what I've got:
compileUnit
:
typedeclaration EOF
;
typedeclaration
:
ID LPAREN DATATYPE INT RPAREN
;
DATATYPE
:
DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
:
'A'
;
DATATYPE_NUMERIC
:
'N'
;
fragment
DIGIT
:
[0-9]
;
fragment
LETTER
:
[a-zA-Z]
;
INT
:
DIGIT+
;
ID
:
LETTER
(
LETTER
| DIGIT
)*
;
LPAREN
:
'('
;
RPAREN
:
')'
;
WS
:
[ \t\f]+ -> skip
;
What I want to be able to parse:
TEST (A10)
what I get:
typedeclaration:1:6: mismatched input 'A10' expecting DATATYPE
I am however able to write:
TEST (A 10)
Why do I need to put a whitespace in here? The LPAREN DATATYPE in itself is working, so there is no need for a space inbetween. Also the INT RPAREN is working.
Why is a space needed between DATATYPE and INT? I'm a bit confused on that one.
I guess that it's matching ID because it's the "longest" match, but there must be some way to force to be lazier here, right?
You should ignore 'A' and 'N' chats at first position of ID. As #CoronA noticed ANTLR matches token as long as possible (length of ID 'A10' more than length of DATATYPE_ALPHANUMERIC 'A'). Also read this: Priority rules. Try to use the following grammar:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: ID LPAREN datatype INT RPAREN
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
: 'A'
;
DATATYPE_NUMERIC
: 'N'
;
INT
: DIGIT+
;
ID
: [b-mo-zB-MO-Z] (LETTER | DIGIT)*
;
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
fragment
DIGIT
: [0-9]
;
fragment
LETTER
: [a-zA-Z]
;
Also you can use the following grammar without id restriction. Data types will be recognized earlier than letters. it's not clear too:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: id LPAREN datatype DIGIT+ RPAREN
;
id
: (datatype | LETTER) (datatype | LETTER | DIGIT)*
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC: 'A';
DATATYPE_NUMERIC: 'N';
// List with another Data types.
LETTER: [a-zA-Z];
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
DIGIT
: [0-9]
;

Intellij Antlr4 Plugin Left direct recursion doesn't work

I'm trying to make parser using Antlr4 for the sql select statement, in which contains the following part
expr: '1' | expr('*'|'/'|'+'|'-'|'||') expr; // As the re-factored form of expression: compound expression;
WS :[ \t\r\n]+ -> skip ;
I suppose this rule will allow the following sets of result:
1
1+1
1+1-1
....
But in the graph it shows that it cannot be parsed
Does anyone get the idea why it cannot be parsed like what i expected?
This slightly adjusted grammar works for me. Tested on input 1+1-1||1*1-1/1. Tested in ANTLRWorks2.1
grammar myGrammar;
top : expr EOF ;
expr : '1'
| expr '+' expr
| expr '*' expr
| expr '/' expr
| expr '+' expr
| expr '-' expr
| expr '||' expr
;
WS :[ \t\r\n]+ -> skip ;
One : '1' ;
Times : '*' ;
Div : '/' ;
Plus : '+' ;
Minus : '-' ;
Or : '||' ;
EDIT
I was able to get this to work, too, when matching the rule top:
grammar newEmptyCombinedGrammar;
top : expr EOF ;
expr: one
| expr op=(Times|Div|Plus|Minus|Or) expr
;
one : One ;
One : '1' ;
Times : '*' ;
Div : '/' ;
Plus : '+' ;
Minus : '-' ;
Or : '||' ;
WS :[ \t\r\n]+ -> skip ;

Small grammar that doesn't work; what am I missing (antlr4)

I have the following grammar. It's supposed to accept the string shown in the comments in the header. It does not. I must be missing something fundamental. Hints on how to debug this would also be appreciated.
/*
Should accept:
b
a:b
a:b^10
b^10
Should not accept:
:b
a:
a:^10
*/
grammar test;
filter:
boostedField EOF
;
boostedField
: qualifiedField (CARET NUMBER)?
;
qualifiedField
: (FIELDNAME COLON)? term
;
term
: TERM
;
FIELDNAME: (LETTER | UNDERSCORE) (ALPHANUM | UNDERSCORE)* ;
NUMBER : NUM_CHAR+ ('.' NUM_CHAR+)? ;
COLON : ':' ;
CARET : '^' ;
WS : (' ' | '\t' | '\n' | '\r' | '\u3000') -> skip ;
UNDERSCORE: '_' ;
// a term may not have a colon or a caret (unless escaped)
TERM : TERM_START_CHAR TERM_CHAR*;
fragment TERM_START_CHAR
: ~( ' ' | '\t' | '\n' | '\r' | '\u3000' | ':' | '^' ) ;
fragment TERM_CHAR : (TERM_START_CHAR | ESCAPED_CHAR) ;
fragment ESCAPED_CHAR : ( '\\' . ) ;
fragment NUM_CHAR: '0'..'9';
fragment LETTER: 'a'..'z' | 'A'..'Z' ;
fragment ALPHANUM: LETTER | NUM_CHAR;

How to have both function calls and parenthetical grouping without backtrack

Is there any way to specify a grammar which allows the following syntax:
f(x)(g, (1-(-2))*3, 1+2*3)[0]
which is transformed into (in pseudo-lisp to show order):
(index
((f x)
g
(* (- 1 -2) 3)
(+ (* 2 3) 1)
)
0
)
along with things like limited operator precedence etc.
The following grammar works with backtrack = true, but I'd like to avoid that:
grammar T;
options {
output=AST;
backtrack=true;
memoize=true;
}
tokens {
CALL;
INDEX;
LOOKUP;
}
prog: (expr '\n')* ;
expr : boolExpr;
boolExpr
: relExpr (boolop^ relExpr)?
;
relExpr
: addExpr (relop^ addExpr)?
| a=addExpr oa=relop b=addExpr ob=relop c=addExpr
-> ^(LAND ^($oa $a $b) ^($ob $b $c))
;
addExpr
: mulExpr (addop^ mulExpr)?
;
mulExpr
: atomExpr (mulop^ atomExpr)?
;
atomExpr
: INT
| ID
| OPAREN expr CPAREN -> expr
| call
;
call
: callable ( OPAREN (expr (COMMA expr)*)? CPAREN -> ^(CALL callable expr*)
| OBRACK expr CBRACK -> ^(INDEX callable expr)
| DOT ID -> ^(INDEX callable ID)
)
;
fragment
callable
: ID
| OPAREN expr CPAREN
;
fragment
boolop
: LAND | LOR
;
fragment
relop
: (EQ|GT|LT|GTE|LTE)
;
fragment
addop
: (PLUS|MINUS)
;
fragment
mulop
: (TIMES|DIVIDE)
;
EQ : '==' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;
LAND : '&&' ;
LOR : '||' ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIVIDE : '/' ;
ID : ('a'..'z')+ ;
INT : '0'..'9' ;
OPAREN : '(' ;
CPAREN : ')' ;
OBRACK : '[' ;
CBRACK : ']' ;
DOT : '.' ;
COMMA : ',' ;
There are a couple of things wrong with your grammar:
1
Only lexer rules can be fragments, not parser rules. Some ANTLR targets simply ignore the fragment keyword in front of parser rules (like the Java target), but better just remove them from your grammar: if you decide to create a parser for a different target-language, you may run into problems because of it.
2
Without the backtrack=true, you cannot mix tree-rewrite operators (^ and !) and rewrite rules (->) because you need to create a single alternative inside relExpr instead of the two alternatives you now have (this is to eliminate an ambiguity).
In your case, you can't create the desired AST with just ^ (inside a single alternative), so you'll need to do it like this:
relExpr
: (a=addExpr -> $a) ( (oa=relOp b=addExpr -> ^($oa $a $b))
( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c))
)?
)?
;
(yes, I know, it's not particularly pretty, but that can't be helped AFAIK)
Also, you can only put the LAND token in the rewrite rules if it is defined in the tokens { ... } block:
tokens {
// literal tokens
LAND='&&';
...
// imaginary tokens
CALL;
...
}
Otherwise you can only use tokens (and other parser rules) in rewrite rules if they really occur inside the parser rule itself.
3
You did not account for the unary minus in your grammar, implement it like this:
mulExpr
: unaryExpr ((TIMES | DIVIDE)^ unaryExpr)*
;
unaryExpr
: MINUS atomExpr -> ^(UNARY_MINUS atomExpr)
| atomExpr
;
Now, to create a grammar that does not need backtrack=true, remove the ID and '(' expr ')' from your atomExpr rule:
atomExpr
: INT
| call
;
and make everything passed callable optional inside your call rule:
call
: (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params)
| OBRACK expr CBRACK -> ^(INDEX $call expr)
| DOT ID -> ^(INDEX $call ID)
)*
;
That way, ID and '(' expr ')' are already matched by call (and there's no ambiguity).
Taken all the remarks above into account, you could get the following grammar:
grammar T;
options {
output=AST;
}
tokens {
// literal tokens
EQ = '==' ;
GT = '>' ;
LT = '<' ;
GTE = '>=' ;
LTE = '<=' ;
LAND = '&&' ;
LOR = '||' ;
PLUS = '+' ;
MINUS = '-' ;
TIMES = '*' ;
DIVIDE = '/' ;
OPAREN = '(' ;
CPAREN = ')' ;
OBRACK = '[' ;
CBRACK = ']' ;
DOT = '.' ;
COMMA = ',' ;
// imaginary tokens
CALL;
INDEX;
LOOKUP;
UNARY_MINUS;
PARAMS;
}
prog
: expr EOF -> expr
;
expr
: boolExpr
;
boolExpr
: relExpr ((LAND | LOR)^ relExpr)?
;
relExpr
: (a=addExpr -> $a) ( (oa=relOp b=addExpr -> ^($oa $a $b))
( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c))
)?
)?
;
addExpr
: mulExpr ((PLUS | MINUS)^ mulExpr)*
;
mulExpr
: unaryExpr ((TIMES | DIVIDE)^ unaryExpr)*
;
unaryExpr
: MINUS atomExpr -> ^(UNARY_MINUS atomExpr)
| atomExpr
;
atomExpr
: INT
| call
;
call
: (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params)
| OBRACK expr CBRACK -> ^(INDEX $call expr)
| DOT ID -> ^(INDEX $call ID)
)*
;
callable
: ID
| OPAREN expr CPAREN -> expr
;
params
: (expr (COMMA expr)*)? -> ^(PARAMS expr*)
;
relOp
: EQ | GT | LT | GTE | LTE
;
ID : 'a'..'z'+ ;
INT : '0'..'9'+ ;
SPACE : (' ' | '\t') {skip();};
which would parse the input "a >= b < c" into the following AST:
and the input "f(x)(g, (1-(-2))*3, 1+2*3)[0]" as follows: