ANTLR4 Grammar - Issue with "dot" in fields and extended expressions - antlr

I have the following ANTLR4 Grammar
grammar ExpressionGrammar;
parse: (expr)
;
expr: MIN expr
| expr ( MUL | DIV ) expr
| expr ( ADD | MIN ) expr
| NUM
| function
| '(' expr ')'
;
function : ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL : '*';
DIV : '/';
MIN : '-';
ADD : '+';
OPEN_PAR : '(' ;
CLOSE_PAR : ')' ;
NUM : '0' | [1-9][0-9]*;
ID : [a-zA-Z_] [a-zA-Z]*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
I have an input expression like this :-
(Fields.V1)*(Fields.V2) + (Constants.Value1)*(Constants.Value2)
The ANTLR parser generated the following text from the grammar above :-
(FieldsV1)*(FieldsV2)+(Constants<missing ')'>
As you can see, the "dots" in Fields.V1 and Fields.V2 are missing from the text and also there is a <missing ')' Error node. I believe I should somehow make ANTLR understand that an expression can also have fields with dot operators.
A question on top of this :-
(Var1)(Var2)
ANTLR is not throwing me error for this above scenario , the expressions should not be (Var1)(Var2) -- It should always have the operator (var1)*(var2) or (var1)+(var2) etc. The parser error tree is not generating this error. How should the grammar be modified to make sure even this scenario is taken into consideration.

To recognize IDs like Fields.V1, change you Lexer rule for ID to something like this:
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
Notice, since each "node" of the ID follows the same rule, I made it a lexer fragment that I could use to compose the ID rule. I also added 0-9 to the second part of the fragment, since it appears that you want to allow numbers in IDs
Then the ID rule uses the fragment to build out the Lexer rule that allows for dots in the ID.
You also didn't add ID as a valid expr alternative
To handle detection of the error condition in (Var1)(Var2), you need Mike's advice to add the EOF Lexer rule to the end of the parse parser rule. Without the EOF, ANTLR will stop parsing as soon as it reaches the end of a recognized expr ((Var1)). The EOF says "and then you need to find an EOF", so ANTLR will continue parsing into the (Var2) and give you the error.
A revised version that handles both of your examples:
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| NUM
| ID
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: '0' | [1-9][0-9]*;
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
(Now that I've read through the comments, this is pretty much just applying the suggestions in the comments)

Related

Issue with whitespace and int literals / binary operators

I am trying to write a grammar that allows for
Signed integers (i.e. integers with or without a sign; 3, -2, +5)
Unary minus (-)
Binary addition and subtraction (+, -)
Here is the relevant grammar:
expr: INTLITER
| unaryOp expr
| expr binaryOp expr
| OPEN_PAREN expr CLOSE_PAREN
;
unaryOp: MINUS ; // Other operators ommitted for clarity
binaryOp: PLUS | MINUS ;
INTLITER: INTSIGN? DIGIT+ ;
fragment INTSIGN : PLUS | MINUS
WS: [ \r\n\t] -> skip ; // Ignore whitespace
I'm finding a strange issue concerning whitespace.
Consider the expression (2+ 1); this gives a correct parse tree, as expected, like so:
However, (2+1) gives this parse tree:
Since the WS rule means that whitespace is ignored, how is the whitespace here affecting the parse tree?
How might I fix this problem?
The problem with the grammar is that you are trying to represent signed numbers as a token in the lexer. Define the INTLITER without "INTSIGN?". The grammar now works.
grammar arithmetic;
expr: INTLITER
| unaryOp expr
| expr binaryOp expr
| OPEN_PAREN expr CLOSE_PAREN
;
unaryOp: MINUS ; // Other operators ommitted for clarity
binaryOp: PLUS | MINUS ;
INTLITER: DIGIT+ ;
WS: [ \r\n\t] -> skip ; // Ignore whitespace
fragment DIGIT
: ('0' .. '9')+
;
OPEN_PAREN
: '('
;
CLOSE_PAREN
: ')'
;
PLUS
: '+'
;
MINUS
: '-'
;

Antlr4 - Implicit Definitions

I am trying to create a simple for now only integer-arithmetic expression parser. For now i have:
grammar MyExpr;
input: (expr NEWLINE)+;
expr: '(' expr ')'
| '-' expr
| <assoc = right> expr '^' expr
| expr ('*' | '/') expr
| expr ('+' | '-') expr
| ID '(' ExpressionList? ')'
| INT;
ExpressionList : expr (',' expr)*;
ID : [a-zA-Z]+;
INT : DIGIT+;
DIGIT: [0-9];
NEWLINE : '\r'?'\n';
WS : [\t]+ -> skip;
The rule ExpressionList seems to cause some problems. If i remove everything containing ExpressionList, everything compiles and seems to work out fine. But like above, i get errors like:
error(160): MyExpr.g4:14:17: reference to parser rule expr in lexer rule ExpressionList
error(126): MyExpr.g4:7:6: cannot create implicit token for string literal in non-combined grammar: '-'
I am using Eclipse and the Antlr4 Plugin. I try to orient myself on the cymbol grammar given in the antlr4-book.
Can someone tell me whats going wrong in my little grammar?
Found it out by myself:
Rules starting with capital letter refer to Lexer-rules. SO all I had to do was renaming my ExpressionList to expressionList.
Maybe someone else will find this useful some day ;)

Controlling Parameter Slurping

I'm trying to write a grammar that supports functions calls without using parentheses:
f x, y
As in Haskell, I'd like function calls to minimally slurp up their parameters. That is, I want
g 5 + 3
to mean
(g 5) + 3
instead of
g (5 + 3)
Unfortunately, I'm getting the second parse with this grammar:
grammar Parameters;
expr
: '(' expr ')'
| expr MULTIPLICATIVE_OPERATOR expr
| expr ADDITIVE_OPERATOR expr
| ID (expr (',' expr)*?)??
| INT
;
MULTIPLICATIVE_OPERATOR: [*/%];
ADDITIVE_OPERATOR: '+';
ID: [a..z]+;
INT: '-'? [0-9]+;
WHITESPACE: [ \t\n\r]+ -> skip;
The parse tree I'm getting is this:
I had thought that the subrule listed first would get attempted first. In this case, expr ADDITIVE_OPERATOR expr appears before the ID subrule, so why is the ID subrule taking higher precedence?
In this case ANTLR does not the correct rule transformation (to eliminate left recursion and to handle precedences):
expr
: expr_1[0]
;
expr_1[int p]
: ('(' expr_1[0] ')' | INT | ID (expr_1[0] (',' expr_1[0])*?)??)
( {4 >= $p}? MULTIPLICATIVE_OPERATOR expr_1[5]
| {3 >= $p}? ADDITIVE_OPERATOR expr_1[4]
)*
;
leading to (expr (expr_1 a (expr_1 5 + (expr_1 3))))
correct would be:
expr
: expr_1[0]
;
expr_1[int p]
: ('(' expr_1[0] ')' | INT | ID (expr_1[5] (',' expr_1[5])*?)??)
( {4 >= $p}? MULTIPLICATIVE_OPERATOR expr_1[5]
| {3 >= $p}? ADDITIVE_OPERATOR expr_1[4]
)*
;
leading to (expr (expr_1 a (expr_1 5) + (expr_1 3)))
I am not certain if this is a bug in ANTLR4 or a trade-off of the transformation algorithm. Perhaps one should write an issue to the ANTLR4 jira.
To solve your problem you can simply put the correctly transformed grammar into your code and it should work. The explanation of rule transformation is found in "The Definitive ANTLR4 Reference" on pages 249ff (and perhaps somewhere on the web).

In ANTLR how do I skip the value in a simple expression parser?

Hi there I have been trying to write a simple expression parser, here is the grammar.
grammar extremelysimpleexpr ;
stat : expr ;
expr : sub ;
sub : add ( '-' add )* ;
add : VAL ( '+' VAL )*
| VAL
;
VAL : [0-9]+ ;
[ \t\n\r]+ -> skip ;
It matches these expressions
1 + 1
0 + 3
4
But I do not want it to match single occurrence of VAL. I want it to match 1 + 1 but not 4. How do I do that ?
You'd have to insert predicates, something like this (untested):
stat : expr { expr.start != expr.stop }? ;
But don't do this! That's not a syntactic issue, but a semantic one. This is something you should validate after parsing, unless you want to complicate your grammar for such a little benefit.
Use visitors instead for all your checks.
By the way, your grammar assigns different precedence levels to the - and + operators... I'm not sure this is what you want.
With ANTLR4 you could just write this:
expr : '(' expr ')'
| '-' expr
| expr ('*'|'/') expr
| expr ('+'|'-') expr
| VAL
;
This grammar forces non trivial expressions by syntax:
stat : expr ( '+' expr )+
| expr ( '-' expr )+
;
expr : expr ( '+' expr )+
| expr ( '-' expr )+
| VAL
;

'a-zA-Z' came as a complete surprise to me while matching alternative

I have problem generating my grammar defintion with antlr v4:
grammar TagExpression;
expr : not expr
| expr and expr
| expr or expr
| '(' expr ')'
| tag
;
tag : [a-zA-Z]+ ;
and : '&' ;
or : '|' ;
not : '!' ;
WS : [ \t\n\r]+ -> skip ;
The syntax error happens here: tag : [a-zA-Z]+ ;
error(50): c:\temp\antlr\TagExpression.g4:10:6: syntax error: 'a-zA-Z' came as a complete surprise to me while matching alternative
The examples I saw had very similar constructs. Any idea why this happens?
Thanks
The character set notation can only be used in a lexer rule (rules that start with a capital letter, and produce tokens instead of parse trees).
Tag : [a-zA-Z]+;
the problem is the syntax in ANTLR should be '[a-zA-Z]+'