How to remove left-recursion from grammar rules - antlr

The following set of antlr grammar lines gives me the error for number_operation & number_argument as given below
The following sets of rules are mutually left-recursive
number_funtion : COUNT LEFT_PAREN number_argument RIGHT_PAREN
number_operation :
number_argument (number_operator number_argument)+ | LEFT_PAREN number_argument (number_operator number_argument)+ RIGHT_PAREN
| prefix_operator number_argument | LEFT_PAREN prefix_operator number_argument RIGHT_PAREN;
number_argument : number_column | number_function | digit_constant | number_operation ;
In order to avoid the left recursion, modifying number_operation with all possible combinations of every element of number_argument can be done like below, however will result in a longer rule.
number_operation :
number_column (number_operator number_argument)+ | LEFT_PAREN number_column (number_operator number_column)+ RIGHT_PAREN
| prefix_operator number_column | LEFT_PAREN prefix_operator number_column RIGHT_PAREN
//and other combinations
Can someone suggest what is the best way to remove the left recursion here ?

How about something like this:
number_operation
: number_operation (number_operator number_argument)+
| number_argument_2 (number_operator number_argument)+
| LEFT_PAREN number_argument (number_operator number_argument)+ RIGHT_PAREN
| prefix_operator number_argument
| LEFT_PAREN prefix_operator number_argument RIGHT_PAREN
;
number_argument
: number_column
| number_function
| digit_constant
| number_operation
;
number_argument_2
: number_column
| number_function
| digit_constant
;
?

Related

ANTLR How to differentiate input arguments of the same type

If I have my input message:
name IS (Jon, Ted) IS NOT (Peter);
I want this AST:
name
|
|-----|
IS IS NOT
| |
| Peter
|----|
Jon Ted
But I'm receiving:
name
|
|-----------------|
IS IS NOT
| |
| |
|----|-----| |----|-----|
Jon Ted Peter Jon Ted Peter
My Grammar file has:
...
expression
| NAME 'IS' OParen Identifier (Comma Identifier)* CParen 'IS NOT' OParen
Identifier (Comma Identifier)* CParen
-> ^(NAME ^('IS' ^(Identifier)*) ^('IS NOT' ^(Identifier)*))
;
...
NAME
: 'name'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_' | '.' | Digit)*
;
How can I differentiate what "belongs" to the 'IS' and what to belongs to the 'IS NOT' ?
Something like this should do it:
expression
: NAME IS left=id_list IS NOT right=id_list -> ^(NAME ^(IS $left) ^(NOT $right))
;
id_list
: '(' ID (',' ID)* ')' -> ID+
;
IS : 'IS';
NOT : 'NOT'; // not a single token that is 'IS NOT'
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | Digit)+
// Not `(...)*`: it should always match a single char!
;

How to get the Text of a Lexer Rule

I have a Antlr Grammar Lexer Rule Like this,
Letter
: '\u0024' | '\u005f'|
'\u0041'..'\u005a' | '\u0061'..'\u007a' |
'\u00c0'..'\u00d6' | '\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' | '\u0100'..'\u1fff' |
'\u3040'..'\u318f' | '\u3300'..'\u337f' |
'\u3400'..'\u3d2d' | '\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Name : Letter (Letter | '0'..'9' | '.' | '-')*;
I want to get the String Value of Name. How can I do it?
from a parser rule:
rule
: Name {String s = $Name.text; System.out.println(s);}
;
or
rule
: n=Name {String s = $n.text; System.out.println(s);}
;
from the lexer rule itself:
Name
: Letter (Letter | '0'..'9' | '.' | '-')*
{String s = $text; System.out.println(s);}
;

How can I differentiate between reserved words and variables using ANTLR?

I'm using ANTLR to tokenize a simple grammar, and need to differentiate between an ID:
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
and a RESERVED_WORD:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
Say I run the lexer on the input:
class abc
I receive two ID tokens for "class" and "abc", while I want "class" to be recognized as a RESERVED_WORD. How can I accomplish this?
Whenever 2 (or more) rules match the same amount of characters, the one defined first will "win". So, if you define RESERVED_WORD before ID, like this:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
The input "class" will be tokenized as a RESERVED_WORD.
Note that it doesn't make a lot of sense to create a single token that matches any reserved word: usually it is done like this:
// ...
NULL : 'null';
TRUE : 'true';
FALSE : 'false;
// ...
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
Now "false" will become a FALSE token, and "falser" an ID.

Why does my grammar work for operators like *, -, /, but not +?

I'm creating a grammar right now and I had to get rid of left recursion, and it seems work for everything except the addition operator.
Here is the related part of my grammar:
SUBTRACT: '-';
PLUS: '+';
DIVIDE: '/';
MULTIPLY: '*';
expr:
(
IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
)
(
PLUS expr
| SUBTRACT expr
| MULTIPLY expr
| DIVIDE expr
| LESS_THAN expr
| LESS_THAN_OR_EQUAL expr
| EQUALS expr
)*
;
INTEGER: ('0'..'9')*;
IDENTIFIER: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*;
Then when I try to do something like
x*1
It work's perfectly. However when I try to do something like
x+1
I get an error saying:
MismatchedTokenException: mismatched input '+' expecting '\u001C'
I've been at this for a while but don't get why it works with *, -, and /, but not +. I have the exact same code for all of them.
Edit: If I reorder it and put SUBTRACT above PLUS, the + symbol will now work but the - symbol won't. Why would antlr care about the order of stuff like that?
Avoiding left recursion (in an expression grammar) is usually done like this:
grammar Expr;
parse
: expr EOF
;
expr
: equalityExpr
;
equalityExpr
: relationalExpr (('==' | '!=') relationalExpr)*
;
relationalExpr
: additionExpr (('>=' | '<=' | '>' | '<') additionExpr)*
;
additionExpr
: multiplyExpr (('+'| '-') multiplyExpr)*
;
multiplyExpr
: atom (('*' | '/') atom)*
;
atom
: IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
| '(' expr ')'
;
// ... lexer rules ...
For example, the input A+B+C would be parsed as follows:
Also see this related answer: ANTLR: Is there a simple example?
I fixed it by making a new rule for the part at the end that I made from removing left recursion:
expr:
(
IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
) lr*
;
lr: PLUS expr
| SUBTRACT expr
| MULTIPLY expr
| DIVIDE expr
| LESS_THAN expr
| LESS_THAN_OR_EQUAL expr
| EQUALS expr;

Solving ANTLR Mutually left-recursive rules

the 'expr' rule in the ANTLR grammar below obviously mutually left-recursive. As a ANTLR newbie it's difficult to get my head around solving this. I've read "Resolving Non-LL(*) Conflicts" in the ANTLR reference book, but I still don't see the solution. Any pointers?
LPAREN : ( '(' ) ;
RPAREN : ( ')' );
AND : ( 'AND' | '&' | 'EN' ) ;
OR : ( 'OR' | '|' | 'OF' );
NOT : ('-' | 'NOT' | 'NIET' );
WS : ( ' ' | '\t' | '\r' | '\n' ) {$channel=HIDDEN;} ;
WORD : (~( ' ' | '\t' | '\r' | '\n' | '(' | ')' | '"' ))*;
input : expr EOF;
expr : (andexpr | orexpr | notexpr | atom);
andexpr : expr AND expr;
orexpr : expr OR expr;
notexpr : NOT expr;
phrase : '"' WORD* '"';
atom : (phrase | WORD);
I would suggest to have a look at the example grammers on the antlr site. The java grammar does what you want.
Basicly you can do something like this:
expr : andexpr;
andexpr : orexpr (AND andexpr)*;
orexpr : notexpr (OR orexpr)*;
notexpr : atom | NOT expr;
The key is, that every expression can end to be an atom.