ANTLR4 - Mutually left-recursive grammar - antlr

I am getting the error: The following sets of rules are mutually left-recursive [symbolExpression]. In my grammar, symbolExpression is directly left-recursive so shouldn't ANTLR4 be handling this?
Here are the relevant parts of my parser:
operation:
OPERATOR '(' (operation | values | value | symbolExpression) ')' #OperatorExpression
| bracketedSymbolExpression #BracketedOperatorExpression
;
symbolExpression:
(operation | values | value | symbolExpression) SYMBOL (operation | values | value | symbolExpression);
bracketedSymbolExpression:
'(' (operation | values | value | symbolExpression) SYMBOL (operation | values | value | symbolExpression) ')';
list: '[' (operation | value) (',' (operation | value))* ']';
values: (operation | value) (',' (operation | value))+;
value:
NUMBER
| IDENTIFIER
| list
| object;

The elements 'symbolExpression' and 'operation' in the rule 'symbolExpression' are interdependently left recursive.
Without knowing the language specification, it is impossible to say whether the language itself is irrecoverably ambiguous.
Nonetheless, one avenue to try is to refactor the grammar to move repeated clauses, like
( operation | value )
and
(operation | values | value | symbolExpression)
to their own rules with the goal of unifying the 'operation' and 'symbolExpression' (and perhaps 'bracketedSymbolExpression') rules into a single rule -- a rule that is at most self left-recursive. Something like
a : value
| LPAREN a* RPAREN
| LBRACK a* LBRACK
| a SYMBOL a
| a ( COMMA a )+
;

Related

antlr3 always matching the longest possible token

Let's suppose that I have input which matches two tokens, antlr is always choosing the longest match. Instead how do I configure it start from shortest match and then go to longest if not possible ?
Example:
rule
: USER PATH
| PATH
;
USER
: '#' ('a'..'z' | 'A'..'Z' | '0-9' | '_')+
;
PATH
: URL_ALLOWED_CHARS+ '.config'
;
fragment URL_ALLOWED_CHARS
: ':' | '/' | '?' | '#' | '['
| ']' | '#' |'!' | '$' | '&'
| '\'' | '(' | ')' | '*'
| '+' | ',' | ';' | '='
| '%' | 'A'..'Z' | 'a'..'z'
| '0'..'9' | '_' | '.'
| '\\' | '-' | '~'
;
For the grammar above, input such as #random_user/file.config
option1 on rule should match and I should get two tokens: #random_user for USER and /file.config for FILE.
Instead, grammar matches the option 2 of the rule and the complete input is matched as PATH. How could I avoid it ?

How to get the Text of a Lexer Rule

I have a Antlr Grammar Lexer Rule Like this,
Letter
: '\u0024' | '\u005f'|
'\u0041'..'\u005a' | '\u0061'..'\u007a' |
'\u00c0'..'\u00d6' | '\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' | '\u0100'..'\u1fff' |
'\u3040'..'\u318f' | '\u3300'..'\u337f' |
'\u3400'..'\u3d2d' | '\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Name : Letter (Letter | '0'..'9' | '.' | '-')*;
I want to get the String Value of Name. How can I do it?
from a parser rule:
rule
: Name {String s = $Name.text; System.out.println(s);}
;
or
rule
: n=Name {String s = $n.text; System.out.println(s);}
;
from the lexer rule itself:
Name
: Letter (Letter | '0'..'9' | '.' | '-')*
{String s = $text; System.out.println(s);}
;

How can I differentiate between reserved words and variables using ANTLR?

I'm using ANTLR to tokenize a simple grammar, and need to differentiate between an ID:
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
and a RESERVED_WORD:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
Say I run the lexer on the input:
class abc
I receive two ID tokens for "class" and "abc", while I want "class" to be recognized as a RESERVED_WORD. How can I accomplish this?
Whenever 2 (or more) rules match the same amount of characters, the one defined first will "win". So, if you define RESERVED_WORD before ID, like this:
RESERVED_WORD : 'class' | 'public' | 'static' | 'extends' | 'void' | 'int' | 'boolean' | 'if' | 'else' | 'while' | 'return' | 'null' | 'true' | 'false' | 'this' | 'new' | 'String' ;
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
The input "class" will be tokenized as a RESERVED_WORD.
Note that it doesn't make a lot of sense to create a single token that matches any reserved word: usually it is done like this:
// ...
NULL : 'null';
TRUE : 'true';
FALSE : 'false;
// ...
ID : LETTER (LETTER | DIGIT)* ;
fragment DIGIT : '0'..'9' ;
fragment LETTER : 'a'..'z' | 'A'..'Z' ;
Now "false" will become a FALSE token, and "falser" an ID.

Can ANTLR differentiate between lexer rules based on the following character?

For parsing a test file I'd like to allow identifier's to begin with a number.
my rule is:
ID : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
However I also need to match numbers in this file as well. My rule for that is:
INT : '0'..'9'+
;
Obviously Antlr won't let me do this as INT will never be matched.
Is there a way to allow this? Specifically I'd like to match an INTEGER followed by an ID with no spaces as just an ID and create an INT token only if it's followed by a space.
For example:
3BOB -> [ID with text "3BOB"]
3 BOB -> [INT with text "3"] [ID with text "BOB"]
Just change the order in which ID and INT tokens are defined.
grammar qqq;
// Parser's rules.
root:
(integer|identifier)+
;
integer:
INT {System.out.println("INT with text '"+$INT.text+"'.");}
;
identifier:
ID {System.out.println("ID with text '"+$ID.text+"'.");}
;
// Lexer's tokens.
INT: '0'..'9'+
;
ID: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')*
;
WS: ' ' {skip();}
;
UNPREDICTED_TOKEN
:
~(' ') {System.out.println("Unpredicted token.");}
;
The order in which tokens are defined in grammar is significant: in case a string can be attributed to multiple tokens it is attributed to that one which is defined first. In your case if you want integer '123' to be attributed to INT when it still conforms to ID -- put INT definition first.
Antlr's token matching is greedy so it won't stop on '123' in '123BOB', but will continue until non of the tokens match the string and take the last token matched. So your identifiers now can start with numbers.
A remark on tokens order can also be found in this article by Mark Volkmann.
The following minor changes in your rules should do the trick:
ID : ('0'..'9')* // optional numbers
('a'..'z' | 'A'..'Z' | '_' | '&' | '/' | '-' | '.') // followed by mandatory character which is not a number
('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '&' | '/' | '-' | '.')* // followed by more stuff (including numbers)
;
INT : '0'..'9'+ // a number
;
You simply let allow your identifiers to start with an optional number and make the following characters mandatory.

Why does my grammar work for operators like *, -, /, but not +?

I'm creating a grammar right now and I had to get rid of left recursion, and it seems work for everything except the addition operator.
Here is the related part of my grammar:
SUBTRACT: '-';
PLUS: '+';
DIVIDE: '/';
MULTIPLY: '*';
expr:
(
IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
)
(
PLUS expr
| SUBTRACT expr
| MULTIPLY expr
| DIVIDE expr
| LESS_THAN expr
| LESS_THAN_OR_EQUAL expr
| EQUALS expr
)*
;
INTEGER: ('0'..'9')*;
IDENTIFIER: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')*;
Then when I try to do something like
x*1
It work's perfectly. However when I try to do something like
x+1
I get an error saying:
MismatchedTokenException: mismatched input '+' expecting '\u001C'
I've been at this for a while but don't get why it works with *, -, and /, but not +. I have the exact same code for all of them.
Edit: If I reorder it and put SUBTRACT above PLUS, the + symbol will now work but the - symbol won't. Why would antlr care about the order of stuff like that?
Avoiding left recursion (in an expression grammar) is usually done like this:
grammar Expr;
parse
: expr EOF
;
expr
: equalityExpr
;
equalityExpr
: relationalExpr (('==' | '!=') relationalExpr)*
;
relationalExpr
: additionExpr (('>=' | '<=' | '>' | '<') additionExpr)*
;
additionExpr
: multiplyExpr (('+'| '-') multiplyExpr)*
;
multiplyExpr
: atom (('*' | '/') atom)*
;
atom
: IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
| '(' expr ')'
;
// ... lexer rules ...
For example, the input A+B+C would be parsed as follows:
Also see this related answer: ANTLR: Is there a simple example?
I fixed it by making a new rule for the part at the end that I made from removing left recursion:
expr:
(
IDENTIFIER
| INTEGER
| STRING
| TRUE
| FALSE
) lr*
;
lr: PLUS expr
| SUBTRACT expr
| MULTIPLY expr
| DIVIDE expr
| LESS_THAN expr
| LESS_THAN_OR_EQUAL expr
| EQUALS expr;