How to remove indirect left recursion in Antlr grammar - antlr

I've a grammar as follow:
expression : scalar
| vector;
scalar : <bunch of rules>
| vector[scalar] #VectorIndex
;
vector : <bunch of rules>
| scalar ('*' | '+' | '-') vector
;
Is there any possibility to remove indirect left recursion from this grammar? Replacing vector with all its sub-rules will make the grammar too repetitive and messy.

Related

Mutually left recursive with simple calculator

When I attempt to compile my antlr4 calculator grammar, it turns out it is left recursive. I need to change it to make it correct.
I have tried rewriting the rules and using different parentheses locations but they all don't work. Here's my latest version of the error rules:
Parser:
expression: INT | DECIMAL | arithmetic;
arithmetic: expression OPERATION expression;
Lexer:
OPERATION: SUB | ADD | MULT | DIV;
SUB: '-';
ADD: '+';
MULT: '*';
DIV: '/';
DPOINT: '.';
INT: SUB? NUMBER+;
DECIMAL: SUB? NUMBER+ DPOINT NUMBER+;
I expect the compilation to be successful, but the following error occurs:
ANTLR Tool v4.4 (/tmp/antlr-4.4-complete.jar)
hZH.g4 -o /home/heng/workspace/Ultimate ZH Compiler/target/generated-sources/antlr4 -listener -no-visitor -encoding UTF-8
error(119): hZH.g4::: The following sets of rules are mutually left-recursive [expression, arithmetic]
1 error(s)
BUILD FAIL
How can I change my rules for the build to be successful?
Inderect left recursive rules aren't supported, but direct left recursion is. So try this:
expression
: expression OPERATION expression
| INT
| DECIMAL
;
I'd not let the lexer match the - to a number, but let that be handled by the parser, like this:
expression
: SUB expression
| expression ( MULT | DIV ) expression
| expression ( ADD | SUB ) expression
| INT
| DECIMAL
| OPAR expression CPAR
;
SUB: '-';
ADD: '+';
MULT: '*';
DIV: '/';
INT: NUMBER+;
DECIMAL: NUMBER+ '.' NUMBER+;
OPAR: '(';
CPAR: ')';
Also note that I gave * and / a higher precedence by moving them above + and -.

What does this ANLTR4 notation mean?

I have a question regarding the notation of a UCB Logo grammar that I found was generated for ANTLR4. There are some notations can't make out and thought about asking. If anyone is willing to clarify, I will be grateful.
Here are the notations I don't quite understand:
WORD
: {listDepth > 0}? ~[ \t\r\n\[\];] ( ~[ \t\r\n\];~] | LINE_CONTINUATION | '\\' ( [ \t\[\]();~] | LINE_BREAK ) )*
| {arrayDepth > 0}? ~[ \t\r\n{};] ( ~[ \t\r\n};~] | LINE_CONTINUATION | '\\' ( [ \t{}();~] | LINE_BREAK ) )*;
array
: '{' ( ~( '{' | '}' ) | array )* '}';
NAME
: ~[-+*/=<> \t\r\n\[\]()":{}] ( ~[-+*/=<> \t\r\n\[\](){}] | LINE_CONTINUATION | '\\' [-+*/=<> \t\r\n\[\]();~{}] )*;
I guess the array means that it can start with { and have an arbitrary number of levels, but has to end with }.
I take it that the others are some form of regular expressions?
Too my knowledge, regex is different for different programming languages.
Did I get that right?
Antlr does not do regular expressions. It does implement some of the same operators, but that is where the similarity largely ends.
The first sub-terms ( {listDepth > 0}?) in the WORD rule are predicates - no relation to anything in the regular expression world. They are defined in the Antlr documentation and explained in detail in the TDAR.
Your understanding of the array rule is essentially correct.

What is wrong with this ANTLR Grammar? Conditional statement nested parenthesis

I've been tasked with writing a prototype of my team's DSL in Java, so I thought I would try it out using ANTLR. However I'm having problems with the 'expression' and 'condition' rules.
The DSL is already well defined so I would like to keep as close to the current spec as possible.
grammar MyDSL;
// Obviously this is just a snippet of the whole language, but it should give a
// decent view of the issue.
entry
: condition EOF
;
condition
: LPAREN condition RPAREN
| atomic_condition
| NOT condition
| condition AND condition
| condition OR condition
;
atomic_condition
: expression compare_operator expression
| expression (IS NULL | IS NOT NULL)
| identifier
| BOOLEAN
;
compare_operator
: EQUALS
| NEQUALS
| GT | LT
| GTEQUALS | LTEQUALS
;
expression
: LPAREN expression RPAREN
| atomic_expression
| PREFIX expression
| expression (MULTIPLY | DIVIDE) expression
| expression (ADD | SUBTRACT) expression
| expression CONCATENATE expression
;
atomic_expression
: SUBSTR LPAREN expression COMMA expression (COMMA expression)? RPAREN
| identifier
| INTEGER
;
identifier
: WORD
;
// Function Names
SUBSTR: 'SUBSTR';
// Control Chars
LPAREN : '(';
RPAREN : ')';
COMMA : ',';
// Literals and Identifiers
fragment DIGIT : [0-9] ;
INTEGER: DIGIT+;
fragment LETTER : [A-Za-z#$#];
fragment CHARACTER : DIGIT | LETTER | '_';
WORD: LETTER CHARACTER*;
BOOLEAN: 'TRUE' | 'FALSE';
// Arithmetic Operators
MULTIPLY : '*';
DIVIDE : '/';
ADD : '+';
SUBTRACT : '-';
PREFIX: ADD| SUBTRACT ;
// String Operators
CONCATENATE : '||';
// Comparison Operators
EQUALS : '==';
NEQUALS : '<>';
GTEQUALS : '>=';
LTEQUALS : '<=';
GT : '>';
LT : '<';
// Logical Operators
NOT : 'NOT';
AND : 'AND';
OR : 'OR';
// Keywords
IS : 'IS';
NULL: 'NULL';
// Whitespace
BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;
The phrase I'm testing with is
(FOO == 115 AND (SUBSTR(BAR,2,1) == 1 OR SUBSTR(BAR,4,1) == 1))
However it is breaking on the nested parenthesis, matching the first ( with the first ) instead of the outermost (see below). In ANTLR3 I solved this with semantic predicates but it seems that ANTLR4 is supposed to have fixed the need for those.
I'd really like to keep the condition and the expression rules separate if at all possible. I have been able to get it to work when merged together in a single expression rule (based on examples here and elsewhere) but the current DSL spec has them as different and I'm trying to reduce any possible differences in behaviour.
Can anyone point out how I can get this all working while maintaining a separate rule for conditions' andexpressions`? Many thanks!
The grammar seems fine to me.
There's one thing going wrong in the lexer: the WORD token is defined before various keywords/operators causing it to get precedence over them. Place your WORD rule at the very end of your lexer rules (or at least after the last keywords which WORD could also match).

How to get rid of useless nodes from this AST tree?

I have already looked at this question and even though the question titles seem to be the same; it doesn't answer my question, at least not in any way that I can understand.
Parsing Math
Here is what I am parsing:
PI -> 3.14.
Number area(Number radius) -> PI * radius^2.
This is how I want my AST tree to look, minus all the useless root nodes.
how it should look http://vertigrated.com/images/How%20I%20want%20the%20tree%20to%20look.png
Here are what I hope are the relevant fragments of my grammar:
term : '(' expression ')'
| number -> ^(NUMBER number)
| (function_invocation)=> function_invocation
| ATOM
| ID
;
power : term ('^' term)* -> ^(POWER term (term)* ) ;
unary : ('+'! | '-'^)* power ;
multiply : unary ('*' unary)* -> ^(MULTIPLY unary (unary)* ) ;
divide : multiply ('/' multiply)* -> ^(DIVIDE multiply (multiply)* );
modulo : divide ('%' divide)* -> ^(MODULO divide (divide)*) ;
subtract : modulo ('-' modulo)* -> ^(SUBTRACT modulo (modulo)* ) ;
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
relation : add (('=' | '!=' | '<' | '<=' | '>=' | '>') add)* ;
expression : relation (and_or relation)*
| string
| container_access
;
and_or : '&' | '|' ;
Precedence
I still want to keep the precedence as illustrated in the following diagrams, but want to eliminate the useless nodes if at all possible.
Source: Number a(x) -> 0 - 1 + 2 * 3 / 4 % 5 ^ 6.
Here are the nodes I want to eliminate:
how I want the precedence tree to look http://vertigrated.com/images/example%202%20desired%20result.png
Basically I want to eliminate any of those nodes that don't directly have a branch under them to binary options.
You must realize that the two rules:
add : sub ( ('+' sub)+ -> ^(ADD sub (sub)*) | -> sub ) ;
and
add : sub ('+'^ sub)* ;
do not produce the same AST. Given the input 1+2+3, the first rule will produce:
ADD
|
.--+--.
| | |
1 2 3
where the second rule produces:
(+)
|
.--+--.
| |
(+) 3
|
.--+--.
| |
1 2
The latter makes more sense: infix expressions are expected to have 2 child nodes, not more.
Why not simply remove the literals in your parser rules and just do:
add : sub (ADD^ sub)*;
ADD : '+';
Creating the same AST using a rewrite rule would look like this:
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
Also see chapter 7: Tree Construction from The Definitive ANTLR Reference. Especially the paragraphs Rewrite Rules in Subrules (page 173) and Referencing Previous Rule ASTs in Rewrite Rules (page 174/175).
Your rule (and other like it)
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
produces the useless production when you don't have a sequence of add operations.
I'm not an ANTLR expert, but I'd guess you need two cases, one for an add term
that is unary, and one for a set of children, the first of which generates your
standard tree, and the second of which simply passes the child tree up to the parent,
without creating a new node?
add : subtract ( ('+' subtract)+ -> ^(ADDITION subtract (subtract)*)
| -> subtract ) ;
Similar changes for other rules with sequences of operands to an operator.
To get rid of the irrelevant nodes, just be explicit:
subtract
:
modulo
(
( '-' modulo)+ -> ^(SUBTRACT modulo+) // no need for parenthesis or asterisk
|
() -> modulo
)
;
Even though I accepted Barts's answers as correct, I wanted to post my own complete answer with example code that I got working just for completeness.
Here is what I did based on Bart's answer:
unary : ('+'! | '-'^)? term ;
pow : (unary -> unary) ('^' s=unary -> ^(POWER $pow $s))*;
mod : (pow -> pow) ('%' s=pow -> ^(MODULO $mod $s))*;
mult : (mod -> mod) ('*' s=mod -> ^(MULTIPLY $mult $s))*;
div : (mult -> mult) ('/' s=mult -> ^(DIVIDE $div $s))*;
sub : (div -> div) ('-' s=div -> ^(SUBTRACT $sub $s))*;
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
And here is what the resulting tree looks like:
working answer http://vertigrated.com/images/working_answer.png
There is an alternative solution to just not use the rewrites and promote the symbols themselves to roots, but I want all descriptive labels in my tree if at all possible. I am just being anal about how the tree is represented so that my tree walking code will be as clean as possible!
power : unary ('^'^ unary)* ;
mod : power ('%'^ power)* ;
mult : mod ('*'^ mod)* ;
div : mult ('/'^ mult)* ;
sub : div ('-'^ div)* ;
add : sub ('+'^ sub)* ;
And this looks like this:
without rewrites http://vertigrated.com/images/without_the_rewrites.png

Lvalue awareness in ANTLR grammar and syntax predicates

I am implementing a parser with ANTLR for D. This language is based on C so there are some ambiguity around the declarations and the expressions. Consider this:
a* b = c; // This is a declaration of the variable d with a pointer-to-a type.
c = a * b; // as an expression is a multiplication.
As the second example could only appear on the right of an assignment expression I tried to resolve this problem with the following snippet:
expression
: left = assignOrConditional
(',' right = assignOrConditional)*
;
assignOrConditional
: ( postfixExpression ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=') )=> assignExpression
| conditionalExpression
;
assignExpression
: left = postfixExpression
( op = ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=')
right = assignOrExpression
)?
;
conditionalExpression
: left = logicalOrExpression
('?' e1 = conditionalExpression ':' e2 = conditionalExpression)?
;
As far as my understanding goes, this should do the trick to avoid the ambiguity but the tests are failing. If I feed the interpreter with any input, starting with the rule assignOrConditional, it will fail with NoViableAltException.
the inputs were
a = b
b-=c
d
Maybe I'm misunderstanding how the predicates are working therefore it would be great if someone could correct my explanation to the code: If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression. (Note, that the assignmentExpression and the conditionalExpression works well). If the next token isn't of them, it tries to parse it as a conditionalExpression.
EDIT
[solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
Any idea what's wrong with my understanding?
If I feed the interpreter with any input, ...
Don't use ANTLRWorks' interpreter: it is buggy, and disregards any type of predicate. Use its debugger: it works flawlessly.
If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression.
You are correct.
EDIT [solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
What's wrong with that?