How do I label expression alternatives with same precedence level? - antlr

With antlr4 I can label rule alternatives like this:
e : e '*' e # Mult
| e '+' e # Add
| INT # Int
;
From what I understand, in the rule above, Mult has higher precedence over Add because Mult comes before Add in the list of alternatives.
So for instance, if I wrote:
e : e '*' e # Mult
| e ('+'|'-') e # Add
| INT # Int
;
The + in 1 + 2 and - in 4 - 2 have the same precedence.
However, now the alternative is not in the top level. Is there a way I can label the rules e '+' e # Add and e '-' e #Sub separately while still having both alternatives have same precedence level?

I'm afraid not. You can label the op though with op=('+'|'-') and then get the ctx.op() value during a tree walk and ask for its token type.

Related

antlr 4 - warning: rule contains an optional block with at least one alternative that can match an empty string

I work with antlr v4 to write a t-sql parser.
Is this warning a problem?
"rule 'sqlCommit' contains an optional block with at least one alternative that can match an empty string"
My Code:
sqlCommit: COMMIT (TRAN | TRANSACTION | WORK)? id?;
id:
ID | CREATE | PROC | AS | EXEC | OUTPUT| INTTYPE |VARCHARTYPE |NUMERICTYPE |CHARTYPE |DECIMALTYPE | DOUBLETYPE | REALTYPE
|FLOATTYPE|TINYINTTYPE|SMALLINTTYPE|DATETYPE|DATETIMETYPE|TIMETYPE|TIMESTAMPTYPE|BIGINTTYPE|UNSIGNEDBIGINTTYPE..........
;
ID: (LETTER | UNDERSCORE | RAUTE) (LETTER | [0-9]| DOT | UNDERSCORE)*
In a version before I used directly the lexer rule ID instead of the parser rule id in sqlCommit. But after change ID to id the warning appears.
(Hint if you are confused of ID and id: I want to use the parser rule id instead of ID because an identifier can be a literal which maybe already matched by an other lexer rule)
Regards
EDIT
With the help of "280Z28" I solved the problem. In the parser rule "id" was one slash more than needed:
BITTYPE|CREATE|PROC|
|AS|EXEC|OUTPUT|
So the | | includes that the parser rule can match an empty string.
From a Google search:
ErrorType.EPSILON_OPTIONAL
Compiler Warning 154.
rule rule contains an optional block with at least one alternative that can match an empty string
A rule contains an optional block ((...)?) around an empty alternative.
The following rule produces this warning.
x : ;
y : x?; // warning 154
z1 : ('foo' | 'bar'? 'bar2'?)?; // warning 154
z2 : ('foo' | 'bar' 'bar2'? | 'bar2')?; // ok
Since:
4.1
The problem described by this warning is primarily a performance problem. By wrapping a zero-length string in an optional block, you added a completely unnecessary decision to the grammar (whether to enter the optional block or not) which has a high likelihood of forcing the prediction algorithm through its slowest path. It's similar to wrapping Java code in the following:
if (slowMethodThatAlwaysReturnsTrue()) {
...
}
I'm struggling to see how this rule also suffers from this warning (with antlr 4.7.1)
join_type: (INNER | (left_right_full__join_type)? (OUTER)?)? JOIN;
left_right_full__join_type: LEFT | RIGHT | FULL;
JOIN: J O I N;
INNER: I N N E R;
OUTER: O U T E R;
AFAICT it always returns JOIN and optionally preceded by the type.

LL(1) predictive parsing -- Avoid Left recursion

When defining a grammar, say a grammar to evaluate an arithmetic expression: we divide the Expression to Terms and Factors, like so:
E ::= E + T
T ::= T * F
F ::= num
| (E)
Then we need to resolve left recursion.
So why not define the grammar like so:
E ::= T + E
T ::= F * T
F := num
| (E)
And have only right recursion.
The problem is that it gets the associativity wrong -- a left-recursive grammar is left associative while a right-recursive grammar is right associative. Since associativity doesn't matter for + or * you don't see a problem, but if you add an operator (such as -) for which associativity DOES matter, you see the problem.
Note that the way that you deal with left recursion in an LL grammar is essentially by converting to right recursion and then post-processing the parse tree to turn it back into left recursion. Breaking it down, you convert to
E ::= T + E | T
which you then left-factor into
E ::= T E'
E' ::= \epsilon | + E
this will parse the expression T + T + T as
E
/ \
T E'
/ \
+ E
/ \
T E'
/ \
+ E
/ \
T E'
|
\epsilon
which you then evaluate by treating it as a linked list of alternating terms and operators which you evaluate/perform top to bottom (left to right):
tmp1 = eval_term(pop list head)
while (list not empty)
op = pop list head
tmp2 = eval_term(pop list head)
tmp1 = tmp1 op tmp2
In the specific example you show, order doesn't matter, so you can swap operands.
But that is not the case for all the other grammars, because moving their symbols may change their meaning; so you need to find another way to eliminate left recursion.

Explanations about FOLLOW function - Grammar

I've some problems to understand the function FOLLOW. I cannot calcule follow functions of a grammar and that's not good. I tried exercises to understand this function and in particulary this exercise, I've this grammar :
S -> E
E -> T E'
E' -> + T E' | minus T E' |
T -> F T'
T' -> * F T' |
F -> id | ( F'
F' -> E ) | n )
Here the results of the calculating of follow function :
S $
E ), $
E' ), $
T +, minus, ), $
T' +, minus, ), $
F *, +, minus, ), $
F' *, +, minus, ), $
I really don't understand why the FOLLOW(T)=FOLLOW(T') = { +, minus, ), $ }
In the grammar that I give, theterminal symbols plus and minus never appears on the right of T or T' so if someon can explain me this, it will be cool
Conceptually, FOLLOW(X) is the set of tokens that can come AFTER an X in a legal sentence in the grammar. So to calculate it, you look at where X appears on the right side of a rule (any rule) and see what comes after it. In the case of T', you have
T -> F T'
T' -> * F T'
since T' is the last thing on the rhs in both cases, you end up with FOLLOW(T') = FOLLOW(T) ∪ FOLLOW(T'), which is equivalent to FOLLOW(T') = FOLLOW(T).
For T you have:
E -> T E'
E' -> + T E'
which gives you FOLLOW(T) = FIRST(E') ∪ FOLLOW(E) ∪ FOLLOW(E') -- the FOLLOWs are included because E' expands to ε. Depending on exactly whose formulation of FIRST and FOLLOW you use, that may mean that ε ∈ FIRST(E') (in which case you remove it from FOLLOW(T)) or that NULLABLE(E') = true, but the overall effect on FOLLOW(T) is the same -- it gets + and minus from FIRST(E') and ) and $ from FOLLOW(E)

Write an unambiguous Statement grammar that meets the following requirements:

Write a “Statement” grammar that meets the following requirements:
skip is a valid statement
Assignment of the form x := E is a valid statement, where x is an identifier and E is an
arithmetic expression
The composition of two statements S0 ; S1 is a valid statement
I have the following solution, but am not sure if it is correct:
x:: E|skip|s0 E|s1 E
S:
  SKIP
| ID ':=' E
| S ';' S
;
There must be another rule for E and SKIP and ID are lexical tokens.
How about this? I'm not sure about what would be considered a "valid" arithmetic expression and what would be considered valid identifiers but how about something like this?
S :: 'skip'
S :: IDENTIFIER ':=' E
S :: S | S ';' S
A1 :: '+' | '-'
A2 :: '*' | '/'
NBR :: '1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'|'0'
O :: NBR /* remove this if arithm. expression only on identifiers */
O :: IDENTIFIER
O :: '(' E ')'
F :: O
F :: O A1 O
E :: F A2 F

How to get rid of useless nodes from this AST tree?

I have already looked at this question and even though the question titles seem to be the same; it doesn't answer my question, at least not in any way that I can understand.
Parsing Math
Here is what I am parsing:
PI -> 3.14.
Number area(Number radius) -> PI * radius^2.
This is how I want my AST tree to look, minus all the useless root nodes.
how it should look http://vertigrated.com/images/How%20I%20want%20the%20tree%20to%20look.png
Here are what I hope are the relevant fragments of my grammar:
term : '(' expression ')'
| number -> ^(NUMBER number)
| (function_invocation)=> function_invocation
| ATOM
| ID
;
power : term ('^' term)* -> ^(POWER term (term)* ) ;
unary : ('+'! | '-'^)* power ;
multiply : unary ('*' unary)* -> ^(MULTIPLY unary (unary)* ) ;
divide : multiply ('/' multiply)* -> ^(DIVIDE multiply (multiply)* );
modulo : divide ('%' divide)* -> ^(MODULO divide (divide)*) ;
subtract : modulo ('-' modulo)* -> ^(SUBTRACT modulo (modulo)* ) ;
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
relation : add (('=' | '!=' | '<' | '<=' | '>=' | '>') add)* ;
expression : relation (and_or relation)*
| string
| container_access
;
and_or : '&' | '|' ;
Precedence
I still want to keep the precedence as illustrated in the following diagrams, but want to eliminate the useless nodes if at all possible.
Source: Number a(x) -> 0 - 1 + 2 * 3 / 4 % 5 ^ 6.
Here are the nodes I want to eliminate:
how I want the precedence tree to look http://vertigrated.com/images/example%202%20desired%20result.png
Basically I want to eliminate any of those nodes that don't directly have a branch under them to binary options.
You must realize that the two rules:
add : sub ( ('+' sub)+ -> ^(ADD sub (sub)*) | -> sub ) ;
and
add : sub ('+'^ sub)* ;
do not produce the same AST. Given the input 1+2+3, the first rule will produce:
ADD
|
.--+--.
| | |
1 2 3
where the second rule produces:
(+)
|
.--+--.
| |
(+) 3
|
.--+--.
| |
1 2
The latter makes more sense: infix expressions are expected to have 2 child nodes, not more.
Why not simply remove the literals in your parser rules and just do:
add : sub (ADD^ sub)*;
ADD : '+';
Creating the same AST using a rewrite rule would look like this:
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
Also see chapter 7: Tree Construction from The Definitive ANTLR Reference. Especially the paragraphs Rewrite Rules in Subrules (page 173) and Referencing Previous Rule ASTs in Rewrite Rules (page 174/175).
Your rule (and other like it)
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
produces the useless production when you don't have a sequence of add operations.
I'm not an ANTLR expert, but I'd guess you need two cases, one for an add term
that is unary, and one for a set of children, the first of which generates your
standard tree, and the second of which simply passes the child tree up to the parent,
without creating a new node?
add : subtract ( ('+' subtract)+ -> ^(ADDITION subtract (subtract)*)
| -> subtract ) ;
Similar changes for other rules with sequences of operands to an operator.
To get rid of the irrelevant nodes, just be explicit:
subtract
:
modulo
(
( '-' modulo)+ -> ^(SUBTRACT modulo+) // no need for parenthesis or asterisk
|
() -> modulo
)
;
Even though I accepted Barts's answers as correct, I wanted to post my own complete answer with example code that I got working just for completeness.
Here is what I did based on Bart's answer:
unary : ('+'! | '-'^)? term ;
pow : (unary -> unary) ('^' s=unary -> ^(POWER $pow $s))*;
mod : (pow -> pow) ('%' s=pow -> ^(MODULO $mod $s))*;
mult : (mod -> mod) ('*' s=mod -> ^(MULTIPLY $mult $s))*;
div : (mult -> mult) ('/' s=mult -> ^(DIVIDE $div $s))*;
sub : (div -> div) ('-' s=div -> ^(SUBTRACT $sub $s))*;
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
And here is what the resulting tree looks like:
working answer http://vertigrated.com/images/working_answer.png
There is an alternative solution to just not use the rewrites and promote the symbols themselves to roots, but I want all descriptive labels in my tree if at all possible. I am just being anal about how the tree is represented so that my tree walking code will be as clean as possible!
power : unary ('^'^ unary)* ;
mod : power ('%'^ power)* ;
mult : mod ('*'^ mod)* ;
div : mult ('/'^ mult)* ;
sub : div ('-'^ div)* ;
add : sub ('+'^ sub)* ;
And this looks like this:
without rewrites http://vertigrated.com/images/without_the_rewrites.png