Defining a left-associative parser with PetitParser - smalltalk

In http://pharobooks.gforge.inria.fr/PharoByExampleTwo-Eng/latest/, an ExpressionGrammar is defined. However, it is right-associative
parser parse: '1 + 2 + 6'. ======> #(1 $+ #(2 $+ 6))
How can I make it left-associative so that
parser parse: '1 + 2 + 6'.
results in
#(#(1 $+ 2) $+ 6)
?

For left associative grammars use:
term := (prod sepratedBy: $+ asParser trim) foldLeft: [ :a :op :b |
...]
For right associative grammars use:
raise := (prod sepratedBy: $^ asParser trim) foldRight: [ :a :op :b |
...]
Alternatively you might want to look at PPExpressionParser, that handles all the details automatically for you. You just tell it what operators are left-associative, right-associative, prefix, or postfix operators. Have a look at the class comment for a in-depth discussion.

look at PPExpressionParser class.
it's designed for that and you have a great example in the class comment

Related

How to disambiguate a subselect from a parenthesized expression?

I have the following expression notation:
expr
: OpenParen expr (Comma expr)* Comma? CloseParen # parenExpr
| OpenParen simpleSelect CloseParen # subSelectExpr
Unfortunately, a simpleSelect can also have a parenthetical around it, and so the following statement becomes ambiguous:
select ((select 1))
Here is the current grammar that I have, simplified down to only showing the issue:
grammar Subselect;
options { caseInsensitive=true; }
statement: query_statement EOF;
query_statement
: query_expr # simple
| query_statement set_op query_statement # set
;
query_expr
: with_clause?
( select | '(' query_statement ')' )
limit_clause?
;
select
: select_clause
(from_clause
where_clause?)?
;
with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';
expr
: '(' expr ')' # parenExpr
| '(' query_expr ')' # subSelect
| Atom # identifier
;
Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
And on the parse of select ((select 1)), here is the output:
What would be a possible way to disambiguate this?
I suppose the main thing is here:
'(' query_statement ')'
Since that recursively calls itself -- is there a way to do indirection or something else such that a query_statement called from within parens can never itself have parens?
Also, maybe this is a common thing? I get the same ambiguous output when running this on the official MySQL grammar here:
I would be curious whether any of the grammars can solve the issue here: https://github.com/antlr/grammars-v4/tree/master/sql. Maybe the best approach is just to remove duplicate parens before parsing the text? (If so, are there are good tools to do that, or do I need to write an additional antlr parser just to do that?)
Your input generates this parse tree:
That's a reasonable interpretation of your input and it is identified as a subSelect expr. It's a subSelect nested in a parenExpr (both of which are exprs).
If I switch up your rule a bit:
expr: '(' query_expr ')' # subSelect
| '(' expr ')' # parenExpr
| Atom # identifier
;
Now it's a subSelect that interprets the nested (select 1) as a query expression.
It's ambiguous because the outer parenthesized expression could match either of the first two alternatives resulting in different parse trees.
In ANTLR, ambiguities in alternatives are resolved by "using" the first alternative that matches. In this way ANTLR has deterministic behavior where you can control which interpretation is used (with alternative order). It's not uncommon for ANTLR grammars to have ambiguities like this.
IMHO, the IntelliJ plugin has caused many people to stumble over this as an indication that something is "wrong" with the grammar. There's a reason that ANTLR itself does not report an error in this situation. It has defined, deterministic behavior.
So far as "resolving" this ambiguity: the simple fact that the syntax uses parentheses to indicate two different "things" indicates that it is inherently ambiguous, so I don't believe you can "fix" the grammar ambiguity without modifying the syntax. (I might be wrong about this, and would find it interesting if someone provides a refactoring that manages to remove the ambiguity.)
EDIT:
After trying an earlier solution that proved incorrect with some additional test data, I've tried a different approach.
I added Atom as a viable alternative for query_expr since that Atom '1` is being offered as test data. In the full grammar implementation, it's hard to predict if this is necessary, even sufficent. I have only the grammar above with which to test.
I used some semantic predicates to strip parentheses (avoids the effort of writing an additional parser).
For testing purposes only, I added SQL-style line comments so that I could test many different inputs quickly.
The following SQL statements were tested, showing no ambiguity.
select 1
select (1)
select ((select 1))
select ((select (abc)))
select abc from ((select 1 from (select((select(1))))))
(select 1 from (select((select(1)))))
((select (xyz) from (select (((((foo))))) from tableX)))
select a from (select x from xyz)
union
select b from abc
select a from ((select x from xyz ))
intersect
((select b from foo))
select a from (select x from xyz )
intersect
(select b from foo)
The grammar is as follows:
grammar Subselect;
options { caseInsensitive=true; }
#header
{
import java.util.*;
}
#parser::members
{
String stripParens(String phrase)
{
String temp1 = phrase.substring[1];
temp2 = temp1.substring(0, s.length()-1);
return temp2;
}
}
statement: query_statement EOF;
query_statement
: query_expr # simple
| query_statement set_op query_statement # set
;
query_expr
: with_clause?
( select | '(' query_statement ')' )
limit_clause?
| Atom
;
select
: select_clause
(from_clause
where_clause?)?
;
with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';
lrpExpr
: {stripParens(_input.LT[1].getText())}? query_expr
;
expr
: '(' lrpExpr ')' # parenExpr
| Atom # identifier
;
//---------------------------------------------
Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
LineComment : '--' ~[\r\n]* -> skip ;
I'm not including images of parse trees in this edit to conserve space. However, from the inputs I tested, lrpExpr, being a separate rule, would give e.g. a Visitor class to evaluate what is inside the parentheses before moving further down the parse tree, so order of evaluation e.g. mathematical operator precedence could still be honored.
All still fast and with zero ambiguity.
I hope this suits your needs better.
Attribution: I used this answer as a starting point for the Java code for the semantic predicate.

Yacc conflict i cant fix

i've been trying to fix a shift/reduce conflict in my yacc specification and i can't seem to find where it is.
%union{
char* valueBase;
char* correspondencia;
}
%token pal palT palC
%type <valueBase> pal
%type <correspondencia> palT palC Smth
%%
Dicionario : Traducao
| Dicionario Traducao
;
Traducao : Palavra Correspondencia
;
Palavra : Base Delim
| Exp
;
Delim :
| ':'
;
Correspondencia :
| palC {printf("PT Tradução: %s\n",$1);}
;
Exp : Smth '-' Smth {aux = yylval.valueBase; printf("PT Tradução: %s %s %s\n", $1, aux, $3);}
;
Smth : palT {$$ = strdup($1);}
| {$$ = "";}
;
Base : pal {printf("EN Palavra base: %s\n",$1);}
;
Any help to find and fix this conflict would be extremely appreciated.
So looking at the y.output file from your grammar, you have a shift/reduce conflict in state 13:
State 13
10 Exp: Smth '-' . Smth
palT shift, and go to state 2
palT [reduce using rule 12 (Smth)]
$default reduce using rule 12 (Smth)
Smth go to state 16
Basically, what this is saying is that when parsing an Exp after having seen a Smth '-' and looking at a lookahead of palT, it doesn't know whether it should reduce an empty Smth to finish the Exp (leaving the palT as part of some later construct) OR shift the palT so it can then be reduced (recognized) as a Smth that completes this Exp.
The language you are recognizing is a sequence of one or more Traducao, each of which consists of a Palavra followed by an optional palC (Correspondencia that may be a palC or empty). That means that you might have a Palavra directly following another Palavra (the Correspondencia for the first one is empty). So the parser needs to find the boundary between one Palavra and the next just by looking at its current state and one token of lookahead, which is a problem.
In particular, when you have an input like PalT '-' PalT '-' PalT, that is two consecutive Palavra, but it is not clear whether the middle PalT belongs to the first one or the second. It is ambiguous, because it could be parsed successfully either way.
If you want the parser to just accept as much as possible into the first Palavra, then you can just accept the default resolution (of shift). If that is wrong and you would want the other interpretation, then you are going to need more lookahead to recognize this case, as it depends on whether or not there is a second '-' after the second palT or something else.

how to resolve an ambiguity

I have a grammar:
grammar Test;
s : ID OP (NUMBER | ID);
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
OP : '/.' | '/' ;
WS : [ \t\r\n]+ -> skip ;
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123). With the grammar above I get the first variant.
Is there a way to get both parse trees? Is there a way to control how it is parsed? Say, if there is a number after the /. then I emit the / otherwise I emit /. in the tree.
I am new to ANTLR.
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123)
I'm not sure. In the ReplaceAll page(*), Possible Issues paragraph, it is said that "Periods bind to numbers more strongly than to slash", so that /.123 will always be interpreted as a division operation by the number .123. Next it is said that to avoid this issue, a space must be inserted in the input between the /. operator and the number, if you want it to be understood as a replacement.
So there is only one possible parse tree (otherwise how could the Wolfram parser decide how to interpret the statement ?).
ANTLR4 lexer and parser are greedy. It means that the lexer (parser) tries to read as much input characters (tokens) that it can while matching a rule. With your OP rule OP : '/.' | '/' ; the lexer will always match the input /. to the /. alternative (even if the rule is OP : '/' | '/.' ;). This means there is no ambiguity and you have no chance the input to be interpreted as OP=/ and NUMBER=.123.
Given my small experience with ANTLR, I have found no other solution than to split the ReplaceAll operator into two tokens.
Grammar Question.g4 :
grammar Question;
/* Parse Wolfram ReplaceAll. */
question
#init {System.out.println("Question last update 0851");}
: s+ EOF
;
s : division
| replace_all
;
division
: expr '/' NUMBER
{System.out.println("found division " + $expr.text + " by " + $NUMBER.text);}
;
replace_all
: expr '/' '.' replacement
{System.out.println("found ReplaceAll " + $expr.text + " with " + $replacement.text);}
;
expr
: ID
| '"' ID '"'
| NUMBER
| '{' expr ( ',' expr )* '}'
;
replacement
: expr '->' expr
| '{' replacement ( ',' replacement )* '}'
;
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
Input file t.text :
x/.123
x/.x -> 1
{x, y}/.{x -> 1, y -> 2}
{0, 1}/.0 -> "zero"
{0, 1}/. 0 -> "zero"
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:0='x',<ID>,1:0]
[#1,1:1='/',<'/'>,1:1]
[#2,2:5='.123',<NUMBER>,1:2]
[#3,7:7='x',<ID>,2:0]
[#4,8:8='/',<'/'>,2:1]
[#5,9:9='.',<'.'>,2:2]
[#6,10:10='x',<ID>,2:3]
[#7,12:13='->',<'->'>,2:5]
[#8,15:15='1',<NUMBER>,2:8]
[#9,17:17='{',<'{'>,3:0]
...
[#29,47:47='}',<'}'>,4:5]
[#30,48:48='/',<'/'>,4:6]
[#31,49:50='.0',<NUMBER>,4:7]
...
[#40,67:67='}',<'}'>,5:5]
[#41,68:68='/',<'/'>,5:6]
[#42,69:69='.',<'.'>,5:7]
[#43,71:71='0',<NUMBER>,5:9]
...
[#48,83:82='<EOF>',<EOF>,6:0]
Question last update 0851
found division x by .123
found ReplaceAll x with x->1
found ReplaceAll {x,y} with {x->1,y->2}
found division {0,1} by .0
line 4:10 extraneous input '->' expecting {<EOF>, '"', '{', ID, NUMBER}
found ReplaceAll {0,1} with 0->"zero"
The input x/.123 is ambiguous until the slash. Then the parser has two choices : / NUMBER in the division rule or / . expr in the replace_all rule. I think that NUMBER absorbs the input and so there is no more ambiguity.
(*) the link was yesterday in a comment that has disappeared, i.e. Wolfram Language & System, ReplaceAll

PetitParser evaluator not working properly

when I try running this code on pharo, my answers are somewhat off. I try evaluating 1-2+3 but for some reason, it does 1- (2+3) and I do not understand why this is so. Thanks for your time.
number := #digit asParser plus token trim ==> [ :token | token inputValue asNumber ].
term := PPUnresolvedParser new.
prod := PPUnresolvedParser new.
term2 := PPUnresolvedParser new.
prod2 := PPUnresolvedParser new.
prim := PPUnresolvedParser new.
term def: (prod , $+ asParser trim , term ==> [ :nodes | nodes first + nodes last ]) / term2.
term2 def: (prod , $- asParser trim , term ==> [ :nodes | nodes first - nodes last ])/ prod.
prod def: (prim , $* asParser trim , prod ==> [ :nodes | nodes first * nodes last ])/ prim.
prod2 def: (prim , $/ asParser trim , prod ==> [ :nodes | nodes first / nodes last ])/ prim.
prim def: ($( asParser trim , term , $) asParser trim ==> [ :nodes | nodes second ]) / number.
start := term end.
start parse: '1 - 2 + 3'
Consider the definition of term
term
def: prod , $+ asParser trim , term
==> [:nodes | nodes first + nodes last]
/ term2.
The / term2 part is an OR between
prod , $+ asParser trim, term ==> [something]
and
term2
Let's mentally parse '1 - 2 + 3' according to term.
We first read $1 and have to decide between the two options above. The first one will fail unless prod consumes '1 - 2'. But this is impossible because
prod
def: prim , $* asParser trim , prod
==> [:nodes | nodes first * nodes last]
/ prim.
prim
def: $( asParser trim , term , $) asParser trim
==> [:nodes | nodes second]
/ number
and there is no $* or $( coming, and '1 - 2' is not parsed by #number.
So, we try with term2, which is defined in a similar way
term2
def: prod , $- asParser trim , term
==> [:nodes | nodes first - nodes last]
/ prod.
so now we have a OR between two options.
As above, the first option fails, so we next try prod, then prim and finally number. The number parser begins with
#digit asParser plus
which consumes the digit $1 and produces the integer 1.
We are now back in term2. We consume ' - ' and are left with '2 + 3'; which according to term produces 5. So we get 1 - 5 = -4.
SHORT EXPLANATION
For term to parse as (1 - 2) + 3, prod should consume 1 - 2, which doesn't because prod involves no $-.

bison shift/reduce problem moving add op into a subexpr

Originally in the example there was this
expr:
INTEGER
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
;
I wanted it to be 'more simple' so i wrote this (i realize it would do '+' for both add and subtract. But this is an example)
expr:
INTEGER
| expr addOp expr { $$ = $1 + $3; }
;
addOp:
'+' { $$ = $1; }
| '-' { $$ = $1; }
;
Now i get a shift/reduce error. It should be exactly the same -_- (to me). What do i need to do to fix this?
edit: To make things clear. The first has NO warning/error. I use %left to set the precedence (and i will use %right for = and those other right ops). However it seems to not apply when going into sub expressions.
Are you sure the conflicts involve just those two rules? The first one should have more conflicts than the second. At least with one symbol of look-ahead the decision to shift to a state with addOp on the stack is easier the second time around.
Update (I believe I can prove my theory... :-):
$ cat q2.y
%% expr: '1' | expr '+' expr | expr '-' expr;
$ cat q3.y
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';
$ yacc q2.y
conflicts: 4 shift/reduce
$ yacc q3.y
conflicts: 2 shift/reduce
Having said all that, it's normal for yacc grammars to have ambiguities, and any real-life system is likely to have not just a few but literally dozens of shift/reduce conflicts. By definition, this conflict occurs when there is a perfectly valid shift available, so if you don't mind the parser taking that shift, then don't worry about it.
Now, in yacc you should prefer left-recursive rules. You can achieve that and get rid of your grammar ambiguity with:
$ cat q4.y
%% expr: expr addOp '1' | '1';
addOp: '+' | '-';
$ yacc q4.y
$
Note: no conflicts in the example above. If you like your grammar the way it is, just do:
%expect 2
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';
The problem is that the rule
expr: expr addOp expr { ..action.. }
has no precedence. Normally rules get the precedence of the first token on the RHS, but this rule has no tokens on its RHS. You need to add a %prec directive to it:
expr: expr addOp expr %prec '+' { ..action.. }
to explicitly give the rule a precedence.
Note that doing this doesn't get rid of the shift/reduce conflict, which was present in your original grammar. It just resolves it according to the precedence rules you specify, which means that bison won't give you a message about it. In general, using precedence to resolve conflicts can be tricky, since it can hide conflicts that you might have wanted to resolve differently, or might be unresolvable in your grammar as written.
Also see my answer to this question