Shift reduce conflict - grammar

I'm having a problem understanding the shift/reduce confict for a grammar that I know has no ambiguity. The case is one of the if else type but it's not the 'dangling else' problem since I have mandatory END clauses delimiting code blocks.
Here is the grammar for gppg (Its a Bison like compiler compiler ... and that was not an echo):
%output=program.cs
%start program
%token FOR
%token END
%token THINGS
%token WHILE
%token SET
%token IF
%token ELSEIF
%token ELSE
%%
program : statements
;
statements : /*empty */
| statements stmt
;
stmt : flow
| THINGS
;
flow : '#' IF '(' ')' statements else
;
else : '#' END
| '#' ELSE statements '#' END
| elseifs
;
elseifs : elseifs '#' ELSEIF statements else
| '#' ELSEIF statements else
;
Here is the conflict output:
// Parser Conflict Information for grammar file "program.y"
Shift/Reduce conflict on symbol "'#'", parser will shift
Reduce 10: else -> elseifs
Shift "'#'": State-22 -> State-23
Items for From-state State 22
10 else: elseifs .
-lookahead: '#', THINGS, EOF
11 elseifs: elseifs . '#' ELSEIF statements else
Items for Next-state State 23
11 elseifs: elseifs '#' . ELSEIF statements else
// End conflict information for parser
I already switched arround everything, and I do know how to resolve it, but that solution involves giving up the left recursion on 'elseif' for a right recursion.
Ive been through all the scarse documentation I have found on the internet regarding this issue (I post some links at the end) and still have not found an elegant solution. I know about ANTLR and I don't want to consider it right now. Please limit your solution to Yacc/Bison parsers.
I would appreciate elegant solutions, I managed to do It by eleminating the /* empty */ rules and duplication everything that needed an empty list but in the larger grammar Im working on It just ends up like 'sparghetti grammar syndrome'.
Here are some links:
http://nitsan.org/~maratb/cs164/bison.html
http://compilers.iecc.com/comparch/article/98-01-079
GPPG, the parser I'm using
Bison manual

Your revised ELSEIF rule has no markers for a condition -- it should nominally have '(' and ')' added.
More seriously, you now have a rule for
elsebody : else
| elseifs else
;
and
elseifs : /* Nothing */
| elseifs ...something...
;
The 'nothing' is not needed; it is implicitly taken care of by the 'elsebody' without the 'elseifs'.
I would be very inclined to use rules 'opt_elseifs', 'opt_else', and 'end':
flow : '#' IF '(' ')' statements opt_elseifs opt_else end
;
opt_elseifs : /* Nothing */
| opt_elseifs '#' ELSIF '(' ')' statements
;
opt_else : /* Nothing */
| '#' ELSE statements
;
end : '#' END
;
I've not run this through a parser generator, but I find this relatively easy to understand.

I think the problem is in the elseifs clause.
elseifs : elseifs '#' ELSEIF statements else
| '#' ELSEIF statements else
;
I think the first version is not required, since the else clause refers back to elseifs anyway:
else : '#' END
| '#' ELSE statements '#' END
| elseifs
;
What happens if you change elseifs?:
elseifs : '#' ELSEIF statements else
;

The answer from Jonathan above seems like it would be the best, but since its not working for you I have a few suggestions you could try that will help you in debugging the error.
Firstly have you considered making the hash/sharp symbol a part of the tokens themselves (i.e. #END, #IF, etc)? So that they get taken out by the lexer, meaning they don't have to be included in the parser.
Secondly I would urge you to rewrite the rules without duplicating any token streams. (Part of the Don't Repeat Yourself principle.) So the rule " '#' ELSEIF statements else " should only exist in one place in that file (not two as you have above).
Lastly I suggest that you look into precedence and associativity of the IF/ELSEIF/ELSE tokens. I know that you should be able to write a parser that doesn't require this but it might be the thing that you need in this case.

I'm still switching thing arround, and my original question had some errors since the elseifs sequence had an else allways at the end which was wrong. Here is another take at the question, this time I get two shift/reduce conflicts:
flow : '#' IF '(' ')' statements elsebody
;
elsebody : else
| elseifs else
;
else : '#' ELSE statements '#' END
| '#' END
;
elseifs : /* empty */
| elseifs '#' ELSEIF statements
;
The conflicts now are:
// Parser Conflict Information for grammar file "program.y"
Shift/Reduce conflict on symbol "'#'", parser will shift
Reduce 12: elseifs -> /* empty */
Shift "'#'": State-10 -> State-13
Items for From-state State 10
7 flow: '#' IF '(' ')' statements . elsebody
4 statements: statements . stmt
Items for Next-state State 13
10 else: '#' . ELSE statements '#' END
11 else: '#' . END
7 flow: '#' . IF '(' ')' statements elsebody
Shift/Reduce conflict on symbol "'#'", parser will shift
Reduce 13: elseifs -> elseifs, '#', ELSEIF, statements
Shift "'#'": State-24 -> State-6
Items for From-state State 24
13 elseifs: elseifs '#' ELSEIF statements .
-lookahead: '#'
4 statements: statements . stmt
Items for Next-state State 6
7 flow: '#' . IF '(' ')' statements elsebody
// End conflict information for parser
Empty rules just aggravate the gppg i'm affraid. But they seem so natural to use I keep trying them.
I already know right recursion solves the problem as 1800 INFORMATION has said. But I'm looking for a solution with left recursion on the elseifs clause.

elsebody : elseifs else
| elseifs
;
elseifs : /* empty */
| elseifs '#' ELSEIF statements
;
else : '#' ELSE statements '#' END
;
I think this should left recurse and always terminate.

OK - here is a grammar (not minimal) for if blocks. I dug it out of some code I have (called adhoc, based on hoc from Kernighan & Plauger's "The UNIX Programming Environment"). This outline grammar compiles with Yacc with no conflicts.
%token NUMBER IF ELSE
%token ELIF END
%token THEN
%start program
%%
program
: stmtlist
;
stmtlist
: /* Nothing */
| stmtlist stmt
;
stmt
: ifstmt
;
ifstmt
: ifcond endif
| ifcond else begin
| ifcond eliflist begin
;
ifcond
: ifstart cond then stmtlist
;
ifstart
: IF
;
cond
: '(' expr ')'
;
then
: /* Nothing */
| THEN
;
endif
: END IF begin
;
else
: ELSE stmtlist END IF
;
eliflist
: elifblock
| elifcond eliflist begin /* RIGHT RECURSION */
;
elifblock
: elifcond else begin
| elifcond endif
;
elifcond
: elif cond then stmtlist end
;
elif
: ELIF
;
begin
: /* Nothing */
;
end
: /* Nothing */
;
expr
: NUMBER
;
%%
I used 'NUMBER' as the dummy element, instead of THINGS, and I used ELIF instead of ELSEIF. It includes a THEN, but that is optional. The 'begin' and 'end' operations were used to grab the program counter in the generated program - and therefore should be removable from this without affecting it.
There was a reason I thought I needed to use right recursion instead of the normal left recursion - but I think it was to do with the code generation strategy I was using, rather than anything else. The question mark in the comment was in the original; I remember not being happy with it. The program as a whole does work - it is a project that's been on the back burner for the last decade or so (hmmm...I did some work at the end of 2004 and beginning of 2005; prior to that, it was 1992 and 1993).
I've not spent the time working out why this compiles conflict-free and what I outlined earlier does not. I hope it helps.

Related

How to disambiguate a subselect from a parenthesized expression?

I have the following expression notation:
expr
: OpenParen expr (Comma expr)* Comma? CloseParen # parenExpr
| OpenParen simpleSelect CloseParen # subSelectExpr
Unfortunately, a simpleSelect can also have a parenthetical around it, and so the following statement becomes ambiguous:
select ((select 1))
Here is the current grammar that I have, simplified down to only showing the issue:
grammar Subselect;
options { caseInsensitive=true; }
statement: query_statement EOF;
query_statement
: query_expr # simple
| query_statement set_op query_statement # set
;
query_expr
: with_clause?
( select | '(' query_statement ')' )
limit_clause?
;
select
: select_clause
(from_clause
where_clause?)?
;
with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';
expr
: '(' expr ')' # parenExpr
| '(' query_expr ')' # subSelect
| Atom # identifier
;
Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
And on the parse of select ((select 1)), here is the output:
What would be a possible way to disambiguate this?
I suppose the main thing is here:
'(' query_statement ')'
Since that recursively calls itself -- is there a way to do indirection or something else such that a query_statement called from within parens can never itself have parens?
Also, maybe this is a common thing? I get the same ambiguous output when running this on the official MySQL grammar here:
I would be curious whether any of the grammars can solve the issue here: https://github.com/antlr/grammars-v4/tree/master/sql. Maybe the best approach is just to remove duplicate parens before parsing the text? (If so, are there are good tools to do that, or do I need to write an additional antlr parser just to do that?)
Your input generates this parse tree:
That's a reasonable interpretation of your input and it is identified as a subSelect expr. It's a subSelect nested in a parenExpr (both of which are exprs).
If I switch up your rule a bit:
expr: '(' query_expr ')' # subSelect
| '(' expr ')' # parenExpr
| Atom # identifier
;
Now it's a subSelect that interprets the nested (select 1) as a query expression.
It's ambiguous because the outer parenthesized expression could match either of the first two alternatives resulting in different parse trees.
In ANTLR, ambiguities in alternatives are resolved by "using" the first alternative that matches. In this way ANTLR has deterministic behavior where you can control which interpretation is used (with alternative order). It's not uncommon for ANTLR grammars to have ambiguities like this.
IMHO, the IntelliJ plugin has caused many people to stumble over this as an indication that something is "wrong" with the grammar. There's a reason that ANTLR itself does not report an error in this situation. It has defined, deterministic behavior.
So far as "resolving" this ambiguity: the simple fact that the syntax uses parentheses to indicate two different "things" indicates that it is inherently ambiguous, so I don't believe you can "fix" the grammar ambiguity without modifying the syntax. (I might be wrong about this, and would find it interesting if someone provides a refactoring that manages to remove the ambiguity.)
EDIT:
After trying an earlier solution that proved incorrect with some additional test data, I've tried a different approach.
I added Atom as a viable alternative for query_expr since that Atom '1` is being offered as test data. In the full grammar implementation, it's hard to predict if this is necessary, even sufficent. I have only the grammar above with which to test.
I used some semantic predicates to strip parentheses (avoids the effort of writing an additional parser).
For testing purposes only, I added SQL-style line comments so that I could test many different inputs quickly.
The following SQL statements were tested, showing no ambiguity.
select 1
select (1)
select ((select 1))
select ((select (abc)))
select abc from ((select 1 from (select((select(1))))))
(select 1 from (select((select(1)))))
((select (xyz) from (select (((((foo))))) from tableX)))
select a from (select x from xyz)
union
select b from abc
select a from ((select x from xyz ))
intersect
((select b from foo))
select a from (select x from xyz )
intersect
(select b from foo)
The grammar is as follows:
grammar Subselect;
options { caseInsensitive=true; }
#header
{
import java.util.*;
}
#parser::members
{
String stripParens(String phrase)
{
String temp1 = phrase.substring[1];
temp2 = temp1.substring(0, s.length()-1);
return temp2;
}
}
statement: query_statement EOF;
query_statement
: query_expr # simple
| query_statement set_op query_statement # set
;
query_expr
: with_clause?
( select | '(' query_statement ')' )
limit_clause?
| Atom
;
select
: select_clause
(from_clause
where_clause?)?
;
with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';
lrpExpr
: {stripParens(_input.LT[1].getText())}? query_expr
;
expr
: '(' lrpExpr ')' # parenExpr
| Atom # identifier
;
//---------------------------------------------
Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
LineComment : '--' ~[\r\n]* -> skip ;
I'm not including images of parse trees in this edit to conserve space. However, from the inputs I tested, lrpExpr, being a separate rule, would give e.g. a Visitor class to evaluate what is inside the parentheses before moving further down the parse tree, so order of evaluation e.g. mathematical operator precedence could still be honored.
All still fast and with zero ambiguity.
I hope this suits your needs better.
Attribution: I used this answer as a starting point for the Java code for the semantic predicate.

The parser didn't comsume all tokens,Is it a bug? [duplicate]

This question already has answers here:
How to force ANTLR to parse all input CharStream
(2 answers)
Closed 4 years ago.
env: antlr 4.7.1
the grammer is:
grammar Whilelang;
program : seqStatement;
seqStatement: statement (';' statement)* ;
statement: ID ':=' expression # attrib
| 'print' Text # print
| '{' seqStatement '}' # block
;
expression: INT # int
| ID # id
| expression ('+'|'-') expression # binOp
| '(' expression ')' # expParen
;
bool: ('true'|'false') # boolean
| expression '=' expression # relOp
| expression '<=' expression # relOp
| 'not' bool # not
| bool 'and' bool # and
| '(' bool ')' # boolParen
;
INT: ('0'..'9')+ ;
ID: ('a'..'z')+;
Text: '"' .*? '"';
Space: [ \t\n\r] -> skip;
The input language code are:
a := 1
b := 2
According to the grammar, Antlr4 should output a error --" expect ';' at line 1 " for the above input language code. But in fact. no error ouputted, It seems the grammar accept only partial input, and didn't consume all input tokens.
Is it a bug of antlr4?
$ grun Whilelang program -trace
a := 1
b := 2
^d
enter program, LT(1)=a
enter seqStatement, LT(1)=a
enter statement, LT(1)=a
consume [#0,0:0='a',<17>,1:0] rule statement
consume [#1,2:3=':=',<2>,1:2] rule statement
enter expression, LT(1)=1
consume [#2,5:5='1',<16>,1:5] rule expression
exit expression, LT(1)=b
exit statement, LT(1)=b
exit seqStatement, LT(1)=b
exit program, LT(1)=b
Not a bug. ANTLR is doing exactly what it was asked to do.
Given the rules
program : seqStatement;
seqStatement: statement (';' statement)* ;
the program rule is then entirely complete when at least one statement has been matched. Since the parser cannot validly match another statement -- optional per the grammar-- it stops.
Changing to
program : seqStatement EOF;
requires the program rule to match statements until it can also match an EOF token (the lexer automatically adds an EOF at the end of the source text). This likely the behavior you are looking for.

Reduce/reduce conflict in clike grammar in jison

I'm working on the clike language compiler using Jison package. I went really well until I've introduced classes, thus Type can be a LITERAL now. Here is a simplified grammar:
%lex
%%
\s+ /* skip whitespace */
int return 'INTEGER'
string return 'STRING'
boolean return 'BOOLEAN'
void return 'VOID'
[0-9]+ return 'NUMBER'
[a-zA-Z_][0-9a-zA-Z_]* return 'LITERAL'
"--" return 'DECR'
<<EOF>> return 'EOF'
"=" return '='
";" return ';'
/lex
%%
Program
: EOF
| Stmt EOF
;
Stmt
: Type Ident ';'
| Ident '=' NUMBER ';'
;
Type
: INTEGER
| STRING
| BOOLEAN
| LITERAL
| VOID
;
Ident
: LITERAL
;
And the jison conflict:
Conflict in grammar: multiple actions possible when lookahead token is LITERAL in state 10
- reduce by rule: Ident -> LITERAL
- reduce by rule: Type -> LITERAL
Conflict in grammar: multiple actions possible when lookahead token is = in state 10
- reduce by rule: Ident -> LITERAL
- reduce by rule: Type -> LITERAL
States with conflicts:
State 10
Type -> LITERAL . #lookaheads= LITERAL =
Ident -> LITERAL . #lookaheads= LITERAL =
I've found quite a similar question that has no been answered, does any one have any clue how to solve this?
That's evidently a bug in jison, since the grammar is certainly LALR(1), and is handled without problems by bison. Apparently, jison is incorrectly computing the lookahead for the state in which the conflict occurs. (Update: It seems to be bug 205, reported in January 2014.)
If you ask jison to produce an LR(1) parser instead of an LALR(1) grammar, then it correctly computes the lookaheads and the grammar passes without warnings. However, I don't think that is a sustainable solution.
Here's another work-around. The Decl and Assign productions are not necessary; the "fix" was to remove LITERAL from Type and add a separate production for it.
Program
: EOF
| Stmt EOF
;
Decl
: Type Ident ';'
| LITERAL Ident ';'
;
Assign
: Ident '=' NUMBER ';'
;
Stmt
: Decl
| Assign
;
Type
: INTEGER
| STRING
| BOOLEAN
| VOID
;
Ident
: LITERAL
;
You might want to consider recognizing more than one statement:
Program
: EOF
| Stmts EOF
;
Stmts
: Stmt
| Stmts Stmt
;

Left-factoring grammar of coffeescript expressions

I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.
My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.
grammar CoffeeScript;
options {
output=AST;
}
tokens {
AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}
COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN : '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER : ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof';
RPAREN : ')';
SHIFT : '<<' | '>>';
STRING : '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER : ('a'..'z' | 'A'..'Z')+;
WS : (' ')+ {skip();} ;
root
: body
;
body
: line
;
line
: expression
;
assign
: assignable EQUAL expression
;
expression
: value
| assign
| operation
;
identifier
: IDENTIFIER
;
simpleAssignable
: identifier
;
assignable
: simpleAssignable
;
value
: assignable
| literal
| parenthetical
;
literal
: alphaNumeric
;
alphaNumeric
: NUMBER
| STRING;
parenthetical
: LPAREN body RPAREN
;
// term should be the same as expression except operation to avoid left-recursion
term
: value
| assign
;
questionOp
: term QUESTION?
;
mathOp
: questionOp (MATH questionOp)*
;
additiveOp
: mathOp ((PLUS | MINUS) mathOp)*
;
shiftOp
: additiveOp (SHIFT additiveOp)*
;
relationOp
: shiftOp (RELATION shiftOp)*
;
compareOp
: relationOp (COMPARE relationOp)*
;
logicOp
: compareOp (LOGIC compareOp)*
;
operation
: UNARY expression
| MINUS expression
| PLUS expression
| MINUS_MINUS simpleAssignable
| PLUS_PLUS simpleAssignable
| simpleAssignable PLUS_PLUS
| simpleAssignable MINUS_MINUS
| simpleAssignable COMPOUND_ASSIGN expression
| logicOp
;
UPDATE:
The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:
CompareOp returns Operation:
AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;
My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.
I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.
UPDATE2 and SOLUTION:
I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.
grammar FunExp;
ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};
root
: expression
;
// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
: operation
;
atom
: NUMBER
| ID
;
functionCall
: ID '(' expression (',' expression)* ')'
;
operation
: multiOp
;
multiOp
: additiveOp (('*' | '/') additiveOp)*
;
additiveOp
: term (('+' | '-') term)*
;
term
: atom
| functionCall
| '(' expression ')'
;
When you generate a lexer and parser from your grammar, you see the following error printed to your console:
error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
(I've emphasized the important bits)
This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.
The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:
expression
: value // choice 1
| assign // choice 2
| operation // choice 3
;
Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:
choice 1:
expression
value
literal
alphaNumeric : {NUMBER, STRING}
choice 3:
expression
operation
logicOp
relationOp
shiftOp
additiveOp
mathOp
questionOp
term
value
literal
alphaNumeric : {NUMBER, STRING}
In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).
So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.
I noticed 3 other things in your grammar:
1
Take the following lexer rules:
NEW : 'new';
...
UNARY : '!' | '~' | NEW;
Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:
UNARY : '!' | '~' | 'new';
2
In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:
LOGIC : '&&' | '||';
and then you use that token in a parser rules like this:
logicOp
: compareOp (LOGIC compareOp)*
;
But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):
AND : '&&';
OR : '||';
...
logicOp
: compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
| OR compareOp // easier to evaluate, you know it's an OR expression
)*
;
3
You're skipping white spaces (and no tabs?) with:
WS : (' ')+ {skip();} ;
but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?
I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.

ANTLR grammar problem with parenthetical expressions

I'm using ANTLRWorks 1.4.2 to create a simple grammar for the purpose of evaluating an user-provided expression as boolean result. This ultimately will be part of a larger grammar, but I have some questions about this current fragment. I want users to be able to use expressions such as:
2 > 1
2 > 1 and 3 < 1
(2 > 1 or 1 < 3) and 4 > 1
(2 > 1 or 1 < 3) and (4 > 1 or (2 < 1 and 3 > 1))
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, and I am not sure why. So, I seem to be missing out on some insight into the right way to handle parenthetical grouping in a grammar.
How can I change my grammar to properly handle parentheses?
My grammar is below:
grammar conditional_test;
boolean
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
boolean_term
: boolean_factor (AND boolean_factor)*
;
boolean_factor
: (NOT)? boolean_test
;
boolean_test
: predicate
;
predicate
: expression relational_operator expression
| LPAREN boolean_value_expression RPAREN
;
relational_operator
: EQ
| LT
| GT
;
expression
: NUMBER
;
LPAREN : '(';
RPAREN : ')';
NUMBER : '0'..'9'+;
EQ : '=';
GT : '>';
LT : '<';
AND : 'and';
OR : 'or' ;
NOT : 'not';
Chris Farmer wrote:
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. ...
You should remove the EOF token from:
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
You normally only use the EOF after the entry point of your grammar (boolean in your case). Be careful boolean is a reserved word in Java and can therefor not be used as a parser rule!
So the first two rules should look like:
bool
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
;
And you may also want to ignore literal spaces by adding the following lexer rule:
SPACE : ' ' {$channel=HIDDEN;};
(you can include tabs an line breaks, of course)
Now all of your example input matches properly (tested with ANTLRWorks 1.4.2 as well).
Chris Farmer wrote:
Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, ...
No, ANTLRWorks does produce errors, perhaps not very noticeable ones. The parse tree ANTLRWorks produces has a NoViableAltException as a leaf, and there are some errors on the "Console" tab.