A yacc reduce/reduce conflict I can't explain - yacc

I'm getting a shift/reduce and reduce/reduce conflict that I believe shouldn't happen. Obviously I'm doing something wrong, so someone explain to me what I'm missing.
My stripped down grammar:
/*
* Test SQL Grammar
*/
%{
#include <stdio.h>
#include <string.h>
%}
/* Yacc's YYSTYPE UNION */
%union {
char* str; /* Pointer to constant string (malloc'd in lex) */
}
%token SELECT FROM AS ROWID ROWNUM NEXTVAL CURRVAL NULL
%token <str> IDENTIFIER STRING NUMBER
%%
query_block
: SELECT
select_list
FROM row_source_list
;
select_list
: '*'
| select_item_list
;
select_item_list
: select_item_list ',' select_item
| select_item
;
select_item
: row_source '.' '*'
| expr
| expr IDENTIFIER
;
row_source_list
: row_source_list ',' row_source
| row_source
;
row_source
: IDENTIFIER
| IDENTIFIER '.' IDENTIFIER
| IDENTIFIER opt_AS IDENTIFIER
| IDENTIFIER '.' IDENTIFIER opt_AS IDENTIFIER
;
opt_AS
: /* Empty */
| AS
;
expr
: IDENTIFIER '.' IDENTIFIER
| IDENTIFIER '.' ROWID
| IDENTIFIER '.' IDENTIFIER '.' IDENTIFIER
| IDENTIFIER '.' IDENTIFIER '.' ROWID
| ROWNUM
| ROWID
| STRING
| NUMBER
| IDENTIFIER '.' CURRVAL
| IDENTIFIER '.' NEXTVAL
| NULL
;
The conflicts seem to arrise because yacc doesn't know if it is working on the select_list (expr list) or the row_source_list. State 26 of y.output details the conflict:
state 26
12 row_source: IDENTIFIER '.' IDENTIFIER .
14 | IDENTIFIER '.' IDENTIFIER . opt_AS IDENTIFIER
17 expr: IDENTIFIER '.' IDENTIFIER .
19 | IDENTIFIER '.' IDENTIFIER . '.' IDENTIFIER
20 | IDENTIFIER '.' IDENTIFIER . '.' ROWID
AS shift, and go to state 16
'.' shift, and go to state 33
IDENTIFIER reduce using rule 15 (opt_AS)
IDENTIFIER [reduce using rule 17 (expr)]
'.' [reduce using rule 12 (row_source)]
$default reduce using rule 17 (expr)
opt_AS go to state 34
Now the basic rule for "query_block" states that a row_source_list must be preceded by the "FROM" keyword, so I don't see why yacc is combining the two into one state.
query_block
: SELECT
select_list
FROM row_source_list
;
I've traced the states and it ends up in this state before finding the "FROM" keyword.
I don't understand why it is considering the row_source_list before it recognized "FROM".

(I flagged this as "no longer reproducible fault" as the OP had solved it trivially, but the flag was aged/timed out).
As it has an answer I'll transcribe the answer so at least the question is noted as answered.
As the OP states:
I found it right after posting. It's the first line in the select_item rule. I should have caught that earlier.
Which to clarify, the select_item rule should be:
select_item
: expr
| expr IDENTIFIER
;
Which removes the ambiguity.

Related

Antlr4: Nested if-then-else in input requires a statement to parse correctly

I have simplified my grammar and am testing it in http://lab.antlr.org/. It works for
// if column not exists CURRENTSCHEMA.TABLE_EXISTS.NO_SUCH_COLUMN
// if column not exists CURRENTSCHEMA.TABLE_EXISTS.NO_SUCH_COLUMN
alter table TABLE_EXISTS add column (NO_SUCH_COLUMN decimal(5,4));
// end if
xyz; <-------- required! Why???
// end if
NOTE: there must be a NL char after the last // end if, else the parsing does not work at all. How to fix that is a separate question. When I add to the line comment a (NL | EOF) as termination sequence it complains about missing a NL.
but if I remove that xyz; (I call that a statement) then it does not. It looks as if each if-then-else must have at least one statement.
However I have defined a statement as either a text with semicolon or a line comment and a line comment with // if ... // end if is considered a directive.
With the xyz; the tree is the expected, a line comment of type directive with a statement which has another line comment with directive. Without the xyz is is a line comment with directive and within is a regular line comment and a statement.
statement :
trailing
| linecomment
| text* ';' (NL | WS)* ;
Can anybody spot my mistake? I have checked the ordering, the length of the tokens but obviously I am missing something essential or I am simply blind.
This is how it should look like, two nested directive nodes.
Complete grammar file:
grammar Sqlscripts;
options { caseInsensitive = true; }
LINE_COMMENT_START: '//' ;
WS: [ \t\r\f]+ ;
NL: '\r\n' | 'r' | '\n' ;
OBJECT: 'OBJECT' ;
COLUMN: 'COLUMN' ;
END: 'END' ;
IF: 'IF' ;
SEMI: ';' ;
NOT: 'NOT' ;
DOT: '.' ;
COMMA: ',' ;
EXISTS: 'EXISTS' ;
WORD: ([A-Z_0-9!#$%&+:<>=?*#{}]
| '\\'
| ']' | '[' | '-')+ ;
trailing:
WS* NL ;
identifier:
WORD DOT WORD (DOT WORD)* ;
directive:
WS* directivecondition
statement*
directiveend ;
directivecondition:
IF WS*
(OBJECT | COLUMN ) WS*
(NOT WS*)? EXISTS WS+
identifier trailing ;
directiveend:
LINE_COMMENT_START WS* END WS+
IF trailing ;
linecomment:
LINE_COMMENT_START
(directive | (~(NL)* NL)) ;
text:
WORD
| identifier
| OBJECT
| COLUMN
| paramclause
| NOT
| EXISTS
| trailing
| linecomment
| WS+
;
// A statement is either a command terminated
statement :
trailing
| linecomment
| text* ';' (NL | WS)* ;
paramclause:
'(' param_list ')' ;
param_list:
text+ (COMMA text*)* ;
script:
statement* (NL | WS)* EOF;

Handling blank lines when White Space is important in ANTLR4

This may be a newbee question, since I don't have a lot of ANTLR experience, but I've done a lot of research and troubleshooting and have not found a solution so resorting to asking. I am trying to write a parser for a very odd format file (PCGEN open source role playing game character editor) that I plan to use for several uses, not the least of which is learning ANTLR. I am to the point that I have everything I want working on the LEX and Parse, except that it stops parsing when it hits blank lines. I know I could add a line to throw away all whitespace, but the file format is such that strings are not really quoted, and white space is usually important, so the only white space that should be ignored is a totally blank line. When I run the Lexer it gives the tokens for the entire file, so I thought the Parser would process the tokens without concern for where they came from, so I am missing something simple. Here is the beggining of my input:
PCGVERSION:2.0
# System Information
CAMPAIGN:Advanced Player's Guide|CAMPAIGN:Ultimate Magic|CAMPAIGN:Ultimate Combat
VERSION:6.07.05
ROLLMETHOD:3|EXPRESSION:2d6+6
PURCHASEPOINTS:N
And this is my current grammar:
grammar PCG;
pcgFile : lines=line+;
line : statement (NEWLINE | EOF)
;
statement : KEYWORD ASSIGN
| KEYWORD ASSIGN YES_NO
| KEYWORD ASSIGN TEXT
| KEYWORD ASSIGN VERSIONNUM
| KEYWORD ( ASSIGN INT )+
| KEYWORD ASSIGN INT
| KEYWORD ASSIGN SUB_START statement SUB_END
| statement SEP statement
;
NEWLINE : '\r\n' | 'r' | '\n' ;
YES_NO : ('Y'|'N');
KEYWORD : [A-Z]+;
INT : [0-9]+;
TEXT : ~(':'|'|'|'\r'|'\n'|'['|']')+;
ASSIGN : ':';
SEP : '|';
COMMENT : '#' ~[\r\n]*->skip ;
VERSIONNUM : ([0-9]+ ('.' [0-9]+)?)
| ('.' [0-9]+)
| ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
;
ROLL : INT [dD] INT (('+'|'-') INT)?;
SUB_START : '[';
SUB_END : ']';
Any help would be appreciated.
You need to allow for more than 1 new line between statements. Do that by removing the rule and rewriting to this:
pcgFile : NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;
The main problem is that your lexer matches # System Information as a TEXT token. Whenever 2 or more rules match the same amount of characters, the rule defined first will "win" *. So that's TEXT. When you place COMMENT before TEXT, it will work:
grammar PCG;
pcgFile : NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;
statement : KEYWORD ASSIGN
| KEYWORD ASSIGN YES_NO
| KEYWORD ASSIGN TEXT
| KEYWORD ASSIGN VERSIONNUM
| KEYWORD ( ASSIGN INT )+
| KEYWORD ASSIGN INT
| KEYWORD ASSIGN SUB_START statement SUB_END
| statement SEP statement
;
NEWLINE : '\r\n' | 'r' | '\n' ;
YES_NO : ('Y'|'N');
KEYWORD : [A-Z]+;
INT : [0-9]+;
COMMENT : '#' ~[\r\n]* ->skip ;
TEXT : ~(':'|'|'|'\r'|'\n'|'['|']')+;
ASSIGN : ':';
SEP : '|';
VERSIONNUM : ([0-9]+ ('.' [0-9]+)?)
| ('.' [0-9]+)
| ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
;
ROLL : INT [dD] INT (('+'|'-') INT)?;
SUB_START : '[';
SUB_END : ']';
Keep in mind that ~(':'|'|'|'\r'|'\n'|'['|']')+ is dangerous: it could easily match a lot of characters.
* because the lexer works like this, input like 12 will never be tokenised as a VERSIONNUM token since INT matches this too an occurs before VERSIONNUM. Fix it by doing something like this:
statement : ...
| KEYWORD ASSIGN versionnum
| ...
;
versionnum : VERSIONNUM
| INT
;
...
INT : [0-9]+;
...
VERSIONNUM : [0-9]* '.' [0-9]+ ('.' [0-9]+)?
;
...

Antlr "CASE" error path

I am using antlr 2.7.6.
I am programming a parser for plc 61131-3 ST language and I can't resolve an issue with my grammar.
The grammar is:
case_Stmt : 'CASE' expression 'OF' case_Selection + ( 'ELSE' stmt_List )? 'END_CASE';
case_Selection : case_List ':' stmt_List;
case_List : case_List_Elem ( ',' case_List_Elem )*;
case_List_Elem : subrange | constant_Expr;
constant_Expr : constant | enum_Value;
stmt_List : ( Stmt ? ';' )*;
stmt : assign_Stmt | subprog_Ctrl_Stmt | selection_Stmt | Iteration_Stmt;
assign_Stmt : ( variable ':=' expression )
enum_Value : ( identifier '#' )? identifier;
variable : identifier | ...
The problem occurs with "enum_Value" as "case_Selection", the parser interprets it as a new "stmt" instead of the new "Case_Selection" it was supposed to.
Example:
CASE (enumVariable) OF
enum#literal1: Variable1 := 1;
enum#liteal2: Variable1 := 2;
enum#liteal3: Variable1 := 3;
ELSE
Variable1 := 4;
END_CASE;
In the above example instead of taking " enum.liteal2" as the new "case_Selection" it interprets it as "assign_Stmt" and gives error because it doesn't found the ':='.
Is there a way to try to read the maximum of characthers till we find the ':' or the ':=' to understand if we realy have a new "stmt" or not?
Thank you!
Edit1: better syntax;

Antlr - mismatched input '1' expecting number

I'm new to Antlr and I have the following simplified language:
grammar Hello;
sentence : targetAttributeName EQUALS expression+ (IF relationedExpression (logicalRelation relationedExpression)*)?;
expression :
'(' expression ')' |
expression ('*'|'/') expression |
expression ('+'|'-') expression |
function |
targetAttributeName |
NUMBER;
filterExpression :
'(' filterExpression ')' |
filterExpression ('*'|'/') filterExpression |
filterExpression ('+'|'-') filterExpression |
function |
filterAttributeName |
NUMBER |
DATE;
relationedExpression :
filterExpression ('<'|'<='|'>'|'>='|'=') filterExpression |
filterAttributeName '=' STRING |
STRING '=' filterAttributeName
;
logicalRelation :
'AND' |
'OR'
;
targetAttributeName :
'x'|
'y'
;
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
function:
simpleFunction |
complexFunction ;
simpleFunction :
'simpleFunction' '(' expression ')' |
'simpleFunction2' '(' expression ')'
;
complexFunction :
'complexFunction' '(' expression ')' |
'complexFunction2' '(' expression ')'
;
EQUALS : '=';
IF : 'IF';
STRING : '"' [a-zA-z0-9]* '"';
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
DATE: NUMBER NUMBER NUMBER NUMBER '.' NUMBER NUMBER? '.' NUMBER NUMBER? '.';
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
It works with x = y * 2, but it doesn't work with x =y * 1.
The error message is the following:
Hello::sentence:1:7: mismatched input '1' expecting {'simpleFunction', 'complexFunction', 'x', 'y', 'complexFunction2', '(', 'simpleFunction2', NUMBER}
It is very strange for me, because 1 is a NUMBER...
If I change the filterAttribute from 'a' '1' to 'a1', then it works with x=y*1, but I don't understand the difference between the two cases. Could somebody explain it for me?
Thanks.
By doing this:
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
ANTLR creates lexer rules from these inline tokens. So you really have a lexer grammar that looks like this:
T_1 : '1': // the rule name will probably be different though
T_a : 'a';
...
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
In other words, the input 1 will be tokenized as T_1, not as a NUMBER.
EDIT
Whenever certain input can match two or more lexer rules, ANTLR chooses the one defined first. The lexer does not "listen" to the parser to see what it needs at a particular time. The lexing and parsing are 2 distinct phases. This is simply how ANTLR works, and many other other parser generators. If this is not acceptable for you, you should google for "scanner-less parsing", or "packrat parsers".

ANTLR Grammar is not backtracking while parsing similar rules

Suppose I have a grammar which takes care of the global variables and some method declarations of some variation of C
program: (declaration)* (procedure)*;
declaration: typespec identifier ';';
procedure: typespec identifier '(' ')' ';';
typespec: 'char' | 'int';
identifier: ('a' .. 'z' | 'A' .. 'Z') ('A' - 'Z' | 'a' .. 'z' | '0' .. '9' | '_')*;
If I feed it something like:
int MAX;
char proc();
the grammar reads int MAX; correctly but then it wants to apply the declaration rule also to the 2nd row, and it fails when it reaches (, and at this point I expect it to backtrack and apply the next rule which is the one for procedure. Could somebody please tell me why this isn't happening?
Did you post all of your grammar? I couldn't get it to compile as you posted...but I played around with what you posted to make it match your example:
program: (declaration)* (procedure)*;
statement: TYPE_SPEC IDENT ;
declaration: statement ';';
procedure: statement '(' ')' ';';
TYPE_SPEC
: 'char' | 'int';
IDENT
: ('a' .. 'z' | 'A' .. 'Z') ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_')*;
WHITESPACE
: ('\r' | '\n' | '\r\n' | ' ' | '\t' ) {$channel=HIDDEN;}
;
I'd recommend that your make lexer rules (The ones in capitals) for your token matching rather than making them part of your parser rules - I've done some of them already for you as you can see.