A stripped down version of the grammar with the conflict:
body: variable_list function_list;
variable_list:
variable_list variable | /* empty */
;
variable:
TYPE identifiers ';'
;
identifiers:
identifiers ',' IDENTIFIER | IDENTIFIER
;
function_list:
function_list function | /* empty */
;
function:
TYPE IDENTIFIER '(' argument_list ')' function_body
;
The problem is that variables and functions both start with TYPE and IDENTIFIER, e.g
int some_var;
int foo() { return 0; }
variables are always declared before functions in this language, but when a parse is attempted, it always gives
parse error: syntax error, unexpected '(', expecting ',' or ';' [after foo]
How can the variable_list be made to be less greedy, or have the parser realize that if the next token is a '(' instead of a ';' or ',' it is obviously a function and not a variable declaration?
The bison debug output for the conflict is
state 17
3 body: variable_list . function_list
27 variable_list: variable_list . variable
T_INT shift, and go to state 27
T_BOOL shift, and go to state 28
T_STR shift, and go to state 29
T_VOID shift, and go to state 30
T_TUPLE shift, and go to state 31
T_INT [reduce using rule 39 (function_list)]
T_BOOL [reduce using rule 39 (function_list)]
T_STR [reduce using rule 39 (function_list)]
T_VOID [reduce using rule 39 (function_list)]
T_TUPLE [reduce using rule 39 (function_list)]
$default reduce using rule 39 (function_list)
variable go to state 32
simpletype go to state 33
type go to state 34
function_list go to state 35
I have tried all sorts of %prec statements to make it prefer reduce (although I am not sure what the difference would be in this case), with no success at making bison use reduce to resolve this, and I have also tried shuffling the rules around making new rules like non_empty_var_list and having body split up into function_list | non_empty_var_list function_list and none of the attempts would fix this issue. I'm new to this and I've run out of ideas of how to fix this, so I'm just completely baffled.
the problem is in that variables and functions both start with TYPE and IDENTIFIER
Not exactly. The problem is that function_list is left-recursive and possibly empty.
When you reach the semi-colon terminating a variable with TYPE in the lookahead, the parser can reduce the variable into a variable_list, as per the first variable_list production. Now the next thing might be function_list, and function_list is allowed to be empty. So it could do an empty reduction to a function_list, which is what would be necessary to start parsing a function. It can't know not to do that until it looks at the '(' which is the third next token. That's far too far away to be relevant.
Here's an easy solution:
function_list: function function_list
| /* EMPTY */
;
Another solution is to make function_list non-optional:
body: variable_list function_list
| variable_list
;
function_list: function_list function
| function
;
If you do that, bison can shift the TYPE token without having to decide whether it's the start of a variable or function definition.
Related
I'm implementing a simple PseudoCode language with ANTLR4, this is my current grammar:
// Define a grammar called PseudoCode
grammar PseudoCode;
prog : FUNCTION SIGNATURE '(' ')'
| FUNCTION SIGNATURE '{' VARB '}' ;
param: VARB | VARB ',' param ;
assignment: VARB '=' NUMBER ;
FUNCTION: 'function' ;
VARB: [a-z0-9]+ ;
SIGNATURE: [a-zA-Z0-9]+ ;
NUMBER: [0-9]+ | [0-9]+ '.' [0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
The problem is after compiling and generating the Parser, Lexer, etc... and then running with grun PseudoCode prog -tree with the input being for example: function bla{bleh}
I keep on getting the following error:
line 1:9 no viable alternative at input 'functionbla'
Can someone point out what is wrong with my grammar?
bla is a VARB, not a SIGNATURE, because it matches both rules and VARB comes first in the grammar. The way you defined your lexer rules, an identifier can only be matched as a SIGNATURE if it contains capital letters.
The simplest solution to this problem would be to have a single lexer rule for identifiers and then use that everywhere where you currently use SIGNATURE or VARB. If you want to disallow capital letters in certain places, you could simply check for this condition in an action or listener, which would also allow you to produce clearer error messages than syntax errors (e.g. "capital letters are not allowed in variable names").
If you absolutely do need capital letters in variable names to be syntax errors, you could define one rule for identifiers with capital letters and one without. Then you could use ID_WITH_CAPITALS | ID_LOWER_CASE_ONLY in places where you want to allow both and ID_LOWER_CASE_ONLY in cases where you only want to allow lower case letters.
PS: You'll also want to make sure that your identifier rule does not match numbers (which both VARB and SIGNATURE currently do). Currently NUMBER tokens will only be generated for numbers with a decimal point.
I am having some issues to related to statement termination for an SQL grammar. Currently it supports statement termination through semicolon ';' token. Making this as optional makes the grammar go out of memory. I want to know whether I can match the first token of next statement as a terminator for previous one without consuming it. Here is a snippet of my parser grammar.
unit_statement
: statement SEMICOLON!
;
statement
options{
backtrack=true;
}
:
declare_statement
| assignment_statement
| sql_statement
;
Here the sql_statement rule must end with a semicolon but declare_statement should not. I am using ANTLR 3
I'm trying to generate a small JavaScript parser which also includes typed variables for a small project.
Luckily, jison already provides a jscore.js which I just adjusted to fit my needs. After adding types I ran into a reduce conflict. I minimized to problem to this minimum JISON:
Jison:
%start SourceElements
%%
// This is up to become more complex soon
Type
: VAR
| IDENT
;
// Can be a list of statements
SourceElements
: Statement
| SourceElements Statement
;
// Either be a declaration or an expression
Statement
: VariableStatement
| ExprStatement
;
// Parses something like: MyType hello;
VariableStatement
: Type IDENT ";"
;
// Parases something like hello;
ExprStatement
: PrimaryExprNoBrace ";"
;
// Parses something like hello;
PrimaryExprNoBrace
: IDENT
;
Actually this script does nothing than parsing two statements:
VariableStatement
IDENT IDENT ";"
ExpStatement
IDENT ";"
As this is a extremly minimized JISON Script, I can't simply replace "Type" be "IDENT" (which btw. worked).
Generating the parser throws the following conflicts:
Conflict in grammar: multiple actions possible when lookahead token is IDENT in state 8
- reduce by rule: PrimaryExprNoBrace -> IDENT
- reduce by rule: Type -> IDENT
Conflict in grammar: multiple actions possible when lookahead token is ; in state 8
- reduce by rule: PrimaryExprNoBrace -> IDENT
- reduce by rule: Type -> IDENT
States with conflicts:
State 8
Type -> IDENT . #lookaheads= IDENT ;
PrimaryExprNoBrace -> IDENT . #lookaheads= IDENT ;
Is there any trick to fix this conflict?
Thank you in advanced!
~Benjamin
This looks like a Jison bug to me. It is complaining about ambiguity in the cases of these two sequences of tokens:
IDENT IDENT
IDENT ";"
The state in question is that reached after shifting the first IDENT token. Jison observes that it needs to reduce that token, and that (it claims) it doesn't know whether to reduce to a Type or to a PrimaryExpressionNoBrace.
But Jison should be able to distinguish based on the next token: if it is a second IDENT then only reducing to a Type can lead to a valid parse, whereas if it is ";" then only reducing to PrimaryExpressionNoBrace can lead to a valid parse.
Are you sure the given output goes with the given grammar? It would be possible either to add rules or to modify the given ones to produce an ambiguity such as the one described. This just seems like such a simple case that I'm surprised Jison gets it wrong. If it in fact does, however, then you should consider filing a bug report.
Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.
My question is in regards to running the following grammar in ANTLRWorks:
INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);
statement
: STMTEND
| INT STMTEND
;
program: statement+;
I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:
"; NL" or "32; NL" parses without error.
";" or "45;" (without newlines) result in EarlyExitException.
"NL" by itself parses without error.
"456 NL", without the semicolon, results in MismatchedTokenException.
What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.
So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!
In case of input like ";" or "45;", the token STMTEND will never be created.
";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.
What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:
program: statement+;
statement
: STMTEND
| INT STMTEND
;
INT : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.