Bison grammar occasionally passes, occasionally fails - grammar

I'm working extensively with Bison grammars for the first time. I have my grammar set up, and a little test suite to correlate results.
Occasionally, the test suite passes:
Reducing stack by rule 101 (line 613):
$1 = nterm mathenv ()
-> $$ = nterm closedTerm ()
Stack now 0 5 3
Entering state 120
Reading a token: Next token is token ENDMATH ()
Reducing stack by rule 28 (line 517):
$1 = nterm closedTerm ()
-> $$ = nterm compoundTerm ()
Stack now 0 5 3
Entering state 119
Reducing stack by rule 12 (line 333):
$1 = nterm compoundTerm ()
-> $$ = nterm compoundTermList ()
Stack now 0 5 3
Entering state 198
Next token is token ENDMATH ()
Shifting token ENDMATH ()
Entering state 325
... continues to completion ...
Occasionally, it does not:
Reducing stack by rule 101 (line 613):
$1 = nterm mathenv ()
-> $$ = nterm closedTerm ()
Stack now 0 5 3
Entering state 120
Reading a token: Next token is token MN ()
Reducing stack by rule 28 (line 517):
$1 = nterm closedTerm ()
-> $$ = nterm compoundTerm ()
Stack now 0 5 3
Entering state 119
Reducing stack by rule 12 (line 333):
$1 = nterm compoundTerm ()
-> $$ = nterm compoundTermList ()
Stack now 0 5 3
Entering state 198
Next token is token MN ()
Shifting token MN ()
Entering state 11
... errors eventually ...
Now at end of input.
Line: 9 Error: syntax error at token
ENDMATH is the correct token to shift to, but sometimes, MN is determined. I get inconsistent results whenever I run my test. Is such a "random" ambiguity normal? What could be causing it? Should I define some %precedence rules?
At the top of y.output, I do see several conflicts for states, like
State 0 conflicts: 3 shift/reduce
State 120 conflicts: 2 shift/reduce
State 127 conflicts: 2 shift/reduce
State 129 conflicts: 2 shift/reduce
State 154 conflicts: 1 shift/reduce
State 207 conflicts: 3 shift/reduce
State 265 conflicts: 109 shift/reduce
State 266 conflicts: 109 shift/reduce
State 267 conflicts: 109 shift/reduce
State 268 conflicts: 109 shift/reduce
State 269 conflicts: 109 shift/reduce
State 342 conflicts: 2 shift/reduce
State 390 conflicts: 109 shift/reduce
State 391 conflicts: 109 shift/reduce
State 396 conflicts: 1 shift/reduce
State 397 conflicts: 1 shift/reduce
Is it advisable to eliminate all of these conflicts? Note that state 120 is listed as having a conflict, and is the state right before this random error occurs.

Conflicts in your grammar mean that the grammar is not LALR(1). That may be due to the grammar being ambiguous or it may be due to the grammar requiring more than one token of lookahead. Whenever you have a conflict, bison resolves it by chosing one of the possible actions (either shift or reduce) based on the precedence directives you have. This results in a parser which recognizes (parses) some subset of the language described by the grammar.
If the conflicts are purely due to ambiguity, this may result in just eliminating ambiguous parses and not actually reducing the language at all. For such cases, using precedence rules to resolve the ambiguity is the right way to deal with the problem, since it gives you a grammar that parses the language you want.
If the conflicts are due to needing more lookahead, precedence rules are generally no help. You need to resolve the problem by rearranging the grammar to not require the lookahead or by using other techniques (hacks) such as having the lexer insert extra synthetic tokens based on further lookahead in the input or other information.
In your case the immediate problem seems to be in the lexer -- in on case it returns the token ENDMATH and in another it returns MN. There may also be ambiguity or lookahead issues in the grammar connected to the conflicts you see in y.output, but such problems appear at first glance to be completely independent of the problem with the lexer.

Related

Antlr grammar not matching expected lexer rule

I'm trying to match a duration string, like for 30 minutes or for 2 hours using the following rules:
durationPhrase: FOR_STR (MINUTE_DURATION | HOUR_DURATION);
MINUTE_DURATION: NONZERO_NUMBER MINUTE_STR;
HOUR_DURATION: NONZERO_NUMBER HOUR_STR;
MINUTE_STR: 'minute'('s')?;
HOUR_STR: 'hour'('s')?;
FOR_STR: 'for';
NONZERO_NUMBER: [0-9]+;
WS: (' '|[\n\t\r]) -> skip;
With the following input:
for 30 minutes
Attempting to debug/match the durationPhrase rule, I'm presented with the error:
line 1:4 mismatched input '30' expecting {MINUTE_DURATION, HOUR_DURATION}
But I can't seem to figure out what lexer rule the '30' is matching? I was under the impression the "longest" lexer rule would win, which would be the MINUTE_DURATION rule.
Is it instead matching NONZERO_NUMBER first? And if so, why?
It's matching NONZERO_NUMBER because neither of the other patterns apply. If you had entered 30minutes, it would have matched MINUTE_DURATION, but as a token pattern, MINUTE_DURATION won't match the space character.
You ignore whitespace by applying -> skip to the token WS. That can only happen after WS is recognised as a token; i.e. after tokenisation. During tokenisation, whitespace characters are just characters.
If you make MINUTE_DURATION and HOUR_DURATION syntax rules rather than lexical rules, it should work as expected.

Shift Reduce Conflict

This is a part of a grammar which implemented with YACC.
%right NOT
%left TIME DIV REALDIV MOD
%left LT LE
%left GT GE
%left EQ NE
%left AND OR
%right ASSIGN
%%
expression: expression EQ expression {};
|expression PLUS expression {};
| expression TIME expression{};
| PLUS expression {};
| variable PLUS PLUS {};
| variable {};
It has 9 shift reduce conflict.
How to fix conflicts without changing the language? (the string which is generated with this grammar)
As written, the grammar is ambiguous because there is no way to distinguish the operator ++ from two instances of +.
Normally, this problem is resolved in the lexical scanner using the "maximal munch" rule, so that the expression a+++b would broken into the lexical items ID PLUSPLUS PLUS ID, resulting in the parse (a++) + b. In that case, if the user really meant a + (+ (+b))) and didn't want to use parentheses, they could simply split the tokens up with whitespace, writing a+ + +b. (In your grammar, <= is apparently scanned as a single lexical token, rather than the tokens < and =, so the inputs a < = b and a <= b result in different parses -- one of which is invalid. That's more or less expected. So there's no obvious reason why + + should be a permitted spelling of the increment operator ++.)
Making that change will change the language slightly. In your grammar, a++b can be interpreted as a+(+b); if ++ were lexically scanned as a single token, the resulting expression would be a syntax error. If that's a problem it can be resolved but I don't think the resulting complexity is worthwhile.

ANTLR 4 token rule that matches any characters until it encounters XYZ

I want a token rule that gobbles up all characters until it gets to the characters XYZ.
Thus, if the input is this:
helloXYZ
then the token rule should return this token:
hello
If the input is this:
Blah Blah XYZ
then the token rule should return this token:
Blah Blah
How do I define a token rule to do this?
Using the hint that Terrance gives in his answer, I think this is what Roger is looking for:
grammar UseLookahead;
parserRule : LexerRule;
LexerRule : .+? { (_input.LA(1) == 'X') &&
(_input.LA(2) == 'Y') &&
(_input.LA(3) == 'Z')
}?
;
This gives the answers required, hello and Blah Blah respectively. I confess that I don't understand the significance of the final ?.
How about this?
HELLO : 'hello' {_input.LA(1)!=' '}? ;
If you want good performance, you need to use a form which does not use predicates. I would use code modeled after PositionAdjustingLexer.g4 to reset the position if the token ends with XYZ.
Edit: Don't underestimate the performance hit of the answer using a semantic predicate. The predicate will be evaluated at least once for every character of your entire input stream, and any character where a predicate is evaluated is prevented from using the DFA. The last time I saw something like this in use, it was responsible for more than 95% of the execution time of the entire parsing process, and removing it improved performance from more 20 seconds to less than 1 second.
tokens {
SpecialToken
}
mode SpecialTokenMode;
// In your position adjusting lexer, if you see a token with the type
// SpecialTokenWithXYZ, reset the position to remove the last 3 characters and set
// the type to SpecialToken
SpecialTokenWithXYZ
: 'XYZ'
-> popMode
;
SpecialTokenCharacterAtEOF
: . EOF
-> type(SpecialToken), popMode
;
SpecialTokenCharacter
: .
-> more
;
If you want even better performance, you can add a couple rules to optimize handling of sequences that do not contain any X characters:
tokens {
SpecialToken
}
mode SpecialTokenMode;
// In your position adjusting lexer, if you see a token with the type
// SpecialTokenWithXYZ, reset the position to remove the last 3 characters and set
// the type to SpecialToken
SpecialTokenWithXYZ
: 'XYZ'
-> popMode
;
SpecialTokenCharacterSpanAtEOF
: ~'X'+ EOF
-> type(SpecialToken), popMode
;
SpecialTokenCharacterSpan
: ~'X'+
-> more
;
SpecialTokenXAtEOF
: 'X' EOF
-> type(SpecialToken), popMode
;
SpecialTokenX
: 'X'
-> more
;

ANTLR 3, what does LT!* mean?

I was looking at the code for a Javascript grammar written in ANTLR 3,
http://www.antlr3.org/grammar/1206736738015/JavaScript.g
In many instances I found
program
: LT!* sourceElements LT!* EOF!
;
what does LT!* mean ?
EDIT:
From
http://ftp.camk.edu.pl/camk/chris/antlrman/antlrman.pdf
I found that LT stands for LOOKAHEAD TOKEN but it is the Nth look ahead token, where is the N part in the above ?
No, LT does not mean LOOKAHEAD TOKEN in this context. It is a token defined nearly at the end of the grammar:
LT
: '\n' // Line feed.
| '\r' // Carriage return.
| '\u2028' // Line separator.
| '\u2029' // Paragraph separator.
;
The * means that the parser tries to match zero or more of these tokens, and the ! indicates that the generated AST should not include these LT tokens.

Bison shift-reduce conflict

A stripped down version of the grammar with the conflict:
body: variable_list function_list;
variable_list:
variable_list variable | /* empty */
;
variable:
TYPE identifiers ';'
;
identifiers:
identifiers ',' IDENTIFIER | IDENTIFIER
;
function_list:
function_list function | /* empty */
;
function:
TYPE IDENTIFIER '(' argument_list ')' function_body
;
The problem is that variables and functions both start with TYPE and IDENTIFIER, e.g
int some_var;
int foo() { return 0; }
variables are always declared before functions in this language, but when a parse is attempted, it always gives
parse error: syntax error, unexpected '(', expecting ',' or ';' [after foo]
How can the variable_list be made to be less greedy, or have the parser realize that if the next token is a '(' instead of a ';' or ',' it is obviously a function and not a variable declaration?
The bison debug output for the conflict is
state 17
3 body: variable_list . function_list
27 variable_list: variable_list . variable
T_INT shift, and go to state 27
T_BOOL shift, and go to state 28
T_STR shift, and go to state 29
T_VOID shift, and go to state 30
T_TUPLE shift, and go to state 31
T_INT [reduce using rule 39 (function_list)]
T_BOOL [reduce using rule 39 (function_list)]
T_STR [reduce using rule 39 (function_list)]
T_VOID [reduce using rule 39 (function_list)]
T_TUPLE [reduce using rule 39 (function_list)]
$default reduce using rule 39 (function_list)
variable go to state 32
simpletype go to state 33
type go to state 34
function_list go to state 35
I have tried all sorts of %prec statements to make it prefer reduce (although I am not sure what the difference would be in this case), with no success at making bison use reduce to resolve this, and I have also tried shuffling the rules around making new rules like non_empty_var_list and having body split up into function_list | non_empty_var_list function_list and none of the attempts would fix this issue. I'm new to this and I've run out of ideas of how to fix this, so I'm just completely baffled.
the problem is in that variables and functions both start with TYPE and IDENTIFIER
Not exactly. The problem is that function_list is left-recursive and possibly empty.
When you reach the semi-colon terminating a variable with TYPE in the lookahead, the parser can reduce the variable into a variable_list, as per the first variable_list production. Now the next thing might be function_list, and function_list is allowed to be empty. So it could do an empty reduction to a function_list, which is what would be necessary to start parsing a function. It can't know not to do that until it looks at the '(' which is the third next token. That's far too far away to be relevant.
Here's an easy solution:
function_list: function function_list
| /* EMPTY */
;
Another solution is to make function_list non-optional:
body: variable_list function_list
| variable_list
;
function_list: function_list function
| function
;
If you do that, bison can shift the TYPE token without having to decide whether it's the start of a variable or function definition.