Shift/Reduce IF-ELSE conflict in my grammar - grammar

I'm trying to write a small parser. Unfortunately I get a "shift-reduce conflict". Grammars are not my strong point, and I only need to get this one small thingy done. Here's the reduced grammar that produces the error:
stmts_opt -> stmts
;
stmts -> stmt
| stmts stmt
| stmsts
;
stmt -> id
| ITERATE content_stmt
| IF test then content_stmt ELSE content_stmt
| IF test then content_stmt
;
content_stmt: BEGIN stmt_opt END
| stmt
;
A solution giving the modified grammar would be highly appreciated.
Edit:
I modified my grammar suiting #rici's answer but the problem persists. Here is my actual grammar productions:
prog: BEGIN_PROG def_sprogram BEGIN_EXEC stmts_opt END_EXEC END_PROG
{ () }
;
def_sprogram: /* empty */ { () }
| define_new def_sprogram { () }
;
define_new: DEFINE_NEW_INSTRUCTION ID AS content_stmt SEMI { }
;
stmts_opt: /* empty */ { () }
| stmts { () }
;
stmts: stmt { () }
| stmts SEMI stmt { () }
| stmts SEMI { () }
;
content_stmt: BEGIN stmts_opt END { () }
| stmt { () }
;
stmt: open_stmt { () }
| closed_stmt { () }
;
open_stmt: ITERATE INT TIMES open_stmt { () }
| WHILE test DO open_stmt { () }
| IF test THEN closed_stmt ELSE open_stmt { () }
| IF test THEN stmt { () }
;
closed_stmt: simple_stmt { () }
| ITERATE INT TIMES closed_stmt { () }
| WHILE test DO closed_stmt { () }
| IF test THEN closed_stmt ELSE closed_stmt { () }
;
Here is the example I am testing on:
BEGINNING-OF-PROGRAM
BEGINNING-OF-EXECUTION
IF not-next-to-a-beeper THEN
move;
IF not-facing-north THEN
turnleft;
ELSE <--- ERROR
turnleft;
IF not-facing-east THEN
IF not-facing-west THEN
turnleft;
turnoff
END-OF-EXECUTION
END-OF-PROGRAM
I am getting the error in the first ELSE. I also tried to declare a simple precedence as suggested by #rici:
%nonassoc THEN
%nonassoc ELSE
but that didn't resolve the error neither.

The simplest solution to the "dangling else" shift-reduce conflict is to force resolution in favour of shifting the ELSE token. Since camlyacc does support precedence declarations (according to the rather sketchy documentation I was able to find) this should be as simple as adding the following to your declaration section (before the first %%):
%nonassoc THEN
%nonassoc ELSE
(The associativity doesn't matter because there is nowhere in the grammar where THEN and ELSE can associate.)
If you want to use the explicit "matched/unmatched" (or "open/closed statements"), it is sufficient to note that BEGIN stmts_opt END is a "matched" (or "closed" statement) since it cannot accept a THEN. The other matched statements are
matched_stmt: BEGIN stmts_opt END
| ITERATE matched_stmt
| IF test THEN matched_stmt ELSE matched_stmt
| /* Any other kind of simple statement */
The unmatched statements are
unmatched_stmt: ITERATE unmatched_stmt
| IF test THEN matched_stmt
| IF test THEN unmatched_stmt
| IF test THEN matched_stmt ELSE unmatched_stmt
Many people prefer to create a non-terminal which includes matched_stmt and unmatched_stmt. However, in your case you seem to not want to nest BEGIN … END blocks, restricting those to the content of a compound statement. So your stmt would be all of the above except for the BEGIN stmts_opt END right-hand side.

Related

Don't read specific token at a given point

At some point in my grammar file, I want ANTLR to read my input as 2 tokens instead of one.
In my source file I have the value
12345.name
and the lexer consumes
12345.
as a FLOAT-Token. At this specific point in the source file I want ANTLR to read this as
12345 (INT)
. (DOT)
name (NAME)
Is there a way to tell ANTLR that it should ignore FLOAT-Types at some given point?
This is my current .g4 file:
grammar Quest;
import Lua;
#header {
package dev.codeflush.m2qc.antlr;
}
/*
prefixed everything with "m2" to avoid nameclashes
*/
m2QuestFile
: m2Define* m2Quest* EOF
;
m2Define
: 'define' NAME m2DefineValue
;
m2DefineValue
: ~('\r\n' | '\r' | '\n')
;
m2Quest
: 'quest' NAME 'begin' m2State* 'end'
;
m2State
: 'state' NAME 'begin' (m2TriggerBlock | m2Function)* 'end'
;
m2TriggerBlock
: 'when' m2Trigger ('or' m2Trigger)* ('with' exp)? 'begin' block 'end'
;
m2Function
: 'function' NAME funcbody
;
m2Trigger
: m2TriggerTarget DOT m2TriggerEvent DOT m2TriggerSubEvent DOT m2TriggerArgument
| m2TriggerTarget DOT m2TriggerEvent DOT m2TriggerArgument
| m2TriggerTarget DOT m2TriggerEvent
| m2TriggerEvent
;
m2TriggerTarget
: NAME
| INT
| NORMALSTRING
;
/*
not complete
*/
m2TriggerEvent
: 'button'
| 'enter'
| 'info'
| 'item_informer'
| 'kill'
| 'leave'
| 'letter'
| 'levelup'
| 'login'
| 'logout'
| 'unmount'
| 'target'
| 'chat'
| 'timer'
| 'server_timer'
;
m2TriggerSubEvent
: 'click'
| 'chat'
| 'arrive'
;
m2TriggerArgument
: exp
;
DOT
: '.'
;
I'm using the Lua grammar from https://github.com/antlr/grammars-v4/blob/master/lua/Lua.g4
My current sample input file looks like this:
quest test begin
state start begin
when kill begin
end
when "12345".kill begin
end
when 12345.kill begin
end
end
end
Where the first two work as intended but the third one doesn't (because the lexer reads '12345.' as one FLOAT-Token)
I had a similar need in my grammar where I wanted to issue multiple tokens (2 actually) for a single match under a specific condition (here: when a dot is directly followed by an identifier, including a keyword).
// Special rule that should also match all keywords if they are directly preceded by a dot.
// Hence it's defined before all keywords.
// Here we make use of the ability in our base lexer to emit multiple tokens with a single rule.
DOT_IDENTIFIER:
DOT_SYMBOL LETTER_WHEN_UNQUOTED_NO_DIGIT LETTER_WHEN_UNQUOTED* { emitDot(); } -> type(IDENTIFIER)
;
A helper function is needed to emit the extra token(s):
/**
* Puts a DOT token onto the pending token list.
*/
void MySQLBaseLexer::emitDot() {
_pendingTokens.emplace_back(_factory->create({this, _input}, MySQLLexer::DOT_SYMBOL, _text, channel,
tokenStartCharIndex, tokenStartCharIndex, tokenStartLine,
tokenStartCharPositionInLine));
++tokenStartCharIndex;
}
which in turn requires custom handling of the token production. You have to override the nextToken method in your token stream, to consider the pending token list before returning the next real token.
/**
* Allow a grammar rule to emit as many tokens as it needs.
*/
std::unique_ptr<antlr4::Token> MySQLBaseLexer::nextToken() {
// First respond with pending tokens to the next token request, if there are any.
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
return pending;
}
// Let the main lexer class run the next token recognition.
// This might create additional tokens again.
auto next = Lexer::nextToken();
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
_pendingTokens.push_back(std::move(next));
return pending;
}
return next;
}
Keep in mind: the lexer rule still issues its own token (which I set to be an IDENTIFIER here), which means you only have to issue the additional tokens.

In Bison (or yacc for that matter), is there an order defined by a grammar?

I have the following grammar in a Bisone file:
item
: "ITEM" t_name t_type v_storage t_prefix t_tag ';'
;
t_name
: [$_A-Za-z][$_A-Z0-9a-z]*
;
t_type
: "BYTE"
| "WORD"
| "LONG"
| "QUAD"
;
v_storage
: %empty
| "TYPEDEF"
;
t_prefix
: %empty
| "PREFIX" t_name
;
t_tag
: %empty
| "TAG" t_name
;
When I attempt to parse the following string ITEM foobar BYTE PREFIX str_ TAG S TYPEDEF; I get an unexpected 'TYPEDEF" and it accepts the ";". Is there something I need to do to allow any order to be specified? If so, I'm hoping that there is a simple solution. Otherwise, I'll need to do a little more work.
It is not possible to tell bison (or yacc) that order doesn't matter. Rules are strictly ordered.
So you have two options:
List all possible orders. If you do this, watch out for ambiguities caused by optional productions. You'll actually need to list all orders and subsets. That mounts up exponentially.
Just accept any list of components, as a list. That will accept repeated components so you'll need to catch that in the semantic action if you care.
The second option is almost always the one you want. Implementation is usually trivial, because you will want to store the components somewhere; as long as that somewhere has a unique value (such as NULL) which means "not yet set", then you only need to test that value before setting it. For example rather than the one in the question):
%{
#include <stdbool>
enum Type {
TYPE_DEFAULT = 0, TYPE_BYTE, TYPE_WORD, TYPE_LONG, TYPE_QUAD
};
typedef struct Item Item;
struct Item {
const char *name;
enum Type type;
int storage; /* 0: unset, 1: TYPEDEF */
const char *prefix;
const char *tag;
};
// ...
// Relies on the fact that NULL and 0 are converted to boolean
// false. Returns true if it's ok to do the set (i.e. thing
// wasn't set).
bool check_dup(bool already_set, const char* thing) {
if (already_set)
fprintf(stderr, "Duplicate %s ignored at line %d\n", thing, yylineno);
return !already_set;
}
%}
%union {
const char *str;
Item *item;
// ...
}
%type <item> item item-def
%token <str> NAME STRING
%%
/* Many of the actions below depend on $$ having been set to $1.
* If you use a template which doesn't provide that guarantee, you
* will have to add $$ = $1; to some actions.
*/
item: item-def { /* Do whatever is necessary to finalise $1 */ }
item-def
: "ITEM" NAME
{ $$ = calloc(1, sizeof *$$); $$->name = $2; }
| item-def "BYTE"
{ if (check_dup($$->type, "type") $$->type = TYPE_BYTE; }
| item-def "WORD"
{ if (check_dup($$->type, "type") $$->type = TYPE_WORD; }
| item-def "LONG"
{ if (check_dup($$->type, "type") $$->type = TYPE_LONG; }
| item-def "QUAD"
{ if (check_dup($$->type, "type") $$->type = TYPE_QUAD; }
| item-def "TYPEDEF"
{ if (check_dup($$->storage, "storage") $$->storage = 1; }
| item-def "PREFIX" STRING
{ if (check_dup($$->prefix, "prefix") $$->prefix = $3; }
| item-def "TAG" STRING
{ if (check_dup($$->tag, "tag") $$->tag = $3; }
You can separate all those item-def productions into something like:
item-def: "ITEM" NAME { /* ... */ }
| item-def item-option
item-option: type | storage | prefix | tag
But then in the actions you need to get at the item object, which is not part of the option production. You can do that with a Bison feature which lets you look into the parser stack:
prefix: "PREFIX" STRING { if (check_dup($<item>0->prefix, "prefix")
$<item>0->prefix = $2; }
In this context, $0 will refer to whatever came before prefix, which is whatever came before item-option, which is an item-def. See the end of this section in the Bison manual, where it describes this practice as "risky", which it is. It also requires you to explicitly specify the tag, because bison doesn't do the grammar analysis necessary to validate the use of $0, which would identify its type.

String Templates and Syntactic Predicates

I am trying to fix my TreeWalker so that it can implement a different string template depending upon whether a bracket is found or not.
i.e. In the following formula: x - (y - z)
Using the brackets will alter the value of the formula. Without the brackets the formula becomes x - y - z which is wrong. I am trying to use a different string template for minus and minus with a bracket. These are as follows:
minus(op1,op2) ::= "$op1$ - $op2$"
minusb(op1,op2) ::= "($op1$ - $op2$)"
And the TreeWalker section I am trying to use with these is below. This is only a section from a much larger TreeWalker.
additiveExpr
scope { bool aFlag }
#init {bool aFlag = true; }
: ^(PLUS
{ $additiveExpr::aFlag = false;
$formula::mFlag = false;
}
op1=expression op2=expression)
-> {$additiveExpr::aFlag}?
plusb(op1={$op1.st},op2={$op2.st})
-> plus(op1={$op1.st},op2={$op2.st})
| ^(MINUS
{
$additiveExpr::aFlag = false;
$formula::mFlag = false;
}
op1=expression op2=expression)
-> {$additiveExpr::aFlag}?
minusb(op1={$op1.st},op2={$op2.st})
-> minus(op1={$op1.st},op2={$op2.st})
In the above rule aFlag always returns false producing the template with no brackets. I am struggling to understand why it will not return true when there is a bracket.
I can post other parts of the Parser if that will help.
I have worked out that the conditional part is determined by syntactic predicates. I need the syntactic predicate expression to be true when the token before the additiveExpr is a bracket (lexer token OPEN '(' ). Does this need to be passed through from the Parser to the TreeWalker or can I work it out from here? This is the equivalent parser code
absExpr returns [string ret_type]
: ABS^ OPEN! additiveExpr CLOSE!
{$ret_type = "numeric"; }
| additiveExpr
{$ret_type = $additiveExpr.ret_type; }
;
additiveExpr returns [string ret_type]
: m=multiplicativeExpr ((PLUS|MINUS)^ multiplicativeExpr )*
{$ret_type = $m.ret_type; }
;
EDIT:
Latest parser section:
absExpr returns [string ret_type]
: ABS^ OPEN additiveExpr CLOSE
{$ret_type = $additiveExpr.ret_type; }
| (OPEN additiveExpr CLOSE)=> OPEN additiveExpr CLOSE
{$ret_type = $additiveExpr.ret_type; }
| additiveExpr
{$ret_type = $additiveExpr.ret_type; }
;
I would probably try something like the following instead of using semantic predicates. You just have to make sure and not use the ! operator in the additiveExpr rule in your parser grammar, or the OPEN and CLOSE tokens will not be available for your tree walker to use.
additiveExpr
: ^(PLUS OPEN op1=expression op2=expression CLOSE)
-> plusb(op1={$op1.st},op2={$op2.st})
| ^(PLUS op1=expression op2=expression)
-> plus(op1={$op1.st},op2={$op2.st})
| ^(MINUS OPEN op1=expression op2=expression CLOSE)
-> minusb(op1={$op1.st},op2={$op2.st})
| ^(MINUS op1=expression op2=expression)
-> minus(op1={$op1.st},op2={$op2.st})
;
Previous answer:
It's hard for me to tell exactly how you want this to work. One thing that might be a problem is this initialization:
scope { bool aFlag }
#init {bool aFlag = true; }
This declares 2 different variables with the same name aFlag. One of them will appear in the scope class, and the other will be a local variable in the additiveExpr method. You may have meant to do this instead:
scope { bool aFlag }
#init { $additiveExpr::aFlag = true; }

variable not passed to predicate method in ANTLR

The java code generated from ANTLR is one rule, one method in most times. But for the following rule:
switchBlockLabels[ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts]
: ^(SWITCH_BLOCK_LABEL_LIST switchCaseLabel[_entity, _method, _preStmts]* switchDefaultLabel? switchCaseLabel*)
;
it generates a submethod named synpred125_TreeParserStage3_fragment(), in which mehod switchCaseLabel(_entity, _method, _preStmts) is called:
synpred125_TreeParserStage3_fragment(){
......
switchCaseLabel(_entity, _method, _preStmts);//variable not found error
......
}
switchBlockLabels(ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts){
......
synpred125_TreeParserStage3_fragment();
......
}
The problem is switchCaseLabel has parameters and the parameters come from the parameters of switchBlockLabels() method, so "variable not found error" occurs.
How can I solve this problem?
My guess is that you've enabled global backtracking in your grammar like this:
options {
backtrack=true;
}
in which case you can't pass parameters to ambiguous rules. In order to communicate between ambiguous rules when you have enabled global backtracking, you must use rule scopes. The "predicate-methods" do have access to rule scopes variables.
A demo
Let's say we have this ambiguous grammar:
grammar Scope;
options {
backtrack=true;
}
parse
: atom+ EOF
;
atom
: numberOrName+
;
numberOrName
: Number
| Name
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
(for the record, the atom+ and numberOrName+ make it ambiguous)
If you now want to pass information between the parse and numberOrName rule, say an integer n, something like this will fail (which is the way you tried it):
grammar Scope;
options {
backtrack=true;
}
parse
#init{int n = 0;}
: (atom[++n])+ EOF
;
atom[int n]
: (numberOrName[n])+
;
numberOrName[int n]
: Number {System.out.println(n + " = " + $Number.text);}
| Name {System.out.println(n + " = " + $Name.text);}
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
In order to do this using rule scopes, you could do it like this:
grammar Scope;
options {
backtrack=true;
}
parse
scope{int n; /* define the scoped variable */ }
#init{$parse::n = 0; /* important: initialize the variable! */ }
: atom+ EOF
;
atom
: numberOrName+
;
numberOrName /* increment and print the scoped variable from the parse rule */
: Number {System.out.println(++$parse::n + " = " + $Number.text);}
| Name {System.out.println(++$parse::n + " = " + $Name.text);}
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
Test
If you now run the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "foo 42 Bar 666";
ScopeLexer lexer = new ScopeLexer(new ANTLRStringStream(src));
ScopeParser parser = new ScopeParser(new CommonTokenStream(lexer));
parser.parse();
}
}
you will see the following being printed to the console:
1 = foo
2 = 42
3 = Bar
4 = 666
P.S.
I don't know what language you're parsing, but enabling global backtracking is usually overkill and can have quite an impact on the performance of your parser. Computer languages often are ambiguous in just a few cases. Instead of enabling global backtracking, you really should look into adding syntactic predicates, or enabling backtracking on those rules that are ambiguous. See The Definitive ANTLR Reference for more info.

Handle hidden channel in antlr 3

I am writing an ANTRL grammar for translating one language to another but the documentation on using the HIDDEN channel is very scarce. I cannot find an example anywhere. The only thing I have found is the FAQ on www.antlr.org which tells you how to access the hidden channel but not how best to use this functionality. The target language is Java.
In my grammar file, I pass whitespace and comments through like so:
// Send runs of space and tab characters to the hidden channel.
WHITESPACE
: (SPACE | TAB)+ { $channel = HIDDEN; }
;
// Single-line comments begin with --
SINGLE_COMMENT
: ('--' COMMENT_CHARS NEWLINE) {
$channel=HIDDEN;
}
;
fragment COMMENT_CHARS
: ~('\r' | '\n')*
;
// Treat runs of newline characters as a single NEWLINE token.
NEWLINE
: ('\r'? '\n')+ { $channel = HIDDEN; }
;
In my members section I have defined a method for writing hidden channel tokens to my output StringStream...
#members {
private int savedIndex = 0;
void ProcessHiddenChannel(TokenStream input) {
List<Token> tokens = ((CommonTokenStream)input).getTokens(savedIndex, input.index());
for(Token token: tokens) {
if(token.getChannel() == token.HIDDEN_CHANNEL) {
output.append(token.getText());
}
}
savedIndex = input.index();
}
}
Now to use this, I have to call the method after every single token in my grammar.
myParserRule
: MYTOKEN1 { ProcessHiddenChannel(input); }
MYTOKEN2 { ProcessHiddenChannel(input); }
;
Surely there must be a better way?
EDIT: This is an example of the input language:
-- -----------------------------------------------------------------
--
--
-- Name Description
-- ==================================
-- IFM1/183 Freq Spectrum Inversion
--
-- -----------------------------------------------------------------
PROCEDURE IFM1/183
TITLE "Freq Spectrum Inversion";
HELP
Freq Spectrum Inversion
ENDHELP;
PRIVILEGE CTRL;
WINDOW MANDATORY;
INPUT
$Input : #NO_YES
DEFAULT select %YES when /IFMS1/183.VALUE = %NO;
%NO otherwise
endselect
PROMPT "Spec Inv";
$Forced_Cmd : BOOLEAN
Default FALSE
Prompt "Forced Commanding";
DEFINE
&RetCode : #PSTATUS := %OK;
&msg : STRING;
&Input : BOOLEAN;
REQUIRE AVAILABLE(/IFMS1)
MSG "IFMS1 not available";
REQUIRE /IFMS1/001.VALUE = %MON_AND_CTRL
MSG "IFMS1 not in control mode";
BEGIN -- Procedure Body --
&msg := "IFMS1/183 -> " + toString($Input) + " : ";
-- pre-check
IF /IFMS1/183.VALUE = $Input
AND $Forced_Cmd = FALSE THEN
EXIT (%OK, MSG &msg + "already set");
ENDIF;
-- command
IF $Input = %YES THEN &Input:= TRUE;
ELSE &Input:= FALSE;
ENDIF;
SET &RetCode := SEND IFMS1.FREQPLAN
( $FreqSpecInv := &Input);
IF &RetCode <> %OK THEN
EXIT (&RetCode, MSG &msg + "command failed");
ENDIF;
-- verify
SET &RetCode := VERIFY /IFMS1/183.VALUE = $Input TIMEOUT '10';
IF &RetCode <> %OK THEN
EXIT (&RetCode, MSG &msg + "verification failed");
ELSE
EXIT (&RetCode, MSG &msg + "verified");
ENDIF;
END
Look into inheriting CommonTokenStream and feeding an instance of your subclass into ANTLR. From the code example that you give, I suspect that you might be interested in taking a look at the filter and the rewrite options available in version 3.
Also, take a look at this other related stack overflow question.
I have just been going through some of my old questions and thought it was worth responding with the final solution that worked the best. In the end, the best way to translate a language was to use StringTemplate. This takes care of re-indenting the output for you. There is a very good example called 'cminus' in the ANTLR example pack that shows how to use it.