Yacc/bison: what's wrong with my syntax equations? - syntax-error

I'm writing a "compiler" of sorts: it reads a description of a game (with rooms, characters, things, etc.) Think of it as a visual version of an Adventure-style game, but with much simpler problems.
When I run my "compiler" I'm getting a syntax error on my input, and I can't figure out why. Here's the relevant section of my yacc input:
character
: char-head general-text character-insides { PopChoices(); }
;
character-insides
: LEFTBRACKET options RIGHTBRACKET
;
char-head
: char-namesWT opt-imgsWT char-desc opt-cond
;
char-desc
: general-text { SetText($1); }
;
char-namesWT
: DOTC ID WORD { AddCharacter($3, $2); expect(EXP_TEXT); }
;
opt-cond
: %empty
| condition
;
condition
: condition-reason condition-main general-text
{ AddCondition($1, $2, $3); }
;
condition-reason
: DOTU { $$ = 'u'; }
| DOTV { $$ = 'v'; }
;
condition-main
: money-conditionWT
| have-conditionWT
| moves-conditionWT
| flag-conditionWT
;
have-conditionWT
: PERCENT_SLASH opt-bang ID
{ $$ = MkCondID($1, $2, $3) ; expect(EXP_TEXT); }
;
opt-bang
: %empty { $$ = TRUE; }
| BANG { $$ = FALSE; }
;
ID: WORD
;
Things in all caps are terminal symbols, things in lower or mixed case are non-terminals. If a non-terminal ends in WT, then it "wants text". That is, it expects that what comes after it may be arbitrary text.
Background: I have written my own token recognizer in C++ because(*) I want the syntax to be able to change the way the lexer's behavior. Two types of tokens should be matched only when the syntax expects them: FILENAME (with slashes and other non-alphameric characters) and TEXT, which means "all the text from here to the end of the line" (but not starting with certain keywords).
The function "expect" tells the lexer when to look for these two symbols. The expectation is reset to EXP_NORMAL after each token is returned.
I have added code to yylex that prints out the tokens as it recognizes them, and it looks to me like the tokenizer is working properly -- returning the tokens I expect.
(*) Also because I want to be able to ask the tokenizer for the column where the error occurred, and get the contents of the line being scanned at the time so I can print out a more useful error message.
Here is the relevant part of the input:
.c Wendy wendy
OK, now you caught me, what do you want to do with me?
.u %/lasso You won't catch me like that.
[
Here is the last part of the debugging output from yylex:
token: 262: DOTC/
token: 289: WORD/Wendy
token: 289: WORD/wendy
token: 292: TEXT/OK, now you caught me, what do you want to do with me?
token: 286: DOTU/
token: 274: PERCENT_SLASH/%/
token: 289: WORD/lasso
token: 292: TEXT/You won't catch me like that.
token: 269: LEFTBRACKET/
here's my error message:
: line 124, columns 3-4: syntax error, unexpected LEFTBRACKET, expecting TEXT
[
To help you understand the equations above, here is the relevant part of the description of the input syntax that I wrote the yacc code from.
// Character:
// .c id charactername,[imagename,[animationname]]
// description-text
// .u condition on the character being usable [optional]
// .v condition on the character being visible [optional]
// [
// (options)
// ]
// Conditions:
// %$[-]n Must [not] have at least n dollars
// %/[-]name Must [not] have named thing
// %t-nnn At/before specified number of moves
// %t+nnn At/after specified number of moves
// %#[-]name named flag must [not] be set
// Condition-char: $, /, t, or #, as described above
//
// Condition:
// % condition-char (identifier/int) ['/' text-if-fail ]
// description-text: Can be either on-line text or multi-line text
// On-line text is the rest of the line
brackets mark optional non-terminals, but a bracket standing alone (represented by LEFTBRACKET and RIGHTBRACKET in the yacc) is an actual token, e.g.
// [
// (options)
// ]
above.
What am I doing wrong?

To debug parsing problems in your grammar, you need to understand the shift/reduce machine that yacc/bison produces (described in the .output file produced with the -v option), and you need to look at the trail of states that the parser goes through to reach the problem you see.
To enable debugging code in the parser (which can print the states and the shift and reduce actions as they occur), you need to compile with -DYYDEBUG or put #define YYDEBUG 1 in the top of your grammar file. The debugging code is controlled by the global variable yydebug -- set to non-zero to turn on the trace and zero to turn it off. I often use the following in main:
#ifdef YYDEBUG
extern int yydebug;
if (char *p = getenv("YYDEBUG"))
yydebug = atoi(p);
#endif
Then you can include -DYYDEBUG in your compiler flags for debug builds and turn on the debugging code by something like setenv YYDEBUG 1 to set the envvar prior to running your program.

I suppose your syntax error message was generated by bison. What is striking is that it claims to have found a LEFTBRACKET when it expects a [. Naively, you might expect it to be satisfied with the LEFTBRACKET it found, but of course bison knows nothing about LEFTBRACKET except its numeric value, which will be some integer larger than 256.
The only reason bison might expect [ is if your grammar includes the terminal '['. But since your scanner seems to return LEFTBRACKET when it sees a [, the parser will never see '['.

Related

How can I use the perl6 regex metasyntax, <foo regex>?

In perl6 grammars, as explained here (note, the design documents are not guaranteed to be up-to-date as the implementation is finished), if an opening angle bracket is followed by an identifier then the construct is a call to a subrule, method or function.
If the character following the identifier is an opening paren, then it's a call to a method or function eg: <foo('bar')>. As explained further down the page, if the first char after the identifier is a space, then the rest of the string up to the closing angle will be interpreted as a regex argument to the method - to quote:
<foo bar>
is more or less equivalent to
<foo(/bar/)>
What's the proper way to use this feature? In my case, I'm parsing line oriented data and I'm trying to declare a rule that will instigate a seperate search on the current line being parsed:
#!/usr/bin/env perl6
# use Grammar::Tracer ;
grammar G {
my $SOLpos = -1 ; # Start-of-line pos
regex TOP { <line>+ }
method SOLscan($regex) {
# Start a new cursor
my $cur = self."!cursor_start_cur"() ;
# Set pos and from to start of the current line
$cur.from($SOLpos) ;
$cur.pos($SOLpos) ;
# Run the given regex on the cursor
$cur = $regex($cur) ;
# If pos is >= 0, we found what we were looking for
if $cur.pos >= 0 {
$cur."!cursor_pass"(self.pos, 'SOLscan')
}
self
}
token line {
{ $SOLpos = self.pos ; say '$SOLpos = ' ~ $SOLpos }
[
|| <word> <ws> 'two' { say 'matched two' } <SOLscan \w+> <ws> <word>
|| <word>+ %% <ws> { say 'matched words' }
]
\n
}
token word { \S+ }
token ws { \h+ }
}
my $mo = G.subparse: q:to/END/ ;
hello world
one two three
END
As it is, this code produces:
$ ./h.pl
$SOLpos = 0
matched words
$SOLpos = 12
matched two
Too many positionals passed; expected 1 argument but got 2
in method SOLscan at ./h.pl line 14
in regex line at ./h.pl line 32
in regex TOP at ./h.pl line 7
in block <unit> at ./h.pl line 41
$
Line 14 is $cur.from($SOLpos). If commented out, line 15 produces the same error. It appears as though .pos and .from are read only... (maybe :-)
Any ideas what the proper incantation is?
Note, any proposed solution can be a long way from what I've done here - all I'm really wanting to do is understand how the mechanism is supposed to be used.
It does not seem to be in the corresponding directory in roast, so that would make it a "Not Yet Implemented" feature, I'm afraid.

YACC: Can I generate "syntax error" from my "semantic" processing?

I'm writing a program in YACC and C/C++. It parses a fairly simple grammar and stores the results in some tables.
I have rules like
room: DOTR ID roomname { AddRoom($3, $2); };
and the code for AddRoom is:
void AddRoom(const char* name, const char* id)
{
theRoom = (void)new GameRoom(name, id);
if (!theGame->addRoom(theRoom)) {
?????
}
}
???? would be where I would insert code to generate a syntax error (I hope).
The purpose of this code is that every object in the game (rooms, doors, NPCs, things) has a unique ID. If theGame->addRoom detects that the ID is not unique, it will return false, and I want yacc to display an error message at that point in the input -- just as if an illegal token had been there.
Just call yyerror(), and remember that there was an error so you don't proceed to later stages. But you do not want to treat this as a syntax error: otherwise you will cause the parser to start discarding tokens etc.

How to get the original text using $text at the same time rule returns value

So I have the following grammar:
top_cmd :cmds
{
std::cout << $cmds.text << std::endl;
}
;
cmds returns [char* str]
: cmd+
{
str = new char('a');
}
;
I get g++ compile error:
"str" was not declared in this scope
If I remove this line
std::cout << $cmds.text << std::endl;
Then compile is fine.
I googled how "$text" is used, it seems to me it is expected to use $text for the purpose of rewrite rules. In my example, the function "cmds" returns "char*" when I remove the offending line and some complex structure when I keep it.
I can think of the following workaround:
1. do not have lower level rules return anything, but pass variable into lower level rules.
2. use re-write rules
But both are pretty big change(my real project is fairly large) considering how much time budge I have.
So is there a short-cut? Basically I do not want to change the grammar of top_cmd and cmds, but I can get the full text of cmds' matching.
I am using ANTLR3C, but I believe this is independent of target language.
I think you should do something like this:
top_cmd :cmds
{
std::cout << $cmds.text << std::endl;}
;
cmds returns [char* str]
: cmd+
{
$str = new char('a'); //added $
}
;
$cmds.text will return the text matched for the rule cmds. If you want to return the str, you should change it for $cmds.str
$text: The text matched for a rule or the text matched
from the start of the rule up until the point of
the $text expression evaluation. Note that this
includes the text for all tokens including those
on hidden channels, which is what you want
because usually that has all the whitespace and
comments. When referring to the current rule,
this attribute is available in any action including
any exception actions.
from the Definitive Antrl reference

Lex and Yacc do not report an error when an unexpected character is parsed.

Lex and Yacc are not reporting an error when an unexpected character is parsed. In the code below, there is no error when #set label sample is parsed, but the # is not valid.
Lex portion of code
identifier [\._a-zA-Z0-9\/]+
<INITIAL>{s}{e}{t} {
return SET;
}
<INITIAL>{l}{a}{b}{e}{l} {
return LABEL;
}
<INITIAL>{i}{d}{e}{n}{t}{i}{f}{i}{e}{r} {
strncpy(yylval.str, yytext,1023);
yylval.str[1023] = '\0';
return IDENTIFIER;
}
Yacc portion of code.
definition : SET LABEL IDENTIFIER
{
cout<<"set label "<<$3<<endl;
};
When #set sample label is parsed, there should be an error reported because # is an unexpected character. But there is no error reported. How should I modify the code so an error is reported?
(Comments converted to a SO style Q&A format)
#JonathanLeffler wrote:
That's why you need a default rule in the lexical analyzer (typically the LHS is .) that arranges for an error to be reported. Without it, the default action is just to echo the unmatched character and proceed onwards with the next one.
At the least you would want to include the specific character that is causing trouble in the error message. You might well want to return it as a single-character token, which will generally trigger an error in the grammar. So:
<*>. { cout << "Error: unexpected character " << yytext << endl; return *yytext; }
might be appropriate.

ANTLR: removing clutter

i'm learning ANTLR right now. Let's say, I have a VHDL code and would like to do some processing on the PROCESS blocks. The rest should be completely ignored. I don't want to describe the whole VHDL language, since I'm interested only in the process blocks. So I could write a rule that matches process blocks. But how do I tell ANTLR to match only the process block rule and ignore anything else?
I know next to no VHDL, so let's say you want to replace all single line comments in a (Java) source file with multi-line comments:
//foo
should become:
/* foo */
You need to let the lexer match single line comments, of course. But you should also make sure it recognizes multi-line comments because you don't want //bar to be recognized as a single line comment in:
/*
//bar
*/
The same goes for string literals:
String s = "no // comment";
Finally, you should create some sort of catch-all rule in the lexer that will match any character.
A quick demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Str
: '"' ('\\' . | ~('\\' | '"'))* '"'
;
MLComment
: '/*' .* '*/'
;
SLComment
: '//' ~('\r' | '\n')*
{
setText("/* " + getText().substring(2) + " */");
}
;
Any
: . // fall through rule, matches any character
;
If you now parse input like this:
//comment 1
class Foo {
//comment 2
/*
* not // a comment
*/
String s = "not // a // comment"; //comment 3
}
the following will be printed to your console:
/* comment 1 */
class Foo {
/* comment 2 */
/*
* not // a comment
*/
String s = "not // a // comment"; /* comment 3 */
}
Note that this is just a quick demo: a string literal in Java could contain Unicode escapes, which my demo doesn't support, and my demo also does not handle char-literals (the char literal char c = '"'; would break it). All of these things are quite easy to fix, of course.
In the upcoming ANTLR v4, you can do fuzzy parsing. take a look at
http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules
You can get the beta software here:
http://antlr.org/download/antlr-4.0b3-complete.jar
Terence