I'm writing a program in YACC and C/C++. It parses a fairly simple grammar and stores the results in some tables.
I have rules like
room: DOTR ID roomname { AddRoom($3, $2); };
and the code for AddRoom is:
void AddRoom(const char* name, const char* id)
{
theRoom = (void)new GameRoom(name, id);
if (!theGame->addRoom(theRoom)) {
?????
}
}
???? would be where I would insert code to generate a syntax error (I hope).
The purpose of this code is that every object in the game (rooms, doors, NPCs, things) has a unique ID. If theGame->addRoom detects that the ID is not unique, it will return false, and I want yacc to display an error message at that point in the input -- just as if an illegal token had been there.
Just call yyerror(), and remember that there was an error so you don't proceed to later stages. But you do not want to treat this as a syntax error: otherwise you will cause the parser to start discarding tokens etc.
Related
Here in this example I tried to capture two Int values and then capture them together as a struct. This gives a "Thread 1: signal SIGABRT" error.
(NOTE: I know that my example could be fixed by simply not nesting the Captures and handling the pattern matching differently. This is just simplified example code for the sake of this question.)
let intCapture = Regex {
TryCapture(as: Reference(Int.self)) {
OneOrMore(.digit)
} transform: { result in
return Int(result)
}
}
let mainPattern = Regex {
TryCapture(as: Reference(Floor.self)) {
"floor #: "
intCapture
" has "
intCapture
" rooms"
}transform: { ( stringMatch, floorInt, roomInt ) in
return Floor(floorNumber: floorInt, roomCount: roomInt)
}
}
struct Floor {
let floorNumber: Int
let roomCount: Int
}
let testString = "floor #: 34 has 25 rooms"
let floorData = testString.firstMatch(of: mainPattern)
After looking into it, I found that in the mainPattern's 'transform' the 'floorInt' and 'roomInt' are what is causing the problem.
The funny part is that when you look at the 'Quick Help'/Option+click, it shows that they are both type Int! It knows what is there but you are not able to capture it!
Further, when I erase one of them, let's say 'floorInt', it gives this error:
Contextual closure type '(Substring) throws -> Floor?' expects 1 argument, but 2 were used in closure body
So really, even though for SOME reason it does know that there are two captured Int values there, it doesn't let you access them for the sake of the transform.
Not deterred, I was helped out in another question by a very helpful user who pointed me to the Evolution submission where they mentioned a .mapOutput, but sadly it seems this particular feature was never implemented!
Is there no real way to create a new transformed value from nested transformed values like this? Any help would be greatly appreciated.
I'm writing a "compiler" of sorts: it reads a description of a game (with rooms, characters, things, etc.) Think of it as a visual version of an Adventure-style game, but with much simpler problems.
When I run my "compiler" I'm getting a syntax error on my input, and I can't figure out why. Here's the relevant section of my yacc input:
character
: char-head general-text character-insides { PopChoices(); }
;
character-insides
: LEFTBRACKET options RIGHTBRACKET
;
char-head
: char-namesWT opt-imgsWT char-desc opt-cond
;
char-desc
: general-text { SetText($1); }
;
char-namesWT
: DOTC ID WORD { AddCharacter($3, $2); expect(EXP_TEXT); }
;
opt-cond
: %empty
| condition
;
condition
: condition-reason condition-main general-text
{ AddCondition($1, $2, $3); }
;
condition-reason
: DOTU { $$ = 'u'; }
| DOTV { $$ = 'v'; }
;
condition-main
: money-conditionWT
| have-conditionWT
| moves-conditionWT
| flag-conditionWT
;
have-conditionWT
: PERCENT_SLASH opt-bang ID
{ $$ = MkCondID($1, $2, $3) ; expect(EXP_TEXT); }
;
opt-bang
: %empty { $$ = TRUE; }
| BANG { $$ = FALSE; }
;
ID: WORD
;
Things in all caps are terminal symbols, things in lower or mixed case are non-terminals. If a non-terminal ends in WT, then it "wants text". That is, it expects that what comes after it may be arbitrary text.
Background: I have written my own token recognizer in C++ because(*) I want the syntax to be able to change the way the lexer's behavior. Two types of tokens should be matched only when the syntax expects them: FILENAME (with slashes and other non-alphameric characters) and TEXT, which means "all the text from here to the end of the line" (but not starting with certain keywords).
The function "expect" tells the lexer when to look for these two symbols. The expectation is reset to EXP_NORMAL after each token is returned.
I have added code to yylex that prints out the tokens as it recognizes them, and it looks to me like the tokenizer is working properly -- returning the tokens I expect.
(*) Also because I want to be able to ask the tokenizer for the column where the error occurred, and get the contents of the line being scanned at the time so I can print out a more useful error message.
Here is the relevant part of the input:
.c Wendy wendy
OK, now you caught me, what do you want to do with me?
.u %/lasso You won't catch me like that.
[
Here is the last part of the debugging output from yylex:
token: 262: DOTC/
token: 289: WORD/Wendy
token: 289: WORD/wendy
token: 292: TEXT/OK, now you caught me, what do you want to do with me?
token: 286: DOTU/
token: 274: PERCENT_SLASH/%/
token: 289: WORD/lasso
token: 292: TEXT/You won't catch me like that.
token: 269: LEFTBRACKET/
here's my error message:
: line 124, columns 3-4: syntax error, unexpected LEFTBRACKET, expecting TEXT
[
To help you understand the equations above, here is the relevant part of the description of the input syntax that I wrote the yacc code from.
// Character:
// .c id charactername,[imagename,[animationname]]
// description-text
// .u condition on the character being usable [optional]
// .v condition on the character being visible [optional]
// [
// (options)
// ]
// Conditions:
// %$[-]n Must [not] have at least n dollars
// %/[-]name Must [not] have named thing
// %t-nnn At/before specified number of moves
// %t+nnn At/after specified number of moves
// %#[-]name named flag must [not] be set
// Condition-char: $, /, t, or #, as described above
//
// Condition:
// % condition-char (identifier/int) ['/' text-if-fail ]
// description-text: Can be either on-line text or multi-line text
// On-line text is the rest of the line
brackets mark optional non-terminals, but a bracket standing alone (represented by LEFTBRACKET and RIGHTBRACKET in the yacc) is an actual token, e.g.
// [
// (options)
// ]
above.
What am I doing wrong?
To debug parsing problems in your grammar, you need to understand the shift/reduce machine that yacc/bison produces (described in the .output file produced with the -v option), and you need to look at the trail of states that the parser goes through to reach the problem you see.
To enable debugging code in the parser (which can print the states and the shift and reduce actions as they occur), you need to compile with -DYYDEBUG or put #define YYDEBUG 1 in the top of your grammar file. The debugging code is controlled by the global variable yydebug -- set to non-zero to turn on the trace and zero to turn it off. I often use the following in main:
#ifdef YYDEBUG
extern int yydebug;
if (char *p = getenv("YYDEBUG"))
yydebug = atoi(p);
#endif
Then you can include -DYYDEBUG in your compiler flags for debug builds and turn on the debugging code by something like setenv YYDEBUG 1 to set the envvar prior to running your program.
I suppose your syntax error message was generated by bison. What is striking is that it claims to have found a LEFTBRACKET when it expects a [. Naively, you might expect it to be satisfied with the LEFTBRACKET it found, but of course bison knows nothing about LEFTBRACKET except its numeric value, which will be some integer larger than 256.
The only reason bison might expect [ is if your grammar includes the terminal '['. But since your scanner seems to return LEFTBRACKET when it sees a [, the parser will never see '['.
My grammar allows:
C → id := E // assign a value/expression to a variable (VAR)
C → print(id) // print variables(VAR) values
To get it done, my lex file is:
[a-z]{
yylval.var_index=get_var_index(yytext);
return VAR;
}
get_var_index returns the index of the variable in the list, if it does not exist then it creates one.
It is working!
The problem is:
Everytime a variable is matched on lex file it creates a index to that variable.
I have to report if 'print(a)' is called and 'a' was not declared, and that will never happen since print(a) always creates an index to 'a'.*
How can I solve it?
Piece of yacc file:
%union {
int int_val;
int var_index;
}
%token <int_val> INTEGER
%token <var_index> VAR
...
| PRINT '(' VAR ')'{
n_lines++;
printf("%d\n",values[$3]);
}
...
| VAR {$$ =values[$1];}
This does seem a bit like a Computer Science class homework question for us to do.
Normally one would not use bison/yacc in this way. One would do the parse with bison/yacc and make a parse tree which then gets walked to perform semantic checks, such as checking for declaration before use and so on. The identifiers would normally be managed in a symbol table, rather than just a table of values to enable other attributes, such as declared to be managed. It's for these reasons that it looks like an exercise rather than a realistic application of the tools. OK; those disclaimers disposed of, lets get to an answer.
The problem would be solved by remembering what has been declared and what not. If one does not plan to use a full symbol table then a simple array of booleans indicating which are the valid values could be used. The array can be initialised to false and set to true on declaration. This value can be checked when a variable is used. As C uses ints for boolean we can use that. The only changes needed are in the bison/yacc. You omitted any syntax for the declarations, but as you indicated they are declared there must be some. I guessed.
%union {
int int_val;
int var_index;
}
int [MAX_TABLE_SIZE] declared; /* initialize to zero before starting parse */
%token <int_val> INTEGER
%token <var_index> VAR
...
| DECLARE '(' VAR ')' { n_lines++; declared[$3] = 1; }
...
| PRINT '(' VAR ')'{
n_lines++;
if (declared[$3]) printf("%d\n",values[$3]);
else printf("Variable undeclared\n");
}
...
| VAR {$$ =value[$1]; /* perhaps need to show more syntax to show how VAR used */}
Lex and Yacc are not reporting an error when an unexpected character is parsed. In the code below, there is no error when #set label sample is parsed, but the # is not valid.
Lex portion of code
identifier [\._a-zA-Z0-9\/]+
<INITIAL>{s}{e}{t} {
return SET;
}
<INITIAL>{l}{a}{b}{e}{l} {
return LABEL;
}
<INITIAL>{i}{d}{e}{n}{t}{i}{f}{i}{e}{r} {
strncpy(yylval.str, yytext,1023);
yylval.str[1023] = '\0';
return IDENTIFIER;
}
Yacc portion of code.
definition : SET LABEL IDENTIFIER
{
cout<<"set label "<<$3<<endl;
};
When #set sample label is parsed, there should be an error reported because # is an unexpected character. But there is no error reported. How should I modify the code so an error is reported?
(Comments converted to a SO style Q&A format)
#JonathanLeffler wrote:
That's why you need a default rule in the lexical analyzer (typically the LHS is .) that arranges for an error to be reported. Without it, the default action is just to echo the unmatched character and proceed onwards with the next one.
At the least you would want to include the specific character that is causing trouble in the error message. You might well want to return it as a single-character token, which will generally trigger an error in the grammar. So:
<*>. { cout << "Error: unexpected character " << yytext << endl; return *yytext; }
might be appropriate.
How to fetch the row and column number of error (i.e which part of string does not follow the grammar rules)?
I am using yacc parser to check the grammar.
Thank you.
you'd better read the dragon book and the aho book that explain and show example of how to write a lex/yacc based compiler.
In order to get line/column of the error, you shall make your lexer preserve the column and line. So in your lexer, you have to declare two globals, SourceLine and SourceCol (of course you can use better non-camel cased names).
In each token production, you have to calculate the column of the produced token, for that purpose I use a macro as follows:
#define Return(a, b, c) \
{\
SourceCol = (SourceCol + yyleng) * c; \
DPRINT ("## Source line: %d, returned token: "a".\n", SourceLine); \
return b; \
}
and the token production, with that macro, is:
"for" { Return("FOR", FOR, 1);
then to keep lines, for each token that makes a new line, I'm using:
{NEWLINES} {
BEGIN(INITIAL);
SourceLine += yyleng;
Return("LINE", LINE, 0);
}
Then in your parser, you can get SourceCol and SourceLine if you declare those as extern globals:
extern unsigned int SourceCol;
extern unsigned int SourceLine;
and now in your parse_error grammar production, you can do:
parse_error : LEXERROR
{
printf("OMG! Your code sucks at line %u and col %u!", SourceLine, SourceCol);
}
of course you may want to add yytext, handle a more verbose error message etc.. But all that's up to you!