yacc: e - line 85 of "tac.y", a token appears on the lhs of a production - yacc

What is the exact meaning of the error:
yacc: e - line 85 of "tac.y", a token appears on the lhs of a production
line 85:
T: INT { strcpy($$.type,"int"); }
|
REAL { strcpy($$.type,"real"); }
I have associated a attribute to T
%union{
struct attribute{
char name[20];
char type[20];
}attr;
}
%token <attr> ID
%token <attr> E
%token <attr> T

It means what it says: that you are attempting to provide a derivation for a token. Tokens come from the lexcal analysis, so they cannot have grammatical rules associated with them.
I guess you meant to declare the (semantic value) type of T:
%type <attr> T
You probably need to change the declaration of E too.
Avoiding this kind of confusion is one of the reasons it is common to use ALL_CAPS for terminals (tokens) and lower-case for non-terminals.

Related

Dropping a token in Yacc/Bison when an error is matched in the rhs production

I am writing a simple calculator in which an expression can be reduced to a statement, or list of statements. If a bad expression triggers a syntax error, I try to catch it with a production rule and then ignore it by giving the rule no actions. However, I believe this still reduces to a stmt_list despite it not being a valid statement.
Is there a way to make it simply ignore tokens matched with an error and prevent them from being reduced and used later?
The code below matches error ';' and reduces it to a stmt_list. It will then try to reduce that stmt_list with a valid expression, but since the first production was never called this will trigger a memory exception. My objective is to have Bison literally do nothing if an error is matched, such that a later valid expression can be the first reduction to stmt_list.
stmt_list:
expr ';' {
// Allocate memory for statement array
stmts = (float*)malloc(24 * sizeof(float));
// Assign pointer for array
$$ = stmts;
// At pointer, assign statement value
*$$ = $1;
// Increment pointer (for next job)
$$++;
}
| stmt_list expr ';' {
$$ = $1;
*$$ = $2;
$$++;
}
| error ';' { } // Do nothing (ignore bad stmt)
| stmt_list error ';' { } // Do nothing (ignore bad stmt)
;
If you supply no action for a rule, bison/yacc provides the default action $$ = $1.
In fact, you are not providing no action. You are providing an explicit action which does nothing. As it happens, if you use the C template, the parser will still perform the default action. In other templates, an action which does not assign a value to $$ might provoke a warning during parser generation. But it certainly won't modify your data structures so as to nullify the action. It can't know what that means. If you know, you should write it as the action :-) .
It's not 100% clear to me why you are keeping the results of the evaluations in a fixed-size dynamically-allocated array. You make no attempt to detect when the array fills up, so it's entirely possible that you'll end up overflowing the allocation and overwriting random memory. Moreover, using a global like this isn't usually a good idea because it prevents you from building more than one list at the same time. (For example, if you wanted to implement function calls, since a function's arguments are also a list of expressions.)
On the whole, it's better to put the implementation of the expanding expression list in a simple API which is implemented elsewhere. Here, I'm going to assume that you've done that; for specificity, I'll assume the following API (although it's just one example):
/* The list header structure, which contain all the information
* necessary to use the list. The forward declaration makes it
* possible to use pointers to ExprList objects without having to
* expose its implementation details.
*/
typedef struct ExprList ExprList;
/* Creates a new empty expression-list and returns a pointer to its header. */
ExprList* expr_list_create(void);
/* Resizes the expression list to the supplied size. If the list
* currently has fewer elements, new elements with default values are
* added at the end. If it currently has more elements, the excess
* ones are discarded. Calling with size 0 empties the list (but
* doesn't delete it).
*/
int expr_list_resize(ExprList* list, int new_length);
/* Frees all storage associated with the expression list. The
* argument must have been created with expr_list_create, and its
* value must not be used again after this function returns.
*/
void expr_list_free(ExprList* list);
/* Adds one element to the end of the expression-list.
* I kept the float datatype for expression values, although I
* strongly believe that its not ideal. But an advantage of using an
* API like this is that it is easier to change.
*/
void expr_list_push(ExprList* list, float value);
/* Returns the number of elements in the expression-list. */
int expr_list_len(ExprList* list);
/* Returns the address of the element in the expression list
* with the given index. If the index is out of range, behaviour
* is undefined; a debugging implementation will report an error.
*/
float* expr_list_at(ExprList* list, int index);
With that API, we can rewrite the productions for valid expressions:
stmt_list: expr ';' { $$ = expr_list_create();
expr_list_push($$, $1);
}
| stmt_list expr ';' { $$ = $1;
expr_list_push($$, $1);
}
Now for the error cases. You have two error rules; one triggers when the error is at the beginning of a list, and the other when the error is encountered after one or more (possibly erroneous) expressions have been handled. Both of these are productions for stmt_list so they must have the same value type as stmt_list does (ExprList*). Thus, they must do whatever you think is appropriate when a syntax error is produced.
The first one, when the error is at the start of the list, only needs to create an empty list. It's hard to see what else it could do.
stmt_list: error ';' { $$ = expr_list_create(); }
It seems to me that there are at least two alternatives for the other error action, when an error is detected after the list has at least one successfully-computed value. One possibility is to ditch the erroneous item, leaving the rest of the list intact. This requires only the default action:
stmt_list: stmt_list error ';'
(Of course, you could add the action { $$ = $1; } explicitly, if you wanted to.)
The other possibility is to empty the entire list, so as to start from scratch with the next element:
stmt_list: stmt_list error ';' { $$ = $1;
expr_list_resize($$, 0);
}
There are undoubtedly other possibilities. As I said, bison cannot figure out what it is that you intended (and neither can I, really). You'll have to implement whatever behaviour you want.

How to synthesise compiler testing data?

I am writing a simple compiler as a school work. I am looking for an automated approach to generate both positive and negative testing data to test my compiler, given the formal grammar and other specification. The language I am dealing with is of mediate size with 38 or so non-terminals. For the sake of illustration, here is a snapshot of the grammar:
program: const_decl* declaration* ENDMARKER
# statement
stmt: flow_stmt | '{' stmt* '}' | NAME [stmt_trailer] ';' | ';'
stmt_trailer: arglist | ['[' expr ']'] '=' expr
flow_stmt: if_stmt | for_stmt | while_stmt | read_stmt ';' | write_stmt ';' | return_stmt ';'
return_stmt: 'return' ['(' expr ')']
if_stmt: 'if' '(' condition ')' stmt ['else' stmt]
condition: expr ('<'|'<='|'>'|'>='|'!='|'==') expr | expr
for_stmt: ('for' '(' NAME '=' expr ';' condition ';'
NAME '=' NAME ('+'|'-') NUMBER ')' stmt)
Is there any tools to generate input file with the help of the grammar? The hand-written tests are too tedious or too weak to discover problems. An example of this language here:
void main() {
int N;
int temp;
int i, j;
int array_size;
reset_heap;
scanf(N);
for (i = 0; i < N; i = i + 1) {
scanf(array_size);
if (array_size > max_heap_size) {
printf("array_size exceeds max_heap_size");
} else {
for (j = 0; j < array_size; j = j + 1) {
scanf(temp);
heap[j] = temp;
}
heap_sort(array_size);
print_heap(array_size);
}
}
}
Generating controllable testing data automatically can save the days. Given the simplicity of the language, there must be some way to effectively do this. Any pointer and insight is greatly appreciated.
Any pointer and insight is greatly appreciated.
This should have the subtopic of How to avoid combinatorial explosion when generating test data.
While I would not be surprised if there are tools to do this having had the same need to generate test data for grammars I have created a few one off applications.
One of the best series of articles I have found on this is by Eric Lippert, Every Binary Tree There Is, think BNF converted to binary operators then converted to AST when you read tree. However he uses Catalan (every branch has two leaves) and when I wrote my app I preferred Motzikin (a branch can have one or two leaves).
Also he did his in C# with LINQ and I did mine in Prolog using DCG.
Generating the data based on the BNF or DCG is not hard, the real trick is to limit the area of expansion and the size of the expansion and to inject bad data.
By area of expansion lets say you want to test nested if statements three levels deep, but have to have valid code that compiles. Obviously you need the boilerplate code to make it compile then you start changing the deeply nested if by adding or removing the else clause. So you need to put in constraints so that the boilerplate code is constant and the testing part is variable.
By size of expansion lets say that you want to test conditional expressions. You can easily calculate that if you have many operators and you want to test them all in combinations you soon run into combinatorial explosion. The trick is to ensure you test deep enough and with enough breadth but not every combination. Again the judicial use of constraints helps.
So the point of all of this is that you start with a tool that takes in the BNF and generates valid code. Then you modify the BNF to add constraints and modify the generator to understand the constraints to generate the code examples.
Then you modify the BNF for invalid data and likewise the generator to understand those rules.
After that is working you can then start layering on levels of automation.
If you do go this route and decide that you will have to learn Prolog, take a look at Mercury first. I have not done this with Mercury, but if I do it again Mercury is high on the list.
While my actual code is not public, this and this is the closest to it that is public.
Along the way I had some fun with it in Code Golf.
When generating terminals such as reserved words or values for types, you can use predefined list with both valid and invalid data, e.g. for if if the language is case sensitive I would include in the list if,If,IF,iF, etc. For value types such as unsigned byte I would include -1,0,255 and 256.
When I was testing basic binary math expressions with +, -, * and ^ I generated all the test for with three basic numbers -2,-1,0,1, and 2. I thought it would be useless since I already had hundreds of test cases, but since it only took a few minutes to generate all of the test cases and several hours to run it, to my surprise it found a pattern I did not cover. The point here is that contrary what most people say about having to many test cases, remember that it is only time on a computer by changing a few constraints so do the large number of test.

How to modify parsing grammar to allow assignment and non-assignment statements?

So the question is about the grammar below. I'm working on a mini-interpreted language for fun (we learned about some compiler design in class, so I want to take it to the next level and try something on my own). I'm stuck trying to make the non-terminal symbol Expr.
Statement ::= Expr SC
Expr ::= /* I need help here */
Assign ::= Name EQUAL Expr
AddSub ::= MulDiv {(+|-) AddSub}
MulDiv ::= Primary {(*|/) MulDiv}
Primary ::= INT | FLOAT | STR | LP Expr RP | Name
Name ::= ID {. Name}
Expr has to be made such that Statement must allow for the two cases:
x = 789; (regular assignment, followed by semicolon)
x+2; (no assignment, just calculation, discarded; followed by a semicolon)
The purpose of the second case is to setup the foundation for more changes in the future. I was thinking about unary increment and decrement operators, and also function calls; both of which don't require assignment to be meaningful.
I've looked at other grammars (C# namely), but it was too complicated and lengthy to understand. Naturally I'm not looking for solutions, but only for guidance on how I could modify my grammar.
All help is appreciated.
EDIT: I should say that my initial thought was Expr ::= Assign | AddSub, but that wouldn't work since it would create ambiguity since both could start with the non-terminal symbol Name. I have made my tokenizer such that it allows one token look ahead (peek), but I have not made such a thing for the non terminals, since it would be trying to fix a problem that could be avoided (ambiguity). In the grammar, the terminals are the ones that are all-caps.
The simplest solution is the one actually taken by the designers of C, and thus by the various C derivatives: treat assignment simply as yet another operator, without restricting it to being at the top-level of a statement. Hence, in C, the following is unproblematic:
while ((ch = getchar()) != EOF) { ... }
Not everyone will consider that good style, but it is certainly common (particularly in the clauses of the for statement, whose syntax more or less requires that assignment be an expression).
There are two small complications, which are relatively easy to accomplish:
Logically, and unlike most operators, assignment associates to the right so that a = b = 0 is parsed as a = (b = 0) and not (a = b) = 0 (which would be highly unexpected). It also binds very weakly, at least to the right.
Opinions vary as to how tightly it should bind to the left. In C, for the most part a strict precedence model is followed so that a = 2 + b = 3 is rejected since it is parsed as a = ((2 + b) = 3). a = 2 + b = 3 might seem like terrible style, but consider also a < b ? (x = a) : (y = a). In C++, where the result of the ternary operator can be a reference, you could write that as (a < b ? x : y) = a in which the parentheses are required even thought assignment has lower precedence than the ternary operator.
None of these options are difficult to implement in a grammar, though.
In many languages, the left-hand side of an assignment has a restricted syntax. In C++, which has reference values, the restriction could be considered semantic, and I believe it is usually implemented with a semantic check, but in many C derivatives lvalue can be defined syntactically. Such definitions are unambiguous, but they are often not amenable to parsing with a top-down grammar, and they can create complications even for a bottom-up grammar. Doing the check post-parse is always a simple solution.
If you really want to distinguish assignment statements from expression statements, then you indeed run into the problem of prediction failure (not ambiguity) if you use a top-down parsing technique such as recursive descent. Since the grammar is not ambiguous, a simple solution is to use an LALR(1) parser generator such as bison/yacc, which has no problems parsing such a grammar since it does not require an early decision as to which kind of statement is being parsed. On the whole, the use of LALR(1) or even GLR parser generators simplifies implementation of a parser by allowing you to specify a grammar in a form which is easily readable and corresponds to the syntactic analysis. (For example, an LALR(1) parser can handle left-associative operators naturally, while a LL(1) grammar can only produce right-associative parses and therefore requires some kind of reconstruction of the syntax tree.)
A recursive descent parser is a computer program, not a grammar, and its expressiveness is thus not limited by the formal constraints of LL(1) grammars. That is both a strength and a weakness: the strength is that you can find solutions which are not limited by the limitations of LL(1) grammars; the weakness is that it is much more complicated (even, sometimes, impossible) to extract a clear statement about the precise syntax of the language. This power, for example, allows recursive descent grammars to handle left associativity in a more-or-less natural way despite the restriction mentioned above.
If you want to go down this road, then the solution is simple enough. You will have some sort of function:
/* This function parses and returns a single expression */
Node expr() {
Node left = value();
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
That easily be modified to be able to parse both expressions and statements. First, refactor the function so that it parses the "rest" of an expression, given the first operator. (The only change is a new prototype and the deletion of the first line in the body.)
/* This function parses and returns a single expression
* after the first value has been parsed. The value must be
* passed as an argument.
*/
Node expr_rest(Node left) {
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
With that in place, it is straightforward to implement both expr and stmt:
Node expr() {
return expr_rest(value());
}
Node stmt() {
/* Check lookahead for statements which start with
* a keyword. Omitted for simplicity.
*/
/* either first value in an expr or target of assignment */
Node left = value();
switch (lookahead) {
case OP_ASSIGN:
accept(lookahead);
return MakeAssignment(left, expr())
}
/* Handle += and other mutating assignments if desired */
default: {
/* Not an assignment, just an expression */
return MakeExpressionStatement(expr_rest(left));
}
}
}

Do-While Loop in C doesn't repeat

This code doesn't repeat if I answer a negative number like "-1.01". How can I make it loop so that it will ask again for c?
#include <stdio.h>
main()
{
float c;
do {
printf("O hai! How much change is owed? ");
scanf("%.2f", &c);
} while (c < 0.0);
return(0);
}
The format strings for scanf are subtly different than those for printf. You are only allowed to have (as per C11 7.21.6.2 The fscanf function /3):
an optional assignment-suppressing character *.
an optional decimal integer greater than zero that specifies the maximum field width (in characters).
an optional length modifier that specifies the size of the receiving object.
a conversion specifier character that specifies the type of conversion to be applied.
Hence your format specifier becomes illegal the instant it finds the . character, which is not one of the valid options. As per /13 of that C11 section listed above:
If a conversion specification is invalid, the behaviour is undefined.
For input, you're better off using the most basic format strings so that the format is not too restrictive. A good rule of thumb in I/O is:
Be liberal in what you accept, specific in what you generate.
So, the code is better written as follows, including what a lot of people ignore, the possibility that the scanf itself may fail, resulting in an infinite loop:
#include <stdio.h>
int main (void) {
float c;
do {
printf ("O hai! How much change is owed? ");
if (scanf ("%f", &c) != 1) {
puts ("Error getting a float.");
break;
}
} while (c < 0.0f);
return 0;
}
If you're after a more general purpose input solution, where you want to allow the user to input anything, take care of buffer overflow, handle prompting and so on, every C developer eventually comes up with the idea that the standard ways of getting input all have deficiencies. So they generally go write their own so as to get more control.
For example, here's one that provides all that functionality and more.
Once you have the user's input as a string, you can examine and play with it as much as you like, including doing anything you would have done with scanf, by using sscanf instead (and being able to go back and do it again and again if initial passes over the data are unsuccessful).
scanf("%.2f", &c );
// ^^ <- This seems unnecessary here.
Please stick with the basics.
scanf("%f", &c);
If you want to limit your input to 2 digits,
scanf("%2f", &c);

Removing Left Recursion in ANTLR

As is explained in Removing left recursion , there are two ways to remove the left recursion.
Modify the original grammar to remove the left recursion using some procedure
Write the grammar originally not to have the left recursion
What people normally use for removing (not having) the left recursion with ANTLR? I've used flex/bison for parser, but I need to use ANTLR. The only thing I'm concerned about using ANTLR (or LL parser in genearal) is left recursion removal.
In practical sense, how serious of removing left recursion in ANTLR? Is this a showstopper in using ANTLR? Or, nobody cares about it in ANTLR community?
I like the idea of AST generation of ANTLR. In terms of getting AST quick and easy way, which method (out of the 2 removing left recursion methods) is preferable?
Added
I did some experiment with the following grammar.
E -> E + T|T
T -> T * F|F
F -> INT | ( E )
After left recursion removal, I get the following one
E -> TE'
E' -> null | + TE'
T -> FT'
T' -> null | * FT'
I could come up with the following ANTLR representation. Even though, It's relatively pretty simple and straightforward, it seems the grammar that doesn't have the left recursion should be the better way to go.
grammar T;
options {
language=Python;
}
start returns [value]
: e {$value = $e.value};
e returns [value]
: t ep
{
$value = $t.value
if $ep.value != None:
$value += $ep.value
}
;
ep returns [value]
: {$value = None}
| '+' t r = ep
{
$value = $t.value
if $r.value != None:
$value += $r.value
}
;
t returns [value]
: f tp
{
$value = $f.value
if $tp.value != None:
$value *= $tp.value
}
;
tp returns [value]
: {$value = None}
| '*' f r = tp
{
$value = $f.value;
if $r.value != None:
$value *= $r.value
}
;
f returns [int value]
: INT {$value = int($INT.text)}
| '(' e ')' {$value = $e.value}
;
INT : '0'..'9'+ ;
WS: (' '|'\n'|'\r')+ {$channel=HIDDEN;} ;
Consider something like a typical parameter list:
parameter_list: parameter
| parameter_list ',' parameter
;
Since you don't care about anything like precedence or associativity with parameters, this is fairly easy to convert to right recursion, at the expense of adding an extra production:
parameter_list: parameter more_params
;
more_params:
| ',' parameter more_params
;
For the most serious cases, you might want to spend some time in the Dragon Book. Doing a quick check, this is covered primarily in chapter 4.
As far as seriousness goes, I'm pretty sure ANTLR simply won't accept a grammar that contains left recursion, which would put it into the "absolute necessity" category.
In practical sense, how serious of
removing left recursion in ANTLR? Is
this a showstopper in using ANTLR?
I think that you have a misunderstanding of left-recursion. It is a property of the grammar, not of the parser generator or the interaction between the parser generator and the specification. It happens when the first symbol on the right side of a rule is equal to the nonterminal corresponding to the rule itself.
To understand the inherent problem here, you need to know something about how a recursive-descent (LL) parser works. In an LL parser, the rule for each nonterminal symbol is implemented by a function corresponding to that rule. So, suppose I have a grammar like this:
S -> A B
A -> a
B -> b
Then, the parser would look (roughly) like this:
boolean eat(char x) {
// if the next character is x, advance the stream and return true
// otherwise, return false
}
boolean S() {
if (!A()) return false;
if (!B()) return false;
return true;
}
boolean A(char symbol) {
return eat('a');
}
boolean B(char symbol) {
return eat('b');
}
However, what happens if I change the grammar to be the following?
S -> A B
A -> A c | null
B -> b
Presumably, I want this grammar to represent a language like c*b. The corresponding function in the LL parser would look like this:
boolean A() {
if (!A()) return false; // stack overflow! We continually call A()
// without consuming any input.
eat('c');
return true;
}
So, we can't have left-recursion. Rewrite the grammar as:
S -> A B
A -> c A | null
B -> b
and the parser changes as such:
boolean A() {
if (!eat('c')) return true;
A();
return true;
}
(Disclaimer: this is my rudimentary approximation of an LL parser, meant only for demonstration purposes regarding this question. It has obvious bugs in it.)
I can't speak for ANTLR, but in general, the steps to eliminate a left recursion of the form:
A -> A B
-> B
is to change it to be:
A -> B+
(note that B must appear at least once)
or, if ANTLR doesn't support the Kleene closure, you can do:
A -> B B'
B' -> B B'
->
If you provide an example of your rules that are having conflicts, I can provide a better, more specific answer.
If you are writing the grammar, then of course you try to write it to avoid the pitfalls of your particular parser generator.
Usually, in my experience, I get some reference manual for the (legacy) language of interest, and it already contains a grammar or railroad diagrams, and it is what it is.
In that case, pretty much left recursion removal from a grammar is done by hand. There's no market for left-recursion-removal tools, and if you had one, it would be specialized to a grammar syntax that didn't match the grammar syntax you have.
Doing this removal is mostly a matter of sweat in many cases, and there isn't usually tons of it. So the usual approach is get out your grammar knife and have at it.
I don't think how you remove left recursion changes how ANTLR gets trees. You have to do the left recursion removal first, or ANTLR (or whatever LL parser generator you are using) simply won't accept your grammar.
There are those of us that don't want the parser generator to put any serious constraints on what we can write for a context free grammar. In this case you want to use something like a GLR parser generator, which handles left- or right-recursion with ease. Unreasonable people can even insist on automated AST generation with no effort on the part of the grammar writer. For a tool that can do both, see DMS Software Reengineering Toolkit.
This is only orthogonally relevant, but I just published a preprint of a paper on a new parsing method that I call "pika parsing" (c.f. packrat parsing) that directly handles left recursive grammars without the need for rule rewriting.
https://arxiv.org/abs/2005.06444