Dropping a token in Yacc/Bison when an error is matched in the rhs production - error-handling

I am writing a simple calculator in which an expression can be reduced to a statement, or list of statements. If a bad expression triggers a syntax error, I try to catch it with a production rule and then ignore it by giving the rule no actions. However, I believe this still reduces to a stmt_list despite it not being a valid statement.
Is there a way to make it simply ignore tokens matched with an error and prevent them from being reduced and used later?
The code below matches error ';' and reduces it to a stmt_list. It will then try to reduce that stmt_list with a valid expression, but since the first production was never called this will trigger a memory exception. My objective is to have Bison literally do nothing if an error is matched, such that a later valid expression can be the first reduction to stmt_list.
stmt_list:
expr ';' {
// Allocate memory for statement array
stmts = (float*)malloc(24 * sizeof(float));
// Assign pointer for array
$$ = stmts;
// At pointer, assign statement value
*$$ = $1;
// Increment pointer (for next job)
$$++;
}
| stmt_list expr ';' {
$$ = $1;
*$$ = $2;
$$++;
}
| error ';' { } // Do nothing (ignore bad stmt)
| stmt_list error ';' { } // Do nothing (ignore bad stmt)
;

If you supply no action for a rule, bison/yacc provides the default action $$ = $1.
In fact, you are not providing no action. You are providing an explicit action which does nothing. As it happens, if you use the C template, the parser will still perform the default action. In other templates, an action which does not assign a value to $$ might provoke a warning during parser generation. But it certainly won't modify your data structures so as to nullify the action. It can't know what that means. If you know, you should write it as the action :-) .
It's not 100% clear to me why you are keeping the results of the evaluations in a fixed-size dynamically-allocated array. You make no attempt to detect when the array fills up, so it's entirely possible that you'll end up overflowing the allocation and overwriting random memory. Moreover, using a global like this isn't usually a good idea because it prevents you from building more than one list at the same time. (For example, if you wanted to implement function calls, since a function's arguments are also a list of expressions.)
On the whole, it's better to put the implementation of the expanding expression list in a simple API which is implemented elsewhere. Here, I'm going to assume that you've done that; for specificity, I'll assume the following API (although it's just one example):
/* The list header structure, which contain all the information
* necessary to use the list. The forward declaration makes it
* possible to use pointers to ExprList objects without having to
* expose its implementation details.
*/
typedef struct ExprList ExprList;
/* Creates a new empty expression-list and returns a pointer to its header. */
ExprList* expr_list_create(void);
/* Resizes the expression list to the supplied size. If the list
* currently has fewer elements, new elements with default values are
* added at the end. If it currently has more elements, the excess
* ones are discarded. Calling with size 0 empties the list (but
* doesn't delete it).
*/
int expr_list_resize(ExprList* list, int new_length);
/* Frees all storage associated with the expression list. The
* argument must have been created with expr_list_create, and its
* value must not be used again after this function returns.
*/
void expr_list_free(ExprList* list);
/* Adds one element to the end of the expression-list.
* I kept the float datatype for expression values, although I
* strongly believe that its not ideal. But an advantage of using an
* API like this is that it is easier to change.
*/
void expr_list_push(ExprList* list, float value);
/* Returns the number of elements in the expression-list. */
int expr_list_len(ExprList* list);
/* Returns the address of the element in the expression list
* with the given index. If the index is out of range, behaviour
* is undefined; a debugging implementation will report an error.
*/
float* expr_list_at(ExprList* list, int index);
With that API, we can rewrite the productions for valid expressions:
stmt_list: expr ';' { $$ = expr_list_create();
expr_list_push($$, $1);
}
| stmt_list expr ';' { $$ = $1;
expr_list_push($$, $1);
}
Now for the error cases. You have two error rules; one triggers when the error is at the beginning of a list, and the other when the error is encountered after one or more (possibly erroneous) expressions have been handled. Both of these are productions for stmt_list so they must have the same value type as stmt_list does (ExprList*). Thus, they must do whatever you think is appropriate when a syntax error is produced.
The first one, when the error is at the start of the list, only needs to create an empty list. It's hard to see what else it could do.
stmt_list: error ';' { $$ = expr_list_create(); }
It seems to me that there are at least two alternatives for the other error action, when an error is detected after the list has at least one successfully-computed value. One possibility is to ditch the erroneous item, leaving the rest of the list intact. This requires only the default action:
stmt_list: stmt_list error ';'
(Of course, you could add the action { $$ = $1; } explicitly, if you wanted to.)
The other possibility is to empty the entire list, so as to start from scratch with the next element:
stmt_list: stmt_list error ';' { $$ = $1;
expr_list_resize($$, 0);
}
There are undoubtedly other possibilities. As I said, bison cannot figure out what it is that you intended (and neither can I, really). You'll have to implement whatever behaviour you want.

Related

How to modify parsing grammar to allow assignment and non-assignment statements?

So the question is about the grammar below. I'm working on a mini-interpreted language for fun (we learned about some compiler design in class, so I want to take it to the next level and try something on my own). I'm stuck trying to make the non-terminal symbol Expr.
Statement ::= Expr SC
Expr ::= /* I need help here */
Assign ::= Name EQUAL Expr
AddSub ::= MulDiv {(+|-) AddSub}
MulDiv ::= Primary {(*|/) MulDiv}
Primary ::= INT | FLOAT | STR | LP Expr RP | Name
Name ::= ID {. Name}
Expr has to be made such that Statement must allow for the two cases:
x = 789; (regular assignment, followed by semicolon)
x+2; (no assignment, just calculation, discarded; followed by a semicolon)
The purpose of the second case is to setup the foundation for more changes in the future. I was thinking about unary increment and decrement operators, and also function calls; both of which don't require assignment to be meaningful.
I've looked at other grammars (C# namely), but it was too complicated and lengthy to understand. Naturally I'm not looking for solutions, but only for guidance on how I could modify my grammar.
All help is appreciated.
EDIT: I should say that my initial thought was Expr ::= Assign | AddSub, but that wouldn't work since it would create ambiguity since both could start with the non-terminal symbol Name. I have made my tokenizer such that it allows one token look ahead (peek), but I have not made such a thing for the non terminals, since it would be trying to fix a problem that could be avoided (ambiguity). In the grammar, the terminals are the ones that are all-caps.
The simplest solution is the one actually taken by the designers of C, and thus by the various C derivatives: treat assignment simply as yet another operator, without restricting it to being at the top-level of a statement. Hence, in C, the following is unproblematic:
while ((ch = getchar()) != EOF) { ... }
Not everyone will consider that good style, but it is certainly common (particularly in the clauses of the for statement, whose syntax more or less requires that assignment be an expression).
There are two small complications, which are relatively easy to accomplish:
Logically, and unlike most operators, assignment associates to the right so that a = b = 0 is parsed as a = (b = 0) and not (a = b) = 0 (which would be highly unexpected). It also binds very weakly, at least to the right.
Opinions vary as to how tightly it should bind to the left. In C, for the most part a strict precedence model is followed so that a = 2 + b = 3 is rejected since it is parsed as a = ((2 + b) = 3). a = 2 + b = 3 might seem like terrible style, but consider also a < b ? (x = a) : (y = a). In C++, where the result of the ternary operator can be a reference, you could write that as (a < b ? x : y) = a in which the parentheses are required even thought assignment has lower precedence than the ternary operator.
None of these options are difficult to implement in a grammar, though.
In many languages, the left-hand side of an assignment has a restricted syntax. In C++, which has reference values, the restriction could be considered semantic, and I believe it is usually implemented with a semantic check, but in many C derivatives lvalue can be defined syntactically. Such definitions are unambiguous, but they are often not amenable to parsing with a top-down grammar, and they can create complications even for a bottom-up grammar. Doing the check post-parse is always a simple solution.
If you really want to distinguish assignment statements from expression statements, then you indeed run into the problem of prediction failure (not ambiguity) if you use a top-down parsing technique such as recursive descent. Since the grammar is not ambiguous, a simple solution is to use an LALR(1) parser generator such as bison/yacc, which has no problems parsing such a grammar since it does not require an early decision as to which kind of statement is being parsed. On the whole, the use of LALR(1) or even GLR parser generators simplifies implementation of a parser by allowing you to specify a grammar in a form which is easily readable and corresponds to the syntactic analysis. (For example, an LALR(1) parser can handle left-associative operators naturally, while a LL(1) grammar can only produce right-associative parses and therefore requires some kind of reconstruction of the syntax tree.)
A recursive descent parser is a computer program, not a grammar, and its expressiveness is thus not limited by the formal constraints of LL(1) grammars. That is both a strength and a weakness: the strength is that you can find solutions which are not limited by the limitations of LL(1) grammars; the weakness is that it is much more complicated (even, sometimes, impossible) to extract a clear statement about the precise syntax of the language. This power, for example, allows recursive descent grammars to handle left associativity in a more-or-less natural way despite the restriction mentioned above.
If you want to go down this road, then the solution is simple enough. You will have some sort of function:
/* This function parses and returns a single expression */
Node expr() {
Node left = value();
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
That easily be modified to be able to parse both expressions and statements. First, refactor the function so that it parses the "rest" of an expression, given the first operator. (The only change is a new prototype and the deletion of the first line in the body.)
/* This function parses and returns a single expression
* after the first value has been parsed. The value must be
* passed as an argument.
*/
Node expr_rest(Node left) {
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
With that in place, it is straightforward to implement both expr and stmt:
Node expr() {
return expr_rest(value());
}
Node stmt() {
/* Check lookahead for statements which start with
* a keyword. Omitted for simplicity.
*/
/* either first value in an expr or target of assignment */
Node left = value();
switch (lookahead) {
case OP_ASSIGN:
accept(lookahead);
return MakeAssignment(left, expr())
}
/* Handle += and other mutating assignments if desired */
default: {
/* Not an assignment, just an expression */
return MakeExpressionStatement(expr_rest(left));
}
}
}

Do-While Loop in C doesn't repeat

This code doesn't repeat if I answer a negative number like "-1.01". How can I make it loop so that it will ask again for c?
#include <stdio.h>
main()
{
float c;
do {
printf("O hai! How much change is owed? ");
scanf("%.2f", &c);
} while (c < 0.0);
return(0);
}
The format strings for scanf are subtly different than those for printf. You are only allowed to have (as per C11 7.21.6.2 The fscanf function /3):
an optional assignment-suppressing character *.
an optional decimal integer greater than zero that specifies the maximum field width (in characters).
an optional length modifier that specifies the size of the receiving object.
a conversion specifier character that specifies the type of conversion to be applied.
Hence your format specifier becomes illegal the instant it finds the . character, which is not one of the valid options. As per /13 of that C11 section listed above:
If a conversion specification is invalid, the behaviour is undefined.
For input, you're better off using the most basic format strings so that the format is not too restrictive. A good rule of thumb in I/O is:
Be liberal in what you accept, specific in what you generate.
So, the code is better written as follows, including what a lot of people ignore, the possibility that the scanf itself may fail, resulting in an infinite loop:
#include <stdio.h>
int main (void) {
float c;
do {
printf ("O hai! How much change is owed? ");
if (scanf ("%f", &c) != 1) {
puts ("Error getting a float.");
break;
}
} while (c < 0.0f);
return 0;
}
If you're after a more general purpose input solution, where you want to allow the user to input anything, take care of buffer overflow, handle prompting and so on, every C developer eventually comes up with the idea that the standard ways of getting input all have deficiencies. So they generally go write their own so as to get more control.
For example, here's one that provides all that functionality and more.
Once you have the user's input as a string, you can examine and play with it as much as you like, including doing anything you would have done with scanf, by using sscanf instead (and being able to go back and do it again and again if initial passes over the data are unsuccessful).
scanf("%.2f", &c );
// ^^ <- This seems unnecessary here.
Please stick with the basics.
scanf("%f", &c);
If you want to limit your input to 2 digits,
scanf("%2f", &c);

antlr infinit eof loop

Please have a look at my grammar: https://bitbucket.org/rstoll/tsphp-parser/raw/cdb41531e86ec66416403eb9c29edaf60053e5df/src/main/antlr/TSPHP.g
Somehow ANTLR produces an infinite loop finding infinite EOF tokens for the following input:
class a{public function void a(}
Although, only prog expects EOF classBody somehow accept it as well. Has someone an idea how I can fix that, what I have to change that classBody does not accept EOF tokens respectively?
Code from the generated class:
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: ( classBody )*
loop17:
do {
int alt17=2;
int LA17_0 = input.LA(1);
if ( (LA17_0==EOF||LA17_0==Abstract||LA17_0==Const||LA17_0==Final||LA17_0==Function||LA17_0==Private||(LA17_0 >= Protected && LA17_0 <= Public)||LA17_0==Static) ) {
alt17=1;
}
switch (alt17) {
case 1 :
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: classBody
{
pushFollow(FOLLOW_classBody_in_classDeclaration1603);
classBody38=classBody();
state._fsp--;
if (state.failed) return retval;
if ( state.backtracking==0 ) stream_classBody.add(classBody38.getTree());
}
break;
default :
break loop17;
}
} while (true);
The problem occurs, when token = EOF, the loop is never quit, since EOF is a valid token, even though I haven not specified like that.
EDIT The do not get the error if I comment line 342 and 347 out (the empty case in rule accessModifierWithoutPrivateOrPublic, accessModifierOrPublic respectively)
EDIT 2 I could solve my problem. I rewrote the methodModifier rule (integrated all the possible modifier in one rule). This way ANTLR does not believe that EOF is a valid token after /empty/ in
accessModifierOrPublic
: accessModifier
| /* empty */ -> Public["public"]
;
This type of bug can occur in error handling for ANTLR 3. In ANTLR 4, the IntStream.consume() method was updated to require the following exception be thrown to preempt this problem.
Throws:
IllegalStateException - if an attempt is made to consume the the end of the stream (i.e. if LA(1)==EOF before calling consume).
For ANTLR 3 grammars, you can at least prevent an infinite loop by using your own TokenStream implementation (probably easiest to extend CommonTokenStream) and throwing this exception if the condition listed above is violated. Note that you might need to allow this condition to be violated once (reasons are complicated), so keep a count and throw the IllegalStateException if the code tries to consume EOF more than 2 or 3 times. Remember this is just an effort to break the infinite loop so you can be a little "fuzzy" on the actual check.

Can ANTLR return Lines of Code when lexing?

I am trying use ANTLR to analyse a large set of code using full Java grammar. Since ANTLR needs to open all the source files and scan them, I am wondering if it can also return lines of code.
I checked API for Lexer and Parser, it seems they do not return LoC. Is it easy to instrument the grammar rule a bit to get LoC? The full Java rule is complicated, I don't really want to mess a large part of it.
If you have an existing ANTLR grammar, and want to count certain things during parsing, you could do something like this:
grammar ExistingGrammar;
// ...
#parser::members {
public int loc = 0;
}
// ...
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
// ...
So, whenever your oparser encounters a someParserRule, you increase the loc by one by placing {loc++;} after (or before) the rule.
So, whatever your definition of a line of code is, simply place {loc++;} in the rule to increase the counter. Be careful not to increase it twice:
statement
: someParserRule {loc++;}
| // ...
;
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
EDIT
I just noticed that in the title of your question you asked if this can be done during lexing. That won't be possible. Let's say a LoC would always end with a ';'. During lexing, you wouldn't be able to make a distinction between a ';' after, say, an assignment (which is a single LoC), and the 2 ';'s inside a for(int i = 0; i < n; i++) { ... } statement (which wouldn't be 2 LoC).
In the C target the data structure ANTLR3_INPUT_STREAM has a getLine() function which returns the current line from the input stream. It seems the Java version of this is CharStream.getLine(). You should be able to call this at any time and get the current line in the input stream.
Use a visitor to visit the CompilationUnit context, then context.stop.getLine() will give you the last line number of the compilation unit context.
#Override public Integer visitCompilationUnit(#NotNull JAVAParser.CompilationUnitContext ctx) {
return ctx.stop.getLine();
}

DBI::sql_type_cast: DBIstcf_DISCARD_STRING - question

My hope was, that DBI::sql_type_cast with the DBIstcf_DISCARD_STRING-flag would modify $sv from '4.8g' to 4.8.
(DBIstcf_DISCARD_STRING:
"If this flag is specified then when the driver successfully casts the bound perl scalar to a non-string type then the string portion of the scalar will be discarded.")
What does the return-value sv could not be case and DBIstcf_STRICT was not used mean?
#!/usr/bin/env perl
use warnings;
use 5.012;
use DBI qw(:sql_types);
my $dsn = "DBI:Proxy:hostname=horst;port=2000;dsn=DBI:ODBC:db1.mdb";
my $dbh = DBI->connect( $dsn, undef, undef, { RaiseError => 1, PrintError => 0 } )
or die $DBI::errstr;
my $sv = '4.8g';
my $sql_type = SQL_DOUBLE;
my $flags = DBIstcf_DISCARD_STRING;
my $sts = DBI::sql_type_cast( $sv, $sql_type, $flags );
say $sts; # 1 (sv could not be case and DBIstcf_STRICT was not used)
say $sv;
# Argument "4.8b" isn't numeric in subroutine entry at ./perl6.pl line 14.
# 1
# 4.8b
The documentation contains a typo -- the description for $sts == 1 should be "sv could not be cast" -- i.e. a cast to SQL_DOUBLE wasn't possible for the value you provided and so nothing was done.
DBIstcf_DISCARD_STRING means something different from what you want. In Perl internal terms it means that if you pass an SV with POK and NOK and PV part "1.23" and NV part 1.23, you will get back an SV with !POK and NOK and NV part 1.23 -- that is, the stored string part of the scalar will be invalidated, leaving the numeric part intact, so any future attempt to use the scalar as a string will force it to be re-converted from a number to a string. But note that it says that this will only happen if the cast is successful, and a cast to SQL_DOUBLE isn't successful if the value isn't a valid number to begin with. "4.8g" doesn't pass the test.
You can clean up the string part of the value almost as effectively as DBI on your own just by doing $sv = 0 + $sv; which will clear POK and force a reconversion to string in the same way. The difference between this and what DBI does is that it's not actually clearing the PV in the way that DBI would, only marking it invalid. To force the value to be cleared immediately in the same way as DBI, you need to do something like
$sv = do { my $tmp = 0 + $sv; undef $sv; $tmp };
but unless you have some really good explanation for why you need that, you don't -- so don't use it. :)
After reading through the documentation and the code in DBI.xs (the implementation is in sql_type_cast_svpv), the return value of 1 means 'the value could not be cast cleanly and DBIstcf_STRICT was not used'.
Taking the key part of that function, in your case:
case SQL_DOUBLE:
sv_2nv(sv);
/* SvNOK should be set but won't if sv is not numeric (in which
* case perl would have warn'd already if -w or warnings are in effect)
*/
cast_ok = SvNOK(sv);
break;
....
if (cast_ok) {
if (flags & DBIstcf_DISCARD_STRING
&& SvNIOK(sv) /* we set a numeric value */
&& SvPVX(sv) /* we have a buffer to discard */
) {
SvOOK_off(sv);
if (SvLEN(sv))
Safefree(SvPVX(sv));
SvPOK_off(sv);
SvPV_set(sv, NULL);
SvLEN_set(sv, 0);
SvCUR_set(sv, 0);
}
}
if (cast_ok)
return 2;
SvNOK should be set for you. Without digging in further into sv_2nv, the core of the problem is that "4.8g" is not a numeric type, as the numeric flag in the scalar value is not set (this is what SvNOK checks for).
My suggestion, use a regular expression to strip that input before calling sql_type_cast.
The typo in the documentation is now fixed in the subversion trunk.
Here is a brief explanation of why sql_type_cast was added.
Although there is nothing to stop you using sql_type_cast it was specifically added for drivers (DBDs) to cast data returned from the database. The original issue it solved was that integers are mostly bound as strings so when the data is returned from the database the scalar's pv is set. Some modules like JSON::XS are clever and look at the pv to help decide if the scalar is a number of not. Without the sql_type_cast JSON::XS was converting a scalar containing a 1 but with the pv set to "1" instead of the shorter 1 in JSON conversions.
To my knowledge only DBD::Oracle does this right now although it is in the TODO for DBD::ODBC.