conditional compilation in yacc - yacc

how to do conditional compilation in yacc. Similar to done in C using ifdef.
I want to create a rule based on a condition. Is it possible in yacc. Example. based on condition rule A is defined as follows:
ruleA : A | B, /* For condition 1 */
ruleA : C /* If condition 1 is not satisfied */

btyacc has conditional compilation based on defined flags, similar to the C preprocessor. You can say:
%ifdef VERSION_A
ruleA: A | B ;
%endif
%ifdef VERSION_B
ruleA: C ;
%endif
and then use a -DVERSION_A or -DVERSION_B command line argument to get one version or the other. Its pretty primitive (you can only test a single flag per %ifdef, can't nest %ifdefs and there's no %else), but its adequate for simple things.

If you can't preprocess your Yacc grammar with an appropriate preprocessor, then you can use a solution based on actions:
ruleA : A { if (condition1) { process OK; } else YYERROR; }
| B { if (condition1) { process OK; } else YYERROR; }
| C { if (!condition1) { process OK; } else YYERROR; }
;
The YYERROR action triggers Yacc's normal error processing mechanism. This means your grammar must 'work' with both sets of rules in operation as far as Yacc is concerned. If this leads to complexity because of shift/reduce (or even reduce/reduce) conflicts, then preprocessing is the way to go. The result grammar will be smaller and more focussed.

My suggestion is that you expose the conditional flag as a terminal symbol, we can call it the mode terminal. (So, technically, it's a run-time not compile-time condition.)
You need to return the mode terminal at every point that it would make a difference in the parse. Your grammar can ignore it when it receives extra ones. It can return the mode terminal only in the "condition 1" case, only in the "not condition 1" case, or you can return a different one in both cases. So say you have two tokens C1 and C2, one for each mode.
This may not work out so well if you are parsing an existing grammar, but if you are designing the grammar and the parser together all problems are solvable.
Then you end up with:
ruleA : C1 A | C1 B | C2 C ;

Related

How to synthesise compiler testing data?

I am writing a simple compiler as a school work. I am looking for an automated approach to generate both positive and negative testing data to test my compiler, given the formal grammar and other specification. The language I am dealing with is of mediate size with 38 or so non-terminals. For the sake of illustration, here is a snapshot of the grammar:
program: const_decl* declaration* ENDMARKER
# statement
stmt: flow_stmt | '{' stmt* '}' | NAME [stmt_trailer] ';' | ';'
stmt_trailer: arglist | ['[' expr ']'] '=' expr
flow_stmt: if_stmt | for_stmt | while_stmt | read_stmt ';' | write_stmt ';' | return_stmt ';'
return_stmt: 'return' ['(' expr ')']
if_stmt: 'if' '(' condition ')' stmt ['else' stmt]
condition: expr ('<'|'<='|'>'|'>='|'!='|'==') expr | expr
for_stmt: ('for' '(' NAME '=' expr ';' condition ';'
NAME '=' NAME ('+'|'-') NUMBER ')' stmt)
Is there any tools to generate input file with the help of the grammar? The hand-written tests are too tedious or too weak to discover problems. An example of this language here:
void main() {
int N;
int temp;
int i, j;
int array_size;
reset_heap;
scanf(N);
for (i = 0; i < N; i = i + 1) {
scanf(array_size);
if (array_size > max_heap_size) {
printf("array_size exceeds max_heap_size");
} else {
for (j = 0; j < array_size; j = j + 1) {
scanf(temp);
heap[j] = temp;
}
heap_sort(array_size);
print_heap(array_size);
}
}
}
Generating controllable testing data automatically can save the days. Given the simplicity of the language, there must be some way to effectively do this. Any pointer and insight is greatly appreciated.
Any pointer and insight is greatly appreciated.
This should have the subtopic of How to avoid combinatorial explosion when generating test data.
While I would not be surprised if there are tools to do this having had the same need to generate test data for grammars I have created a few one off applications.
One of the best series of articles I have found on this is by Eric Lippert, Every Binary Tree There Is, think BNF converted to binary operators then converted to AST when you read tree. However he uses Catalan (every branch has two leaves) and when I wrote my app I preferred Motzikin (a branch can have one or two leaves).
Also he did his in C# with LINQ and I did mine in Prolog using DCG.
Generating the data based on the BNF or DCG is not hard, the real trick is to limit the area of expansion and the size of the expansion and to inject bad data.
By area of expansion lets say you want to test nested if statements three levels deep, but have to have valid code that compiles. Obviously you need the boilerplate code to make it compile then you start changing the deeply nested if by adding or removing the else clause. So you need to put in constraints so that the boilerplate code is constant and the testing part is variable.
By size of expansion lets say that you want to test conditional expressions. You can easily calculate that if you have many operators and you want to test them all in combinations you soon run into combinatorial explosion. The trick is to ensure you test deep enough and with enough breadth but not every combination. Again the judicial use of constraints helps.
So the point of all of this is that you start with a tool that takes in the BNF and generates valid code. Then you modify the BNF to add constraints and modify the generator to understand the constraints to generate the code examples.
Then you modify the BNF for invalid data and likewise the generator to understand those rules.
After that is working you can then start layering on levels of automation.
If you do go this route and decide that you will have to learn Prolog, take a look at Mercury first. I have not done this with Mercury, but if I do it again Mercury is high on the list.
While my actual code is not public, this and this is the closest to it that is public.
Along the way I had some fun with it in Code Golf.
When generating terminals such as reserved words or values for types, you can use predefined list with both valid and invalid data, e.g. for if if the language is case sensitive I would include in the list if,If,IF,iF, etc. For value types such as unsigned byte I would include -1,0,255 and 256.
When I was testing basic binary math expressions with +, -, * and ^ I generated all the test for with three basic numbers -2,-1,0,1, and 2. I thought it would be useless since I already had hundreds of test cases, but since it only took a few minutes to generate all of the test cases and several hours to run it, to my surprise it found a pattern I did not cover. The point here is that contrary what most people say about having to many test cases, remember that it is only time on a computer by changing a few constraints so do the large number of test.

How to modify parsing grammar to allow assignment and non-assignment statements?

So the question is about the grammar below. I'm working on a mini-interpreted language for fun (we learned about some compiler design in class, so I want to take it to the next level and try something on my own). I'm stuck trying to make the non-terminal symbol Expr.
Statement ::= Expr SC
Expr ::= /* I need help here */
Assign ::= Name EQUAL Expr
AddSub ::= MulDiv {(+|-) AddSub}
MulDiv ::= Primary {(*|/) MulDiv}
Primary ::= INT | FLOAT | STR | LP Expr RP | Name
Name ::= ID {. Name}
Expr has to be made such that Statement must allow for the two cases:
x = 789; (regular assignment, followed by semicolon)
x+2; (no assignment, just calculation, discarded; followed by a semicolon)
The purpose of the second case is to setup the foundation for more changes in the future. I was thinking about unary increment and decrement operators, and also function calls; both of which don't require assignment to be meaningful.
I've looked at other grammars (C# namely), but it was too complicated and lengthy to understand. Naturally I'm not looking for solutions, but only for guidance on how I could modify my grammar.
All help is appreciated.
EDIT: I should say that my initial thought was Expr ::= Assign | AddSub, but that wouldn't work since it would create ambiguity since both could start with the non-terminal symbol Name. I have made my tokenizer such that it allows one token look ahead (peek), but I have not made such a thing for the non terminals, since it would be trying to fix a problem that could be avoided (ambiguity). In the grammar, the terminals are the ones that are all-caps.
The simplest solution is the one actually taken by the designers of C, and thus by the various C derivatives: treat assignment simply as yet another operator, without restricting it to being at the top-level of a statement. Hence, in C, the following is unproblematic:
while ((ch = getchar()) != EOF) { ... }
Not everyone will consider that good style, but it is certainly common (particularly in the clauses of the for statement, whose syntax more or less requires that assignment be an expression).
There are two small complications, which are relatively easy to accomplish:
Logically, and unlike most operators, assignment associates to the right so that a = b = 0 is parsed as a = (b = 0) and not (a = b) = 0 (which would be highly unexpected). It also binds very weakly, at least to the right.
Opinions vary as to how tightly it should bind to the left. In C, for the most part a strict precedence model is followed so that a = 2 + b = 3 is rejected since it is parsed as a = ((2 + b) = 3). a = 2 + b = 3 might seem like terrible style, but consider also a < b ? (x = a) : (y = a). In C++, where the result of the ternary operator can be a reference, you could write that as (a < b ? x : y) = a in which the parentheses are required even thought assignment has lower precedence than the ternary operator.
None of these options are difficult to implement in a grammar, though.
In many languages, the left-hand side of an assignment has a restricted syntax. In C++, which has reference values, the restriction could be considered semantic, and I believe it is usually implemented with a semantic check, but in many C derivatives lvalue can be defined syntactically. Such definitions are unambiguous, but they are often not amenable to parsing with a top-down grammar, and they can create complications even for a bottom-up grammar. Doing the check post-parse is always a simple solution.
If you really want to distinguish assignment statements from expression statements, then you indeed run into the problem of prediction failure (not ambiguity) if you use a top-down parsing technique such as recursive descent. Since the grammar is not ambiguous, a simple solution is to use an LALR(1) parser generator such as bison/yacc, which has no problems parsing such a grammar since it does not require an early decision as to which kind of statement is being parsed. On the whole, the use of LALR(1) or even GLR parser generators simplifies implementation of a parser by allowing you to specify a grammar in a form which is easily readable and corresponds to the syntactic analysis. (For example, an LALR(1) parser can handle left-associative operators naturally, while a LL(1) grammar can only produce right-associative parses and therefore requires some kind of reconstruction of the syntax tree.)
A recursive descent parser is a computer program, not a grammar, and its expressiveness is thus not limited by the formal constraints of LL(1) grammars. That is both a strength and a weakness: the strength is that you can find solutions which are not limited by the limitations of LL(1) grammars; the weakness is that it is much more complicated (even, sometimes, impossible) to extract a clear statement about the precise syntax of the language. This power, for example, allows recursive descent grammars to handle left associativity in a more-or-less natural way despite the restriction mentioned above.
If you want to go down this road, then the solution is simple enough. You will have some sort of function:
/* This function parses and returns a single expression */
Node expr() {
Node left = value();
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
That easily be modified to be able to parse both expressions and statements. First, refactor the function so that it parses the "rest" of an expression, given the first operator. (The only change is a new prototype and the deletion of the first line in the body.)
/* This function parses and returns a single expression
* after the first value has been parsed. The value must be
* passed as an argument.
*/
Node expr_rest(Node left) {
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
With that in place, it is straightforward to implement both expr and stmt:
Node expr() {
return expr_rest(value());
}
Node stmt() {
/* Check lookahead for statements which start with
* a keyword. Omitted for simplicity.
*/
/* either first value in an expr or target of assignment */
Node left = value();
switch (lookahead) {
case OP_ASSIGN:
accept(lookahead);
return MakeAssignment(left, expr())
}
/* Handle += and other mutating assignments if desired */
default: {
/* Not an assignment, just an expression */
return MakeExpressionStatement(expr_rest(left));
}
}
}

antlr infinit eof loop

Please have a look at my grammar: https://bitbucket.org/rstoll/tsphp-parser/raw/cdb41531e86ec66416403eb9c29edaf60053e5df/src/main/antlr/TSPHP.g
Somehow ANTLR produces an infinite loop finding infinite EOF tokens for the following input:
class a{public function void a(}
Although, only prog expects EOF classBody somehow accept it as well. Has someone an idea how I can fix that, what I have to change that classBody does not accept EOF tokens respectively?
Code from the generated class:
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: ( classBody )*
loop17:
do {
int alt17=2;
int LA17_0 = input.LA(1);
if ( (LA17_0==EOF||LA17_0==Abstract||LA17_0==Const||LA17_0==Final||LA17_0==Function||LA17_0==Private||(LA17_0 >= Protected && LA17_0 <= Public)||LA17_0==Static) ) {
alt17=1;
}
switch (alt17) {
case 1 :
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: classBody
{
pushFollow(FOLLOW_classBody_in_classDeclaration1603);
classBody38=classBody();
state._fsp--;
if (state.failed) return retval;
if ( state.backtracking==0 ) stream_classBody.add(classBody38.getTree());
}
break;
default :
break loop17;
}
} while (true);
The problem occurs, when token = EOF, the loop is never quit, since EOF is a valid token, even though I haven not specified like that.
EDIT The do not get the error if I comment line 342 and 347 out (the empty case in rule accessModifierWithoutPrivateOrPublic, accessModifierOrPublic respectively)
EDIT 2 I could solve my problem. I rewrote the methodModifier rule (integrated all the possible modifier in one rule). This way ANTLR does not believe that EOF is a valid token after /empty/ in
accessModifierOrPublic
: accessModifier
| /* empty */ -> Public["public"]
;
This type of bug can occur in error handling for ANTLR 3. In ANTLR 4, the IntStream.consume() method was updated to require the following exception be thrown to preempt this problem.
Throws:
IllegalStateException - if an attempt is made to consume the the end of the stream (i.e. if LA(1)==EOF before calling consume).
For ANTLR 3 grammars, you can at least prevent an infinite loop by using your own TokenStream implementation (probably easiest to extend CommonTokenStream) and throwing this exception if the condition listed above is violated. Note that you might need to allow this condition to be violated once (reasons are complicated), so keep a count and throw the IllegalStateException if the code tries to consume EOF more than 2 or 3 times. Remember this is just an effort to break the infinite loop so you can be a little "fuzzy" on the actual check.

Can ANTLR return Lines of Code when lexing?

I am trying use ANTLR to analyse a large set of code using full Java grammar. Since ANTLR needs to open all the source files and scan them, I am wondering if it can also return lines of code.
I checked API for Lexer and Parser, it seems they do not return LoC. Is it easy to instrument the grammar rule a bit to get LoC? The full Java rule is complicated, I don't really want to mess a large part of it.
If you have an existing ANTLR grammar, and want to count certain things during parsing, you could do something like this:
grammar ExistingGrammar;
// ...
#parser::members {
public int loc = 0;
}
// ...
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
// ...
So, whenever your oparser encounters a someParserRule, you increase the loc by one by placing {loc++;} after (or before) the rule.
So, whatever your definition of a line of code is, simply place {loc++;} in the rule to increase the counter. Be careful not to increase it twice:
statement
: someParserRule {loc++;}
| // ...
;
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
EDIT
I just noticed that in the title of your question you asked if this can be done during lexing. That won't be possible. Let's say a LoC would always end with a ';'. During lexing, you wouldn't be able to make a distinction between a ';' after, say, an assignment (which is a single LoC), and the 2 ';'s inside a for(int i = 0; i < n; i++) { ... } statement (which wouldn't be 2 LoC).
In the C target the data structure ANTLR3_INPUT_STREAM has a getLine() function which returns the current line from the input stream. It seems the Java version of this is CharStream.getLine(). You should be able to call this at any time and get the current line in the input stream.
Use a visitor to visit the CompilationUnit context, then context.stop.getLine() will give you the last line number of the compilation unit context.
#Override public Integer visitCompilationUnit(#NotNull JAVAParser.CompilationUnitContext ctx) {
return ctx.stop.getLine();
}

Removing Left Recursion in ANTLR

As is explained in Removing left recursion , there are two ways to remove the left recursion.
Modify the original grammar to remove the left recursion using some procedure
Write the grammar originally not to have the left recursion
What people normally use for removing (not having) the left recursion with ANTLR? I've used flex/bison for parser, but I need to use ANTLR. The only thing I'm concerned about using ANTLR (or LL parser in genearal) is left recursion removal.
In practical sense, how serious of removing left recursion in ANTLR? Is this a showstopper in using ANTLR? Or, nobody cares about it in ANTLR community?
I like the idea of AST generation of ANTLR. In terms of getting AST quick and easy way, which method (out of the 2 removing left recursion methods) is preferable?
Added
I did some experiment with the following grammar.
E -> E + T|T
T -> T * F|F
F -> INT | ( E )
After left recursion removal, I get the following one
E -> TE'
E' -> null | + TE'
T -> FT'
T' -> null | * FT'
I could come up with the following ANTLR representation. Even though, It's relatively pretty simple and straightforward, it seems the grammar that doesn't have the left recursion should be the better way to go.
grammar T;
options {
language=Python;
}
start returns [value]
: e {$value = $e.value};
e returns [value]
: t ep
{
$value = $t.value
if $ep.value != None:
$value += $ep.value
}
;
ep returns [value]
: {$value = None}
| '+' t r = ep
{
$value = $t.value
if $r.value != None:
$value += $r.value
}
;
t returns [value]
: f tp
{
$value = $f.value
if $tp.value != None:
$value *= $tp.value
}
;
tp returns [value]
: {$value = None}
| '*' f r = tp
{
$value = $f.value;
if $r.value != None:
$value *= $r.value
}
;
f returns [int value]
: INT {$value = int($INT.text)}
| '(' e ')' {$value = $e.value}
;
INT : '0'..'9'+ ;
WS: (' '|'\n'|'\r')+ {$channel=HIDDEN;} ;
Consider something like a typical parameter list:
parameter_list: parameter
| parameter_list ',' parameter
;
Since you don't care about anything like precedence or associativity with parameters, this is fairly easy to convert to right recursion, at the expense of adding an extra production:
parameter_list: parameter more_params
;
more_params:
| ',' parameter more_params
;
For the most serious cases, you might want to spend some time in the Dragon Book. Doing a quick check, this is covered primarily in chapter 4.
As far as seriousness goes, I'm pretty sure ANTLR simply won't accept a grammar that contains left recursion, which would put it into the "absolute necessity" category.
In practical sense, how serious of
removing left recursion in ANTLR? Is
this a showstopper in using ANTLR?
I think that you have a misunderstanding of left-recursion. It is a property of the grammar, not of the parser generator or the interaction between the parser generator and the specification. It happens when the first symbol on the right side of a rule is equal to the nonterminal corresponding to the rule itself.
To understand the inherent problem here, you need to know something about how a recursive-descent (LL) parser works. In an LL parser, the rule for each nonterminal symbol is implemented by a function corresponding to that rule. So, suppose I have a grammar like this:
S -> A B
A -> a
B -> b
Then, the parser would look (roughly) like this:
boolean eat(char x) {
// if the next character is x, advance the stream and return true
// otherwise, return false
}
boolean S() {
if (!A()) return false;
if (!B()) return false;
return true;
}
boolean A(char symbol) {
return eat('a');
}
boolean B(char symbol) {
return eat('b');
}
However, what happens if I change the grammar to be the following?
S -> A B
A -> A c | null
B -> b
Presumably, I want this grammar to represent a language like c*b. The corresponding function in the LL parser would look like this:
boolean A() {
if (!A()) return false; // stack overflow! We continually call A()
// without consuming any input.
eat('c');
return true;
}
So, we can't have left-recursion. Rewrite the grammar as:
S -> A B
A -> c A | null
B -> b
and the parser changes as such:
boolean A() {
if (!eat('c')) return true;
A();
return true;
}
(Disclaimer: this is my rudimentary approximation of an LL parser, meant only for demonstration purposes regarding this question. It has obvious bugs in it.)
I can't speak for ANTLR, but in general, the steps to eliminate a left recursion of the form:
A -> A B
-> B
is to change it to be:
A -> B+
(note that B must appear at least once)
or, if ANTLR doesn't support the Kleene closure, you can do:
A -> B B'
B' -> B B'
->
If you provide an example of your rules that are having conflicts, I can provide a better, more specific answer.
If you are writing the grammar, then of course you try to write it to avoid the pitfalls of your particular parser generator.
Usually, in my experience, I get some reference manual for the (legacy) language of interest, and it already contains a grammar or railroad diagrams, and it is what it is.
In that case, pretty much left recursion removal from a grammar is done by hand. There's no market for left-recursion-removal tools, and if you had one, it would be specialized to a grammar syntax that didn't match the grammar syntax you have.
Doing this removal is mostly a matter of sweat in many cases, and there isn't usually tons of it. So the usual approach is get out your grammar knife and have at it.
I don't think how you remove left recursion changes how ANTLR gets trees. You have to do the left recursion removal first, or ANTLR (or whatever LL parser generator you are using) simply won't accept your grammar.
There are those of us that don't want the parser generator to put any serious constraints on what we can write for a context free grammar. In this case you want to use something like a GLR parser generator, which handles left- or right-recursion with ease. Unreasonable people can even insist on automated AST generation with no effort on the part of the grammar writer. For a tool that can do both, see DMS Software Reengineering Toolkit.
This is only orthogonally relevant, but I just published a preprint of a paper on a new parsing method that I call "pika parsing" (c.f. packrat parsing) that directly handles left recursive grammars without the need for rule rewriting.
https://arxiv.org/abs/2005.06444