How to synthesise compiler testing data? - testing

I am writing a simple compiler as a school work. I am looking for an automated approach to generate both positive and negative testing data to test my compiler, given the formal grammar and other specification. The language I am dealing with is of mediate size with 38 or so non-terminals. For the sake of illustration, here is a snapshot of the grammar:
program: const_decl* declaration* ENDMARKER
# statement
stmt: flow_stmt | '{' stmt* '}' | NAME [stmt_trailer] ';' | ';'
stmt_trailer: arglist | ['[' expr ']'] '=' expr
flow_stmt: if_stmt | for_stmt | while_stmt | read_stmt ';' | write_stmt ';' | return_stmt ';'
return_stmt: 'return' ['(' expr ')']
if_stmt: 'if' '(' condition ')' stmt ['else' stmt]
condition: expr ('<'|'<='|'>'|'>='|'!='|'==') expr | expr
for_stmt: ('for' '(' NAME '=' expr ';' condition ';'
NAME '=' NAME ('+'|'-') NUMBER ')' stmt)
Is there any tools to generate input file with the help of the grammar? The hand-written tests are too tedious or too weak to discover problems. An example of this language here:
void main() {
int N;
int temp;
int i, j;
int array_size;
reset_heap;
scanf(N);
for (i = 0; i < N; i = i + 1) {
scanf(array_size);
if (array_size > max_heap_size) {
printf("array_size exceeds max_heap_size");
} else {
for (j = 0; j < array_size; j = j + 1) {
scanf(temp);
heap[j] = temp;
}
heap_sort(array_size);
print_heap(array_size);
}
}
}
Generating controllable testing data automatically can save the days. Given the simplicity of the language, there must be some way to effectively do this. Any pointer and insight is greatly appreciated.

Any pointer and insight is greatly appreciated.
This should have the subtopic of How to avoid combinatorial explosion when generating test data.
While I would not be surprised if there are tools to do this having had the same need to generate test data for grammars I have created a few one off applications.
One of the best series of articles I have found on this is by Eric Lippert, Every Binary Tree There Is, think BNF converted to binary operators then converted to AST when you read tree. However he uses Catalan (every branch has two leaves) and when I wrote my app I preferred Motzikin (a branch can have one or two leaves).
Also he did his in C# with LINQ and I did mine in Prolog using DCG.
Generating the data based on the BNF or DCG is not hard, the real trick is to limit the area of expansion and the size of the expansion and to inject bad data.
By area of expansion lets say you want to test nested if statements three levels deep, but have to have valid code that compiles. Obviously you need the boilerplate code to make it compile then you start changing the deeply nested if by adding or removing the else clause. So you need to put in constraints so that the boilerplate code is constant and the testing part is variable.
By size of expansion lets say that you want to test conditional expressions. You can easily calculate that if you have many operators and you want to test them all in combinations you soon run into combinatorial explosion. The trick is to ensure you test deep enough and with enough breadth but not every combination. Again the judicial use of constraints helps.
So the point of all of this is that you start with a tool that takes in the BNF and generates valid code. Then you modify the BNF to add constraints and modify the generator to understand the constraints to generate the code examples.
Then you modify the BNF for invalid data and likewise the generator to understand those rules.
After that is working you can then start layering on levels of automation.
If you do go this route and decide that you will have to learn Prolog, take a look at Mercury first. I have not done this with Mercury, but if I do it again Mercury is high on the list.
While my actual code is not public, this and this is the closest to it that is public.
Along the way I had some fun with it in Code Golf.
When generating terminals such as reserved words or values for types, you can use predefined list with both valid and invalid data, e.g. for if if the language is case sensitive I would include in the list if,If,IF,iF, etc. For value types such as unsigned byte I would include -1,0,255 and 256.
When I was testing basic binary math expressions with +, -, * and ^ I generated all the test for with three basic numbers -2,-1,0,1, and 2. I thought it would be useless since I already had hundreds of test cases, but since it only took a few minutes to generate all of the test cases and several hours to run it, to my surprise it found a pattern I did not cover. The point here is that contrary what most people say about having to many test cases, remember that it is only time on a computer by changing a few constraints so do the large number of test.

Related

Computation of dependencies (related to the K prelude)

I'm particularly interested to understand the K prelude (how it is structured, why its content is like that, how "kompile" calculates dependencies, etc).
The main question is: what is the criterion for a hooked symbol from the K prelude to be copied into the generated Kore file?
Here some examples of potential problems:
The symbol andBool is copied with its associated rewrite rules, which does not seem to be the case for the symbol in_keys, which is simply copied without its rewrite rules.
Other symbols seem to be useless (for the IMP semantic) but exist, with or without its rewrite rules, in the generated Kore file, such as countAllOccurrences, findChar, signExtendBitRangeInt or Float2String.
It seems that SortId is generated by the line syntax Id [token]. However, the lines "syntax Bool ::= "true" [token] and syntax Bool ::= "false" [token] do not generate true and false symbols.
(Moreover, is it a choice that true and false are values and not constructors?)
The sort named SortId is not generated for the following example, whereas some generated hooked symbols depend on this sort. This problem does not exist with the IMP semantic.
module MAX-OW-SYNTAX
imports INT
imports BOOL
syntax Exp ::= Int | "(" Exp ")" [bracket]
| "max" Exp Exp
endmodule
module MAX-OW
imports MAX-OW-SYNTAX
syntax KResult ::= Int
rule max X Y => Y requires X <Int Y
rule max X _ => X [owise]
endmodule
Is it correct that the K prelude is implemented in each language of each backend, and that an implementation in the Kore language is available in the K prelude?
Do you have the necessary interface to implement for a new backend? (For instance, Bag is obsolete, but not Set, List and Map, but I don't know the list of set operators, map operators, etc. that the new backend must provide.)
Is there a reason why andThenBool and andBool have the same semantics once implemented in the Kore syntax (Booleans module)?
Where are the rewrite rules defined for ==Bool, used in the definition of =/=Bool (Booleans module)?
The best reference point for the K internals is the User Manual, along with the K source for the prelude. To respond to your specific questions as best as I can:
in_keys only has simplification rules that apply on symbolic backends. These will not apply on concrete backends, and so those backends use the hooked implementation MAP.in_keys. Some functions (such as andBool) can be implemented both in K and as an efficient backend hook. For example, on the K LLVM backend, andBool is implemented by code generation. If a backend didn't support that hook, the (relatively) inefficient K rewriting implementation would be used.
The Id sort is built in for convenience. It represents program identifiers.
You haven't imported DOMAINS in this example. Doing so will pull in the Id sort and related rewrites.
Very roughly, and largely for internal purposes. Do you have a hypothetical K backend in mind, or is there a way in which the LLVM / Haskell backends provided by K are inadequate for your specific use case?
andThenBool is required to short-circuit its arguments; andBool is permitted to short-circuit, but may evaluate both arguments strictly. An implementation that makes both perform short-circuiting is valid.
==Bool is implemented only in terms of a hook. In domains.md, you can see the hook(BOOL.eq) attribute that indicates how ==Bool is implemented.
Do let us know if you have further questions, or would like help implementing a specific semantics in K.

How to modify parsing grammar to allow assignment and non-assignment statements?

So the question is about the grammar below. I'm working on a mini-interpreted language for fun (we learned about some compiler design in class, so I want to take it to the next level and try something on my own). I'm stuck trying to make the non-terminal symbol Expr.
Statement ::= Expr SC
Expr ::= /* I need help here */
Assign ::= Name EQUAL Expr
AddSub ::= MulDiv {(+|-) AddSub}
MulDiv ::= Primary {(*|/) MulDiv}
Primary ::= INT | FLOAT | STR | LP Expr RP | Name
Name ::= ID {. Name}
Expr has to be made such that Statement must allow for the two cases:
x = 789; (regular assignment, followed by semicolon)
x+2; (no assignment, just calculation, discarded; followed by a semicolon)
The purpose of the second case is to setup the foundation for more changes in the future. I was thinking about unary increment and decrement operators, and also function calls; both of which don't require assignment to be meaningful.
I've looked at other grammars (C# namely), but it was too complicated and lengthy to understand. Naturally I'm not looking for solutions, but only for guidance on how I could modify my grammar.
All help is appreciated.
EDIT: I should say that my initial thought was Expr ::= Assign | AddSub, but that wouldn't work since it would create ambiguity since both could start with the non-terminal symbol Name. I have made my tokenizer such that it allows one token look ahead (peek), but I have not made such a thing for the non terminals, since it would be trying to fix a problem that could be avoided (ambiguity). In the grammar, the terminals are the ones that are all-caps.
The simplest solution is the one actually taken by the designers of C, and thus by the various C derivatives: treat assignment simply as yet another operator, without restricting it to being at the top-level of a statement. Hence, in C, the following is unproblematic:
while ((ch = getchar()) != EOF) { ... }
Not everyone will consider that good style, but it is certainly common (particularly in the clauses of the for statement, whose syntax more or less requires that assignment be an expression).
There are two small complications, which are relatively easy to accomplish:
Logically, and unlike most operators, assignment associates to the right so that a = b = 0 is parsed as a = (b = 0) and not (a = b) = 0 (which would be highly unexpected). It also binds very weakly, at least to the right.
Opinions vary as to how tightly it should bind to the left. In C, for the most part a strict precedence model is followed so that a = 2 + b = 3 is rejected since it is parsed as a = ((2 + b) = 3). a = 2 + b = 3 might seem like terrible style, but consider also a < b ? (x = a) : (y = a). In C++, where the result of the ternary operator can be a reference, you could write that as (a < b ? x : y) = a in which the parentheses are required even thought assignment has lower precedence than the ternary operator.
None of these options are difficult to implement in a grammar, though.
In many languages, the left-hand side of an assignment has a restricted syntax. In C++, which has reference values, the restriction could be considered semantic, and I believe it is usually implemented with a semantic check, but in many C derivatives lvalue can be defined syntactically. Such definitions are unambiguous, but they are often not amenable to parsing with a top-down grammar, and they can create complications even for a bottom-up grammar. Doing the check post-parse is always a simple solution.
If you really want to distinguish assignment statements from expression statements, then you indeed run into the problem of prediction failure (not ambiguity) if you use a top-down parsing technique such as recursive descent. Since the grammar is not ambiguous, a simple solution is to use an LALR(1) parser generator such as bison/yacc, which has no problems parsing such a grammar since it does not require an early decision as to which kind of statement is being parsed. On the whole, the use of LALR(1) or even GLR parser generators simplifies implementation of a parser by allowing you to specify a grammar in a form which is easily readable and corresponds to the syntactic analysis. (For example, an LALR(1) parser can handle left-associative operators naturally, while a LL(1) grammar can only produce right-associative parses and therefore requires some kind of reconstruction of the syntax tree.)
A recursive descent parser is a computer program, not a grammar, and its expressiveness is thus not limited by the formal constraints of LL(1) grammars. That is both a strength and a weakness: the strength is that you can find solutions which are not limited by the limitations of LL(1) grammars; the weakness is that it is much more complicated (even, sometimes, impossible) to extract a clear statement about the precise syntax of the language. This power, for example, allows recursive descent grammars to handle left associativity in a more-or-less natural way despite the restriction mentioned above.
If you want to go down this road, then the solution is simple enough. You will have some sort of function:
/* This function parses and returns a single expression */
Node expr() {
Node left = value();
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
That easily be modified to be able to parse both expressions and statements. First, refactor the function so that it parses the "rest" of an expression, given the first operator. (The only change is a new prototype and the deletion of the first line in the body.)
/* This function parses and returns a single expression
* after the first value has been parsed. The value must be
* passed as an argument.
*/
Node expr_rest(Node left) {
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
With that in place, it is straightforward to implement both expr and stmt:
Node expr() {
return expr_rest(value());
}
Node stmt() {
/* Check lookahead for statements which start with
* a keyword. Omitted for simplicity.
*/
/* either first value in an expr or target of assignment */
Node left = value();
switch (lookahead) {
case OP_ASSIGN:
accept(lookahead);
return MakeAssignment(left, expr())
}
/* Handle += and other mutating assignments if desired */
default: {
/* Not an assignment, just an expression */
return MakeExpressionStatement(expr_rest(left));
}
}
}

auto generate code based on a packet structure

We have data packets with different structures. They are supposed to be read/written in different languages. Example:
| ClassId | Data |
ClassId = "datapoint" (Data structure):
temperature - 1bytes
elevation - 2bytes
gradient - 1bytes
ClassId = "config" (Data structure):
frequency - 1bytes
deviceId - 3bytes
ClassId = "accelerometer" (Data structure):
time - 2bytes
x - 2bytes
y - 2bytes
z - 2bytes
Rather than manually writing the code that parses each data packet based on its class (which is error prone and time-consuming), I would expect to have a configuration file and then the code (python/c/etc.) is generated automatically that can read and write packets. Something along these lines:
lib.set(packet, "datapoint", {
elevation: 933,
temperature: 18,
gradient: 20
});
lib.get(packet, "datapoint");
=>
{
elevation: 933,
temperature: 18,
gradient: 20
}
Googling it did not bring me anywhere. Any pointers would be very helpful.
You need a code generation system, that compiles a packet specification into code to parse/unparse the packets.
You can build one ad hoc using a parser generator, and write ad hoc code to procedurally walk a parse tree and spit out relevant code.
Or you can use a program transformation system (PTS), which treats your packet specifications like source code, and transforms to source code in your target language. You explain the syntax of the packets to the PTS pretty much the same way you explain the syntax to a parser generator.
But with a PTS, you can write transformation rules in surface syntax notation that recognizes the packet-system syntax and maps it to target langauge function syntax. That makes writing and maintaining such a tool a lot easier, especially if the packet syntax changes, and/or you change the target language infrastructure to parse the packets in different ways.
EDIT 10/3: OP asks for a concrete example, presumably with a PTS.
I'm going to show what this looks like for our DMS Software Reengineering Toolkit (see bio for more about DMS).
First you need an (DMS-compatible) grammar for the packet languages. Based on what I see, it is pretty simple:
Packets = Packet ;
Packets = Packets Packet ; -- allow a list of packet defintions
Packet = 'ClassID' '=' STRING members ;
members = ;
members = members member ; -- allow list of members
member = IDENTIFIER '-' NATURAL 'bytes' ;
I think this grammar is naive in that packet members in practice may have different types (maybe strings, floats, booleans, ...); OP's example only shows what I'm assuming are N-byte binary integer numbers. You also need grammars for your various target languages. I'm going to assume you have these grammars (and that's quite the assumption); lets work with C for the moment. [DMS does have many of these].
We also have to assume a representation of the transmitted data. OP suggests something but I think he is trying to hint at generated code ("lib.set...").
Instead I'm going to assume that packet content is being read from a Stream as binary bytes simply appended together; this makes for the smallest packet size possible and thus fast transmission times.
So, now we to specify our code generator, as set of rewrite rules that map packet definitions to code.
For background, a rewrite rule for a PTS typically looks like this:
if you see *this*, replace it by *that*
So you are essentially replacing one structure by another. These typically operate on ASTs, but use surface syntax for this and that for readability.
What follows are DMS's source to source rewrite rules; they look like they operate on text but in fact they operate on ASTs produced by DMS's parser.
DMS has its own syntax for rules but it essentially follows the typical style above:
rule rule_name( pattern_variables ):
source_syntax_category -> target_syntax_category =
" this_pattern " -> " that_pattern " ;
Source and target patterns are enclosed in *metaquotes" "; actual literal quote characters are thus escaped as \".
For DMS rules this is always a fragment of the Packet notation, and that is always a fragment of our chosen target language (C). Pattern variable names in the rule head are given syntactic types and can only match the corresponding type in the AST. Pattern variables found inside metaquotes are written as \variable. Metafunctions can compute derived results; they are invoked inside patterns as "\function( args )". See DMS Rewrite Rules for more details.
source domain Packet; -- the little language we defined
target domain C; -- what we will generate code for
-- you'll write one of these rulesets for each target language
rule top_level(pl: Packets): Packets -> Statements =
" \pl "
-> " ReadPacketType(stream, packet_type);
switch(packet_type) {
\pl
default: Panic(\"unrecognized packet type\");
}" if IsRoot(pl); -- do this once [at root of tree]
rule translate_packet_definitions(p: Packet, pl: packet_list): Packets -> switch_case_list
" \p \pl ";
rule translate_packet_definition(s:STRING, ms: members, pl: Packets): Packets -> switch_case =
" ClassID = \s \m \pl "
-> " case \concatenate\(\"enum_\"\,\string_to_identifier\(\s\)\): {
\string_to_identifier\(\s\)* p=malloc(sizeof(\string_to_identifier\(\s\)));
\m
return p;
}
";
rule translate_members(m: member, ms: members) : members -> Statements
= " \m \ms ";
rule translate_member(i: IDENTIFIER, n: NATURAL) = member -> StatementList =
" \i - \n bytes " ->
" p-> \toCIdentifer\(\i\) = ReadNByteValue(stream,\toCNatural\(\n\)) ; "
This isn't complete (in particular, I need anther set of rules to generate the enum declaration for the set of packet types) and I doubt if it is exactly right but it give the flavor of the rules. With these rules, OP's example input would generate this C code:
ReadPacketType(stream, packet_type);
switch(packet_type) {
case enum_datapoint: {
datapoint* p=malloc(sizeof(datapoint));
p->temperature=ReadNByteValue(stream,1);
p->elevation=ReadNByteValue(stream,2);
p->gradient=ReadNByteValue(stream,2);
return p;
}
case enum_config: {
config* p=malloc(sizeof(config));
p->frequency=ReadNByteValue(stream,1);
p->deviceId=ReadNByteValue(stream,3);
return p;
}
case enum_accelerometer: {
accelerometer* p=malloc(sizeof(accelerometer));
p-time>=ReadNByteValue(stream,2);
p->x=ReadNByteValue(stream,2);
p->y=ReadNByteValue(stream,2);
p->z=ReadNByteValue(stream,2);
return p;
}
default: Panic(\"unrecognized packet type\");
}

Is there a benefit/penalty in record modification?

In a functional program I have an API that provides functions on complex state implemented as a record:
let remove_number nr {counter ; numbers ; foo } = {counter ; numbers = IntSet.remove nr numbers ; foo}
let add_fresh {counter ; numbers ; foo } = { counter = counter + 1 ; numbers = IntSet.add counter numbers ; foo }
I know, I can use the simplified record modification syntax like this:
let remove_number nr state = { state with numbers = IntSet.remove nr numbers }
When the record type grows, the latter style is actually more readable. Hence, I will probably use it anyway. But out of curiosity, I wonder, whether it also allows the compiler to detect possible memory reusage more easily (my application is written in a monadic style, so there will usually only be one record that is passed along, hence an optimizing compiler could remove all allocations but one and do in-place-mutation instead). In my limited view, the with-syntax gives a good heuristic for places to apply such optimization, but is that true?
Does OCaml even optimize (unneeded) record allocations?
Is the record modification syntax lowered before any optimizations apply?
And finally, is there any pattern recognition implemented in the ocaml
compiler, that tells it that there is a "cheap" way to create one
record expression by modifying a "dead" value in place (and how is
that optimization usually called)?
The two versions of remove_number that you give are equivalent. The { expr with ... } notation doesn't modify a record. It creates a new record.
Record modification looks like this:
let remove_number nr rec = rec.numbers <- IntSet.remove nr rec.numbers
I don't think OCaml does the sort of optimization you describe. The plan with OCaml is to generate code that's close to what you write.

Can ANTLR return Lines of Code when lexing?

I am trying use ANTLR to analyse a large set of code using full Java grammar. Since ANTLR needs to open all the source files and scan them, I am wondering if it can also return lines of code.
I checked API for Lexer and Parser, it seems they do not return LoC. Is it easy to instrument the grammar rule a bit to get LoC? The full Java rule is complicated, I don't really want to mess a large part of it.
If you have an existing ANTLR grammar, and want to count certain things during parsing, you could do something like this:
grammar ExistingGrammar;
// ...
#parser::members {
public int loc = 0;
}
// ...
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
// ...
So, whenever your oparser encounters a someParserRule, you increase the loc by one by placing {loc++;} after (or before) the rule.
So, whatever your definition of a line of code is, simply place {loc++;} in the rule to increase the counter. Be careful not to increase it twice:
statement
: someParserRule {loc++;}
| // ...
;
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
EDIT
I just noticed that in the title of your question you asked if this can be done during lexing. That won't be possible. Let's say a LoC would always end with a ';'. During lexing, you wouldn't be able to make a distinction between a ';' after, say, an assignment (which is a single LoC), and the 2 ';'s inside a for(int i = 0; i < n; i++) { ... } statement (which wouldn't be 2 LoC).
In the C target the data structure ANTLR3_INPUT_STREAM has a getLine() function which returns the current line from the input stream. It seems the Java version of this is CharStream.getLine(). You should be able to call this at any time and get the current line in the input stream.
Use a visitor to visit the CompilationUnit context, then context.stop.getLine() will give you the last line number of the compilation unit context.
#Override public Integer visitCompilationUnit(#NotNull JAVAParser.CompilationUnitContext ctx) {
return ctx.stop.getLine();
}