In a recent question (How to define (and name) the corresponding safe term comparison predicates in ISO Prolog?) #false asked for an implementation of the term ordering predicate lt/2, a variant of the ISO builtin (#<)/2.
The truth value of lt(T1,T2) is to be stable w.r.t. arbitrary variable bindings in T1 and T2.
In various answers, different implementations (based on implicit / explicit term traversal) were proposed. Some caveats and hints were raised in comments, so were counterexamples.
So my question: how can candidate implementations be tested? Some brute-force approach? Or something smarter?
In any case, please share your automatic testing machinery for lt/2! It's for the greater good!
There are two testing strategies: validation and verification.
Validation: Testing is always the same. First you need a specification of what you want to test. Second you need an implementation of what you want to test.
Then from the implementation you extract the code execution paths. And for each code execution path from the specification, you derive the desired outcome.
And then you write test cases combining each execution paths and the desired outcomes. Do not only test positive paths, but also negative paths.
If your code is recursive, you have theoretically infinitely many execution paths.
But you might find that a sub recursion more or less asks the same than what another test case already asked. So you can also test with a finite set in many cases.
Validation gives you confidence.
Verification: You would use some formal methods to derive the correctness of your implementation from the specification.
Verification gives you 100% assurance.
Related
I'm working on a streaming rules engine, and some of my customers have a few hundred rules they'd like to evaluate on every event that arrives at the system. The rules are pure (i.e. non-side-effecting) Boolean expressions, and they can be nested arbitrarily deeply.
Customers are creating, updating and deleting rules at runtime, and I need to detect and adapt to the population of rules dynamically. At the moment, the expression evaluation uses an interpreter over the internal AST, and I haven't started thinking about codegen yet.
As always, some of the predicates in the tree are MUCH cheaper to evaluate than others, and I've been looking for an algorithm or data structure that makes it easier to find the predicates that are cheap, and that are validly interpretable as controlling the entire expression. My mental headline for this pattern is "ANDs all the way to the root", i.e. any predicate for which all ancestors are ANDs can be interpreted as controlling.
Despite several days of literature search, reading about ROBDDs, CNF, DNF, etc., I haven't been able to close the loop from what might be common practice in the industry to my particular use case. One thing I've found that seems related is Analysis and optimization for boolean expression indexing
but it's not clear how I could apply it without implementing the BE-Tree data structure myself, as there doesn't seem to be an open source implementation.
I keep half-jokingly mentioning to my team that we're going to need a SAT solver one of these days. 😅 I guess it would probably suffice to write a recursive algorithm that traverses the tree and keeps track of whether every ancestor is an AND or an OR, but I keep getting the "surely this is a solved problem" feeling. :)
Edit: After talking to a couple of friends, I think I may have a sketch of a solution!
Transform the expressions into Conjunctive Normal Form, in which, by definition, every node is in a valid short-circuit position.
Use the Tseitin algorithm to try to avoid exponential blowups in expression size as a result of the CNF transform
For each AND in the tree, sort it in ascending order of cost (i.e. cheapest to the left)
???
Profit!^Weval as usual :)
You should seriously consider compiling the rules (and the predicates). An interpreter is 10-50x slower than machine code for the same thing. This is a good idea if the rule set doesn't change very often. Its even a good idea if the rules can change dynamically because in practice they still don't change very fast, although now your rule compiler has be online. Eh, just makes for a bigger application program and memory isn't much of an issue anymore.
A Boolean expression evaluation using individual machine instructions is even better. Any complex boolean equation can be compiled in branchless sequences of individual machine instructions over the leaf values. No branches, no cache misses; stuff runs pretty damn fast. Now, if you have expensive predicates, you probably want to compile code with branches to skip subtrees that don't affect the result of the expression, if they contain expensive predicates.
Within reason, you can generate any equivalent form (I'd run screaming into the night over the idea of using CNF because it always blows up on you). What you really want is the shortest boolean equation (deepest expression tree) equivalent to what the clients provided because that will take the fewest machine instructions to execute. This may sound crazy, but you might consider exhaustive search code generation, e.g., literally try every combination that has a chance of working, especially if the number of operators in the equation is relatively small. The VLSI world has been working hard on doing various optimizations when synthesizing boolean equations into gates. You should look into the the Espresso hueristic boolean logic optimizer (https://en.wikipedia.org/wiki/Espresso_heuristic_logic_minimizer)
One thing that might drive you expression evaluation is literally the cost of the predicates. if I have formula A and B, and I know that A is expensive to evaluate and usually returns true, then clearly I want to evaluate B and A instead.
You should consider common sub expression evaluation, so that any common subterm is only computed once. This is especially important when one has expensive predicates; you never want to evaluate the same expensive predicate twice.
I implemented these tricks in a PLC emulator (these are basically machines that evaluate buckets [like hundreds of thousands] of boolean equations telling factory actuators when to move) using x86 machine instructions for AND/OR/NOT for Rockwell Automation some 20 years ago. It outran Rockwell's "premier" PLC which had custom hardware but was essentially an interpreter.
You might also consider incremental evaluation of the equations. The basic idea is not to re-evaluate all the equations over and over, but rather to re-evaluate only those equations whose input changed. Details are too long to include here, but a patent I did back then explains how to do it. See https://patents.google.com/patent/US5623401A/en?inventor=Ira+D+Baxter&oq=Ira+D+Baxter
In behavior based testing, it looks like the number of error scenarios grow exponentially.
As per Aslak Hellesøy, BDD was created to combine automated acceptance tests, functional requirements and software documentation.
In 2003 I became part of a small clique of people from the XP community who were exploring better ways to do TDD. Dan North named this BDD. The idea was to combine automated acceptance tests, functional requirements and software documentation into one format that would be understandable by non-technical people as well as testing tools.
Software development teams use JBehave as a tool for BDD testing (thanks to Dan North).
As there can be a lot of possible negative options; it looks like the number of negative scenarios in a JBehave test suite can grow a lot in numbers. Time taken to run test suite as well as to modify the product increases with these kind of growing scenarios. Specially, I feel that it is becoming hard to maintain as a documentation of the product.
I am not exactly sure whether this is an abuse of BDD/JBehave concepts due to misunderstandings from different teams; or may be that is the way it should be.
Let me explain this concern with an example.
Say an application has a behavior to order an item via a REST service.
PUT /order
{
// JSON body with 3 mandatory parameters and 2 optional parameters
}
Happy scenario
Invoke REST endpoint with correct values for all 3 mandatory parameters
Invoke REST endpoint with correct values for all 5 parameters
Negative scenarios
There are a lot of negative scenarios that we can come up with.
Input value based scenarios
Mandatory parameter 1 is set to null, with correct values for other two mandatory parameters (3 possible scenarios with each mandatory parameter)
Mandatory parameter 1 is set to empty, with correct values for other two mandatory parameters (3 possible scenarios with each mandatory parameter)
Mandatory parameter 1 is set to a value in invalid format, with correct values for other two mandatory parameters (3 possible scenarios with each mandatory parameter)
Mandatory parameter 1 & 2 are set to null, with correct value for other mandatory parameter (2 possible scenarios)
Likewise, we can write 3^3 scenarios just for those three parameters; which grows exponentially with the number of parameters.
Then we can combine, optional parameters also into the equation and come up with more scenarios (say optional parameter with null, empty and invalid-format values).
Payment ability based scenarios
Based on available money, there will be scenarios.
Delivery location based scenarios
Based on delivery possibilities, there will be scenarios.
Question/Concern
I would like to learn more on whether all these negative scenarios (+more) should be part of JBehave based test suite? If that is the case, any advice/thoughts on how to make it more maintainable?
It helps a lot to know what the tested application does internally in its own validation processes, specifically the order of validation.
In a simplified example of three required parameters, you really only need three scenarios: one for each parameter. If you know that the application will fail if parameter one is invalid, you don't need to check that again when you test parameter two in another scenario since the second parameter would never be validated upon failure of the first, so instead of three times three, you simply have three:
1) invalid, valid, valid.
2) valid, invalid, valid.
3) valid, valid, invalid.
That is, unless, the application DOES check all three parameters and reports accordingly that one or more parameters were invalid. Speaking as a developer who now does automation, I can tell you that unless I thought multiple invalid parameters were a high probability, I would only check parameters one at a time and fail out with an error upon the first invalid parameter. Having written accounting software, there were times where it was logical to validate all parameters and report accordingly, but that was the exception rather than the rule. If you know what the application is checking, and in what order, you can write better test scripts, but I realize that is not always possible.
There is still the question of the seemingly limitless kinds of invalid data, so even in my simplified example, you could still have lots of tests, but in that situation it can be dealt-with using parameters of invalid values. You could still limit it to just three scenarios, each having any number of invalid parameters to test.
I hope I understood your question correctly and offered some useful information.
I created my grammar using antlr4 but I want to test robustess
is there an automatic tool or a good way to do that fast
Thanks :)
As it's so hard to find real unit tests for ANTLR, I wrote 2 articles about it:
Unit test for Lexer
Unit test for Parser
A Lexer test checks whether a given text is read and converted to the expected Token sequences. It's useful for instance to avoid ambiguity errors.
A Parser test take a sequence of tokens (that is, it starts after the lesser part) and checks whether that token sequence traverses the expected rules ( java methods).
The only way I found to create unit tests for a grammar is to create a number of examples from a written spec of the given language. This is neither fast, nor complete, but I see no other way.
You could be tempted to create test cases directly from the grammar (writing a tool for that isn't that hard). But think a moment about this. What would you test then? Your unit tests would always succeed, unless you use generated test cases from an earlier version of the grammar.
A special case is when you write a grammar for a language that has already a grammar for another parser generation tool. In that case you could use the original grammar to generate test cases which you then can use to test your new grammar for conformity.
However, I don't know any tool that can generate the test cases for you.
Update
Meanwhile I got another idea that would allow for better testing: have a sentence generator that generates random sentences from your grammar (I'm currently working on one in my Visual Studio Code ANTLR4 extension). The produced sentences can then be examined using a heuristic approach, for their validity:
Confirm the base structure.
Check for mandatory keywords and their correct order.
Check that identifiers and strings are valid.
Watch out for unusual constructs that are not valid according to language.
...
This would already cover a good part of the language, but has limits. Matching code and generating it are not 1:1 operations. A grammar rule that matches certain (valid) input might generate much more than that (and can so produce invalid input).
In one chapter of his book 'Software Testing Techniques' Boris Beizer addresses the topic of 'syntax testing'. The basic idea is to (mentally or actually) take a grammar and represent it as a syntax diagram (aka railroad diagram). For systematic testing, this graph then would be covered: Good cases where the input matches the elements, but also bad cases for each node. Iterations and recursive calls would be handled like loops, that is, cases with zero, one, two, one less than max, max, once above max iterations (i.e. occurrences of the respective syntactic element).
I have implemented FIFO semaphores but now I need a way to test/prove that they are working properly. A simple test would be to create some threads that try to wait on a semaphore and then print a message with a number and if the numbers are in order it should be FIFO, but this is not good enough to prove it because that order could have occurred by chance. Thus, I need a better way of testing it.
If necessary locks or condition variables can be used too.
Thanks
What you describe with your sentence "but this is not good enough to prove it because that order could have occurred by chance" is somehow a known dilema.
1) Even if you have a specification, you can not ensure that the specification match your intention. To illustrate this I will take an example from "the limit of correctness". Let's consider a specification for a factorization function that is:
Compute A and B such as A * B = C
But it's not enough as you could have an implementation that returns A=1 and B=C. Adding A,B != 1 can still lead to A=-1 and B=-C, so the only correct specification must state A,B>1. That's just to illustrate how complicated it can be to write a specification that match the real intention.
2) Even having proved an algorithm, still doesn't mean the implementation is correct in practice. This is best illustrated with this quote from Donald Knuth:
Beware of bugs in the above code; I
have only proved it correct, not tried
it.
3) Testing can only reveal the presence of bug, not their absence. This quote goes back to Dijkstra:
Testing can be used to show the
presence of bugs but never to show
their absence.
Conclusion: you are doomed and you will never be 100% sure that your code is correct according to its intent! But stuff aren't that bad. Having a high confidence about the code is usually enough. For instance, if using multiple threads is still not enough for you, you can decide to use fuzzing as well so as to randomize the test execution even more. If your tests always pass, well, you can be pretty confident that your code is good.
because that order could have occurred by chance.
You can run the test a few times, e.g. 10, and test that each time the order was correct. This will ensure that it happened not by chance.
P.S. Multiple threads in a unit test is usually avoided
I would like to know if somebody often uses metrics to validate its code/design.
As example, I think I will use:
number of lines per method (< 20)
number of variables per method (< 7)
number of paremeters per method (< 8)
number of methods per class (< 20)
number of field per class (< 20)
inheritance tree depth (< 6).
Lack of Cohesion in Methods
Most of these metrics are very simple.
What is your policy about this kind of mesure ? Do you use a tool to check their (e.g. NDepend) ?
Imposing numerical limits on those values (as you seem to imply with the numbers) is, in my opinion, not very good idea. The number of lines in a method could be very large if there is a significant switch statement, and yet the method is still simple and proper. The number of fields in a class can be appropriately very large if the fields are simple. And five levels of inheritance could be way too many, sometimes.
I think it is better to analyze the class cohesion (more is better) and coupling (less is better), but even then I am doubtful of the utility of such metrics. Experience is usually a better guide (though that is, admittedly, expensive).
A metric I didn't see in your list is McCabe's Cyclomatic Complexity. It measures the complexity of a given function, and has a correlation with bugginess. E.g. high complexity scores for a function indicate: 1) It is likely to be a buggy function and 2) It is likely to be hard to fix properly (e.g. fixes will introduce their own bugs).
Ultimately, metrics are best used at a gross level -- like control charts. You look for points above and below the control limits to identify likely special cases, then you look at the details. For example a function with a high cyclomatic complexity may cause you to look at it, only to discover that it is appropriate because it a dispatcher method with a number of cases.
management by metrics does not work for people or for code; no metrics or absolute values will always work. Please don't let a fascination with metrics distract from truly evaluating the quality of the code. Metrics may appear to tell you important things about the code, but the best they can do is hint at areas to investigate.
That is not to say that metrics are not useful. Metrics are most useful when they are changing, to look for areas that may be changing in unexpected ways. For example, if you suddenly go from 3 levels of inheritance to 15, or 4 parms per method to 12, dig in and figure out why.
example: a stored procedure to update a database table may have as many parameters as the table has columns; an object interface to this procedure may have the same, or it may have one if there is an object to represent the data entity. But the constructor for the data entity may have all of those parameters. So what would the metrics for this tell you? Not much! And if you have enough situations like this in the code base, the target averages will be blown out of the water.
So don't rely on metrics as absolute indicators of anything; there is no substitute for reading/reviewing the code.
Personally I think it's very difficult to adhere to these types of requirements (i.e. sometimes you just really need a method with more than 20 lines), but in the spirit of your question I'll mention some of the guidelines used in an essay called Object Calisthenics (part of the Thoughtworks Anthology if you're interested).
Levels of indentation per method (<2)
Number of 'dots' per line (<2)
Number of lines per class (<50)
Number of classes per package (<10)
Number of instance variances per class (<3)
He also advocates not using the 'else' keyword nor any getters or setters, but I think that's a bit overboard.
Hard numbers don't work for every solution. Some solutions are more complex than others. I would start with these as your guidelines and see where your project(s) end up.
But, regarding these number specifically, these numbers seem pretty high. I usually find in my particular coding style that I usually have:
no more than 3 parameters per method
signature about 5-10 lines per method
no more than 3 levels of inheritance
That isn't to say I never go over these generalities, but I usually think more about the code when I do because most of the time I can break things down.
As others have said, keeping to a strict standard is going to be tough. I think one of the most valuable uses of these metrics is to watch how they change as the application evolves. This helps to give you an idea how good a job you're doing on getting the necessary refactoring done as functionality is added, and helps prevent making a big mess :)
OO Metrics are a bit of a pet project for me (It was the subject of my master thesis). So yes I'm using these and I use a tool of my own.
For years the book "Object Oriented Software Metrics" by Mark Lorenz was the best resource for OO metrics. But recently I have seen more resources.
Unfortunately I have other deadlines so no time to work on the tool. But eventually I will be adding new metrics (and new language constructs).
Update
We are using the tool now to detect possible problems in the source. Several metrics we added (not all pure OO):
use of assert
use of magic constants
use of comments, in relation to the compelxity of methods
statement nesting level
class dependency
number of public fields in a class
relative number of overridden methods
use of goto statements
There are still more. We keep the ones that give a good image of the pain spots in the code. So we have direct feedback if these are corrected.