How would you effectively test command line software, with many switches and arguments - testing

A command line utility/software could potentially consist of many different switches and arguments.
Lets say your software is called CLI and lets say CLI has the following features:
The general syntax of CLI is:
CLI <data structures> <operation> <required arguments> [optional arguments]
<data structures> could be 'matrix', 'complex numbers', 'int', 'floating point', 'log'
<operation> could be 'add', 'subtract', 'multiply', 'divide'
I cant think of any required and optional arguments, but lets say your software does support it
Now you want to test this software. And you wish to test interface itself, not the logic. Essentially the interface must return the correct success codes and error codes.
Essentially a lot of real word software still present a Command Line interface with several options. I am curious if there is any formal testing methodology established for this. One idea i had was to construct a grammar (like EBNF) and describing the 'language' of the interface. But I fail to push this idea ahead. What good is a grammar for in this case? How does it enable the generation of many many combinations .
I am curious to learn more about any theoretical models which could be applied to such a problem or if anyone in here has actually done such testing with satisfying coverage

There is a command-line tool as part of a product i maintain, and i have a situation thats very similar to what you describe. What i did was employ a unit testing framework, and encode each combination of arguments as a test method.
The program is implemented in c#/.NET, so i use microsoft's testing framework that's builtin to Visual Studio, but the approach would work with any unit testing framework.
Each test invokes a utility function that starts the process and sends in the input and cole ts the output. Then, each test is responsible for verifying that the output from the CLI matches what was expected. In some cases, there's a family of test cases that can be performed by a single test method, wih a for loop in it. The logic needs to run the CLI and check the output for each iteration.
The set of tests i have does not cover every permutation of arguments, but it covers the 80% cases and i can add new tests if there are ever any defects.

Using a recursive grammar to generate switches is an interesting idea. If you where to try this then you would need to first write the grammar in such a way that all switches could be used, and then do a random walk of the grammar.
This provides an easy method of randomly walking a grammar and outputting the result.

Related

Is it feasible to use Antlr for source code completion?

I don't know, if this question is valid since i'm not very familiar with source code parsing. My goal is to write a source code completion function for one existing programming language (Language "X") for learning purposes.
Is Antlr(v4) suitable for such a task or should the necessary AST/Parse Tree creation and parsing be done by hand, assuming no existing solutions exists?
I haven't found much information about that specific topic, except a list of compiler books, except a compiler is not what i'm after for.
The code completion in GoWorks is completely implemented using ANTLR 4. The following video shows the level of completion of this code completion engine. The code completion example runs from 5 minutes through the end of the video.
Intro to Tunnel Vision Labs' GoWorks IDE (Preview Release)
I have been working on code completion algorithms for many years, and strongly believe that there is no better solution (automated or manual) for producing a code completion solution for a new language that meets the requirements for what I would call highly-responsive code completion. If you are not interested in that level of performance or accuracy, other solutions may be easier for you to get involved with (I don't work with those personally, because I am too easily disappointed in the results).
Xtext uses ANTLR3 and has good autocomplete facilities. The problem is, it generates a seperate parser (again using antlr3) for autocomplete processing which is derived from AbstractInternalContentAssistParser. This multi-thousand line code part shows that the error recovery of ANTLR3 alone found to be insufficient by the xtext team.
Meanwhile ANTLR4 has a function parser.getExpectedTokensWithinCurrentRule() which lists possible token types for given position. It works when used in a ParseTreeListener. Remaining is semantics, scoping etc which is out of ANTLRs scope.

Tool or eclipse base plugin available for generate test cases for SalesForce platform related Apex classes

Can any one please tell me is there any kind of tools or eclipse base plugins available for generate relevant test cases for SalesForce platform related Apex classes. It seems with code coverage they are not expecting out come like we expect with JUnit, they want to cover whether, test cases are going through the flows of the source classes (like code go through).
Please don't get this post in wrong, I don't want anyone is going to write test cases for my codes :). I have post this question due to nature of SalesForce expecting that code coverage should be. Thanks.
Although Salesforce requires a certain percentage of code coverage for your test cases, you really need to be writing cases that check the results to ensure that the code behaves as designed.
So, even if there was a tool that could generate code to get 100% coverage of your test class, it wouldn't be able to test the results of those method calls, leaving you with a false sense of having "tested code".
I've found that breaking up long methods into separate, sometimes static, methods makes it easier to do unit testing. You can test each individual method, and not worry so much about tweaking parameters to a single method so that it covers all execution paths.
it's now possible to generate test classes automatically for your class/trigger/batch. You can install "Test Class Generator" app from AppExchange and see it working.
This would really help you generating test class and saves lot of your development time.

Does there exist an established standard for testing command line arguments?

I am developing a command line utility that has a LOT of flags. A typical command looks like this:
mycommand --foo=A --bar=B --jar=C --gnar=D --binks=E
In most cases, a 'success' message is printed but I still want to verify against other sources like an external database to ensure actual success.
I'm starting to create integration tests and I am unsure of the best way to do this. My main concerns are:
There are many many flag combinations, how do I know which combinations to test? If you do the math for the 10+ flags that can be used together...
Is it necessary to test permutations of flags?
How to build a framework capable of automating the tests and then verifying results.
How to keep track of a large number of flags and providing an order so it is easy to tell what combinations have been implemented and what has not.
The thought of manually writing out individual cases and verifying results in a unit-test like format is daunting.
Does anyone know of a pattern that can be used to automate this type of test? Perhaps even software that attempts to solve this problem? How did people working on GNU commandline tools test their software?
I think this is very specific to your application.
First, how do you determine the success of the execution of you application? Is it a result code? Is it something printed to the console?
For question 2, it depends how you parse those flags in your application. Most of the time, order of flags isn't important, but there are cases where it is. I hope you don't need to test for permutations of flags, because it would add a lot of cases to test.
In a general case, you should analyse what is the impact of each flag. It is possible that a flag doesn't interfere with the others, and then it just need to be tested once. This is also the case for flags that are meant to be used alone (--help or --version, for example). You also need to analyse what values you should test for each flag. Usually, you want to try each kind of possible valid value, and each kind of possible invalid values.
I think a simple bash script could be written to perform the tests, or any scripting language, like Python. Using nested loops, you could try, for each flag, possibles values, including tests for invalid values and the case where the flag isn't set. I will produce a multidimensional matrix of results, that should be analysed to see if results are conform to what expected.
When I write apps (in scripting languages), I have a function that parses a command line string. I source the file that I'm developing and unit test that function directly rather than involving the shell.

How would one go about testing an interpreter or a compiler?

I've been experimenting with creating an interpreter for Brainfuck, and while quite simple to make and get up and running, part of me wants to be able to run tests against it. I can't seem to fathom how many tests one might have to write to test all the possible instruction combinations to ensure that the implementation is proper.
Obviously, with Brainfuck, the instruction set is small, but I can't help but think that as more instructions are added, your test code would grow exponentially. More so than your typical tests at any rate.
Now, I'm about as newbie as you can get in terms of writing compilers and interpreters, so my assumptions could very well be way off base.
Basically, where do you even begin with testing on something like this?
Testing a compiler is a little different from testing some other kinds of apps, because it's OK for the compiler to produce different assembly-code versions of a program as long as they all do the right thing. However, if you're just testing an interpreter, it's pretty much the same as any other text-based application. Here is a Unix-centric view:
You will want to build up a regression test suite. Each test should have
Source code you will interpret, say test001.bf
Standard input to the program you will interpret, say test001.0
What you expect the interpreter to produce on standard output, say test001.1
What you expect the interpreter to produce on standard error, say test001.2 (you care about standard error because you want to test your interpreter's error messages)
You will need a "run test" script that does something like the following
function fail {
echo "Unexpected differences on $1:"
diff $2 $3
exit 1
}
for testname
do
tmp1=$(tempfile)
tmp2=$(tempfile)
brainfuck $testname.bf < $testname.0 > $tmp1 2> $tmp2
[ cmp -s $testname.1 $tmp1 ] || fail "stdout" $testname.1 $tmp1
[ cmp -s $testname.2 $tmp2 ] || fail "stderr" $testname.2 $tmp2
done
You will find it helpful to have a "create test" script that does something like
brainfuck $testname.bf < $testname.0 > $testname.1 2> $testname.2
You run this only when you're totally confident that the interpreter works for that case.
You keep your test suite under source control.
It's convenient to embellish your test script so you can leave out files that are expected to be empty.
Any time anything changes, you re-run all the tests. You probably also re-run them all nightly via a cron job.
Finally, you want to add enough tests to get good test coverage of your compiler's source code. The quality of coverage tools varies widely, but GNU Gcov is an adequate coverage tool.
Good luck with your interpreter! If you want to see a lovingly crafted but not very well documented testing infrastructure, go look at the test2 directory for the Quick C-- compiler.
I don't think there's anything 'special' about testing a compiler; in a sense it's almost easier than testing some programs, since a compiler has such a basic high-level summary - you hand in source, it gives you back (possibly) compiled code and (possibly) a set of diagnostic messages.
Like any complex software entity, there will be many code paths, but since it's all very data-oriented (text in, text and bytes out) it's straightforward to author tests.
I’ve written an article on compiler testing, the original conclusion of which (slightly toned down for publication) was: It’s morally wrong to reinvent the wheel. Unless you already know all about the preexisting solutions and have a very good reason for ignoring them, you should start by looking at the tools that already exist. The easiest place to start is Gnu C Torture, but bear in mind that it’s based on Deja Gnu, which has, shall we say, issues. (It took me six attempts even to get the maintainer to allow a critical bug report about the Hello World example onto the mailing list.)
I’ll immodestly suggest that you look at the following as a starting place for tools to investigate:
Software: Practice and Experience April 2007. (Payware, not available to the general public---free preprint at http://pobox.com/~flash/Practical_Testing_of_C99.pdf.
http://en.wikipedia.org/wiki/Compiler_correctness#Testing (Largely written by me.)
Compiler testing bibliography (Please let me know of any updates I’ve missed.)
In the case of brainfuck, I think testing it should be done with brainfuck scripts. I would test the following, though:
1: Are all the cells initialized to 0
2: What happens when you decrement the data pointer when it's currently pointing to the first cell? Does it wrap? Does it point to invalid memory?
3: What happens when you increment the data pointer when it's pointing at the last cell? Does it wrap? Does it point to invalid memory
4: Does output function correctly
5: Does input function correctly
6: Does the [ ] stuff work correctly
7: What happens when you increment a byte more than 255 times, does it wrap to 0 properly, or is it incorrectly treated as an integer or other value.
More tests are possible too, but this is probably where i'd start. I wrote a BF compiler a few years ago, and that had a few extra tests. Particularly I tested the [ ] stuff heavily, by having a lot of code inside the block, since an early version of my code generator had issues there (on x86 using a jxx I had issues when the block produced more than 128 bytes or so of code, resulting in invalid x86 asm).
You can test with some already written apps.
The secret is to:
Separate the concerns
Observe the law of Demeter
Inject your dependencies
Well, software that is hard to test is a sign that the developer wrote it like it's 1985. Sorry to say that, but utilizing the three principles I presented here, even line numbered BASIC would be unit testable (it IS possible to inject dependencies into BASIC, because you can do "goto variable".

What is code coverage and how do YOU measure it?

What is code coverage and how do YOU measure it?
I was asked this question regarding our automating testing code coverage. It seems to be that, outside of automated tools, it is more art than science. Are there any real-world examples of how to use code coverage?
Code coverage is a measurement of how many lines/blocks/arcs of your code are executed while the automated tests are running.
Code coverage is collected by using a specialized tool to instrument the binaries to add tracing calls and run a full set of automated tests against the instrumented product. A good tool will give you not only the percentage of the code that is executed, but also will allow you to drill into the data and see exactly which lines of code were executed during a particular test.
Our team uses Magellan - an in-house set of code coverage tools. If you are a .NET shop, Visual Studio has integrated tools to collect code coverage. You can also roll some custom tools, like this article describes.
If you are a C++ shop, Intel has some tools that run for Windows and Linux, though I haven't used them. I've also heard there's the gcov tool for GCC, but I don't know anything about it and can't give you a link.
As to how we use it - code coverage is one of our exit criteria for each milestone. We have actually three code coverage metrics - coverage from unit tests (from the development team), scenario tests (from the test team) and combined coverage.
BTW, while code coverage is a good metric of how much testing you are doing, it is not necessarily a good metric of how well you are testing your product. There are other metrics you should use along with code coverage to ensure the quality.
Code coverage basically tells you how much of your code is covered under tests. For example, if you have 90% code coverage, it means 10% of the code is not covered under tests.
I know you might be thinking that if 90% of the code is covered, it's good enough, but you have to look from a different angle. What is stopping you from getting 100% code coverage?
A good example will be this:
if(customer.IsOldCustomer())
{
}
else
{
}
Now, in the code above, there are two paths/branches. If you are always hitting the "YES" branch, you are not covering the "else" part and it will be shown in the Code Coverage results. This is good because now you know that what is not covered and you can write a test to cover the "else" part. If there was no code coverage, you are just sitting on a time bomb, waiting to explode.
NCover is a good tool to measure code coverage.
Just remember, having "100% code-coverage" doesn't mean everything is tested completely - while it means every line of code is tested, it doesn't mean they are tested under every (common) situation..
I would use code-coverage to highlight bits of code that I should probably write tests for. For example, if whatever code-coverage tool shows myImportantFunction() isn't executed while running my current unit-tests, they should probably be improved.
Basically, 100% code-coverage doesn't mean your code is perfect. Use it as a guide to write more comprehensive (unit-)tests.
Complementing a few points to many of the previous answers:
Code coverage means, how well your test set is covering your source code. i.e. to what extent is the source code covered by the set of test cases.
As mentioned in above answers, there are various coverage criteria, like paths, conditions, functions, statements, etc. But additional criteria to be covered are
Condition coverage: All boolean expressions to be evaluated for true and false.
Decision coverage: Not just boolean expressions to be evaluated for true and false once, but to cover all subsequent if-elseif-else body.
Loop Coverage: means, has every possible loop been executed one time, more than once and zero time. Also, if we have assumption on max limit, then, if feasible, test maximum limit times and, one more than maximum limit times.
Entry and Exit Coverage: Test for all possible call and its return value.
Parameter Value Coverage (PVC). To check if all possible values for a parameter are tested. For example, a string could be any of these commonly: a) null, b) empty, c) whitespace (space, tabs, new line), d) valid string, e) invalid string, f) single-byte string, g) double-byte string. Failure to test each possible parameter value may leave a bug. Testing only one of these could result in 100% code coverage as each line is covered, but as only one of seven options are tested, means, only 14.2% coverage of parameter value.
Inheritance Coverage: In case of object oriented source, when returning a derived object referred by base class, coverage to evaluate, if sibling object is returned, should be tested.
Note: Static code analysis will find if there are any unreachable code or hanging code, i.e. code not covered by any other function call. And also other static coverage. Even if static code analysis reports that 100% code is covered, it does not give reports about your testing set if all possible code coverage is tested.
Code coverage has been explained well in the previous answers. So this is more of an answer to the second part of the question.
We've used three tools to determine code coverage.
JTest - a proprietary tool built over JUnit. (It generates unit tests as well.)
Cobertura - an open source code coverage tool that can easily be coupled with JUnit tests to generate reports.
Emma - another - this one we've used for a slightly different purpose than unit testing. It has been used to generate coverage reports when the web application is accessed by end-users. This coupled with web testing tools (example: Canoo) can give you very useful coverage reports which tell you how much code is covered during typical end user usage.
We use these tools to
Review that developers have written good unit tests
Ensure that all code is traversed during black-box testing
Code coverage is simply a measure of the code that is tested. There are a variety of coverage criteria that can be measured, but typically it is the various paths, conditions, functions, and statements within a program that makeup the total coverage. The code coverage metric is the just a percentage of tests that execute each of these coverage criteria.
As far as how I go about tracking unit test coverage on my projects, I use static code analysis tools to keep track.
For Perl there's the excellent Devel::Cover module which I regularly use on my modules.
If the build and installation is managed by Module::Build you can simply run ./Build testcover to get a nice HTML site that tells you the coverage per sub, line and condition, with nice colors making it easy to see which code path has not been covered.
In the previous answers Code coverage has been explained well . I am just adding some knowledge related to tools if your are working on iOS and OSX platforms, Xcode provides the facility to test and monitor code coverage.
Reference Links:
https://developer.apple.com/library/archive/documentation/DeveloperTools/Conceptual/testing_with_xcode/chapters/07-code_coverage.html
https://medium.com/zendesk-engineering/code-coverage-and-xcode-6b2fb8756a51
Both are helpful links for learning and exploring code coverage with Xcode.
The purpose of code coverage testing is to figure out how much code is being tested. Code coverage tool generate a report which shows how much of the application code has been run. Code coverage is measured as a percentage, the closer to 100%, the better. This is an example of a white-box test. Here are some open source tools for code coverage testing:
Simplecov - For Ruby
Coverlet - For .NET
Cobertura - For Java
Coverage.py - For Python
Jest - For JavaScript
For PHP you should take a look at the Github from Sebastian Bergmann
Provides collection, processing, and rendering functionality for PHP code coverage information.
https://github.com/sebastianbergmann/php-code-coverage
What code coverage IS NOT
To truly understand what code coverage is, it is very important to understand what it is not.
A couple of answers/comments here and on related questions have alluded to this:
Franci Penov
BTW, while code coverage is a good metric of how much testing you are doing, it is not necessarily a good metric of how well you are testing your product.
steve
Just because every line of your code is run at some point in your tests, it doesn't mean you have tested every possible scenario that the code could be run under. If you just had a function that took x and returned x/x and you ran the test using my_func(2) you would have 100% coverage (as the function's code will have been run) but you've missed a huge issue when 0 is the parameter. I.e. you haven't tested all necessary scenarios even with 100% coverage.
KeithS:
However, the flip side of coverage is actually twofold: first, a test that adds coverage for coverage's sake is useless; every test must prove that code works as expected in some novel situation. Also, "coverage" is not "exercise"; your test suites may execute every line of code in the SUT, but they may not prove that a line of logic works in every situation.
No one says it more succinctly and to the point than Mark Simpson:
Code coverage tells you what you definitely haven't tested, not what you have.
An Illustrative Example
I spent some time writing a reply to a feature request that Istanbul (a Javascript test coverage tool) "Change definition of coverage to require more than 1 hit" per line. No one will ever see it there 🤣, so I thought it might be useful to reuse the gist of it here:
A coverage tool CANNOT prove that your code is tested adequately. All it can do is tell you that you provided some kind of coverage for every line of code in your codebase, but even then it doesn't prove the coverage means anything, because a test might execute a line of code without making any assertions on its results. Only you as a developer can decide the actual semantically unique input variations and boundary conditions that need to be covered by tests and ensure that the test logic does in fact make the right assertions.
For example, say you have the following Javascript function. A single test that asserts an input of (1, 1) returns 1 would give you 100% line coverage. What does that prove?
function max(a, b) {
return a > b ? a : b
}
Putting aside for a moment the semantically poor coverage of this test, the 100% line coverage is rather misleading too, as it doesn't provide 100% branch coverage. That's easily seen by splitting the branches onto different lines and rerunning the line coverage report:
function max(a, b) {
if (a > b) {
return a
} else {
return b
}
}
or even
function max(a, b) {
return a > b ?
a :
b
}
What this tells us is that the "coverage" metric depends too much on the implementation, whereas ideally testing should be black box. And even then it's a judgement call.
For example, would the following three input cases constitute complete testing of the max function?
(2, 1)
(1, 2)
(1, 1)
You'd get 100% line and 100% branch coverage for the above implementations. But what about non-number inputs? Ok, so you add two more input cases:
(null, 1)
(1, null)
which forces you to update the implementation:
function max(a, b) {
if (typeof a !== 'number' || typeof b !== 'number') {
return undefined
}
return a > b ? a : b
}
Looking good. You have 100% line and branch coverage, and you've covered invalid inputs.
But is that enough? What about negative numbers?
The ideal of 100% blackbox coverage is a fantasy
In my opinion, in this situation, for the simple nature of this function, testing negative number cases is anal overkill. If the situation were different, say the function only existed because we need to implemented some tricky algorithm or optimization, that may or may not work as expected for negative numbers, then I'd add more input cases including negative numbers.
Often times, you only discover corner cases because you have hundreds or thousands of users and only through their using your software in unexpected ways or in conditions and software environments you could not foresee or reproduce even if you could are such rare cases exposed. And often those rare cases are artifacts of the nature of your implementation, not something you'd arrive at from analysis of an idealized abstraction of the buggy code's interfaces.
I think what that shows is the ideal of 100% blackbox coverage is a bit of a fantasy. You would waste a lot of time writing unnecessary tests if you treated everything as an idealized black box. In the example above, I know the implementation uses a simple and reliable non-number check and then uses the native Javascript logic to compare values (a > b), and that it would be silly to do anything more complex. Knowing that, I'm not going to test passing in negative numbers, floats, strings, objects, etc.
At the end of the day, you have to be practical and use good judgement, and that judgement usually cannot ignore knowing something about the nature of what's in the black box, or at least the assumptions made inside the black box.
All this said, I don't have a CS degree 😂. What's the equivalent of IANAL for programmer advice?