I am wondering what the relation between random and fuzz testing is. I understand that random testing has been there for a longer time, but I cannot see any differences between them. They both seem to make use of random input to see if the program goes into an unexpected state (i.e. crash). Is the main difference difference that fuzz testing is automated?
Random(ized) testing has no intention of crashing a system. You can randomize valid values. The goals could be to increase coverage or to find out new/unexpected information about the system (possibly bugs, but could be simply unknown behaviour).
Fuzz(y) testing is about sending complete rubbish (e.g. could be random bytes instead of HTTP request) into the system and see whether it can handle it gracefully (not crash or hang). The data is not necessarily random - it's just meaningless to the software.
Related
I am looking for a machine learning strategy that will allow me to load in test data, along with a test outcome, and optimize for a particular scenario to adjust future testing parameters.
(See example in the edit)
Original example: For example, consider that I have a 3 dimensional space (environmental chamber) that I place a physical test device into. I will then select a number of locations and physical attributes with which to test the device. First I will select to test the device at every location configuration, across multiple temperatures, humidities, and pressures. At each test increment, or combination of variables, I log the value of each feature, e.g. x,y,z positional data, as well as the temperature, humidity, pressure, etc.. after setting these parameters I will initiate a software test on the physical device that is affected by the environmental factors in ways too complex for me to predict. This software test can output three outputs that vary with an unknown (until tested) probability based on the logged physical parameters. Of the three outputs, one is a failure, one is a success, and one is that the test finishes without any meaningful output (we can ignore this case in processing).
Once the test finishes testing every physical parameter I provide it, I would like to then run an algorithm on this test data to select the controllable parameters e.g. x,y,z positions, or temperature to maximize my chance of a successful test while also minimizing my chance at a failure (in some cases we find parameters that have a high chance at failure and a high chance at success, failures are more time expensive, thus we need to avoid them). The algorithm should use these to suggest an alternative set of test parameter ranges to initiate the next test iteration that it believes will get us closer to our goal.
Currently, I can maximize the success case by using an iterative decision tree and ignoring the results that end in failure.
Any ideas are appreciated
Edit:
Another example (this is contrived, lets not get into the details of PRNGS)-- Lets say that I have an embedded device that has a hardware pseudo random number generator (PRNG) that is affected by environmental factors, such as heat and magnetometer data. I have a program on the device that uses this PRNG to give me a random value. Suppose that this PRNG just barely achieves sufficient randomization in the average case, in the best case gives me a good random value, and in the worst case fails to provide a random value. By changing the physical parameters of the environment around the device I can find values which with some probability cause this PRNG to fail to give me a random number, give me an 'ok' random number, and which cause it to succeed in generating a cryptographically secure random number. Lets suppose in cases in which it fails to generate a good enough random number the program enters a long running loop trying to find one before ultimately failing, which we would like to avoid as it is computationally expensive. To test this, we first start off by testing every combination of variables in which we can control (temperature, position, etc..) perhaps by jumping several degrees at a time, to give a rough picture on what the device's response looks like over a large range. I would like then to run an algorithm on this test data, narrow my testing windows and iterate over the newly selected feature parameters to arrive at an optimized solution. In this case, since failures are expensive, the optimized solution would be something that minimizes failures, while simultaneously maximizing successes.
Doing software development, you also want to verify the robustness of your code. Especially in image processing - I'm pretty sure that this applies to other fields, too, like bio sciences simulators - your input data can vary a lot.
So far, I've faced the situation that a rolled-out piece of software crashes and causes some irritation at the customer’s site. The framework holding the image processing algorithms is pretty stable, crashes usually occur in the algorithms itself.
Image you use a 3rd-party closed source image processing library. To figure out any problematic code you go manually through the code you wrote. Everything around the blackboxed function seems pretty robust.
Unfortunately, as soon as an image with this very special gradient on this very particular region, the blackboxed function crashes.
Enveloping all 3rd-party functions with try-catches is not taking out all risks. Especially on embedded devices you may just get a segfault.
To avoid unhappy customers und therefore eradicate possible crashes, I started to do white-noise-tests using random-generated patterns as input images and let this tests run for a few days - which actually gave my some confidence (and in some cases, more mistrust) in the robustness on an closed-source function.
Compared to an analytical (or using integration / unit tests) approach is seems like ... steamroller tactics. It is just not very elegant.
Coming to my question: Is this empirical testing approach appropriate? Are there better solutions available?
Imagine that you developed an application that reads/writes/updates/deletes data in SQL Server database.
What tests do you run to see all possible sql deadlocks before putting it into production?
You need to follow the basics of testing.
Unit Testing
Positive test cases, simple test cases that get you started
Negative test cases, add bad data that fails validations, and check if its been written
Boundary (dont confuse it with negative), check for all limits, validations, 0, maxint, minint. For gps, check for +90 degrees, -90 degrees, -180 degrees... Basically testing is not as easy as you think. Doing it right is as/more painful than coding.
All the above tests should focus on code COVERAGE (path coverage too, if you have the guts :)).
And yes, you need to test for add, update, remove anomalies, from your system. Do this per table, per connection per relation.
Scale-ability
Each table, you need to add data from different clients, increase the number of clients until you break. IT MUST BREAK. That is what you can scale to max.
Performance
Is comparative in nature. You cannot test for performance by itself. It has to be between builds, between competitive product (exactly same data, exactly same hardware, exactly same scenario)
Stress
Add a lot of data from all sides, concurrently (dont use scale-ability max's here, the intent is "NOT" to break, but to find out if it can deal with a lot of data, users, concurrency.
Reliability
Here, use normal data, normal scenario that your customer is looking for. And run it over 48/72 hours. The gc's should bot go crazy, memory leaks should not keep growing, performance should be "consistent" (need not be the best). But should not "DROP" over time. Look for "things going bad with time".
Assume you have access to an "oracle" implementation whose output you trust to be correct.
The most obvious way to do this seems to be to run a set of known plaintext/hash combinations through the implementation and see if they come out as expected. An arbitrary number of these cases could be constructed by generating random plaintexts (using a static seed to keep it deterministic) and using the oracle to find their hashes.
The major problem I see with this is that it's not guaranteed to hit possible corner cases. Generating more cases will reduce the likelihood of missing corner cases, but how many cases is enough?
There's also the side issue of specifying the lengths of these random plaintexts because MD5 takes an arbitrary-length string as input. For my purposes, I don't care about long inputs (say, anything longer than 16 bytes), so you can use the fact that this is a "special purpose" MD5 implementation in your answer if it makes things simpler or you can just answer for the general case if it's all the same.
If you have an algorithmic error, it's extremely likely that every hash will be wrong. Hashes are unforgiving by nature.
Since the majority of possible errors will be exposed quickly, you really won't need that many tests. The main things to cover are the edge cases:
Length=0 (input is empty)
Length=1
Length=16
Input contains at least one byte with value 0
Repeated patterns of bytes in the input (would this be a meaningful edge case for MD5?)
If those all pass, perhaps along with tests for one or two more representative inputs, you could be pretty confident in your algorithm. There aren't that many edge cases (unless someone more familiar with the algorithm's details can think of some more).
I've recently become quite interested in identifying patterns for software scalability testing. Due to the variable nature of different software solutions, it seems to like there are as many good solutions to the problem of scalability testing software as there are to designing and implementing software. To me, that means that we can probably distill some patterns for this type of testing that are widely used.
For the purposes of eliminating ambiguity, I'll say in advance that I'm using the wikipedia definition of scalability testing.
I'm most interested in answers proposing specific pattern names with thorough descriptions.
All the testing scenarios I am aware of use the same basic structure for the test which involves generating a number of requests on one or more requesters targeted at the processing agent to be tested. Kurt's answer is an excellent example of this process. Generally you will run the tests to find some thresholds and also run some alternative configurations (less nodes, different hardware etc...) to build up an accurate averaged data.
A requester can be a machine, network card, specific software or thread in software that generates the requests. All it does is generate a request that can be processed in some way.
A processing agent is the software, network card, machine that actually processes the request and returns a result.
However what you do with the results determines the type of test you are doing and they are:
Load/Performance Testing: This is the most common one in use. The results are processed is to see how much is processed at various levels or in various configurations. Again what Kurt is looking for above is an example if this.
Balance Testing: A common practice in scaling is to use a load balancing agent which directs requests to a process agent. The setup is the same as for Load Testing, but the goal is to check distribution of requests. In some scenarios you need to make sure that an even (or as close to as is acceptable) balance of requests across processing agents is achieved and in other scenarios you need to make sure that the process agent that handled the first request for a specific requester handles all subsequent requests (web farms are commonly needed like this).
Data Safety: With this test the results are collected and the data is compared. What you are looking for here is locking issues (such as a SQL deadlock) which prevents writes or that data changes are replicated to the various nodes or repositories you have in use in an acceptable time or less.
Boundary Testing: This is similar to load testing except the goal is not processing performance but how much is stored effects performance. For example if you have a database how many rows/tables/columns can you have before the I/O performance drops below acceptable levels.
I would also recommend The Art of Capacity Planning as an excellent book on the subject.
I can add one more type of testing to Robert's list: soak testing. You pick a suitably heavy test load, and then run it for an extended period of time - if your performance tests usually last for an hour, run it overnight, all day, or all week. You monitor both correctness and performance. The idea is to detect any kind of problem which builds up slowly over time: things like memory leaks, packratting, occasional deadlocks, indices needing rebuilding, etc.
This is a different kind of scalability, but it's important. When your system leaves the development shop and goes live, it doesn't just get bigger 'horizontally', by adding more load and more resources, but in the time dimension too: it's going to be running non-stop on the production machines for weeks, months or years, which it hasn't done in development.