Testing fault tolerant code - testing

I’m currently working on a server application were we have agreed to try and maintain a certain level of service. The level of service we want to guaranty is: if a request is accepted by the server and the server sends on an acknowledgement to the client we want to guaranty that the request will happen, even if the server crashes. As requests can be long running and the acknowledgement time needs be short we implement this by persisting the request, then sending an acknowledgement to the client, then carrying out the various actions to fulfill the request. As actions are carried out they too are persisted, so the server knows the state of a request on start up, and there’s also various reconciliation mechanisms with external systems to check the accuracy of our logs.
This all seems to work fairly well, but we have difficult saying this with any conviction as we find it very difficult to test our fault tolerant code. So far we’ve come up with two strategies but neither is entirely satisfactory:
Have an external process watch the server code and then try and kill it off at what the external process thinks is an appropriate point in the test
Add code the application that will cause it to crash a certain know critical points
My problem with the first strategy is the external process cannot know the exact state of the application, so we cannot be sure we’re hitting the most problematic points in the code. My problem with the second strategy, although it gives more control over were the fault takes, is I do not like have code to inject faults within my application, even with optional compilation etc. I fear it would be too easy to over look a fault injection point and have it slip into a production environment.

I think there are three ways to deal with this, if available I could suggest a comprehensive set of integration tests for these various pieces of code, using dependency injection or factory objects to produce broken actions during these integrations.
Secondly, running the application with random kill -9's, and disabling of network interfaces may be a good way to test these things.
I would also suggest testing file system failure. How you would do that depends on your OS, on Solaris or FreeBSD I would create a zfs file system in a file, and then rm the file while the application is running.
If you are using database code, then I would suggest testing failure of the database as well.
Another alternative to dependency injection, and probably the solution I would use, are interceptors, you can enable crash test interceptors in your code, these would know the state of the application and introduce the above listed failures at the correct time, or any others you may want to create. It would not require changes to your existing code, just some additional code to wrap it.

A possible answer to the first point is to multiply experiments with your external process so that probability to impact problematic parts of code is increased. Then you can analyze core dump file to determine where the code has actually crashed.
Another way is to increase observability and/or commandability by stubbing library or kernel calls, i.e., without modifying your application code.
You can find some resources on Fault Injection page of Wikipedia, in particular in Software Implemented Fault Injection section.

Your concern about fault injection is not a fundamental concern. You merely need a foolproof way to prevent such code ending up in deployment. One way to do so is by designing your fault injector as a debugger. I.e. the faults are injected by a process external to your process. This already provides a level of isolation. Furthermore, most OS'es provide some kind of access control which prevents debugging unless specifially enabled. In the most primitive form, it's by limiting it to root, on other operating systems it requires a specific "debug privilege". Naturally, on production nobody will have that, and thus your fault injector cannot even run on production.
Practially, the fault injector can set breakpoints at specific addresses, i.e. function or even line of code. You can then react to that, e.g. by terminating the process after a certain breakpoint is hit three times.

I was just about to write the same as Justin :)
The component I would suggest to replace during testing could be the logging component (if you have one, if not, I'd strongly suggest to implement one...). It's relatively easy to replace it with code that generates error and the logger usually gets enough information to know the current application state.
Also it seems to be feasible to make sure that the testing code doesn't go into production. I would discourage conditional compilation though but rather go with some configuration file to select the logging component.
Using "random" kills might help to detect errors but is not well suited for systematic testing because of its non-determinism. Therefore I wouldn't use it for automatic tests.

Related

IWantToRunWhenBusStartsAndStops not for production?

New to NServiceBus (4.7.5) and just implemented an NSB host.exe hosted service (implementing IWantToRunWhenBusStartsAndStops) that detects changes to database tables and notifies subscribing web apps by publishing events, e.g. "CustomerDataWasUpdatedEvent". In the future we will perform the actual update through messagehandlers receiving commands obviously, but at the moment this publishing service just polls the database etc.
It all works well, however, approaching production, I noticed that David Boike, in his latest edition of "Learning NServiceBus", states that classes implementing
IWantToRunWhenBusStartsAndStops are really mostly for development and rarely used in production. I set up my database change detection in the Start method and it works nicely, does anyone know why this is discouraged?
Here is the comment in the actual book:
https://books.google.se/books?id=rvpzBgAAQBAJ&pg=PA110&lpg=PA110&dq=nservicebus+iwanttorunwhenbusstartsandstops+in+production+david+boike&source=bl&ots=U6sNII0nm3&sig=qIXffOVFhcy-_3qDnSExRpwRlD4&hl=sv&sa=X&ei=lHWRVc2_BKrWywPB65fIBw&ved=0CBsQ6AEwAA#v=onepage&q=nservicebus%20iwanttorunwhenbusstartsandstops%20in%20production%20david%20boike&f=false
The actual quote is:
...it isn't common to have widespread use of in a production system.
Uncommon is not the same thing as discouraged.
That said I do think there is intent here by the author to highlight the fact that further up the page they assert that this is not a good place to be doing lots of coding, as an unhandled exception can cause the whole process to fail.
The author actually does go on to mention a possible use case for when you may want to load a resource(s) to do work within the handler.
Ok, maybe it's just this scenario we have that is a bit uncommon
Agreed - there is nothing fundamentally wrong with your approach. I recently did the same thing as you for wiring up SqlDependency to listen for database events and then publish a message as a result. In these scenarios there is literally nothing else you can do other than to use IWantToRunAtStatup.
Also, David himself often trawls the nservicebus tag, maybe he'll provide a more definitive answer than mine.
I'll copy the answer I gave in the Particular Software Google Group...
I'll quote myself directly here:
An implementation of IWantToRunWhenBusStartsAndStops is a great place to create a quick interface in order to test messages during debugging by allowing you to send messages based on the console input. Apart from this, it isn't common to have widespread use of them in a production system. One possible production use case will be to provision a resource needed by the endpoint at startup and then tear it down when the endpoint stops.
I think if I could add a little bit of emphasis it would be to "widespread use". I'm not trying to say you won't/can't have an IWantToRunWhenBusStartsAndStops in production code or that avoiding them is a best practice. I am trying to say that having a ton of them is probably a code smell.
Above that paragraph in the book, I warn about IWantToRunWhenBusStartsAndStops not having any ambient transactions or try/catch stuff going on. THAT is really the key part. If you end up throwing an exception in an IWantToRunWhenBusStartsAndStops, tyou can run into big problems. If you use something like a .NET Timer and then throw an exception, you can crash your process!
Let me tell you how I screwed up on this in my first-ever NServiceBus system. The system (still in use today, from what I hear) is responsible for ingesting more than 3000 RSS feeds (probably a lot more than that now) into a CMS. So processing each feed, breaking it up into items, resizing images, encoding attached video for mobile ... all those things were handled in NServiceBus message handlers, which was scaled out to multiple servers, and that was all fantastic.
The problem was the scheduler. I implemented that as an IWantToRunWhenBusStartsAndStops (well, actually IWantToRunAtStartup at that time) and it quickly turned into a mess. I kept the whole table worth of feed information in memory so that I could calculate when to fire off the next ProcessFeed command. I was using the .NET Timer class, and IIRC, I eventually had to use threading primitives like ManualResetEvent in order to coordinate the activity. And because I was using .NET Timer, if the scheduler threw an exception, that endpoint failed and had to restart. Lots of weird edge cases and it was always a quagmire of bugs. Plus, this was now a singleton "commander app" so while the feed/item processors could be scaled out, the scheduler could not.
As I got more experienced with NServiceBus, I realized that each feed should have been a saga, starting from a FeedCreated event, controlled through PauseProcessing and ResumeProcessing commands, using timeouts to control the next processing time, and finally (perhaps) ended via a FeedRemoved event. This would have been MUCH more straightforward and everything would have executed inside transactionally-controlled message handlers.
That experience led me to be a little bit distrustful/skeptical of IWantToRunWhenBusStartsAndStops. Not saying it's bad, just something to be aware of. Always be prepared to consider if what you're trying to do couldn't be better accomplished in another way.

Assertions in ABAP

Over the years I've written code in a variety of languages and environments, but one constant seemed to be the consensus on the use of assertions. As I understand it, they are there for the development process when you want to identify "impossible" errors and other situations to which your first reaction would be "that can't be right" and which cannot be handled gracefully, leaving the system in a state where it has no choice but to terminate. Assertions are easy to understand and quick to code but due to their fail-fast nature are unsuitable for development code. Ideally, assertions are used to discover all development bugs and then removed or turned off when shipping the code. Input or program states that are wrong, but possible (and expected to occur) should instead be handled gracefully via exceptions or other error handling techniques.
However, none of this seems to hold true for writing ABAP code for SAP. I've just spent the better part of an hour trying to track down the precise location where an assert was giving me an unintelligible error. This turned out to be five levels down in standard SAP code, which is apparently riddled with ASSERT statements. I now know that a certain variable identifying a table IS NOT INITIAL while its accompanying variable identifying a field is.
This tells me nothing. The Web Dynpro component running this code actually "catches" this assert, showing me a generic error message, which only serves to prevent the debugger from launching when the assert is tripped.
My question therefore is what the guidelines or best practices are for the use of assertions in ABAP. Is this SAP writing bad code? Is it an accepted practice to fill your custom code with asserts and leave them in when shipping the code? If so, how would we go about handling these asserts in runtime so that the application doesn't crash and burn while still being able to identify the cause of the error?
The guidelines and best practices are virtually the same in ABAP development as in any other language. Assertion should be used as internal guidance checks only, exceptions for regular input validation errors and other stuff. It might be sensible to leave the assertions in the code - after all, you'd probably rather want your program to crash in a controlled fashion than continue in an unforeseen way and probably damage some critical data in the process without anyone noticing. Take a look at checkpoint groups if you don't want your program to abort in a production environment - but in my opinion: What's the use of a sanity check (as a last line of defense) if it's disabled in the environment where it matters most?
Of course I'm assuming that the input is validated properly (so that crashes are prevented) and that all APIs are used according to the intended use and documentation. Unfortunately - as with every other programming language - it's up to the developer to live up to these standards.

How to simulate production concurrency issue using tests

I've have an application which receives an object from queue, transforms it and publishes it on a topic. It's a message driven bean (spring) message listener container with a fair number of inner beans.
Some strange activity recently happened on the prod boxes. We want to check if this is a concurrency issue. Which is great, but not something I've done before.
My approach is to pump a load of messages at application. Write a piece of software to listen to it's publishing topic. Consume these and process them via something similar to Junit tests which compares the objects attributes with the expected result.
I've added the above for a bit of scope on the problem but basically is there any applications on the market that I could plug into either my code or my IDE which would enable me to do that. I think this is somewhat over the capabilities for JUNIT
I would recommend JavaPathfinder (JPF) for this type of exercise.
JPF is essentially a JVM that simulates execution of your code in every possible way - e.g. simulating all the possible interleavings of the bytecode instructions. JPF works as a model checker that explores the set of possible states that an application can enter and evaluates them based on predefined rules -e.g. deadlock free, non-null, etc.
Model checkers usually fail when applications become too complex but JPF is able to execute the app without traversing the whole state space and is usually able to reach problem states pretty quickly. Try to have a look at that one.

Self-diagnostic test for software layers?

With the increasing effort put into dividing software pieces into independent layers and decoupling them using dynamic discovery and dependency injection, it has become harder to determine which "layer" in the system is contributing to a system-wide failure of the "application".
Unit tests help by making sure that all "modules" that makes up a layer is working as expected. But unit tests are written in a such a way that each "modules" are isolated by using techniques such as stubbing and mocking.
Consider the simplistic example below:
L1. Database -> L2. Database Layer -> L3. Windows Service -> L4. Client Application
For instance, if the database engine is down, then the system will not operate normally. It will be hard to tell if the database engine is really down or that there is a bug in the database layer (L2) code. To check, you would have to launch some kind of a database management tool to check if the database engine is running.
What we are trying to achieve is a developer tool that can be launched whenever "there is something wrong" with the system, and this tool will "query" each layer for their "integrity" or "diagnostic" data. The tool will provide the list of software layers and their "integrity status". It will then be able to say, right off the bat, that layer X is the cause of the issue (I.e. The database engine is down).
Of course, each layer will be responsible for providing its own "diagnostic method" that can be queried by the tool.
I guess what we're trying to achieve here is some kind of an "Integration Test" framework or something similar that can be used in run-time (not compile/build time like a unit test). The inspiration came from physical devices that have their own "On-board Diagnostics" like cars. A good example in the software world is the Power-on Self Test that is being run every time the computer is being turned-on.
Have anyone seen or heard something like this? Any suggestions or pointers will surely help a lot!
You could have a common interface that each layer would implement as a WCF service. This way you could connect to every layer and diagnose it. Having this diagnostics available might be useful but if you want to implement it everywhere (every layer) it will be another thing that may fail - how would you diagnose that? Your system would swarm with WCF services which is not good as it requires a lot of maintenance and makes it less stable. Plus it requires a lot of work to implement.
Alternative that I would suggest is to have a good logging system in place. Minimum would be to have every module log an error in all catch sections but I suggest more than that especially for debug purposes. I recommend using Log4Net which is free and very flexible. It is effective in not logging when it is not required meaning that you can set the logging level high and it does not impact performance even in production code. You can change logging level during runtime by changing a setting in a config file. I use Log4Net a lot and it works well.
Once you have your code logging stuff you can configure Log4Net so that all logs go into a central database. You would then have one place where you could relatively easy diagnose what's going on, what has failed, where and what was the exception or message. You can even set up email messages to be sent when something goes wrong.

How you test your applications for reliability under badly behaving i/o

Almost every application out there performs i/o operations, either with disk or over network.
As my applications work fine under the development-time environment, I want to be sure they will still do when the Internet connection is slow or unstable, or when the user attempts to read data from badly-written CD.
What tools would you recommend to simulate:
slow i/o (opening files, closing files, reading and writing, enumeration of directory items)
occasional i/o errors
occasional 'access denied' responses
packet loss in tcp/ip
etc...
EDIT:
Windows:
The closest solution to do the job as described seems to be holodeck, commercial software (>$900).
Linux:
Open solution wasn't found by now, but the same effect
can be achived as specified by smcameron and krosenvold.
Decorator pattern is a good idea.
It would require to wrap my i/o classes, but resulting in a testing framework.
The only remaining untested code would be in 3rd party libraries.
Yet I decided not to go this way, but leave my code as it is and simulate i/o errors from outside.
I now know that what I need is called 'fault injection'.
I thought it was a common production-line part with plenty of solutions I just didn't know.
(By the way, another similar good idea is 'fuzz testing', thanks to Lennart)
On my mind, the problem is still not worth $900.
I'm going to implement my own open-source tool based on hooks (targeting win32).
I'll update this post when I'm done with it. Come back in 3 or 4 weeks or so...
What you need is a fault injecting testing system. James Whittaker's 'How to break software' is a good read on this subject and includes a CD with many of the tools needed.
If you're on linux you can do tons of magic with iptables;
iptables -I OUTPUT -p tcp --dport 7991 -j DROP
Can simulate connections up/down as well. There's lots of tutorials out there.
Check out "Fuzz testing": http://en.wikipedia.org/wiki/Fuzzing
At a programming level many frameworks will let you wrap the IO stream classes and delegate calls to the wrapped instance. I'd do this and add in a couple of wait calls in the key methods (writing bytes, closing the stream, throwing IO exceptions, etc). You could write a few of these with different failure or issue type and use the decorator pattern to combine as needed.
This should give you quite a lot of flexibility with tweaking which operations would be slowed down, inserting "random" errors every so often etc.
The other advantage is that you could develop it in the same code as your software so maintenance wouldn't require any new skills.
You don't say what OS, but if it's linux or unix-ish, you can wrap open(), read(), write(), or any library or system call etc, with an LD_PRELOAD-able library to inject faults.
Along these lines:
http://scaryreasoner.wordpress.com/2007/11/17/using-ld_preload-libraries-and-glibc-backtrace-function-for-debugging/
I didn't go writing my own file system filter, as I initially thought, because there's a simpler solution.
1. Network i/o
I've found at least 2 ways to simulate i/o errors here.
a) Running a virtual machine (such as vmware) allows to configure bandwidth and packet loss rate. Vmware supports on-machine debugging.
b) Running a proxy on the local machine and tunneling all the traffic through it. For the case of upd/tcp communications a proxifier (e.g. widecap) can be used.
2. File i/o
I've managed to deduce this scenario to the previous one by mapping a drive letter to a network share which resides inside the virtual machine. The file i/o will be slow.
A cheaper alternative exists: to set up a local ftp server (e.g. FileZilla), configure speeds and use Novell's NetDrive to access it.
You'll wanna setup a test lab for this. What type of application are you building anyway? Are you really expecting the application be fed corrupt data?
A test technique I know the Microsoft Exchange Server people tried was sending noise to the server. Basically feeding every possible input with seemingly random data. They managed to crash the server quite often this way.
But still, if you can't trust input that hasn't been signed then general rules apply. Track every operation which could potentially be untrusted (result of corrupt data) and you should be able to handle most problems gracefully.
Just test your application behavior on random input, that should catch most problems but you'll never be able to fully protect your self from corrupt data. That's just not possible, as the data could be part of some internal buffer being handed off within the application itself.
Be mindful of when and how you decode data. That is all.
The first thing you'll need to do is define what "correct" means under these circumstances. You can only test against a definition of what behaviour is intended.
The tactics of testing will depend on technology. In the context of automated unit testing, I have found it very useful, in OO languages such as Java, to use various flavors of "mocking" or "stubbing" to pass e.g. misbehaving InputStreams to parts of my code that used file I/O.
Consider holodeck for some of the fault injection, if you have access to spare hardware you can simulate network impairment using Netem or a commercial product based on it the Mini-Maxwell, which is much more expensive than free but possibly easier to use.