Steps to error proofing a mission critical process - error-handling

I'm writing a program that will continuously process files placed into a hot folder.
This program should have 100% uptime with no admin intervention. In other words it should not fail on "stupid" errors. i.e. Someone deletes the output directory it should simply recreate it and move on.
What I'm thinking about doing is to code the entire program and then go through and look for "error points" and then add code to handle the error.
What I'm trying to avoid is adding erroneous or unnecessary error handling or even building error handling into the control flow of the program (i.e. the error handling controls the flow of the program). Well perhaps it could control the flow to a certain extent, but that would constitute bad design (subjective).
What are some methodologies for "error proofing" a "critical" process?

If your process must be error-proof and have no admin intervention, you must handle all possible errors. If you leave any chance of stopping the program, it will happen (Murphy's Law) and you will not know.
Even handling all possible errors, I think you'll need some logging and even a monitor with (mail?) alerts to be sure your process is always running fine.

The most important thing to do is to document your assumptions in the form of unit tests. You should write a test that violates each assumption, and then prove that your program successfully recovers or takes action to make this state true again.
To use your example, if someone could delete the critical folder, make a test that simulates this and then show that your program handles this case without crashing.

Unit testing.

On technique for thorough analysis is a HAZOP study, where for each part of the process you consider keywords for that process. For a chemical in a process plant, these might be 'more' 'less', 'missing', 'hotter' 'colder' 'leak' 'pressure' and so-one.
When applying HAZOP to software, you would consider keywords appropriate to the objects in your software.
For example, for a reading a file you might consider 'more' to be buffer overrun, 'less' missing data, 'missing' not existing, 'leak' lack of file handles, and so on.

Related

Labview Program changes behavior after looking at (not changing) the Block Diagram

My Labview Program works like a charm, until I look at the Block Diagram. No changes are made. I do not Save. Just Ctrl+E and then Ctrl+R.
Now it does not work properly. Only a Restart of Labview fixes the problem.
My Program controls two Scanner arrays for Laser Cutting simultaneously. To force parallel working, I use the Error handler and loops that wait for a signal from the Scanner. But suddenly some loops run more often than they should.
What does majorly happen in Labview when I open the Block diagram that messes with my code?
Edit:
Its hard to tell what is happening without violating my non-disclosure agreement.
I'm controlling two independent mirror-Arrays for Laser Cutting. While one is running one Cutting-Job, the other is supposed to run the other Jobs. Just very fast. When the first is finished they meet at the same position and run the same geometry at the same slow speed. The jobs are provided as *.XML and stored as .net Objects. The device only runs the most recent job and overwrites it when getting a new one.
I can check if a job is still running. While this is true I run a while loop for the other jobs. Now this loop runs a few times too often and even ignores WAIT-blocks to a degree. Also it skips the part where it reads the XML job file, changes the speed part back to fast again and saves it. It only runs one time fast.
#Joe: No it does not. It only runs once well. afterwards it does not.
Youtube links
The way it is supposed to move
The wrong way
There is exactly one thing I can think of that changes solely by opening the block diagram.
When the block diagram opens, any commented-out or unreachable-code-compiler-eliminated sections of code will load their subVIs. If one of those commented out sections of code were somehow interfere with your running code, you might have an issue.
There are only two ways I know of for that to interfere... both of them are fairly improbable.
a) You have some sort of "check for all VIs in memory" or "check for all types in memory" that you're using as a plug-in system. When the commented-out sections load, that would change the VIs in memory. Such systems are not uncommon when parsing XML, so maybe.
b) You are using Run VI method for some dynamically invoked VI to execute as a top-level VI, but by loading the diagram, it discovers that it is a subVI of your current program. A VI cannot simultaneously be top-level and a subVI, so the call to Run VI returns an error.
That's it. I can't think of anything else. Both ideas seem unlikely, but given your claim and a lack of a block diagram, I figured I'd post it as a hypothesis.
In the improbable case someone has a similar problem. The problem was a xml file that was read during run time. Sometimes multiple instances tried to access it and this produced the error.
Quick point to check: are Debug and "retain data in wires" disabled? While it may not change the computations, but it may certainly change the timing of very tight loops, and that was one of the unexpected program behaviors, OP was referring to.

LabVIEW: missing block diagram

I have a two broken VIs with front panels that open fine, but I can't edit or run them, or open theis block diagrams.
One of these was made as a replacement for the first when it started to have this problem. I need to at least find out how to avoid this problem in future, so I don't lose work on bigger VIs.
I'm not sure if it makes any difference, but I very recently upgraded to LabVIEW 2013.
Thank you in advance.
This is the error I get when I try to run them:
"
VI has a bad connection to or cannot find a subVI or external routine.
This VI has a bad connection to or cannot find a subVI or external routine but
it has no block diagram to show or fix the error. You must find or correct the
subVI or external routine. Check for more information in the Explain dialog box
in Get Info.
"
Before reverting to a previous version (using dropbox) I got a different error with one of them:
"
LabVIEW: Generic error.
An error occurred loading VI 'sweep harmonics first test.vi', LabVIEW load
error code 6: Could not load the block diagram.
"
One situation how this happened.
Sometime LabVIEW crashes, and it restart. After restart, LabVIEW will ask you to recover the autosaved code.
I personally always discard those autosaved code. If you do choose to recover autosaved code, there is a chance the recovered code is corrupted. Once you save corrupted code to disk, you are probably going to lose the ability to open/save the block diagram ever again.
Having a version control system is usually a way to avoid minimize the damage when LabVIEW crashes. At worst, you loose maybe an hour worth of work.
If you can't open Block Diagram of your VI, first check the suggestion by #Rodrigo - it is most likely just a "compiled" VI, which has Block Diagram removed.
If you think there is Block Diagram inside and it is just corruped - you may contact NI support. And if you want to look deeper by yourself, extract the VI to XML using pyLabview, and look into the XML - there you can modify every single part of the VI. For example, you may start removing parts until it starts working.
I wouldn't go into manual VI editing unless you have at least a dozen of affected files though. For a single file, it will be faster to re-create it in LabVIEW instead of trying to understand the internals. If many files are affected - may be worth finding the issue in one, as other files probably have the same glitch, so you can make a script which extracts, modifies and re-creates VIs automatically.
From the sound of it, I believe what happens is that you are trying to run the VI's created as "DATA" for an executable, instead of the actual source VI's.
When you build an executable LabVIEW creates a copy of all the Top Level VI's dependencies into the support (DATA) folder which should be in the same directory as your executable.
Try opening the VI's that are marked as not having a block diagram and navigate to File>>VI Properties to check the path from which the VI is being loaded. If it's not the original VI, you can just replace it.

Should app crash or continue on normally while noting the problem?

Options:
1) When there is bad input, the app crashes and prints a message to the console saying what happened
2) When there is bad input, the app throws away the input and continues on as if nothing happened (though nothing the problem in a separate log file).
While 2 may seem like the obvious solution, the app is an engine and framework for game development, so if a user is writing something and does something wrong, it may be beneficial for that problem to be immediately obvious (app crashing) rather than it being ignored and the user potentially forgetting to check the log to see if there were any problems (may forget if the programmed behavior isn't very noticeable on screen, so he doesn't catch that it is missing).
There is no one-size-fits-all solution. It really depends on the situation and how bad the input is.
However, since you specifically mentioned this is for an engine or framework, then I would say it should never crash. It should raise exceptions or provide notable return codes or whatever is relevant for your environment, and then the application developer using your framework can decide how to handle. The framework itself should not make this decision for all apps that utilize the framework.
I would use exceptions if the language you are using allows them..
Since your framework will be used by other developers you shouldn't really constraint any approach, you should let the developers catch your exception (or errors) and manage what to do..
Generally speaking nothing should crash on user input. Whether the app can continue with the error logged or stop right there is something that is useful to be able to configure.
If it's too easy to ignore errors, people will just do so, instead of fixing them. On the other hand, sometimes an error is not something you can fix, or it's totally unrelated to what you're working on, and it's holding up your current task. So it depends a bit on who the user is.
Logging libraries often let you switch logs on and off by module and severity. It might be that you want something similar, to let users configure the "stop on error" behaviour for certain modules or only when above a certain level of severity.
Personally I would avoid the crash approach and opt for (2) that said make sure that the error is detected and logged and above all avoid any swallowing of errors (e.g. empty catch).
It is always helpful to have some kind of tracing/logging module, for instance later when you are doing performance tuning or general troubleshooting.
It depends on what the problem is. When I'm programming and writing error handling I use this as my mantra:
Is this exception really exceptional?
Meaning, is the error in input or whatever condition is "not normal" recoverable? In the case of a game, a File not Found exception on a texture could be recoverable and you could show a default texture so you know something broke.
However, if you have textures in a compressed file and you keep getting checksum errors, that would be an exceptional exception and I would crash the game with the details.
It really boils down to: can the application keep running without issue?
The one exception to this rule though (ha ha) is, if something is corrupted you can no longer trust your validation methods and you should crash as quickly as you can to prevent the corruption from spreading.

Error checking overkill?

What error checking do you do? What error checking is actually necessary? Do we really need to check if a file has saved successfully? Shouldn't it always work if it's tested and works ok from day one?
I find myself error checking for every little thing, and most of the time if feels overkill. Things like checking to see if a file has been written to a file system successfully, checking to see if a database statement failed.......shouldn't these be things that either work or don't?
How much error checking do you do? Are there elements of error checking that you leave out because you trust that it'll just work?
I'm sure I remember reading somewhere something along the lines of "don't test for things that'll never really happen".....can't remember the source though.
So should everything that could possibly fail be checked for failure? Or should we just trust those simpler operations? For example, if we can open a file, should we check to see if reading each line failed or not? Perhaps it depends on the context within the application or the application itself.
It'd be interesting to hear what others do.
UPDATE: As a quick example. I save an object that represents an image in a gallery. I then save the image to disc. If the saving of the file fails I'll have to image to display even though the object thinks there is an image. I could check for failure of the the image saving to disc and then delete the object, or alternatively wrap the image save in a transaction (unit of work) - but that can get expensive when using a db engine that uses table locking.
Thanks,
James.
if you run out of free space and try to write file and don't check errors your appliation will fall silently or with stupid messages. i hate when i see this in other apps.
I'm not addressing the entire question, just this part:
So should everything that could
possibly fail be checked for failure?
Or should we just trust those simpler
operations?
It seems to me that error checking is most important when the NEXT step matters. If failure to open a file will allow error messages to get permanently lost, then that is a problem. If the application will simply die and give the user an error, then I would consider that a different kind of problem. But silently dying, or silently hanging, is a problem that you should really do your best to code against. So whether something is a "simple operation" or not is irrelevant to me; it depends on what happens next, or what would be the result if it failed.
I generally follow these rules.
Excessively validate user input.
Validate public APIs.
Use Asserts that get compiled out of production code for everything else.
Regarding your example...
I save an object that represents an image in a gallery. I then save the image to disc. If the saving of the file fails I'll have [no] image to display even though the object thinks there is an image. I could check for failure of the the image saving to disc and then delete the object, or alternatively wrap the image save in a transaction (unit of work) - but that can get expensive when using a db engine that uses table locking.
In this case, I would recommend saving the image to disk first before saving the object. That way, if the image can't be saved, you don't have to try to roll back the gallery. In general, dependencies should get written to disk (or put in a database) first.
As for error checking... check for errors that make sense. If fopen() gives you a file ID and you don't get an error, then you don't generally need to check for fclose() on that file ID returning "invalid file ID". If, however, file opening and closing are disjoint tasks, it might be a good idea to check for that error.
This may not be the answer you are looking for, but there is only ever a 'right' answer when looked at in the full context of what you're trying to do.
If you're writing a prototype for internal use and if you get the odd error, it doens't matter, then you're wasting time and company money by adding in the extra checking.
On the other hand, if you're writing production software for air traffic control, then the extra time to handle every conceivable error may be well spent.
I see it as a trade off - extra time spent writing the error code versus the benefits of having handled that error if and when it occurs. Religiously handling every error is not necessary optimal IMO.

How would I go about taking a snapshot of a process to preserve its state for future investigation? Is this possible?

Whether this is possible I don't know, but it would mighty useful!
I have a process that fails periodically (running in Windows 2000). I then have just one chance to react to it before having to restart it and painfully wait for it to fail again. I didn't write the process so don't have the source to debug. The failure is seemingly random.
With a snapshot of the process I could repeatedly and quickly test reactions to the failure.
I had thought of running inside a VM but this isn't possible in this instance.
EDIT:
#Jon Cage asked:
When you say a snapshot, you mean capturing a process when it's about to fail (including memory, program state etc. etc.) ...and then replaying it's final few seconds repeatedly to see what effect it has on some other component?
This is exactly what I mean!
I think minidump is what you are looking for.
You can also used Userdump:
The User Mode Process Dumper
(userdump) dumps any running Win32
processes memory image (including
system processes such as csrss.exe,
winlogon.exe, services.exe, etc) on
the fly, without attaching a debugger,
or terminating target processes.
Generated dump file can be analyzed or
debugged by using the standard
debugging tools.
This article shows you how to use it.
My best bet is to start the process in a debugger (OllyDbg being my preferred tool).
The process will pause on an exception, and you can try to figure out what happened shortly before that.
This needs some understanding of assembler and does not allow to create a snapshot of the process for later analysis. You would need to write your own debugger for that - it should be theoretically possible.