What happens if an MPI process crashes? - process

I am evaluating different multiprocessing libraries for a fault tolerant application. I basically need any process to be allowed to crash without stopping the whole application.
I can do it using the fork() system call. The limit here is that the process can be created on the same machine, only.
Can I do the same with MPI? If a process created with MPI crashes, can the parent process keep running and eventually create a new process?
Is there any alternative (possibly multiplatform and open source) library to get the same result?
As reported here, MPI 4.0 will have support for fault tolerance.

If you want collectives, you're going to have to wait for MPI-3.something (as High Performance Mark and Hristo Illev suggest)
If you can live with point-to-point, and you are a patient person willing to raise a bunch of bug reports against your MPI implementation, you can try the following:
disable the default MPI error handler
carefully check every single return code from your MPI programs
keep track in your application which ranks are up and which are down. Oh, and when they go down they can never get back. but you're unable to use collectives anyway (see my opening statement), so that's not a huge deal, right?
Here's an old paper (back when Bill still worked at Argonne. I think it's from 2003):
http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf . It lays out the kinds of fault tolerant things one can do in MPI. Perhaps such a "constrained MPI" might still work for your needs.

If you're willing to go for something research quality, there's two implementations of a potential fault tolerance chapter for a future version of MPI (MPI-4?). The proposal is called User Level Failure Mitigation. There's an experimental version in MPICH 3.2a2 and a branch of Open MPI that also provides the interfaces. Both are far from production quality, but you're welcome to try them out. Just know that since this isn't in the MPI Standard, the function prefixes are not MPI_*. For MPICH, they're MPIX_*, for the Open MPI branch, they're OMPI_* (though I believe they'll be changing theirs to be MPIX_* soon as well.
As Rob Latham mentioned, there will be lots of work you'll need to do within your app to handle failures, though you don't necessarily have to check all of your return codes. You can/should use MPI error handlers as a callback function to simplify things. There's information/examples in the spec available along with the Open MPI branch.

Related

Linux process activities

Is there possibility to show what's going on under specified process in Linux?
For example, i run SQL query -> select evil_function();
and notice that process under Linux uses all cpu.
So is there something with what I can see whats going on under this process?
What I want is to see what queries is running under this process.
Thanks!
strace will tell you what system calls the process is making.
To see what called routines are taking the most CPU, you need to run a profiling tool, and make sure the executable of the process you in compiled correctly (sometimes it needs to be instrumented during compilation for profiling, sometimes it just needs to be compiled with debug symbols, or not stripped of them after compilation).
You might want to look at oprofile, valgrind, gprof and for starters on free tools - there are also commercial products available.
Here are a few links:
http://www.pixelbeat.org/programming/profiling/
http://en.wikipedia.org/wiki/List_of_performance_analysis_tools
You are mixing a whole bunch of things.
If you are talking about MySQL do:
show processlist;
For info specifically about linux processes, you can strace the process to get a list of system function that it calls. Unless you are experienced with linux this will be useless to you.
If the process is paused then you can find out what function it is stopped on, but that's probably not what you want, since you say the process is running.
There are also various tools that can give you info on what parts of the disk the process is reading, and how much memory it's allocating.
And finally you can use gdb to break into the process and single step your way through it to see exactly what it's doing. This will also likely be useless to you since an SQL server does a LOT of things - far to many to understand by this method.

What causes a program to freeze

From what experience I have programming whenever a program has a problem it crashes, whether it is from an unhanded exception or a piece of code that should have been checked for errors, but was not and threw one. What would cause a program to completely freeze a system to the point of requiring a restart.
Edit: Thanks for the answers. As for the language and OS this question was inspired by me playing Fallout and the game freezing twice in an hour causing me to have to restart the xbox, so I am guessing c++.
A million different things. The most common that come to mind are:
Spawning too many threads or processes, which drowns the OS scheduler.
Gobbling too much RAM, which puts the memory manager into page-fault hell.
In a Dotnet/Java type environment its quite difficult to seize a system up, because the Runtime keeps you code at a distance from the OS.
Closer to the metal say C or C++, Assembly etc you have to play fair with the rest of the system - If you dont have it already grab a copy of Petzold and observe/experiment yourself with the amount of 'boilerplate' code to get a single Window running...
Even closer, down at the driver level all sorts of things can happen...
There are number of reasons, being internal or external that leads to deadlocked application, more general case is when something is being asked for by a program but is not given that leads to infinite waiting, the practical example to this is, a program writes some text to a file, but when it is about to open a file for writing, same file is opened by any other application, so the requesting app will wait (freeze in some cases if not coded properly) until it gets exclusive control of the file.
And a critical freeze that leads to restarting the system is when the file which is asked for is something which very important for the OS. However, you may not need to restart the system in order to get it back to normal, unless the program which was frozen is written in a language that produces native binary, i.e. C/C++ to be precise. So if application is written in a language which works with the concept of managed code, like any .NET language, it will not need a system restart to get things back to normal.
page faults, trying to access inaccessible data or memory(acces violation), incompatible data types etc.

Daemon with Clojure/JVM

I'd like to have a small (not doing too damn much) daemon running on a little server, watching a directory for new files being added to it (and any directories in the main one), and calling another Clojure program to deal with that new file.
Ideally, each file would be added to a queue (a list represented by a ref in Clojure?) and the main process would take care of those files in the queue on a FIFO basis.
My question is: is having a JVM up running this little program all the time too much a resource hog? And do you have any suggestions as to how go about doing this?
Thank you very much!
EDIT: Another question I should ask: should I run this as its own instance (using less memory) and have it launch a new JVM when a file is seen, or have it on the same JVM the Clojure code that will process the file?
As long as it is running fine now and it has no memory leaks it should be fine.
From the daemon terminology I gather it is on a unix clone, and in this case best is to start it from an init script, or from the rc.local script. Unfortunately details differ from OS to OS to be more specific.
Limit the memry using -Xmx=64m or something to make sure it fails before taking down the rest of the services. Play a bit with the number to find the lowest reliable size.
Also, since clojures claim to fame is its ability to deal with concurrency it make a lot of sense to only run one JVM with all functionality running on it in multiple threads. The overhead of spawning new processes is already very big and if it is a JVM which needs to JIT and warm up its memory management, doubly so. On a resource constrained machine could pose a problem. and on a resource rich machine this is a waste.
I always found that the JVM is not made to quickly run something script like and exit again. It is really not made for that use case in my opinion
.

What causes difference in VB6 app testing result when running from Dev machine vs installed?

I'm new to VB6 but i'm currently in charge of maintaining a horror of editor like tool with plenty of forms, classes, modules and 3rd party tools all chunk together like the skin faces on that guy in the texas chainsaw massacre...
What i don't understand is why i get different results when i run the app in debugging mode, vs when i compiled it and run it on my devevelopment pc vs when i installed it on a different pc.
Yes i know i'm dumb, so please direct me to where i can find out more about this. I'm hoping to find out something like different linking, registry related etc connection that i'm simply not getting right now, i.e. something like wax on, wax off :P
The main pain in the neck is when i'm trying to debug some errors from my QA and i need to find a spare pc to test this on plus i can't directly debug because i don't know where the code is if i do it that manner.
Thanks.
i run the app in debugging mode vs when i compiled it and run it on my
devevelopment pc
When you compile you have the option of compiling to native code or pcode. The debugger runs using pcode only. Under rare circumstances when you compile to native code there will be a change in behavior. This particular is really rare. I used VB6 since it's release and I may get it once or twice a year. My application is a complex CAD/CAM creating shapes and running metal cutting machine and has two dozen DLLs. Not a typical situation. At home with my hobby software I never ran into this problem.
There are another class of errors that result from event sequencing problems. While VB6 isn't truly multi-tasking it has the ability to jump out of the current code block to process a event. If it re-enters the same block for the new event interesting things (to say the least) can result. I think this is the likely source of your problems as you software is an editor which is a highly interactive type of software.
In general the problem is fixed by reordering the effected areas. You find the effected area by inserting MsgBox or write to a text file to log where you are. I recommend logging to a text file as MsgBox tend to alter behavior that are timing or multi-tasking related.
Remember if a event fire while VB6 in the middle of a code block and there a DoEvents floating around then it will leave the code block process the event and return to the original code block. If it re-enters the same code block and you didn't mean for this to happen then you will have problems. And you will have different problems on different computers as the timing will be different for each.
The easiest way to deal with this type of issues is create some flag variables. In multi-tasking parlance they are known as semaphores or mutexes. WHen you enter a critical section of code, you set it true. When you leave the routine you set it to false. If it is already true when you enter that section of code you don't execute it.
when i installed it on a different pc.
These are usually the result of the wrong DLL installed. Most likely you have an older version while the target has a newer version. I would download the free Virtual PC and create a clean Window XP install to double check this.
If your problem is event timing this too can be different on different computers. This is found by logging (not MsgBox) suspect regions.
If you can display a screen shot or the text of your specific errors then I can help better.
The first thing to check would be the versions of all the dlls that your app depends on - including the service pack version of the VB6 dll.
Have you any more specific details about what's behaving differently?

How would I go about taking a snapshot of a process to preserve its state for future investigation? Is this possible?

Whether this is possible I don't know, but it would mighty useful!
I have a process that fails periodically (running in Windows 2000). I then have just one chance to react to it before having to restart it and painfully wait for it to fail again. I didn't write the process so don't have the source to debug. The failure is seemingly random.
With a snapshot of the process I could repeatedly and quickly test reactions to the failure.
I had thought of running inside a VM but this isn't possible in this instance.
EDIT:
#Jon Cage asked:
When you say a snapshot, you mean capturing a process when it's about to fail (including memory, program state etc. etc.) ...and then replaying it's final few seconds repeatedly to see what effect it has on some other component?
This is exactly what I mean!
I think minidump is what you are looking for.
You can also used Userdump:
The User Mode Process Dumper
(userdump) dumps any running Win32
processes memory image (including
system processes such as csrss.exe,
winlogon.exe, services.exe, etc) on
the fly, without attaching a debugger,
or terminating target processes.
Generated dump file can be analyzed or
debugged by using the standard
debugging tools.
This article shows you how to use it.
My best bet is to start the process in a debugger (OllyDbg being my preferred tool).
The process will pause on an exception, and you can try to figure out what happened shortly before that.
This needs some understanding of assembler and does not allow to create a snapshot of the process for later analysis. You would need to write your own debugger for that - it should be theoretically possible.