I have a script running on one of my linux servers which handles batch file processing with a ProcessPoolExecutor and generally runs fine days or even weeks on end without any issue. Sometimes though it looks like a few of my child processes just die (I have no error message or exception at all and can't reproduce it even with killing cp's from the shell) and lead to the parent process just waiting for the return indefinitely...
Thats the call (the initializer doesn't have any effect in this case, it's just to handle the reverse scenario described in another very helpful thread on s.o.)
with ProcessPoolExecutor(max_workers=int(config['PERFORMANCE']['NumberOfProcesses']),
initializer=start_thread_to_terminate_when_parent_process_dies,
initargs=(os.getpid(),)
) as executor:
executor.map(process_main, file_list)
From what I've gathere the Pool should be able to recover in exactly the described scenario:
https://bugs.python.org/issue9205
Anyone got any idea? (thought about switching to the pebble library with it's timeout functionality or creating a separate watchdog script)
Related
I'm working on getting Redis to run on Solaris 10 and there's a few integration tests that are failing. The test I'm looking into works like this:
Start Redis
It forks and the child starts dumping the database to a backup file (RDB)
There's actually a parent / child / grandchild relationship going on where the grandchild becomes a zombie, but I noticed that only minutes before I had to head home.
After a short time the test script sends SIGTERM to the child
The child catches the signal & shuts down gracefully
The parent calls wait3()
In spite of the wait3() call the child ends up in a zombie state.
The test fails around 90% of the time when I run it. Once it gets into a failed state it never recovers. I tried changing the test to wait significantly longer and although it appears to call wait3() many times after the process has exited, it stays in that state until the parent process(es) are killed.
Unfortunately I won't be able to work on this again until next week, so I'm researching it from home. Most of my googling has only turned up documentation or "why do processes become zombies?" type questions.
This google groups thread from the mid 90s may help, though they're mostly talking about older releases of Solaris / SunOS.
I was mistaken. It looks like the master node doesn't see that its child failed so doesn't wait.
Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()
We are running hangfire single threaded using BackgroundJobServerOptions.WorkerCount = 1 (because we have a requiement for ordered processing).
Most of the time this is no problem, but occasionally a job gets stuck for entirely expected reasons (eg, the actual code it is running goes into an infintite loop), but because we are running single threaded this prevents other jobs in the queue from starting.
In order to try and work around this, we delete the job, but then it stays on the queue, blocking any other job from starting:
The only way I have found to resolve this is to drop and recreate the hangfire DB which is obviously not great.
Why does deleting a running job in hangfire not also remove it from the queue? Is this weird delete behavior a bug which to be fixed in a later version, or is this behavior by design because we're running single threaded?
If this is by design then how do you cancel a processing job in a way which removes it from the queue?
Well it seems that this behavior is by design.
If the IIS app pool worker is recycled, Hangfire will start processing the next task immediately. However, without this restart Hangfire will "hang" indefinitely.
An issue was raised on github about this, however it has not been solved yet:
https://github.com/HangfireIO/Hangfire/issues/80
With no way to cancel or manually "fail" a job, this makes hangfire a lot less useful in a single threaded scenario.
Update: this has been partially or fully addressed in some later version of Hangfire.
I ran into a very weird problem. I have a VB.NET program which calls another program which runs in the background. We're using a special software here to deliver this software over web. What this software basically does is, that i creates a new remote desktop connection, grabs the screen and opens up a web server.
While running the sub programm / sub process the screen does not react smooth anymore, it gets very low and then freezes. We figured out, that we're triggering too many screen updates at once so that we simply flood our connection which causes the crash in the browser.
Is there any simple way to determine how many screen updates were sent and which causes these updates? Best would be that we can identify the process so that we can investigate further.
The whole process is ran as a backgroundWorker which then creates another process.
Edit:
Could it have something to do with the CPU load (which is very high)? Although the subprocess is executed in the background - and is visible in the process list - is there any chance that this causes the UI Update?
Finally solved it. It was a Timer updating the View every microsecond becuase the Interval was not correctly set.
I have a VB.NET application that uses threads to asynchronously process some tasks in a "Scheduled Task" (console application).
We are limiting this application to run 10 threads at once, like so:
(pseudo-code)
- Create a generic list of 10 threads
- Spawn off the threadproc for each one
- Do a thread.join statement for each thread to wait for the longest running one to complete.
I am finding that if the code called by the threadproc contains any "Debug.Writeline" or "Trace.Traceinformation" statements, the thread hangs. I can see the thread in the Debug - Windows - Threads window and switch to it, but it highlights the Debug.Writeline statement and never gets past it.
Is there something special about the Debug or Trace statements that make them non-thread-safe?
Why would this hang things up? If I leave the debug statement in, the thread never completes. If I take the debug statement out, the thread completes in less than 5 seconds.
Yes and no.
Internally, Debug.WriteLine ends up calling into TraceInternal.WriteLine. This particular function does not explicitly stop thread execution but it does acquire a process global lock during the execution of the method. This lock protects both the list of trace listeners and serializes the processing of WriteLine commands.
It's possible for 2 threads to simultaneously hit this WriteLine statement and hence have one thread pause for a short period of time. It is also possible for a custom trace listener to be doing a very long lived or blocking operation which would essentially freeze all other threads for a noticable period of time.
Use Visual Studio to check and see what other threads are currently broken in this function. See if that gives you a clue as to what is holding up this process.
You may have a trace listener that maybe is interfering with the Debug.WriteLine.
There is a custom trace listener in this application. Once I commented it out, my locking problems were solved. Now if I could only track down the original developer to find out what they were doing with this custom listener...