Hangfire 1.3.4 - deleted jobs stuck in queue - hangfire

We are running hangfire single threaded using BackgroundJobServerOptions.WorkerCount = 1 (because we have a requiement for ordered processing).
Most of the time this is no problem, but occasionally a job gets stuck for entirely expected reasons (eg, the actual code it is running goes into an infintite loop), but because we are running single threaded this prevents other jobs in the queue from starting.
In order to try and work around this, we delete the job, but then it stays on the queue, blocking any other job from starting:
The only way I have found to resolve this is to drop and recreate the hangfire DB which is obviously not great.
Why does deleting a running job in hangfire not also remove it from the queue? Is this weird delete behavior a bug which to be fixed in a later version, or is this behavior by design because we're running single threaded?
If this is by design then how do you cancel a processing job in a way which removes it from the queue?

Well it seems that this behavior is by design.
If the IIS app pool worker is recycled, Hangfire will start processing the next task immediately. However, without this restart Hangfire will "hang" indefinitely.
An issue was raised on github about this, however it has not been solved yet:
https://github.com/HangfireIO/Hangfire/issues/80
With no way to cancel or manually "fail" a job, this makes hangfire a lot less useful in a single threaded scenario.
Update: this has been partially or fully addressed in some later version of Hangfire.

Related

ProcessPoolExecutor stuck indefinitely when child process dies

I have a script running on one of my linux servers which handles batch file processing with a ProcessPoolExecutor and generally runs fine days or even weeks on end without any issue. Sometimes though it looks like a few of my child processes just die (I have no error message or exception at all and can't reproduce it even with killing cp's from the shell) and lead to the parent process just waiting for the return indefinitely...
Thats the call (the initializer doesn't have any effect in this case, it's just to handle the reverse scenario described in another very helpful thread on s.o.)
with ProcessPoolExecutor(max_workers=int(config['PERFORMANCE']['NumberOfProcesses']),
initializer=start_thread_to_terminate_when_parent_process_dies,
initargs=(os.getpid(),)
) as executor:
executor.map(process_main, file_list)
From what I've gathere the Pool should be able to recover in exactly the described scenario:
https://bugs.python.org/issue9205
Anyone got any idea? (thought about switching to the pebble library with it's timeout functionality or creating a separate watchdog script)

Different RecurringJobs executing at the same time

I'm trying to execute a process to update my database, but the problem is that I set different RecurringJobs for it at different hours.
Today when I checked hangfire status, since yesterday that I instanced hangfire, I found the job should execute yesterday and the one task for today, both executed 30 minutes ago at the same time, and this has created duplicates in the database.
Can you help me with this?
If your problem is one of concurrency, you can solve it by running hangfire single threaded. Simply configure the number of hangfire worker threads on startup:
var server = new BackgroundJobServer(new BackgroundJobServerOptions
{
WorkerCount = 1
});
This will force hangfire to process queued jobs sequentially.
Alternatively, if you have the Pro version of hangfire you can control order using batch chaining.
I don't know if a worker can be considered as a thread.
Within a hangfire worker, single threaded code will be run by exactly one thread
This doesn't look like a concurrency issue as has been suggested. It's not completely clear what you are trying to do but I'm assuming you want the job to run at 7, 12:45, and 17:30 and had issues because both the 7am and 17:30 job ran at the same time (7am).
Based on the created time it looks like you created these around 14:30. That means the 17:30 job should have ran but didn't until the next morning around 7am. My best guess is this was hosted in IIS and the site app pool was recycled.
This would cause any recurring jobs that were supposed to run to be delayed until the app pool / site was started again (which I assume was around 7am).
Check out these documents on how to ensure your site is always running: http://docs.hangfire.io/en/latest/deployment-to-production/making-aspnet-app-always-running.html
If it's not an IIS issue something must have caused the BackgroundJobServer to stop monitoring the database for jobs until ~7:00am (server shutdown, error, etc).

Spring Batch restart crashed jobs

Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()

NServiceBus - Long running handler prevents queue from processing any other messages

I am running NSB 5 and I am using NHibernate Persistence and have MaximumConcurrencyLevel set to 10.
I have a handler that calls a stored proc that executes an SSIS package. This package takes a non trivial amount of time to run. I started to notice that whenever this particular message type is handled all other message handling stops. I noticed via SQL Profiler that the constant querying of the queue table that NSB does in the background stops and that any extra messages put into the queue are not handled even though NSB is only handling one message.
Is there any guidelines or known issues for dealing with handlers that block the queue because database commands take a long time to complete?
Is sounds like 10 threads are busy, so the endpoint is blocked, can you test this?
I would recommend hosting this message handler in its own process
Make sense?

weblogic 10 TimerManager avoiding propagation of security context to the scheduled tasks

We are using weblogic 10 and I am using the commonj's TimerManager which is part of weblogic to schedule a task, everything is fine but I have one problem. The securitycontext of the thread which scheduled the TimerListener task is somehow stored in the TimerListener task and is being used for the work done in the TimeListener task and this is causing the problem for me. Can anyone of you pls point me on how to avoid propagation of security context to the scheduled tasks from the thread which scheduled those tasks?
This is way late but anyway, one way to avoid propagating context is to use unmanaged threads i.e. spawn threads without commonj. Of this throws the baby out with the bathwater.