Celery - automatic retrying of long running tasks running on crashed worker - redis

I'm using Celery with Django over Redis.
Some of my tasks are quite long, taking about 1 hour. I'm aware that this is suboptimal, and preferably I should use shorter tasks, but this is what I got...
Sometimes the task/worker crash. This can happen for various unimportant reasons. Maybe this worker crashed, network problem, spot-instance when preempted, killed by OOM, or any other unexpected reason that I can't "catch" and handle.
I want to make sure the task will be tried again as fast as possible.
I can use ack_late, but the problem is that this task has a very long timeout (about 90 minutes), which means that if the task started and the worker crashed after 2 minutes, I will now wait for another 88 minutes until the task will get back to the queue and will start executing again on another worker.
I'm wondering if there exists another solution, that will see the worker "disappeared" and will put the task back in the queue?

You could give task_reject_on_worker_lost a try... It is a tricky one, but have a look...

Related

What is a crashloop?

I'm reading Google's Site Reliability Engineering book and ran across the word crashloop which I've never heard before and have not been able to locate a definition
"If a task tries to use more resources than it requested, Borg kills the task and restarts it (as a slowly crashlooping task is usually preferable to a task that hasn’t been restar‐ ted at all)."
What is a crashloop and how does it compare to an infinite loop if at all?
A crashloop is when a process crashes and is restarted by a watchdog daemon, indefinitely.
That is, the history is:
Process starts at time T.
Process crashes at time T+1.
Watchdog daemon restarts process.
Process started at time T+2.
Process crashes at time T+3.
Watchdog daemon restarts process.
Process starts...etc.
Here, the watchdog deamon is Borg, and the process is encapsulated into a task.
In general, in distributed computing if you want something to eventually succeed, you have to write down your intent for it to be completed and you need a worker to loop continually to act on this intent. This is "at least once delivery" of a work item.
Here, the intent is that the task runs (written down into Borg), and Borg itself is running the loop that is constantly trying to make sure the task runs. This is why when a task crashes, it is restarted. When a task crashes repeatedly, together you end up with a crashloop.

How to prevent ironworker from enqueuing tasks of workers that are still running?

I have this worker whose runtime greatly varies from 10 seconds to up to an hour. I want to run this worker every five minutes. This is fine as long as the job finishes within five minutes. However, If the job takes longer Iron.io keeps enqueuing the same task over and over and a bunch of tasks of the same type accumulate while the worker is running.
Furthermore, it is crucial that the task may not run concurrently, so max concurrency for this worker is set to one.
So my question is: Is there a way to prevent Iron.io from enqueuing tasks of workers that are still running?
Answering my own question.
According to Iron.io support it is not possible to prevent IronWorker from enqueuing tasks of workers that are still running. For cases like mine it is better to have master workers that do the scheduling, i.e. creating/enqueuing tasks from script via one of the client libraries.
The best option would be to enqueue new task from the worker's code. For example, your task is running for 10 sec - 1 hour and enqueues itself at the end (last line of code). This will prevent the tasks from accumulating while the worker is running.

Spring Batch restart crashed jobs

Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()

Hangfire 1.3.4 - deleted jobs stuck in queue

We are running hangfire single threaded using BackgroundJobServerOptions.WorkerCount = 1 (because we have a requiement for ordered processing).
Most of the time this is no problem, but occasionally a job gets stuck for entirely expected reasons (eg, the actual code it is running goes into an infintite loop), but because we are running single threaded this prevents other jobs in the queue from starting.
In order to try and work around this, we delete the job, but then it stays on the queue, blocking any other job from starting:
The only way I have found to resolve this is to drop and recreate the hangfire DB which is obviously not great.
Why does deleting a running job in hangfire not also remove it from the queue? Is this weird delete behavior a bug which to be fixed in a later version, or is this behavior by design because we're running single threaded?
If this is by design then how do you cancel a processing job in a way which removes it from the queue?
Well it seems that this behavior is by design.
If the IIS app pool worker is recycled, Hangfire will start processing the next task immediately. However, without this restart Hangfire will "hang" indefinitely.
An issue was raised on github about this, however it has not been solved yet:
https://github.com/HangfireIO/Hangfire/issues/80
With no way to cancel or manually "fail" a job, this makes hangfire a lot less useful in a single threaded scenario.
Update: this has been partially or fully addressed in some later version of Hangfire.

celery takes too long time to write result to rabbitmq

Recently, I started celery beat to run a task periodically. The task will take about 2 minutes. The beat interval is 3 minutes. The back end use rabbitmq.
However, the totally elapsed time of a task become nearly 20 minutes. It looks so strange! After some work, I found that the extra time consumed by sending task result to rabbitmq. It is awesome! Why?
And the celery worker will use another 5 or 7 minutes to receive the next task. I do not know what the worker are doing in this period.
Anyone could help to explain them?