Celery Queue Hanging - rabbitmq

I've got one specific job that seems to hang my celery workers every so often. I'm using rabbitmq as a broker. I've tried a couple things to fix this, to no avail:
Autoscaling the workers to allow the hung ones plenty of time to finish execution
Setting a global timeout
So I've come up a little short on what's causing this problem, and how I can fix it. Can anyone give me any pointers? The task in question is simply inserting a record into the database (MongoDB in this case.)
Update: I've added CELERYD_FORCE_EXECV. We'll see if that fixes it.
Update 2: nope!

A specific job making the child processes hang is often a symptom of IO that never completes, e.g. a web request or socket read without a timeout.
Most libraries supports setting a timeout, but if not you can always use socket.setdefaulttimeout:
import socket
#task
def http_get(url, timeout=1.0, retry_after=3.0, max_retries=None):
prev_timeout = socket.getdefaulttimeout()
socket.setdefaulttimeout(timeout)
try:
return requests.get(url)
except socket.timeout:
raise http_get.retry(exc=exc, countdown=retry_after, max_retries=max_retries)
finally:
socket.setdefaulttimeout(prev_timeout)

You are most likely hitting a infinite loop bug in Celery / Kombu (see https://github.com/celery/celery/issues/3712) that only got fixed very recently. It has not gotten into a release yet. See commit https://github.com/celery/kombu/pull/760 for details. If you cannot use a repo build for your installation a work around is to either switch to Redis or set CELERY_WORKER_PREFETCH_MULTIPLIER=0 and -P solo for now.

Related

ProcessPoolExecutor stuck indefinitely when child process dies

I have a script running on one of my linux servers which handles batch file processing with a ProcessPoolExecutor and generally runs fine days or even weeks on end without any issue. Sometimes though it looks like a few of my child processes just die (I have no error message or exception at all and can't reproduce it even with killing cp's from the shell) and lead to the parent process just waiting for the return indefinitely...
Thats the call (the initializer doesn't have any effect in this case, it's just to handle the reverse scenario described in another very helpful thread on s.o.)
with ProcessPoolExecutor(max_workers=int(config['PERFORMANCE']['NumberOfProcesses']),
initializer=start_thread_to_terminate_when_parent_process_dies,
initargs=(os.getpid(),)
) as executor:
executor.map(process_main, file_list)
From what I've gathere the Pool should be able to recover in exactly the described scenario:
https://bugs.python.org/issue9205
Anyone got any idea? (thought about switching to the pebble library with it's timeout functionality or creating a separate watchdog script)

Celery: How to shutdown/timeout a celery worker child/process

I am running a celery worker --concurrency=10. So there are 10 process running in parallele on my environment.
When a task is executed and returns, I would like my server to tell the (worker) process to kill itself. Then, because of the concurrency, it should be create a new one.
How can I do this? I am thinking about SIGNALS -> task_postrun
Edit:
After setting up the application correctly (with --max-task-per-child that you can find below), I encountered an issue where I got exited code 143 from child processes.
This is because the process is killed and the server do not handle it.
The solution is to revoke the process before just after the task run:
from celery.signals import task_postrun
from celery.task.control import revoke
#task_postrun
def setup_task_postrun(task_id, task, *args, **kwargs):
revoke(task_id=task_id, terminate=True)
I am not sure this is a good practice, but it works
That is relatively odd requirement, but it is quite easy to do with Celery. All you have to do is to put the following into your Celery configuration:
worker_max_tasks_per_child=1

Celery with RabbitMQ creating too many queues

When running Django/Celery/RabbitMQ on production server, some tasks are sent and consumed correctly. However, RabbitMQ starts using up all the CPU after processing is done. I believe this is related to the following report.
RabbitMQ on EC2 Consuming Tons of CPU
In that thread, it is suggested to set these config values:
CELERY_IGNORE_RESULT
CELERY_AMQP_TASK_RESULT_EXPIRES
I forked and customized the celery-haystack package to set both those values when calling appl_async(), however it seems to have had no effect.
I think Celery is creating a large number (one per task) of uid-named queues automatically to store results. But I don't seem to be able to stop it.
Any ideas?
I just got a day of digging into this problem myself. I think the two options you meantioned can be explained like this:
CELERY_IGNORE_RESULT: if True then the results of tasks will be ignored, hence they won't return anything where you call them with delay or apply_async.
CELERY_AMQP_TASK_RESULT_EXPIRES: the expiration time for a result stored in the result backend. You can set this option to a reasonable value so RabbitMQ can delete expired results.
The many queues generated are for storing results only. So in case you don't want to store any results, you can remove CELERY_RESULT_BACKEND option from your config file.
Have a ncie day!

Spring Batch restart crashed jobs

Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()

Hangfire 1.3.4 - deleted jobs stuck in queue

We are running hangfire single threaded using BackgroundJobServerOptions.WorkerCount = 1 (because we have a requiement for ordered processing).
Most of the time this is no problem, but occasionally a job gets stuck for entirely expected reasons (eg, the actual code it is running goes into an infintite loop), but because we are running single threaded this prevents other jobs in the queue from starting.
In order to try and work around this, we delete the job, but then it stays on the queue, blocking any other job from starting:
The only way I have found to resolve this is to drop and recreate the hangfire DB which is obviously not great.
Why does deleting a running job in hangfire not also remove it from the queue? Is this weird delete behavior a bug which to be fixed in a later version, or is this behavior by design because we're running single threaded?
If this is by design then how do you cancel a processing job in a way which removes it from the queue?
Well it seems that this behavior is by design.
If the IIS app pool worker is recycled, Hangfire will start processing the next task immediately. However, without this restart Hangfire will "hang" indefinitely.
An issue was raised on github about this, however it has not been solved yet:
https://github.com/HangfireIO/Hangfire/issues/80
With no way to cancel or manually "fail" a job, this makes hangfire a lot less useful in a single threaded scenario.
Update: this has been partially or fully addressed in some later version of Hangfire.