Using django-celery chord, celery.chord_unlock keeps executing forever not calling the provided callback - redis

I'm using Django Celery with Redis to run a few tasks like this:
header = [
tasks.invalidate_user.subtask(args = (user)),
tasks.invalidate_details.subtask(args = (user))
]
callback = tasks.rebuild.subtask()
chord(header)(callback)
So basically the same as stated in documentation.
My problem is, that when this task chord is called, celery.chord_unlock task keeps retrying forever. Tasks in header finish successfully, but because of chord_unlock never being done, callback is never called.
Guessing that my problem is with not being able to detect that the tasks from header are finished, I turned to documentation to look how can this be customized. I've found a section, describing how the synchronization is implemented, there is an example provided, what I'm missing is how do I get that example function to be called (i.e. is there a signal for this?).
Further there's a note that this method is not used with Redis backend:
This is used by all result backends except Redis and Memcached, which increment a counter after each task in the header, then applying the callback when the counter exceeds the number of tasks in the set.
But also says, that Redis approach is better:
The Redis and Memcached approach is a much better solution
What approach is that? How is it implemented?
So, why is chord_unlock never done and how can I make it detect finished header tasks?
I'm using: Django 1.4, celery 2.5.3, django-celery 2.5.5, redis 2.4.12

You don't have an example of your tasks, but I had the same problem and my solution might apply.
I had ignore_result=True on the tasks that I was adding to a chord, defined like so:
#task(ignore_result=True)
Apparently ignoring the result makes it so that the chord_unlock task doesn't know they're complete. After I removed ignore_result (even if the task only returns true) the chord called the callback properly.

I had the same error, I changed the broker to rabbitmq and chord_unlock is working until my task finishes (2-3 minutes tasks)
when using redis the task finishes and chord_unlock only retried like 8-10 times every 1s, so callback was not executing correctly.
[2012-08-24 16:31:05,804: INFO/MainProcess] Task celery.chord_unlock[5a46e8ac-de40-484f-8dc1-7cf01693df7a] retry: Retry in 1s
[2012-08-24 16:31:06,817: INFO/MainProcess] Got task from broker: celery.chord_unlock[5a46e8ac-de40-484f-8dc1-7cf01693df7a] eta:[2012-08-24 16:31:07.815719-05:00]
... just like 8-10 times....
changing broker worked for me, now I am testing #Chris solution and my callback function never receives the results from the header subtasks :S, so, it does not works for me.
celery==3.0.6
django==1.4
django-celery==3.0.6
redis==2.6
broker: redis-2.4.16 on Mac OS X

This could cause a problem such that; From the documentation;
Note:
If you are using chords with the Redis result backend and also overriding the Task.after_return() method, you need to make sure to call the super method or else the chord callback will not be applied.
def after_return(self, *args, **kwargs):
do_something()
super(MyTask, self).after_return(*args, **kwargs)
As my understanding, If you have overwritten after_return function in your task, it must be removed or at least calling super one.
Bottom of the topic:http://celery.readthedocs.org/en/latest/userguide/canvas.html#important-notes

Related

RabbitMQ + kombu - A long callback blocks the heartbeat leading to aborting the connection

We have been trying to use RabbitMQ to transfer data from Project A to Project B.
We created a producer who takes the data from Project A and puts it in a queue, and that was relatively easy. Then, create a k8s pod for Project B, which listens to the appropriate queue with the ConsumerMixin of kombu.
Overall, the integration was reasonable and straightforward. But when we started to process long messages, we noticed that they were coming back into the queue repeatedly.
After research, we found out that whenever the processing of the message takes more than 20 seconds, the message showed up in the queue again, even though the processing was successful.
The source of this issue lies with the heartbeat of RabbitMQ. We set the heartbeat for 10 seconds, and the RabbitMQ checks the connection twice before it kills it. However, because the process of the callback takes more than 20 seconds, and the .ack() (acknowledge) of the message happens at the end of the callback (to ensure it was successful), the heartbeat is being blocked by the process of this message (as described here: https://github.com/celery/kombu/issues/621#issuecomment-251836611).
We have been trying to find a workaround with Threading, to process the message on a different thread and avoid the block of the heartbeat, but it didn't work. Also, it feels like we were trying to hack things and not solve the problem.
So my question here is if there is a proper workaround to handle this situation, or what alternatives do we have? RabbitMQ seemed like the right choice since we use it in standalone projects with Celery, and it is also recommended on the internet.

How to obtain the return value from a function enqueued in rq?

From the documentation of redis queue https://python-rq.org/docs, I came to know that the worker can return results only after a certain time and till then return None.
Is there any way to find out that the worker execution is complete (not with time.sleep() pls.) ?
In my case what is happening is the worker keeps running and the data displayed on the UI is None as the control moves to my rendering of UI code as soon as worker is assigned the task and doesnot wait to complete the execution ?
Pls. help me.
I know it's been a year, but if someone else needs it:
that depends on your needs - I'd use supervisor, do you can easily see updated output of each running worker/process, either with the output file, or in the browser, with the inet_http_server section
if you want something done, after the current job has finished - just chain jobs to the queue.
you only need to specify the job in the "depends_on" parameter.docs

Should failed (and inactive) shards fail the whole action and how?

I'm developing an Elasticsearch plugin that extract terms from fields that match a pattern. To get all the scaffolding done, I have started out from this plugin: https://github.com/jprante/elasticsearch-index-termlist. So I extend TransportBroadcastOperationAction, and have it broadcast the request to activePrimaryShardsGrouped, and then in newResponse, I merge the shard results, count the failed shards, and pass the counter to the BroadcastOperationResponse constructor eventually.
I call this on the ES client like:
TermListResponse resp = TermListAction.INSTANCE.newRequestBuilder(client)
.setIndices("foo")
.setFields("bar", "baaz").setPattern("wombat*")
.execute().actionGet();
My problem is that the above will not throw exception when there were failed shards, although it indicates that in resp.getFailedShards(). Is it how it's supposed to be, or am I doing something wrong? Checking resp.getFailedShards() after all invocations doesn't look very safe, because someone can forget to do that and work with a partial term list accidentally.
Furthermore, the cause of the failed shards in my case was that the cluster was recently restarted and so the client could already connect but some shards weren't ready yet. I think it would nice if the action just waits for the broadcast target shards to become ready (with some timeout of course), just like search requests do, apparently. Maybe that means waiting for the "yellow" cluster health state, but where I'm supposed to do that, if I want to be true to the approach of ES?

How to ACK celery tasks with parallel code in reactor?

I have a celery task that, when called, simply ignites the execution of some parallel code inside a twisted reactor. Here's some sample (not runnable) code to illustrate:
def run_task_in_reactor():
# this takes a while to run
do_something()
do_something_more()
#celery.task
def run_task():
print "Started reactor"
reactor.callFromThread(run_task_in_reactor)
(For the sake of simplicity, please assume that the reactor is already running when the task is received by the worker; I used the signal #worker_process_init.connect to start my reactor in another thread as soon as the worker comes up)
When I call run_task.delay(), the task finishes pretty quickly (since it does not wait for run_task_in_reactor() to finish, only schedules its execution in the reactor). And, when run_task_in_reactor() finally runs, do_something() or do_something_more() can throw an exception, which will go unoticed.
Using pika to consume from my queue, I can use an ACK inside do_something_more() to make the worker notify the correct completion of the task, for instance. However, inside Celery, this does not seems to be possible (or, at least, I do't know how to accomplish the same effect)
Also, I cannot remove the reactor, since it is a requirement of some third-party code I'm using. Other ways to achieve the same result are appreciated as well.
Use reactor.blockingCallFromThread instead.

How to detect alarm-based blocking RabbitMQ producer?

I have a producer sending durable messages to a RabbitMQ exchange. If the RabbitMQ memory or disk exceeds the watermark threshold, RabbitMQ will block my producer. The documentation says that it stops reading from the socket, and also pauses heartbeats.
What I would like is a way to know in my producer code that I have been blocked. Currently, even with a heartbeat enabled, everything just pauses forever. I'd like to receive some sort of exception so that I know I've been blocked and I can warn the user and/or take some other action, but I can't find any way to do this. I am using both the Java and C# clients and would need this functionality in both. Any advice? Thanks.
Sorry to tell you but with RabbitMQ (at least with 2.8.6) this isn't possible :-(
had a similar problem, which centred around trying to establish a channel when the connection was blocked. The result was the same as what you're experiencing.
I did some investigation into the actual core of the RabbitMQ C# .Net Library and discovered the root cause of the problem is that it goes into an infinite blocking state.
You can see more details on the RabbitMQ mailing list here:
http://rabbitmq.1065348.n5.nabble.com/Net-Client-locks-trying-to-create-a-channel-on-a-blocked-connection-td21588.html
One suggestion (which we didn't implement) was to do the work inside of a thread and have some other component manage the timeout and kill the thread if it is exceeded. We just accepted the risk :-(
The Rabbitmq uses a blocking rpc call that listens for a reply indefinitely.
If you look the Java client api, what it does is:
AMQChannel.BlockingRpcContinuation k = new AMQChannel.SimpleBlockingRpcContinuation();
k.getReply(-1);
Now -1 passed in the argument blocks until a reply is received.
The good thing is you could pass in your timeout in order to make it return.
The bad thing is you will have to update the client jars.
If you are OK with doing that, you could pass in a timeout wherever a blocking call like above is made.
The code would look something like:
try {
return k.getReply(200);
} catch (TimeoutException e) {
throw new MyCustomRuntimeorTimeoutException("RabbitTimeout ex",e);
}
And in your code you could handle this exception and perform your logic in this event.
Some related classes that might require this fix would be:
com.rabbitmq.client.impl.AMQChannel
com.rabbitmq.client.impl.ChannelN
com.rabbitmq.client.impl.AMQConnection
FYI: I have tried this and it works.