RabbitMQ + kombu - A long callback blocks the heartbeat leading to aborting the connection - rabbitmq

We have been trying to use RabbitMQ to transfer data from Project A to Project B.
We created a producer who takes the data from Project A and puts it in a queue, and that was relatively easy. Then, create a k8s pod for Project B, which listens to the appropriate queue with the ConsumerMixin of kombu.
Overall, the integration was reasonable and straightforward. But when we started to process long messages, we noticed that they were coming back into the queue repeatedly.
After research, we found out that whenever the processing of the message takes more than 20 seconds, the message showed up in the queue again, even though the processing was successful.
The source of this issue lies with the heartbeat of RabbitMQ. We set the heartbeat for 10 seconds, and the RabbitMQ checks the connection twice before it kills it. However, because the process of the callback takes more than 20 seconds, and the .ack() (acknowledge) of the message happens at the end of the callback (to ensure it was successful), the heartbeat is being blocked by the process of this message (as described here: https://github.com/celery/kombu/issues/621#issuecomment-251836611).
We have been trying to find a workaround with Threading, to process the message on a different thread and avoid the block of the heartbeat, but it didn't work. Also, it feels like we were trying to hack things and not solve the problem.
So my question here is if there is a proper workaround to handle this situation, or what alternatives do we have? RabbitMQ seemed like the right choice since we use it in standalone projects with Celery, and it is also recommended on the internet.

Related

RabbitMQ how to only have one message at a time and don't requeue on failure

Our system has a bunch of consumers that use rabbit to consume messages for long running tasks. Currently we ack at the end of processing, so that if the consumer crashes, the message gets requeued. What we want is that a consumer only works on one message at a time and does not prefetch so that another consumer can work on the next message, and if a crash occurs we do not requeue, but we'll have our own monitor that will decide whether we need to re-run on a larger EC2 instance or whatever. It looks like we can get CLOSE to this by acking at start of processing with a prefetch of 1, but that is still 1 message in the queue that could have been handled by another consumer. Apparently setting prefetch to 0 makes no sense
according to rabbit devs (I don't understand why), so another option would be to still ack only on completion so that a prefetch doesn't occur, but somehow DON'T requeue on crash.
If we are swimming upstream so to speak then I know we'll have to come up with another plan, but I don't understand why the desire for a consumer to only work on one thing at a time (and not prefetch the next item of work) and to not requeue on crash is so odd
Consider using one of the RabbitTemplate receive() or receiveAndConvert methods instead; that's a better model for this type of workload - fetching records as needed instead of them being pushed into your app.

Setting a long timeout for RabbitMQ ack message

I was wondering if this is possible. I want to pull a task from a queue and have some work that could potentially take anywhere from 3 seconds or longer (possibly) minutes before an ack is sent back to RabbitMQ notifying that the work has been completed. The work is done by a user, hence this is why the time it takes to process the job varies.
I don't want to ack the message immediately after I pop off the queue because I want the message to be requeued if no ack is received. Can anyone give me any insights into how to solve my problem?
Having a long timeout should be fine, and certainly as you say you want redelivery if something goes wrong, so you want to only ack after you finish.
The best way to achieve that, IMO, would be to have multiple consumers on the queue (i.e. multiple threads/processes consuming from the same queue). That should be fine as long as there's no particular ordering constraint on your queue contents (i.e. the way there might be if the queue were to contain contents representing Postgres data that involves FK constraints).
This tutorial on the RabbitMQ website provides more info (Python linked, but there should be similar tutorials for other languages): https://www.rabbitmq.com/tutorials/tutorial-two-python.html
Edit in response to comment from OP:
What's your heartbeat set to? If your worker doesn't acknowledge the heartbeat within the set period of time, the server will consider the connection to be dead.
Not sure which language you're using, but for Java you would use the setRequestedHeartbeat method to specify the heartbeat.
The way you implement your workers, it's vital that the heartbeat can still be sent back to the RabbitMQ server. If something blocks the client from sending the heartbeat, the server will kill the connection after the time interval expires.

MSMQ + WCF - Retry with Growing Delay

I am using MSMQ 4 with WCF. We have a Microsoft Dynamics plugin putting a message on an queue. A service picks up the message and makes an HTTP request to another web server. The web server responds by putting another message on a different queue. A second service picks up the messages and sends the response back to Dynamics...
We have our retry queue set up to retry 3 times and then wait for 5 minutes before retrying again. The Dynamics system some times takes so long (due to other plugins) that we can round-trip before the database transaction commits. The user's aren't seeing the update come through for another 5 minutes.
I am curious if there is a way to configure the retry mechanism to retry incrementally. So, the first time it fails, it only waits a few seconds. If it fails a second time, it waits twice that. And the time between retries just keeps growing.
The problem with just reducing the time between retries is that a bad message could easily fill up a log file.
It turns out there is no built-in way of doing this. One slightly involved option is to create multiple queues, each with its own retry/poison sub-queues, each with a growing retry delay. You can reuse the same handler for each queue - the only thing that changes is the configuration. You also need a handler that can read the poison sub-queues (service) and move the message to the next queue in the chain (client).
So, you set receiveErrorHandling to Move. The maxRetryCycles and receiveRetryCount are just 1. Each queue will use a growing retryCycleDelay. Each queue you create will have a poison sub-queue created for it automatically. You simply read from each poison sub-queue and use a client to move it to the next queue.
I am sure someone could write some code that would automatically create N queues with a growing retryCycleDelay and hook it up all programmatically. Since it is the same handler/client for every queue, it wouldn't be a big deal.

RabbitMQ management

In the console pane rabbitmq one day I had accumulated 8000 posts, but I am embarrassed that their status is idle at the counter ready and total equal to 1. What status should be completed at the job, idle? In what format is registered x-pires? It seems to me that I had something wrong =(
While it's difficult to fully understand what you are asking, it seems that you simply don't have anything pulling messages off of the queue in question.
In general, RabbitMQ will hold on to a message in a queue until a listener pulls it off and successfully ACKs, indicating that the message was successfully processed. You can configure queues to behave differently by setting a Time-To-Live (TTL) on messages or having different queue durabilities (eg. destroyed when there are no more listeners), but the default is to play it safe.

NServiceBus Retry Delay

What is the optimal way to configure/code NServiceBus to delay retrying messages?
In its default configuration retry happens almost immediately up to the number of attempts defined in the configuration file. I'd ideally like to retry again after an hour, etc.
Also, how does HandleCurrentMessageLater() work? What does the Later aspect refer to?
The NSB retries is there to remedy temporary problems like deadlocks etc. Longer retries is better handled by creating another process that monitors the error queue and puts them back into to the source queue at the interval you like. Take a look at the ReturnToSourceQueue.exe that comes with NSB for reference.
Edit: NServiceBus now supports this , we call it Second Level Retries, see http://docs.particular.net/ for more details
Here is a blog post on why NServiceBus doesn't include a retry delay that I wrote after asking Udi this very same question in his distributed systems architecture course:
NServiceBus Retries: Why no back-off delay?
And here is a discussion thread covering some of the points involved in building an error queue monitor/retry endpoint:
http://tech.groups.yahoo.com/group/nservicebus/message/10964
As far as HandleCurrentMessageLater(), all that does is puts the current message back at the end of the queue. If there are no other messages waiting, it's going to be processed again immediately.
As of NServiceBus 3.2.1, they provide an out of the box solution to handle back off delays in the event of consecutive message failures. The previously existing retry mechanism still retries failures without a delay to handle cases like Database deadlocks, quickly self healing network issues, etc.
Once a message has been retried the configured number of times, the message is moved to a "Second Level Retry" queue. This queue, as configured below, will retry after a 10, 20, and 30 second delay, then the message will be moved to the configured error queue. You're free to change these values to something that better suites your environment.
You can also check out this link:
http://docs.particular.net/nservicebus/second-level-retries