Probably not the best place to ask (maybe Server Fault), but I'd try here:
I have django celery sending tasks via HaProxy to a RabbitMQ cluster, and we are losing messages (tasks not being executed) every now and then.
Observed
We turned of the workers and monitored the queue size, we noticed that we started 100 jobs but only 99 showed up in the queue.
It seems to happen when other processes are using RabbitMQ for other jobs
Tried
I tried flooding RabbitMQ with dummy messages with many connections and tried to put some proper tasks in to the queue, I couldn't replicate the issue constantly.
Just wondering if anyone had experienced this before?
UPDATE:
So I dove into the code and eventually stumble on to celery/app/amqp.py, I was debugging by adding extra publish method to an non-existing exchange, see below:
log.warning(111111111)
self.publish(
body,
exchange=exchange, routing_key=routing_key,
serializer=serializer or self.serializer,
compression=compression or self.compression,
headers=headers,
retry=retry, retry_policy=_rp,
reply_to=reply_to,
correlation_id=task_id,
delivery_mode=delivery_mode, declare=declare,
**kwargs
)
log.warning(222222222)
self.publish(
body,
exchange='celery2', routing_key='celery1',
serializer=serializer or self.serializer,
compression=compression or self.compression,
headers=headers,
retry=retry, retry_policy=_rp,
reply_to=reply_to,
correlation_id=task_id,
delivery_mode=delivery_mode, declare=declare,
**kwargs
)
log.warning(333333333)
Then tried to trigger 100 tasks from project code, and then result was only 1 message got put into the celery queue, I think it's caused by ProducerPool or ConnectionPool
Related
We have been trying to use RabbitMQ to transfer data from Project A to Project B.
We created a producer who takes the data from Project A and puts it in a queue, and that was relatively easy. Then, create a k8s pod for Project B, which listens to the appropriate queue with the ConsumerMixin of kombu.
Overall, the integration was reasonable and straightforward. But when we started to process long messages, we noticed that they were coming back into the queue repeatedly.
After research, we found out that whenever the processing of the message takes more than 20 seconds, the message showed up in the queue again, even though the processing was successful.
The source of this issue lies with the heartbeat of RabbitMQ. We set the heartbeat for 10 seconds, and the RabbitMQ checks the connection twice before it kills it. However, because the process of the callback takes more than 20 seconds, and the .ack() (acknowledge) of the message happens at the end of the callback (to ensure it was successful), the heartbeat is being blocked by the process of this message (as described here: https://github.com/celery/kombu/issues/621#issuecomment-251836611).
We have been trying to find a workaround with Threading, to process the message on a different thread and avoid the block of the heartbeat, but it didn't work. Also, it feels like we were trying to hack things and not solve the problem.
So my question here is if there is a proper workaround to handle this situation, or what alternatives do we have? RabbitMQ seemed like the right choice since we use it in standalone projects with Celery, and it is also recommended on the internet.
I am using celery and rabbitmq , but due to pushing several task in queue my server memory utilization becomes more than 40% , so that rabbit further will not accepting any task . so i want to delete those message which are already executed , but due to durable behavior of rabbitmq those message not automatically delete, so i want to set some configuration like autoAck=True , so that if message is consumed from celery ,it will delete from rabbitmq queues and also from my server memory. please explain how can we do that .
OK, so while I don't fully understand why you have the problem you have, it is clear what is going on.
A publisher puts a message task in the queue
Your worker process pulls the message and processes it
The message is never actually removed from the queue
This behavior happens when a consumer fails to acknowledge the processing of a message. To confirm, if you look at the RabbitMQ management plug-in, you'll see a whole bunch of unacknowledged messages. These will be unavailable for consumption, but will continue to be held on the server and taking up disk space and memory.
Further, if you do a Basic.Recover, all of these messages will then get dumped back into the queue to be processed again.
This problem is due to incorrect configuration of your consumer. There are two ways to address this:
You can configure the consumer to auto-ack (i.e. acknowledge the message automatically upon receipt). This is done when you declare the consumer (using Basic.Consume). Edit: It looks like this may be the default behavior of Celery.
You can configure your worker process to submit an acknowledgement (using Basic.Ack). Edit: this is done via the acks_late property in Celery.
I have made a consumer for RabbitMQ as a console application written in C#.NET. It is programmed to listen to a queue perpetually and whenever it find a message in the queue, it processes it. The consumer processes on an average 35 messages / second. The consumers are scheduled to run at system startup in the task scheduler. The consumers run fine for 3 - 4 days. But then, they keep on running but don't process any messages although the queue has messages in it. When the consumer is stopped and again started, it again starts processing the messages properly. But, by the time you manually restart, millions of messages get queued. Can someone please help me explain this abnormal behavior. I have other queues too which are running since months together without ceasing to stop.
Requesting prompt response. Thanks in advance to the experts.
I suggest you to look at the consumer code, it might be running but stuck in RabbitMQ exceptions. It sounds odd that it runs fine for 3-4 days.
I had similar problem of consumer not consuming messages in the queue because I was using "RabbitMQ.Client.QueueingBasicConsumer" to dequeue the message and when the queue was closed abruptly although consumer was running but it was in System.IO.EndOfStreamException. I am using "RabbitMQ.Client.Events.EventingBasicConsumer" which has helped me to solve the issue.
I am trying to do some stress testing on AMQ 5.5.1.
I have created a queue and using Jmeter Point-to-point to send JMS requests to the queue. Kindly note I haven't configured any consumer so mesages just get stacked up and actually stored in KahaDB store.
I notice if I have used 200 users in the Thread group - it creates exactly 400 threads on ActiveMQ that I can see via jconsole.
Jmeter slowly(actually quite fast) keeps on pushing messages to the queue as I can see the queue size gradually increasing and doesn't do it at one go.
I am using ProducerFlowControl as false and using the default hybrid store cursor on (though I haven't got a ready consumer at the moment).
I am also using Persistent Delivery.
My questions are:
What is restricting Jmeter from pushing all the 200 messages at one go? Is it ActiveMQ or I need to configure something in jmeter to be able to send 200 at one go. I did notice as soon as I start the test on Jmeter straight away 400 threads are created on ActiveMQ which makes me think it establishes connections at one go for 200 users with activemq but messages are pushed in batches but not together.
Why are there 2 threads per consumer on activemq and why do all the threads remain active until all messages have been pushed. Ideally if the users were pushing messages one by one as soon as they have done so and got an acknowledgement back it should have died out. But all 200 X 2 threads die at the same time when all messages have finally been pushed.
Any help is appreciated.
It seems the longer I keep my rabbitmq server running, the more trouble I have with unacknowledged messages. I would love to requeue them. In fact there seems to be an amqp command to do this, but it only applies to the channel that your connection is using. I built a little pika script to at least try it out, but I am either missing something or it cannot be done this way (how about with rabbitmqctl?)
import pika
credentials = pika.PlainCredentials('***', '***')
parameters = pika.ConnectionParameters(host='localhost',port=5672,\
credentials=credentials, virtual_host='***')
def handle_delivery(body):
"""Called when we receive a message from RabbitMQ"""
print body
def on_connected(connection):
"""Called when we are fully connected to RabbitMQ"""
connection.channel(on_channel_open)
def on_channel_open(new_channel):
"""Called when our channel has opened"""
global channel
channel = new_channel
channel.basic_recover(callback=handle_delivery,requeue=True)
try:
connection = pika.SelectConnection(parameters=parameters,\
on_open_callback=on_connected)
# Loop so we can communicate with RabbitMQ
connection.ioloop.start()
except KeyboardInterrupt:
# Gracefully close the connection
connection.close()
# Loop until we're fully closed, will stop on its own
connection.ioloop.start()
Unacknowledged messages are those which have been delivered across the network to a consumer but have not yet been ack'ed or rejected -- but that consumer hasn't yet closed the channel or connection over which it originally received them. Therefore the broker can't figure out if the consumer is just taking a long time to process those messages or if it has forgotten about them. So, it leaves them in an unacknowledged state until either the consumer dies or they get ack'ed or rejected.
Since those messages could still be validly processed in the future by the still-alive consumer that originally consumed them, you can't (to my knowledge) insert another consumer into the mix and try to make external decisions about them. You need to fix your consumers to make decisions about each message as they get processed rather than leaving old messages unacknowledged.
If messages are unacked there are only two ways to get them back into the queue:
basic.nack
This command will cause the message to be placed back into the queue and redelivered.
Disconnect from the broker
This action will force all unacked messages from this channel to be put back into the queue.
NOTE: basic.recover will try to republish unacked messages on the same channel (to the same consumer), which is sometimes the desired behaviour.
RabbitMQ spec for basic.recover and basic.nack
The real question is: Why are the messages unacknowledged?
Possible scenarios to cause unacked messages:
Consumer fetching too many messages, then not processing and acking them quickly enough.
Solution: Prefetch as few messages as appropriate.
Buggy client library (I have this issue currently with pika 0.9.13. If the queue has a lot of messages, a certain number of messages will get stuck unacked, even hours later.
Solution: I have to restart the consumer several times until all unacked messages are gone from the queue.
All the unacknowledged messages will go to ready state once all the workers/consumers are stopped.
Ensure all workers are stopped by confirming with a grep on ps aux output, and stopping/killing them if found.
If you are managing workers using supervisor, which shows as worker is stopped, you may want to check for zombies. Supervisor reports the worker to be stopped but still you will find zombie processes running when grepped on ps aux output. Killing the zombie processes will bring messages back to ready state.