RabbitMq logs filling with {{badmatch,{error,access_refused}}, consuming disk space - rabbitmq

We were facing a problem in a busy production cluster, where the log files were quickly filling, and consuming the entire disk (40Gb) in a matter of days.
The errors we saw were:
=ERROR REPORT==== 19-Jul-2019::12:01:41 ===
** Generic server <0.13892.127> terminating
** Last message in was {'$gen_cast',init}
** When Server state == {state,undefined,undefined,undefined,undefined,
{<<"prod1">>,
<<"Move from My_Queue_Name">>},
dynamic,
{shovel,
{endpoint,
["amqp:///prod1"],
#Fun<rabbit_shovel_parameters.4.75090704>},
{endpoint,
["amqp:///prod1"],
#Fun<rabbit_shovel_parameters.5.120532295>},
1000,on_confirm,
#Fun<rabbit_shovel_parameters.6.48689962>,
#Fun<rabbit_shovel_parameters.7.130815760>,
<<"My_Queue_Name">>,
1,'queue-length'},
undefined,undefined,undefined,undefined,undefined}
** Reason for termination ==
** {{badmatch,{error,access_refused}},
[{rabbit_shovel_worker,make_conn_and_chan,1,[]},
{rabbit_shovel_worker,handle_cast,2,[]},
{gen_server2,handle_msg,2,[]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
Whilst the queue name did vary between about 10 different queues, we saw the same error logging out over, and over, and over.

What we noticed quite late into dealing with the incident was that the name of the virtual host within the shovel parameters (amqp:///prod1) related actually to an old vhost which we had deleted weeks ago.
It seems that when we deleted this vhost, there were still queues within the host. Internally, this seemed to have put RabbitMq into a bad state, and there was still some lingering shovel configurations related to a queue which "sort-of" existed, I'm no Rabbit expert so make an assumption that deleting the vhost does not delete the queues even though you never then see them on the UI and it gets itself confused.
The solution here was to re-create the vhost with the same name, then we can see and delete all the queues, then re-delete the vhost. This stopped the errors and the log files are no longer growing exponentially.

Related

Keep queues and exchanges when vhost is corrupted

We've recently come across a problem when using RabbitMQ: when the hard drive of our server is full, RabbitMQ's vhost are getting corrupted, and unusable.
The only to make RabbitMQ functional again is to delete, and recreate the corrupted hosts.
Doing so, all of our queues, and exchanges, along with the data in it, is then gone.
While this situation should not happen in prod, we're searching for a way to prevent data loss, if such an event does occur.
We've been looking at the official rabbitMQ documentation, as well as on stack exchange, but haven't found any solution to prevent data loss when a host is corrupted.
We plan on setting up a cluster at a later stage of development, which should at least help in reducing the loss of data when a vhost is corrupted, but it's not possible for now.
Is there any reliable way to either prevent vhost corruption, or to fix the vhost without losing data?
Some thoughts on this (in no particular order):
RabbitMQ has multiple high-availability configurations - relying upon a single node provides no protection against data loss.
In general, you can have one of two possible guarantees with a message, but never both:
At least once delivery - a message will be delivered at least one time, and possibly more.
At most once delivery - a message may or may not be delivered, but if it is delivered, it will never be delivered a second time
Monitoring the overall health of your nodes (i.e. disk space, processor use, memory, etc.) should be done proactively by a tool specific to that purpose. You should never be surprised by running out of a critical system resource.
If you are running one node, and that node is out of disk space, and you have a bunch of messages on it, and you're worried about data loss, wondering how RabbitMQ can help you, I would say you have your priorities mixed up.
RabbitMQ is not a database. It is not designed to reliably store messages for an indefinite time period. Please don't count on it as such.

Artemis vs Activemq 5 message store

In activemq 5, each queue had a folder containing its data and messages, everything.
Which would mean that, in case of an issue, for example an out of disk space error. Some files would get corrupted before the server crash. In that case, in activemq 5, we would find logs indicating corrupted files, and we could delete the queue folder that was corrupted, resulting in small loss of messages instead of ALL messages.
In artemis, it seems that messages are stored in the same files, independently from the queue they are stored in. Which means if i get an out of disk space error, i might have to delete all my messages.
First, can you confirm the change of behaviour, and secondly, is there a way to recover ? And a bonus, if anyone know why this change happened, I would like to understand.
Artemis uses a completely new message journal implementation as compared to 5.x. The same journal is used for all messages. However, it isn't subject to the same corruption problems as you've seen with 5.x. If records from the journal can't be processed then they are simply skipped.
If you get an out of disk space error you should never need to delete all your messages. The journal files themselves are allocated and filled with zeroes to meet their configured size before they are actually used so if you were going to run out of disk space you'd do so during that process before any messages were written to them.
The Artemis journal implementation was written from the ground up for high performance specifically in conjunction with the broker's non-blocking architecture.

RabbitMQ - cannot delete queue because the queue cannot be found

Overnight our RabbitMQ queues got full. The node basically ran out of space. However now RabbitMQ doesn't really work. No component can make a connection to it, because it blocks connections. I want to free up the space under it, but when I try to purge a queue via the admin gui, I get the following error:
NOT_FOUND - no queue 'sharding: sharded_queue - rabbit#hostname - 0'
in vhost '/'
If I try to list the queues with the command line tool, they are not listed. These ques are only visible at this moment via the GUI, but I cannot interact with them in any way. Delete doesn't work either.
Is there a way, a workaround to get the queues cleaned? Should I find the actual messages stored on the disk and delete those?
Update
I have found the following command in this thread:
rabbitmqctl eval 'Q = {resource, <<"/">>, queue, <<"sharding: sharded_queue - rabbit#hostname - 0">>}, rabbit_amqqueue:internal_delete(Q).'
This actually deletes the queue. It doesn't show up anymore in the GUI. However the disk space is still not freed up which is a huge problem.

celeryev Queue in RabbitMQ Becomes Very Large

I am using celery on rabbitmq. I have been sending thousands of messages to the queue and they are being processed successfully and everything is working just fine. However, the number of messages in several rabbitmq queues are growing quite large (hundreds of thousands of items in the queue). The queues are named celeryev.[...] (see screenshot below). Is this appropriate behavior? What is the purpose of these queues and shouldn't they be regularly purged? Is there a way to purge them more regularly, I think they are taking up quite a bit of disk space.
You can use the CELERY_EVENT_QUEUE_TTL celery option (only working with amqp), that will set the message expiry time, after which it will be deleted from the queue.
For anyone else who is running into problems with a celeryev queue becoming very large and threatening the disk space on your rabbitmq server, beware the accepted answer! Here's my suggestion. Just issue this command on your rabbitmq instance:
rabbitmqctl set_policy limit_celeryev_queues "^celeryev\." '{"max-length":1000000}' --apply-to queues
This will limit any queue beginning with "celeryev" to 1 Million entries. I did some experimenting with a stuck flower instance causing a runaway celeryev queue, and setting CELERY_EVENT_QUEUE_TTL / CELERY_EVENT_QUEUE_EXPIRES did not help control the queue size.
In my testing, I started a flower process, then SIGSTOP'ed it, and watched its celeryev queue start running away. Neither of these two settings helped at all. I confirmed SIGCONT'ing the flower process would bring the queue back to 0 rapidly. I am not certain why these two knobs didn't help, but it may have something to do with how RabbitMQ implements these two settings.
First, the Per-Message TTL corresponding to CELERY_EVENT_QUEUE_TTL only establishes an expiration time on each queue entry -- AIUI it will not automatically delete the message out of the queue to save space upon expiration. Second, the Queue TTL corresponding to CELERY_EVENT_QUEUE_EXPIRES says that it "... guarantees that the queue will be deleted, if unused for at least the expiration period". However, I believe that their definition of "unused" may be too strict to kick in for e.g. an overburdened, stuck, or killed flower process.
EDIT: Unfortunately, one problem with this suggestion is that the set_policy ... apply-to queues will only impact existing queues, and flower can and will create new queues which may overflow.
Celery use celeryev prefixed queues (and exchange) for monitoring, you can configure it as you want or disable at all (celery control disable_events).
You just have to set a config to your Celery.
If you want to avoid Celery from creating celeryev.* queues:
CELERY_SEND_EVENTS = False # Will not create celeryev.* queues
If you need these queues for monitoring purpose (CeleryFlower for instance), you may regularly purge them:
CELERY_EVENT_QUEUE_EXPIRES = 60 # Will delete all celeryev. queues without consumers after 1 minute.
The solution came from here: https://www.cloudamqp.com/docs/celery.html
You can limit the queue size in RabbitMQ with x-max-length queue declaration argument
http://www.rabbitmq.com/maxlength.html

Temporary queue made in Celery

I am using Celery with RabbitMQ. Lately, I have noticed that a large number of temporary queues are getting made.
So, I experimented and found that when a task fails (that is a tasks raises an Exception), then a temporary queue with a random name (like c76861943b0a4f3aaa6a99a6db06952c) is formed and the queue remains.
Some properties of the temporary queue as found in rabbitmqadmin are as follows -
auto_delete : True
consumers : 0
durable : False
messages : 1
messages_ready : 1
And one such temporary queue is made everytime a task fails (that is, raises an Exception). How to avoid this situation? Because in my production environment a large number of such queues get formed.
It sounds like you're using the amqp as the results backend. From the docs here are the pitfalls of using that particular setup:
Every new task creates a new queue on the server, with thousands of
tasks the broker may be overloaded with queues and this will affect
performance in negative ways. If you’re using RabbitMQ then each
queue will be a separate Erlang process, so if you’re planning to
keep many results simultaneously you may have to increase the Erlang
process limit, and the maximum number of file descriptors your OS
allows
Old results will not be cleaned automatically, so you must make
sure to consume the results or else the number of queues will
eventually go out of control. If you’re running RabbitMQ 2.1.1 or
higher you can take advantage of the x-expires argument to queues,
which will expire queues after a certain time limit after they are
unused. The queue expiry can be set (in seconds) by the
CELERY_AMQP_TASK_RESULT_EXPIRES setting (not enabled by default).
From what I've read in the changelog, this is no longer the default backend in versions >=2.3.0 because users were getting bit in the rear end by this behavior. I'd suggest changing the results backend if this not the functionality you need.
Well, Philip is right there. The following is a description of how I solved it. It is a configuration in celeryconfig.py.
I am still using CELERY_BACKEND = "amqp" as Philip had said. But in addition to that, I am now using CELERY_IGNORE_RESULT = True. This configuration will ensure that the extra queues are not formed for every task.
I was already using this configuration but still when a task fails, the extra queue was formed. Then I noticed that I was using another configuration which needed to be removed which was CELERY_STORE_ERRORS_EVEN_IF_IGNORED = True. What this did that it did not store the results for all tasks but did only for errors (tasks which failed) and hence one extra queue for a task which failed.
The CELERY_TASK_RESULT_EXPIRES dictates the time to live of the temp queues. The default is 1 day. You can modify this value.
The reason this is happening is because celery workers remote control is enabled (it is enabled by default).
You can disable it by setting the CELERY_ENABLE_REMOTE_CONTROL setting to False
However, note that you will lose the ability to do things like add_consumer, cancel_consumer etc using the celery command
amqp backend creates a new queue for each task. If you want to avoid it, you can use rpc backend which keeps results in a single queue.
In your config, set
CELERY_RESULT_BACKEND = 'rpc'
CELERY_RESULT_PERSISTENT = True
You can read more about this on celery docs.