RabbitMQ delete a corrupted queue after node crash - rabbitmq

RabbitMQ Version 3.7.21
Erlang Version Erlang 21.3.8.10
My team had 2 nodes hit the memory watermark last night and so I rebuilt the bad nodes but it left some queues in a bad state. I want to clear them out so that we can recreate them.
The stats show NaN for Ready, Unacked, and Total and the stats in queue look like:
It looks like the queue's node is one that no longer exists so unfortunately I can't access it. It's completely gone.
I have tried the following commands:
rabbitmqctl eval 'Q = rabbit_misc:r(<<"/">>, queue, <<"QUEUE">>), rabbit_amqqueue:internal_delete(Q).'
rabbitmqctl eval 'Q = {resource, <<"/">>, queue, <<"QUEUE">>}, rabbit_amqqueue:internal_delete(Q).'
but get this error:
{:undef, [{:rabbit_amqqueue, :internal_delete, [{:resource, "/", :queue, "QUEUE"}], []}, {:erl_eval, :do_apply, 6, [file: 'erl_eval.erl', line: 680]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}
Which I assume means it's trying to make an RPC call to a node that no longer exists and it fails. This seems crazy to me because not just is the node gone but it has been forgotten from the cluster but still a couple queues remain.

Looks like there are 3 options:
Comb through the Mnesia tables and delete the corrupted ones
Fully rebuild the cluster and migrate to a new cluster
Rename your queues and ignore corrupted ones
We're going to go with Option 3 for now but I'm sure eventually there will be a breaking change in RabbitMQ that will make Option 2 more appealing but for now the quick fix is best for me.

According to https://groups.google.com/g/rabbitmq-users/c/VSjzvOUfS3s/m/q8OmFTqACAAJ, the internal_delete function in 3.7.x takes two arguments:
In 3.7.x rabbit_amqqueue:internal_delete takes two arguments (acting user name is the second one).
Therefore, the next time you need to delete a queue in a bad state, try
rabbitmqctl eval 'Q = {resource, <<"/">>, queue, <<"QUEUE">>}, rabbit_amqqueue:internal_delete(Q, <<"CLI">>).'

Related

attempt to call field 'replicate_commands' (a nil value)

I use jedis + lua to eval script, here is my lua script:
redis.replicate_commands()
local second = redis.call('TIME')[1]
local currentKey = KEYS[1]..second
if redis.call('EXISTS', currentKey) == 0 then
redis.call('SETEX', currentKey, 1, 1)
return 1
else
return redis.call('INCR', currentKey)
end
As I use 'Time', it reports error:Write commands not allowed after non deterministic commands.
after searching on internet, I add 'redis.replicate_commands()' as first line of lua script, but it still reports error:ERR Error running script (call to f_c89a6ee8ad732a325e530f4a69226851cde302e2): #user_script:1: user_script:1: attempt to call field 'replicate_commands' (a nil value)
Does replicate_commands need arguments or is there a way to solve my problem?
redis version:3.0
jedis version:2.9
lua version: I don't know where to find
The error attempt to call field 'replicate_commands' (a nil value) means replicate_commands() doesn't exists in the redis object. It is a Lua-side error message.
replicate_commands() was introduced until Redis 3.2. See EVAL - Replicating commands instead of scripts. Consider upgrading.
The first error message (Write commands not allowed after non deterministic commands) is a redis-side message, you cannot call write-commands (like SET, SETEX, INCR, etc) after calling non-deterministic commands (like SPOP, SCAN, RANDOMKEY, TIME, etc).
A very important part of scripting is writing scripts that are pure functions.
Scripts executed in a Redis instance are, by default, propagated to
replicas and to the AOF file by sending the script itself -- not the
resulting commands.
This is so if the Redis server is restarted, playing again the AOF log, or also if replicated in a slave, the script should deliver the same dataset.
This is why in Redis 3.2 replicate_commands() was introduced. And starting with Redis 5 scripts are always replicated as effects -- as if replicate_commands() was called when the script started. But for versions before 3.2, you simply cannot do this.
Therefore, either upgrade to 3.2 or later, or pass currentKey already calculated to the script from the client instead.
Note that creating currentKey dynamically makes your script single-instance-only.
All Redis commands must be analyzed before execution to determine
which keys the command will operate on. In order for this to be true
for EVAL, keys must be passed explicitly. This is useful in many ways,
but especially to make sure Redis Cluster can forward your request to
the appropriate cluster node.
Note this rule is not enforced in order to provide the user with
opportunities to abuse the Redis single instance configuration, at the
cost of writing scripts not compatible with Redis Cluster.
Finally, the Lua version at Redis 3.0.0 is Lua 5.1.5, same as all the way up to Redis 6 RC1.

Authentication failures in cassandra when 1 of 16 nodes is down

I have a Cassandra cluster running :
Cassandra 2.0.11.83 | DSE 4.6.0 | CQL spec 3.1.1 | Thrift protocol 19.39.0
The cluster has 18 nodes, split among 3 datacenters, 6 in each. My system_auth keyspace has the following replication defined:
replication = {
'class': 'NetworkTopologyStrategy',
'DC1': '4',
'DC2': '4',
'DC3': '4'}
and my authenticator/authorizer are set to:
authenticator: org.apache.cassandra.auth.PasswordAuthenticator
authorizer: org.apache.cassandra.auth.CassandraAuthorizer
This morning I brought down one of the nodes in DC1 for maintenance. Within a few seconds/minute client applications started logging exceptions like this:
"User my_application_user has no MODIFY permission on or any of its parents"
Running 'LIST ALL PERMISSIONS of my_application_user' on one of the other nodes shows that user to have SELECT and MODIFY on the keyspace xxxxx, so I am rather confused. Do I have a setup issue? Is this a bug of some sort?
Re-posting this as the answer, as BrianC suggested above.
So this is resolved... Here's the sequence of events that seems to have fixed it:
Add 18 more nodes
Run cleanup on original nodes (this was part of the original plan)
Run a scrub on 1 table, since it was throwing exceptions on cleanup
Run a repair on the system_auth KS on the original troubled node
Wait for repair service to complete a full pass on all keyspaces
Decom original 18 nodes.
Honestly, I don't know what fixed it. The system_auth repair makes most sense, but what doesn't make sense is that it had run many passes before, so why work now, I don't know. I hope this at least helps someone.

Why does celery add thousands of queues to rabbitmq that seem to persist long after the tasks completel?

I am using celery with a rabbitmq backend. It is producing thousands of queues with 0 or 1 items in them in rabbitmq like this:
$ sudo rabbitmqctl list_queues
Listing queues ...
c2e9b4beefc7468ea7c9005009a57e1d 1
1162a89dd72840b19fbe9151c63a4eaa 0
07638a97896744a190f8131c3ba063de 0
b34f8d6d7402408c92c77ff93cdd7cf8 1
f388839917ff4afa9338ef81c28aad75 0
8b898d0c7c7e4be4aa8007b38ccc00ea 1
3fb4be51aaaa4ac097af535301084b01 1
This seems to be inefficient, but further I have observed that these queues persist long after processing is finished.
I have found the task that appears to be doing this:
#celery.task(ignore_result=True)
def write_pages(page_generator):
g = group(render_page.s(page) for page in page_generator)
res = g.apply_async()
for rendered_page in res:
print rendered_page # TODO: print to file
It seems that because these tasks are being called in a group, they are being thrown into the queue but never being released. However, I am clearly consuming the results (as I can view them being printed when I iterate through res. So, I do not understand why those tasks are persisting in the queue.
Additionally, I am wondering if the large number queues that are being created is some indication that I am doing something wrong.
Thanks for any help with this!
Celery with the AMQP backend will store task tombstones (results) in an AMQP queue named with the task ID that produced the result. These queues will persist even after the results are drained.
A couple recommendations:
Apply ignore_result=True to every task you can. Don't depend on results from other tasks.
Switch to a different backend (perhaps Redis -- it's more efficient anyway): http://docs.celeryproject.org/en/latest/userguide/tasks.html
Use CELERY_TASK_RESULT_EXPIRES (or on 4.1 CELERY_RESULT_EXPIRES) to have a periodic cleanup task remove old data from rabbitmq.
http://docs.celeryproject.org/en/master/userguide/configuration.html#std:setting-result_expires

celery+rabbitmq empty queue

I am using celery+rabbitmq. I can't find convenient way to clear queue in celery+rabbitmq. I do it with remove and create vhost.
rabbitmqctl delete_vhost <vhostpath>
rabbitmqctl add_vhost <vhostpath>
Is it prefer way to clear some celery queue ?
I'm not quite sure how celery works, but I suspect you want to purge a RabbitMQ queue (you're currently simulating this by deleting the queues and having celery re-create them).
You could install RabbitMQ's Management Plugin. Its WebUI will allow you to purge the required queue. This should also tell you which queue you're aiming for, so you wouldn't need to delete everything.
Once you know which queue it is, you could purge it programatically. For instance, using py-amqplib, you would do something like:
from amqplib import client_0_8 as amqp
conn = amqp.Connection(host="localhost:5672", userid="guest", password="guest", virtual_host="/", insist=False)
conn = conn.channel()
conn.queue_purge("the-target-queue")
There's probably a better way to do it, though.
If you are facing this problem because you used rabbitmq for the result backend and as a result you got too many queues, then i would suggest using a different result backend (redis or mongodb)
This is one well known flaw with the celery. It will create a separate queue for each result if you amqp for result backend.
If you still want to stick to amqp as result backend. It will clear itself in 24 hours. You can however set it to a smaller value using CELERY_AMQP_TASK_RESULT_EXPIRES setting.
If you need to delete ALL items in queue (especially when the list is long)
1) Saves all items into the file
sudo rabbitmqctl list_queues -p /yourvhost name > queues.txt
don't forget to remove first and last lines from 'queues.txt'
2) Use mentioned python code to do the job
from amqplib import client_0_8 as amqp
conn = amqp.Connection(host="127.0.0.1:5672", userid="guest", password="guest", virtual_host="/yourvhost", insist=False)
conn = conn.channel()
queues = None
with open('queues.txt', 'r') as f:
queues = f.readlines()
for q in queues:
if q:
#print 'deleting %s' % q
conn.queue_purge(q.strip())
print 'purged %d items' % len(queues)

How to destroy jobs enqueued by resque workers?

I'm using Resque on a rails-3 project to handle jobs that are scheduled to run every 5 minutes. I recently did something that snowballed the creation of these jobs and the stack has hit over 1000 jobs. I fixed the issue that caused that many jobs to be queued and now the problem I have is that the jobs created by the bug are still there and therefore It becomes difficult to test something since a job is added to a queue with 1000+ jobs.
I can't seem to stop these jobs. I have tried removing the queue from the redis-cli using the flushall command but it didn't work. Am I missing something? coz I can't seem to find a way of getting rid of these jobs.
Playing off of the above answers, if you need to clear all of your queues, you could use the following:
Resque.queues.each{|q| Resque.redis.del "queue:#{q}" }
If you pop open a rails console, you can run this code to clear out your queue(s):
queue_name = "my_queue"
Resque.redis.del "queue:#{queue_name}"
Resque already has a method for doing this - try Resque.remove_queue(queue_name) (see the documentation here). Internally it performs Resque.redis.del(), but it also does other cleanup, and by using an api method (rather than making assumptions about how resque works) you'll be more future-proof.
Updated rake task for clearing (according to latest redis commands changes): https://gist.github.com/1228863
This is what works now:
Resque.remove_queue("...")
Enter redis console:
redis-cli
List databases:
127.0.0.1:6379> KEYS *
1) "resque:schedules_changed"
2) "resque:workers"
3) "resque:queue:your_overloaded_queue"
"resque:queue:your_overloaded_queue" - db which you need.
Then run:
DEL resque:queue:your_overloaded_queue
Or if you want to delete specified jobs in queue then list few values from db with LRANGE command:
127.0.0.1:6379> LRANGE resque:queue:your_overloaded_queue 0 2
1) "{\"class\":\"AppClass\",\"args\":[]}"
2) "{\"class\":\"AppClass\",\"args\":[]}"
3) "{\"class\":\"AppClass\",\"args\":[]}"
Then copy/paste one value to LREM command:
127.0.0.1:6379> LREM resque:queue:your_overloaded_queue 5 "{\"class\":\"AppClass\",\"args\":[]}"
(integer) 5
Where 5 - number of elements to remove.
It's safer and bulletproof to use the Resque API rather than deleting everything on the Resque's Redis. Resque does some cleaning in the inside.
If you want to remove all queues and associated enqueued jobs:
Resque.queues.each {|queue| Resque.remove_queue(queue)}
The queues will be re-created the next time a job is enqueued.
Documentation