Which kind of FlowControl is Rabbit going through? - rabbitmq

Background
I am experiencing some abnormal behavior in the RabbitMQ cluster in which the cluster seems to stop responding in a timely and performant fashion. Producers enter FlowControl, and consumers do not get messages.
The server has:
Spare CPU capacity (topping out at around 50%)
Lots of free memory
IOPS blow provisioned IOPS rates
Yet, under high load I see connections going into FlowControl, latency increasing across the board and consumers timing out on acks :(
Problem I am trying to solve
The current metrics I am looking at, do not indicate something is going wrong.
Which metrics, logs and events should I look at to understand that the system is going towards max capacity? Is it possible to understand which type of FlowControl is Rabbit going through?
Versions used
rabbitmq_erlang_version: 1:25.1.2-1
rabbitmq_server_version: 3.10.10-1
Happy to provide any further information needed : )

Related

RabbitMQ slowing down after some times

On our RabbitMQ installed in production, we have a performance issue.
To explain the context, we have an initialization batch that creates around ~60k messages. For business reasons, those messages must be treated in strict order and we can't lose any. As such, we have only one queue which is durable and lazy and one consumer (SpringBoot AMQP) with a prefetch of 10. Both are on the same virtual machine.
At first, the processing is fast enough, around 5 to 10 messages per second. But it progressively slows down until it reaches a cap of fewer than 20 messages per hour. It takes approximately 1 hour to reach this point.
After some investigations, we found out that the problem comes from RabbitMQ. When we simply stop and restart it, the performance goes back to normal and then drops slowly again. Doing the same on just the consumer doesn't change anything.
I'm thinking about some resources bottleneck but I can't manage to find which one as RAM, CPU, and disk looks fine. I am not really familiar with ERL virtual machine and managing RabbitMQ itself so I may have missed something.
Does someone as an idea of the source of the problem or where I could look for more information on what is happening?
RabbitMQ characteristics :
ERL 23.3.2
RabbitMQ 3.8.14

RabbitMQ HA with Durable features

Background
I have a RabbitMQ cluster that running for more than a year without any problems. Lastly, I found that sometimes, the CPU of the machine is touching the 100% CPU. I'm investigating ways to increase the throughput of the cluster to serve more customers.
The cluster architecture is that we have HA enabled (exactly 1 replica), and durable messages (for all the queues). As I understand it, the durable feature is the most expensive one in terms of performance. So, I trying to understand if it is needed for me.
Question
According to my experience, the cluster was running for more than a year without problems. So I assume that the chance for a problem is very low. Even after this, I want to create another layer of protection, just in case...
If I have two servers that holding the same data, but not storing it into the disk (durable OFF), is not safe enough for 99.99% of the cases? Those two servers are in different regions so the chance that both of them will go down is very low. Wondering if saving it to the disk can be helpful, or just a waste?
There is a thumb rule about the performance improvements of disabling the durable feature? In percents.
Thank you!
The influence of durable on performance
For reliable delivery, rabbitmq use the publish confirmation mechanism. Everytime the publisher publish a message to rabbitmq server, the server will respond with basic.ack rpc to ack the message. For routable messages, the basic.ack is sent when a message has been accepted by all the queues. For persistent messages routed to durable queues, this means persisting to disk. For mirrored queues, this means that all mirrors have accepted the message. So as you mentioned, the IO may become bottlenect of performance.
Is it overhead both durable and mirrored
It depends on your consideration between performance and HA. Imagine if you declare non-durable mirrored queue, and the master and slave are down, your messages will get lost. So whether overhead depends on how important message safty is.
Is the performance bottleneck mainly caused by durable?
As we discussed, if you declare non-durable queue, the throught maybe increase. But this may not be the main cause of low performance. You have said the cpu usage sometimes is 100%, which means very little I/O waitting. The high load maybe due to many connections and high throughput. In order to determine how to increase throughput, you can use benchmark tool to find the bottleneck.
pages maybe useful:
https://www.cloudamqp.com/blog/2016-01-25-identify-and-protect-against-high-cpu-and-memory-usage.html
https://www.cloudamqp.com/blog/2018-01-08-part2-rabbitmq-best-practice-for-high-performance.html

RabbitMQ delayed exchange plugin loads and resources

We are using rabbitmq (3.6.6) to send analysis (millions) to different analyzers. These are very quick and we were planning on use the rabbit-message-plugin to schedule monitorizations over the analyzed elements.
We were thinking about rabbitmq-delayed-exchange-plugin, already made some tests and we need some clarification.
Currently:
We are scheduling millions of messages
Delays range from a few minutes to 24 hours
As previously said, these are tests, so we are using a machine with one core and 4G of RAM which has also other apps running on it.
What happened with a high memory watermark set up at 2.0G:
RabbitMQ eventually (a day or so) starts consuming 100% (only one core) and does not respond to the management interface nor rabbitmqctl. This goes on for at least 18 hours (always end up killing, deleting mnesia delayed file on disk - about 100 / 200 MB - and restarting).
What happened with a high memory watermark set up at 3.6G:
RabbitMQ was killed by kernel, because of high memory usage (4 GB hardware) about a week after working like this.
Mnesia file for delayed exchange is about 1.5G
RabbitMQ cannot start anymore giving to the below trace (we are assuming that because of being terminated by a KILL messages in the delay somehow ended up corrupted or something
{could_not_start,rabbit,
rabbitmq-server[12889]: {{case_clause,{timeout,['rabbit_delayed_messagerabbit#rabbitNode']}},
rabbitmq-server[12889]: [{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
And right now we are asking ourselves: Are we a little over our heads using rabbit delayed exchange plugin for this volumes of information? If we are, then end of the problem, rethink and restart, but if not, what could be an appropiate hardware and/or configuration setup?
RabbitMQ delayed exchange plugin is not properly designed to store millions of messages.
It is also documented to the plugin page
Current design of this plugin doesn't really fit scenarios with a high
number of delayed messages (e.g. 100s of thousands or millions). See
72 for details.
Read also here: https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/issues/72
This plugin is often used as if RabbitMQ was a database. It is not.

scalability of azure cloud queue

In current project we currently use 8 worker role machines side by side that actually work a little different than azure may expect it.
Short outline of the system:
each worker start up to 8 processes that actually connect to cloud queue and processes messages
each process accesses three different cloud queues for collecting messages for different purposes (delta recognition, backup, metadata)
each message leads to a WCF call to an ERP system to gather information and finally add retreived response in an ReDis cache
this approach has been chosen over many smaller machines due to costs and performance. While 24 one-core machines would perform by 400 calls/s to the ERP system, 8 four-core machines with 8 processes do over 800 calls/s.
Now to the question: when even increasing the count of machines to increase performance to 1200 calls/s, we experienced outages of Cloud Queue. In same moment of time, 80% of the machines' processes don't process messages anymore.
Here we have two problems:
Remote debugging is not possible for these processes, but it was possible to use dile to get some information out.
We use GetMessages method of Cloud Queue to get up to 4 messages from queue. Cloud Queue always answers with 0 messages. Reconnect the cloud queue does not help.
Restarting workers does help, but shortly lead to same problem.
Are we hitting the natural end of scalability of Cloud Queue and should switch to Service Bus?
Update:
I have not been able to fully understand the problem, I described it in the natual borders of Cloud Queue.
To summarize:
Count of TCP connections have been impressive. Actually too impressive (multiple hundreds)
Going back to original memory size let the system operate normally again
In my experience I have been able to get better raw performance out of Azure Cloud Queues than service bus, but Service Bus has better enterprise features (reliable, topics, etc). Azure Cloud Queue should process up to 2K/second per queue.
https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
You can also try partitioning to multiple queues if there is some natural partition key.
Make sure that your process don't have some sort of thread deadlock that is the real culprit. You can test this by connecting to the queue when it appears hung and trying to pull messages from the queue. If that works it is your process, not the queue.
Also take a look at this to setup some other monitors:
https://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
It took some time to solve this issue:
First a summarization of the usage of the storage account:
We used the blob storage once a day pretty heavily.
The "normal" diagonistics that Azure provides out of the box also used the same storage account.
Some controlling processes used small tables to store and read information once an hour for ca. 20 minutes
There may be up to 800 calls/s that try to increase a number to count calls to an ERP system.
When recognizing that the storage account is put under heavy load we split it up.
Now there are three physical storage accounts heaving 2 queues.
The original one still keeps up to 800/s calls for increasing counters
Diagnositics are still on the original one
Controlling information has been also moved
The system runs now for 2 weeks, working like a charm. There are several things we learned from that:
No, the infrastructure is "not just there" and it doesn't scale endlessly.
Even if we thought we didn't use "that much" summarized we used quite heavily and uncontrolled.
There is no "best practices" anywhere in the net that tells the complete story. Esp. when start working with the storage account a guide from MS would be quite helpful
Exception handling in storage is quite bad. Even if the storage account is overused, I would expect some kind of exception and not just returning zero message without any surrounding information
Read complete story here: natural borders of cloud storage scalability
UPDATE:
The scalability has a lot of influences. You may are interested in Azure Service Bus: Massive count of listeners and senders to be aware of some more pitfalls.

How to ensure flow control in RabbitMQ is never triggered?

I have a publisher pushing to a queue at a slightly larger rate than the consumers can consume. For small numbers, it is okay, but for a very large number of messages, RabbitMQ starts writing it to the disk. At a certain point of time, the disk becomes full, and flow control is triggered. From then on, the rates are really slow. Is there any way to decrease or share this load between cluster nodes? How should I design my application so that flow control is never triggered?
I am using RabbitMQ 3.2.3 on three nodes with 13G RAM, and 10G of system disk space - connected to each other through the cluster. Two of these are RAM nodes, and the remaining one is a disk node, also used for RabbitMQ management plugin.
You can tweak the configuration, upgrade hardware etc and in the end you'd probably want to put a load balancer in front of your RabbitMQ servers to balance the load between multiple RabbitMQ nodes. The problem here is that if you are publishing at a higher rate than you are consuming, eventually you will run into this problem again, and again.
I think the best way to prevent this from happening is to implement logic on the publisher side that keeps track of the number of requests waiting to be processed in the queue. If the number of requests exceeds X the publisher should either wait until the number of messages has gone down, or publish new messages at a slower rate. This type of solution of course depends on where the messages published are coming from, if they are user submitted (e.g. through a browser or client) you could show a loading-bar when the queue builds-up.
Ideally though you should focus on making the processing on the consumer side faster, and maybe scale that part up, but having something to throttle the publisher when it gets busy should help prevent buildups.