RabbitMQ delayed exchange plugin loads and resources - rabbitmq

We are using rabbitmq (3.6.6) to send analysis (millions) to different analyzers. These are very quick and we were planning on use the rabbit-message-plugin to schedule monitorizations over the analyzed elements.
We were thinking about rabbitmq-delayed-exchange-plugin, already made some tests and we need some clarification.
Currently:
We are scheduling millions of messages
Delays range from a few minutes to 24 hours
As previously said, these are tests, so we are using a machine with one core and 4G of RAM which has also other apps running on it.
What happened with a high memory watermark set up at 2.0G:
RabbitMQ eventually (a day or so) starts consuming 100% (only one core) and does not respond to the management interface nor rabbitmqctl. This goes on for at least 18 hours (always end up killing, deleting mnesia delayed file on disk - about 100 / 200 MB - and restarting).
What happened with a high memory watermark set up at 3.6G:
RabbitMQ was killed by kernel, because of high memory usage (4 GB hardware) about a week after working like this.
Mnesia file for delayed exchange is about 1.5G
RabbitMQ cannot start anymore giving to the below trace (we are assuming that because of being terminated by a KILL messages in the delay somehow ended up corrupted or something
{could_not_start,rabbit,
rabbitmq-server[12889]: {{case_clause,{timeout,['rabbit_delayed_messagerabbit#rabbitNode']}},
rabbitmq-server[12889]: [{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
And right now we are asking ourselves: Are we a little over our heads using rabbit delayed exchange plugin for this volumes of information? If we are, then end of the problem, rethink and restart, but if not, what could be an appropiate hardware and/or configuration setup?

RabbitMQ delayed exchange plugin is not properly designed to store millions of messages.
It is also documented to the plugin page
Current design of this plugin doesn't really fit scenarios with a high
number of delayed messages (e.g. 100s of thousands or millions). See
72 for details.
Read also here: https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/issues/72
This plugin is often used as if RabbitMQ was a database. It is not.

Related

RabbitMQ slowing down after some times

On our RabbitMQ installed in production, we have a performance issue.
To explain the context, we have an initialization batch that creates around ~60k messages. For business reasons, those messages must be treated in strict order and we can't lose any. As such, we have only one queue which is durable and lazy and one consumer (SpringBoot AMQP) with a prefetch of 10. Both are on the same virtual machine.
At first, the processing is fast enough, around 5 to 10 messages per second. But it progressively slows down until it reaches a cap of fewer than 20 messages per hour. It takes approximately 1 hour to reach this point.
After some investigations, we found out that the problem comes from RabbitMQ. When we simply stop and restart it, the performance goes back to normal and then drops slowly again. Doing the same on just the consumer doesn't change anything.
I'm thinking about some resources bottleneck but I can't manage to find which one as RAM, CPU, and disk looks fine. I am not really familiar with ERL virtual machine and managing RabbitMQ itself so I may have missed something.
Does someone as an idea of the source of the problem or where I could look for more information on what is happening?
RabbitMQ characteristics :
ERL 23.3.2
RabbitMQ 3.8.14

scalability of azure cloud queue

In current project we currently use 8 worker role machines side by side that actually work a little different than azure may expect it.
Short outline of the system:
each worker start up to 8 processes that actually connect to cloud queue and processes messages
each process accesses three different cloud queues for collecting messages for different purposes (delta recognition, backup, metadata)
each message leads to a WCF call to an ERP system to gather information and finally add retreived response in an ReDis cache
this approach has been chosen over many smaller machines due to costs and performance. While 24 one-core machines would perform by 400 calls/s to the ERP system, 8 four-core machines with 8 processes do over 800 calls/s.
Now to the question: when even increasing the count of machines to increase performance to 1200 calls/s, we experienced outages of Cloud Queue. In same moment of time, 80% of the machines' processes don't process messages anymore.
Here we have two problems:
Remote debugging is not possible for these processes, but it was possible to use dile to get some information out.
We use GetMessages method of Cloud Queue to get up to 4 messages from queue. Cloud Queue always answers with 0 messages. Reconnect the cloud queue does not help.
Restarting workers does help, but shortly lead to same problem.
Are we hitting the natural end of scalability of Cloud Queue and should switch to Service Bus?
Update:
I have not been able to fully understand the problem, I described it in the natual borders of Cloud Queue.
To summarize:
Count of TCP connections have been impressive. Actually too impressive (multiple hundreds)
Going back to original memory size let the system operate normally again
In my experience I have been able to get better raw performance out of Azure Cloud Queues than service bus, but Service Bus has better enterprise features (reliable, topics, etc). Azure Cloud Queue should process up to 2K/second per queue.
https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
You can also try partitioning to multiple queues if there is some natural partition key.
Make sure that your process don't have some sort of thread deadlock that is the real culprit. You can test this by connecting to the queue when it appears hung and trying to pull messages from the queue. If that works it is your process, not the queue.
Also take a look at this to setup some other monitors:
https://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
It took some time to solve this issue:
First a summarization of the usage of the storage account:
We used the blob storage once a day pretty heavily.
The "normal" diagonistics that Azure provides out of the box also used the same storage account.
Some controlling processes used small tables to store and read information once an hour for ca. 20 minutes
There may be up to 800 calls/s that try to increase a number to count calls to an ERP system.
When recognizing that the storage account is put under heavy load we split it up.
Now there are three physical storage accounts heaving 2 queues.
The original one still keeps up to 800/s calls for increasing counters
Diagnositics are still on the original one
Controlling information has been also moved
The system runs now for 2 weeks, working like a charm. There are several things we learned from that:
No, the infrastructure is "not just there" and it doesn't scale endlessly.
Even if we thought we didn't use "that much" summarized we used quite heavily and uncontrolled.
There is no "best practices" anywhere in the net that tells the complete story. Esp. when start working with the storage account a guide from MS would be quite helpful
Exception handling in storage is quite bad. Even if the storage account is overused, I would expect some kind of exception and not just returning zero message without any surrounding information
Read complete story here: natural borders of cloud storage scalability
UPDATE:
The scalability has a lot of influences. You may are interested in Azure Service Bus: Massive count of listeners and senders to be aware of some more pitfalls.

ActiveMQ performance for producing persistent text messages

As advised on the webpage
activemq-performance-module-users-manual I've tried (on an Intel i7 laptop with Windows 7 OS and SSD drive) the performance of producing persistent messages on a ActiveMQ Queue :
mvn activemq-perf:producer -Dproducer.destName=queue://TEST.FOO -Dproducer.deliveryMode=persistent
against the default installation of activemq 5.12.1
The performance which I got is around 300-400 messages per second.
On the page activemq-performance I have been reading much higher numbers:
When running the server on one box and a single producer and consumer thread in separate VMs on the other box, using a single topic we got around 21-22,000 messages/second using 1-2K messages.
On the other hand, when the messages are not persistent, the performance of the producer grows to 49000 messages per second. -Dproducer.deliveryMode=nonpersistent
When the messages are sent asynchrounously.
-Dproducer.deliveryMode=persistent -Dfactory.useAsyncSend=true
I get around 23000 messages sent per second.
From what I see here stackoverflow-activemq-persistent-performance-on-different-operatiing-systems it makes a difference when running activemq on different OS.
Can somebody give me some tips for having a better performance for writing persistent activemq messages?
Performance of sending persistent messages is all about disk based IO as the message must be written to the disk prior to the broker signalling the client that the message send completed. The faster the disk the better your throughput will be, all else being equal.
To work around some of this you can send persistent messages in transactional batches so that the send itself is complete and the synchronization point is reduced to the transaction boundary.
Depending on the size of the text messages you can also gain some performance by using compression, this can be turned on via a option in the ActiveMQConnectionFactory.

How to ensure flow control in RabbitMQ is never triggered?

I have a publisher pushing to a queue at a slightly larger rate than the consumers can consume. For small numbers, it is okay, but for a very large number of messages, RabbitMQ starts writing it to the disk. At a certain point of time, the disk becomes full, and flow control is triggered. From then on, the rates are really slow. Is there any way to decrease or share this load between cluster nodes? How should I design my application so that flow control is never triggered?
I am using RabbitMQ 3.2.3 on three nodes with 13G RAM, and 10G of system disk space - connected to each other through the cluster. Two of these are RAM nodes, and the remaining one is a disk node, also used for RabbitMQ management plugin.
You can tweak the configuration, upgrade hardware etc and in the end you'd probably want to put a load balancer in front of your RabbitMQ servers to balance the load between multiple RabbitMQ nodes. The problem here is that if you are publishing at a higher rate than you are consuming, eventually you will run into this problem again, and again.
I think the best way to prevent this from happening is to implement logic on the publisher side that keeps track of the number of requests waiting to be processed in the queue. If the number of requests exceeds X the publisher should either wait until the number of messages has gone down, or publish new messages at a slower rate. This type of solution of course depends on where the messages published are coming from, if they are user submitted (e.g. through a browser or client) you could show a loading-bar when the queue builds-up.
Ideally though you should focus on making the processing on the consumer side faster, and maybe scale that part up, but having something to throttle the publisher when it gets busy should help prevent buildups.

distributed cluster questions about performance

I'm using 6 servers to make a cluster and they are all disk nodes. I use rabbitmq for collecting log file for our website. Now at the peak hour, the publish rate is about 30k message per second. There are 2 main consumers(hdfs and elasticsearch) and each one need to handle all message, so the delivery rate hit about 60k per second.
In my scenario, a single server can hold 10k delivery rate and I use 6 node to load balance the pressure. My solution is that I created 2 queues on each node. Each message is with a random routing-key(something like message.0, message.1, etc) to distribute the pressure to every node.
What confused me is:
All message send to one node. Should I use a HA Proxy to load balance this publish pressure?
Is there any performance difference between Durable Queues and Transient Queues?
Is there any performance difference between Memory Node and Disk Node? What I know is the difference between memory node and disk node is only about the meta data such as queue configuration.
How can I imrove the performance in publish and delivery codes? I've researched and I know several methods:
disable the confirm mechanism(in publish codes?)
enable HiPE(I've done that and it helped a lot)
For example, input is 1w mps(message per second), there are two consumers to consume all message. Then the output is 2w mps. If my server can handle 1w mps, I need two server to handle the 2w-mps-pressure. Now a new consumer need to consume all message, too. As a result, output hits 3w mps, so I need another one more server. For a conclusion, one more consumer for all message, one more server?
"All message send to one node. Should I use a HA Proxy to load balance this publish pressure?"
This article outlines a number of designs aimed at distributing load in RabbitMQ.
"Is there any performance difference between Durable Queues and Transient Queues?"
Yes, Durable Queues are backed up to disk so that they can be reinstated on server-restart, for example. This adds a nominal overhead, though the actual process occurs asynchronously.
"Is there any performance difference between Memory Node and Disk Node?"
Not that I'm aware of, but that would depend on the machine itself.
"How can I imrove the performance in publish and delivery codes?"
Try this out.