RabbitMQ HA with Durable features - rabbitmq

Background
I have a RabbitMQ cluster that running for more than a year without any problems. Lastly, I found that sometimes, the CPU of the machine is touching the 100% CPU. I'm investigating ways to increase the throughput of the cluster to serve more customers.
The cluster architecture is that we have HA enabled (exactly 1 replica), and durable messages (for all the queues). As I understand it, the durable feature is the most expensive one in terms of performance. So, I trying to understand if it is needed for me.
Question
According to my experience, the cluster was running for more than a year without problems. So I assume that the chance for a problem is very low. Even after this, I want to create another layer of protection, just in case...
If I have two servers that holding the same data, but not storing it into the disk (durable OFF), is not safe enough for 99.99% of the cases? Those two servers are in different regions so the chance that both of them will go down is very low. Wondering if saving it to the disk can be helpful, or just a waste?
There is a thumb rule about the performance improvements of disabling the durable feature? In percents.
Thank you!

The influence of durable on performance
For reliable delivery, rabbitmq use the publish confirmation mechanism. Everytime the publisher publish a message to rabbitmq server, the server will respond with basic.ack rpc to ack the message. For routable messages, the basic.ack is sent when a message has been accepted by all the queues. For persistent messages routed to durable queues, this means persisting to disk. For mirrored queues, this means that all mirrors have accepted the message. So as you mentioned, the IO may become bottlenect of performance.
Is it overhead both durable and mirrored
It depends on your consideration between performance and HA. Imagine if you declare non-durable mirrored queue, and the master and slave are down, your messages will get lost. So whether overhead depends on how important message safty is.
Is the performance bottleneck mainly caused by durable?
As we discussed, if you declare non-durable queue, the throught maybe increase. But this may not be the main cause of low performance. You have said the cpu usage sometimes is 100%, which means very little I/O waitting. The high load maybe due to many connections and high throughput. In order to determine how to increase throughput, you can use benchmark tool to find the bottleneck.
pages maybe useful:
https://www.cloudamqp.com/blog/2016-01-25-identify-and-protect-against-high-cpu-and-memory-usage.html
https://www.cloudamqp.com/blog/2018-01-08-part2-rabbitmq-best-practice-for-high-performance.html

Related

Handling RabbitMQ node failures in a cluster in order to continue publishing and consuming

I would like to create a cluster for high availability and put a load balancer front of this cluster. In our configuration, we would like to create exchanges and queues manually, so one exchanges and queues are created, no client should make a call to redeclare them. I am using direct exchange with a routing key so its possible to route the messages into different queues on different nodes. However, I have some issues with clustering and queues.
As far as I read in the RabbitMQ documentation a queue is specific to the node it was created on. Moreover, we can only one queue with the same name in a cluster which should be alive in the time of publish/consume operations. If the node dies then the queue on that node will be gone and messages may not be recovered (depends on the configuration of course). So, even if I route the same message to different queues in different nodes, still I have to figure out how to use them in order to continue consuming messages.
I wonder if it is possible to handle this failover scenario without using mirrored queues. Say I would like switch to a new node in case of a failure and continue to consume from the same queue. Because publisher is just using routing key and these messages can go into more than one queue, same situation is not possible for the consumers.
In short, what can I to cope with the failures in an environment explained in the first paragraph. Queue mirroring is the best approach with a performance penalty in the cluster or a more practical solution exists?
Data replication (mirrored queues in RabbitMQ) is a standard approach to achieve high availability. I suggest to use those. If you don't replicate your data, you will lose it.
If you are worried about performance - RabbitMQ does not scale well.
The only way I know to improve performance is just to make your nodes bigger or create second cluster. Adding nodes to cluster does not really improve things. Also if you are planning to use TLS it will decrease throughput significantly as well. If you have high throughput requirement +HA I'd consider Apache Kafka.
If your use case allows not to care about HA, then just re-declare queues/exchanges whenever your consumers/publishers connect to the broker, which is absolutely fine. When you declare queue that's already exists nothing wrong will happen, queue won't be purged etc, same with exchange.
Also, check out RabbitMQ sharding plugin, maybe that will do for your usecase.

IBM MQ Multi-Instance Queues

My company uses IBM MQ's Multi-Instance Queues right now. We would like to replicate those queues to a different Data Center over the WAN for Disaster Recover purposes. I'm skeptical it will work simply due to all the message traffic and even a slight delay will cause the Queues to fail.
What is the technical reason why this will not work?
Are you talking about storage replication? If so are you planning to use synchronous or asynchronous replication?
Asynch will not cause any delay on the replicating end but there will be some amount of delay before the receiving end receives data depending on network distance. Your storage team should be able to tell you how many seconds the async replication delay could be.
With synch the data is sent over the network by the replicating end storage array and a confirmation comes back over the network before the the storage array returns to the OS that the write was successful. To be usable the two arrays have to be with in 6ms of each other. This type of replication adds a delay to each write equal to the network ms.
MQ application can batch messages into single units of work to improve performance with sync replication is in place, but this will slow down persistent message performance.
Define "Slight delay" in your statement?
Async replication will cause a delay and RPO will not be zero. Your storage team can advise on RPO value. If that is not acceptable, asynch replication is not an option for you.
Although it's pragmatic choice from cost and distance standpoint but could cause duplicate or missing transactions.
For synch replication, the distance in data-centers is limited. (Apart from hit on performance on Primary DC). Check with your storage team on the distance limit.

RabbitMQ clustering

I have created RabbitMQ cluster on single windows machine with HA policy to all and created two DISC and two RAM node and 1 STAT node. I then ran the PerfTest (rabbitmq client test utility), the result were disappointing, it was around 5000m/sec. But when I ran the same test with single RabbitMQ node it gave me good result i.e. 25000m/sec. I am unable to get what wrong is happening, its result should be impressive if run within cluster, but it is opposite. Anyone have encounter the same or if know the reason behind it.
Thanks
A RabbitMQ Cluster with Mirrored Queues won't go faster than a single node. Why? Clustering is there to improve reliability and fault tolerance, not to improve throughput.
What's the reason for this? When you enable mirrored queues, RabbitMQ needs to coordinate state between nodes, that is, it needs to coordinate publishes, consumers and acks, to not deliver the same message more than once, or to more than one consumer. All this coordination affects performance, but that's the tradeoff with this kind of replication.
If you need decentralised replication, then you could use the Federation Plugin
The throughput rate would depend on couple of factors. In our perf tests for RabbitMQ in a cluster we observed that the rate varied depending on RabbitMQ nodes were DISC or RAM, but a big chunk of the performance variation was observed when running RabbitMQ Cluster with Mirrored Queues vs without. With Mirroring enabled we were seeing a rate of 3500 m/sec, while without it was 5000 m/sec. Also what is your message size when you run your perftest.
As is typical with RabbitMQ, it really depends. Here are a few ways that I have found to improve performance with RabbitMQ clustering:
Push the messages to a set of appropriately sized memory nodes only using a load balancer
Keep the message size very small
Do not use amqp transactions or Publisher Confirms
Only use HA Mirrored queues for a small set of queues that you absolutely have to have the data saved
Set a TTL on all messages or queues using a policy
Just to addon to above comments.. Putting it as FYI
http://www.rabbitmq.com/blog/2012/05/11/some-queuing-theory-throughput-latency-and-bandwidth/
http://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance-measurements-part-2/
The problems is that you are running a cluster on the same machine with the same resources.
The purpose of a rabbit cluster is to scale out and not scale in.
In other words, to have more network connections available, more disk power of course more CPU power to handle more messages.
When adding nodes on a single machine you don't scale your resources plus you are adding overheads of using a cluster. (As stated above)

Having one slow rabbit mq consumer slows down other consumer

I have one rabbit mq Publisher who is publishing on a Direct exchange. There are multiple rabbit mq consumers bound to the Direct exchange with different routing keys.
Few of these consumers might take more time to process the message.
My question is does one slow consumer affect the performance of other consumers even though they are bound on different routing keys ?
One slow consumer will have no affect on other consumers. Each consumer is independent and can work as fast or as slow as necessary for your application.
It will affect other consumers in the terrible case that said consumer's queue start backing up badly up to the point where you hit the server memory watermark. If that happens tho, you need to review what's going on in your system for such situation to arise.

How distributed should queues be in a RabbitMQ cluster?

Assume you have a small rabbitmq system of 3 nodes that is supposed to handle 100+ decently high volume queues in the same exchange. Given that queues only exist on the node they are created on (we're not using replicated, High Availability queues), what's the best way to create the queues? Is there any benefit to having the queues distributed among the cluster nodes, or is it better to keep them all on one node and have rmq do the routing?
It depends on your application, really.
RabbitMQ is smart about sending messages, so it'll only send a message to a node in the cluster if
a queue that holds that message resides on that node or
if a consumer has connected to that node and has requested the message.
In general, you should aim to declare queues on the nodes on which both the publishers and the consumers for that queue will connect to. In other words, you should aim to connect publishers and consumers to the node that holds the queues they use. This assumes you're trying to conserve bandwidth used overall.
If you're using clustering to improve throughput (and you probably are), and you don't care about internal bandwidth used, you should aim to connect your publishers/consumers to the nodes in a balanced way and not worry about the internal routing mechanisms.
One last thing to think about is memory and disk-space. Queues store messages in main memory, and fallback to disk if that's insufficient. So, if you declare all your queues in one place, that'll result in one node that's "over-worked" and two nodes with memory to spare.
As part of a move towards redundancy and failover in an application I'm working on, I've just finished setting up a RabbitMQ cluster behind a proxy, and have all of my publishers and consumers connect via the proxy, which round robins connections to the individual nodes as they come in from the clients. Prior to upgrading RabbitMQ to 2.7.1, this seemed to pretty evenly distribute queues to the separate nodes, though this would of course depend pretty heavily on how your proxy balances the requests and when your clients try to connect (and declare a queue)...
Having said all that, I just upgraded to RabbitMQ 2.7.1, which was pretty painless, and gave us HA queues, which is a pretty big win for our apps. At any rate, if you're interested in the set up, and think it would be of benefit to your queue problem, I'd be happy to share the setup.