I was reading through the documentation of RabbitMQ on their website and came across two terminologies which seem to be doing the same thing - "Durable Queues" and "Disk Node". As per the documentation if I make a Disk Node, all data except messages, message store indices, queue indices and other node state (not sure what are the other node states).
So, if I make my node a Disk Node, do I still need to mark my queue as durable to survive broker restarts ?
Same question goes for durable exchanges as well.
Disk nodes and durable queues are two different concepts within RabbitMQ.
RabbitMQ maintains certain internal information (such as users, passwords, vhosts, ...) within specific mnesia tables. Disk nodes store these tables on disk. As the related documentation states:
This does not include messages, message store indices, queue indices and other node state.
To ensure durability/persistence of exchanges, queues or messages you need to explicitly state it when you declare/publish them.
Related
what scaling options can we use if rabbitMQ metrices reaches a threshold?I have a VM on which RabbitMQ is running. If the queue length>90% of total queue length, can we increase the instance count by 1 and a with a separate queue such that they are to be processed on a priority basis?
In short what scaling options do we have based on different parameters for RabbitMQ
Take a look into RabbitMQ Sharding Plugin
From their README:
RabbitMQ Sharding Plugin
This plugin introduces the concept of sharded queues for RabbitMQ.
Sharding is performed by exchanges, that is, messages will be
partitioned across "shard" queues by one exchange that we should
define as sharded. The machinery used behind the scenes implies
defining an exchange that will partition, or shard messages across
queues. The partitioning will be done automatically for you, i.e: once
you define an exchange as sharded, then the supporting queues will be
automatically created on every cluster node and messages will be
sharded across them.
Auto-scaling
One interesting property of this plugin, is that if you add more nodes
to your RabbitMQ cluster, then the plugin will automatically create
more shards in the new node. Say you had a shard with 4 queues in node
a and node b just joined the cluster. The plugin will automatically
create 4 queues in node b and join them to the shard partition.
Already delivered messages will not be rebalanced, but newly arriving
messages will be partitioned to the new queues.
I am new to RabbitMQ. I wanted to know how memory is used in case of HA.
For example, in Kafka the partition use a specific amount of memory if data is present or not in it and so do the replications .In RabbitMQ how are the queues allocated memory ? and How does HA work ?Do the mirrored queues occupy the same amout of memory each replicated node ?
Queues in RabbitMQ don't need a lot of resources per se, but messages will be kept in memory in most of the cases. When a message is sent to the queue that has mirrored queues, this message will be replicated among other nodes defined by the mirroring policy. The idea of mirrored queues is to provide high availability, so if the broker hosting the master queue crashes, a new master queue will be elected among alive mirrored queues. The switch to the new node should happen quite fast, because all messages are ready to be consumed.
Simple example:
The cluster consists of 3 nodes:
The test queue was created on the node-1.rabbitmq node and the mirroring policy was applied to replicate messages on all nodes:
Approximately 70k messages were sent to the test queue and the screenshot from the RabbitMQ management tool is shown below:
It is clear that all nodes got messages and they are kept in memory.
Memory consumption of RabbitMQ is a tricky topic and there are many factors which can affect it (type of the queue, the amount of messages in other queues, reaching the defined limits, etc.). In the official documentation it is stated:
RabbitMQ can report on its own memory use, to let you see where your system is using memory. Note that all measurements are somewhat approximate, based on values returned by the underlying Erlang virtual machine; however they should still be accurate enough to be useful.
Just like mentioned in title, when a queue is declared on a server amongst a group of nodes which are all in a cluster, is it physically on a single server? or physically spread over nodes and considered logically on a server?
Quote from rabbitmq docs
All data/state required for the operation of a RabbitMQ broker is
replicated across all nodes. An exception to this are message
queues, which by default reside on one node, though they are visible
and reachable from all nodes.
So unless the queues are mirrored, they are on one node (for mirroring queues see here).
I want to build a RabbitMQ system which is able to scale out for the sake of performance.
I've gone through the official document of RabbitMQ Clustering. However, its clustering doesn't seem to support scalability. That's because only through master queue we can publish/consume, even though the master queue is reachable from any node of a cluster. Other than the node on which a master queue resides, we can't process any publish/consume.
Why do we cluster then?
Why do we cluster then?
To ensure availability.
To enforce data replication.
To spread the load/data accross queues on different nodes. Master queues can be stored on different node and replicated with a factor < number of cluster nodes.
Other than the node on which a master queue resides, we can't process
any publish/consume.
Client can be connected on any node of the cluster. This node will transfer 'the request' to the master queue node and vice versa. As a downside it will generate extra hop.
Answer to the question in the title Is RabbitMQ Clustering including scalability too? - yes it does, this is achieved by simply adding more nodes/removing some nodes to/from the cluster. Of course you have to consider high availability - that is queue and exchange mirroring etc.
And just to make something clear regarding:
However, its clustering doesn't seem to support scalability. That's
because only through master queue we can publish/consume, even though
the master queue is reachable from any node of a cluster.
Publishing is done to exchange, queues have nothing to with publishing. A publishing client publishes only to an exchange and a routing key. It doesn't need any knowledge about the queue.
The Scenario:
We have multiple nodes distributed geographically on which we want to have queues collecting messages for that location. And then we want to send this collected data from every queue in every node to their corresponding queues in a central location. In the central node, we will pull out data collected in the queues (from other nodes), process it and store it persistently.
Constraints:
Data is very important to us. Therefore, we have to make sure that we are not loosing data in any case.
Therefore, we need persistent queues on every node so that even if the node goes down for some random reason, when we bring it up we have the collected data safe with us and we can send it to the central node where it can be processed.
Similarly, if the central node goes down, the data must remain at all the other nodes so that when the central node comes up we can send all the data to the central node for processing.
Also, the data on the central node must not get duplicated or stored again. That is data collected on one of the nodes should be stored on the central nodes only once.
The data that we are collecting is very important to us and the order of data delivery to the central node is not an issue.
Our Solution
We have considered a couple of solutions out of which I am going to list down the one that we thought would be the best. A possible solution (in our opinion) is to use Redis to maintain queues everywhere because Redis provides persistent storage. Then perhaps have a daemon running on all the geographically separated nodes which reads the data from the queue and sends it to the central node. The central node on receiving the data sends an ACK to the node it received the data from (because data is very important to us) and then on receiving the ACK, the node deletes the data from the queue. Of course, there will be timeout period in which the ACK must be received.
The Problem
The above stated solution (according to us) will work fine but the issue is that we don't want to implement the whole synchronization protocol by ourselves for the simple reason that we might be wrong here. We were unable to find this particular way of synchronization in Redis. So we are open to other AMQP based queues like RabbitMQ, ZeroMQ, etc. Again we were not able to figure out if we can do this with these solutions.
Do these Message Queues or any other data store provide features that can be the solution to our problem? If yes, then how?
If not, then is our solution good enough?
Can anyone suggest a better solution?
Can there be a better way to do this?
What would be the best way to make it fail safe?
The data that we are collecting is very important to us and the order of data delivery to the central node is not an issue.
You could do this with RabbitMQ by setting up the central node (or cluster of nodes) to be a consumer of messages from the other nodes, and using the message acknowledgement feature. This feature means that the central node(s) can ack delivery, so that other nodes only delete messages after the ack. See for example: http://www.rabbitmq.com/tutorials/tutorial-two-python.html
If you have further questions please email the mailing list rabbitmq-discuss.