I have a RabbitMQ Cluster with two nodes and a listener (connected in both) that declares a Queue (binding with a topic exchange). The queue was declared (apparently at node 1) and replicated to node 2 (as expected). So far so good.
But when the node 1 goes down, the queue is deleted from node 2, crashing my listener. Here is the queue parameters:
exclusive="false"
durable="true"
auto-delete="false"
Clearly it does not seem to be a correct high availability behavior. The question is how can I create this queue and keep it available when at least 1 node is up (no matter wich one)?
Related
Here's my current architecture
I have a bunch of IoT devices, that connects through raw duplex persistent TCP to 1 instances of my "worker" that is connected to a RabbitMQ Queue
My publisher publishes some messages that look like that
{
"iot_device_name" : "A",
"command" : "reboot"
}
The worker is then able to map the iot_device_name to the TCP socket.
All is working nice, but if we want to add HA and to scale out a bit, it would be better to have 4 instances of the worker. Load balancing the TCP question is not a problem (with HaProxy or Nginx).
Now the problem is on how to split the load on the Queue part, as the list of IoT devices handled by a worker is dynamic (i.e a device could disconnect and reconnect to an other worker)
So is there a way for a worker to say: "Hmmm no I can't treat this message because I don't know this device, give me an other" so that an other worker can then take it and handle it ?
Other information that may be of help:
the workers are all in the same network, that is also the same than the publisher
the numbers of workers is not dynamic and even if we extrapolate the number of devices for the next years, 8 workers would takes us VERY FAR, as it simply route message/transcode messages, so their cpu load is ridiculous.
So if I understand your architecture correctly, you have commands sent to your publisher on one side, which are pushed into rabbitmq.
On the consumer side, you have multiple workers, to which the messages are dispatched, and each worker has a bunch of devices connected to it.
If indeed this is your architecture, I'd propose the following for your rabbitmq configuration:
use a direct exchange
each worker has it's own queue (exclusive), and manages the bindings between the exchange and its queue dynamically:
each time a device connects to a worker, that worker adds a binding between its queue and the exchange with as routing key the identifier for the device
each time a worker detects that a device is not connected to it anymore, he removes the related binding from the rabbitmq configuration
related to the detection of disconnected devices, I'd expect it common that it's upon receiving a command to push to the device that a worker realize the device isn't connected to it anymore, in such cases in addition to adapting the bindings, the worker would republish the message to the same exchange with the same routing key, so that it can have another shot at being consumed by the proper worker
I'd also consider configuring a TTL on the queues, no point in consuming a message that's too old
The publisher will of course also need to alter the message, including the intended device identification as routing key
I hope the proposal here makes sense, there are a few other cases to be considered: alternate exchange to make sure we don't lose requests if there is a (short) period when the device hasn't reconnected to a worker and we get a command for it anyway, adding a property to a message republished to ensure we do not add an infinite loop in the system, ... but what is indicated above should be a reasonable starting point to achieve your goal
I have a RabbitMQ cluster (without HA) setup with the nodes in multiple instances. From the documentation, what I understood is, in cluster mode, the queues are not mirrored and it is owned by the node where it is declared.
So, now the question is, what will happen when the node which owns the queue goes down? Correct me if I'm wrong, since the queues are not mirrored, the client applications will throw up for missing queues.
Should we write our logic to figure out if the node goes down, the queues have to be re declared and in this case, what will happen to the messages?
So, now the question is, what will happen when the node which owns the queue goes down?
From the docs:
When RabbitMQ quits or crashes it will forget the queues and messages unless you tell it not to. Two things are required to make sure that messages aren't lost: we need to mark both the queue and messages as durable.
next question:
Should we write our logic to figure out if the node goes down, the queues have to be re declared and in this case, what will happen to the messages?
Yes, its a good idea to re-declare your queues.
In case when your node is going down, all consumers connected to it will be disconnected. Every time consumer connects, it should assume it's queue does not exist and so, it needs to fire declare queue request as first request when its connected.
If consumer sends declare queue request and queue does exist then:
declare queue won't affect queue's
messages in any way. If messages were persisted, they continue to be
in the queue.
Under normal circumstances (if you don't change the queue's
properties) no errors will be thrown
Scenario: Two ActiveMQ nodes A, B. No master slave, but peers, with network connectors between them.
A durable topic subscriber is registered with both (as it uses failover and at one point connects to A and at another point connects to B).
Issue: If subscriber is being online against A, a copy of each message is placed in the offload subscription on B.
Question: Is this by design? Can this be configured so that a message is deduped and only sent to the subscriber in one of subscriptions?
Apparently by-design: http://activemq.apache.org/how-do-distributed-queues-work.html
See "Distributed Topics in Store/Forward" where it says:
For topics the above algorithm is followed except, every interested client receives a copy of the message - plus ActiveMQ will check for loops (to avoid a message flowing infinitely around a ring of brokers).
We are facing a random issue with ActiveMQ and its consumers. We observe that, few consumers are not receiving messages, even though they are connected to ActiveMQ queue. But it works fine after the consumer restart.
We have a queue named testQueue at ActiveMQ side. A consumer is trying to de-queue the messages from this queue. We are using Spring's DefaultMessageListenerContainer for this purpose. Message is being delivered to the consumer node from ActiveMQ Broker. From the tcpdump as well, it was obvious that, message is reaching the consumer node, But the actual consumer code is not able to see the message. In other words, the message seems to be stuck either in ActiveMQ consumer code or in Spring’s DefaultMessageListenerContainer.
See refer to the below fig. for more clarity on the issue. Message is reaching Consumer node, but it is not reaching the “Actual Consumer Class”, which means that the message got stuck either in AMQ consumer code or Spring DMLC.
Below are the details captured from ActiveMQ admin.
Queue-Name /Pending-Message-Count /Consumer-Count /Messages-Enqueued /Messages-Dequeued
testQueue /9 /1 /9 /0
Below are the more details.
Connection-ID /SessionId /Selector /Enqueues /Dequeues /Dispatched /Dispatched-Queue /Prefetch
ID:bearsvir52-45176-1375519181268-3:5 /1 / /9 /0 /9 /9 /250
From the second table it is obvious that, messages are being delivered to the consumer, but the consumer is not acknowledging the message. Hence the messages are stuck in Dispatched-Queue at broker side.
Few points for to your notice:
1)There is no time difference b/w Broker node and consumer node.
2)Observed the tcpdump at consumer side. We can see MessageDispatch(Openwire) packet being transferred to consumer node, But could not find the MessageAck(Openwire) for the same.
3)Sometimes it is working on a node, and sometimes it is creating problem on the same node.
One cause of this can be incorrectly using a CachingConnectionFactory (with cached consumers) with a listener container that dynamically adjusts the consumers (max consumers > consumers). You can end up with a cached consumer just sitting in the pool and not being actively used. You never need to cache consumers with a listener container.
For problems like this, I generally recommend running with TRACE logging and you can see all the consumer activity.
It took lot of time to figure out the solution. There seems to be some issue with the org.apache.activemq.ActiveMQConnection.java class, in case of AMQ fail over. The connection object is not getting started at consumer side in such cases.
Following is the fix i have added in ActiveMQConnection.java file and compiled the sources to create activemq-core-x.x.x.jar
private final Object startMutex = new Object();
added a check in createSession method
public Session createSession(boolean transacted, int acknowledgeMode) throws JMSException {
synchronized (startMutex) {
if(!isStarted()) {
start();
}
}
We've been using Rabbit successfully for about a year. Recently have upgraded to v2.6.1, because we want to use clusters with replicated message queues.
My testing has hit a puzzling behavior that smells like a Rabbit bug to me. The test that uncovers this is working with a two-node cluster. Both nodes are running v2.6.1. Both nodes have disk. Both nodes are running on Mac OS, though I doubt this is pertinent.
I'm also running Alice on the node that runs the test. The test uses it to programmatically do a stop_app on one of the nodes, because the test is trying to validate that if the cluster master fails, and a slave is elevated to take its place, that we don't lose messages.
So, the test has a small thread pool, which is given tasks that periodically 1) publish messages, and 2) toggle the state of the Rabbit master node (stopped if running; started if stopped). Other threads are consuming messages from queues.
I'm using publisher confirms, and I'm also acknowledging the messages in the consumers (using autoAck=false for channel.basicConsume()).
When the master node is stopped, I see both the producers and consumers catching ShutdownSignalException. They handle this by attempting to reconnect to the cluster. This works fine. When reconnected, they continue with their business.
Sometimes, what I see is that a consumer has successfully fetched a message from the broker, and is calling channel.basicAck() when it gets that ShutdownSignalException.
Later, when the consumer has reconnected, it again pulls down the same message. (The message bodies are tagged with a UUID, so I know it is the same one.) This time, when the consumer attempts to basicAck() the message, it again gets ShutdownSignalException, but this one has the following text in it: "reply-text=PRECONDITION_FAILED - unknown delivery tag 7".
In fact, that is the same delivery tag that was offered to the consumer by the broker before the master went down and the consumer reconnected.
Googling suggests that this event means that the consumer is attempting to ack the same message more than once.
But, how can this be so? If the first ack succeeded, then the message should have been removed from the broker's queues, and the consumer shouldn't see the same message again.
Yet, if the first ack did not succeed, then the consumer shouldn't be dinged for attempting to re-ack the message.
Anyone seen this before? It smells like a bug in Rabbit's replicated queues to me, but I've still new to Rabbit, and so am willing to believe there's a subtlety here in consuming from a clustered broker that I haven't yet grokked!
Thanks, --Steve
I'm not sure if my case matching yours, but I have seen similar "unknown delivery tag" on attempts to ack after reconnect and then the same message arrived again. Initially it looked like a bug to me, but in fact this is expected behavior. Consumer with QOS>1 may have in it's local buffer some messages and delivery tag will be invalid for all o them after reconnect. From another hand, attempt to ack even the current message after reconnect doesn't make any sense, because that message already nacked automatically on connection lost and this is why I got it again.