How to configure RabbitMQ using Active/Passive High Availability architecture - rabbitmq

I'm trying to setup a cluster of RabbitMQ servers, to get highly available queues using an active/passive server architecture. I'm following this guides:
http://www.rabbitmq.com/clustering.html
http://www.rabbitmq.com/ha.html
http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-amazon-ec2/
My requirement for high availability is simple, i have two nodes (CentOS 6.4) with RabbitMQ (v3.2) and Erlang R15B03. The Node1 must be the "active", responding all requests, and the Node2 must be the "passive" node that has all the queues and messages replicated (from Node1).
To do that, i have configured the following:
Node1 with RabbitMQ working fine in non-cluster mode
Node2 with RabbitMQ working fine in non-cluster mode
The next I did was to create a cluster between both nodes: joining Node2 to Node1 (guide 1). After that I configured a policy to make mirroring of the queues (guide 2), replicating all the queues and messages among all the nodes in the cluster. This works, i can connect to any node and publish or consume message, while both nodes are available.
The problem occurs when i have a queue "queueA" that was created on the Node1 (master on queueA), and when Node1 is stopped, I can't connect to the queueA in the Node2 to produce or consume messages, Node2 throws an error saying that Node1 is not accessible (I think that queueA is not replicated to Node2, and Node2 can't be promoted as master of queueA).
The error is:
{"The AMQP operation was interrupted: AMQP close-reason, initiated by
Peer, code=404, text=\"NOT_FOUND - home node 'rabbit#node1' of durable
queue 'queueA' in vhost 'app01' is down or inaccessible\", classId=50,
methodId=10, cause="}
The sequence of steps used is:
Node1:
1. rabbitmq-server -detached
2. rabbitmqctl start_app
Node2:
3. Copy .erlang.cookie from Node1 to Node2
4. rabbitmq-server -detached
Join the cluster (Node2):
5. rabbitmqctl stop_app
6. rabbitmqctl join_cluster rabbit#node1
7. rabbitmqctl start_app
Configure Queue mirroring policy:
8. rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'
Note: The pattern used for queue names is "" (all queues).
When I run 'rabbitmqctl list_policies' and 'rabbitmqctl cluster_status' is everything ok.
Why the Node2 cannot respond if Node1 is unavailable? Is there something wrong in this setup?

You haven't specified the virtual host (app01) in your set_policy call, thus the policy will only apply to the default virtual host (/). This command line should work:
rabbitmqctl set_policy -p app01 ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'

In the web management console, is queueA listed as Node1 +1?
It sounds like there might be some issue with your setup. I've got a set of vagrant boxes that are pre-configured to work in a cluster, might be worth trying that and identifying issues in your setup?

Only mirror queue which are synchronized with the master are promoted to be master, after fails. This is default behavior, but can be changed to promote-on-shutdown always.

Read carefully your reference
http://www.rabbitmq.com/ha.html
You could use a cluster of RabbitMQ nodes to construct your RabbitMQ
broker. This will be resilient to the loss of individual nodes in
terms of the overall availability of service, but some important
caveats apply: whilst exchanges and bindings survive the loss of
individual nodes, queues and their messages do not. This is because a
queue and its contents reside on exactly one node, thus the loss of a
node will render its queues unavailable.

Make sure that your queue is not durable or exclusive.
From the documentation (https://www.rabbitmq.com/ha.html):
Exclusive queues will be deleted when the connection that declared them is closed. For this reason, it is not useful for an exclusive
queue to be mirrored (or durable for that matter) since when the node
hosting it goes down, the connection will close and the queue will
need to be deleted anyway.
For this reason, exclusive queues are never mirrored (even if they
match a policy stating that they should be). They are also never
durable (even if declared as such).
From your error message:
{"The AMQP operation was interrupted: AMQP close-reason, initiated by
Peer, code=404, text=\"NOT_FOUND - home node 'rabbit#node1' of
durable queue 'queueA' in vhost 'app01' is down or inaccessible\", classId=50, methodId=10, cause="}
It looks like you created a durable queue.

Related

TCP connection succeeded, Erlang distribution failed

We installed Erlang Vm (erlang-23.2.1-1.el7.x86_64.rpm) and Rabbitmq server(rabbitmq-server-3.8.19-1.el7.noarch.rpm) on 3 different machines and were successful in starting the RabbitMQ server with three different clusters on 3 machines, but when we tried to cluster these rabbitmq nodes we are facing Erlang distribution failed error, googled it and found it might be due to Erlang cookie mismatch can anyone help us how to solve this mismatch issue if it is the root cause
Error message :
Error: unable to perform an operation on node 'rabbit#keng03-dev01-ins01-dmq67-app-1627533565-1'. Please see diagnostics information and suggestions below.
The most common reasons for this are:
Target node is unreachable (e.g. due to hostname resolution, TCP connection, or firewall issues)
CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
Target node is not running
In addition to the diagnostics info below:
See the CLI, clustering, and networking guides on https://rabbitmq.com/documentation.html to learn more
Consult server logs on node rabbit#keng03-dev01-ins01-dmq67-app-1627533565-1
If a target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
attempted to contact: ['rabbit#keng03-dev01-ins01-dmq67-app-1627533565-1']
rabbit#keng03-dev01-ins01-dmq67-app-1627533565-1:
connected to epmd (port 4369) on keng03-dev01-ins01-dmq67-app-1627533565-1
epmd reports node 'rabbit' uses port 25672 for inter-node and CLI tool traffic
TCP connection succeeded but Erlang distribution failed
suggestion: check if the Erlang cookie is identical for all server nodes and CLI tools
suggestion: check if all server nodes and CLI tools use consistent hostnames when addressing each other
suggestion: check if inter-node connections may be configured to use TLS. If so, all nodes and CLI tools must do that
suggestion: see the CLI, clustering, and networking guides on https://rabbitmq.com/documentation.html to learn more
Current node details:
node name: 'rabbitmqcli-616-rabbit#keng03-dev01-ins01-dmq67-app-1627533565-2'
effective user's home directory: /var/lib/rabbitmq
Erlang cookie hash: AFJEXwyuc44Sp8oYi00SOw==
'''
I had samer error description, in my case the erlang cookies matched among cluster nodes, but I seemed to face some case-sensitivity with the rabbitmqctl join_cluster-command.
With an elevated command prompt on host 2179NBXXXDP
this failed: rabbitmqctl join_cluster rabbit#2179ASXXX02
and this worked: rabbitmqctl join_cluster rabbit#2179asxxx02
(Hostname of the latter turned out to be indeed lowercased in my case.)

cluster_formation.classic_config.nodes does not work of rabbitmq

I have 2 rabbitmq nodes.
Their node names are: rabbit#testhost1 and rabbit#testhost2
I'd like them can auto cluster.
On testhost1
# cat /etc/rabbitmq/rabbitmq.conf
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit#testhost1
cluster_formation.classic_config.nodes.2 = rabbit#testhost2
On testhost2
# cat /etc/rabbitmq/rabbitmq.conf
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit#testhost1
cluster_formation.classic_config.nodes.2 = rabbit#testhost2
I start rabbit#testhost1 first and then rabbit#testhost2.
The second node didn't join to the cluster of first node.
While node rabbit#testhost1 can join rabbit#testhost2 with rabbitmqctl command: rabbitmqctl join_cluster rabbit#testhost2.
So the network between should not have problem.
Could you give me some idea about why can't combine cluster? Is the configuration nor correct?
I have opened the debug log and the info related to rabbit_peer_discovery_classic_config is very little:
2019-01-28 16:56:47.913 [info] <0.250.0> Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping registration.
The rabbitmq version is 3.7.8
Did you start the nodes without cluster config before you attempted clustering?
I had started individual peers with the default configuration once before I added cluster formation settings to the config file. By starting a node without clustering config, it seems form a cluster of its own, and on further start would only contact the last known cluster (self).
From https://www.rabbitmq.com/cluster-formation.html
How Peer Discovery Works
When a node starts and detects it doesn't have a previously initialised database, it will check if there's a peer discovery mechanism configured. If that's the case, it will then perform the discovery and attempt to contact each discovered peer in order. Finally, it will attempt to join the cluster of the first reachable peer.
You should be able to reset the node with rabbitmqctl reset (Warning: This removes all data from the management database, such as configured users and vhosts, and deletes all persistent messages along with clustering information.) and then use clustering config.

ActiveMQ network subscription issue

I have a strange behavior in ActiveMQ with network connectors. Here is the setup:
Broker A listening for connections via a nio transport connector on 61616
Broker B establishing a duplex connection to broker A
A producer on A sends messages to a known queue, say q1
A consumer on B subscribes to the queue q1
I can clearly see that the duplex connection is established but the consumer on B doesn't receive any message.
On jconsole I can see that the broker A is sending messages, up to the value of the prefetch limit (1000 messages) to the network consumer, which seems fine. The "DispatchedQueue", "DispatchedQueueSize", and more importantly the "MessageCountAwaitingAck" counters have the same value: they are stuck to 1000.
On the broker B, the queue size is 0.
At the system level, I can clearly see an established connection between broker A and broker B:
# On broker A (192.168.x.x)
$ netstat -t -p -n
tcp 89984 135488 192.168.x.x:61616 172.31.x.x:57270 ESTABLISHED 18591/java
# On broker B (172.31.x.x)
$ netstat -t -p -n
tcp 102604 101144 172.31.x.x:57270 192.168.x.x:61616 ESTABLISHED 32455/java
Weird thing: the recv-q and send-q on both brokers A and B seem to have some data not read by the other side. They don't increase or decrease, they are just stuck to these values.
The ActiveMQ logs on both sides don't say much, even in TRACE level.
Seems like neither broker A or broker B are sending acks for the messages to the other side.
How is that possible? What's a potential cause and fix?
EDIT: I should add that I'm using an embedded ActiveMQ 5.13.4 on both sides.
Thanks.

activemq master not giving up on network failure

I have an activemq installation with master / slave failover.
Master and Slave are synced using the lease-database-locker
Master and Slave run on 2 different machines and the database is located on a third machine.
Failover and client reconnection works properly on a forced shutdown of the master broker. The slave is taking over properly and the clients reconnect due to their failover setting.
The problems start, if I simulate a network outage on the master broker only. This is done by using an iptables Drop Rule for packages going to the database on the master.
The master now realizes, that it cannot connect to the Database any longer. The slave starts up, since it's network connection is still alive.
It seems from the logs, that the clients still try to reconnect to the non responding master
For my understanding the master should inform the clients, that there is no connection anymore. The clients should failover and reconnect to the slave.
But this is not happening.
The clients do reconnect to the slave if I reestablish the db connection by reenabling the network connection to the db for the master. The master gives up beeing the master then.
I have set a queryTimeout on the lease-database-locker.
I have set updateClusterClients=true for the transport connector.
I have set a validationQueryTimeout of 10s on the db connection.
I have set a testOnBorrow for the db connection
Is there a way to force the master to inform the clients to failover in this particular case ?
After some digging I found the trick.
The broker was not informing the clients due to a missing ioExceptionHandler configuration.
The documentation can be found here
http://activemq.apache.org/configurable-ioexception-handling.html
I needed to specify
<bean id="ioExceptionHandler" class="org.apache.activemq.util.LeaseLockerIOExceptionHandler">
<property name="stopStartConnectors"><value>true</value></property>
<property name="resumeCheckSleepPeriod"><value>5000</value></property>
</bean>
and tell the broker to use the Handler
<broker xmlns="http://activemq.apache.org/schema/core" ....
ioExceptionHandler="#ioExceptionHandler" >
In order to produce an error on network outages I also had to set a queryTimeout on the lease query:
<jdbcPersistenceAdapter dataDirectory="${activemq.base}/data" dataSource="#mysql-ds-db01-st" lockKeepAlivePeriod="3000">
<locker>
<lease-database-locker lockAcquireSleepInterval="10000" queryTimeout="8" />
</locker>
This will produce an sql exception if the query takes to long due to a network outage.
I did test the network by dropping packages to the database using an iptables rule:
/sbin/iptables -A OUTPUT -p tcp --destination-port 13306 -j DROP
Sounds like you client doesn't have the address of the slave in its URI so it doesn't know where to reconnect to. The master broker doesn't inform the client where the slave is as it doesn't know there is a slave(s) or where that slave might be on the network, and even if it did that would be unreliable depending on what the conditions are that caused the master broker to drop in the first place.
You need to provide the client with the connection information for the master and the slave in the failover URI.

HAProxy setup on a system which does not host any RabbitMQ node

I want to set up HAProxy for RabbitMQ cluster. I have following queries on the same:
(1) Suppose I have a scenario where my RabbitMQ server, client, and haproxy are on different machines.
RabbitMQ node1 -> Machine1
RabbitMQ node2 -> Machine2
HAPROXY -> Machine3
RabbitMQ client -> Mahcine4
node1 and node2 have been clustered. Is this a correct configuration? The rationale behind my asking this question is : can HAProxy be setup on a machine which does not host any node or HaProxy has to be setup on a machine which host at least one RabbitMQ server node?
(2) If the above setup is valid, then my RabbitMQ client should know only HAPrxoy machine, and in that case, how shall I connect my client to HAProxy? The client code which works when RabbitMQ client has to connect to a machine hosting RabbitMQ server node will not work here.
I investigated and found answers of my questions. 1. This set up is valid in the sense it is a possible scenario. 2. Client will connect to HAProxy server.