Activemq stops working - activemq/zookeeper setup - activemq

I've configured 3 zookeepers and 3 activemq instances in 1 cluster.
Scenario
3 activemq instances with only 1 master and other two is slave.
all 3 activemq instances are running, i.e. sudo service activemq status returns running but checking the logs, 1 instance(activemq1) is currently waiting for other cluster members, 1 instance(activemq2) stops, 1 instance(activemq3) has error. Assumming that we only require two instance to elect master, this setup should be able to run successfully .
two activemq instances should be running
zookeeper instances are running fine.
Issue
Below are the stacktraces of the respective activemq instances. Based on my understanding, it needs at least two properly running activemq intances for the cluster to nominate a master instance. Given that all activemq instanes produces running when issued with sudo service activemq status , I'm assuming there is an issue inside each activemq instances - refer to below stacktraces. Now, I noticed on logs, that activemq1 only fails to be properly running since other activemq instances failed internally. Notice the stacktrace on activemq2, it's stucked after it successfully connected to zookeeper and activemq3 has issue, I still need to figure out. The issue is fixed when I restarted activemq2 and activemq3. However, I can't be sure this won't happen again, thus this question.
activem1 show the below stacktrace, which I assume that this is because the other 2 activemq instances are running but has errors
Session establishment complete on server 10.5.4.111/10.5.4.111:2181, sessionid = 0x1582db00708000c, negotiated timeout = 4000
Not enough cluster members connected to elect a master.
Not enough cluster members connected to elect a master.
Not enough cluster members connected to elect a master.
activemq2 has the below stacktrace, which is the one I don't understand. It has stopped after successful connection to zookeeper, which should be detected by other activemq instances belonging to cluster-activem1 and activemq3
Opening socket connection to server 10.5.4.111/10.5.4.111:2181
Socket connection established to 10.5.4.111/10.5.4.111:2181, initiating session
Session establishment complete on server 10.5.4.111/10.5.4.111:2181, sessionid = 0x1582db00708000d, negotiated timeout = 4000
activemq3 has the below stacktrace
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:568)[apache-jsp-8.0.9.M3.jar:2.3]
Configuration for activemq
the previous config here is with 2s zkSessionTimeout - which is the default. I made it to 4s as per googled to maximize the time needed for an activemq instance registers itself to zookeeper.
<persistenceAdapter>
<replicatedLevelDB
directory="${activemq.data}/leveldb"
replicas="3"
bind="tcp://0.0.0.0:61619"
zkAddress="zookeeper_addresses_here"
hostname="activemq_hostname_here"
zkSessionTimeout="4s"
/>
</persistenceAdapter>
Configuration for zookeeper
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/my/data/dir
clientPort=2181
server.1=activemq1_privateIP:2888:3888
server.2=activemq2_privateIP:2888:3888
server.3=activemq3_privateIP::2888:3888
autopurge.purgeInterval=24
autopurge.snapRetainCount=5
Zookeeper version 3.4.9
ActiveMQ version 5.13.4
Setup via Opswork

The attribute "directory" master-slave mq is need to refer to the same folder

Related

ambari cluster + poor connection between ambari-agent to ambari server

we have ambari cluster with 872 data-nodes machines , when ambari version is 2.6.x
we have for now some network problem ,
after long investigation we found that , ambari agent that runs on some machine not communicate well with the ambari server
therefore we get some strange behaviors as 5 dead data-nodes from ambari dashboard , while for sure datanodes machine are healthy
is it possible to give more tolerated value in ambari agent configuration so the ack between ambari agent to ambari server will be after more little time in order to ignore the network problems ?
something like timeout or time connection between the ambari agent to ambari server
First of all, you need to get the root cause of the issue why Data Node is showing as Dead.
Ambari agent runs on every node. It is responsible for sending
metrics and heartbeat to the Ambari server which then publishes to
your Ambari web UI.
The name node waits for 10 minutes till it declares the data node as dead and copies
the blocks to other data nodes.
If it's showing that data node is dead then please check the Ambari agent status in
the specific node by running-service ambari-agent status. Parallelly you can check the ambari-agent.log in the worker node to check why Ambari agent stopped working.
You can configure your http timeouts in ambari-agents for service tasks, http timeouts
https://github.com/apache/ambari/blob/trunk/ambari-agent/conf/unix/ambari-agent.ini
There's a HTTP Timeout section you can configure it based on your network throughput.
The file should be in /etc/ambari-agent/ambari.properties

Redis cluster node failure not detected on MISCONF

We currently have a redis cache cluster with 3 masters and 3 slaves hosted on 3 windows servers (1 master/slave by server). We are using StackExhange.Redis as our client.
We have RBD disabled but AOF enabled and are experiencing some problems with the cluster in the following situation :
One of our servers became full and the redis node on this server was unable to write to the AOF file (the error returned to the client was MISCONF Errors writing to the AOF file: No space left on device).
The cluster did not detect that the node was failing and so did not exlclude it from the cluster.
All cache operations were blocked until we make some place on the server.
We know that we don't need the AOF, so we have disalbed it after the incident.
But we would like to confirm or infirm our view on redis clustering: for us, if a node was experiencing a failure, the cluster would redirect all requests to another one. We have tested that with a stopped node master, a slave is promoted into a master so we are confident that our cluster is working, but we are not sure why, in our case, the node was not marked as a failure.
Is the cluster capable of detecting a node failure when the failure is only happening when a request is made from a client to the cluster ?

ActiveMQ forwarding bridge with failover

Here is what I try to achieve with ActiveMQ:
I'd like to have 2 clusters of brokers: clusterA and clusterB. Data between these 2 clusters should be mirrored. So, when clusterA receives a message it will be stored at storageA and also this message should be forwarded to clusterB (if there is such demand) and stored in storageB. On the other hand if clusterB receives a message it should be forwarded to clusterA.
I'm wondering whether config like this considered to be valid according to described above:
<networkConnectors>
<networkConnector
uri="static:(failover(tcp://clusterB_broker1:port,tcp://clusterB_broker2:port,tcp://clusterB_broker3:port))"
name="bridge"
duplex="true"
conduitSubscriptions="true"
decreaseNetworkConsumerPriority="false"/>
</networkConnectors>
This is a valid configuration. It indicates (assuming that all ClusterA brokers are configured this way) that brokers in ClusterA will store and forward first to clusterB_broker1, and if it is down will instead store and forward to clusterB_broker2, and then to clusterB_broker3 if clusterB_broker2 is down. But depending on your intra-cluster broker configuration, it is not going to do what you want it to.
The broker configurations must be set up for failover themselves or else you will lose messages when clusterB_broker1 goes down. If clusterB brokers are not working together as described below, then when clusterB_broker1 goes down, any messages sent to it will not be present or accessible on the other clusterB brokers. New messages will forward to them.
How to do failover within the cluster depends on your ActiveMQ version.
The latest version (5.9.0) supports 3 failover (or master/slave) cluster configurations. For quick reference, they are:
Shared File System Master Slave
JDBC Master Slave
Replicated LevelDB Store
Earlier versions supported a master/slave configuration that had one master and one slave node where messages were forwarded to the slave broker. This setup was not well maintained, had bugs, and has been removed from ActiveMQ.

how to use master/slave configuration in activemq using apache zookeeper?

I'm trying to configure master/slave configuration using apache zookeeper. I have 2 application servers only on which I'am running activemq. as per the tutorial given at
[1]: http://activemq.apache.org/replicated-leveldb-store.html we should have atleast 3 zookeeper servers running. since I have only 2 machines , can I run 2 zookeeper servers on 1 machine and remaining one on another ? also can I run just 2 zookeeper servers and 2 activemq servers respectively on my 2 machines ?
I will answer the zookeper parts of the question.
You can run two zookeeper nodes on a single server by specifying different port numbers. You can find more details at http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html under Running Replicated ZooKeeper header.
Remember to use this for testing purposes only, as running two zookeeper nodes on the same server does not help in failure scenarios.
You can have just 2 zookeeper nodes in an ensemble. This is not recommended as it is less fault tolerant. In this case, failure of one zookeeper node makes the zookeeper cluster unavailable since more than half of the nodes in the ensemble should be alive to service requests.
If you want just POC ActiveMQ, one zookeeper server is enough :
zkAddress="192.168.1.xxx:2181"
You need at least 3 AMQ serveur to valid your HA configuration. Yes, you can create 2 AMQ instances on the same node : http://activemq.apache.org/unix-shell-script.html
bin/activemq create /path/to/brokers/mybroker
Note : don't forget du change port number in activemq.xml and jetty.xml files
Note : when stopping one broker I notice that all stopping.

ActiveMQ failover protocol not reconnecting to master after restarting

I am using ActiveMQ version 5.4 and I have a pure master slave configuration. My slave is configured such that starts its network transports connectors in the event of a failure. My clients are configured using the failover protocol, just like the docs say:
failover://(tcp://masterhost:61616,tcp://slavehost:61616)?randomize=false
When my master dies, the clients successfully fail over to the slave perfectly. The problem is that after I recover (i.e. stop the slave, copy over the data, restart the master, then restart the slave), the clients are still trying to connect to the the slave (which does not have any open network connectors at that point). Thus, the clients never reconnect to the master after restarting it. Is this how it's supposed to work?
I've seen this as well. If you're using the PooledConnectionFactory, set an expiry timeout on the pooled connections via setExpiryTimeout. The API documentation here suggests that this will force reconnection to the master broker:
allow connections to expire, irrespective of load or idle time. This is useful with failover to force a reconnect from the pool, to reestablish load balancing or use of the master post recovery