What will a HRegionServer do if the zookeeper node that connected to the HRegionServer goes down? - crash

In HBase, when a HRegionServer starts up, it will create an ephemeral znode in the ZooKeeper cluster. When the HRegionServer crashes, the ephemeral znode will be deleted, and the HMaster can be notified about the crash of the HRegionServer.
As I known, ZooKeeper will clean up ephemeral znodes for the closed sessions. For example, a client C connects with a ZooKeeper node ZK1 and creates an ephmeral znode "/eph". When client C crashes or ZK1 crashes, the ephmeral znode "/eph" will be deleted after a while.
So, I'm curious about what will a HRegionServer do if the zookeeper node that connected to the HRegionServer goes down. Will the HRegionServer recreate the ephemeral znode? Will the HMaster be notified about the deletion of the HRegionServer's ephemeral znode? Will the HMaster start the server shutdown handling process?

Related

RabbitMQ HA cluster graceful shutdown of a master node when using 'when-synced' policy

Suppose I use ‘when-synced’ policy for both ha-promote-on-failure, ha-promote-on-shutdown on a HA cluster.
If so, ‘mirror to master promotion' will never be occurred and the master queue is blocked if there are no synchronized mirrors on controlled master shutdown.
That's what the documentation says.
https://www.rabbitmq.com/ha.html#cluster-shutdown
By default, RabbitMQ will refuse to promote an unsynchronised mirror
on controlled master shutdown (i.e. explicit stop of the RabbitMQ
service or shutdown of the OS) in order to avoid message loss; instead
the entire queue will shut down as if the unsynchronised mirrors were
not there.
If using 'when-synced' policy and if no mirrors were synchronized at the time of shutting down the master, according to the documentation, master doesn’t seem to shutdown gracefully.
For me it seems like there are only two options.
Waiting for a master to be restored (regardless of how long it takes) if I use ‘when-synced’.
Abandoning all the messages that are not yet synchronized to mirrors (exist only in the master) for availability if I use ‘always’.
Really?
There’s no option like “Blocking queues until one of the mirrors is fully synced, and then promote the synced mirror to the new master”?

Redis cluster node failure not detected on MISCONF

We currently have a redis cache cluster with 3 masters and 3 slaves hosted on 3 windows servers (1 master/slave by server). We are using StackExhange.Redis as our client.
We have RBD disabled but AOF enabled and are experiencing some problems with the cluster in the following situation :
One of our servers became full and the redis node on this server was unable to write to the AOF file (the error returned to the client was MISCONF Errors writing to the AOF file: No space left on device).
The cluster did not detect that the node was failing and so did not exlclude it from the cluster.
All cache operations were blocked until we make some place on the server.
We know that we don't need the AOF, so we have disalbed it after the incident.
But we would like to confirm or infirm our view on redis clustering: for us, if a node was experiencing a failure, the cluster would redirect all requests to another one. We have tested that with a stopped node master, a slave is promoted into a master so we are confident that our cluster is working, but we are not sure why, in our case, the node was not marked as a failure.
Is the cluster capable of detecting a node failure when the failure is only happening when a request is made from a client to the cluster ?

ActiveMQ takes a long time to failover

I have 3 ActiveMQ brokers in a networked Shared File System(GlusterFS)/Master Slave configuration - all in VMs.
If the master fails the client should failover to the new master.
The issue I have is that the connection to the new master takes about 50 seconds.
Is that reasonable?
How to improve it?
My client connection looks like this
failover:(tcp://a1:61616?connectionTimeout=1000,tcp://a2:61616?connectionTimeout=1000,tcp://a3:61616?connectionTimeout=1000)?randomize=false&maxReconnectDelay=10000&backup=true"
Also when disconnecting the master by disconnecting network cable it stops and throws an exception regarding the kahaDB (which is on GlusterFS) and needs to be restarted.
Is there a workaround for this behavior so the master broker auto-restarts or is able to connect automatically once the network comes back?
The failover depends on the time the underlying file system take for releasing the file lock.
In your case, the NFS cluster is waiting 50s to detect that the first node is lost and so release the lock on the kahadb file, wich can then be taken by the seconde node.
You can customize this delay with the NFSD_V4_GRACE and NFSD_V4_LEASE parameters in the NFS server configuration file (/etc/sysconfig/nfs on redhat/centos systems).
You can also customize the kahadb lockKeepAlivePeriod, see http://activemq.apache.org/pluggable-storage-lockers.html

rabbitmq cluster all down ,when first slave node,queue is state down

I have 3 nodes this disc mode and "ha-mode is all". rabbitmq version 3.6.4
when I try to stop all nodes, first I stop two slave nodes,end stop master nodes. Assume that master node is broken and can't be started. I use rabbitmqctl force_boot setup one slave node, I found queue state is down.
I don't think this is right. I think the slave node setup become master, and queue is available. Do not consider whether the message is lost.
But, first stop master node, then stop new master node, end last node. I can
rabbitmqctl force_boot setup any node. any node is available.
Sounds like you're ending up with unsynchronized slaves and by default RabbitMQ will refuse to fail over to an unsynchronised slave on controlled master shutdown.
Stopping master nodes with only unsynchronised slaves
It's possible that when you shut down a master node that all available slaves are unsynchronised. A common situation in which this can occur is rolling cluster upgrades. By default, RabbitMQ will refuse to fail over to an unsynchronised slave on controlled master shutdown (i.e. explicit stop of the RabbitMQ service or shutdown of the OS) in order to avoid message loss; instead the entire queue will shut down as if the unsynchronised slaves were not there. An uncontrolled master shutdown (i.e. server or node crash, or network outage) will still trigger a failover even to an unsynchronised slave.
If you would prefer to have master nodes fail over to unsynchronised slaves in all circumstances (i.e. you would choose availability of the queue over avoiding message loss) then you can set the ha-promote-on-shutdown policy key to always rather than its default value of when-synced.
https://www.rabbitmq.com/ha.html

ActiveMQ failover protocol not reconnecting to master after restarting

I am using ActiveMQ version 5.4 and I have a pure master slave configuration. My slave is configured such that starts its network transports connectors in the event of a failure. My clients are configured using the failover protocol, just like the docs say:
failover://(tcp://masterhost:61616,tcp://slavehost:61616)?randomize=false
When my master dies, the clients successfully fail over to the slave perfectly. The problem is that after I recover (i.e. stop the slave, copy over the data, restart the master, then restart the slave), the clients are still trying to connect to the the slave (which does not have any open network connectors at that point). Thus, the clients never reconnect to the master after restarting it. Is this how it's supposed to work?
I've seen this as well. If you're using the PooledConnectionFactory, set an expiry timeout on the pooled connections via setExpiryTimeout. The API documentation here suggests that this will force reconnection to the master broker:
allow connections to expire, irrespective of load or idle time. This is useful with failover to force a reconnect from the pool, to reestablish load balancing or use of the master post recovery