What does "quorum reached" in redis logs? what technically it means? - redis

What does "quorum reached" in redis logs technically what it indicates, is it problem? Am I missing anything to tuning in redis.conf parameter to fix it?
Redis log message:
Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
837:M 05 May 10:30:22.216 # Cluster state changed: fail

The message means that the cluster had reached a consensus about that node's status and it is marked as failing. This happens when a node does not respond to the cluster's internal chatter protocol, and could be the result of any kind of failure (e.g. network, process...). You should check that node's logs for more information.

Related

Redis cluster node failure not detected on MISCONF

We currently have a redis cache cluster with 3 masters and 3 slaves hosted on 3 windows servers (1 master/slave by server). We are using StackExhange.Redis as our client.
We have RBD disabled but AOF enabled and are experiencing some problems with the cluster in the following situation :
One of our servers became full and the redis node on this server was unable to write to the AOF file (the error returned to the client was MISCONF Errors writing to the AOF file: No space left on device).
The cluster did not detect that the node was failing and so did not exlclude it from the cluster.
All cache operations were blocked until we make some place on the server.
We know that we don't need the AOF, so we have disalbed it after the incident.
But we would like to confirm or infirm our view on redis clustering: for us, if a node was experiencing a failure, the cluster would redirect all requests to another one. We have tested that with a stopped node master, a slave is promoted into a master so we are confident that our cluster is working, but we are not sure why, in our case, the node was not marked as a failure.
Is the cluster capable of detecting a node failure when the failure is only happening when a request is made from a client to the cluster ?

How to configure Akka.Cluster for services that Crash when binding to port 0

What I am testing is the following scenario:
Start 2 Lighthouses, then start a 3 service that is a member of the cluster. It's seed nodes are configured to be the two Lighthouses that were previously started.
Now this 3rd service has it's HOCON set to bind to port 0, which does it's job and gives me a random port.
Now when I force quit this service to simulate a crash, The logging output from Akka.Net gets REAL chatty (important parts)
AssociationError...Tried to associate with unreachable remote address
address is now gated for 5000ms ... No connection could be made because the target machine actively refused it.
And it seems like it just goes on forever. I assume this is probably harmless and it just looks like a terrible error. The message itself makes sense, the service is literally gone so it can not and will never be able to connect.
Now if I restart the service since it's configured to bind to 0 for Akka.Remoting, it will get an entirely new port, so the Unreachable status of the other failed service will never be resolved.
Is this the expected behavior? I also think there is a configuration setting that might come into play here:
auto-down-unreachable-after
Now this comes with it's own warning about:
Using auto-down implies that two separate clusters will automatically be formed in case of network partition.
Setting this does silence the messages:
auto-down-unreachable-after = 3s
And I get a new message after the node is marked unreachable:
Association to [akka.tcp://ClusterName#localhost:58977] having UID [983892349]is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
Remote actorsystem must be restarted to recover from this situation. Seems pretty serious and something to avoid. At the same time, given that the service joins on a random port, it is irrecoverable. In trying to gain some more knowledge about the UID it seems that it's internally assigned. So I can only guess there would not be any collisions later in time with UIDs, so this would be the proper behavior.
This seems to be the only option outside of
log-info = off
to just silence the logs
I assume the logging of the lighthouse services are chatty, right? That is 'normal' behaviour of the Akka gossip protocol trying to communicate with the crashed node. When this happens, you must configure what you want to do.
The solution for solving this is not always the same for each situation. It could depend for example if you are running the services on a cloud microservices platform for example. But one of the options is indeed 'auto-downing'. This will mark the service as 'UNREACHABLE' (as you can see). This means that the node isn't out of the cluster, but the cluster continues to operate without the crashed node. That's the reason that the same node cannot join, because it is still marked as 'UNREACHABLE'.
Be aware that auto-downing could result into a 'split-brain' of the cluster, where the two parts of the cluster (for example one cluster of 4 nodes gets split into 2 clusters of 2 nodes). This is a situation that you don't want, so this may not be the best solution!
Akka.NET has some other solution to you can configure to correctly deal with this: the Split Brain Resolver. More information how to configure this: https://getakka.net/articles/clustering/split-brain-resolver.html
These are all strategies to prevent 'split-brain' situations and will involve sacrificing nodes to keep the cluster consistent. Use these strategies in combination with for example a microservices orchestration platform (so that instances will restart themselves after crashing/exiting) to create a perfect self-healing Akka cluster.

rabbitmq cluster how to change active/active into active/passive mode?

I have setp a 2 nodes rabbitmq cluster with one loader balancer at frontend, after this was setup, it was working as active/active mode, then network partition happened on one node, I got the failed node out of the cluster and rejoin it into the cluster again, then this failed node were not accecpting any connection.
Then I tried to moved the other node out of the balancer, the recovered node began to accept connections, so this cluster is active/passive mode.
I don't know what caused this, is there any way to change it back to active/active? And which step to specify its mode during setup?
Thanks for your advice in advance!
rabbitmq really (really) doesn't like network partitions. By default, when you have one, everything pauses. In that situation you must fix it manually. Choosing the loser by stopping it and starting it should resume everything once it rejoins the cluster.
If that doesn't work, then shut down the failed node, and use rabbitmqctl to "forget_cluster_node", and then rejoin it to the cluster.
You should read this very carefully
https://www.rabbitmq.com/partitions.html
specifically, "Recovering from a network partition"
Then read the next few paragraphs even more carefully. There are some automatic recovery modes, each with advantages and disadvantages.
At my company we chose autoheal because we value availability, and accept the possible loss of messages.

Why PUBLISH command in redis slave causes no error?

I have a redis master-slave setup and the configuration of the slave is set to slave_read_only:1, but when I enter a PUBLISH command on the slave node it does not fail. I would expect an error, but it just takes the command and nothing else happens. The message is not propagated to the master either.
The question is, why is that? Did I mis-configure redis? Is that a feature? To what purpose? Or is it just a bug?
The problem arises in a setup where automatic failover occurs. A master may become a slave and clients of that slave may publish messages without realizing that it is no master any more. Do I have to check before each message is sent if the redis node is still master?
I use redis 3.0.5
You didn't misconfigure - this is the defined behavior as PUBLISH isn't considered a write command.
Also note, that when replicating published events are replicated from master to slaves (downstream, as usual), so if you're publishing to a slave only clients connected to it or to its slaves and subscribed to the relevant channel will get the message.

RabbitMq Clustering

I am new to RabbitMq. I am not able to understand the concept here. Please find the scenario.
I have two machines (RMQ1, RMQ2) where I have installed rabbitmq in both the machines which are running. Again I clustered RMQ2 to join RMQ1
cmd:/> rabbitmqctl join_cluster rabbit#RMQ1
If you see the status of the machines here it is as below
In RMQ1
c:/> rabbitmqctl cluster_status
Cluster status of node rabbit#RMQ1...
[{nodes,[{disc,[rabbit#RMQ1,rabbit#RMQ2]}]},
{running_nodes,[rabbit#RMQ1,rabbit#RMQ2]}]
In RMQ2
c:\> rabbitmqctl cluster_status
Cluster status of node rabbit#RMQ2 ...
[{nodes,[{disc,[rabbit#RMQ1,rabbit#RMQ2]}]},
{running_nodes,[rabbit#RMQ1,rabbit#RMQ2]}]
The in order to publish and subscribe message I am connecting to RMQ1. Now I see the whenever I sent or message to RMQ1, I see message mirrored in both RMQ1 and RMQ2. This I understand clearly that as both the nodes are in same cluster they are getting mirrored across nodes.
Let say I bring down the RMQ2, I still see message getting published to RMQ1.
But when I bring down the RMQ1, I cannot publish the message anymore. From this I understand that RMQ1 is master and RMQ2 is slave.
Now I have below questions, without changing the code :
How do I make the RMQ2 take up the job of accepting the message.
What is the meaning of Highly Available Queues.
How should be the strategy for implementing this kind scenario.
Please help
Question #2 is best answered first, since it will clear up a lot of things for you.
What is the meaning of highly available queues?
A good source of information for this is the Rabbit doc on high availability. It's very important to understand that mirroring (which is how you achieve high availability in Rabbit) and clustering are not the same thing. You need to create a cluster in order to mirror, but mirroring doesn't happen automatically just because you create a cluster.
When you cluster Rabbit, the nodes in the cluster share exchanges, bindings, permissions, and other resources. This allows you to manage the cluster as a single logical broker and utilize it for scenarios such as load-balancing. However, even though queues in a cluster are accessible from any machine in the cluster, each queue and its messages are still actually located only on the single node where the queue was declared.
This is why, in your case, bringing down RMQ1 will make the queues and messages unavailable. If that's the node you always connect to, then that's where those queues reside. They simply do not exist on RMQ2.
In addition, even if there are queues and messages on RMQ2, you will not be able to access them unless you specifically connect to RMQ2 after you detect that your connection to RMQ1 has been lost. Rabbit will not automatically connect you to some surviving node in a cluster.
By the way, if you look at a cluster in the RabbitMQ management console, what you see might make you think that the messages and queues are replicated. They are not. You are looking at the cluster in the management console. So regardless of which node you connect to in the console, you will see a cluster-wide view.
So with this background now you know the answer to your other two questions:
What should be the strategy for implementing high availability? / how to make RMQ2 accept messages?
From your description, you are looking for the failover that high availability is intended to provide. You need to enable this on your cluster. This is done through a policy, and there are various ways to do it, but the easiest way is in the management console on the Admin tab in the Policies section:
The previously cited doc has more detail on what it means to configure high availability in Rabbit.
What this will give you is mirroring of queues and messages across your cluster. That way, if RMQ1 fails then RMQ2 will still have your queues and messages since they are mirrored across both nodes.
An important note is that Rabbit will not automatically detect a loss of connection to RMQ1 and connect you to RMQ2. Your client needs to do this. I see you tagged your question with EasyNetQ. EasyNetQ provides this "failover connect" type of feature for you. You just need to supply both node hosts in the connection string. The EasyNetQ doc on clustering has details. Note that EasyNetQ even lets you inject a simple load balancing strategy in this case as well.