I am trying to learn akka.net clustering.
I thought I understood that when a node went down, it would be removed from the cluster. But that does not seem to be happening.
I fired up a instance of Lighthouse (as the seed node) and made a super simple Akka.net project and connected then. It all connected fine.
But when I killed the node, Lighthouse keeps looking for it over and over. Eventually it will say something about the Leader not being able to perform its duties.
I know that the node did not leave the cluster gracefully, but I imagine that I will have nodes that crash.
I thought that when that happens, the gossip system was supposed to remove the dead node from the cluster and everything would move on. (Then if the node came back online, it could ask to be added back into the cluster.
But I must be missing something. Because Lighthouse just keeps retrying over and over.
Why does it do that instead of just waiting for it to connect again?
I added this to the "Cluster" part of my configuration and it caused the node to timeout:
auto-down-unreachable-after = 5s
Related
I wanted to understand the how a Node left event is triggered for an Apache Ignite grid.
Do nodes keep pinging each other constantly to find it if nodes are present or they ping each other only when required?
If ping from client node is not successful then can it also trigger NODE_LEFT event or it can only be triggered by server node.
Once a node has left, then which node triggers topology update event i.e. PME. Can it be triggered by client node or only server nodes can trigger it.
Yes, nodes are pinging each other to verify the connection. Here is more detailed explanation of how a node failure happens. You might also check this video.
The final decision of failing a node (leaving the cluster) is made on the Coordinator node issuing a special event that has to be acked by other nodes (NODE FAILED).
Though a node might leave a cluster explicitly, sending a TcpDiscoveryNodeLeftMessage (aka triggering a NODE_LEFT event), for example when you stop it gracefully.
Only the coordinator node can change topology version, meaning that a PME always starts on the coordinator and is spread to other nodes afterward.
I have an OpenLdap cluster with 6 nodes, when an item is added/deleted in the master, the synchronization kicks in and the changes are replicated to other slave nodes in the cluster, but sometimes one of the slave cluster nodes (the same node all the time) misses the updates and hence there is a difference between this slave node and the rest of the slave nodes and the master, so sometimes when the request goes to the unsynchronized slave it yields invalid results.
In the problematic slave's ldap logs, there is no error information during this operation to the master which explains the miss, so cant figure out what has caused this problem, bringing down that slave and re-add does not help either.
Anyone has faced similar problem and figured out the cause ?
I have posted the query in OpenLDAP itself and they suggested to go for the upgrade as the version i am using is 2.4.44 which is couple of years old and looks like there have been many replication related fixes went in since 2.4.44
Below is the OpenLDAP forum link for the same:
https://bugs.openldap.org/show_bug.cgi?id=9701
What I am testing is the following scenario:
Start 2 Lighthouses, then start a 3 service that is a member of the cluster. It's seed nodes are configured to be the two Lighthouses that were previously started.
Now this 3rd service has it's HOCON set to bind to port 0, which does it's job and gives me a random port.
Now when I force quit this service to simulate a crash, The logging output from Akka.Net gets REAL chatty (important parts)
AssociationError...Tried to associate with unreachable remote address
address is now gated for 5000ms ... No connection could be made because the target machine actively refused it.
And it seems like it just goes on forever. I assume this is probably harmless and it just looks like a terrible error. The message itself makes sense, the service is literally gone so it can not and will never be able to connect.
Now if I restart the service since it's configured to bind to 0 for Akka.Remoting, it will get an entirely new port, so the Unreachable status of the other failed service will never be resolved.
Is this the expected behavior? I also think there is a configuration setting that might come into play here:
auto-down-unreachable-after
Now this comes with it's own warning about:
Using auto-down implies that two separate clusters will automatically be formed in case of network partition.
Setting this does silence the messages:
auto-down-unreachable-after = 3s
And I get a new message after the node is marked unreachable:
Association to [akka.tcp://ClusterName#localhost:58977] having UID [983892349]is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
Remote actorsystem must be restarted to recover from this situation. Seems pretty serious and something to avoid. At the same time, given that the service joins on a random port, it is irrecoverable. In trying to gain some more knowledge about the UID it seems that it's internally assigned. So I can only guess there would not be any collisions later in time with UIDs, so this would be the proper behavior.
This seems to be the only option outside of
log-info = off
to just silence the logs
I assume the logging of the lighthouse services are chatty, right? That is 'normal' behaviour of the Akka gossip protocol trying to communicate with the crashed node. When this happens, you must configure what you want to do.
The solution for solving this is not always the same for each situation. It could depend for example if you are running the services on a cloud microservices platform for example. But one of the options is indeed 'auto-downing'. This will mark the service as 'UNREACHABLE' (as you can see). This means that the node isn't out of the cluster, but the cluster continues to operate without the crashed node. That's the reason that the same node cannot join, because it is still marked as 'UNREACHABLE'.
Be aware that auto-downing could result into a 'split-brain' of the cluster, where the two parts of the cluster (for example one cluster of 4 nodes gets split into 2 clusters of 2 nodes). This is a situation that you don't want, so this may not be the best solution!
Akka.NET has some other solution to you can configure to correctly deal with this: the Split Brain Resolver. More information how to configure this: https://getakka.net/articles/clustering/split-brain-resolver.html
These are all strategies to prevent 'split-brain' situations and will involve sacrificing nodes to keep the cluster consistent. Use these strategies in combination with for example a microservices orchestration platform (so that instances will restart themselves after crashing/exiting) to create a perfect self-healing Akka cluster.
I'm having a strange issue today. First of all, every thing was still working fine yesterday when I left the office, but today when I went back to work my DC/OS dashboard showed my that there weren't any services running, or Nodes connected.
I've ran into this issue once or twice before and was related to the marathon not being able to elect a master. One of the 3 master nodes is then also showing a lot of errors in the journal. This can be resolved by stopping / starting the dcos-marathon service on that host, which brings it back into the marathon group.
I did see the Nodes and services again. But now it sometimes tells me there is only one Node connected and then 3 again, and just 1 again, etc..
When I stop the dcos-mesos-master process on the conflicting host, this stops and I have a stable master cluster (but probably not really resilient).
It looks like the failing node is trying to become the master, which causes this.. I've tried to search about rejoining a failed mesos-master.. but came up
I'm running DC/OS on a CoreOS environment.
Although a general behavior is described, you may need to provide more specifics such as the kernel version, dc/os version, specs and etc. The simplest answer I can provide based what's been given is to reach out via their support channel on Slack ( https://dcos-community.slack.com/ ).
I have setp a 2 nodes rabbitmq cluster with one loader balancer at frontend, after this was setup, it was working as active/active mode, then network partition happened on one node, I got the failed node out of the cluster and rejoin it into the cluster again, then this failed node were not accecpting any connection.
Then I tried to moved the other node out of the balancer, the recovered node began to accept connections, so this cluster is active/passive mode.
I don't know what caused this, is there any way to change it back to active/active? And which step to specify its mode during setup?
Thanks for your advice in advance!
rabbitmq really (really) doesn't like network partitions. By default, when you have one, everything pauses. In that situation you must fix it manually. Choosing the loser by stopping it and starting it should resume everything once it rejoins the cluster.
If that doesn't work, then shut down the failed node, and use rabbitmqctl to "forget_cluster_node", and then rejoin it to the cluster.
You should read this very carefully
https://www.rabbitmq.com/partitions.html
specifically, "Recovering from a network partition"
Then read the next few paragraphs even more carefully. There are some automatic recovery modes, each with advantages and disadvantages.
At my company we chose autoheal because we value availability, and accept the possible loss of messages.