Re-join cluster node after seed node got restarted

Re-join cluster node after seed node got restarted - akka.net

Let's imagine such scenario. I have a three nodes inside my akka cluster (node A,B,C). Each node is deployed to a different physical device inside a network.
All of those nodes are wrapped inside Topshelf windows services.
Node A is my seed node, the other ones are just simply 'worker' nodes with port specified.
When I run cluster and stop node (service) B or C and then restart them. Nodes are rejoining with no issues.
I'd like to ask whether it's possible to handle other scenario which will be. When I stop seed node (node A), the other nodes - services still running and then I restart node-service A - I'd like to make nodes B,C rejoin the cluster and make the whole eco system working again.
Is such scenario possible to implement? If yes then how should I do that?

In Akka.NET cluster any node can serve as a seed node for others as long as it's a part of the cluster. "Seeds" are just a configuration thing, so you can define a list of well-known node addresses you know, that are a part of the cluster.
Regarding your case, there are several solutions I can think of:
Quite common approach is to define more than one seed node in the configuration, so that your node doesn't serve as a single point of failure. As long as at least one of the configured seed nodes is alive, everything should work fine. Keep in mind, that the seed nodes should be defined in each configuration in the exactly same order.
If your "worker" nodes have statically assigned endpoints, they can be used as seed nodes as well.
Since you can initialize the cluster programmaticaly from code, you can also use 3rd party service as a node discovery service. You can use i.e. consul for that - I've started a project, which gives such functionality. While it's not yet published, feel free to fork it or contribute, if it will help you.

Related

Setup docker-swarm to use specific nodes as backup?

Is there a way to setup docker-swarm to only use specific nodes (workers or managers) as fail-over nodes? For instance if one specific worker dies (or if a service on it dies), only then it will use another node, before that happens it's as if the node wasn't in the swarm.

No, that is not possible. However, docker-swarm does have the features to build that up. Let's say that you have 3 worker nodes in which you want to run service A. 2/3 nodes will always be available and node 3 will be the backup.
Add a label to the 3 nodes. E.g: runs=serviceA . This will make sure that your service only runs in those 3 nodes.
Make the 3rd node unable to schedule tasks by running docker node update --availability drain <NODE-ID>
Whenever you need your node back, run docker node update --availability active <NODE-ID>

Do I need multiple masters on OKD?

So I have a question regarding setting up OKD for our needs - our team has already established that Kubernetes is basically the simplest way for us to manage our stack. We don't have too much workload; probably 3 dedicated servers could work through all of it, but we have a lot of services and tools that are best served by running in docker containers, and we also strongly benefit from running our fairly monolithic core application as a container to make deployment and maintenance simpler.
The question though, is that how many nodes we need; specifically, whether we need HA Master nodes.
From the documentation, it seems that Infrastructure nodes are responsible for routing. Does this mean that even if the master node goes down, the other nodes are still available and routing works, so long as domains point at the infrastructure nodes? Or would a failed master make all the other nodes unreachable?

In our environment router pods are running on infra nodes and we can safely turn off master node without impact for applications.
master node: api, controllers, etcd
infra node: registry, router, metrics, logging etc.
With master turned off you just can't manage cluster, the rest works fine. It is good to have more than one master node for etcd redundancy, but with such small environment I think it makes no sense maintain more.

How to configure Akka.Cluster for services that Crash when binding to port 0

What I am testing is the following scenario:
Start 2 Lighthouses, then start a 3 service that is a member of the cluster. It's seed nodes are configured to be the two Lighthouses that were previously started.
Now this 3rd service has it's HOCON set to bind to port 0, which does it's job and gives me a random port.
Now when I force quit this service to simulate a crash, The logging output from Akka.Net gets REAL chatty (important parts)
AssociationError...Tried to associate with unreachable remote address
address is now gated for 5000ms ... No connection could be made because the target machine actively refused it.
And it seems like it just goes on forever. I assume this is probably harmless and it just looks like a terrible error. The message itself makes sense, the service is literally gone so it can not and will never be able to connect.
Now if I restart the service since it's configured to bind to 0 for Akka.Remoting, it will get an entirely new port, so the Unreachable status of the other failed service will never be resolved.
Is this the expected behavior? I also think there is a configuration setting that might come into play here:
auto-down-unreachable-after
Now this comes with it's own warning about:
Using auto-down implies that two separate clusters will automatically be formed in case of network partition.
Setting this does silence the messages:
auto-down-unreachable-after = 3s
And I get a new message after the node is marked unreachable:
Association to [akka.tcp://ClusterName#localhost:58977] having UID [983892349]is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
Remote actorsystem must be restarted to recover from this situation. Seems pretty serious and something to avoid. At the same time, given that the service joins on a random port, it is irrecoverable. In trying to gain some more knowledge about the UID it seems that it's internally assigned. So I can only guess there would not be any collisions later in time with UIDs, so this would be the proper behavior.
This seems to be the only option outside of
log-info = off
to just silence the logs

I assume the logging of the lighthouse services are chatty, right? That is 'normal' behaviour of the Akka gossip protocol trying to communicate with the crashed node. When this happens, you must configure what you want to do.
The solution for solving this is not always the same for each situation. It could depend for example if you are running the services on a cloud microservices platform for example. But one of the options is indeed 'auto-downing'. This will mark the service as 'UNREACHABLE' (as you can see). This means that the node isn't out of the cluster, but the cluster continues to operate without the crashed node. That's the reason that the same node cannot join, because it is still marked as 'UNREACHABLE'.
Be aware that auto-downing could result into a 'split-brain' of the cluster, where the two parts of the cluster (for example one cluster of 4 nodes gets split into 2 clusters of 2 nodes). This is a situation that you don't want, so this may not be the best solution!
Akka.NET has some other solution to you can configure to correctly deal with this: the Split Brain Resolver. More information how to configure this: https://getakka.net/articles/clustering/split-brain-resolver.html
These are all strategies to prevent 'split-brain' situations and will involve sacrificing nodes to keep the cluster consistent. Use these strategies in combination with for example a microservices orchestration platform (so that instances will restart themselves after crashing/exiting) to create a perfect self-healing Akka cluster.

Apache ignite node won't join cluster if server nodes start simultaneously

We have a problem with ignite when we start two ignite server nodes at the exact same time. We are currently implementing our own discovery mechanism by extending TcpDiscoveryIpFinderAdapter. The first time the TcpDiscoveryIpFinderAdapter is called neither ignite servers will be able to find the other node (due to the nature of our discovery mechanism). Subsequent invocations does report the other node with a correct IP, yet the ignite nodes will not start to talk to each other.
If we start the servers with some delay, the second server will (on the first attempt) find the other node and join the cluster successfully.
Is there a way to get the two nodes to talk to each other even after both of them initially think they are a cluster of one node?

CouchBase 2.5 2 nodes in replica: 1 node fail: the service is no more available

We are testing Couchbase with a two node cluster with one replica.
When we stop the service on one node, the other one does not respond until we restart the service or manually failover the stopped node.
Is there a way to maintain the service from the good node when one node is temporary unavailable?

If a node goes down then in order to activate the replicas on the other node you will need to manually fail it over. If you want this to happen automatically then you can enable auto-failover, but in order to use that feature I'm pretty sure you must have at least a three node cluster. When you want to add the failed node back then you can just re-add it to the cluster and rebalance.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas