We are trying to prevent our application startups from just spinning if we cannot reach the remote cluster. From what I've read Force Server Mode states
In this case, discovery will happen as if all the nodes in topology
were server nodes.
What i want to know is:
Does this client then permanently act as a server which would run computes and store caching data?
If connection to the cluster does not happen at first, a later connection to an establish cluster cause issue with consistency? What would be the expect behavior with a Topology version mismatch? Id their potential for a split brain scenario?
No, it's still a client node, but behaves as a server on discovery protocol level. For example, it can start without any server nodes running.
Client node can never cause data inconsistency as it never stores the data. This does not depend forceServerMode flag.
Related
I need some clarity on flag forceServerMode flag of TcpDiscoverySpi. As per below documentation, DiscoverySPI will behave in same way for client node as it would for server:
If node is configured as client node (see IgniteConfiguration.clientMode) TcpDiscoverySpi starts in client mode as well. In this case node does not take its place in the ring, but it connects to random node in the ring (IP taken from IP finder configured) and use it as a router for discovery traffic. Therefore slow client node or its shutdown will not affect whole cluster. If TcpDiscoverySpi needs to be started in server mode regardless of IgniteConfiguration.clientMode, forceSrvMode should be set to true.
Does that mean:
Client node would become part of discovery ring? If Yes, would it impact overall Grid performance?
Does it impact near caches of client node if it has any?
What are the benefits if we make this flag true?
Added link for the documentation for reference:
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html
This option has already been deprecated. But let me answer all the questions.
It could potentially affect cluster stability and performance.
I believe it shouldn't impact anything in terms of functionality.
I don't see any major benefits of the property. Most likely you can start a standalone client without any running servers, but maybe not, honestly I haven't checked it.
In general I wouldn't recommend using it.
PS a similar question was asnwered a while ago.
I've started exploring Redis Cluster and it's C client(hiredis). I've been unable to find much info about the client's interaction with the Redis cluster. I've got some queries in this regard:
Does the client make a connection with all the nodes of the cluster(master and slaves) in the beginning?
Is there any coordinator node which proxies the client's request to the correct node?
If not, does the client periodically get the info about the hash-slot holdings of each node in the cluster(in order to send its request to the correct node)?
Which client-cluster connection specific parameters are configurable?
Does the client make a connection with all the nodes?
Yes, the client maintains a connection with all the masters at least.
Is there a coordinator node which proxies the client's request to the correct node?
No, there isn't. By design, redis cluster does not have a proxy.
(Aside: There is some talk of developing a proxy solution for redis - but I don't expect it to be released any time soon.)
Does the client periodically get info about hash slot bindings?
When a client starts up, it builds up a cache of hash-slot mappings. Then, at runtime, if a slot is migrated to another master, redis cluster will return a specific error that will tell the client the new owner for that slot. The client is then expected to cache the new owner, and retry the request against the new node.
As a result of this design, clients usually have a very good cache of every slot and it's owner, and there is very little overhead.
which client connection parameters are configurable?
The most important parameter is the list of server nodes to connect to the cluster. You don't have to specify all the nodes - the client can auto-discover all the masters. As long as even one node is active, the client will discover all the other nodes.
Apart from that, you have connection timeout parameters, parameters to control TLS.
What I am testing is the following scenario:
Start 2 Lighthouses, then start a 3 service that is a member of the cluster. It's seed nodes are configured to be the two Lighthouses that were previously started.
Now this 3rd service has it's HOCON set to bind to port 0, which does it's job and gives me a random port.
Now when I force quit this service to simulate a crash, The logging output from Akka.Net gets REAL chatty (important parts)
AssociationError...Tried to associate with unreachable remote address
address is now gated for 5000ms ... No connection could be made because the target machine actively refused it.
And it seems like it just goes on forever. I assume this is probably harmless and it just looks like a terrible error. The message itself makes sense, the service is literally gone so it can not and will never be able to connect.
Now if I restart the service since it's configured to bind to 0 for Akka.Remoting, it will get an entirely new port, so the Unreachable status of the other failed service will never be resolved.
Is this the expected behavior? I also think there is a configuration setting that might come into play here:
auto-down-unreachable-after
Now this comes with it's own warning about:
Using auto-down implies that two separate clusters will automatically be formed in case of network partition.
Setting this does silence the messages:
auto-down-unreachable-after = 3s
And I get a new message after the node is marked unreachable:
Association to [akka.tcp://ClusterName#localhost:58977] having UID [983892349]is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
Remote actorsystem must be restarted to recover from this situation. Seems pretty serious and something to avoid. At the same time, given that the service joins on a random port, it is irrecoverable. In trying to gain some more knowledge about the UID it seems that it's internally assigned. So I can only guess there would not be any collisions later in time with UIDs, so this would be the proper behavior.
This seems to be the only option outside of
log-info = off
to just silence the logs
I assume the logging of the lighthouse services are chatty, right? That is 'normal' behaviour of the Akka gossip protocol trying to communicate with the crashed node. When this happens, you must configure what you want to do.
The solution for solving this is not always the same for each situation. It could depend for example if you are running the services on a cloud microservices platform for example. But one of the options is indeed 'auto-downing'. This will mark the service as 'UNREACHABLE' (as you can see). This means that the node isn't out of the cluster, but the cluster continues to operate without the crashed node. That's the reason that the same node cannot join, because it is still marked as 'UNREACHABLE'.
Be aware that auto-downing could result into a 'split-brain' of the cluster, where the two parts of the cluster (for example one cluster of 4 nodes gets split into 2 clusters of 2 nodes). This is a situation that you don't want, so this may not be the best solution!
Akka.NET has some other solution to you can configure to correctly deal with this: the Split Brain Resolver. More information how to configure this: https://getakka.net/articles/clustering/split-brain-resolver.html
These are all strategies to prevent 'split-brain' situations and will involve sacrificing nodes to keep the cluster consistent. Use these strategies in combination with for example a microservices orchestration platform (so that instances will restart themselves after crashing/exiting) to create a perfect self-healing Akka cluster.
We're currently running Apache Ignite in a docker container but are having problems with the time server sync. Each node is reporting all the known ip addresses which later is used by a remote peer to send a time sync message over UDP.
Is there a way to specify the externally reachable ip address that the peers will use for time sync?
It's possible to set to every node a network interface it should use for all network related communications using IgniteConfiguration.setLocalHost(...) method. The time server will use the addresses specified this way for its needs as well.
However it's not critical that the time server is non workable on your side because it's used for cache CLOCK mode which is discouraged for usage.
We have a number of web-apps running on IIS 6 in a cluster of machines. One of those machines is also a state server for the cluster. We do not use sticky IP's.
When we need to take down the state server machine this requires the entire cluster to be offline for a few minutes while it's switched from one machine to another.
Is there a way to switch a state server from one machine to another with zero downtime?
You could use Velocity, which is a distributed caching technology from Microsoft. You would install the cache on two or more servers. Then you would configure your web app to store session data in the Velocity cache. If you needed to reboot one of your servers, the entire state for your cluster would still be available.
You could use the SQL server option to store state. I've used this in the past and it works well as long as the ASPState table it creates is in memory. I don't know how well it would scale as an on-disk table.
If SQL server is not an option for whatever reason, you could use your load balancer to create a virtual IP for your state server and point it at the new state server when you need to change. There'd be no downtime, but people who are on your site at the time would lose their session state. I don't know what you're using for load balancing, so I don't know how difficult this would be in your environment.