GridGain Server partition loss - ignite

We have 3 node Gridgain server and there are 3 client nodes deployed in GCP Kubernetes engine. Cluster is native persistence enabled. Also <property name="shutdownPolicy" value="GRACEFUL"/> as shutdown policy. There is one backup for each cache. After automatic cluster restart getting partition loss. Need to reset these partitions by executing control commands.
Can you provide proper solution for this. We have around 60GB persistent data.

<property name="shutdownPolicy" value="GRACEFUL"/> is supposed to protect from partition loss if certain conditions are met:
The caches must be either PARTITIONED with backups > 0 or REPLICATED. Check your configs. Default cache config in Ignite is PARTITIONED with backups = 0 (for historical reasons), so the defaults won't work.
There must be more than one baseline node (only baseline nodes store data!). Here is the doc.
You must stop the nodes in a graceful way. This is a bit tricky since you don't always control this.
If you stop with a kill to the process, make sure it uses SIGTERM and not SIGKILL because the later always kills the process immediately
If you stop with Ignite.close() this should just work
If you stop with Java System.exit() it'll work, but if you use System.halt() - it won't (because halt() is not graceful)
If you use orchestrators such as Kubernetes, you need to make sure they'll stop the nodes gracefully. For example, in Kubernetes you normally have to set terminationGracePeriodSeconds to a high value so that Kubernetes waits for the nodes to finish graceful shutdown instead of killing them.
If you use custom startup scripts, you need to make sure they forward signals to the Ignite process.
To debug this, check the points above. I would normally start by looking at the server logs (with IGNITE_QUIET=false!) to see if "Invoking shutdown hook" message is there. If it isn't there then your shutdown hook isn't getting called, and the problem is one of the points under 3. Otherwise, there should be other log messages explaining the situation.

Related

Apache Ignite Force Server Mode

We are trying to prevent our application startups from just spinning if we cannot reach the remote cluster. From what I've read Force Server Mode states
In this case, discovery will happen as if all the nodes in topology
were server nodes.
What i want to know is:
Does this client then permanently act as a server which would run computes and store caching data?
If connection to the cluster does not happen at first, a later connection to an establish cluster cause issue with consistency? What would be the expect behavior with a Topology version mismatch? Id their potential for a split brain scenario?
No, it's still a client node, but behaves as a server on discovery protocol level. For example, it can start without any server nodes running.
Client node can never cause data inconsistency as it never stores the data. This does not depend forceServerMode flag.

Restarting managed servers by clusters without outage

I want to write script for restarting weblogics managed servers, which would do the following:
It would contain loop ,which would restart first nodes of all clusters at one time.
a.)FORCE_SHUTDOWN
b.)wait for status: SHUTDOWN
c.)START managed servers
d.)wait for status: RUNNING
e.)move to next node of each cluster and repeat until all managed servers are restarted.
So in first iteration it would restart all first nodes of each cluster, in second iteration it would restart the second nodes of each cluster and repeat this action until all managed servers are restarted.
I have not started to writing the script yet, I am newbie with weblogic and this is just concept. Do you have any suggestions how to achieve that goal?
Why reinvent the wheel?
rollingRestart
Category: Control Commands
Use with WLST: Online
Description Initiates a rolling restart of all servers in a domain or all servers in a specific cluster or clusters without interrupting
the service. This command provides the ability to sequentially restart
servers.
This operation involves the graceful shutdown of the servers, and the
servers being restarted without interrupting the service for the user.
Syntax
rollingRestart(target, [options])

Best Redis setup for session caching

I see there are multiple modes of operation for Redis (cluster, sentinel, master-slave, etc?). I don't fully understand the implications of each, but my question is this:
If I have a web application that requires distributed session persistence, which configuration of Redis makes the most sense? The main reason I'm using redis is to achieve some level of fault tolerance. If one of my frontend servers fails, I want the sessions to be available for other nodes to pickup the workload. If a redis node goes down, I don't want this to affect the user experiences, and I don't want to have to wake up a developer at midnight to correct the matter.
From everything I've read, Redis Sentinel is the way to go for fault tolerance.

rabbitmq cluster how to change active/active into active/passive mode?

I have setp a 2 nodes rabbitmq cluster with one loader balancer at frontend, after this was setup, it was working as active/active mode, then network partition happened on one node, I got the failed node out of the cluster and rejoin it into the cluster again, then this failed node were not accecpting any connection.
Then I tried to moved the other node out of the balancer, the recovered node began to accept connections, so this cluster is active/passive mode.
I don't know what caused this, is there any way to change it back to active/active? And which step to specify its mode during setup?
Thanks for your advice in advance!
rabbitmq really (really) doesn't like network partitions. By default, when you have one, everything pauses. In that situation you must fix it manually. Choosing the loser by stopping it and starting it should resume everything once it rejoins the cluster.
If that doesn't work, then shut down the failed node, and use rabbitmqctl to "forget_cluster_node", and then rejoin it to the cluster.
You should read this very carefully
https://www.rabbitmq.com/partitions.html
specifically, "Recovering from a network partition"
Then read the next few paragraphs even more carefully. There are some automatic recovery modes, each with advantages and disadvantages.
At my company we chose autoheal because we value availability, and accept the possible loss of messages.

Detect cluster node failure Jboss AS 7.1.1-Final

I have configured 2 node clusers in Jboss AS 7.1.1-Final. I am planning to use sticky sessions. Meanwhile I am also recording number of active online users in Infinispan cache with node IP from where that user session was created for reporting purpose.
I have taken care of scenarios for login/logout where I would clear our cache entries. Problem is if one of the server node goes down, I need to write clean up routine to clear such records of that node from cache too.
One of the option is to write a client and check at specific interval if server is alive otherwise trigger a clean up routine. This approach would work but I am looking for more cleaner approach if I could detect server node failure that gets notified to other live nodes then I could hit cleanup.
From console I know that it shows when server goes down or comes up. But what would be that listerner to listen to such events. Any thoughts?
If you just need to know when the node leaves within some server module (inside JBoss server) you can use the ViewChanged listener
You cannot get this information on clients connected via REST or memcached protocols - with HotRod protocol it is doable but pretty hackish, you'd have to override TransportFactory.updateServers (probably just extend TcpTransportFactory - see configuration property infinispan.client.hotrod.transport_factory)