What happens to data that is pending write behind on ignite node failure with backups configured? - ignite

I have an Apache Ignite cluster with 3 server nodes. The cache is configured with 1 backup and the default PRIMARY_SYNC write synchronization mode. It also has write behind enabled, for writing the data through to the file system.
If a node fails, does any data that is pending write behind on it still get written through to the file system via its backup node?
Also, is there a way to only enable cache backups for data that is pending write behind, and not for data that has successfully been written through?

Related

How to change the Ignite to maintanance mode?

What is Ignite maintenance mode of Ignite, and how to change an ignite to this mode? I was stuck joining the node to the cluster and complains cleaning up the persistent data, however the data can be cleaned (using control.sh) only in the maintenance mode only.
This is a special mode, similar to running Windows in a safe mode after a crash or a data corruption where most of the cluster functionality is disabled and a user is asked to perform some maintenance task to resolve the issue, most straightforward example I can think of is - to clean (remove) some corrupted files on disk just like in your question. You can refer to IEP-53: Maintenance Mode proposal for the details.
I don't think that there is a way to enter this mode manually unless you trigger some preconfigured conditions like stopping a node in the middle of checkpointing with WAL disabled. Once the state is fixed, maintenance mode should be resolved automatically allowing a node to join the cluster.
Also, from my understanding, this mode is about a particular node rather than a complete cluster. I.e. you can have a 4-nodes cluster with only 1 node in maintenance mode, in that case, you have to run control.sh commands locally for the concrete failed node, not from another healthy node. If that's not the case, please provide more details or file a JIRA ticket because reported behavior looks quite broken to me.

GridGain Server partition loss

We have 3 node Gridgain server and there are 3 client nodes deployed in GCP Kubernetes engine. Cluster is native persistence enabled. Also <property name="shutdownPolicy" value="GRACEFUL"/> as shutdown policy. There is one backup for each cache. After automatic cluster restart getting partition loss. Need to reset these partitions by executing control commands.
Can you provide proper solution for this. We have around 60GB persistent data.
<property name="shutdownPolicy" value="GRACEFUL"/> is supposed to protect from partition loss if certain conditions are met:
The caches must be either PARTITIONED with backups > 0 or REPLICATED. Check your configs. Default cache config in Ignite is PARTITIONED with backups = 0 (for historical reasons), so the defaults won't work.
There must be more than one baseline node (only baseline nodes store data!). Here is the doc.
You must stop the nodes in a graceful way. This is a bit tricky since you don't always control this.
If you stop with a kill to the process, make sure it uses SIGTERM and not SIGKILL because the later always kills the process immediately
If you stop with Ignite.close() this should just work
If you stop with Java System.exit() it'll work, but if you use System.halt() - it won't (because halt() is not graceful)
If you use orchestrators such as Kubernetes, you need to make sure they'll stop the nodes gracefully. For example, in Kubernetes you normally have to set terminationGracePeriodSeconds to a high value so that Kubernetes waits for the nodes to finish graceful shutdown instead of killing them.
If you use custom startup scripts, you need to make sure they forward signals to the Ignite process.
To debug this, check the points above. I would normally start by looking at the server logs (with IGNITE_QUIET=false!) to see if "Invoking shutdown hook" message is there. If it isn't there then your shutdown hook isn't getting called, and the problem is one of the points under 3. Otherwise, there should be other log messages explaining the situation.

Redis cluster node failure not detected on MISCONF

We currently have a redis cache cluster with 3 masters and 3 slaves hosted on 3 windows servers (1 master/slave by server). We are using StackExhange.Redis as our client.
We have RBD disabled but AOF enabled and are experiencing some problems with the cluster in the following situation :
One of our servers became full and the redis node on this server was unable to write to the AOF file (the error returned to the client was MISCONF Errors writing to the AOF file: No space left on device).
The cluster did not detect that the node was failing and so did not exlclude it from the cluster.
All cache operations were blocked until we make some place on the server.
We know that we don't need the AOF, so we have disalbed it after the incident.
But we would like to confirm or infirm our view on redis clustering: for us, if a node was experiencing a failure, the cluster would redirect all requests to another one. We have tested that with a stopped node master, a slave is promoted into a master so we are confident that our cluster is working, but we are not sure why, in our case, the node was not marked as a failure.
Is the cluster capable of detecting a node failure when the failure is only happening when a request is made from a client to the cluster ?

How to clean Apache Ignite caches and sort of start over?

I have a 3 node ignite cluster and 1 client that creates cache. During the development and testing, I had to stop the cluster or interrupt the cache building several time and the entire system is broken now. Only one node starts and the other nodes crashes. The client is blocked and it does not do anythin.
Is there any way to clean everything and sort of start fresh?
I am using Ignite 2.1 and using Persistent Cache storage.
Thank you for your help.
Just delete Ignite work directory - by default, it's ${IGNITE_HOME}/work.
Also, if you configured WAL store path, you need to clean it too:
https://apacheignite.readme.io/docs/distributed-persistent-store#section-write-ahead-log
Note: All data in persistent store will be lost.

ActiveMQ takes a long time to failover

I have 3 ActiveMQ brokers in a networked Shared File System(GlusterFS)/Master Slave configuration - all in VMs.
If the master fails the client should failover to the new master.
The issue I have is that the connection to the new master takes about 50 seconds.
Is that reasonable?
How to improve it?
My client connection looks like this
failover:(tcp://a1:61616?connectionTimeout=1000,tcp://a2:61616?connectionTimeout=1000,tcp://a3:61616?connectionTimeout=1000)?randomize=false&maxReconnectDelay=10000&backup=true"
Also when disconnecting the master by disconnecting network cable it stops and throws an exception regarding the kahaDB (which is on GlusterFS) and needs to be restarted.
Is there a workaround for this behavior so the master broker auto-restarts or is able to connect automatically once the network comes back?
The failover depends on the time the underlying file system take for releasing the file lock.
In your case, the NFS cluster is waiting 50s to detect that the first node is lost and so release the lock on the kahadb file, wich can then be taken by the seconde node.
You can customize this delay with the NFSD_V4_GRACE and NFSD_V4_LEASE parameters in the NFS server configuration file (/etc/sysconfig/nfs on redhat/centos systems).
You can also customize the kahadb lockKeepAlivePeriod, see http://activemq.apache.org/pluggable-storage-lockers.html