How to clean Apache Ignite caches and sort of start over? - ignite

I have a 3 node ignite cluster and 1 client that creates cache. During the development and testing, I had to stop the cluster or interrupt the cache building several time and the entire system is broken now. Only one node starts and the other nodes crashes. The client is blocked and it does not do anythin.
Is there any way to clean everything and sort of start fresh?
I am using Ignite 2.1 and using Persistent Cache storage.
Thank you for your help.

Just delete Ignite work directory - by default, it's ${IGNITE_HOME}/work.
Also, if you configured WAL store path, you need to clean it too:
https://apacheignite.readme.io/docs/distributed-persistent-store#section-write-ahead-log
Note: All data in persistent store will be lost.

Related

Rolling restart Ignite full memory mode on K8s without losing data

I am running ignite on the k8s cluster with 5 pods and set backups = 1 https://apacheignite.readme.io/docs/primary-and-backup-copies. Is there any way to do a rolling restart without losing data and how to check if the data is synced to other instances before restarting one after another
Thank you
You can monitor cache rebalancing process by using jmx
https://ignite.apache.org/docs/latest/monitoring-metrics/metrics#monitoring-rebalancing

Redis Cluster Single Node issue

Can Redis run in cluster mode enabled but with single instance. It seems we get cluster status as fail when we try to deploy with single node because no slots are added to it.
https://serverfault.com/a/815813/510599
I understand we can manually add SLOTS after deployment. But I wonder if we can modify Redis source code to make this change.

How to change the Ignite to maintanance mode?

What is Ignite maintenance mode of Ignite, and how to change an ignite to this mode? I was stuck joining the node to the cluster and complains cleaning up the persistent data, however the data can be cleaned (using control.sh) only in the maintenance mode only.
This is a special mode, similar to running Windows in a safe mode after a crash or a data corruption where most of the cluster functionality is disabled and a user is asked to perform some maintenance task to resolve the issue, most straightforward example I can think of is - to clean (remove) some corrupted files on disk just like in your question. You can refer to IEP-53: Maintenance Mode proposal for the details.
I don't think that there is a way to enter this mode manually unless you trigger some preconfigured conditions like stopping a node in the middle of checkpointing with WAL disabled. Once the state is fixed, maintenance mode should be resolved automatically allowing a node to join the cluster.
Also, from my understanding, this mode is about a particular node rather than a complete cluster. I.e. you can have a 4-nodes cluster with only 1 node in maintenance mode, in that case, you have to run control.sh commands locally for the concrete failed node, not from another healthy node. If that's not the case, please provide more details or file a JIRA ticket because reported behavior looks quite broken to me.

GridGain Server partition loss

We have 3 node Gridgain server and there are 3 client nodes deployed in GCP Kubernetes engine. Cluster is native persistence enabled. Also <property name="shutdownPolicy" value="GRACEFUL"/> as shutdown policy. There is one backup for each cache. After automatic cluster restart getting partition loss. Need to reset these partitions by executing control commands.
Can you provide proper solution for this. We have around 60GB persistent data.
<property name="shutdownPolicy" value="GRACEFUL"/> is supposed to protect from partition loss if certain conditions are met:
The caches must be either PARTITIONED with backups > 0 or REPLICATED. Check your configs. Default cache config in Ignite is PARTITIONED with backups = 0 (for historical reasons), so the defaults won't work.
There must be more than one baseline node (only baseline nodes store data!). Here is the doc.
You must stop the nodes in a graceful way. This is a bit tricky since you don't always control this.
If you stop with a kill to the process, make sure it uses SIGTERM and not SIGKILL because the later always kills the process immediately
If you stop with Ignite.close() this should just work
If you stop with Java System.exit() it'll work, but if you use System.halt() - it won't (because halt() is not graceful)
If you use orchestrators such as Kubernetes, you need to make sure they'll stop the nodes gracefully. For example, in Kubernetes you normally have to set terminationGracePeriodSeconds to a high value so that Kubernetes waits for the nodes to finish graceful shutdown instead of killing them.
If you use custom startup scripts, you need to make sure they forward signals to the Ignite process.
To debug this, check the points above. I would normally start by looking at the server logs (with IGNITE_QUIET=false!) to see if "Invoking shutdown hook" message is there. If it isn't there then your shutdown hook isn't getting called, and the problem is one of the points under 3. Otherwise, there should be other log messages explaining the situation.

Migrate a (storm+nimbus) cluster to a new Zookeeper, without loosing the information or having downtime

I have a nimbus+storm cluster using Zookeeper, and I wish to move my cluster and point it to a new Zookeeper. Do you know if this is possible? Can I keep all the information of the old zookeeper and save it in the new one? Is it possible to do it without downtime?
I have looked in the internet for this procedure but I have not found much.
Would it be as simples as change the storm.yml file in both the master . and worker nodes? Do I need a restart afterwards?
# storm.zookeeper.servers:
# - "server1"
# - "server2"
If you just change storm.yml, you'd be pointing Storm at a new empty Zookeeper cluster, and it will be like you just installed Storm from scratch. More likely, you want to grow your Zookeeper cluster to include your new machines, then update storm.yml to point at the new machines, then shrink the cluster to exclude the machines you want to move away from. That way, your Zookeeper quorum is preserved even though you've moved to other physical machines.
This is easier to do on Zookeeper 3.5 with dynamic reconfiguration http://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.html. I'm unsure whether Storm will run on Zookeeper 3.5, but you may consider investigating whether you can upgrade to 3.5 before growing/shrinking the cluster.
Otherwise you will have to do a rolling restart to add the new Zookeeper nodes, then do another one to remove the old machines once the cluster has stabilized.
Let me suggest a hack here. This was a script provided by microsoft for migration on HD Insight cluster , but you can change it and use it for your need.
The script can be downloaded from : https://github.com/hdinsight/hdinsight-storm-examples/tree/master/tools/zkdatatool-1.0 and you can read more about it here :
https://blogs.msdn.microsoft.com/azuredatalake/2017/02/24/restarting-storm-eventhub/
I have used it in the past when i had to migrate some stuff between PaaS clusters and i can confirm it works ok!