The infinispan cluster node expiration result is inconsistent with the official documentation description - infinispan

When I tested the expiration method of infinispan cluster node cache, I found that when the node reached the maximum idle time, it would not get "the last time the entry is accessed" from other nodes in the cluster, but directly invalidate the cache entry of the node. For example: I started two nodes A and B, and set the maximum idle time of the cache to 10s. At the beginning of the test, I sent a request to Node A to access the database records and write the database records to the cache. At this time, Node A synchronizes the data cache to Node B. Then at 5s, I accessed the cache entry at Node A, and then at Node B after 10s. I found that the cache entry at Node B was invalid, Node B retrieved the database records from the database, and wrote the cache and synchronized to other nodes, instead of treating the cache as valid.
Why is it different from the description in the document? http://infinispan.org/docs/stable/user_guide/user_guide.html#expiration_details
For the configuration of cluster node cache expiration failure, I configure it as follows:
Configuration C = new ConfigurationBuilder()
.expiration().enableReaper(). wakeUpInterval(50000L).maxIdle(10000l).build();

It sounds like you are using an older version of Infinispan. Cluster wide max idle expiration wasn't introduced until 9.3 in https://issues.jboss.org/browse/ISPN-9003. If this issue still persists with 9.3 or newer, you can log a bug at https://issues.jboss.org/projects/ISPN.

Related

Apache Ignite Partitioned Mode with 1 backup copy: updates to cache do not get reflected in both paritions?

I have an Apache Ignite cluster with 5 nodes, running in PARTITIONED mode with 1 back-up copy for each primary partition (also configured to read from backup if it's on the local node).
Updates to data in one of the caches is received from a Kafka topic, updates are processed and cache is re-loaded as required.
However, occasionally, I am observing that when I request the data from the cache, I will get the correct updated data a handful of times, but this will alternate with getting the stale data pre-update back.
It seems to me that something fails when syncing between the primary and back up node upon update (configuration is FULL_SYNC so not related to async issues). I can't spot any errors in the logs which suggest something like this however.
How can I determine if this is the cause of the issue? What else may be going wrong to cause this behaviour?
Running on Ignite 2.9.1
Thanks

Ignite Partition Loss Immediately After Cache Creation

I am using ignite 2.8.1 in a 4 node cluster with persistence enabled. I was attempting to do a rolling restart of the cluster but I believe during that process the cluster ended up having partition loss seemingly all on one node in the cluster. I am using policy READ_ONLY_SAFE. From that point on, even though all the nodes came back up, about every 1 in 8 times I would create a cache it would immediately have partition loss in the new cache, i.e. we would create the cache and then query it 1 second later and the queries would fail with "Failed to execute query because cache partition has been lost". How can partitions be lost immediately after creation if no cluster event has happened such as nodes leaving the cluster?
Partitions for newly created caches may be lost if cluster has some nodes absent from baseline / in "lost partition" state.
This is so that affinity collocation would work. Since on 2 caches with same affinity configurations partitions need to be collocated by node, there's nowhere to put these "extra" partitions for newly created caches.
You need to reset lost partitions first.

Apache Ignite Cache Read Issue (All affinity node left the grid)

Ignite Version: 2.5
Ignite Cluster Size: 10 nodes
One of our spark job is writing data to ignite cache in every one hour. The total records per hour is 530 million. Another spark job read the cache but when it try to read it, we are getting the error as "Failed to Execute the Query (all affinity nodes left the grid)
Any pointers will be helpful.
If you are using "embedded mode" deployment it will mean that nodes are started when jobs are ran and stopped when jobs finish. If you do not have enough backups you will lose data when this happens. Any chance this may be your problem? Be sure to connect to Ignite cluster with client=true.

Get distributed query data from Ignite local caches in cluster

I have Ignite cache with name "IgniteCache" on each node in cluster(of 2 servers) with local mode enabled. Certain number of entries are loaded into these local caches. Now, I have started separate client node which queries data from this "IgniteCache" on cluster. But always when I query data, I am getting null result(Instead of getting data from both server nodes)
This happens because local caches are not distributed across nodes. When you query a local cache, you only will see data which is stored locally on the same node. You don't have any on the client, so result is empty.

Couchbase node failure

My understanding could be amiss here. As I understand it, Couchbase uses a smart client to automatically select which node to write to or read from in a cluster. What I DON'T understand is, when this data is written/read, is it also immediately written to all other nodes? If so, in the event of a node failure, how does Couchbase know to use a different node from the one that was 'marked as the master' for the current operation/key? Do you lose data in the event that one of your nodes fails?
This sentence from the Couchbase Server Manual gives me the impression that you do lose data (which would make Couchbase unsuitable for high availability requirements):
With fewer larger nodes, in case of a node failure the impact to the
application will be greater
Thank you in advance for your time :)
By default when data is written into couchbase client returns success just after that data is written to one node's memory. After that couchbase save it to disk and does replication.
If you want to ensure that data is persisted to disk in most client libs there is functions that allow you to do that. With help of those functions you can also enshure that data is replicated to another node. This function is called observe.
When one node goes down, it should be failovered. Couchbase server could do that automatically when Auto failover timeout is set in server settings. I.e. if you have 3 nodes cluster and stored data has 2 replicas and one node goes down, you'll not lose data. If the second node fails you'll also not lose all data - it will be available on last node.
If one node that was Master goes down and failover - other alive node becames Master. In your client you point to all servers in cluster, so if it unable to retreive data from one node, it tries to get it from another.
Also if you have 2 nodes in your disposal you can install 2 separate couchbase servers and configure XDCR (cross datacenter replication) and manually check servers availability with HA proxies or something else. In that way you'll get only one ip to connect (proxy's ip) which will automatically get data from alive server.
Hopefully Couchbase is a good system for HA systems.
Let me explain in few sentence how it works, suppose you have a 5 nodes cluster. The applications, using the Client API/SDK, is always aware of the topology of the cluster (and any change in the topology).
When you set/get a document in the cluster the Client API uses the same algorithm than the server, to chose on which node it should be written. So the client select using a CRC32 hash the node, write on this node. Then asynchronously the cluster will copy 1 or more replicas to the other nodes (depending of your configuration).
Couchbase has only 1 active copy of a document at the time. So it is easy to be consistent. So the applications get and set from this active document.
In case of failure, the server has some work to do, once the failure is discovered (automatically or by a monitoring system), a "fail over" occurs. This means that the replicas are promoted as active and it is know possible to work like before. Usually you do a rebalance of the node to balance the cluster properly.
The sentence you are commenting is simply to say that the less number of node you have, the bigger will be the impact in case of failure/rebalance, since you will have to route the same number of request to a smaller number of nodes. Hopefully you do not lose data ;)
You can find some very detailed information about this way of working on Couchbase CTO blog:
http://damienkatz.net/2013/05/dynamo_sure_works_hard.html
Note: I am working as developer evangelist at Couchbase