Cache partition not replicated - ignite

I have 2 nodes with the persistence enabled. I create a cache like so
// all the queues across the frontier instances
CacheConfiguration cacheCfg2 = new CacheConfiguration("queues");
cacheCfg2.setBackups(backups);
cacheCfg2.setCacheMode(CacheMode.PARTITIONED);
globalQueueCache = ignite.getOrCreateCache(cacheCfg2);
where backups is a value > 1
When one of the nodes dies, I get
Exception in thread "Thread-2" javax.cache.CacheException: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostParts [cacheName=queues, part=2]
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryAdapter.executeScanQuery(GridCacheQueryAdapter.java:597)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl$1.applyx(IgniteCacheProxyImpl.java:519)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl$1.applyx(IgniteCacheProxyImpl.java:517)
at org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:3482)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:516)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:843)
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.query(GatewayProtectedCacheProxy.java:418)
at crawlercommons.urlfrontier.service.ignite.IgniteService$QueueCheck.run(IgniteService.java:270)
Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostParts [cacheName=queues, part=2]
... 9 more
I expected the content to have been replicated onto the other node. Why isn't that the case?

Most likely there is a misconfiguration somewhere. Check the following:
you are not working with an existing cache (replace getOrCreateCache to createCache)
you are not having more server nodes than the backup factor is
inspect the logs for "Detected lost partitions" message and what happened prior

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

How to disable all operations on ignite cache when topology is not valid

I have 2 server nodes and one client node. I am using TopologyValidator to validate the topology.
If any server node left the cluster I want disable all operations. TopologyValidator disables only update operation not get operation. Can you help me to do this?
Currently TopologyValidator disables update operations only.
You can use IgniteCache#close() operations to disable all operations on specific caches.
See: https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/IgniteCache.html#close--
If you do the following:
IgniteCache cache = ignite.getOrCreateCache(config);
cache.put(1L , new Person(1L, "A", "B"));
cache.close();
System.out.println(cache.get(1L)); //exception here.
you will get the following exception on the get call:
[INFO ][exchange-worker-#43%node1%][GridCacheProcessor] Finish proxy initialization, cacheName=test1, localNodeId=...
Exception in thread "main" java.lang.IllegalStateException: Cache has been closed: test1
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.checkProxyIsValid(GatewayProtectedCacheProxy.java:1548)
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.onEnter(GatewayProtectedCacheProxy.java:1580)
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:634)
In addition to Alex's answer, you might implement a custom analog of the TopologyValidator. All you need is to listen for the EVT_NODE_LEFT and EVT_NODE_JOINED events to trigger the custom logic, like stopping a cache or switching some application access validator.

Apache Geode debug Unknown pdx type=2140705

If I start a GFSH client and connect to Geode. There is a lot of data in myRegion and to check through it then I run:
query --query="select * from /myRegion"
I am getting the response:
Result : false
startCount : 0
endCount : 20
Message : Unknown pdx type=2140705
How does one troubleshoot / debug this problem?
UPDATE: The error in the Geode server log is:
[info 2018/07/04 10:53:07.275 BST IsGeode <Function Execution Processor1> tid=0x48] Exception occurred:
java.lang.IllegalStateException: Unknown pdx type=1318971
at org.apache.geode.internal.InternalDataSerializer.readPdxSerializable(InternalDataSerializer.java:3042)
at org.apache.geode.internal.InternalDataSerializer.basicReadObject(InternalDataSerializer.java:2859)
at org.apache.geode.DataSerializer.readObject(DataSerializer.java:2961)
at org.apache.geode.internal.util.BlobHelper.deserializeBlob(BlobHelper.java:90)
at org.apache.geode.internal.cache.EntryEventImpl.deserialize(EntryEventImpl.java:1911)
at org.apache.geode.internal.cache.EntryEventImpl.deserialize(EntryEventImpl.java:1904)
at org.apache.geode.internal.cache.PreferBytesCachedDeserializable.getDeserializedValue(PreferBytesCachedDeserializable.java:73)
at org.apache.geode.internal.cache.LocalRegion.getDeserialized(LocalRegion.java:1269)
at org.apache.geode.internal.cache.LocalRegion$NonTXEntry.getValue(LocalRegion.java:8771)
at org.apache.geode.internal.cache.EntriesSet$EntriesIterator.moveNext(EntriesSet.java:179)
at org.apache.geode.internal.cache.EntriesSet$EntriesIterator.next(EntriesSet.java:134)
at org.apache.geode.cache.query.internal.CompiledSelect.doNestedIterations(CompiledSelect.java:837)
at org.apache.geode.cache.query.internal.CompiledSelect.doIterationEvaluate(CompiledSelect.java:699)
at org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:423)
at org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:53)
at org.apache.geode.cache.query.internal.DefaultQuery.executeUsingContext(DefaultQuery.java:558)
at org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:385)
at org.apache.geode.cache.query.internal.DefaultQuery.execute(DefaultQuery.java:319)
at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:247)
at org.apache.geode.management.internal.cli.functions.DataCommandFunction.select(DataCommandFunction.java:202)
at org.apache.geode.management.internal.cli.functions.DataCommandFunction.execute(DataCommandFunction.java:147)
at org.apache.geode.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:185)
at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:662)
at org.apache.geode.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:1108)
at java.lang.Thread.run(Thread.java:748)
You can tell the immediate cause from the stack trace.
A PDX serialized stream contains a type id which is a reference into a repository of type metadata maintained by a GemFire cluster. In this case, the serialized data of the object contained a typeId that is not in the cluster's metadata repository.
So the question becomes, "what serialized that object and why did it use an invalid type id ?"
The only way I've seen this happen before is when a cluster is fully restarted and the pdx metadata goes away, either because it was not persistent or because it was deleted (by clearing out the locator working directory for example).
GemFire clients cache the mapping between a type and it's type ID. This allows them to quickly serialize objects without continually looking up the type id from the server. Client connections can persist across cluster restarts. When a client reconnects it does not flush the cached information and continues to write objects using its cached type ID.
So the combination of a pdx-metadata losing cluster restart and a client that is not restarted (e.g. an app. server) is the only way I have seen this happen before. Does this match your scenario ?
If so, one of the best ways to avoid this is to persist your pdx metadata and never delete it.

Apache Ignite sql query returns only cache contents, not complete results from database

My Ignite nodes (2 server nodes - let's call them A and B) are configured as follows:
ccfg.setCacheMode(CacheMode.PARTITIONED);
ccfg.setAtomicityMode(CacheMode.TRANSACTIONAL);
ccfg.setReadThrough(true);
ccfg.setWriteThrough(true);
ccfg.setWriteBehindEnabled(true);
ccfg.setWriteBehindBatchSize(10000);
Node A is started first, from command line as follows:
apache-ignite-fabric-2.2.0-bin>bin/ignite.bat config/default-config.xml
Node B is started from java code by running
public static void main(String[] args) throws Exception {
Ignite ignite = Ignition.start(ServerConfigurationFactory.createConfiguration());
ignite.cache("MyCache").loadCache(null);
...
}
(jar containing ServerConfigurationFactory is put in the apache-ignite-fabric-2.2.0-bin\libs directory so Node A and B are on the same cluster..otherwise there is an error)
I have a query that is supposed to return 9061 results from the database. After the cache loading process in Node B, I went to the Web Console and ran a simple count SQL statement against the caches. There is a button "Execute on selected node" that allows you to choose a specific cache to query. I queried Node A and got a count of 2341, and on Node B I get a count of 2064. If I just use the "Execute" button I get 4405 which is just the total of node A and B. Obviously they are missing 4656 records (9061 total records in db - 4405 in nodes A and B). I also ran the same count query in Java code using SqlFieldsQuery and I also get 4405.
Since readThrough is set to true I expected Ignite to also return results that are not in memory. But this is not the case because it just returns whatever is on the cache. Am I doing something wrong here? Thank you.
Read though works only for key-value APIs, so SQL engine assumes that all required data is preloaded from database prior to running a query.
If your data set doesn't fit in memory and you can't preload all the data, you can use native Ignite persistence storage: https://apacheignite.readme.io/docs/distributed-persistent-store

HANA hdbindexserver start issue after power outage

There was a power outage for our 5+1 node HANA cell cluster.
After we booted up the servers, tried to start the HANA DB.
During HDB start with SIDADM we can see on the nodes 2-3-4-5:
FAIL: process hdbindexserver HDB Indexserver not running
So of course trying to start hdbindexserver with hand with SIDADM:
cd /usr/sap/SIDADM/HDB0x/exe; ./hdbindexserver
But this just produces error:
/usr/sap/SIDADM/HDB0x/foobar003/trace> cat indexserver_alert_foobar003.trc
...
[14268]{-1}[-1/-1] 2017-10-09 19:55:34.593776 e TrexNet Communication.cpp(00501) : no internal interface found
[14287]{-1}[-1/-1] 2017-10-09 19:56:01.428226 e Checkpoint CheckpointMgr.cc(00244) : Skip versions garbage collection savepoint: transaction distribution work failure: snapshot timestamp synchronization failed
[14287]{-1}[-1/-1] 2017-10-09 19:56:22.467184 e Row_Engine transdtx.cc(01410) : Unexpected ltt exception thrown: transaction distribution work failure (at foobar/ptime/storage/tm/transdtx.cc:1410 )
[14287]{-1}[-1/-1] 2017-10-09 19:56:22.467427 f PersistenceLayer PersistenceController.cpp(00679) : startup failed exception 1: no.71000145 (ptime/storage/tm/transdtx.cc:1512)
snapshot timestamp synchronization failed
...
The IPs are up. There is 1 TB of RAM.
The question: what could cause hdbindexserver to fail to start?
Looks like the indexserver process wasn't able to bind the internal network interface again:
Communication.cpp(00501) : no internal interface found
I'd look into the other tracefiles and the system log to check whether the configured NI is up and available.
It seems the persistence storage (disk where data and log file resides) is not responding within time and hence it's getting timed out. Can you check if you can access the data file and log file from the server.
Also check is network I/O slow or disk I/O slow on that server, causing the synchronization to timeout.
You can try stopping the system completely and try to bring HDB on just that server first to check if above issue exists.