Apache Ignite: Reached logical end of the segment for file - ignite

I have enabled Ignite native persistence and disabled WAL log:
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
</bean>
</property>
<!-- disabled wal log because query result doesn't need recovery -->
<property name="walMode" value="NONE"/>
</bean>
</property>
<property name="cacheConfiguration">
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<!-- Set the cache name. -->
<property name="name" value="query_cache"/>
<!-- Set the cache mode. -->
<property name="cacheMode" value="PARTITIONED"/>
</bean>
</property>
I start server by application and operate cache in another class:
public class IgniteServer {
public static void main(String[] args) {
Ignite ignite = Ignition.start("examples/config/ignite-server-config.xml");
ignite.cluster().state(ClusterState.ACTIVE);
}
}
try (Ignite ignite = Ignition.start("examples/config/ignite-server-config.xml")) {
IgniteCache cache = ignite.getOrCreateCache("query_cache");
cache.put("1", "value-1");
System.out.println(cache.get("1"));
}
This is working fine, but after stopping IgniteServer I can't restart it again with following error:
[15:28:20] Initialized write-ahead log manager in NONE mode, persisted data may be lost in a case of unexpected node failure. Make sure to deactivate the cluster before shutdown.
[2021-03-18 15:28:20,876][ERROR][main][root] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Failed to read checkpoint record from WAL, persistence consistency cannot be guaranteed. Make sure configuration points to correct WAL folders and WAL folder is properly mounted [ptr=FileWALPointer [idx=0, fileOff=0, len=0], walPath=db/wal, walArchive=db/wal/archive]]]
class org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to read checkpoint record from WAL, persistence consistency cannot be guaranteed. Make sure configuration points to correct WAL folders and WAL folder is properly mounted [ptr=FileWALPointer [idx=0, fileOff=0, len=0], walPath=db/wal, walArchive=db/wal/archive]
at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.performBinaryMemoryRestore(GridCacheDatabaseSharedManager.java:2269)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:873)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:5022)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1251)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2052)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1698)
at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1114)
at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1032)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:918)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:817)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:687)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:656)
at org.apache.ignite.Ignition.start(Ignition.java:353)
at org.apache.ignite.examples.atest.IgniteServer.main(IgniteServer.java:9)
[2021-03-18 15:28:20,881][ERROR][main][root] JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Failed to read checkpoint record from WAL, persistence consistency cannot be guaranteed. Make sure configuration points to correct WAL folders and WAL folder is properly mounted [ptr=FileWALPointer [idx=0, fileOff=0, len=0], walPath=db/wal, walArchive=db/wal/archive]]]
Sometimes server is shutdown automatically with message:
FileWriteAheadLogManager: Reached logical end of the segment for file: /ignite/work/db/wal/node-xxxx/xxx.wal
I have disabled WAL log, I don't know why it still read checkpoint and failed. I checked $IGNITE_HOME/work/db/wal/node-xxx/, I found 10 wal files and all of them with size 67.1MB, seems there is a infinite loop and fill them all. After I deleted work folder I can start the serve again.
Questions:
How can I fix this problem with native persistence on and WAL log off
Seems like I shutdown server in a wrong way, how can I stop server safely by code without checking checkpoint?
Thanks.

I advise against walMode=NONE. If you have to use it, make sure to remove the whole persistence directory before restarting node, or
Try calling ignite.cluster().state(INACTIVE) before shutting down any nodes.

Related

Error in Apache Ignite : Can't restore memory - critical part of WAL archive is missing

I was trying to run my application that uses an Apache ignite cache in a local Windows machine and got the below error:
ERROR [exchange-worker-#42%ignite-instance-0%] [] - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0, fileOff=3746370, len=53]. Can't restore memory - critical part of WAL archive is missing.]]
class org.apache.ignite.internal.pagemem.wal.StorageException: Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0, fileOff=3746370, len=53]. Can't restore memory - critical part of WAL archive is missing.
at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
Any idea what might have gone wrong?
My dataStorageConfiguration is :
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
</bean>
</property>
<property name="walMode" value="LOG_ONLY"/>
<property name="walCompactionEnabled" value="true" />
</bean>
</property>
Did you lose your WAL? If you don't care about any preexisting data at all, consider removing your Ignite work dir (typically ignite/work or %TMP%/ignite/work).
To summarize the solution:
Delete the following directory
Windows : C:\Users\\AppData\Local\Temp\ignite
Linux : find the directory with heirarchy ignit/work and delete it.

Apache Ignite Eviction Policy Max Size in Visor shown as <n/a>

My question is really simple.
I am using the example XML config for setting up the following Eviction Policy from https://apacheignite.readme.io/docs/evictions#section-first-in-first-out-fifo- .
<bean class="org.apache.ignite.cache.CacheConfiguration">
<property name="name" value="myCache"/>
<!-- Enabling on-heap caching for this distributed cache. -->
<property name="onheapCacheEnabled" value="true"/>
<property name="evictionPolicy">
<!-- FIFO eviction policy. -->
<bean class="org.apache.ignite.cache.eviction.fifo.FifoEvictionPolicy">
<!-- Set the maximum cache size to 1 million (default is 100,000). -->
<property name="maxSize" value="1000000"/>
</bean>
</property>
...
</bean>
How can I verify that my eviction policy has been setup successfully before start seeding data into my cache? I have been using visor and then the config command, I can see that Eviction Policy Enabled is set to on, Eviction Policy set to o.a.i.cache.eviction.fifo.FifoEvictionPolicy but Eviction Policy Max Size is set to <n/a> although it has been configured in the XML. That leads me to think that the Eviction policy max size is not setup correctly. Can someone share some light into this question?
Apache Ignite has management beans, so you can verify your Cache configuration using MBeans. Just run a jconsole from JDK and check Cache configuration. Please see sample screenshot:
Thanks, Alexey

Hibernate Search & Lucene - set write timeout lock

I have an
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock#/XXXXX/User_Index/write.lock
exception and I read that the write timeout lock should be increased from the default 1 second.
(
It is interesting that previously I didn't have this exception but I work on a task to use Spring on the project. There is a small chance that there are more, competing transactions trying to get access to the index...? I don't think I think the Spring transaction is configured properly:
<!-- for the #Transactional annotations -->
<tx:annotation-driven />
<context:component-scan base-package="XXX.audit, XXX.authorization, XXX.policy, XXX.printing, XXX.provisioning, XXX.service.plainspring" />
<!-- defining Transaction Manager for Spring -->
<bean id="transactionManager" class="org.springframework.orm.hibernate4.HibernateTransactionManager">
<property name="dataSource" ref="dataSource" />
<property name="sessionFactory" ref="sessionFactory" />
</bean>
)
So I tried to configure the write lock timeout like
<bean id="sessionFactory" class="org.springframework.orm.hibernate4.LocalSessionFactoryBean" lazy-init="true">
...
<property name="hibernateProperties">
<props>
...
<prop key="hibernate.search.lucene_version">LUCENE_35</prop>
<prop key="hibernate.search.default.indexwriter.writeLockTimeout">20000</prop>
...
</property>
<property name="dataSource">
<ref bean="dataSource"/>
</property>
</bean>
but no success. Apache Lucene doesn't have config file. Also there is no Lucene code, only Hibernate Search is used (i.e. not possible to set the value of an IndexWriter)
How can I configure the the write lock timeout?
Apache Lucene 3.5
Hibernate Search 4.1.1
Thanks,
V.
There is no option to configure the IndexWriter lock timeout, as this should never be needed.
If you see such a timeout happening it's usually because of either of:
There is a lock file in the index directory as a left over from a crashed JVM
The configuration isn't suitable for the architecture of the application
Check the left over scenario first: shut down your application and see if there is a file name write.lock. If the application is not running it's safe to delete this file.
If that's not the case then you probably have two different instances of Hibernate Search attempting to use the same index directory, and both attempting to write to it.
That's not a valid configuration and you're getting the exception because the index is already locked by te other instance; having a lock timeout increase would only have you wait for a very long time - possibly until the other application is shut down.
Don't share indexes among applications; if you really need to do so, check the manual for the JMS based backends or other non-default backends which allow for multiple applications to share a single IndexWriter.
Finally, please consider upgrading. These versions are extremely old.

SharedRDD code for ignite works on setup of single server but fails with exception when additional server added

I have 2 server nodes running collocated with spark worker. I am using shared ignite RDD to save my dataframe. My code works fine when I work with only one server node stared, if I start both server nodes code fails with
Grid is in invalid state to perform this operation. It either not started yet or has already being or have stopped [gridName=null, state=STOPPING]
DiscoverySpi is configured as below
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<!--
Ignite provides several options for automatic discovery that can be used
instead os static IP based discovery. For information on all options refer
to our documentation: http://apacheignite.readme.io/docs/cluster-config
-->
<!-- Uncomment static IP finder to enable static-based discovery of initial nodes. -->
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<!--<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder">-->
<property name="shared" value="true"/>
<property name="addresses">
<list>
<!-- In distributed environment, replace with actual host IP address. -->
<value>v-in-spark-01:47500..47509</value>
<value>v-in-spark-02:47500..47509</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
I know this exception generally means ignite instanace either not started or stopped and operation tried with same, but I don't think this is the case for reasons that with single server node it works fine and also I am not explicitly closing ignite instance in my program.
Also in my code flow I do perform operations in transaction which works, so it is like
create cache1 : works fine
Create cache2 : works fine
put value in cache1 ; works fine
igniteRDD.saveValues on cache2 : This step failes with above mentioned exception.
USE this link for complete error trace
caused by part is pasted below here also
Caused by: java.lang.IllegalStateException: Grid is in invalid state to perform this operation. It either not started yet or has already being or have stopped [gridName=null, state=STOPPING]
at org.apache.ignite.internal.GridKernalGatewayImpl.illegalState(GridKernalGatewayImpl.java:190)
at org.apache.ignite.internal.GridKernalGatewayImpl.readLock(GridKernalGatewayImpl.java:90)
at org.apache.ignite.internal.IgniteKernal.guard(IgniteKernal.java:3151)
at org.apache.ignite.internal.IgniteKernal.getOrCreateCache(IgniteKernal.java:2739)
at org.apache.ignite.spark.impl.IgniteAbstractRDD.ensureCache(IgniteAbstractRDD.scala:39)
at org.apache.ignite.spark.IgniteRDD$$anonfun$saveValues$1.apply(IgniteRDD.scala:164)
at org.apache.ignite.spark.IgniteRDD$$anonfun$saveValues$1.apply(IgniteRDD.scala:161)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:883)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:883)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
... 3 more</pre>
It looks like the node embedded in the executor process is stopped for some reason while you are still trying to run the job. To my knowledge the only way for this to happen is to stop the executor process. Can this be the case? Is there anything in the log except the trace?

Properly Shutting Down ActiveMQ and Spring DefaultMessageListenerContainer

Our system will not shutdown when a "Stop" command is issued from the Tomcat Manager. I have determined that it is related to ActiveMQ/Spring. I have even figured out how to get it to shutdown, however my solution is a hack (at least I hope this isn't the "correct" way to do it). I would like to know the proper way to shutdown ActiveMQ so that I can remove my hack.
I inherited this component and I have no information about why certain architectural decisions were made, after a lot of digging I think I understand his thoughts, but I could be missing something. In other words, the real problem could be in the way that we are trying to use ActiveMQ/Spring.
We run in ServletContainer (Tomcat 6/7) and use ActiveMQ 5.9.1 and Spring 3.0.0 Multiple instances of our application can run in a "group", with each instance running on it's own server. ActiveMQ is used to facilitate communication between the multiple instances. Each instance has it's own embedded broker and it's own set of queues. Every queue on every instance has exactly 1 org.springframework.jms.listener.DefaultMessageListenerContainer listening to it, so 5 queues = 5 DefaultMessageListenerContainers for example.
Our system shut down properly until we fixed a bug by adding queuePrefetch="0" to the ConnectionFactory. At first I assumed that this change was incorrect in some way, but now that I understand the situation, I am confident that we should not be using the prefetch functionality.
I have created a test application to replicate the issue. Note that the information below makes no mention of message producers. That is because I can replicate the issue without ever sending/processing a single message. Simply creating the Broker, ConnectionFactory, Queues and Listeners during boot, is enough to keep the system from stopping properly.
Here is my sample configuration from my Spring XML. I will be happy to provide my entire project if someone wants it:
<amq:broker persistent="false" id="mybroker">
<amq:transportConnectors>
<amq:transportConnector uri="tcp://0.0.0.0:61616"/>
</amq:transportConnectors>
</amq:broker>
<amq:connectionFactory id="ConnectionFactory" brokerURL="vm://localhost?broker.persistent=false" >
<amq:prefetchPolicy>
<amq:prefetchPolicy queuePrefetch="0"/>
</amq:prefetchPolicy>
</amq:connectionFactory>
<amq:queue id="lookup.mdb.queue.cat" physicalName="DogQueue"/>
<amq:queue id="lookup.mdb.queue.dog" physicalName="CatQueue"/>
<amq:queue id="lookup.mdb.queue.fish" physicalName="FishQueue"/>
<bean id="messageListener" class="org.springframework.jms.listener.DefaultMessageListenerContainer" abstract="true">
<property name="connectionFactory" ref="ConnectionFactory"/>
</bean>
<bean parent="messageListener" id="cat">
<property name="destination" ref="lookup.mdb.queue.dog"/>
<property name="messageListener">
<bean class="com.acteksoft.common.remote.jms.WorkerMessageListener"/>
</property>
<property name="concurrentConsumers" value="200"/>
<property name="maxConcurrentConsumers" value="200"/>
</bean>
<bean parent="messageListener" id="dog">
<property name="destination" ref="lookup.mdb.queue.cat"/>
<property name="messageListener">
<bean class="com.acteksoft.common.remote.jms.WorkerMessageListener"/>
</property>
<property name="concurrentConsumers" value="200"/>
<property name="maxConcurrentConsumers" value="200"/>
</bean>
<bean parent="messageListener" id="fish">
<property name="destination" ref="lookup.mdb.queue.fish"/>
<property name="messageListener">
<bean class="com.acteksoft.common.remote.jms.WorkerMessageListener"/>
</property>
<property name="concurrentConsumers" value="200"/>
<property name="maxConcurrentConsumers" value="200"/>
</bean>
My hack involves using a ServletContextListener to manually stop the objects. The hacky part is that I have to create additional threads to stop the DefaultMessageListenerContainers. Perhaps I'm stopping the objects in the wrong order, but I've tried everything that I can imagine. If I attempt to stop the objects in the main thread, then they hang indefinitely.
Thank you in advance!
UPDATE
I have tried the following based on boday's recommendation but it didn't work. I have also tried to specify the amq:transportConnector uri as tcp://0.0.0.0:61616?transport.daemon=true
<amq:broker persistent="false" id="mybroker" brokerName="localhost">
<amq:transportConnectors>
<amq:transportConnector uri="tcp://0.0.0.0:61616?daemon=true"/>
</amq:transportConnectors>
</amq:broker>
<amq:connectionFactory id="connectionFactory" brokerURL="vm://localhost" >
<amq:prefetchPolicy>
<amq:prefetchPolicy queuePrefetch="0"/>
</amq:prefetchPolicy>
</amq:connectionFactory>
At one point I tried to add similar properties to the brokerUrl parameter in the amq:connectionFactory element and the shutdown worked properly, however after further testing I learned that the properties were resulting in an exception to be thrown from VMTransportFactory. This resulted in improper initialization and the basic message functionality didn't work.
In case anyone else is wondering, as far as I can see it's not possible to have a daemon ListenerContainer using ActiveMQ.
When the ActiveMQConnection is started, it creates a ThreadPoolExecutor with non-daemon thread. This is seemingly to avoid issues when failing over the connection from one broker to another.
https://issues.apache.org/jira/browse/AMQ-796
executor = new ThreadPoolExecutor(1, 1, 5, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>(), new ThreadFactory() {
#Override
public Thread newThread(Runnable r) {
Thread thread = new Thread(r, "ActiveMQ Connection Executor: " + transport);
//Don't make these daemon threads - see https://issues.apache.org/jira/browse/AMQ-796
//thread.setDaemon(true);
return thread;
}
});
try setting daemon=true on your TCP transport, this allows the process to run as a deamon thread which won't block the shutdown of your container
see http://activemq.apache.org/tcp-transport-reference.html