Thread stuck when accessing atomic reference or long on Apache Ignite - ignite

This is regarding a rather recent issue that we’ve been facing. We run 2 client instances and 26 apache ignite instances. All are AWS R4.2xLarge nodes. Recently we’ve been seeing this issue where when trying to fetch an atomicLong or atomicReference, the executing thread gets stuck and doesn’t return. This issue usually happens on 1 or 2 ignite instances. I am not sure why this happens and so any help on this would be really appreciated.
This is the thread dump while trying to get an atomicReference:
"main" #1 prio=5 os_prio=0 cpu=3528.41ms elapsed=1067.33s allocated=312M defined_classes=9309 tid=0x00007f4ce4046fc0 nid=0x1537 waiting on condition [0x00007f4cece90000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base#11.0.7/Native Method)
- parking to wait for <0x00007f4cbfe7c7d0> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base#11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base#11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base#11.0.7/CountDownLatch.java:232)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicReference(DataStructuresProcessor.java:744)
at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3743)
at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3732)
at company.explore.cache.persist.SavedAudienceLocationProvider.getSavedAudienceLocation(SavedAudienceLocationProvider.java:27)
at company.explore.listeners.lifecycle.LifecycleListener.configureSavedAudienceLocation(LifecycleListener.java:45)
at company.explore.listeners.lifecycle.LifecycleListener.onLifecycleEvent(LifecycleListener.java:38)
at org.apache.ignite.internal.IgniteKernal.notifyLifecycleBeans(IgniteKernal.java:725)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1156)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
- locked <0x00007f4cbf072a38> (a org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance)
at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
at org.apache.ignite.Ignition.start(Ignition.java:348)
at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
Since this is stuck any Ignition.ignite calls fail as well and cause the job not to go through:
"pub-#22" #48 prio=5 os_prio=0 cpu=5.76ms elapsed=1036.50s allocated=421K defined_classes=6 tid=0x00007f4ce4cf3990 nid=0x1607 waiting on condition [0x00007f40375f6000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base#11.0.7/Native Method)
- parking to wait for <0x00007f4cbf16d9e0> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base#11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base#11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base#11.0.7/CountDownLatch.java:232)
at org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:7657)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1671)
at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1389)
at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1258)
at org.apache.ignite.Ignition.ignite(Ignition.java:489)
at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:58)
at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:31)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C2.execute(GridClosureProcessor.java:1855)
at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base#11.0.7/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base#11.0.7/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base#11.0.7/Thread.java:834)
Similarly this is an instance where the thread is waiting for CountDownLatch when trying to get atomicLong:
"pub-#489" #608 prio=5 os_prio=0 cpu=16.80ms elapsed=7076.10s allocated=2409K defined_classes=17 tid=0x00007f48c8014c60 nid=0x5bd5 waiting on condition [0x00007f48359e1000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base#11.0.7/Native Method)
- parking to wait for <0x00007f518aba6060> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base#11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base#11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base#11.0.7/CountDownLatch.java:232)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicLong(DataStructuresProcessor.java:463)
at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3716)
at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3705)
at company.explore.cache.persist.person.SerializationStatus.getSerializeCounter(SerializationStatus.java:86)
at company.explore.cache.persist.person.SerializationStatus.startNodeSerialization(SerializationStatus.java:21)
at company.explore.cache.persist.personv2.PersonSerializationJob.serializePeopleData(PersonSerializationJob.java:98)
at company.explore.cache.persist.personv2.PersonSerializationJob.run(PersonSerializationJob.java:75)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1944)
at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base#11.0.7/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base#11.0.7/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base#11.0.7/Thread.java:834)
These issues have only started coming up as of the past 2 months or so. The system itself has been very stable for a long time.
I haven’t posted the entire thread dump as it would be quite large. If needed, I can post it on pastebin or upload it somewhere.
Since this really isn’t a very consistent issue I am not sure about how to create a reproducer project. But I can provide any logs or so if needed.
EDIT:
The entire thread dumps have been posted on pastebin. Please find the links below:
Atomic Reference related thread dump: pastebin.com/ydNMFSEP
Atomic Long related thread dump: pastebin.com/psJgwi3F

Related

Scaling Apache Ignite Grid

We having scaling Apache Ignite grid where client nodes scale up and down based on load.
Data nodes are our server nodes where continuous queries run.
However this leads to unclean shutdown of some client nodes as we rely on SIGTERM for Ignite node shutdown.
Unclean shutdown of client nodes impacts excution of continuos query which starts giving "Possible starvation in striped pool" warning ultimately leading to Blocked system-critical threads.
We are currently working on ways to prevent striped pool stravation and have noticed 2 key issues around it:
Continuos query thread trying to connect to nodes which have shutdown but are still present in topology: We are planning to reduce the timeout so that client node is discarded early from grid.
Stacktrace:
Thread [name="sys-stripe-1-#2%App%", id=37, state=RUNNABLE, blockCnt=233817, waitCnt=3343945]
at sun.nio.ch.Net.poll(Native Method)
at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3781)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3635)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3375)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3180)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:3013)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2960)
at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:2100)
at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:2365)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1964)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1935)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1917)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1324)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1261)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:1059)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.access$600(CacheContinuousQueryHandler.java:90)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler$2.onEntryUpdated(CacheContinuousQueryHandler.java:459)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:447)
Continuos query threads waiting for read lock while trying to update the cache. This is generally comes up after retries for Client node connection are over.
Stacktrace:
Possible starvation in striped pool.
Thread name: sys-stripe-12-#13%App%
Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=CacheContinuousQueryBatchAck [routineId=37b43550-d3a5-4518-8745-ece5dc06b1fd, updateCntrs=HashMap {2=7414, 5=8228, 7=7508, 13=7536, 525=7586, 14=7596, 527=7959, 533=7886, 534=7666, 539=9556, 547=7866, 36=8380, 549=8131, 38=7126, 39=7776, 46=7822, 52=7800, 54=8098, 567=7894, 569=7640, 60=7912, 62=8170, 63=7962, 64=8190, 65=7662, 72=7754, 585=7712, 81=8564, 594=8000, 82=7980, 83=7999, 595=7688, 596=7972, 85=7494, 597=7806, 601=7812, 89=7478, 602=7868, 603=7944, 604=7944, 93=7778, 96=8036, 99=7916, 102=7584, 618=7956, 107=7656, 111=7176, 112=8042, 116=7620, 125=7768, 637=7662, 130=7846, 642=7696, 134=11672, 138=7638, 651=7418, 652=7908, 140=7478, 654=9136, 655=8934, 144=8052, 145=7656, 147=7904, 663=7354, 153=7868, 667=8232, 669=7774, 157=7850, 160=8094, 673=8120, 682=7722, 172=7930, 689=7864, 180=8026, 692=7674, 184=7526, 699=7458, 191=8326, 193=7700, 195=7986, 197=8056, 713=7858, 716=7896, 719=7946, 210=7560, 725=7604, 214=7442, 727=7668, 729=7406, 731=7790, 219=7594, 733=7360, 225=7522, 737=7482, 227=7838, 744=8380, 234=7150, 237=7886, 750=7910, 239=8624... and 104 more}]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=CacheContinuousQueryBatchAck [routineId=0e950ae5-1474-4488-9042-80dbddb2f09a, updateCntrs=HashMap {2=7414, 5=8228, 7=7508, 13=7536, 525=7586, 14=7596, 527=7959, 533=7886, 534=7666, 539=9556, 547=7866, 36=8380, 549=8131, 38=7126, 39=7776, 46=7822, 52=7800, 54=8098, 567=7894, 569=7640, 60=7912, 62=8170, 63=7962, 64=8190, 65=7662, 72=7754, 585=7712, 81=8564, 594=8000, 82=7980, 595=7688, 83=7999, 596=7972, 597=7806, 85=7494, 601=7812, 89=7478, 602=7868, 603=7944, 604=7944, 93=7778, 96=8036, 99=7916, 102=7584, 618=7956, 107=7656, 111=7176, 112=8042, 116=7620, 637=7662, 125=7768, 130=7846, 642=7696, 134=11672, 138=7638, 651=7418, 140=7478, 652=7908, 654=9136, 655=8934, 144=8052, 145=7656, 147=7904, 663=7354, 153=7868, 667=8232, 669=7774, 157=7850, 160=8094, 673=8120, 682=7722, 172=7930, 689=7864, 692=7674, 180=8026, 184=7526, 699=7458, 191=8326, 193=7700, 195=7986, 197=8056, 713=7858, 716=7896, 719=7946, 210=7560, 725=7604, 214=7442, 727=7668, 729=7406, 219=7594, 731=7790, 733=7360, 225=7522, 737=7482, 227=7838, 744=8380, 234=7150, 237=7886, 750=7910, 239=8624... and 104 more}]]]]
Deadlock: false
Completed: 3316358
Thread [name="sys-stripe-12-#13%App%", id=48, state=WAITING, blockCnt=106311, waitCnt=1659827]
Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync#5f611d9a, ownerName=exchange-worker-#71%App%, ownerId=138]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at o.a.i.i.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.readLock(GridDhtPartitionTopologyImpl.java:256)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1837)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1734)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3322)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:141)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:273)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:268)
at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142)
at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109)
at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308)
at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907)
at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528)
at o.a.i.i.managers.communication.GridIoManager.access$5300(GridIoManager.java:241)
at o.a.i.i.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421)
at o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:565)
at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
Here we can see that lock is owned by "exchange-worker-#71%App%" which seems to be struck. In few cases we have seen that lock has no owner specific:
Thread [name="sys-stripe-2-#3%App%", id=43, state=WAITING, blockCnt=39097, waitCnt=394328]
Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync#667500d1, ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1663)
Continuos query runs on server nodes which are our data nodes and we do not expect data nodes to be impacted by client nodes like getting locked.
Can someone advice on how we can avoid such locks given that nodes can have unclean shutdowns?
I believe setting IGNITE_ENABLE_FORCIBLE_NODE_KILL property as true cluster-wide could help with the matter. It streamlines the process of kicking thick client nodes off a cluster, the main case for it is abruptly terminated client nodes.

Extracting JVM stack trace from a heap dump (hprof) with a command line tool

I want to automate the extraction of the stack trace from the JVM crash heap dump file on a headless server. Is there a non-GUI solution for this?
MAT project comes with a script ParseHeapDump.sh (*.bat on Windows), if you just run it on a heap dump like this: /path/to/MAT/ParseHeapDump.sh <HEAPDUMP_NAME>.hprof, you get a set of files, one of them being a text file <HEAPDUMP_NAME>.threads containing stack traces, e.g.:
Thread 0x715883630
locals:
Thread 0x71577b418
at java.lang.Object.wait(J)V (Native Method)
at java.lang.ref.ReferenceQueue.remove(J)Ljava/lang/ref/Reference; (ReferenceQueue.java:155)
at jdk.internal.ref.CleanerImpl.run()V (CleanerImpl.java:148)
at java.lang.Thread.run()V (Thread.java:834)
at jdk.internal.misc.InnocuousThread.run()V (InnocuousThread.java:134)
locals:
objectId=0x715833c30, line=1
objectId=0x716df1990, line=1
objectId=0x715832230, line=2
objectId=0x71577b418, line=2
objectId=0x71577b418, line=2
objectId=0x71577b418, line=3
objectId=0x71577b418, line=4
Thread 0x715883050
at jdk.internal.misc.Unsafe.park(ZJ)V (Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(Ljava/lang/Object;J)V (LockSupport.java:234)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(J)J (AbstractQueuedSynchronizer.java:2123)
at java.util.concurrent.DelayQueue.take()Ljava/util/concurrent/Delayed; (DelayQueue.java:229)
at com.intellij.util.concurrency.AppDelayQueue.lambda$new$0()V (AppDelayQueue.java:26)
at com.intellij.util.concurrency.AppDelayQueue$$Lambda$16.run()V (Unknown Source)
at java.lang.Thread.run()V (Thread.java:834)
locals:
objectId=0x715883050, line=1
objectId=0x715cd8a90, line=2
objectId=0x7291c6618, line=2
objectId=0x715f73520, line=3
objectId=0x715cd8aa8, line=3
objectId=0x715883050, line=3
objectId=0x715f73520, line=4
objectId=0x715cd8ac0, line=5
objectId=0x715883050, line=6
...
You are only interested in lines starting with Thread 0x... and at ... here.
Notes:
The script does a lot more work than would be necessary to only extract a thread dump (it parses the heap as well which is 99.99...% of work); and it generates many files that you do not need (all prefixed <HEAPDUMP_NAME>).
Thread names are lost, and thread ids seem to not correspond to anything OS-specific (e.g. native thread id); in other words, you cannot find Thread 0x715883050 in the ps or top/htop output — at least I do not know a way.
Open your heap dump in VisualVM
After the heap dump is loaded in the top left corner of the tab is a dropdown. Switch it to Threads.
Switch the thread view

Stripped pool starvation in WAL writing causes node cluster node failure

Moderate workload on 3 node ignite cluster causes one node to fail with stripped pool startvation while archiving WAL.
This happens one or two times in week.
I already checked all IO problems which could hang WAL rollover. But this issue still persist
I am using latest ignite 2.7 as a library inside spring boot application
: >>> Possible starvation in striped pool.
Deadlock: false
Completed: 1397
Thread [name="sys-stripe-7-#8%server.node%", id=22, state=WAITING, blockCnt=3, waitCnt=757]
Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.awaitNext(FileWriteAheadLogManager.java:2871)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.access$2300(FileWriteAheadLogManager.java:2451)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.rollOver(FileWriteAheadLogManager.java:1205)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:836)
at o.a.i.i.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4267)
at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6333)
at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6082)
at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5782)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3719)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.access$5900(BPlusTree.java:3613)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1895)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1779)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1638)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621)
at o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428)
at o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2295)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3242)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:135)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:309)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:304)
at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
at o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:505)
at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
ERROR --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-1, blockedFor=10s]
WARN --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Thread [name="sys-stripe-1-#2%server.node%", id=16, state=WAITING, blockCnt=0, waitCnt=754]
Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248]
Failure Detection feature is not very well configured in Apache Ignite 2.7 by default. You can turn it off (by setting to NoOp) or set a large failureDetectionTimeout to avoid such messages (and shutdown of nodes).

JobTracker - High memory and native thread usage

We are running hadoop on GCE with HDFS default file system, and data input/output from/to GCS.
Hadoop version: 1.2.1
Connector version: com.google.cloud.bigdataoss:gcs-connector:1.3.0-hadoop1
Observed behavior: JT will accumulate threads in waiting state, leading to OOM:
2015-02-06 14:15:51,206 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.initialize(AbstractGoogleAsyncWriteChannel.java:318)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:275)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:145)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:184)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:168)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:77)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:655)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:444)
at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1860)
at org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:709)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:706)
at org.apache.hadoop.mapred.JobTracker.initJob(Jobenter code hereTracker.java:3890)
at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
After looking through the JT logs I found these warnings:
2015-02-06 14:30:17,442 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xx.xxx.xxx.xxx:50010
java.io.IOException: Call to /xx.xxx.xxx.xxx:50020 failed on local exception: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
at org.apache.hadoop.ipc.Client.call(Client.java:1118)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy10.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:414)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:392)
at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:201)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3317)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)
Caused by: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:205)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1249)
at org.apache.hadoop.ipc.Client.call(Client.java:1093)
... 9 more
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635)
... 12 more
This appears to be similar to hadoop bug reporter here: https://issues.apache.org/jira/browse/MAPREDUCE-5606
I tried proposed solution by disabling saving job logs into the output path and it solved the problem at the expense of missing logs :)
I also ran jstack on JT and it showed hundreds of WAITING or TIMED_WAITING threads as such:
pool-52-thread-1" prio=10 tid=0x00007feaec581000 nid=0x524f in Object.wait() [0x00007fead39b3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:327)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at java.io.PipedInputStream.read(PipedInputStream.java:378)
- locked <0x000000074d86ba60> (a java.io.PipedInputStream)
at com.google.api.client.util.ByteStreams.read(ByteStreams.java:181)
at com.google.api.client.googleapis.media.MediaHttpUploader.setContentAndHeadersOnCurrentReque
st(MediaHttpUploader.java:629)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.
java:409)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr
actGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogl
eClientRequest.java:460)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.run(AbstractGo
ogleAsyncWriteChannel.java:354)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000074d864918> (a java.util.concurrent.ThreadPoolExecutor$Worker)
It appears JT is having hard time keeping up communicating with GCS via GCS Connector.
Please advise,
Thank you
At the moment, every open FSDataOutputStream in the GCS connector for Hadoop consumes a thread until it's closed, because a separate thread needs to run the "resumable" HttpRequests while the user of the OutputStream writes bytes intermittently. In most cases, (such as in individual Hadoop tasks), there's only ever one long-lived output stream, and possibly a few shorter-lived ones for writing small metadata/marker files, etc.
In general, there are two possible causes for the OOM you're running into:
You have lots of queued up jobs; every submitted job holds an unclosed OutputStream, and thus consumes a "waiting" thread. However, since you mention you only need to queue up ~10 jobs, this shouldn't be the root cause.
Something is causing a "leak" of the PrintWriter objects, originally created in logSubmitted and added to fileManager. Typically, terminal events (like logFinished will correctly close() all the PrintWriters before removing them from the map via markCompleted, but in theory they may be bugs here or there which can cause one of the OutputStreams to leak without being close()'d. For example, while I haven't had a chance to verify this assertion, it seems that IOException trying to do something like logMetaInfo will "removeWriter" without closing it.
I've verified that at least under normal circumstances, the OutputStream seem to get closed correctly, and my sample JobTracker shows a clean jstack after having successfully run a lot of jobs.
TL;DR: There are some working theories as to why some resource may leak and ultimately prevent necessary threads from being created. You should consider changing hadoop.job.history.user.location to some HDFS location in the meantime, as a way to preserve the job logs in the absence of placing them on GCS.

is this indication of blocked finalizer

I see following call stack for finalizer thread. Is it normal to have a call to WaitForSingleObject at top in finalizer? Is there anyway I can determine if its not deadlocked or waiting for really long time?
0:009> k
Child-SP RetAddr Call Site
00000000`0a56e5c8 000007fe`fd5010dc ntdll!NtWaitForSingleObject+0xa
00000000`0a56e5d0 000007fe`fdfabeb5 KERNELBASE!WaitForSingleObjectEx+0x79
00000000`0a56e670 000007fe`fe04a576 rpcrt4!UTIL_GetOverlappedResultEx+0x45
00000000`0a56e6b0 000007fe`fdfaf0dd rpcrt4!WS_SyncRecv+0xf6
00000000`0a56e720 000007fe`fdfe7a29 rpcrt4!OSF_CCONNECTION::TransSendReceive+0x18d
00000000`0a56e780 000007fe`fdfa7f61 rpcrt4!OSF_CCONNECTION::SendBindPacket+0xa5c
00000000`0a56e930 000007fe`fdfa8e27 rpcrt4!OSF_CCONNECTION::ActuallyDoBinding+0xc1
00000000`0a56e9d0 000007fe`fdfa8bb6 rpcrt4!OSF_CCONNECTION::OpenConnectionAndBind+0x207
00000000`0a56ea90 000007fe`fdfa8acd rpcrt4!OSF_CCALL::BindToServer+0xc6
00000000`0a56eb40 000007fe`fdfadaeb rpcrt4!OSF_BINDING_HANDLE::InitCCallWithAssociation+0xa5
00000000`0a56eba0 000007fe`fdfad9d0 rpcrt4!OSF_BINDING_HANDLE::AllocateCCall+0x102
00000000`0a56ecd0 000007fe`fdfc74eb rpcrt4!OSF_BINDING_HANDLE::NegotiateTransferSyntax+0x30
00000000`0a56ed20 000007fe`ff462271 rpcrt4!I_RpcNegotiateTransferSyntax+0x9f
00000000`0a56edb0 000007fe`ff45d185 ole32!CRpcChannelBuffer::NegotiateSyntax+0x69
00000000`0a56ee20 000007fe`fe05ba22 ole32!NdrExtNegotiateTransferSyntax+0xe5
00000000`0a56ee60 000007fe`fe05cbbb rpcrt4!Ndr64pClientSetupTransferSyntax+0x453
00000000`0a56eec0 000007fe`ff4621d0 rpcrt4!NdrpClientCall3+0xcb
00000000`0a56f180 000007fe`ff31d8a2 ole32!ObjectStublessClient+0x11d
00000000`0a56f510 000007fe`ff321bb3 ole32!ObjectStubless+0x42
00000000`0a56f560 000007fe`ff321b22 ole32!RemoteReleaseRifRefHelper+0x57
00000000`0a56f5b0 000007fe`ff3217eb ole32!RemoteReleaseRifRef+0xca
00000000`0a56f620 000007fe`ff321417 ole32!CStdMarshal::DisconnectCliIPIDs+0x4c2
00000000`0a56f720 000007fe`ff3194fa ole32!CStdMarshal::Disconnect+0x40c
00000000`0a56f780 000007fe`ff319428 ole32!CStdIdentity::~CStdIdentity+0xa6 [d:\w7rtm\com\ole32\com\dcomrem\stdid.cxx # 312]
00000000`0a56f7b0 000007fe`ff319b49 ole32!CStdIdentity::`scalar deleting destructor'+0x14
00000000`0a56f7e0 000007fe`f2a79f94 ole32!CStdIdentity::CInternalUnk::Release+0xdc [d:\w7rtm\com\ole32\com\dcomrem\stdid.cxx # 767]
00000000`0a56f810 000007fe`f2ba8bea clr!SafeReleasePreemp+0x74
00000000`0a56f860 000007fe`f2ba8acc clr!RCW::ReleaseAllInterfaces+0xda
00000000`0a56f8b0 000007fe`f2ba8c14 clr!RCW::ReleaseAllInterfacesCallBack+0x54
00000000`0a56f930 000007fe`f2ba937e clr!RCW::Cleanup+0x25
00000000`0a56f980 000007fe`f2ba9214 clr!RCWCleanupList::ReleaseRCWListRaw+0x16
00000000`0a56f9b0 000007fe`f2bb005a clr!RCWCleanupList::ReleaseRCWListInCorrectCtx+0x94
00000000`0a56f9f0 000007fe`f2be326e clr!RCWCleanupList::CleanupAllWrappers+0xe5
00000000`0a56fa70 000007fe`f2be319f clr!SyncBlockCache::CleanupSyncBlocks+0xc2
00000000`0a56fae0 000007fe`f2be47c7 clr!Thread::DoExtraWorkForFinalizer+0xdc
00000000`0a56fb10 000007fe`f2a5458c clr!SVR::GCHeap::FinalizerThreadWorker+0x78
00000000`0a56fb50 000007fe`f2a5451a clr!Frame::Pop+0x50
00000000`0a56fb90 000007fe`f2a54491 clr!COMCustomAttribute::PopSecurityContextFrame+0x192
00000000`0a56fc90 000007fe`f2b31bfe clr!COMCustomAttribute::PopSecurityContextFrame+0xbd
00000000`0a56fd20 000007fe`f2b45020 clr!ManagedThreadBase_NoADTransition+0x3f
00000000`0a56fd80 000007fe`f2ab33de clr!SVR::GCHeap::FinalizerThreadStart+0xb4
00000000`0a56fdc0 00000000`772459ed clr!Thread::intermediateThreadProc+0x7d
00000000`0a56fe80 00000000`7747c541 kernel32!BaseThreadInitThunk+0xd
00000000`0a56feb0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
0:009> !CLRStack
OS Thread Id: 0x1338 (6)
Child SP IP Call Site
000000000b67fcd8 00000000774a12fa [DebuggerU2MCatchHandlerFrame: 000000000b67fcd8]
This is a finalizer worker and it is blocked for a RPC response. In this case if the thread is not waiting for more than a few milliseconds (depending on how long the COM server takes to respond to the request) it is normal. But if we are waiting longer then you might want to investigate what happened to the response packet.
There is no way in user mode to understand what's the tick-count of the thread you may want to examine the same thread using livekd on a running system if the problem is re-producible
It is this indication of a blocked finalizer. A very simple example if you use COM to interact between Excel and C#.
You have a Module of Excel that is being made a member variable of C# and then the issue might arise as the control of that module is within GC's hands but excel will be waiting for it to be released by the C#.
Simple way is to just clear any unmanaged code you have in your C# component before letting the unmanaged code to be cleared.