Thread stuck when accessing atomic reference or long on Apache Ignite - ignite
This is regarding a rather recent issue that we’ve been facing. We run 2 client instances and 26 apache ignite instances. All are AWS R4.2xLarge nodes. Recently we’ve been seeing this issue where when trying to fetch an atomicLong or atomicReference, the executing thread gets stuck and doesn’t return. This issue usually happens on 1 or 2 ignite instances. I am not sure why this happens and so any help on this would be really appreciated.
This is the thread dump while trying to get an atomicReference:
"main" #1 prio=5 os_prio=0 cpu=3528.41ms elapsed=1067.33s allocated=312M defined_classes=9309 tid=0x00007f4ce4046fc0 nid=0x1537 waiting on condition [0x00007f4cece90000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base#11.0.7/Native Method)
- parking to wait for <0x00007f4cbfe7c7d0> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base#11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base#11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base#11.0.7/CountDownLatch.java:232)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicReference(DataStructuresProcessor.java:744)
at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3743)
at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3732)
at company.explore.cache.persist.SavedAudienceLocationProvider.getSavedAudienceLocation(SavedAudienceLocationProvider.java:27)
at company.explore.listeners.lifecycle.LifecycleListener.configureSavedAudienceLocation(LifecycleListener.java:45)
at company.explore.listeners.lifecycle.LifecycleListener.onLifecycleEvent(LifecycleListener.java:38)
at org.apache.ignite.internal.IgniteKernal.notifyLifecycleBeans(IgniteKernal.java:725)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1156)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
- locked <0x00007f4cbf072a38> (a org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance)
at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
at org.apache.ignite.Ignition.start(Ignition.java:348)
at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
Since this is stuck any Ignition.ignite calls fail as well and cause the job not to go through:
"pub-#22" #48 prio=5 os_prio=0 cpu=5.76ms elapsed=1036.50s allocated=421K defined_classes=6 tid=0x00007f4ce4cf3990 nid=0x1607 waiting on condition [0x00007f40375f6000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base#11.0.7/Native Method)
- parking to wait for <0x00007f4cbf16d9e0> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base#11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base#11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base#11.0.7/CountDownLatch.java:232)
at org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:7657)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1671)
at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1389)
at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1258)
at org.apache.ignite.Ignition.ignite(Ignition.java:489)
at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:58)
at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:31)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C2.execute(GridClosureProcessor.java:1855)
at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base#11.0.7/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base#11.0.7/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base#11.0.7/Thread.java:834)
Similarly this is an instance where the thread is waiting for CountDownLatch when trying to get atomicLong:
"pub-#489" #608 prio=5 os_prio=0 cpu=16.80ms elapsed=7076.10s allocated=2409K defined_classes=17 tid=0x00007f48c8014c60 nid=0x5bd5 waiting on condition [0x00007f48359e1000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base#11.0.7/Native Method)
- parking to wait for <0x00007f518aba6060> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base#11.0.7/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base#11.0.7/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base#11.0.7/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base#11.0.7/CountDownLatch.java:232)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicLong(DataStructuresProcessor.java:463)
at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3716)
at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3705)
at company.explore.cache.persist.person.SerializationStatus.getSerializeCounter(SerializationStatus.java:86)
at company.explore.cache.persist.person.SerializationStatus.startNodeSerialization(SerializationStatus.java:21)
at company.explore.cache.persist.personv2.PersonSerializationJob.serializePeopleData(PersonSerializationJob.java:98)
at company.explore.cache.persist.personv2.PersonSerializationJob.run(PersonSerializationJob.java:75)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1944)
at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base#11.0.7/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base#11.0.7/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base#11.0.7/Thread.java:834)
These issues have only started coming up as of the past 2 months or so. The system itself has been very stable for a long time.
I haven’t posted the entire thread dump as it would be quite large. If needed, I can post it on pastebin or upload it somewhere.
Since this really isn’t a very consistent issue I am not sure about how to create a reproducer project. But I can provide any logs or so if needed.
EDIT:
The entire thread dumps have been posted on pastebin. Please find the links below:
Atomic Reference related thread dump: pastebin.com/ydNMFSEP
Atomic Long related thread dump: pastebin.com/psJgwi3F
Related
Scaling Apache Ignite Grid
We having scaling Apache Ignite grid where client nodes scale up and down based on load. Data nodes are our server nodes where continuous queries run. However this leads to unclean shutdown of some client nodes as we rely on SIGTERM for Ignite node shutdown. Unclean shutdown of client nodes impacts excution of continuos query which starts giving "Possible starvation in striped pool" warning ultimately leading to Blocked system-critical threads. We are currently working on ways to prevent striped pool stravation and have noticed 2 key issues around it: Continuos query thread trying to connect to nodes which have shutdown but are still present in topology: We are planning to reduce the timeout so that client node is discarded early from grid. Stacktrace: Thread [name="sys-stripe-1-#2%App%", id=37, state=RUNNABLE, blockCnt=233817, waitCnt=3343945] at sun.nio.ch.Net.poll(Native Method) at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3781) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3635) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3375) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3180) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:3013) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2960) at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:2100) at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:2365) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1964) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1935) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1917) at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1324) at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1261) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:1059) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.access$600(CacheContinuousQueryHandler.java:90) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler$2.onEntryUpdated(CacheContinuousQueryHandler.java:459) at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryUpdated(CacheContinuousQueryManager.java:447) Continuos query threads waiting for read lock while trying to update the cache. This is generally comes up after retries for Client node connection are over. Stacktrace: Possible starvation in striped pool. Thread name: sys-stripe-12-#13%App% Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=CacheContinuousQueryBatchAck [routineId=37b43550-d3a5-4518-8745-ece5dc06b1fd, updateCntrs=HashMap {2=7414, 5=8228, 7=7508, 13=7536, 525=7586, 14=7596, 527=7959, 533=7886, 534=7666, 539=9556, 547=7866, 36=8380, 549=8131, 38=7126, 39=7776, 46=7822, 52=7800, 54=8098, 567=7894, 569=7640, 60=7912, 62=8170, 63=7962, 64=8190, 65=7662, 72=7754, 585=7712, 81=8564, 594=8000, 82=7980, 83=7999, 595=7688, 596=7972, 85=7494, 597=7806, 601=7812, 89=7478, 602=7868, 603=7944, 604=7944, 93=7778, 96=8036, 99=7916, 102=7584, 618=7956, 107=7656, 111=7176, 112=8042, 116=7620, 125=7768, 637=7662, 130=7846, 642=7696, 134=11672, 138=7638, 651=7418, 652=7908, 140=7478, 654=9136, 655=8934, 144=8052, 145=7656, 147=7904, 663=7354, 153=7868, 667=8232, 669=7774, 157=7850, 160=8094, 673=8120, 682=7722, 172=7930, 689=7864, 180=8026, 692=7674, 184=7526, 699=7458, 191=8326, 193=7700, 195=7986, 197=8056, 713=7858, 716=7896, 719=7946, 210=7560, 725=7604, 214=7442, 727=7668, 729=7406, 731=7790, 219=7594, 733=7360, 225=7522, 737=7482, 227=7838, 744=8380, 234=7150, 237=7886, 750=7910, 239=8624... and 104 more}]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=CacheContinuousQueryBatchAck [routineId=0e950ae5-1474-4488-9042-80dbddb2f09a, updateCntrs=HashMap {2=7414, 5=8228, 7=7508, 13=7536, 525=7586, 14=7596, 527=7959, 533=7886, 534=7666, 539=9556, 547=7866, 36=8380, 549=8131, 38=7126, 39=7776, 46=7822, 52=7800, 54=8098, 567=7894, 569=7640, 60=7912, 62=8170, 63=7962, 64=8190, 65=7662, 72=7754, 585=7712, 81=8564, 594=8000, 82=7980, 595=7688, 83=7999, 596=7972, 597=7806, 85=7494, 601=7812, 89=7478, 602=7868, 603=7944, 604=7944, 93=7778, 96=8036, 99=7916, 102=7584, 618=7956, 107=7656, 111=7176, 112=8042, 116=7620, 637=7662, 125=7768, 130=7846, 642=7696, 134=11672, 138=7638, 651=7418, 140=7478, 652=7908, 654=9136, 655=8934, 144=8052, 145=7656, 147=7904, 663=7354, 153=7868, 667=8232, 669=7774, 157=7850, 160=8094, 673=8120, 682=7722, 172=7930, 689=7864, 692=7674, 180=8026, 184=7526, 699=7458, 191=8326, 193=7700, 195=7986, 197=8056, 713=7858, 716=7896, 719=7946, 210=7560, 725=7604, 214=7442, 727=7668, 729=7406, 219=7594, 731=7790, 733=7360, 225=7522, 737=7482, 227=7838, 744=8380, 234=7150, 237=7886, 750=7910, 239=8624... and 104 more}]]]] Deadlock: false Completed: 3316358 Thread [name="sys-stripe-12-#13%App%", id=48, state=WAITING, blockCnt=106311, waitCnt=1659827] Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync#5f611d9a, ownerName=exchange-worker-#71%App%, ownerId=138] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at o.a.i.i.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.readLock(GridDhtPartitionTopologyImpl.java:256) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1837) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1734) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3322) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:141) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:273) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:268) at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142) at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318) at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109) at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308) at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907) at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528) at o.a.i.i.managers.communication.GridIoManager.access$5300(GridIoManager.java:241) at o.a.i.i.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421) at o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:55) at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:565) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) Here we can see that lock is owned by "exchange-worker-#71%App%" which seems to be struck. In few cases we have seen that lock has no owner specific: Thread [name="sys-stripe-2-#3%App%", id=43, state=WAITING, blockCnt=39097, waitCnt=394328] Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync#667500d1, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1663) Continuos query runs on server nodes which are our data nodes and we do not expect data nodes to be impacted by client nodes like getting locked. Can someone advice on how we can avoid such locks given that nodes can have unclean shutdowns?
I believe setting IGNITE_ENABLE_FORCIBLE_NODE_KILL property as true cluster-wide could help with the matter. It streamlines the process of kicking thick client nodes off a cluster, the main case for it is abruptly terminated client nodes.
Extracting JVM stack trace from a heap dump (hprof) with a command line tool
I want to automate the extraction of the stack trace from the JVM crash heap dump file on a headless server. Is there a non-GUI solution for this?
MAT project comes with a script ParseHeapDump.sh (*.bat on Windows), if you just run it on a heap dump like this: /path/to/MAT/ParseHeapDump.sh <HEAPDUMP_NAME>.hprof, you get a set of files, one of them being a text file <HEAPDUMP_NAME>.threads containing stack traces, e.g.: Thread 0x715883630 locals: Thread 0x71577b418 at java.lang.Object.wait(J)V (Native Method) at java.lang.ref.ReferenceQueue.remove(J)Ljava/lang/ref/Reference; (ReferenceQueue.java:155) at jdk.internal.ref.CleanerImpl.run()V (CleanerImpl.java:148) at java.lang.Thread.run()V (Thread.java:834) at jdk.internal.misc.InnocuousThread.run()V (InnocuousThread.java:134) locals: objectId=0x715833c30, line=1 objectId=0x716df1990, line=1 objectId=0x715832230, line=2 objectId=0x71577b418, line=2 objectId=0x71577b418, line=2 objectId=0x71577b418, line=3 objectId=0x71577b418, line=4 Thread 0x715883050 at jdk.internal.misc.Unsafe.park(ZJ)V (Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(Ljava/lang/Object;J)V (LockSupport.java:234) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(J)J (AbstractQueuedSynchronizer.java:2123) at java.util.concurrent.DelayQueue.take()Ljava/util/concurrent/Delayed; (DelayQueue.java:229) at com.intellij.util.concurrency.AppDelayQueue.lambda$new$0()V (AppDelayQueue.java:26) at com.intellij.util.concurrency.AppDelayQueue$$Lambda$16.run()V (Unknown Source) at java.lang.Thread.run()V (Thread.java:834) locals: objectId=0x715883050, line=1 objectId=0x715cd8a90, line=2 objectId=0x7291c6618, line=2 objectId=0x715f73520, line=3 objectId=0x715cd8aa8, line=3 objectId=0x715883050, line=3 objectId=0x715f73520, line=4 objectId=0x715cd8ac0, line=5 objectId=0x715883050, line=6 ... You are only interested in lines starting with Thread 0x... and at ... here. Notes: The script does a lot more work than would be necessary to only extract a thread dump (it parses the heap as well which is 99.99...% of work); and it generates many files that you do not need (all prefixed <HEAPDUMP_NAME>). Thread names are lost, and thread ids seem to not correspond to anything OS-specific (e.g. native thread id); in other words, you cannot find Thread 0x715883050 in the ps or top/htop output — at least I do not know a way.
Open your heap dump in VisualVM After the heap dump is loaded in the top left corner of the tab is a dropdown. Switch it to Threads. Switch the thread view
Stripped pool starvation in WAL writing causes node cluster node failure
Moderate workload on 3 node ignite cluster causes one node to fail with stripped pool startvation while archiving WAL. This happens one or two times in week. I already checked all IO problems which could hang WAL rollover. But this issue still persist I am using latest ignite 2.7 as a library inside spring boot application : >>> Possible starvation in striped pool. Deadlock: false Completed: 1397 Thread [name="sys-stripe-7-#8%server.node%", id=22, state=WAITING, blockCnt=3, waitCnt=757] Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.awaitNext(FileWriteAheadLogManager.java:2871) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.access$2300(FileWriteAheadLogManager.java:2451) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.rollOver(FileWriteAheadLogManager.java:1205) at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:836) at o.a.i.i.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4267) at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6333) at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6082) at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5782) at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3719) at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.access$5900(BPlusTree.java:3613) at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1895) at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1779) at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1638) at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621) at o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935) at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428) at o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2295) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3242) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:135) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:309) at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:304) at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056) at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306) at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101) at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295) at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569) at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197) at o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:127) at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1093) at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:505) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) ERROR --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-1, blockedFor=10s] WARN --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Thread [name="sys-stripe-1-#2%server.node%", id=16, state=WAITING, blockCnt=0, waitCnt=754] Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248]
Failure Detection feature is not very well configured in Apache Ignite 2.7 by default. You can turn it off (by setting to NoOp) or set a large failureDetectionTimeout to avoid such messages (and shutdown of nodes).
JobTracker - High memory and native thread usage
We are running hadoop on GCE with HDFS default file system, and data input/output from/to GCS. Hadoop version: 1.2.1 Connector version: com.google.cloud.bigdataoss:gcs-connector:1.3.0-hadoop1 Observed behavior: JT will accumulate threads in waiting state, leading to OOM: 2015-02-06 14:15:51,206 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371) at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.initialize(AbstractGoogleAsyncWriteChannel.java:318) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:275) at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:145) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:184) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:168) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:77) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:655) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:444) at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1860) at org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:709) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:706) at org.apache.hadoop.mapred.JobTracker.initJob(Jobenter code hereTracker.java:3890) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) After looking through the JT logs I found these warnings: 2015-02-06 14:30:17,442 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xx.xxx.xxx.xxx:50010 java.io.IOException: Call to /xx.xxx.xxx.xxx:50020 failed on local exception: java.io.IOException: Couldn't set up IO streams at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150) at org.apache.hadoop.ipc.Client.call(Client.java:1118) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at com.sun.proxy.$Proxy10.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:414) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:392) at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:201) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3317) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987) Caused by: java.io.IOException: Couldn't set up IO streams at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642) at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:205) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1249) at org.apache.hadoop.ipc.Client.call(Client.java:1093) ... 9 more Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635) ... 12 more This appears to be similar to hadoop bug reporter here: https://issues.apache.org/jira/browse/MAPREDUCE-5606 I tried proposed solution by disabling saving job logs into the output path and it solved the problem at the expense of missing logs :) I also ran jstack on JT and it showed hundreds of WAITING or TIMED_WAITING threads as such: pool-52-thread-1" prio=10 tid=0x00007feaec581000 nid=0x524f in Object.wait() [0x00007fead39b3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x000000074d86ba60> (a java.io.PipedInputStream) at java.io.PipedInputStream.read(PipedInputStream.java:327) - locked <0x000000074d86ba60> (a java.io.PipedInputStream) at java.io.PipedInputStream.read(PipedInputStream.java:378) - locked <0x000000074d86ba60> (a java.io.PipedInputStream) at com.google.api.client.util.ByteStreams.read(ByteStreams.java:181) at com.google.api.client.googleapis.media.MediaHttpUploader.setContentAndHeadersOnCurrentReque st(MediaHttpUploader.java:629) at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader. java:409) at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr actGoogleClientRequest.java:419) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr actGoogleClientRequest.java:343) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogl eClientRequest.java:460) at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.run(AbstractGo ogleAsyncWriteChannel.java:354) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x000000074d864918> (a java.util.concurrent.ThreadPoolExecutor$Worker) It appears JT is having hard time keeping up communicating with GCS via GCS Connector. Please advise, Thank you
At the moment, every open FSDataOutputStream in the GCS connector for Hadoop consumes a thread until it's closed, because a separate thread needs to run the "resumable" HttpRequests while the user of the OutputStream writes bytes intermittently. In most cases, (such as in individual Hadoop tasks), there's only ever one long-lived output stream, and possibly a few shorter-lived ones for writing small metadata/marker files, etc. In general, there are two possible causes for the OOM you're running into: You have lots of queued up jobs; every submitted job holds an unclosed OutputStream, and thus consumes a "waiting" thread. However, since you mention you only need to queue up ~10 jobs, this shouldn't be the root cause. Something is causing a "leak" of the PrintWriter objects, originally created in logSubmitted and added to fileManager. Typically, terminal events (like logFinished will correctly close() all the PrintWriters before removing them from the map via markCompleted, but in theory they may be bugs here or there which can cause one of the OutputStreams to leak without being close()'d. For example, while I haven't had a chance to verify this assertion, it seems that IOException trying to do something like logMetaInfo will "removeWriter" without closing it. I've verified that at least under normal circumstances, the OutputStream seem to get closed correctly, and my sample JobTracker shows a clean jstack after having successfully run a lot of jobs. TL;DR: There are some working theories as to why some resource may leak and ultimately prevent necessary threads from being created. You should consider changing hadoop.job.history.user.location to some HDFS location in the meantime, as a way to preserve the job logs in the absence of placing them on GCS.
is this indication of blocked finalizer
I see following call stack for finalizer thread. Is it normal to have a call to WaitForSingleObject at top in finalizer? Is there anyway I can determine if its not deadlocked or waiting for really long time? 0:009> k Child-SP RetAddr Call Site 00000000`0a56e5c8 000007fe`fd5010dc ntdll!NtWaitForSingleObject+0xa 00000000`0a56e5d0 000007fe`fdfabeb5 KERNELBASE!WaitForSingleObjectEx+0x79 00000000`0a56e670 000007fe`fe04a576 rpcrt4!UTIL_GetOverlappedResultEx+0x45 00000000`0a56e6b0 000007fe`fdfaf0dd rpcrt4!WS_SyncRecv+0xf6 00000000`0a56e720 000007fe`fdfe7a29 rpcrt4!OSF_CCONNECTION::TransSendReceive+0x18d 00000000`0a56e780 000007fe`fdfa7f61 rpcrt4!OSF_CCONNECTION::SendBindPacket+0xa5c 00000000`0a56e930 000007fe`fdfa8e27 rpcrt4!OSF_CCONNECTION::ActuallyDoBinding+0xc1 00000000`0a56e9d0 000007fe`fdfa8bb6 rpcrt4!OSF_CCONNECTION::OpenConnectionAndBind+0x207 00000000`0a56ea90 000007fe`fdfa8acd rpcrt4!OSF_CCALL::BindToServer+0xc6 00000000`0a56eb40 000007fe`fdfadaeb rpcrt4!OSF_BINDING_HANDLE::InitCCallWithAssociation+0xa5 00000000`0a56eba0 000007fe`fdfad9d0 rpcrt4!OSF_BINDING_HANDLE::AllocateCCall+0x102 00000000`0a56ecd0 000007fe`fdfc74eb rpcrt4!OSF_BINDING_HANDLE::NegotiateTransferSyntax+0x30 00000000`0a56ed20 000007fe`ff462271 rpcrt4!I_RpcNegotiateTransferSyntax+0x9f 00000000`0a56edb0 000007fe`ff45d185 ole32!CRpcChannelBuffer::NegotiateSyntax+0x69 00000000`0a56ee20 000007fe`fe05ba22 ole32!NdrExtNegotiateTransferSyntax+0xe5 00000000`0a56ee60 000007fe`fe05cbbb rpcrt4!Ndr64pClientSetupTransferSyntax+0x453 00000000`0a56eec0 000007fe`ff4621d0 rpcrt4!NdrpClientCall3+0xcb 00000000`0a56f180 000007fe`ff31d8a2 ole32!ObjectStublessClient+0x11d 00000000`0a56f510 000007fe`ff321bb3 ole32!ObjectStubless+0x42 00000000`0a56f560 000007fe`ff321b22 ole32!RemoteReleaseRifRefHelper+0x57 00000000`0a56f5b0 000007fe`ff3217eb ole32!RemoteReleaseRifRef+0xca 00000000`0a56f620 000007fe`ff321417 ole32!CStdMarshal::DisconnectCliIPIDs+0x4c2 00000000`0a56f720 000007fe`ff3194fa ole32!CStdMarshal::Disconnect+0x40c 00000000`0a56f780 000007fe`ff319428 ole32!CStdIdentity::~CStdIdentity+0xa6 [d:\w7rtm\com\ole32\com\dcomrem\stdid.cxx # 312] 00000000`0a56f7b0 000007fe`ff319b49 ole32!CStdIdentity::`scalar deleting destructor'+0x14 00000000`0a56f7e0 000007fe`f2a79f94 ole32!CStdIdentity::CInternalUnk::Release+0xdc [d:\w7rtm\com\ole32\com\dcomrem\stdid.cxx # 767] 00000000`0a56f810 000007fe`f2ba8bea clr!SafeReleasePreemp+0x74 00000000`0a56f860 000007fe`f2ba8acc clr!RCW::ReleaseAllInterfaces+0xda 00000000`0a56f8b0 000007fe`f2ba8c14 clr!RCW::ReleaseAllInterfacesCallBack+0x54 00000000`0a56f930 000007fe`f2ba937e clr!RCW::Cleanup+0x25 00000000`0a56f980 000007fe`f2ba9214 clr!RCWCleanupList::ReleaseRCWListRaw+0x16 00000000`0a56f9b0 000007fe`f2bb005a clr!RCWCleanupList::ReleaseRCWListInCorrectCtx+0x94 00000000`0a56f9f0 000007fe`f2be326e clr!RCWCleanupList::CleanupAllWrappers+0xe5 00000000`0a56fa70 000007fe`f2be319f clr!SyncBlockCache::CleanupSyncBlocks+0xc2 00000000`0a56fae0 000007fe`f2be47c7 clr!Thread::DoExtraWorkForFinalizer+0xdc 00000000`0a56fb10 000007fe`f2a5458c clr!SVR::GCHeap::FinalizerThreadWorker+0x78 00000000`0a56fb50 000007fe`f2a5451a clr!Frame::Pop+0x50 00000000`0a56fb90 000007fe`f2a54491 clr!COMCustomAttribute::PopSecurityContextFrame+0x192 00000000`0a56fc90 000007fe`f2b31bfe clr!COMCustomAttribute::PopSecurityContextFrame+0xbd 00000000`0a56fd20 000007fe`f2b45020 clr!ManagedThreadBase_NoADTransition+0x3f 00000000`0a56fd80 000007fe`f2ab33de clr!SVR::GCHeap::FinalizerThreadStart+0xb4 00000000`0a56fdc0 00000000`772459ed clr!Thread::intermediateThreadProc+0x7d 00000000`0a56fe80 00000000`7747c541 kernel32!BaseThreadInitThunk+0xd 00000000`0a56feb0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d 0:009> !CLRStack OS Thread Id: 0x1338 (6) Child SP IP Call Site 000000000b67fcd8 00000000774a12fa [DebuggerU2MCatchHandlerFrame: 000000000b67fcd8]
This is a finalizer worker and it is blocked for a RPC response. In this case if the thread is not waiting for more than a few milliseconds (depending on how long the COM server takes to respond to the request) it is normal. But if we are waiting longer then you might want to investigate what happened to the response packet. There is no way in user mode to understand what's the tick-count of the thread you may want to examine the same thread using livekd on a running system if the problem is re-producible
It is this indication of a blocked finalizer. A very simple example if you use COM to interact between Excel and C#. You have a Module of Excel that is being made a member variable of C# and then the issue might arise as the control of that module is within GC's hands but excel will be waiting for it to be released by the C#. Simple way is to just clear any unmanaged code you have in your C# component before letting the unmanaged code to be cleared.