JVM intermitent chrashes on garbage collection - crash

We have an JSF app deployed on jboss5 with JVM HotSpot build 1.6.0_14-b08. On a machine with 4 cores.
In the last few days we encountered a few sudden crashes of the JVM accodring to the fatal error log it looks like taht it is caused during garbage collection calls.
we use this flags in JVM to determin GC
-Dsun.rmi.dgc.client.gcInterval=600000 -Dsun.rmi.dgc.server.gcInterval=3600000
XX:ParallelGCThreads=4 -XX:+DisableExplicitGC"
How can we track the root cause? I'm not an expert in examin the fatal error logs.
Some of the causes :
1.)
--------------- T H R E A D ---------------
Current thread (0x000000004d8dc800): VMThread [stack:0x0000000040d83000,0x0000000040e84000] [id=24601]
siginfo:si_signo=SIGSEGV: si_errno=0, si_code=2 (SEGV_ACCERR), si_addr=0x00002aaaae1ff000
......
VM_Operation (0x000000005ea08b40): ParallelGCFailedAllocation, mode: safepoint, requested by thread 0x0000000051def000
2.)
--------------- T H R E A D ---------------
Current thread (0x0000000041ec8800): GCTaskThread [stack: 0x0000000000000000,0x0000000000000000] [id=19822]
siginfo:si_signo=SIGSEGV: si_errno=0, si_code=2 (SEGV_ACCERR), si_addr=0x00002aaaae1ff008

Related

JVM Runtime.availableProcessors() returns 2 when it should be 4

I'm running openjdk11 on alpine linux in a container in an AWS EKS cluster.
The application determines the size of a threadpool based on the number of CPUs as returned by Runtime.getRuntime().availableProcessors()
This call is returning 2 processors even though the container shows that 4 CPUs are available:
# cat /proc/cpuinfo | grep processor
processor : 0
processor : 1
processor : 2
processor : 3
Any idea why and how to solve the problem?
Update
Doing some more digging (prompted by some great questions from #gohm'c in the comments), I found a way to add some trace log prints to the JVM with -Xlog:os+container=trace
[0.001s][trace][os,container] CPU Shares is: 1536
[0.001s][trace][os,container] CPU Share count based on shares: 2
Now, I defined in resources.requests.cpu: "1500m".
I don't know why the slight discrepancy but I changed the value of the CPU request, and indeed the CPU Shares in the log trace changes accordingly.
I understand how the resources.limits.cpu value could affect the CPUs that the JVM sees. But why is the resources.requests.cpu value doing that! This seems like a bug to me? Any thoughts?

Stripped pool starvation in WAL writing causes node cluster node failure

Moderate workload on 3 node ignite cluster causes one node to fail with stripped pool startvation while archiving WAL.
This happens one or two times in week.
I already checked all IO problems which could hang WAL rollover. But this issue still persist
I am using latest ignite 2.7 as a library inside spring boot application
: >>> Possible starvation in striped pool.
Deadlock: false
Completed: 1397
Thread [name="sys-stripe-7-#8%server.node%", id=22, state=WAITING, blockCnt=3, waitCnt=757]
Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.awaitNext(FileWriteAheadLogManager.java:2871)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager$FileWriteHandle.access$2300(FileWriteAheadLogManager.java:2451)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.rollOver(FileWriteAheadLogManager.java:1205)
at o.a.i.i.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:836)
at o.a.i.i.processors.cache.GridCacheMapEntry.logUpdate(GridCacheMapEntry.java:4267)
at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6333)
at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6082)
at o.a.i.i.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5782)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3719)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree$Invoke.access$5900(BPlusTree.java:3613)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1895)
at o.a.i.i.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1779)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1638)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621)
at o.a.i.i.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428)
at o.a.i.i.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2295)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3242)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:135)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:309)
at o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:304)
at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
at o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:505)
at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
ERROR --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-1, blockedFor=10s]
WARN --- [tcp-disco-msg-worker-#2%server.node%] [] o.a.i.i.u.t.G : Thread [name="sys-stripe-1-#2%server.node%", id=16, state=WAITING, blockCnt=0, waitCnt=754]
Lock [object=java.util.concurrent.locks.ReentrantLock$NonfairSync#b01791b, ownerName=sys-#214%server.node%, ownerId=248]
Failure Detection feature is not very well configured in Apache Ignite 2.7 by default. You can turn it off (by setting to NoOp) or set a large failureDetectionTimeout to avoid such messages (and shutdown of nodes).

How to build historgram of methods by time spent inside with Mono?

I have tried the following:
mono --profile=log myprog.exe
to collect profiler data. Then to interpret those I invoke:
> mprof-report output.mlpd
Mono log profiler data
Profiler version: 2.0
Data version: 14
Arguments: log
Architecture: x86-64
Operating system: linux
Mean timer overhead: 51 nanoseconds
Program startup: Fri Jul 20 00:11:12 2018
Program ID: 19840
Server listening on: 59374
JIT summary
Compiled methods: 8349
Generated code size: 2621631
JIT helpers: 0
JIT helpers code size: 0
GC summary
GC resizes: 0
Max heap size: 0
Object moves: 0
Metadata summary
Loaded images: 16
Loaded assemblies: 16
Exception summary
Throws: 0
Thread summary
Thread: 0x7fb49c50a700, name: ""
Thread: 0x7fb49d27b700, name: "Threadpool worker"
Thread: 0x7fb49d07a700, name: "Threadpool worker"
Thread: 0x7fb49ce79700, name: "Threadpool worker"
Thread: 0x7fb49cc78700, name: "Threadpool worker"
Thread: 0x7fb49d6b9700, name: ""
Thread: 0x7fb4bbff1700, name: "Finalizer"
Thread: 0x7fb4bfe3f740, name: "Main"
Domain summary
Domain: (nil), friendly name: "myprog.exe"
Domain: 0x1d037f0, friendly name: "(null)"
Context summary
Context: (nil), domain: (nil)
However, there's no information concerning which methods were called often and took long to complete, which was the only one thing I expected from profiling.
How do I use Mono profiling to gather and output information about method calls' total run time? Like hprof with cpu=times will generate.
The Mono docs are "slightly" wrong as the methods calls are not tracked by default. This option creates huge profile log output and massively slows down "total" execution time and when combined with other options like alloc, effect the execution time of the methods and thus any timings that are being collected.
Personally I would recommend using calls profiling by itself adjusting the calldepthto a level that matters to your profiling. i.e. do you need to profile into the framework calls or not? Also a smaller call depth also greatly decreases the size of the log produced.
Example:
mono --profile=log:calls,calldepth=10 Console_Ling.exe
Produces:
Method call summary
Total(ms) Self(ms) Calls Method name
53358 0 1 (wrapper runtime-invoke) <Module>:runtime_invoke_void_object (object,intptr,intptr,intptr)
53358 2 1 Console_Ling.MainClass:Main (string[])
53340 2 1 Console_Ling.MainClass:Stuff ()
53337 0 3 System.Linq.Enumerable:ToList<int> (System.Collections.Generic.IEnumerable`1<int>)
53194 13347 1 System.Linq.Enumerable/WhereListIterator`1<int>:ToList ()
33110 13181 20000000 Console_Ling.MainClass/<>c__DisplayClass0_0:<Stuff>b__0 (int)
19928 13243 20000000 System.Collections.Generic.List`1<int>:Contains (int)
6685 6685 20000000 System.Collections.Generic.GenericEqualityComparer`1<int>:Equals (int,int)
~~~~
Re: http://www.mono-project.com/docs/debug+profile/profile/profiler/#profiler-option-documentation

HANA hdbindexserver start issue after power outage

There was a power outage for our 5+1 node HANA cell cluster.
After we booted up the servers, tried to start the HANA DB.
During HDB start with SIDADM we can see on the nodes 2-3-4-5:
FAIL: process hdbindexserver HDB Indexserver not running
So of course trying to start hdbindexserver with hand with SIDADM:
cd /usr/sap/SIDADM/HDB0x/exe; ./hdbindexserver
But this just produces error:
/usr/sap/SIDADM/HDB0x/foobar003/trace> cat indexserver_alert_foobar003.trc
...
[14268]{-1}[-1/-1] 2017-10-09 19:55:34.593776 e TrexNet Communication.cpp(00501) : no internal interface found
[14287]{-1}[-1/-1] 2017-10-09 19:56:01.428226 e Checkpoint CheckpointMgr.cc(00244) : Skip versions garbage collection savepoint: transaction distribution work failure: snapshot timestamp synchronization failed
[14287]{-1}[-1/-1] 2017-10-09 19:56:22.467184 e Row_Engine transdtx.cc(01410) : Unexpected ltt exception thrown: transaction distribution work failure (at foobar/ptime/storage/tm/transdtx.cc:1410 )
[14287]{-1}[-1/-1] 2017-10-09 19:56:22.467427 f PersistenceLayer PersistenceController.cpp(00679) : startup failed exception 1: no.71000145 (ptime/storage/tm/transdtx.cc:1512)
snapshot timestamp synchronization failed
...
The IPs are up. There is 1 TB of RAM.
The question: what could cause hdbindexserver to fail to start?
Looks like the indexserver process wasn't able to bind the internal network interface again:
Communication.cpp(00501) : no internal interface found
I'd look into the other tracefiles and the system log to check whether the configured NI is up and available.
It seems the persistence storage (disk where data and log file resides) is not responding within time and hence it's getting timed out. Can you check if you can access the data file and log file from the server.
Also check is network I/O slow or disk I/O slow on that server, causing the synchronization to timeout.
You can try stopping the system completely and try to bring HDB on just that server first to check if above issue exists.

php7.0.2 Program terminated with signal 11, Segmentation fault

I am running php-7.0.2 with codeigniter (a php mvc frame). I got some segmentation faults which caused core dumps. And, I found that these segmentation faults randomly occurred when the child php-fpm processes shutdown and restart. I don't know why.
Using gdb "bt" to display the core dump:
Core was generated by `php-fpm: pool www '.
Program terminated with signal 11, Segmentation fault.
\#0 zend_string_release (ht=0x114dae0) at /home/smt/phpng/php-7.0.2/Zend/zend_string.h:269
269 /home/smt/phpng/php-7.0.2/Zend/zend_string.h: No such file or directory.
in /home/smt/phpng/php-7.0.2/Zend/zend_string.h
Missing separate debuginfos, use: debuginfo-install php7-7.0.2-20160407105024.x86_64
(gdb) bt
\#0 zend_string_release (ht=0x114dae0) at /home/smt/phpng/php-7.0.2/Zend/zend_string.h:269
\#1 zend_hash_destroy (ht=0x114dae0) at /home/smt/phpng/php-7.0.2/Zend/zend_hash.c:1273
\#2 0x000000000080647b in module_destructor (module=0x14b6ae0)
at /home/smt/phpng/php-7.0.2/Zend/zend_API.c:2509
\#3 0x000000000080075c in module_destructor_zval (zv=<value optimized out>)
at /home/smt/phpng/php-7.0.2/Zend/zend.c:615
\#4 0x000000000080dcff in _zend_hash_del_el_ex (ht=0x1154780)
at /home/smt/phpng/php-7.0.2/Zend/zend_hash.c:1013
\#5 _zend_hash_del_el (ht=0x1154780) at /home/smt/phpng/php-7.0.2/Zend/zend_hash.c:1037
\#6 zend_hash_graceful_reverse_destroy (ht=0x1154780) at /home/smt/phpng/php-7.0.2/Zend/zend_hash.c:1489
\#7 0x0000000000800096 in zend_shutdown () at /home/smt/phpng/php-7.0.2/Zend/zend.c:840
\#8 0x00000000007a2a6a in php_module_shutdown () at /home/smt/phpng/php-7.0.2/main/main.c:2339
\#9 0x000000000089e45d in main (argc=<value optimized out>, argv=<value optimized out>)
at /home/smt/phpng/php-7.0.2/sapi/fpm/fpm/fpm_main.c:1997
(gdb) quit
The php-fpm.log is as following:
[20-Apr-2016 08:00:02] WARNING: [pool www] child 11751 exited on signal 11 (SIGSEGV - core dumped) after 3600.462022 seconds from start
I am very curious about this bug.
Until now, I am sure that the core dumps occurred when the fpm restarted. The restarts were caused by the command 'kill -10 fpm-master-process-ids'. Or, the fpm also restarted when it had processed 'pm.max_requests' requests.
However, the core dumps didn't occur at every restart and the probability of core dumps was very small. I cannot find the role.
Fortunately, I have installed the 7.0.5 version to replace the 7.0.2 version in our production environment and it had run for three days without core dumps.
I cannot find any modification in the changelogs from 7.0.2 to 7.0.5. This is exactly a very strange bug and I want to know the reason. who can tell me something about this bug?
After updating to 7.0.5, core dump has not occurred for 2 weeks. So, this bug has been fixed in 7.0.5!
I still don't know what case this bug.
I am a curious cat. #_#