Facing application stop during GC - jvm

Need some advise:
I am using Linux operating system
I have 24 CPU Intel(R) Xeon(R) CPU X5650 # 2.67GHz in my machine
RAM: 24675112 kB
JDK version: java version "1.6.0_32"
My JVM setting:
-Xmx12288m
-Xms12288m
-XX:MaxPermSize=1024m
-XX:MaxNewSize=4096m
-XX:InitialCodeCacheSize=128m
-XX:ReservedCodeCacheSize=256m
-XX:+UseCodeCacheFlushing
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:+CMSClassUnloadingEnabled
-XX:+UseCodeCacheFlushing
-XX:+OptimizeStringConcat
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:TLABSize=512k
-XX:+PrintGC
-XX:+DisableExplicitGC
-XX:+PrintGCDetails
-XX:-PrintConcurrentLocks
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintReferenceGC
-XX:+PrintJNIGCStalls
-XX:+PrintGCApplicationStoppedTime
-XX:+CMSScavengeBeforeRemark
-XX:ConcGCThreads=12
-XX:ParallelGCThreads=12
my GC log:
2013-08-27T16:59:34.584-0500: 201544.846: [GC [1 CMS-initial-mark: 7548665K(8388608K)] 7973934K(12163520K), 0.5412390 secs] [Times: user=0.54 sys=0.00, real=0.54 secs]
Total time for which application threads were stopped: 0.5427300 seconds
2013-08-27T16:59:35.125-0500: 201545.387: [CMS-concurrent-mark-start]
Total time for which application threads were stopped: 0.0018140 seconds
Total time for which application threads were stopped: 0.0012720 seconds
Total time for which application threads were stopped: 0.0005850 seconds
Total time for which application threads were stopped: 0.0016750 seconds
Total time for which application threads were stopped: 0.0013800 seconds
Total time for which application threads were stopped: 0.0004860 seconds
Total time for which application threads were stopped: 0.0010740 seconds
Total time for which application threads were stopped: 0.0015250 seconds
Total time for which application threads were stopped: 0.0012570 seconds
Total time for which application threads were stopped: 0.0001500 seconds
Total time for which application threads were stopped: 0.0012430 seconds
2013-08-27T16:59:48.240-0500: 201558.502: [CMS-concurrent-mark: 13.095/13.114 secs] [Times: user=72.62 sys=4.61, real=13.12 secs]
2013-08-27T16:59:48.240-0500: 201558.502: [CMS-concurrent-preclean-start]
201558.502: [Preclean SoftReferences, 0.0003750 secs]201558.502: [Preclean WeakReferences, 0.0006710 secs]201558.503: [Preclean FinalReferences, 0.0001780 secs]201558.503: [Preclean PhantomReferenc
es, 0.0000170 secs]2013-08-27T16:59:48.265-0500: 201558.527: [CMS-concurrent-preclean: 0.025/0.025 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2013-08-27T16:59:48.265-0500: 201558.527: [CMS-concurrent-abortable-preclean-start]
2013-08-27T16:59:53.952-0500: 201564.214: [GC 201564.214: [ParNew202182.528: [SoftReference, 0 refs, 0.0000110 secs]202182.528: [WeakReference, 1015 refs, 0.0001720 secs]202182.528: [FinalReference, 1625 refs, 0.0025210 secs]202182.531: [PhantomReference, 1 refs, 0.0000050 secs]202182.531: [JNI Weak Reference, 0.0000100 secs] (promotion failed): 3774912K->3774912K(3774912K), 619.3859950 secs]202183.600: [CMS CMS: abort preclean due to time 2013-08-27T17:10:14.937-0500: 202185.199: [CMS-concurrent-abortable-preclean: 7.271/626.671 secs] [Times: user=699.20 sys=98.95, real=626.55 secs]
(concurrent mode failure)202189.179: [SoftReference, 614 refs, 0.0000710 secs]202189.179: [WeakReference, 5743 refs, 0.0007600 secs]202189.180: [FinalReference, 3380 refs, 0.0004430 secs]202189.180: [PhantomReference, 212 refs, 0.0000260 secs]202189.180: [JNI Weak Reference, 0.0000180 secs]: 8328471K->4813636K(8388608K), 18.5940310 secs] 11323577K->4813636K(12163520K), [CMS Perm : 208169K->208091K(346920K)], 637.9802050 secs] [Times: user=705.47 sys=98.79, real=637.85 secs]
**Total time for which application threads were stopped: 637.9820120 seconds**
Total time for which application threads were stopped: 0.0047480 seconds
Total time for which application threads were stopped: 0.0006330 seconds
Total time for which application threads were stopped: 0.0052820 seconds
Total time for which application threads were stopped: 0.0008540 seconds
Total time for which application threads were stopped: 0.0008090 seconds
Total time for which application threads were stopped: 0.0002400 seconds
Total time for which application threads were stopped: 0.0058850 seconds
Total time for which application threads were stopped: 0.0008530 seconds
Total time for which application threads were stopped: 0.0010900 seconds
Total time for which application threads were stopped: 0.0006730 seconds
Total time for which application threads were stopped: 0.0006930 seconds
Total time for which application threads were stopped: 0.0012200 seconds
Total time for which application threads were stopped: 0.0016290 seconds
Total time for which application threads were stopped: 0.0007660 seconds
Total time for which application threads were stopped: 0.0005650 seconds
Total time for which application threads were stopped: 0.0002880 seconds
Total time for which application threads were stopped: 0.0005440 seconds
Total time for which application threads were stopped: 0.0006790 seconds
Total time for which application threads were stopped: 0.0007510 seconds
Total time for which application threads were stopped: 0.0003870 seconds
Total time for which application threads were stopped: 0.0007190 seconds
Total time for which application threads were stopped: 0.0016670 seconds
Total time for which application threads were stopped: 0.0007340 seconds
Total time for which application threads were stopped: 0.0014800 seconds
Total time for which application threads were stopped: 0.0019800 seconds
Total time for which application threads were stopped: 0.0009810 seconds
Total time for which application threads were stopped: 0.0010530 seconds
Total time for which application threads were stopped: 0.0006650 seconds
Total time for which application threads were stopped: 0.0009600 seconds
Total time for which application threads were stopped: 0.0007110 seconds
Total time for which application threads were stopped: 0.0011330 seconds
Total time for which application threads were stopped: 0.0006940 seconds
Total time for which application threads were stopped: 0.0008220 seconds
Total time for which application threads were stopped: 0.0015080 seconds
Total time for which application threads were stopped: 0.0007340 seconds
Total time for which application threads were stopped: 0.0003830 seconds
Total time for which application threads were stopped: 0.0005620 seconds
2013-08-27T17:10:39.422-0500: 202209.684: [GC 202209.685: [ParNew202210.354: [SoftReference, 0 refs, 0.0000130 secs]202210.355: [WeakReference, 3827 refs, 0.0003690 secs]202210.355: [FinalReference, 3925 refs, 0.0024380 secs]202210.357: [PhantomReference, 99 refs, 0.0000220 secs]202210.357: [JNI Weak Reference, 0.0000090 secs]: 3355520K->419392K(3774912K), 0.6728950 secs] 8169156K->5597366K(12163520K), 0.6730540 secs] [Times: user=7.13 sys=0.00, real=0.68 secs]
Total time for which application threads were stopped: 0.6743080 seconds
Total time for which application threads were stopped: 0.0013680 seconds
Total time for which application threads were stopped: 0.0004720 seconds
Total time for which application threads were stopped: 0.0006960 seconds
Total time for which application threads were stopped: 0.0015600 seconds
My application that running is getting "Service Temporary Unavailable"

This is an old question but I'm going to answer it for completion as the other suggestions/comments miss the point.
The GC log record in question is at 201564.214.
It is a ParNew (young generational collection) that suffers from a promotion failure (tenured space does not have enough capacity to accept data being promoted). But I don't think this is the real problem but rather the consequence of another underlying problem.
The odd thing here is the the preclean phase has collected 98.95 seconds of system time and the Full GC (concurrent mode failure) is reporting 98.79 seconds of system time. These numbers suggest that the underlying problem isn't with GC but with how the JVM is interacting with the OS. Garbage collectors run in user space and should not accumulate system time.
One of the things that can cause this is in how much work the OS needs to perform to manage memory. If the system is in low memory conditions the OS may be spending a lot of time managing it. This could be further aggravated by competing/heavy I/O operations or the use of transparent large pages in Linux. It's hard to say without further data. That said, you need to look at the system you're running on in this situations to get a better handle on what is happening.

Related

What does "bw: SpinningDown" mean in a RedisTimeoutException?

What does "bw: SpinningDown" mean in this error -
Timeout performing GET (5000ms), next: GET foo!bar!baz, inst: 5, qu: 0, qs: 0, aw: False, bw: SpinningDown, ....
Does it mean that the Redis server instance is spinning down, or something else?
It means something else actually. The abbreviation bw stands for Backlog-Writer, which contains the status of what the backlog is doing in Redis.
For this particular status: SpinningDown, you actually left out the important bits that relate to it.
There are 4 values being tracked for workers being Busy, Free, Min and Max.
Let's take these hypothetical values: Busy=250,Free=750,Min=200,Max=1000
In this case there are 50 more existing (busy) threads than the minimum.
The cost of spinning up a new thread is high, especially if you hit the .NET-provided global thread pool limit. In which case only 1 new thread is created every 500ms due to throttling.
So once the Backlog is done processing an item, instead of just exiting the thread, it will keep it in a waiting state (SpinningDown) for 5 seconds. If during that time there still is more Backlog to process, the same thread will process another item from the Backlog.
If no Backlog item needed to be processed in those 5 seconds, the thread will be exited, which will eventually lead to a decrease in Busy (existing) threads.
This only happens for threads above the Min count of course, as those will be kept alive even if there is no work to do.

Threadpool not starting any new threads, but is below limit

I have a app that uses async tasks to get Data from a MQTT server.
After I updated my System to Windows 11 I started noticing the Problem that the async tasks are not starting after the App runs for some Days.
This Worked fine before I upraded from Windows 10 to Windows 11.
Checking with a Debugger, I noticed that the Threadpool is set to minimum 16 tasks, but currently only has 1 task witch is waiting for another tasks that is not starting.
I also checked a Dump with Windbg and got this Result from !ThreadPool
CPU utilization: 9%
Worker Thread: Total: 1 Running: 0 Idle: 1 MaxLimit: 32767 MinLimit: 16
Work Request in Queue: 0
--------------------------------------
Number of Timers: 1
--------------------------------------
Completion Port Thread:Total: 1 Free: 1 MaxFree: 32 CurrentLimit: 1 MaxLimit: 1000 MinLimit: 16
My Code is basicly
Sub Test()
StartUpMQTT().Wait()
End Sub
Private Async Function StartUpMQTT() As Task
MQTTClient = Await Net.Mqtt.MqttClient.CreateAsync(strWebLink, Configuration) 'This is the Line where it waits for the unstarted task.
sessionState = Await MQTTClient.ConnectAsync(New MqttClientCredentials("TestAuslesung"))
End function
For now I changed my StartUpMQTT().Wait() to StartUpMQTT().Wait(5000), but I have to wait if this solves the issue of the whole app hanging.
Anything I can try to prevent this from happening/freezing the app, other than my change?

About filebeat delay check

I am using input as log and output as kafka in filebeat.yml
Is the part that is not published in the inflog taken every 30 seconds a delay?
If the delay is correct, can you tell how much delay is there?
2022-03-02T17:30:14.908+0900 INFO [monitoring] log/log.go:124 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":881300,"time":{"ms":11}},"total":{"ticks":2062840,"time":{"ms":16},"value":2062840},"user":{"ticks":1181540,"time":{"ms":5}}},"info":{"ephemeral_id":"f1g8e7gg-g08g-42g1-a8ga-200866gg510g","uptime":{"ms":4851750021}},"memstats":{"gc_next":9482064,"memory_alloc":6413616,"memory_total":78147157472}},"filebeat":{"events":{"active":4,"added":4},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"active":2,"batches":2,"total":2}},"outputs":{"kafka":{"bytes_write":618}},"pipeline":{"clients":1,"events":{"active":2,"filtered":2,"published":2,"total":4}}},"registrar":{"states":{"current":1}},"system":{"load":{"1":0.19,"15":1.16,"5":0.5,"norm":{"1":0.0119,"15":0.0725,"5":0.0313}}}}}}

Optaplanner - local phase 0 step in total in 1 minute

I have set up a solver with two local search phases and it works fine. However, there was a time that the 2nd phase didn't make any move in about 1 minute, as the log shows below:
...
2016-05-07 14:14:55,847 [main] DEBUG LS step (10069), time spent (593822), score (0hard/-81medium/5395020soft), best score (0hard/-80medium/5393781soft), accepted/selected move count (5/48), picked move (CL [cID=1147576, id=27246 => SL [id=49, sID=E942648]] <=> CL [cID=1133912, id=14716 => SL [ id=7, sID=E942592]]).
2016-05-07 14:14:55,858 [main] DEBUG LS step (10070), time spent (593833), score (0hard/-81medium/5395390soft), best score (0hard/-80medium/5393781soft), accepted/selected move count (5/18), picked move (CL [cID=1142322, id=22533 => SL [ id=51, sID=E943251]] <=> CL [cID=1134362, id=14118 => SL [ id=49, sID=E942648]]).
2016-05-07 14:14:55,858 [main] INFO Local Search phase (1) ended: step total (10071), time spent (593833), best score (0hard/-80medium/5393781soft).
2016-05-07 14:16:05,042 [main] INFO Local Search phase (2) ended: step total (0), time spent (663017), best score (0hard/-80medium/5393781soft).
2016-05-07 14:16:05,042 [main] INFO Solving ended: time spent (663017), best score (0hard/-80medium/5393781soft), average calculate count per second (2771).
Before phase 1 ended, there wasn't any improvement in the last couple of steps. And phase 2 started but made 0 step in a minute. The solver then ended because it has reached the maximum time allowed.
I'm a bit surprised that phase 2 made no step at all. Is it simply because it didn't manage to find any better score?
If you don't see any moves in TRACE log (as suggested in comments), it could be because you're using a custom move list factory and that's taken too long to initialize.

Mono error when load testing

During load testing (using Load UI) of a new .Net web api using Mono hosted on a medium sized Amazon server I'm receiving the following results (in chronological order over the course of about ten minutes)
5 connections per second for 60 seconds
No errors
50 connections per second for 60 seconds
No errors
100 connections per second for 60 seconds
Received 3 errors, appearing later during the run
2014-02-07 00:12:10Z Error HttpResponseExtensions Error occured while Processing Request: [IOException] Write failure Write failure|The socket has been shut down
2014-02-07 00:12:10Z Info HttpResponseExtensions Failed to write error to response: {0} Cannot be changed after headers are sent.
5 connections per second for 60 seconds
No errors
100 connections per second for 30 seconds
No errors
100 connections per second for 60 seconds
Received 1 error same as above, appearing later during the run
100 connections per second for 45 seconds
No errors
Doing some research on this, this error seems to be a standard one received when a client closed the connection. As this is only occurring during the heavier load tests, I am wondering if it is just getting to the upper limits of what the server instance can support? If not any suggestions on hunting down the source of the errors?