Can a BLOCKED Thread cause high CPU Consumption - jvm

We saw a high CPU consumption issue in our production environment recently, and saw something strange while debugging the same. When I did a "top -H" to see the CPU stats per thread ID, I found a thread X consuming high CPU. When I took the thread dumps, I saw that this thread X was in BLOCKED state. What does this mean, can a thread which is in BLOCKED state consume high CPU ? I think this might be trivial question but I am a novice in debugging Performance issues and JVM, and not sure what I might be missing here.

Entering and exiting a BLOCKED state can be expensive. If you are BLOCKED for even a little while this is not a problem, but if you are blocking briefly in a busy loop, your thread can appear blocked but in reality burning CPU.
I would look for multiple threads repeatedly competing on a shared resources which are entering BLOCKED very briefly.

#Peter has already mentioned good point about busy loop (which could be JVM internal adaptive optimization of spin locks in case of synchronization or busy loop created by application itself on some condition) which can burn CPU. There is another indirect way in which the CPU can go very high because of thread blocking. Typically in a web server if lots of threads are in blocked state ( not because of synchronization lock related blocking but say waiting for IO from a back-end datastore) then it may put lots of pressure on JVM garbage collection. These worker threads are supposed to finish their work quickly so that all the objects created by them on heap is quickly de-referenced and garbage collected. If lots of threads are in this state then the garbage collection threads have to work overtime and they may end up taking lots of CPU.

Related

Can a Deadlock occur with CPU as a resource?

I am on my fourth year of Software Engineering and we are covering the topic of Deadlocks.
The generalization goes that a Deadlock occurs when two processes A and B, use two resources X and Y and wait for the release of the other process resource before releasing theirs.
My question would be, given that the CPU is a resource in itself, is there a scenario where there could be a deadlock involving CPU as a resource?
My first thought on this problem is that you would require a system where a process cannot be released from the CPU by timed interrupts (it could just be a FCFS algorithm). You would also require no waiting queues for resources, because getting into a queue would release the resource. But then I also ask, can there be Deadlocks when there are queues?
CPU scheduler can be implemented in any way, you can build one which used FCFS algorithm and allowed processes to decide when they should relinquish control of CPU. but these kind of implementations are neither going to be practical nor reliable since CPU is the single most important resource an operating system has and allowing a process to take control of it in such a way that it may never be preempted will effectively make process the owner of the system which contradicts the basic idea that operating system should always be in control of the system.
As far as contemporary operating systems (Linux, Windows etc) are concerned, this will never happen because they don't allow such situations.

How to reduce time taken on threads reaching Safepoint - Sync state

About the Issue:
During heavy IO in the VM, we faced JVM pause/slowness due to stopping threads taking more time. When looking on safepoint logs it showed Sync state takes the most time.
We also tried printing Safepoint traces on timeout delay (-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=200) to know which threads is causing this issue but nothing seems to be suspicious. Also when setting timeout for safepoints, we are not getting timeout detected print when the time spent is in 'Sync' state.
Questions about this safepoint tracing:
How does the safepoint timeout work?
After logging the thread details, does the safepoint exists and all threads resume?
Will that VM operation be carried out. What will happen if the vmop is GC.
Using Async-profiler:
Tried time-to-safepoint profiling using async-profiler and noticed VM Thread is taking more time on SafepointSynchronize::begin() method and C2 compiler threads is taking almost equal time as VM Thread.
We doubt that C2 Compilers may be taking time to reach safepoint. Can someone help us in resolving this issue and to interpret this time-to-safepoint flamegraph. Thanks in advance.
SafepointTimeout option affects nothing but logging, i.e. threads will not be interrupted, VM operation will run normally, etc.
SafepointTimeout does not always print timed out threads: a thread may already have reached the safepoint by the time printing occurs. Furthermore, SafepointTimeout may not even detect a timeout, if the entire process has been frozen by the Operating System.
For example, such 'freezes' many happen
when a process has exhausted its cpu quota in a cgroup (container);
when a system is low on physical memory, and direct reclaim occurs;
due to activity of another process (e.g. I observed long JVM pauses when atop utility inspected the system).
async-profiler indeed has a time-to-safepoint profiling option (--ttsp), though using it correctly may seem tricky. It works best in wall profiling mode with jfr output. In this configuration, async-profiler will sample all threads (both running and blocking) during safepoint synchronization, and record each individual event with a timestamp.
Such profile can be then analyzed with JDK Mission Control: choose the time interval around the long pause, and look at the stack traces of java threads in this interval.
Note that if the JVM process is 'frozen', async-profiler thread does not work either, i.e. you will not see collected samples during this period. Normally, in wall clock profiling mode, all threads are sampled evenly. But if you see a 'gap ' (missed events during some time interval), it apparently means the JVM process has not received CPU time. In this case, the reason of JVM pauses is not in the Java application, but rather in the operating system / environment.

Monitor worker crashes in apache storm

When running in a cluster, if something wrong happens, a worker generally dies (JVM shutdown). It can be caused by many factors, most of the time it is a challenge (the biggest difficulty with storm?) to find out what causes the crash.
Of course, storm-supervisor restarts dead workers and liveness is quite good within a storm cluster, still a worker crash is a mess that we should avoid as it adds overhead, latency (can be very long until a worker is found dead and respawned) and data loss if you didn't design your topology to prevent that.
Is there an easy way / tool / methodology to check when and possibly why a storm worker crashes? They are not shown in storm-ui (whereas supervisors are shown), and everything needs manual monitoring (with jstack + JVM opts for instance) with a lot of care.
Here are some cases that can happen:
timeouts and many possible reasons: slow java garbage collection, bad network, bad sizing in timeout configuration. The only output we get natively from supervisor logs is "state: timeout" or "state: disallowed" which is poor. Also when a worker dies the statistics on storm-ui are rebooted. As you get scared of timeouts you end up using long ones which does not seem to be a good solution for real-time processing.
high back pressure with unexpected behaviour, starving worker heartbeats and inducing a timeout for instance. Acking seems to be the only way to deal with back pressure and needs good crafting of bolts according to your load. Not acking seems to be a no-go as it would indeed crash workers and get bad results in the end (even less data processed than an acking topology under pressure?).
code runtime exceptions, sometimes not shown in storm-ui that need manual checking of application logs (the easiest case).
memory leaks that can be found out with JVM dumps.
The storm supervisor logs restart by timeout.
you can monitor the supervisor log, also you can monitor your bolt's execute(tuple) method's performance.
As for memory leak, since storm supervisor does kill -9 the worker, the heap dump is likely to be corrupted, so i would use tools that monitor your heap dynamically or killing the supervisor to produce heap dumps via jmap. Also, try monitoring the gc logs.
I still recommend increasing the default timeouts.

Optimum thread count NServiceBus

We're trying to figure out the optimum number of threads to use for our NServiceBus service. We're running it on a machine with 2 quad cores. We've been having problems with the queue backing up. We started with 100 threads then bumped it to 200 and things got worse. We backed it down to 75, then 50 and it seemed even better. Is there some optimal number based on how many CPU's we have or some rule of thumb that we should use to determine the number of threads to run?
Every thread you have running has an overhead attached to it. If you have 2 quad cores then you will be able to have exactly 8 threads running at any one time. Each thread will be consuming a core.
If you have more than 8 threads then there is a chance you will start to do LESS useful work, not more. This is because every time windows decides to give one of the threads not currently consuming a core a turn at doing something it needs to store the state of one of the running threads and then restore the old state of the thread that is about to run - then let the thread go at it. If you have a huge number of threads you're going to spend a large amount of time just switching between the threads and doing nothing useful.
If you have a bunch of threads that are blocked waiting for IO (for instance a message to finish writing to disk so it can be got at) then you might be able to run more threads than you have cores and still get something useful done as a number of those threads will be sitting waiting for something else to complete. It's a complex subject and there is no real answer to 'how many threads should I use'. A good rule of thumb is have a thread for every core and then try and play with it a bit if you want to achieve more throughput. Testing it under real conditions is the only real way to find the sweet spot. You might find that you only need one thread to process the messages and half the time that thread is blocked waiting for a message to come in....
Obviously, even what I've described is oversimplified. Windows needs access to the cores to do OSy things so even if you have 8 cores all of your 8 threads wont always be running because the windows threads are having a turn... then you have IO threads etc....

Is it safe to access the hard drive via many different GCD queues?

Is it safe? For instance, if I create a bunch of different GCD queues that each compress (tar cvzf) some files, am I doing something wrong? Will the hard drive be destroyed?
Or does the system properly take care of such things?
Dietrich's answer is correct save for one detail (that is completely non-obvious).
If you were to spin off, say, 100 asynchronous tar executions via GCD, you'd quickly find that you have 100 threads running in your application (which would also be dead slow due to gross abuse of the I/O subsystem).
In a fully asynchronous concurrent system with queues, there is no way to know if a particular unit of work is blocked because it is waiting for a system resource or waiting for some other enqueued unit of work. Therefore, anytime anything blocks, you pretty much have to spin up another thread and consume another unit of work or risk locking up the application.
In such a case, the "obvious" solution is to wait a bit when a unit of work blocks before spinning up another thread to de-queue and process another unit of work with the hope that the first unit of work "unblocks" and continues processing.
Doing so, though, would mean that any asynchronous concurrent system with interaction between units of work -- a common case -- would be so slow as to be useless.
Far more effective is to limit the # of units of work that are enqueued in the global asynchronous queues at any one time. A GCD semaphore makes this quite easy; you have a single serial queue into which all units of work are enqueued. Every time you dequeue a unit of work, you increment the semaphore. Every time a unit of work is completed, you decrement the semaphore. As long as the semaphore is below some maximum value (say, 4), then you enqueue a new unit of work.
If you take something that is normally IO limited, such as tar, and run a bunch of copies in GCD,
It will run more slowly because you are throwing more CPU at an IO-bound task, meaning the IO will be more scattered and there will be more of it at the same time,
No more than N tasks will run at a time, which is the point of GCD, so "a billion queue entries" and "ten queue entries" give you the same thing if you have less than 10 threads,
Your hard drive will be fine.
Even though this question was asked back in May, it's still worth noting that GCD has now provided I/O primitives with the release of 10.7 (OS X Lion). See the man pages for dispatch_read and dispatch_io_create for examples on how to do efficient I/O with the new APIs. They are smart enough to properly schedule I/O against a single disk (or multiple disks) with knowledge of how much concurrency is, or is not, possible in the actual I/O requests.