Monitor worker crashes in apache storm - jvm

When running in a cluster, if something wrong happens, a worker generally dies (JVM shutdown). It can be caused by many factors, most of the time it is a challenge (the biggest difficulty with storm?) to find out what causes the crash.
Of course, storm-supervisor restarts dead workers and liveness is quite good within a storm cluster, still a worker crash is a mess that we should avoid as it adds overhead, latency (can be very long until a worker is found dead and respawned) and data loss if you didn't design your topology to prevent that.
Is there an easy way / tool / methodology to check when and possibly why a storm worker crashes? They are not shown in storm-ui (whereas supervisors are shown), and everything needs manual monitoring (with jstack + JVM opts for instance) with a lot of care.
Here are some cases that can happen:
timeouts and many possible reasons: slow java garbage collection, bad network, bad sizing in timeout configuration. The only output we get natively from supervisor logs is "state: timeout" or "state: disallowed" which is poor. Also when a worker dies the statistics on storm-ui are rebooted. As you get scared of timeouts you end up using long ones which does not seem to be a good solution for real-time processing.
high back pressure with unexpected behaviour, starving worker heartbeats and inducing a timeout for instance. Acking seems to be the only way to deal with back pressure and needs good crafting of bolts according to your load. Not acking seems to be a no-go as it would indeed crash workers and get bad results in the end (even less data processed than an acking topology under pressure?).
code runtime exceptions, sometimes not shown in storm-ui that need manual checking of application logs (the easiest case).
memory leaks that can be found out with JVM dumps.

The storm supervisor logs restart by timeout.
you can monitor the supervisor log, also you can monitor your bolt's execute(tuple) method's performance.
As for memory leak, since storm supervisor does kill -9 the worker, the heap dump is likely to be corrupted, so i would use tools that monitor your heap dynamically or killing the supervisor to produce heap dumps via jmap. Also, try monitoring the gc logs.
I still recommend increasing the default timeouts.

Related

How to reduce time taken on threads reaching Safepoint - Sync state

About the Issue:
During heavy IO in the VM, we faced JVM pause/slowness due to stopping threads taking more time. When looking on safepoint logs it showed Sync state takes the most time.
We also tried printing Safepoint traces on timeout delay (-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=200) to know which threads is causing this issue but nothing seems to be suspicious. Also when setting timeout for safepoints, we are not getting timeout detected print when the time spent is in 'Sync' state.
Questions about this safepoint tracing:
How does the safepoint timeout work?
After logging the thread details, does the safepoint exists and all threads resume?
Will that VM operation be carried out. What will happen if the vmop is GC.
Using Async-profiler:
Tried time-to-safepoint profiling using async-profiler and noticed VM Thread is taking more time on SafepointSynchronize::begin() method and C2 compiler threads is taking almost equal time as VM Thread.
We doubt that C2 Compilers may be taking time to reach safepoint. Can someone help us in resolving this issue and to interpret this time-to-safepoint flamegraph. Thanks in advance.
SafepointTimeout option affects nothing but logging, i.e. threads will not be interrupted, VM operation will run normally, etc.
SafepointTimeout does not always print timed out threads: a thread may already have reached the safepoint by the time printing occurs. Furthermore, SafepointTimeout may not even detect a timeout, if the entire process has been frozen by the Operating System.
For example, such 'freezes' many happen
when a process has exhausted its cpu quota in a cgroup (container);
when a system is low on physical memory, and direct reclaim occurs;
due to activity of another process (e.g. I observed long JVM pauses when atop utility inspected the system).
async-profiler indeed has a time-to-safepoint profiling option (--ttsp), though using it correctly may seem tricky. It works best in wall profiling mode with jfr output. In this configuration, async-profiler will sample all threads (both running and blocking) during safepoint synchronization, and record each individual event with a timestamp.
Such profile can be then analyzed with JDK Mission Control: choose the time interval around the long pause, and look at the stack traces of java threads in this interval.
Note that if the JVM process is 'frozen', async-profiler thread does not work either, i.e. you will not see collected samples during this period. Normally, in wall clock profiling mode, all threads are sampled evenly. But if you see a 'gap ' (missed events during some time interval), it apparently means the JVM process has not received CPU time. In this case, the reason of JVM pauses is not in the Java application, but rather in the operating system / environment.

Can a BLOCKED Thread cause high CPU Consumption

We saw a high CPU consumption issue in our production environment recently, and saw something strange while debugging the same. When I did a "top -H" to see the CPU stats per thread ID, I found a thread X consuming high CPU. When I took the thread dumps, I saw that this thread X was in BLOCKED state. What does this mean, can a thread which is in BLOCKED state consume high CPU ? I think this might be trivial question but I am a novice in debugging Performance issues and JVM, and not sure what I might be missing here.
Entering and exiting a BLOCKED state can be expensive. If you are BLOCKED for even a little while this is not a problem, but if you are blocking briefly in a busy loop, your thread can appear blocked but in reality burning CPU.
I would look for multiple threads repeatedly competing on a shared resources which are entering BLOCKED very briefly.
#Peter has already mentioned good point about busy loop (which could be JVM internal adaptive optimization of spin locks in case of synchronization or busy loop created by application itself on some condition) which can burn CPU. There is another indirect way in which the CPU can go very high because of thread blocking. Typically in a web server if lots of threads are in blocked state ( not because of synchronization lock related blocking but say waiting for IO from a back-end datastore) then it may put lots of pressure on JVM garbage collection. These worker threads are supposed to finish their work quickly so that all the objects created by them on heap is quickly de-referenced and garbage collected. If lots of threads are in this state then the garbage collection threads have to work overtime and they may end up taking lots of CPU.

RabbitMQ - strange synchronization behavior

I have simple RabbitMQ cluster with 2 physical identical linux nodes: (CentOS, RabbitMQ 3.1.5, Erlang R15B, 2GB Ram, CPU 1xCore). Mirroring and synchronization of nodes is turned on.
I have two problems which bothers me:
In a normal situation everything is fine, but after restarting one of the nodes(by stop_app and start_app in the commandline) the whole cluster becomes unavaible to producers and consumers - I can't produce or receive messages from a queue during synchronization. Is this situation normal?
During synchronization I observed very high CPU load (almost 100%) on the slave node(that which was restarted). I measured the speed of synchronization - it's dramatic low (synchronization of 2 millions of messages takes above 3 hours). It's strange because producing of such amount takes much less. Is this situation normal too?
I've recently been tasked with looking into RabbitMQ at work and so have been deep in the documentation.
When synchronising this is the case. This is an extract from the RabbitMQ HA documentation here.
If a queue is set to automatically synchronise it will synchronise
whenever a new slave joins - becoming unresponsive until it has done
so.
If the messages are being read-from and persisted-to disk (either through choice or through memory limitations) there may be overhead there. You can see a chart on this blog entry (it's the last chart before the comments) which indicates that there are performance changes when reading from and writing to queues of that many messages. These charts are for older versions of RabbitMQ but I've not seen anything more recent.
Hope this helps!

Redis - Default blocking VM

The blocking VM performance is better overall, as there is no time lost in
synchronization, spawning of threads, and resuming blocked
clients waiting for values. So if you are willing to accept an higher
latency from time to time, blocking VM can be a good pick. Especially
if swapping happens rarely and most of your often accessed data
happens to fit in your memory.
This is default mode of Redis (and the only mode going forward I believe now VM is deprecated in 2.6), leaving the OS to handle paging (if/when required). I am correct in my understanding that it will take some time to get "hot" when booted/started. When working on a 1gb RAM node with a 16gb dataset, does Redis attempt to load it all into virtual memory at boot and thus 90%+ is immediately paged out, and only after some good amount of usages does the above statement hold true?
Redis VM was already deprecated in Redis 2.4, and has been removed in Redis 2.6. It is a dead end: don't use it.
I think you are confusing the blocking VM with OS paging. They are two different things.
OS paging is the default mode of Redis when Redis VM is not configured at all (whatever the blocking mode). The OS will swap Redis memory if it does not fit in physical memory. The event loop can be frozen at any time. When it happens, performance is abysmal because none of the Redis internal data structures is designed for this (no locality, no paging system).
Redis VM can be configured in non blocking mode (using I/O threads). When I/Os are done, the event loop is not blocked, and Redis is still responsive. However, when too many I/Os pile up, the I/O threads will be completely busy, and you end up with a responsive Redis, but unable to process any queries requiring I/Os.
Redis VM can also be configured in blocking mode. In this mode all I/Os are synchronously performed in the main event loop thread. So the event loop is frozen in case of I/O (for instance in case of a key miss). All clients are impacted. However, general performance (CPU consumption and latency) is better than with the non blocking mode because some threading scheduling/synchronization is saved.
In practice, the difference between OS paging and the Redis blocking VM is the granularity level. With Redis VM, the granularity is the key. With OS paging, well it is the page (a 4 KB block which can span on several unrelated keys).
In all 3 cases, the initial load of the dump file will be extremely slow and generate a peak of random I/Os on your system. As you pointed out, most objects will be loaded and then swapped out. The warm-up time will be significant.
Except if you have extreme locality in your data, or if you do not care at all about the latencies, using 1 GB RAM for a 16 GB dataset with the Redis VM is science-fiction IMO.
There is a reason why the Redis VM was phased out. By design, it will never perform as well as a disk-based datastore (which can exploit file mapping or direct I/Os to avoid the double buffering, and use adapted data structures like B-trees).
Redis as an in-memory store is excellent. But if you need to store something which is bigger than RAM, don't use it. Other (disk-based) stores will all perform much better.

Is it safe to access the hard drive via many different GCD queues?

Is it safe? For instance, if I create a bunch of different GCD queues that each compress (tar cvzf) some files, am I doing something wrong? Will the hard drive be destroyed?
Or does the system properly take care of such things?
Dietrich's answer is correct save for one detail (that is completely non-obvious).
If you were to spin off, say, 100 asynchronous tar executions via GCD, you'd quickly find that you have 100 threads running in your application (which would also be dead slow due to gross abuse of the I/O subsystem).
In a fully asynchronous concurrent system with queues, there is no way to know if a particular unit of work is blocked because it is waiting for a system resource or waiting for some other enqueued unit of work. Therefore, anytime anything blocks, you pretty much have to spin up another thread and consume another unit of work or risk locking up the application.
In such a case, the "obvious" solution is to wait a bit when a unit of work blocks before spinning up another thread to de-queue and process another unit of work with the hope that the first unit of work "unblocks" and continues processing.
Doing so, though, would mean that any asynchronous concurrent system with interaction between units of work -- a common case -- would be so slow as to be useless.
Far more effective is to limit the # of units of work that are enqueued in the global asynchronous queues at any one time. A GCD semaphore makes this quite easy; you have a single serial queue into which all units of work are enqueued. Every time you dequeue a unit of work, you increment the semaphore. Every time a unit of work is completed, you decrement the semaphore. As long as the semaphore is below some maximum value (say, 4), then you enqueue a new unit of work.
If you take something that is normally IO limited, such as tar, and run a bunch of copies in GCD,
It will run more slowly because you are throwing more CPU at an IO-bound task, meaning the IO will be more scattered and there will be more of it at the same time,
No more than N tasks will run at a time, which is the point of GCD, so "a billion queue entries" and "ten queue entries" give you the same thing if you have less than 10 threads,
Your hard drive will be fine.
Even though this question was asked back in May, it's still worth noting that GCD has now provided I/O primitives with the release of 10.7 (OS X Lion). See the man pages for dispatch_read and dispatch_io_create for examples on how to do efficient I/O with the new APIs. They are smart enough to properly schedule I/O against a single disk (or multiple disks) with knowledge of how much concurrency is, or is not, possible in the actual I/O requests.