Safety of generating a threaddump on a production system - jvm

We have a production Java system which is using a lot more threads than usual. I would like to use kill -3 pid to get a threaddump, and if necessary get a binary heap dump using JConsole for offline analysis in Eclipse MAT.
I am running Java 1.5.0_10 on RHEL4.
How likely is it that either of these will kill the JVM? What about adverse effects on its performance while the dumps are produced?

It won't kill the VM, but generating a heap dump will likely freeze the JVM during the dump process, since it has to dump a consistent snapshot. Once the dump has finished, it'll resume all threads from where they were suspended. So it won't be destructive, but it'll briefly stop processing.

I use below command to get heap dump
jmap -heap pid
For details - http://docs.oracle.com/javase/6/docs/technotes/tools/share/jmap.html

Related

Websphere - frequent thread/heap dump generation

Our application in prod environment is generating frequent heap/thread dumps while running very large reports eventually resulting in JVM failure. WebSphere is the server and heap size is set to 1024/2048(initial/max) across all nodes.
What are some ways to tackle this issue? I could think about the following options. Is there anything else I am missing?
Set min/max heap size to 2048 or even higher?
Enable verbose garbage collection in WebSphere and analyze optimal heap size?
Thread Analysis:
Runnable : 123(67%)
Blocked : 16(9%)
Waiting on Condition : 43(23%)
A good starting place to start investigating the OOM is this IBM KnowledgeCenter topic
Since it seems you experience an OutOfMemory Issue there are three possibilities to consider:
Your Apps consistently need more memory to handle the current load.
Solution: You have to load test you application with production-like traffic and tune your Min/Max Heap Size accordingly.
You have a Memory Leak issue.
Solution: Analyze the heapdumps/coredumps produced using IBM Support Assistant tools. A PMR to IBM would help.
Websphere has a memory leak.
Solution: Open a PMR
Here is a nice read about Java Memory Management in WAS environments.
Try to capture the memory, garbage collection information from the production environment. I am not sure if GC log has any performance impact. However, jstat is an extremely light weight tool and can be used in production environment with out any performance impact. Dump the output of jstat at regular intervals using the following command (Here I am setting the interval to 1 hour):
jstat -gc <PID> 3600s

Monitor worker crashes in apache storm

When running in a cluster, if something wrong happens, a worker generally dies (JVM shutdown). It can be caused by many factors, most of the time it is a challenge (the biggest difficulty with storm?) to find out what causes the crash.
Of course, storm-supervisor restarts dead workers and liveness is quite good within a storm cluster, still a worker crash is a mess that we should avoid as it adds overhead, latency (can be very long until a worker is found dead and respawned) and data loss if you didn't design your topology to prevent that.
Is there an easy way / tool / methodology to check when and possibly why a storm worker crashes? They are not shown in storm-ui (whereas supervisors are shown), and everything needs manual monitoring (with jstack + JVM opts for instance) with a lot of care.
Here are some cases that can happen:
timeouts and many possible reasons: slow java garbage collection, bad network, bad sizing in timeout configuration. The only output we get natively from supervisor logs is "state: timeout" or "state: disallowed" which is poor. Also when a worker dies the statistics on storm-ui are rebooted. As you get scared of timeouts you end up using long ones which does not seem to be a good solution for real-time processing.
high back pressure with unexpected behaviour, starving worker heartbeats and inducing a timeout for instance. Acking seems to be the only way to deal with back pressure and needs good crafting of bolts according to your load. Not acking seems to be a no-go as it would indeed crash workers and get bad results in the end (even less data processed than an acking topology under pressure?).
code runtime exceptions, sometimes not shown in storm-ui that need manual checking of application logs (the easiest case).
memory leaks that can be found out with JVM dumps.
The storm supervisor logs restart by timeout.
you can monitor the supervisor log, also you can monitor your bolt's execute(tuple) method's performance.
As for memory leak, since storm supervisor does kill -9 the worker, the heap dump is likely to be corrupted, so i would use tools that monitor your heap dynamically or killing the supervisor to produce heap dumps via jmap. Also, try monitoring the gc logs.
I still recommend increasing the default timeouts.

Pentaho text file input step crashing (out of memory)

I am using Pentaho for reading a very large file. 11GB.
The process is sometime crashing with out of memory exception, and sometimes it will just say process killed.
I am running the job on a machine with 12GB, and giving the process 8 GB.
Is there a way to run the Text File Input step with some configuration to use less memory? maybe use the disk more?
Thanks!
Open up spoon.sh/bat or pan/kettle .sh or .bat and change the -Xmx figure. Search for JAVAMAXMEM Even though you have spare memory unless java is allowed to use it it wont work. although to be fair in your example above i can't really see why/how it would be consuming much memory anyway!

Can I use MRJob to process big files in local mode?

I have a relatively big file - around 10GB to process. I suspect it won't fit into my laptop's RAM, if MRJob decides to sort it in RAM or something similar.
At the same time, I don't want to setup hadoop or EMR - the job is not urgent and I can simple start worker before going to sleep and get the results the next morning. In other words, I'm quite happy with local mode. I know, the performance won't be perfect but it's ok for now.
So can it process such 'big' files at a single weak machine? If yes - what would you recommend to do (besides setting a custom tmp dir to point to the filesystem, not to the ramdisk which will be exhausted quickly). Let's assume we use version 0.4.1.
I think the RAM size won't be an issue with the python runner of mrjob. The output of each step should be written out to temporary file on disk, so it should not fill up the RAM I believe. Dumping output to disk is the way it should be with Hadoop (and the reason why it is slow due to IO). So I would just run the job and see how it goes.
If the RAM size is an issue, you can create enough swap space on your laptop to make it at least run, thought it will be slow if the partition isn't on SSD.

Storing entire process state on disk and restoring it later? (On Linux/Unix)

I would like to know: Is there a system call, library, kernel module or command line tool I can use to store the complete state of a running program on the disk?
That is: I would like to completely dump the memory, page layout, stack, registers, threads and file descriptors a process is currently using to a file on the hard drive and be able to restore it later seamlessly, just like an emulator "savestate" or a Virtual Machine "snapshot".
I would also like, if possible, to have multiple "backup copies" of the program state, so I can revert to a previous execution point if the program dies for some reason.
Is this possible?
You should take a look at the BLCR project from Berkeley Lab.
This is widely used by several MPI implementations to provide
Checkpoint / Restart capabilities for parallel applications.
A core dump is basically this, so yes, it must be possible to get.
What you really want is a way to restore that dump as a running program. That might be more difficult.