I'm trying to keep an eye on how long an application runs. To do this, I capture every process's ID as it starts, and when that process is shut down, I log the time. However, Google's Chrome starts and stops like 6 processes when you start it up and shut it down, meaning each execution of Chrome gets logged multiple times.
Is there a better way to track the execution of an application than by process ID? Or is there, perhaps, a technique for getting around this particular problem? I'd considered not adding a process ID if a process with the same ID was added within a second or so, but that seems exploitable.
Any ideas?
I am not 100% but I would assume that one process in Chrome must be the parent. try eliminating processes from your list if their parent (PPID) is the same (and not init = PID 1)
I ended up just checking if I was adding a duplicate. Not very efficient, but easy and effective. It will serve for now.
Related
I'm working on getting Redis to run on Solaris 10 and there's a few integration tests that are failing. The test I'm looking into works like this:
Start Redis
It forks and the child starts dumping the database to a backup file (RDB)
There's actually a parent / child / grandchild relationship going on where the grandchild becomes a zombie, but I noticed that only minutes before I had to head home.
After a short time the test script sends SIGTERM to the child
The child catches the signal & shuts down gracefully
The parent calls wait3()
In spite of the wait3() call the child ends up in a zombie state.
The test fails around 90% of the time when I run it. Once it gets into a failed state it never recovers. I tried changing the test to wait significantly longer and although it appears to call wait3() many times after the process has exited, it stays in that state until the parent process(es) are killed.
Unfortunately I won't be able to work on this again until next week, so I'm researching it from home. Most of my googling has only turned up documentation or "why do processes become zombies?" type questions.
This google groups thread from the mid 90s may help, though they're mostly talking about older releases of Solaris / SunOS.
I was mistaken. It looks like the master node doesn't see that its child failed so doesn't wait.
The title is a bit misleading, so let me explain further.
I have a non thread-safe dll I have no choice but to use as part of my back end
servers. I can't use it directly in my servers as the thread issues it has causes
it to crash. So, I created an akka.net cluster of N nodes each which hosts a single
actor. All of my API calls that were originally to that bad dll are now routed through
messages to these nodes through a round-robin group. As each node only has a single, single
threaded actor, I get safe access, but as I have N of them running I get parallelism, of a sort.
In production, I have things configured with auto-down = false and default timings on heartbeats
and so on. This works perfectly. I can fire up new nodes as needed, they get added to the group,
I can remove them with Cluster.Leave and that is happy as well.
My issue is with debugging. In our development environment we keep a cluster of 20 nodes each
exposing a single actor as described above that wraps this dll. We also have a set of nodes that act as
seed nodes and do nothing else.
When our application is run it joins the cluster. This allows it to direct requests through the round-robin
router to the nodes we keep up in our cluster. When doing development and testing and debugging the app, if I configure things to use auto-down = false
we end up with problems whenever a test run crashes or we stop the application with out going through
proper cluster leaving logic. Such as when we terminate the app with the stop button in the debugger.
With out auto-down, this leaves us with a missing member of the cluster that causes the leader to disallow
additions to the cluster. This means that the next time I run the app to debug, I cant join the cluster and am
stuck.
It seems that I have to have auto-down set to get debugging to work. If it is set, then when I crash my app
the node is removed from the cluster 5 seconds later. When I next fire up my
app, the cluster is back in a happy state and I can join just fine.
The problem with this is that if I am debugging the application and pause it for any amount of time, it is almost immediately
seen as unreachable and then 5 seconds later is thrown out of the cluster. Basically, I can't debug with these settings.
So, I set failure-detector.acceptable-heartbeat-pause = 600s to give me more time to pause the app
while debugging. I will get shutdown in 10 min, but I don't often sit in the debugger for that long, so its an acceptable
trade-off. The issue with this is of course that when I crash the app, or stop it in the debugger, the cluster thinks it
exists for the next 10 minutes. No one tries to talk to these nodes directly, so in theory that isn't a huge issue, but I keep
running into cases where the test I just ran got itself elected as role leader. So the role leader is now dead, but the cluster
doesn't know it yet. This seems to prevent me from joining anything new to the cluster until my 10 min are up. When I try to leave
the cluster nicely, my dead node gets stuck at the exiting state and doesn't get removed for 10 minutes. And I don't always get
notified of the removal either, forcing me to set a timeout on leaving that will cause it to give up.
There doesn't seem to be any way to say "never let me be the leader". When I have run the app with no role set for the cluster
it seems to often get itself elected as the cluster leader causing the same problem
as when the role leader is dead but unknown to be so, but at a larger level.
So, I don't really see any way around this, but maybe someone has some tricks to pull this off. I want to be able to debug
my cluster member without it being thrown out of the cluster, but I also don't want the cluster to think that leader nodes
are around when they aren't, preventing me from rejoining during my next attempt.
Any ideas?
Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()
We are experiencing this issue approximately once a month. It is very hard to pinpoint the cause so any help would be appreciated. This causes the App pool to stop and brings the site down. We have gone through all log files and have concluded nothing. We are using the 2.0.3 version on IIS 6.
I've noticed IIS defaults web apps on a 29-hour recycle schedule, which can be troublesome since it may recycle at times your users do not expect it to.
For example: web app starts at 12 am, which means the next day it recycles at 5am, the day after that at 10am, the day after that at 3pm, etc. (this is assuming there is enough request activity against your app to keep it alive so it does not shutdown due to inactivity)
If your web app relies heavily on in-memory session state this is especially bad because the recycle will kill sessions and possibly force users to re-authenticate and lose any unsaved work. (if you don't design your app to work seamlessly with recycling)
Check the recycle schedule and make sure it recycles at a time that you expect. See this for screenshots: http://remy.supertext.ch/2010/08/iis7-worker-process-reached-its-allowed-processing-time-limit/
Not sure about the infinite loop suggestion... sounds like you just have a recycling configuration issue to resolve.
This likely indicates an infinite loop in your application code.
Basically, every time a request comes into the web server, IIS hands the request off to a worker process. You can configure in IIS how many of those workers there are, and what the timeout value is. The timeout is to keep things moving in case the application code hangs -- it gets killed so the thread can go back in the pool to keep servicing new requests.
So look through your code for likely infinite loops. Or alternatively, it could be an extremely long-running database query that could have eventually finished but exceeded the timeout value. Perhaps your web application offers the end user an opportunity to make too broad of a query that returns too much data or requires too much DB processing time.
It's hard to give a specific cause for you, of course, but try to think along these lines.
If you're experiencing a crash as a result (sounds like you are) then you might want to grab a copy of Debugging Tools for Windows and spend some time reading Tess Ferrandez' blog--she offers great advice on performing post mortem crash analysis and makes WinDbg a whole lot more approachable.
I want to write a program, that should be notified by O.S. whenever any running process on that OS dies.
I don't want to myself poll and compare everytime if a previously existing process has died. I want my program to be alerted by OS whenever a process termination happens.
How do I go about it? Some sample code would be very helpful.
PS: Looking for approaches in Java/C++.
Sounds like you want PsSetCreateProcessNotifyRoutine(). See this article to get started:
http://www.codeproject.com/KB/threads/procmon.aspx
Under Unix, you could use the sigchld signal to get notified of the death of the process. This requires, however, that the process being monitored is a child process of the monitoring process.
Under Windows, you might need to have a valid handle to the process. If you spawn the process yourself using CreateProcess, you get the handle for free, otherwise you must acquire by other means. It might then be possible to wait for the process to terminate by calling WaitForSingleObject on the handle.
Sorry, I don't have any example code for this. I am not even sure, that waiting on the process handle under Windows really awaits termination of the process (as opposed to some other "significant" condition, which causes the process handle to enter "signalled" state or something).
I don't have a code sample ready but one idea – on Linux – might be to find out the ID of the process you'd like to watch when first starting your watcher program (e.g. using $ pgrep) and then using inotify to watch /proc/<PID>/ – which gets deleted when the process dies. In contrast to polling, this doesn't cost any significant CPU resources.
Now, procfs is not completely supported by inotify, so I can't guarantee this approach would actually work but it is certainly worth looking into.