Monit - Check if process exists and kill when consuming X memory - monit

I have 'someprocess' running on some hosts, and want a single monit check for hosts with and without 'someprocess' running, which kills someprocess when memory usage exceeds a threshold.
Below check works, but on hosts not running someprocess monit continually logs "process is not running" in /var/log/monit.log.
check process someprocess
matching "someprocess"
if memory usage > 2% for 1 cycle then exec "/usr/bin/kill someprocess"
I want to also include 'if exists' but keep getting monit systax errors, Im not sure I can have more than one if statement.
Does anyone know if I can do this, so have something along the lines of:
if exists AND memory usage > 2% for 1 cycle then exec "/usr/bin/kill someprocess"

You can add an additional test and disable the monitoring if the application is not available, to prevent floating the logs with messages.
check process someprocess matching "someprocess"
if memory usage > 2% for 1 cycle then exec "/usr/bin/kill someprocess"
if not exist for 5 cycles then unmonitor
This will disable the check, if the application is not available for 5 monitor cycles. To enable the monitoring again, use "monit monitor someprocess".
See https://mmonit.com/monit/documentation/monit.html#EXISTENCE-TESTS

Related

Why i can't kill Erlang process?

I am spawning 2 processes and it seems i can not kill either of them:
restarter - process that spawns the worker whenever it goes down
worker -process that gets messages from the shell, concatenates them and returns them in the reason of an exit to the restarter which in turn forwards them to the shell.
The worker process can't be killed since the restarter would restart it on any trap exit message. But what keeps the restarter process alive?
-module(mon).
-compile_flags([debug_info]).
-export([worker/1,init/0,restarter/2,clean/1]).
% ctrl+g
init()->
Pid=spawn(?MODULE,restarter,[self(),[]]),
register(restarter,Pid),
Pid.
restarter(Shell,Queue)->
process_flag(trap_exit,true),
Wk=spawn_link(?MODULE,worker,[Queue]),
register(worker,Wk),
receive
{'EXIT',Pid,{Queue,normal}}->Shell ! {Queue,"From res: worker died peacefully, wont restart"};
{'EXIT',Pid,{Queue,horrible}} ->
Shell ! {Queue,"Processed so far:"},
Shell ! "will restart in 5 seconds, select fresh/stale -> 1/0",
receive
1 ->
Shell ! "Will restart fresh",
restarter(Shell,[]);
0 ->Shell ! "Will continue work",
restarter(Shell,Queue)
after 5000 ->
Shell ! "No response -> started with 666",
restarter(Shell,[666])
end;
{MSG}->Shell ! {"Unknown message...closing",MSG}
end.
worker(Queue)->
receive
die->exit({Queue,horrible});
finish->exit({Queue,normal});
MSG->worker([{time(),MSG}|Queue])
end.
Usage
mon:init().
regs(). %worker and restarter are working
whereis(worker) ! "msg 1", whereis(worker) ! "msg2".
whereis(worker) ! finish.
flush(). % should get the first clause from restarter
regs(). % worker should be up and running again
exit(whereis(restarter),reason).
regs(). % restarter should be dead
In this scenario, the restarter process is trapping exits, so exit(whereis(restarter), reason) doesn't kill it. The exit signal gets converted to a message, and gets put into the message queue of the process:
> process_info(whereis(restarter), messages).
{messages,[{'EXIT',<0.76.0>,reason}]}
The reason it's still in the message queue is that none of the clauses in the receive expression matches this message. The first two clauses are specific to the exit reasons used by the worker process, and the last clause might look like a catch-all clause but it actually isn't - it matches any message that is a tuple with one element. If it were written MSG instead of {MSG}, it would have received the exit reason message, and sent "Unknown message" to the shell.
If you really want to kill the process, use the kill reason:
exit(whereis(restarter), kill).
A kill exit signal is untrappable, even if the process is trapping exits.
Another thing: the first two receive clauses will only match if the worker's queue is empty. That is because it reuses the variable name Queue, so the queue in {'EXIT',Pid,{Queue,normal}} must be equal to the value passed as an argument to the restarter function. In a situation like this, you'd normally use NewQueue or something as the variable in the receive clauses.

Monit for "cron-like" tasks

Have some batch-type jobs that I would like to move from cron to Monit but am struggling to get them to work properly. These scripts typically run once a day, but on occasion have to be re-ran later in the day. Goal is to take advantage of the monit & m/monit front-ends to re-run as well as be alerted on failure in similar fashion to other things under monit.
The below was my first attempt. I know the docs say to use range/wildcard for minute field but I have my monit daemon set to cycle every 20 seconds so thought I'd be able to get away with this.
check program test.sh
with path "/usr/local/bin/test.sh"
every "0 7 * * *"
if status != 0 then alert
This does not seem to work as it seems like it picks up the exit status of the program on the NEXT run. So I have a zombie process sitting around until 7am the next day, at which time I'll see the status from the previous day's run.
Would be nice if this ran immediate or if there was a way to schedule something as "batch" that would only run once when started (either from command line or web gui). Example below.
check program test.sh
with path "/usr/local/bin/test.sh"
mode batch
if status != 0 then alert
Is it possible to do what I want? Can a 'check program' be scheduled that will only run one time when started or using the 'every [cron]' type syntax supported by monit?
TIA for any suggestions.
The latest version of monit (5.18) now picks up the exit status on the next daemon cycle, not on the next execution of the program like in the past (which might not be until the next day).

Monit false alerts

I am monitoring a java daemon process with PID. Below is the code.
check process SemanticReplication with pidfile "/ngs/app/edwt/opsmonit /monit/scripts/process.pid"
start = "/ngs/app/edwt/scripts/javadaemon/start_daemon.ksh"
stop = "/ngs/app/edwt/scripts/javadaemon/stop_daemon.ksh"
Many times, even though java daemon process is up and running, I get false alert as process not running.
In the next monit check cycle (after a minute), another monit alert triggers as process is up and running.
Can someone help how do we avoid this false alerts ?
Your check statement is to have monit check for the existence of the pid file (which looks weird with the spaces, btw). If there isn't any, it'll send an alert by default and then runs the start directive.
I get around this by having a check process ... matching statement like so:
check process app-pass matching 'Passenger RubyApp: \/home\/app\/app-name\/public'
Essentially, "matching" does the equivalent of ps aux | grep ... which does a better job when I can't rely on a pid file existing, like with a child process.

Monit - how to identify crashes of a program instead of restarts

I am using monit to monitor my program. The program being monitored can potentially crash under 2 situations
Program can randomly crash. It just needs to be restarted
It gets into a bad state and crashes each time it is started subsequently
To fix the latter situation, I have a script to stop the program, reset it to a good state by cleaning its data files and restart it. I tried the below config
check process program with pidfile program.pid
start program = "programStart" as uid username and gid groupname
stop program = "programStop" as uid username and gid groupname
if 3 restarts within 20 cycles then exec "cleanProgramAndRestart" as uid username and gid groupname
if 6 restarts within 20 cycles then timeout
Say monit restarts the program 3 times in 3 cycles. After it is restarted the third time, the cleanProgramAndRestart script runs. However as the cleanProgramAndRestart script restarts the program yet again, the condition of 3 restarts is met again in the next cycle and it becomes an infinite loop
Could anyone suggest any way to fix this?
If any of the below actions are possible, then there may be a way around.
If there is a "crash" keyword, instead of "restarts", I will be able to run the clean script after the program crashes 3 times instead of after it is restarted 3 times
If there is a way to reset the "restarts" counter in some way after running the exec script
If there is a way to exec something only if output of the condition 3 restarts changed
Monit is polling your "tests" every cycle. The cycle length is usually defined in /etc/monitrc, in set daemon cycle_length
So if your cleanProgramAndRestart takes less than a cycle to perform, it shouldn't happen.
As it is happening, I guess your cleanProgramAndRestart takes more than a cycle to perform.
You can:
Increase the cycle length in Monit configuration
check your program every x cycles (make sure that cycle_length*x > cleanProgramAndRestart_length)
If you can't modify these variables, there could be a little workaround, with a temp file:
check process program
with pidfile program.pid
start program = "programStart"
as uid username and gid groupname
stop program = "programStop"
as uid username and gid groupname
if 3 restarts within 20 cycles
then exec "touch /tmp/program__is_crashed"
if 6 restarts within 20 cycles then timeout
check file program_crash with path /tmp/program_crash every x cycles #(make sure that cycle_length*x > cleanProgramAndRestart_length)
if changed timestamp then exec "cleanProgramAndRestart"
as uid username and gid groupname

Sun Grid Engine resubmit job stuck in 'Rq' state

I have what I hope is a pretty simple question, but I'm not super familiar with Sun Grid, so I've been having trouble finding the answer. I am currently submitting jobs to a grid using a bash submission script that generates a command and then executes it. I have read online that if a sun grid job exits with a code of 99, it gets re-submitted to the grid. I have successfully written my bash script to do this:
[code to generate command, stores in $command]
$command
STATUS=$?
if [[ $STATUS -ne 0 ]]; then
exit 99
fi
exit 0
When I submit this job to the grid with a command that I know has a non-zero exit status, the job does indeed appear to be resubmitted, however the scheduler never sends it to another host, instead it just remains stuck in the queue with the status "Rq":
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2150015 0.55500 GridJob.sh my_user Rq 04/08/2013 17:49:00 1
I have a feeling that this is something simple in the config options for the queue, but I haven't been able to find anything googling. I've tried submitting this job with the qsub -r y option, but that doesn't seem to change anything.
Thanks!
Rescheduled jobs will only get run in queues that have their rerun attribute (FALSE by default) set to TRUE, so check your queue configuration (qconf -mq myqueue). Without this, your job remains in the rescheduled-pending state indefinitely because it has nowhere to go.
IIRC, submitting jobs with qsub -r yes only qualifies them for automatic rescheduling in the event of an exec node crash, and that exiting with status 99 should trigger a reschedule regardless.