Monit false alerts - monit

I am monitoring a java daemon process with PID. Below is the code.
check process SemanticReplication with pidfile "/ngs/app/edwt/opsmonit /monit/scripts/process.pid"
start = "/ngs/app/edwt/scripts/javadaemon/start_daemon.ksh"
stop = "/ngs/app/edwt/scripts/javadaemon/stop_daemon.ksh"
Many times, even though java daemon process is up and running, I get false alert as process not running.
In the next monit check cycle (after a minute), another monit alert triggers as process is up and running.
Can someone help how do we avoid this false alerts ?

Your check statement is to have monit check for the existence of the pid file (which looks weird with the spaces, btw). If there isn't any, it'll send an alert by default and then runs the start directive.
I get around this by having a check process ... matching statement like so:
check process app-pass matching 'Passenger RubyApp: \/home\/app\/app-name\/public'
Essentially, "matching" does the equivalent of ps aux | grep ... which does a better job when I can't rely on a pid file existing, like with a child process.

Related

Monit - Check if process exists and kill when consuming X memory

I have 'someprocess' running on some hosts, and want a single monit check for hosts with and without 'someprocess' running, which kills someprocess when memory usage exceeds a threshold.
Below check works, but on hosts not running someprocess monit continually logs "process is not running" in /var/log/monit.log.
check process someprocess
matching "someprocess"
if memory usage > 2% for 1 cycle then exec "/usr/bin/kill someprocess"
I want to also include 'if exists' but keep getting monit systax errors, Im not sure I can have more than one if statement.
Does anyone know if I can do this, so have something along the lines of:
if exists AND memory usage > 2% for 1 cycle then exec "/usr/bin/kill someprocess"
You can add an additional test and disable the monitoring if the application is not available, to prevent floating the logs with messages.
check process someprocess matching "someprocess"
if memory usage > 2% for 1 cycle then exec "/usr/bin/kill someprocess"
if not exist for 5 cycles then unmonitor
This will disable the check, if the application is not available for 5 monitor cycles. To enable the monitoring again, use "monit monitor someprocess".
See https://mmonit.com/monit/documentation/monit.html#EXISTENCE-TESTS

Can I use "stopsignal=WINCH" to have supervisord gracefully stop an Apache process?

According to the Apache documentation, the WINCH signal can be used to gracefully stop Apache.
So it would seem that, in supervisord, I should be able to use stopsignal=WINCH to configure supervisord to stop Apache gracefully.
However, Google turns up 0 results for "stopsignal=WINCH". It seems odd that no-one has tried this before.
Just wanted to confirm: is stopsignal=WINCH the way to get supervisord to stop Apache gracefully?
I had the same problem running/stopping apache2 under supervisord inside a Docker container. I don't know if your problem is related to Docker or not, or how familiar you are with Docker. Just to give you some context: when calling docker stop <container-name>, Docker sends SIGTERM to the process with PID 1 running inside the container (some details on the topic), in this case supervisord. I wanted supervisord to pass the signal to all its programs to gracefully terminate them, because I realized that, if you don't gracefully terminate apache2, you might not be able to restart that because the PID file is not removed. I tried both with and without stopsignal=WINCH, and the result didn't change for me. In both cases apache2 was gently terminated (exit status was 0 and no PID file in /var/run/apache2. To stay on the safe side, I kept the stopsignal=WINCH inside the supervisord config, but as of today I was also not able to find a clear answer online, neither here nor by googling.
According to the supervisord's source code:
# all valid signal numbers
SIGNUMS = [ getattr(signal, k) for k in dir(signal) if k.startswith('SIG') ]
def signal_number(value):
try:
num = int(value)
except (ValueError, TypeError):
name = value.strip().upper()
if not name.startswith('SIG'):
name = 'SIG' + name
num = getattr(signal, name, None)
if num is None:
raise ValueError('value %r is not a valid signal name' % value)
if num not in SIGNUMS:
raise ValueError('value %r is not a valid signal number' % value)
return num
It does recognize all signals and even if your signal name doesn't start with 'SIG', it will add that automatically too.

Monit for "cron-like" tasks

Have some batch-type jobs that I would like to move from cron to Monit but am struggling to get them to work properly. These scripts typically run once a day, but on occasion have to be re-ran later in the day. Goal is to take advantage of the monit & m/monit front-ends to re-run as well as be alerted on failure in similar fashion to other things under monit.
The below was my first attempt. I know the docs say to use range/wildcard for minute field but I have my monit daemon set to cycle every 20 seconds so thought I'd be able to get away with this.
check program test.sh
with path "/usr/local/bin/test.sh"
every "0 7 * * *"
if status != 0 then alert
This does not seem to work as it seems like it picks up the exit status of the program on the NEXT run. So I have a zombie process sitting around until 7am the next day, at which time I'll see the status from the previous day's run.
Would be nice if this ran immediate or if there was a way to schedule something as "batch" that would only run once when started (either from command line or web gui). Example below.
check program test.sh
with path "/usr/local/bin/test.sh"
mode batch
if status != 0 then alert
Is it possible to do what I want? Can a 'check program' be scheduled that will only run one time when started or using the 'every [cron]' type syntax supported by monit?
TIA for any suggestions.
The latest version of monit (5.18) now picks up the exit status on the next daemon cycle, not on the next execution of the program like in the past (which might not be until the next day).

Monit - how to identify crashes of a program instead of restarts

I am using monit to monitor my program. The program being monitored can potentially crash under 2 situations
Program can randomly crash. It just needs to be restarted
It gets into a bad state and crashes each time it is started subsequently
To fix the latter situation, I have a script to stop the program, reset it to a good state by cleaning its data files and restart it. I tried the below config
check process program with pidfile program.pid
start program = "programStart" as uid username and gid groupname
stop program = "programStop" as uid username and gid groupname
if 3 restarts within 20 cycles then exec "cleanProgramAndRestart" as uid username and gid groupname
if 6 restarts within 20 cycles then timeout
Say monit restarts the program 3 times in 3 cycles. After it is restarted the third time, the cleanProgramAndRestart script runs. However as the cleanProgramAndRestart script restarts the program yet again, the condition of 3 restarts is met again in the next cycle and it becomes an infinite loop
Could anyone suggest any way to fix this?
If any of the below actions are possible, then there may be a way around.
If there is a "crash" keyword, instead of "restarts", I will be able to run the clean script after the program crashes 3 times instead of after it is restarted 3 times
If there is a way to reset the "restarts" counter in some way after running the exec script
If there is a way to exec something only if output of the condition 3 restarts changed
Monit is polling your "tests" every cycle. The cycle length is usually defined in /etc/monitrc, in set daemon cycle_length
So if your cleanProgramAndRestart takes less than a cycle to perform, it shouldn't happen.
As it is happening, I guess your cleanProgramAndRestart takes more than a cycle to perform.
You can:
Increase the cycle length in Monit configuration
check your program every x cycles (make sure that cycle_length*x > cleanProgramAndRestart_length)
If you can't modify these variables, there could be a little workaround, with a temp file:
check process program
with pidfile program.pid
start program = "programStart"
as uid username and gid groupname
stop program = "programStop"
as uid username and gid groupname
if 3 restarts within 20 cycles
then exec "touch /tmp/program__is_crashed"
if 6 restarts within 20 cycles then timeout
check file program_crash with path /tmp/program_crash every x cycles #(make sure that cycle_length*x > cleanProgramAndRestart_length)
if changed timestamp then exec "cleanProgramAndRestart"
as uid username and gid groupname

Sun Grid Engine resubmit job stuck in 'Rq' state

I have what I hope is a pretty simple question, but I'm not super familiar with Sun Grid, so I've been having trouble finding the answer. I am currently submitting jobs to a grid using a bash submission script that generates a command and then executes it. I have read online that if a sun grid job exits with a code of 99, it gets re-submitted to the grid. I have successfully written my bash script to do this:
[code to generate command, stores in $command]
$command
STATUS=$?
if [[ $STATUS -ne 0 ]]; then
exit 99
fi
exit 0
When I submit this job to the grid with a command that I know has a non-zero exit status, the job does indeed appear to be resubmitted, however the scheduler never sends it to another host, instead it just remains stuck in the queue with the status "Rq":
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2150015 0.55500 GridJob.sh my_user Rq 04/08/2013 17:49:00 1
I have a feeling that this is something simple in the config options for the queue, but I haven't been able to find anything googling. I've tried submitting this job with the qsub -r y option, but that doesn't seem to change anything.
Thanks!
Rescheduled jobs will only get run in queues that have their rerun attribute (FALSE by default) set to TRUE, so check your queue configuration (qconf -mq myqueue). Without this, your job remains in the rescheduled-pending state indefinitely because it has nowhere to go.
IIRC, submitting jobs with qsub -r yes only qualifies them for automatic rescheduling in the event of an exec node crash, and that exiting with status 99 should trigger a reschedule regardless.