Application failed to start after timestamp check failed in monit - monit

I am having a simple monit control file which contains the following
check process fooBar1 with pidfile fooBar1PidFile
start program = "/etc/init.d/fooBar1 start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar1 stop"
if 5 restarts within 5 cycles then unmonitor
check process fooBar2 with pidfile fooBar2PidFile
start program = "/etc/init.d/fooBar2 start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar2 stop"
if 5 restarts within 5 cycles then unmonitor
check process fooBar with pidfile fooBarPidFile
start program = "/etc/init.d/fooBar start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar stop"
if 5 restarts within 5 cycles then unmonitor
if memory usage > 25.0 MB for 4 cycles then alert
depends on fooBar1
depends on fooBar2
depends on checkFile
check file checkFile with path pathToFile
if timestamp > 4 minute for 8 cycles then restart
Here the intention is to restart fooBar, fooBar1 and fooBar2 applications when the timestamp check for checkFile fails. But what actually happens is it tries to restart the checkFile instead of fooBar.
This check was working fine with monit version 5.5, but not working with 5.18.
This is what I am getting when timestamp fails and after that 8 cycles are elapsed.
'checkFile' timestamp for pathToFile failed -- current timestamp is Fri, 08 Dec 2017 12:47:04
'fooBar' failed to start -- could not start required services: 'checkFile'
'fooBar' start action failed
Am I missing something here?
Thanks in advance

Tried with a work around,
check file checkFile with path pathToFile
if timestamp > 4 minute for 8 cycles then
exec "/etc/init.d/fooBar restart"
repeat every 1 cycles
This is restarting application fooBar every cycle when timestamp check fails. But just wondering any other better way to do it?

Related

open MPI - ring_c on multiple hosts fails

I have recently installed open MPI on two Ubuntu 14.04 hosts and I am now testing its functionality with the two provided test functions hello_c and ring_c. The hosts are called 'hermes' and 'zeus' and they both have the user 'mpiuser' to log in non-interactively (via ssh-agent).
The functions mpirun hello_c and mpirun --host hermes,zeus hello_c both work properly.
Calling the function mpirun --host zeus ring_c locally also works. Output for both hermes and zeus:
mpiuser#zeus:/opt/openmpi-1.6.5/examples$ mpirun --host zeus ring_c
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
But calling the function mpirun --host zeus,hermes ring_c fails and gives following output:
mpiuser#zeus:/opt/openmpi-1.6.5/examples$ mpirun --host hermes,zeus ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[zeus:2930] *** An error occurred in MPI_Recv
[zeus:2930] *** on communicator MPI_COMM_WORLD
[zeus:2930] *** MPI_ERR_TRUNCATE: message truncated
[zeus:2930] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
Process 0 sent to 1
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 2930 on
node zeus exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
I haven't found any documentation on how to solve such a problem and I don't have a clue where to look for the mistake on the basis of the error output. How can I fix this?
You've changed two things between the first and second runs - you've increased the number of processes from 1 to 2, and run on multiple hosts rather than a single host.
I'd suggest you first check you can run on 2 processes on the same host:
mpirun -n 2 ring_c
and see what you get.
When debugging on a cluster it's often useful to know where each process is running. You should always print out the total number of processes as well. Try using the following code at the top of ring_c.c:
char nodename[MPI_MAX_PROCESSOR_NAME];
int namelen;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(nodename, &namelen);
printf("Rank %d out of %d running on node %s\n", rank, size, nodename);
The error you're getting is saying that the incoming message is too large for the receive buffer, which is weird given that the code always sends and receives a single integer.

Monit for "cron-like" tasks

Have some batch-type jobs that I would like to move from cron to Monit but am struggling to get them to work properly. These scripts typically run once a day, but on occasion have to be re-ran later in the day. Goal is to take advantage of the monit & m/monit front-ends to re-run as well as be alerted on failure in similar fashion to other things under monit.
The below was my first attempt. I know the docs say to use range/wildcard for minute field but I have my monit daemon set to cycle every 20 seconds so thought I'd be able to get away with this.
check program test.sh
with path "/usr/local/bin/test.sh"
every "0 7 * * *"
if status != 0 then alert
This does not seem to work as it seems like it picks up the exit status of the program on the NEXT run. So I have a zombie process sitting around until 7am the next day, at which time I'll see the status from the previous day's run.
Would be nice if this ran immediate or if there was a way to schedule something as "batch" that would only run once when started (either from command line or web gui). Example below.
check program test.sh
with path "/usr/local/bin/test.sh"
mode batch
if status != 0 then alert
Is it possible to do what I want? Can a 'check program' be scheduled that will only run one time when started or using the 'every [cron]' type syntax supported by monit?
TIA for any suggestions.
The latest version of monit (5.18) now picks up the exit status on the next daemon cycle, not on the next execution of the program like in the past (which might not be until the next day).

Monit false alerts

I am monitoring a java daemon process with PID. Below is the code.
check process SemanticReplication with pidfile "/ngs/app/edwt/opsmonit /monit/scripts/process.pid"
start = "/ngs/app/edwt/scripts/javadaemon/start_daemon.ksh"
stop = "/ngs/app/edwt/scripts/javadaemon/stop_daemon.ksh"
Many times, even though java daemon process is up and running, I get false alert as process not running.
In the next monit check cycle (after a minute), another monit alert triggers as process is up and running.
Can someone help how do we avoid this false alerts ?
Your check statement is to have monit check for the existence of the pid file (which looks weird with the spaces, btw). If there isn't any, it'll send an alert by default and then runs the start directive.
I get around this by having a check process ... matching statement like so:
check process app-pass matching 'Passenger RubyApp: \/home\/app\/app-name\/public'
Essentially, "matching" does the equivalent of ps aux | grep ... which does a better job when I can't rely on a pid file existing, like with a child process.

Expect - wait until process terminate

I am completely new in Expect, and I want to run my Python script via Telnet.
This py script takes about 1 minute to execute, but when I try to run it via Telnet with Expect - it doesn't work.
I have this expect simple code:
#! /usr/bin/expect
spawn telnet <ip_addr>
expect "login"
send "login\r"
expect "assword"
send "password\r"
expect "C:\\Users\\user>\r"
send "python script.py\r"
expect "C:\\Users\\user>\r"
close
When I replace script.py with the one with shorter execution time - it works great. Could you tell me what should I change, so I can wait until my script.py process will terminate? Should I use timeout or sleep?
If you are sure about the execution time of the script, then you can add sleep or set the timeout to the desired value
send "python script.py\r"
sleep 60; # Sleeping for 1 min
expect "C:\\Users\\user>"; # Now expecting for the prompt
Or
set timeout 60;
send "python script.py\r"
expect "C:\\Users\\user>"; # Now expecting for the prompt
But, if the time is variant, then better handle the timeout event and wait for the prompt till some amount of time. i.e.
set timeout 60; # Setting timeout as 1 min;
set counter 0
send "python script.py\r"
expect {
# Check if 'counter' is equal to 5
# This means, we have waited 5 mins already.
# So,exiting the program.
if {$counter==5} {
puts "Might be some problem with python script"
exit 1
}
# Increase the 'counter' in case of 'timeout' and continue with 'expect'
timeout {
incr counter;
puts "Waiting for the completion of script...";
exp_continue; # Causes the 'expect' to run again
}
# Now expecting for the prompt
"C:\\Users\\user>" {puts "Script execution is completed"}
}
A simpler alternative: if you don't care how long it takes to complete:
set timeout -1
# rest of your code here ...

Monit - how to identify crashes of a program instead of restarts

I am using monit to monitor my program. The program being monitored can potentially crash under 2 situations
Program can randomly crash. It just needs to be restarted
It gets into a bad state and crashes each time it is started subsequently
To fix the latter situation, I have a script to stop the program, reset it to a good state by cleaning its data files and restart it. I tried the below config
check process program with pidfile program.pid
start program = "programStart" as uid username and gid groupname
stop program = "programStop" as uid username and gid groupname
if 3 restarts within 20 cycles then exec "cleanProgramAndRestart" as uid username and gid groupname
if 6 restarts within 20 cycles then timeout
Say monit restarts the program 3 times in 3 cycles. After it is restarted the third time, the cleanProgramAndRestart script runs. However as the cleanProgramAndRestart script restarts the program yet again, the condition of 3 restarts is met again in the next cycle and it becomes an infinite loop
Could anyone suggest any way to fix this?
If any of the below actions are possible, then there may be a way around.
If there is a "crash" keyword, instead of "restarts", I will be able to run the clean script after the program crashes 3 times instead of after it is restarted 3 times
If there is a way to reset the "restarts" counter in some way after running the exec script
If there is a way to exec something only if output of the condition 3 restarts changed
Monit is polling your "tests" every cycle. The cycle length is usually defined in /etc/monitrc, in set daemon cycle_length
So if your cleanProgramAndRestart takes less than a cycle to perform, it shouldn't happen.
As it is happening, I guess your cleanProgramAndRestart takes more than a cycle to perform.
You can:
Increase the cycle length in Monit configuration
check your program every x cycles (make sure that cycle_length*x > cleanProgramAndRestart_length)
If you can't modify these variables, there could be a little workaround, with a temp file:
check process program
with pidfile program.pid
start program = "programStart"
as uid username and gid groupname
stop program = "programStop"
as uid username and gid groupname
if 3 restarts within 20 cycles
then exec "touch /tmp/program__is_crashed"
if 6 restarts within 20 cycles then timeout
check file program_crash with path /tmp/program_crash every x cycles #(make sure that cycle_length*x > cleanProgramAndRestart_length)
if changed timestamp then exec "cleanProgramAndRestart"
as uid username and gid groupname