I am spawning 2 processes and it seems i can not kill either of them:
restarter - process that spawns the worker whenever it goes down
worker -process that gets messages from the shell, concatenates them and returns them in the reason of an exit to the restarter which in turn forwards them to the shell.
The worker process can't be killed since the restarter would restart it on any trap exit message. But what keeps the restarter process alive?
-module(mon).
-compile_flags([debug_info]).
-export([worker/1,init/0,restarter/2,clean/1]).
% ctrl+g
init()->
Pid=spawn(?MODULE,restarter,[self(),[]]),
register(restarter,Pid),
Pid.
restarter(Shell,Queue)->
process_flag(trap_exit,true),
Wk=spawn_link(?MODULE,worker,[Queue]),
register(worker,Wk),
receive
{'EXIT',Pid,{Queue,normal}}->Shell ! {Queue,"From res: worker died peacefully, wont restart"};
{'EXIT',Pid,{Queue,horrible}} ->
Shell ! {Queue,"Processed so far:"},
Shell ! "will restart in 5 seconds, select fresh/stale -> 1/0",
receive
1 ->
Shell ! "Will restart fresh",
restarter(Shell,[]);
0 ->Shell ! "Will continue work",
restarter(Shell,Queue)
after 5000 ->
Shell ! "No response -> started with 666",
restarter(Shell,[666])
end;
{MSG}->Shell ! {"Unknown message...closing",MSG}
end.
worker(Queue)->
receive
die->exit({Queue,horrible});
finish->exit({Queue,normal});
MSG->worker([{time(),MSG}|Queue])
end.
Usage
mon:init().
regs(). %worker and restarter are working
whereis(worker) ! "msg 1", whereis(worker) ! "msg2".
whereis(worker) ! finish.
flush(). % should get the first clause from restarter
regs(). % worker should be up and running again
exit(whereis(restarter),reason).
regs(). % restarter should be dead
In this scenario, the restarter process is trapping exits, so exit(whereis(restarter), reason) doesn't kill it. The exit signal gets converted to a message, and gets put into the message queue of the process:
> process_info(whereis(restarter), messages).
{messages,[{'EXIT',<0.76.0>,reason}]}
The reason it's still in the message queue is that none of the clauses in the receive expression matches this message. The first two clauses are specific to the exit reasons used by the worker process, and the last clause might look like a catch-all clause but it actually isn't - it matches any message that is a tuple with one element. If it were written MSG instead of {MSG}, it would have received the exit reason message, and sent "Unknown message" to the shell.
If you really want to kill the process, use the kill reason:
exit(whereis(restarter), kill).
A kill exit signal is untrappable, even if the process is trapping exits.
Another thing: the first two receive clauses will only match if the worker's queue is empty. That is because it reuses the variable name Queue, so the queue in {'EXIT',Pid,{Queue,normal}} must be equal to the value passed as an argument to the restarter function. In a situation like this, you'd normally use NewQueue or something as the variable in the receive clauses.
I am having a simple monit control file which contains the following
check process fooBar1 with pidfile fooBar1PidFile
start program = "/etc/init.d/fooBar1 start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar1 stop"
if 5 restarts within 5 cycles then unmonitor
check process fooBar2 with pidfile fooBar2PidFile
start program = "/etc/init.d/fooBar2 start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar2 stop"
if 5 restarts within 5 cycles then unmonitor
check process fooBar with pidfile fooBarPidFile
start program = "/etc/init.d/fooBar start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar stop"
if 5 restarts within 5 cycles then unmonitor
if memory usage > 25.0 MB for 4 cycles then alert
depends on fooBar1
depends on fooBar2
depends on checkFile
check file checkFile with path pathToFile
if timestamp > 4 minute for 8 cycles then restart
Here the intention is to restart fooBar, fooBar1 and fooBar2 applications when the timestamp check for checkFile fails. But what actually happens is it tries to restart the checkFile instead of fooBar.
This check was working fine with monit version 5.5, but not working with 5.18.
This is what I am getting when timestamp fails and after that 8 cycles are elapsed.
'checkFile' timestamp for pathToFile failed -- current timestamp is Fri, 08 Dec 2017 12:47:04
'fooBar' failed to start -- could not start required services: 'checkFile'
'fooBar' start action failed
Am I missing something here?
Thanks in advance
Tried with a work around,
check file checkFile with path pathToFile
if timestamp > 4 minute for 8 cycles then
exec "/etc/init.d/fooBar restart"
repeat every 1 cycles
This is restarting application fooBar every cycle when timestamp check fails. But just wondering any other better way to do it?
What I would like to do, is the following:
if process-x fails (to (re)start) then execute cmd-x
if it recovers then execute cmd-y
For the alerting via E-mail, a notificaton is sent per default on recovery. For the exec method however, I can not find a way to make this work. If I try this in the monitrc:
check process proc_x with pidfile /var/run/proc_x.pid
start program = "/bin/sh -c '/etc/init.d/Sxxproc_x start'"
stop program = "/bin/sh -c '/etc/init.d/Sxxproc_x stop'"
if 3 restarts within 5 cycles then exec "<some error cmd>"
else if succeeded then exec "<some restore cmd>"
this results in a "syntax error 'else'". If I remove the else line, the error command is called as expected. Apparently, the 'else' can not be used for the restarts test. But how can I add to execute a command is program starting succeeds or recovers?
I found a solution thanks to the answer to this topic:
get monit to alert first and restart later
The "if not exist for ..." with corresponding "else" did the trick for me to report the recover. The error report is separate. My monitrc code now:
check process proc_x with pidfile /var/run/proc_x.pid
start program = "/bin/sh -c '/etc/init.d/Sxxproc_x start'"
stop program = "/bin/sh -c '/etc/init.d/Sxxproc_x stop'"
if 1 restart within 1 cycle then exec "<some error cmd>"
repeat every 1 cycle
if not exist for 3 cycles then restart
else if succeeded 2 times within 2 cycles then exec "<some restore cmd>"
I am completely new in Expect, and I want to run my Python script via Telnet.
This py script takes about 1 minute to execute, but when I try to run it via Telnet with Expect - it doesn't work.
I have this expect simple code:
#! /usr/bin/expect
spawn telnet <ip_addr>
expect "login"
send "login\r"
expect "assword"
send "password\r"
expect "C:\\Users\\user>\r"
send "python script.py\r"
expect "C:\\Users\\user>\r"
close
When I replace script.py with the one with shorter execution time - it works great. Could you tell me what should I change, so I can wait until my script.py process will terminate? Should I use timeout or sleep?
If you are sure about the execution time of the script, then you can add sleep or set the timeout to the desired value
send "python script.py\r"
sleep 60; # Sleeping for 1 min
expect "C:\\Users\\user>"; # Now expecting for the prompt
Or
set timeout 60;
send "python script.py\r"
expect "C:\\Users\\user>"; # Now expecting for the prompt
But, if the time is variant, then better handle the timeout event and wait for the prompt till some amount of time. i.e.
set timeout 60; # Setting timeout as 1 min;
set counter 0
send "python script.py\r"
expect {
# Check if 'counter' is equal to 5
# This means, we have waited 5 mins already.
# So,exiting the program.
if {$counter==5} {
puts "Might be some problem with python script"
exit 1
}
# Increase the 'counter' in case of 'timeout' and continue with 'expect'
timeout {
incr counter;
puts "Waiting for the completion of script...";
exp_continue; # Causes the 'expect' to run again
}
# Now expecting for the prompt
"C:\\Users\\user>" {puts "Script execution is completed"}
}
A simpler alternative: if you don't care how long it takes to complete:
set timeout -1
# rest of your code here ...
I am using monit to monitor my program. The program being monitored can potentially crash under 2 situations
Program can randomly crash. It just needs to be restarted
It gets into a bad state and crashes each time it is started subsequently
To fix the latter situation, I have a script to stop the program, reset it to a good state by cleaning its data files and restart it. I tried the below config
check process program with pidfile program.pid
start program = "programStart" as uid username and gid groupname
stop program = "programStop" as uid username and gid groupname
if 3 restarts within 20 cycles then exec "cleanProgramAndRestart" as uid username and gid groupname
if 6 restarts within 20 cycles then timeout
Say monit restarts the program 3 times in 3 cycles. After it is restarted the third time, the cleanProgramAndRestart script runs. However as the cleanProgramAndRestart script restarts the program yet again, the condition of 3 restarts is met again in the next cycle and it becomes an infinite loop
Could anyone suggest any way to fix this?
If any of the below actions are possible, then there may be a way around.
If there is a "crash" keyword, instead of "restarts", I will be able to run the clean script after the program crashes 3 times instead of after it is restarted 3 times
If there is a way to reset the "restarts" counter in some way after running the exec script
If there is a way to exec something only if output of the condition 3 restarts changed
Monit is polling your "tests" every cycle. The cycle length is usually defined in /etc/monitrc, in set daemon cycle_length
So if your cleanProgramAndRestart takes less than a cycle to perform, it shouldn't happen.
As it is happening, I guess your cleanProgramAndRestart takes more than a cycle to perform.
You can:
Increase the cycle length in Monit configuration
check your program every x cycles (make sure that cycle_length*x > cleanProgramAndRestart_length)
If you can't modify these variables, there could be a little workaround, with a temp file:
check process program
with pidfile program.pid
start program = "programStart"
as uid username and gid groupname
stop program = "programStop"
as uid username and gid groupname
if 3 restarts within 20 cycles
then exec "touch /tmp/program__is_crashed"
if 6 restarts within 20 cycles then timeout
check file program_crash with path /tmp/program_crash every x cycles #(make sure that cycle_length*x > cleanProgramAndRestart_length)
if changed timestamp then exec "cleanProgramAndRestart"
as uid username and gid groupname