Monit wait for a file to start a process? - monit

Basically the monit to start a process "CAD" when a file "product_id" is ready. My config is as below:
check file product_id with path /etc/platform/product_id
if does not exist then alert
check process cad with pidfile /var/run/cad.pid
depends on product_id
start = "/bin/sh -c 'cd /home/root/cad/scripts;./run-cad.sh 2>&1 | logger -t CAD'" with timeout 120 seconds
stop = "/bin/sh -c 'cd /home/root/cad/scripts;./stop-cad.sh 2>&1 | logger -t CAD'"
I’m expecting “monit” to call “start” until the file is available. But it seems it restarted the process (stop and start) every cycle.
Is there anything configured wrong here?
Appreciate any help.

The reason it's restarting every cycle is because the product_id file is not ready. Anything that depends on product_id will be restarted if the check fails.
I would suggest writing a script that checks for the existence of product_id and starts CAD if it's there. You could then run this script from a "check program" block in monit.

This is how I do it:
check program ThisIsMyProgram with path "/home/user/program_check.sh"
every 30 cycles
if status == 1 then alert
This will run the shell script, and error if status = 1.
Shell script:
#!/bin/bash
FILE=/path/to/file/that/needs/to/exist.json
PID=$(sudo pidof ThisIsMyProgram)
if [ -s $FILE ]; then
if [ ! -z "$PID" ];then
exit 0
else
sudo service thisismyprogram start 2>&1 >> /dev/null
exit 1
fi
else
exit 0
fi
Shell script checks if file exist, if it does it will start process and keep it running.

Related

Using tail to monitor an active logging file

I'm running multiple 'shred' commands on multiple hard drives in a workstation. The 'shred' commands are all run in the background in order to run the commands concurrently. The output of each 'shred' is redirected to a text file, and I also have the output directed to the terminal as well. I'm using tail to monitor the log file for errors, and halt the script if any are encountered. If there are no errors, the script should simply continue on to conclusion. When I test it by forcing a drive failure (disconnecting a drive), it detects the I/O errors and the script halts as expected. The problem I'm having is that when there are NO errors, I cannot get 'tail' to terminate once the 'shred' commands have completed, and the script just hangs at that point. Since I put the 'tail' command in the 'while' loop below, I would have thought that 'tail' would continue to run as long as the 'shred' processes were running, but would then halt after the 'shred' processes stopped, thus ending the 'while' loop. But that hasn't been the case. The script still hangs even after the 'shred' processes have ended. If I go to another terminal window while the script is "hangiing," and kill the 'tail' process, the script continues as normal. Any ideas how to get the 'tail' process to end when the 'shred' processes are gone?
My code:
shred -n 3 -vz /dev/sda 2>&1 | tee -a logfile &
shred -n 3 -vz /dev/sdb 2>&1 | tee -a logfile &
shred -n 3 -vz /dev/sdc 2>&1 | tee -a logfile &
pids=$(pgrep shred)
while kill -0 $pids 2> /dev/null; do
tail -qn0 -f logfile | \
read LINE
echo "$LINE" | grep -q "error"
if [ $? = 0 ]; then
killall shred > /dev/null 2>&1
echo "Error encountered. Halting."
exit
fi
done
wait $pids
There is other code after the 'wait' that does other stuff, but this is where the script is hanging
Not directly related to the question, but you can use Daggy - Data Aggregation Utility
In this case, all subprocesses will be end with main daggy process.

Cant Terminate process which is launched at bootup with at daemon

I have fooinit.rt process launched at boot (/etc/init.d/boot.local)
Here is boot.local file
...
/bin/fooinit.rt &
...
I create an order list at job in order to kill fooinit.rt. that is Triggered in C code
and I wrote a stop script (in)which kill -9 pidof fooinit.rt is written
Here is stop script
#!/bin/sh
proc_file="/tmp/gdg_list$$"
ps -ef | grep $USER > $proc_file
echo "Stop script is invoked!!"
suff=".rt"
pid=`fgrep "$suff" $proc_file | awk '{print $2}'`
echo "pid is '$pid'"
rm $proc_file
When at job timer expires 'kill -9 pid'( of fooinit.rt) command can not terminate fooinit.rt process!!
I checked pid number printed and the sentence "Stop script is invoked!!" is Ok !
Here is "at" job command in C code (I verified that the stop scriptis is called after 1 min later)
...
case 708: /* There is a trigger signal here*/
{
result = APP_RES_PRG_OK;
system("echo '/sbin/stop' | at now + 1 min");
}
...
On the other hand, It works properly in case launching fooinit.rt manually from shell as a ordinary command. (not from /etc/init.d/boot.local). So kill -9 work and terminates fooinit.rt process
Do you have any idea why kill -9 can not terminate foo.rt process if it is launched from /etc/init.d/boot.local
Your solution is built around a race condition. There is no guarantee it will kill the right process (an unknowable amount of time can pass between the ps call and the attempt to make use of the pid), plus it's also vulnerable to a tmp exploit: someone could create a few thousand symlinks under /tmp called "gdg_list[1-32767]" that point to /etc/shadow and your script would overwrite /etc/shadow if it runs as root.
Another potential problem is the setting of $USER -- have you made sure it's correct? Your at job will be called as the user your C program runs as, which may not be the same user your fooinit.rt runs as.
Also, your script doesn't include a kill command at all.
A much cleaner way of doing this would be to run your fooinit.rt under some process supervisor like runit and use runit to shut it down when it's no longer needed. That avoids the pid bingo as well as the /tmp attack vector.
But even using pkill -u username -f fooinit.rt would be less racy than the script you provided.

simulate nagios notifications

My normal method of testing the notification and escalation chain is to simulate a failure by causing one, for example blocking a port.
But this is thoroughly unsatisfying. I don't want down time recorded in nagios where there was none. I also don't want to wait.
Does anyone know a way to test a notification chain without causing the outage? For example something like this:
$ ./check_notifications_chain <service|host> <time down>
at <x> minutes notification email sent to group <people>
at <2x> minutes notification email sent to group <people>
at <3x> minutes escalated to group <management>
at <200x> rm -rf; shutdown -h now executed.
Extending this paradigm I might make the notification chain a nagios check in itself, but I'll stop here before my brain explodes.
Anyone?
If you only want to verify that the email alerts are working properly, you could create a simple test service, which generates a warning once a day.
test_alert.sh:
#!/bin/bash
date=`date -u +%H%M`
echo $date
echo "Nagios test script. Intentionally generates a warning daily."
if [[ "$date" -ge "1900" && "$date" -le "1920" ]] ; then
exit 1
else
exit 0
fi
commands.cfg:
define command{
command_name test_alert
command_line /bin/bash /usr/local/scripts/test_alert.sh
}
services.cfg:
define service {
host localhost
service_description Test Alert
check_command test_alert
use generic-service
}
This is an old post but maybe my solution can help someone.
I use the plugin "check_dummy" which is in the Nagios plugins pack.
As it says, it is stupid.
See some exemple of how it works :
Usage:
check_dummy <integer state> [optional text]
$ ./check_dummy 0
OK
$ ./check_dummy 2
CRITICAL
$ ./check_dummy 3 salut
UNKNOWN: salut
$ ./check_dummy 1 azerty
WARNING: azerty
$ echo $?
1
I create a file which contain the interger state and the optional text :
echo 0 OKAY | sudo tee /usr/local/nagios/libexec/dummy.txt
sudo chown nagios:nagios /usr/local/nagios/libexec/dummy.txt
With the command :
# Dummy check (notifications tests)
define command {
command_name my_check_dummy
command_line $USER1$/check_dummy $(cat /usr/local/nagios/libexec/dummy.txt)
}
Associated with the service description :
define service {
use generic-service
host_name localhost
service_description Dummy check
check_period 24x7
check_interval 1
max_check_attempts 1
retry_interval 1
notifications_enabled 1
notification_options w,u,c,r
notification_interval 0
notification_period 24x7
check_command my_check_dummy
}
So I just change the contents of the file "dummy.txt" to change the service state :
echo "2 Oups" | sudo tee /usr/local/nagios/libexec/dummy.txt
echo "1 AHHHH" | sudo tee /usr/local/nagios/libexec/dummy.txt
echo "0 Parfait !" | sudo tee /usr/local/nagios/libexec/dummy.txt
This allowed me to debug my notification program.
Hope it helps !

glassfish start script fails through crontab

I have a created a script to check to see if my glassfish server is running (installed on a freebsd system), if it isn't, the script attempts to kill the java process to ensure it's not hung, and then issues the asadmin start-domain command
If this script runs from the command line it is successful 100% of the time. When it is run from the cron tab, every line runs except the asadmin start-domain line - it does not seem to execute or at least does not complete, i.e. the server is not running after this script runs.
For anyone not familiar with glassfish or the asadmin utility used to start the server, it is my understanding that a forked process is used. could this be causing a problem via cron?
Again, in all my tests today, the script runs to completion when run from the command line. Once it's executed through the cron, it does not complete... what would be different running this from the crontab???
thanks in advance for any help... i'm pulling my hair out trying to make this work!
#!/bin/bash
JAVA_HOME=/usr/local/diablo-jdk1.6.0/; export JAVA_HOME
timevar=`date +%d-%m-%Y_%H.%M.%S`
process_name='java'
get_contents=`cat urls.txt`
for i in $get_contents
do
echo checking $i
statuscode=$(curl --connect-timeout 10 --write-out %{http_code} --silent --output /dev/null $i)
case $statuscode in
200)
echo "$timevar $i $statuscode okay" >> /usr/home/user1/logfile.txt
;;
*)
echo "$timevar $i $statuscode bad" >> /usr/home/user1/logfile.txt
echo "Status $statuscode found" | mail -s "Check of $i failed" some.address#gmail.com
process_id=`ps acx | grep -i $process_name | awk {'print $1'}`
if [ -z "$process_id" ]
then
echo "java wasn't found in the process list"
else
echo "Killing java, currently process $process_id"
kill -9 $process_id
fi
/usr/home/user1/glassfish3/bin/asadmin start-domain domain1
;;
esac
done
Also, just for completeness, here is the entry in the cron tab:
*/2 * * * * /usr/home/user1/server.check.sh >> /usr/home/user1/cron.log
Ok... found the answer to this on another site, but I thought I'd add the answer in here for future reference.
The problem was the PATH!! even though java_home was set, java itself wasn't in the path for the cron daemon.
A quick test to see what path is available to your cron, add this line:
*/2 * * * * env > /usr/home/user1/env.output
From what I can gather, the PATH initially available to cron is pretty minimal. Since java was in /usr/local/bin, i added that to the path right in the crontab and kaboom! it worked!
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
*/2 * * * * /usr/home/user1/server.check.sh >> /usr/home/user1/cron.log

bash script to restart Apache automatically

I wrote a bash script to restart Apache when it hanged and send email to the admin. The code is shown below. the code will restart Apache if the number of Apache process is zero. The problem is: Apache some time hangs and processes is still not zero,so in this case the script will not restart Apache.
The needed is: how do I modify the code to restart Apache if it hanged and the processes is not zero.
#!/bin/bash
if [ `pgrep apache2 -c` -le "0" ]; then
/etc/init.d/apache2 stop
pkill -u www-data
/etc/init.d/apache2 start
echo "restarting....."
SUBJECT="Apache auto restart"
# Email To ?
EMAIL="me#mydomain.com"
# Email text/message
EMAILMESSAGE="apache auto restart done"
# send an email using /bin/mail
/bin/mail -s "$SUBJECT" "$EMAIL" "$EMAILMESSAGE"
fi
We used to have Apache segfaulting sometimes on a machine; here's the script we used trying to debug the problem while keeping Apache up. It ran from cron (as root) once every minute or so. It should be self-explanatory.
#!/bin/sh
# Script that checks whether apache is still up, and if not:
# - e-mail the last bit of log files
# - kick some life back into it
# -- Thomas, 20050606
PATH=/bin:/usr/bin
THEDIR=/tmp/apache-watchdog
EMAIL=yourself#example.com
mkdir -p $THEDIR
if ( wget --timeout=30 -q -P $THEDIR http://localhost/robots.txt )
then
# we are up
touch ~/.apache-was-up
else
# down! but if it was down already, don't keep spamming
if [[ -f ~/.apache-was-up ]]
then
# write a nice e-mail
echo -n "apache crashed at " > $THEDIR/mail
date >> $THEDIR/mail
echo >> $THEDIR/mail
echo "Access log:" >> $THEDIR/mail
tail -n 30 /var/log/apache2_access/current >> $THEDIR/mail
echo >> $THEDIR/mail
echo "Error log:" >> $THEDIR/mail
tail -n 30 /var/log/apache2_error/current >> $THEDIR/mail
echo >> $THEDIR/mail
# kick apache
echo "Now kicking apache..." >> $THEDIR/mail
/etc/init.d/apache2 stop >> $THEDIR/mail 2>&1
killall -9 apache2 >> $THEDIR/mail 2>&1
/etc/init.d/apache2 start >> $THEDIR/mail 2>&1
# send the mail
echo >> $THEDIR/mail
echo "Good luck troubleshooting!" >> $THEDIR/mail
mail -s "apache-watchdog: apache crashed" $EMAIL < $THEDIR/mail
rm ~/.apache-was-up
fi
fi
rm -rf $THEDIR
We never did figure out the problem...
Can the count of a process really be less than zero?
This should be sufficient:
if ! pgrep apache2 -c >/dev/null; then
You could try to send an http request to apache (e.g. using wget --timeout=10) and if that request times out or fails (exit status != 0), you kill and restart apache.
Why would Apache hang? Can you get to the cause?
There are a number of scripts and tools out there to 'daemonize' apps and watch over them. As you seem to be on Debian or Ubuntu, have a look at the packages daemon and daemontools. I am sure there are others too.