Is it possible to have an overall job status set to OK if at least one node reports ok?
Currently my job runs tasks on docker and will only run succeed on the leader and will fail on the others. I would like to have it so the job is OK so long as it has run successfully on at least one node. Is this possible?
You can do that in this way:
First, you need two jobs. The first one: points to your nodes, but this job needs an option to pass the node name in the node filter textbox. And the second one: to call the first one via API with an inline script.
You just need to check the second job.
First job:
<joblist>
<job>
<context>
<options preserveOrder='true'>
<option name='thenode' />
</options>
</context>
<defaultTab>nodes</defaultTab>
<description></description>
<dispatch>
<excludePrecedence>true</excludePrecedence>
<keepgoing>false</keepgoing>
<rankOrder>ascending</rankOrder>
<successOnEmptyNodeFilter>false</successOnEmptyNodeFilter>
<threadcount>1</threadcount>
</dispatch>
<executionEnabled>true</executionEnabled>
<id>9fcc183b-b4df-4554-b75b-c660eda706b3</id>
<loglevel>INFO</loglevel>
<name>TestFWISE</name>
<nodeFilterEditable>false</nodeFilterEditable>
<nodefilters>
<filter>${option.thenode}</filter>
</nodefilters>
<nodesSelectedByDefault>true</nodesSelectedByDefault>
<scheduleEnabled>true</scheduleEnabled>
<sequence keepgoing='true' strategy='node-first'>
<command>
<exec>sh hello.sh</exec>
</command>
</sequence>
<uuid>9fcc183b-b4df-4554-b75b-c660eda706b3</uuid>
</job>
</joblist>
Second job:
<joblist>
<job>
<defaultTab>nodes</defaultTab>
<description></description>
<executionEnabled>true</executionEnabled>
<id>18bbd45e-5301-4498-8b92-0c4828194b61</id>
<loglevel>INFO</loglevel>
<name>Runner</name>
<nodeFilterEditable>false</nodeFilterEditable>
<scheduleEnabled>true</scheduleEnabled>
<sequence keepgoing='false' strategy='node-first'>
<command>
<fileExtension>sh</fileExtension>
<script><![CDATA[#!/bin/bash
# script that detect only when all nodes fail
# https://stackoverflow.com/questions/58798856/rundeck-fail-a-job-only-when-all-nodes-fail
#####################################################
# rundeck instance values
server="localhost"
port="4440"
api="32"
jobid="9fcc183b-b4df-4554-b75b-c660eda706b3"
token="fb4ra5Rc1d71rOYhFxXK9a1vtMXtwVZ1"
#####################################################
# 1 - succeeded at least in one node
# 0 - failed in all nodes
#####################################################
flag="0"
#####################################################
# here put all nodes to pass via options, like a list
mynodes='node01'
#####################################################
# "prudential" time between actions
pt="2"
#####################################################
# 1) run the job via API and store the execution ID.
# 2) takes the execution ID and store job status via API
# 3) with job status decides if runs at least on one node or not.
for currentnode in $mynodes
do
sleep $pt
execid=$(curl -H accept:application/json --data-urlencode "argString=-thenode $currentnode" http://$server:$port/api/$api/job/$jobid/run?authtoken=$token | jq -r '.id')
# only for debug
echo "execution id: $execid"
sleep $pt
status=$(curl --location --request GET "http://$server:$port/api/$api/execution/$execid/state?authtoken=$token" | jq -r '.executionState')
# only for debug
echo "status is: $status"
# if job runs OK, then assign the value 1 to flag
if [ "$status" = "SUCCEEDED" ]
then
flag="1"
fi
done
sleep $pt
# only for debug
echo "flag value: $flag"
#####################################################
# now is time to check the flag
if [ "$flag" -eq "1" ]
then
echo "the job has been succeeded at least in one node."
exit 0 # rundeck job ends normally :3
else
echo "the job has been failed in all nodes."
exit 1 # rundeck job ends with error :(
fi]]></script>
<scriptargs />
<scriptinterpreter>/bin/bash</scriptinterpreter>
</command>
</sequence>
<uuid>18bbd45e-5301-4498-8b92-0c4828194b61</uuid>
</job>
</joblist>
The second job fails only if your first job fails in all nodes. If the first job runs at least in one node the job shows "OK".
If you need the script separated I leave it here:
#!/bin/bash
# script that detects only when all nodes fail
# https://stackoverflow.com/questions/58798856/rundeck-fail-a-job-only-when-all-nodes-fail
#####################################################
# rundeck instance values
server="localhost"
port="4440"
api="32"
jobid="9fcc183b-b4df-4554-b75b-c660eda706b3"
token="fb4ra5Rc1d71rOYhFxXK9a1vtMXtwVZ1"
#####################################################
# 1 - succeeded at least in one node
# 0 - failed in all nodes
#####################################################
flag="0"
#####################################################
# here put all nodes to pass via options, like a list
mynodes='node00 node01'
#####################################################
# "prudential" time between actions
pt="2"
#####################################################
# 1) run the job via API and store the execution ID.
# 2) takes the execution ID and store job status via API
# 3) with job status decides if runs at least on one node or not.
for currentnode in $mynodes
do
sleep $pt
execid=$(curl -H accept:application/json --data-urlencode "argString=-thenode $currentnode" http://$server:$port/api/$api/job/$jobid/run?authtoken=$token | jq -r '.id')
# only for debug
echo "execution id: $execid"
sleep $pt
status=$(curl --location --request GET "http://$server:$port/api/$api/execution/$execid/state?authtoken=$token" | jq -r '.executionState')
# only for debug
echo "status is: $status"
# if job runs OK, then assign the value 1 to flag
if [ "$status" = "SUCCEEDED" ]
then
flag="1"
fi
done
sleep $pt
# only for debug
echo "flag value: $flag"
#####################################################
# now is time to check the flag
if [ "$flag" -eq "1" ]
then
echo "the job has been succeeded at least in one node."
exit 0 # rundeck job ends normally :3
else
echo "the job has been failed in all nodes."
exit 1 # rundeck job ends with error :(
fi
NOTE: The script needs JQ to work.
Related
In tests for ping from iputils certain tests should fail for non-root but pass for root. Thus I need a detection whether user running tests is root or not. Current code:
run_as_root = false
r = run_command('id', '-u')
if r.stdout().strip().to_int() == 0
message('running as root')
run_as_root = true
else
message('running as normal user')
endif
...
test(name, cmd, args : args, should_fail : not run_as_root)
is not working, because test is done during build phase:
$ meson builddir
The Meson build system
Version: 0.59.4
...
Program xsltproc found: YES (/usr/bin/xsltproc)
Message: running as normal user
and not for running tests because root user is not properly detected:
# cd builddir/ && meson test
[21/21] Linking target ninfod/ninfod
1/36 arping -V OK 0.03s
...
32/36 ping -c1 -i0.001 127.0.0.1 UNEXPECTEDPASS 0.02s
>>> ./builddir/ping/ping -c1 -i0.001 127.0.0.1
33/36 ping -c1 -i0.001 __1 UNEXPECTEDPASS 0.02s
What to do to evaluate user when running tests?
This is really a case for skipping rather than expected failure. It would be easy to wrap your tests in a small shell or python script that checks the effective UID and returns the magic exit code 77 (which meson interprets as skip)
Something like:
#!/bin/bash
if [ "$(id -u)" -ne 0 ]; then
echo "User does not have root, cannot run"
exit 77
fi
exec "$#"
This will cause meson test to return a status of SKIP if the tests are not run as root.
I'm running multiple 'shred' commands on multiple hard drives in a workstation. The 'shred' commands are all run in the background in order to run the commands concurrently. The output of each 'shred' is redirected to a text file, and I also have the output directed to the terminal as well. I'm using tail to monitor the log file for errors, and halt the script if any are encountered. If there are no errors, the script should simply continue on to conclusion. When I test it by forcing a drive failure (disconnecting a drive), it detects the I/O errors and the script halts as expected. The problem I'm having is that when there are NO errors, I cannot get 'tail' to terminate once the 'shred' commands have completed, and the script just hangs at that point. Since I put the 'tail' command in the 'while' loop below, I would have thought that 'tail' would continue to run as long as the 'shred' processes were running, but would then halt after the 'shred' processes stopped, thus ending the 'while' loop. But that hasn't been the case. The script still hangs even after the 'shred' processes have ended. If I go to another terminal window while the script is "hangiing," and kill the 'tail' process, the script continues as normal. Any ideas how to get the 'tail' process to end when the 'shred' processes are gone?
My code:
shred -n 3 -vz /dev/sda 2>&1 | tee -a logfile &
shred -n 3 -vz /dev/sdb 2>&1 | tee -a logfile &
shred -n 3 -vz /dev/sdc 2>&1 | tee -a logfile &
pids=$(pgrep shred)
while kill -0 $pids 2> /dev/null; do
tail -qn0 -f logfile | \
read LINE
echo "$LINE" | grep -q "error"
if [ $? = 0 ]; then
killall shred > /dev/null 2>&1
echo "Error encountered. Halting."
exit
fi
done
wait $pids
There is other code after the 'wait' that does other stuff, but this is where the script is hanging
Not directly related to the question, but you can use Daggy - Data Aggregation Utility
In this case, all subprocesses will be end with main daggy process.
I have a docker image that starts the entrypoint.sh script
This script checks if the project is well configured
If everything is correct,
the container starts
otherwise I received this error:
echo "Danger! bla bla bla"
exit 1000
Now if i start the container in this mode:
docker-compose up
i see the error correctly:
Danger! bla bla bla
but i need to launch the container in daemon mode:
docker-compose up -d
How can I show the log only in case of error?
The -d flag in docker-compose up -d stands for detached mode and not deamon mode.
In detached mode, your service(s) (e.g. container(s)) runs in the background of your terminal. You can't see logs in this mode.
To see all service(s) logs you need to run this command :
docker-compose logs -f
The -f flag stands for "Follow log output".
This will output all the logs for each running service you have in your docker-compose.yml
From my understanding you want to fire up your service(s) with :
docker-compose up -d
In order to let service(s) run in the background and have a clean console output.
And you want to print out only the errors from the logs, to do so add a pipe operator and search for error with the grep command :
docker-compose logs | grep error
This will output all the errors logged by a docker service(s).
You'll find the official documentation related to the docker-compose up command here and to the logs command here. More info on logs-handling in this article.
Related answer here.
Basically the monit to start a process "CAD" when a file "product_id" is ready. My config is as below:
check file product_id with path /etc/platform/product_id
if does not exist then alert
check process cad with pidfile /var/run/cad.pid
depends on product_id
start = "/bin/sh -c 'cd /home/root/cad/scripts;./run-cad.sh 2>&1 | logger -t CAD'" with timeout 120 seconds
stop = "/bin/sh -c 'cd /home/root/cad/scripts;./stop-cad.sh 2>&1 | logger -t CAD'"
I’m expecting “monit” to call “start” until the file is available. But it seems it restarted the process (stop and start) every cycle.
Is there anything configured wrong here?
Appreciate any help.
The reason it's restarting every cycle is because the product_id file is not ready. Anything that depends on product_id will be restarted if the check fails.
I would suggest writing a script that checks for the existence of product_id and starts CAD if it's there. You could then run this script from a "check program" block in monit.
This is how I do it:
check program ThisIsMyProgram with path "/home/user/program_check.sh"
every 30 cycles
if status == 1 then alert
This will run the shell script, and error if status = 1.
Shell script:
#!/bin/bash
FILE=/path/to/file/that/needs/to/exist.json
PID=$(sudo pidof ThisIsMyProgram)
if [ -s $FILE ]; then
if [ ! -z "$PID" ];then
exit 0
else
sudo service thisismyprogram start 2>&1 >> /dev/null
exit 1
fi
else
exit 0
fi
Shell script checks if file exist, if it does it will start process and keep it running.
My normal method of testing the notification and escalation chain is to simulate a failure by causing one, for example blocking a port.
But this is thoroughly unsatisfying. I don't want down time recorded in nagios where there was none. I also don't want to wait.
Does anyone know a way to test a notification chain without causing the outage? For example something like this:
$ ./check_notifications_chain <service|host> <time down>
at <x> minutes notification email sent to group <people>
at <2x> minutes notification email sent to group <people>
at <3x> minutes escalated to group <management>
at <200x> rm -rf; shutdown -h now executed.
Extending this paradigm I might make the notification chain a nagios check in itself, but I'll stop here before my brain explodes.
Anyone?
If you only want to verify that the email alerts are working properly, you could create a simple test service, which generates a warning once a day.
test_alert.sh:
#!/bin/bash
date=`date -u +%H%M`
echo $date
echo "Nagios test script. Intentionally generates a warning daily."
if [[ "$date" -ge "1900" && "$date" -le "1920" ]] ; then
exit 1
else
exit 0
fi
commands.cfg:
define command{
command_name test_alert
command_line /bin/bash /usr/local/scripts/test_alert.sh
}
services.cfg:
define service {
host localhost
service_description Test Alert
check_command test_alert
use generic-service
}
This is an old post but maybe my solution can help someone.
I use the plugin "check_dummy" which is in the Nagios plugins pack.
As it says, it is stupid.
See some exemple of how it works :
Usage:
check_dummy <integer state> [optional text]
$ ./check_dummy 0
OK
$ ./check_dummy 2
CRITICAL
$ ./check_dummy 3 salut
UNKNOWN: salut
$ ./check_dummy 1 azerty
WARNING: azerty
$ echo $?
1
I create a file which contain the interger state and the optional text :
echo 0 OKAY | sudo tee /usr/local/nagios/libexec/dummy.txt
sudo chown nagios:nagios /usr/local/nagios/libexec/dummy.txt
With the command :
# Dummy check (notifications tests)
define command {
command_name my_check_dummy
command_line $USER1$/check_dummy $(cat /usr/local/nagios/libexec/dummy.txt)
}
Associated with the service description :
define service {
use generic-service
host_name localhost
service_description Dummy check
check_period 24x7
check_interval 1
max_check_attempts 1
retry_interval 1
notifications_enabled 1
notification_options w,u,c,r
notification_interval 0
notification_period 24x7
check_command my_check_dummy
}
So I just change the contents of the file "dummy.txt" to change the service state :
echo "2 Oups" | sudo tee /usr/local/nagios/libexec/dummy.txt
echo "1 AHHHH" | sudo tee /usr/local/nagios/libexec/dummy.txt
echo "0 Parfait !" | sudo tee /usr/local/nagios/libexec/dummy.txt
This allowed me to debug my notification program.
Hope it helps !