Sun Grid Engine resubmit job stuck in 'Rq' state - grid-computing

I have what I hope is a pretty simple question, but I'm not super familiar with Sun Grid, so I've been having trouble finding the answer. I am currently submitting jobs to a grid using a bash submission script that generates a command and then executes it. I have read online that if a sun grid job exits with a code of 99, it gets re-submitted to the grid. I have successfully written my bash script to do this:
[code to generate command, stores in $command]
$command
STATUS=$?
if [[ $STATUS -ne 0 ]]; then
exit 99
fi
exit 0
When I submit this job to the grid with a command that I know has a non-zero exit status, the job does indeed appear to be resubmitted, however the scheduler never sends it to another host, instead it just remains stuck in the queue with the status "Rq":
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2150015 0.55500 GridJob.sh my_user Rq 04/08/2013 17:49:00 1
I have a feeling that this is something simple in the config options for the queue, but I haven't been able to find anything googling. I've tried submitting this job with the qsub -r y option, but that doesn't seem to change anything.
Thanks!

Rescheduled jobs will only get run in queues that have their rerun attribute (FALSE by default) set to TRUE, so check your queue configuration (qconf -mq myqueue). Without this, your job remains in the rescheduled-pending state indefinitely because it has nowhere to go.
IIRC, submitting jobs with qsub -r yes only qualifies them for automatic rescheduling in the event of an exec node crash, and that exiting with status 99 should trigger a reschedule regardless.

Related

SLURM releasing resources using scontrol update results in unknown endtime

I have a program that will dynamically release resources during job execution, using the command:
scontrol update JobId=$SLURM_JOB_ID NodeList=${remaininghosts}
However, this results in some very weird behavior sometimes. Where the job is re-queued. Below is the output of sacct
sacct -j 1448590
JobID NNodes State Start End NodeList
1448590 4 RESIZING 20:47:28 01:04:22 [0812,0827],[0663-0664]
1448590.0 4 COMPLETED 20:47:30 20:47:30 [0812,0827],[0663-0664]
1448590.1 4 RESIZING 20:47:30 01:04:22 [0812,0827],[0663-0664]
1448590 3 RESIZING 01:04:22 01:06:42 [0812,0827],0663
1448590 2 RESIZING 01:06:42 1:12:42 0827,tnxt-0663
1448590 4 COMPLETED 05:33:15 Unknown 0805-0807,0809]
The first lines show everything works fine, nodes are getting released but in the last line, it shows a completely different set of nodes with an unknown end time. The slurm logs show the job got requeued:
requeue JobID=1448590 State=0x8000 NodeCnt=1 due to node failure.
I suspect this might happen because the head node is killed, but the slurm documentation doesn't say anything about that.
Does anybody had an idea or suggestion?
Thanks
In this post there was a discussion about resizing jobs.
In your particular case, for shrinking I would use:
Assuming that j1 has been submitted with:
$ salloc -N4 bash
Update j1 to the new size:
$ scontrol update jobid=$SLURM_JOBID NumNodes=2
$ scontrol update jobid=$SLURM_JOBID NumNodes=ALL
And update the environmental variables of j1 (the script is created by the previous commands):
$ ./slurm_job_$SLURM_JOBID_resize.sh
Now, j1 has 2 nodes.
In your example, your "remaininghost" list, as you say, may exclude the head node that is needed by Slurm to shrink the job. If you provide a quantity instead of a list, the resize should work.

Read all lines at the same time individually - Solaris ksh

I need some help with a script. Solaris 10 and ksh.
I Have a file called /temp.list with this content:
192.168.0.1
192.168.0.2
192.168.0.3
So, I have a script which reads this list and executes some commands using the lines values:
FILE_TMP="/temp.list"
while IFS= read line
do
ping $line
done < "$FILE_TMP"
It works, but it executes the command on line 1. When it's over, it goes to the line 2, and it goes successively until the end. I would like to find a way to execute the command ping at the same time in each line of the list. Is there a way to do it?
Thank you in advance!
Marcus Quintella
As Ari's suggested, googling ksh multithreading will produce a lot of ideas/solutions.
A simple example:
FILE_TMP="/temp.list"
while IFS= read line
do
ping $line &
done < "$FILE_TMP"
The trailing '&' says to kick the ping command off in the background, allowing loop processing to continue while the ping command is running in the background.
'course, this is just the tip of the proverbial iceberg as you now need to consider:
multiple ping commands are going to be dumping output to stdout (ie, you're going to get a mish-mash of ping output in your console), so you'll need to give some thought as to what to do with multiple streams of output (eg, redirect to a common file? redirect to separate files?)
you need to have some idea as to how you want to go about managing and (possibly) terminating commands running in the background [ see jobs, ps, fg, bg, kill ]
if running in a shell script you'll likely find yourself wanting to suspend the main shell script processing until all background jobs have completed [ see wait ]

Monit for "cron-like" tasks

Have some batch-type jobs that I would like to move from cron to Monit but am struggling to get them to work properly. These scripts typically run once a day, but on occasion have to be re-ran later in the day. Goal is to take advantage of the monit & m/monit front-ends to re-run as well as be alerted on failure in similar fashion to other things under monit.
The below was my first attempt. I know the docs say to use range/wildcard for minute field but I have my monit daemon set to cycle every 20 seconds so thought I'd be able to get away with this.
check program test.sh
with path "/usr/local/bin/test.sh"
every "0 7 * * *"
if status != 0 then alert
This does not seem to work as it seems like it picks up the exit status of the program on the NEXT run. So I have a zombie process sitting around until 7am the next day, at which time I'll see the status from the previous day's run.
Would be nice if this ran immediate or if there was a way to schedule something as "batch" that would only run once when started (either from command line or web gui). Example below.
check program test.sh
with path "/usr/local/bin/test.sh"
mode batch
if status != 0 then alert
Is it possible to do what I want? Can a 'check program' be scheduled that will only run one time when started or using the 'every [cron]' type syntax supported by monit?
TIA for any suggestions.
The latest version of monit (5.18) now picks up the exit status on the next daemon cycle, not on the next execution of the program like in the past (which might not be until the next day).

Monit false alerts

I am monitoring a java daemon process with PID. Below is the code.
check process SemanticReplication with pidfile "/ngs/app/edwt/opsmonit /monit/scripts/process.pid"
start = "/ngs/app/edwt/scripts/javadaemon/start_daemon.ksh"
stop = "/ngs/app/edwt/scripts/javadaemon/stop_daemon.ksh"
Many times, even though java daemon process is up and running, I get false alert as process not running.
In the next monit check cycle (after a minute), another monit alert triggers as process is up and running.
Can someone help how do we avoid this false alerts ?
Your check statement is to have monit check for the existence of the pid file (which looks weird with the spaces, btw). If there isn't any, it'll send an alert by default and then runs the start directive.
I get around this by having a check process ... matching statement like so:
check process app-pass matching 'Passenger RubyApp: \/home\/app\/app-name\/public'
Essentially, "matching" does the equivalent of ps aux | grep ... which does a better job when I can't rely on a pid file existing, like with a child process.

cron script to act as a queue OR a queue for cron?

I'm betting that someone has already solved this and maybe I'm using the wrong search terms for google to tell me the answer, but here is my situation.
I have a script that I want to run, but I want it to run only when scheduled and only one at a time. (can't run the script simultaneously)
Now the sticky part is that say I have a table called "myhappyschedule" which has the data I need and the scheduled time. This table can have multiple scheduled times even at the same time, each one would run this script. So essentially I need a queue of each time the script fires and they all need to wait for each one before it to finish. (sometimes this can take just a minute for the script to execute sometimes its many many minutes)
What I'm thinking about doing is making a script that checks myhappyschedule every 5 min and gathers up those that are scheduled, puts them into a queue where another script can execute each 'job' or occurrence in the queue in order. Which all of this sounds messy.
To make this longer - I should say that I'm allowing users to schedule things in myhappyschedule and not edit crontab.
What can be done about this? File locks and scripts calling scripts?
add a column exec_status to myhappytable (maybe also time_started and time_finished, see pseudocode)
run the following cron script every x minutes
pseudocode of cron script:
[create/check pid lock (optional, but see "A potential pitfall" below)]
get number of rows from myhappytable where (exec_status == executing_now)
if it is > 0, exit
begin loop
get one row from myhappytable
where (exec_status == not_yet_run) and (scheduled_time <= now)
order by scheduled_time asc
if no such row, exit
set row exec_status to executing_now (maybe set time_started to now)
execute whatever command the row contains
set row exec_status to completed
(maybe also store the command output/return as well, set time_finished to now)
end loop
[delete pid lock file (complementary to the starting pid lock check)]
This way, the script first checks if none of the commands is running, then runs first not-yet run command, until there are no more commands to be run at the given moment. Also, you can see what command is executing by querying the database.
A potential pitfall: if the cron script is killed, a scheduled task will remain in "executing_now" state. That's what the pid lock at beginning and end is for: to see if the cron script terminated properly. pseudocode of create/check pidlock:
if exists pidlockfile then
check if process id given in file exists
if not exists then
update myhappytable set exec_status = error_cronscript_died_while_executing_this
where exec_status == executing_now
delete pidlockfile
else (previous instance still running)
exit
endif
endif
create pidlockfile containing cron script process id
You can use the at(1) command inside your script to schedule its next run. Before it exits, it can check myhappyschedule for the next run time. You don't need cron at all, really.
I came across this question while researching for a solution to the queuing problem. For the benefit of anyone else searching here is my solution.
Combine this with a cron that starts jobs as they are scheduled (even if they are scheduled to run at the same time) and that solves the problem you described as well.
Problem
At most one instance of the script should be running.
We want to cue up requests to process them as fast as possible.
ie. We need a pipeline to the script.
Solution:
Create a pipeline to any script. Done using a small bash script (further down).
The script can be called as
./pipeline "<any command and arguments go here>"
Example:
./pipeline sleep 10 &
./pipeline shabugabu &
./pipeline single_instance_script some arguments &
./pipeline single_instance_script some other_argumnts &
./pipeline "single_instance_script some yet_other_arguments > output.txt" &
..etc
The script creates a new named pipe for each command. So the above will create named pipes: sleep, shabugabu, and single_instance_script
In this case the initial call will start a reader and run single_instance_script with some arguments as arguments. Once the call completes, the reader will grab the next request off the pipe and execute with some other_arguments, complete, grab the next etc...
This script will block requesting processes so call it as a background job (& at the end) or as a detached process with at (at now <<< "./pipeline some_script")
#!/bin/bash -Eue
# Using command name as the pipeline name
pipeline=$(basename $(expr "$1" : '\(^[^[:space:]]*\)')).pipe
is_reader=false
function _pipeline_cleanup {
if $is_reader; then
rm -f $pipeline
fi
rm -f $pipeline.lock
exit
}
trap _pipeline_cleanup INT TERM EXIT
# Dispatch/initialization section, critical
lockfile $pipeline.lock
if [[ -p $pipeline ]]
then
echo "$*" > $pipeline
exit
fi
is_reader=true
mkfifo $pipeline
echo "$*" > $pipeline &
rm -f $pipeline.lock
# Reader section
while read command < $pipeline
do
echo "$(date) - Executing $command"
($command) &> /dev/null
done