SLURM releasing resources using scontrol update results in unknown endtime - batch-processing

I have a program that will dynamically release resources during job execution, using the command:
scontrol update JobId=$SLURM_JOB_ID NodeList=${remaininghosts}
However, this results in some very weird behavior sometimes. Where the job is re-queued. Below is the output of sacct
sacct -j 1448590
JobID NNodes State Start End NodeList
1448590 4 RESIZING 20:47:28 01:04:22 [0812,0827],[0663-0664]
1448590.0 4 COMPLETED 20:47:30 20:47:30 [0812,0827],[0663-0664]
1448590.1 4 RESIZING 20:47:30 01:04:22 [0812,0827],[0663-0664]
1448590 3 RESIZING 01:04:22 01:06:42 [0812,0827],0663
1448590 2 RESIZING 01:06:42 1:12:42 0827,tnxt-0663
1448590 4 COMPLETED 05:33:15 Unknown 0805-0807,0809]
The first lines show everything works fine, nodes are getting released but in the last line, it shows a completely different set of nodes with an unknown end time. The slurm logs show the job got requeued:
requeue JobID=1448590 State=0x8000 NodeCnt=1 due to node failure.
I suspect this might happen because the head node is killed, but the slurm documentation doesn't say anything about that.
Does anybody had an idea or suggestion?
Thanks

In this post there was a discussion about resizing jobs.
In your particular case, for shrinking I would use:
Assuming that j1 has been submitted with:
$ salloc -N4 bash
Update j1 to the new size:
$ scontrol update jobid=$SLURM_JOBID NumNodes=2
$ scontrol update jobid=$SLURM_JOBID NumNodes=ALL
And update the environmental variables of j1 (the script is created by the previous commands):
$ ./slurm_job_$SLURM_JOBID_resize.sh
Now, j1 has 2 nodes.
In your example, your "remaininghost" list, as you say, may exclude the head node that is needed by Slurm to shrink the job. If you provide a quantity instead of a list, the resize should work.

Related

Is there a way to get a nice error report summary when running many jobs on DRMAA cluster?

I need to run a snakemake pipeline on a DRMAA cluster with a total number of >2000 jobs. When some of the jobs have failed, I would like to receive in the end an easy readable summary report, where only the failed jobs are listed instead of the whole job summary as given in the log.
Is there a way to achieve this without parsing the log file by myself?
These are the (incomplete) cluster options:
jobs: 200
latency-wait: 5
keep-going: True
rerun-incomplete: True
restart-times: 2
I am not sure if there is another way than parsing the log file yourself, but I've done it several times with grep and I am happy with the results:
cat .snakemake/log/[TIME].snakemake.log | grep -B 3 -A 3 error
Of course you should change the TIME placeholder for whichever run you want to check.

Monit for "cron-like" tasks

Have some batch-type jobs that I would like to move from cron to Monit but am struggling to get them to work properly. These scripts typically run once a day, but on occasion have to be re-ran later in the day. Goal is to take advantage of the monit & m/monit front-ends to re-run as well as be alerted on failure in similar fashion to other things under monit.
The below was my first attempt. I know the docs say to use range/wildcard for minute field but I have my monit daemon set to cycle every 20 seconds so thought I'd be able to get away with this.
check program test.sh
with path "/usr/local/bin/test.sh"
every "0 7 * * *"
if status != 0 then alert
This does not seem to work as it seems like it picks up the exit status of the program on the NEXT run. So I have a zombie process sitting around until 7am the next day, at which time I'll see the status from the previous day's run.
Would be nice if this ran immediate or if there was a way to schedule something as "batch" that would only run once when started (either from command line or web gui). Example below.
check program test.sh
with path "/usr/local/bin/test.sh"
mode batch
if status != 0 then alert
Is it possible to do what I want? Can a 'check program' be scheduled that will only run one time when started or using the 'every [cron]' type syntax supported by monit?
TIA for any suggestions.
The latest version of monit (5.18) now picks up the exit status on the next daemon cycle, not on the next execution of the program like in the past (which might not be until the next day).

Setting SGE for running an executable with different input files on different nodes

I used to work with a cluster using SLURM scheduler, but now I am more or less forced to switch to a SGE-based cluster, and I'm trying to get a hang of it. The thing I was working on SLURM system involves running an executable using N input files, and set a SLURM configuration file in this fashion,
slurmConf.conf SLURM configuration file
0 /path/to/exec /path/to/input1
1 /path/to/exec /path/to/input2
2 /path/to/exec /path/to/input3
3 /path/to/exec /path/to/input4
4 /path/to/exec /path/to/input5
5 /path/to/exec /path/to/input6
6 /path/to/exec /path/to/input7
7 /path/to/exec /path/to/input8
8 /path/to/exec /path/to/input9
9 /path/to/exec /path/to/input10
And my working submission script in SLURM contains this line;
srun -n $SLURM_NNODES --multi-prog $slconf
$slconf refers to a path to that configuration file
This setup worked as I wanted - to run the executable with 10 different inputs at the same time with 10 nodes. Now that I just transitioned to SGE system, I want to do the same thing but I tried to read the manual and found nothing quite like SLURM. Could you please give me some light on how to achieve the same thing on SGE system?
Thank you very much!
You could use the "job array" feature of the Grid Engine.
Create a shell script sge_job.sh
#!/bin/sh
#
# sge_job.sh -- SGE job description script
#
#$ -t 1-10
/path/to/exec /path/to/input$SGE_TASK_ID
And submit this script to SGE with qsub.
qsub sge_job.sh
Dmitri Chubarov's answer is excellent, and the most robust way to proceed as it places less load on the submit node when submitting many jobs (>1000). Alternatively, you can wrap qsub in a for loop:
for i in {1..10}
do
echo "/path/to/exec /path/to/input${i}" | qsub
done
I sometimes use the above when whatever varies as input is not easily captured as a range of integers.
Example:
for f in `ls /some/path/input*`
do
echo "/path/to/exec ${f}" | qsub
done

Hadoop jobs getting poor locality

I have some fairly simple Hadoop streaming jobs that look like this:
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
-files hdfs:///apps/local/count.pl \
-input /foo/data/bz2 \
-output /user/me/myoutput \
-mapper "cut -f4,8 -d," \
-reducer count.pl \
-combiner count.pl
The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary.
The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed).
When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too.
The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.
I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files.
It looks like this previous thread [ why map task always running on a single node ] is relevant but not conclusive.
EDIT: at #jtravaglini's suggestion I tried the following variation and saw the same problem - all 45 map jobs running on a single node:
yarn jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar \
wordcount /foo/data/bz2 /user/me/myoutput
At the end of the output of that task in my shell, I see:
Launched map tasks=45
Launched reduce tasks=1
Data-local map tasks=18
Rack-local map tasks=27
which is the number of data-local tasks you'd expect to see on one node just by chance alone.

Sun Grid Engine resubmit job stuck in 'Rq' state

I have what I hope is a pretty simple question, but I'm not super familiar with Sun Grid, so I've been having trouble finding the answer. I am currently submitting jobs to a grid using a bash submission script that generates a command and then executes it. I have read online that if a sun grid job exits with a code of 99, it gets re-submitted to the grid. I have successfully written my bash script to do this:
[code to generate command, stores in $command]
$command
STATUS=$?
if [[ $STATUS -ne 0 ]]; then
exit 99
fi
exit 0
When I submit this job to the grid with a command that I know has a non-zero exit status, the job does indeed appear to be resubmitted, however the scheduler never sends it to another host, instead it just remains stuck in the queue with the status "Rq":
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2150015 0.55500 GridJob.sh my_user Rq 04/08/2013 17:49:00 1
I have a feeling that this is something simple in the config options for the queue, but I haven't been able to find anything googling. I've tried submitting this job with the qsub -r y option, but that doesn't seem to change anything.
Thanks!
Rescheduled jobs will only get run in queues that have their rerun attribute (FALSE by default) set to TRUE, so check your queue configuration (qconf -mq myqueue). Without this, your job remains in the rescheduled-pending state indefinitely because it has nowhere to go.
IIRC, submitting jobs with qsub -r yes only qualifies them for automatic rescheduling in the event of an exec node crash, and that exiting with status 99 should trigger a reschedule regardless.