Setting SGE for running an executable with different input files on different nodes - scripting

I used to work with a cluster using SLURM scheduler, but now I am more or less forced to switch to a SGE-based cluster, and I'm trying to get a hang of it. The thing I was working on SLURM system involves running an executable using N input files, and set a SLURM configuration file in this fashion,
slurmConf.conf SLURM configuration file
0 /path/to/exec /path/to/input1
1 /path/to/exec /path/to/input2
2 /path/to/exec /path/to/input3
3 /path/to/exec /path/to/input4
4 /path/to/exec /path/to/input5
5 /path/to/exec /path/to/input6
6 /path/to/exec /path/to/input7
7 /path/to/exec /path/to/input8
8 /path/to/exec /path/to/input9
9 /path/to/exec /path/to/input10
And my working submission script in SLURM contains this line;
srun -n $SLURM_NNODES --multi-prog $slconf
$slconf refers to a path to that configuration file
This setup worked as I wanted - to run the executable with 10 different inputs at the same time with 10 nodes. Now that I just transitioned to SGE system, I want to do the same thing but I tried to read the manual and found nothing quite like SLURM. Could you please give me some light on how to achieve the same thing on SGE system?
Thank you very much!

You could use the "job array" feature of the Grid Engine.
Create a shell script sge_job.sh
#!/bin/sh
#
# sge_job.sh -- SGE job description script
#
#$ -t 1-10
/path/to/exec /path/to/input$SGE_TASK_ID
And submit this script to SGE with qsub.
qsub sge_job.sh

Dmitri Chubarov's answer is excellent, and the most robust way to proceed as it places less load on the submit node when submitting many jobs (>1000). Alternatively, you can wrap qsub in a for loop:
for i in {1..10}
do
echo "/path/to/exec /path/to/input${i}" | qsub
done
I sometimes use the above when whatever varies as input is not easily captured as a range of integers.
Example:
for f in `ls /some/path/input*`
do
echo "/path/to/exec ${f}" | qsub
done

Related

Can Snakemake parallelize the same rule both within and across nodes?

I have a somewhat basic question about Snakemake parallelization when using cluster execution: can jobs from the same rule be parallelized both within a node and across multiple nodes at the same time?
Let's say for example that I have 100 bwa mem jobs and my cluster has nodes with 40 cores each. Could I run 4 bwa mem per node, each using 10 threads, and then have Snakemake submit 25 separate jobs? Essentially, I want to parallelize both within and across nodes for the same rule.
Here is my current snakefile:
SAMPLES, = glob_wildcards("fastqs/{id}.1.fq.gz")
print(SAMPLES)
rule all:
input:
expand("results/{sample}.bam", sample=SAMPLES)
rule bwa:
resources:
time="4:00:00",
partition="short-40core"
input:
ref="/path/to/reference/genome.fa",
fwd="fastqs/{sample}.1.fq.gz",
rev="fastqs/{sample}.2.fq.gz"
output:
bam="results/{sample}.bam"
log:
"results/logs/bwa/{sample}.log"
params:
threads=10
shell:
"bwa mem -t {params.threads} {input.ref} {input.fwd} {input.rev} 2> {log} | samtools view -bS - > {output.bam}"
I've run this with the following command:
snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25
With this setup, I get 25 jobs submitted, each to a different node. However, only one bwa mem process (using 10 threads) is run per node.
Is there some straightforward way to modify this so that I could get 4 different bwa mem jobs (each using 10 threads) to run on each node?
Thanks!
Dave
Edit 07/28/22:
In addition to Troy's suggestion below, I found a straightforward way of accomplishing what I was trying to do by simply following the job grouping documentation.
Specifically, I did the following when executing my Snakemake pipeline:
snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25 --groups bwa=group0 --group-components group0=4 --rerun-incomplete --cores 40
By specifying a group ("group0") for the bwa rule and setting "--group-components group0=4", I was able to group the jobs such that 4 bwa runs are occurring on each node.
You can try job grouping but note that resources are typically summed together when submitting group jobs like this. Usually that's not what is desired, but in your case it seems to be correct.
Instead you can make a group job with another rule that does the grouping for you in batches of 4.
rule bwa_mem:
group: 'bwa_batch'
output: '{sample}.bam'
...
def bwa_mem_batch(wildcards):
# for wildcard.i, pick 4 bwa_mem outputs to put in this group
return expand('{sample}.bam', sample=SAMPLES[i*4:i*4+4])
rule bwa_mem_batch:
input: bwa_mem_batch_input
output: touch('flag_{i}') # could be temp too
group 'bwa_batch'
The consuming rule must request flag_{i} for i in {0..len(SAMPLES)//4}. With cluster integration, each slurm job gets 1 bwa_mem_batch job and 4 bwa_mem jobs with resources for a single bwa_mem job. This is useful for batching together multiple jobs to increase the runtime.
As a final point, this may do what you want, but I don't think it will help you get around QOS or other job quotas. You are using the same amount of CPU hours either way. You may be waiting in the queue longer because the scheduler can't find 40 threads to give you at once, where it could have given you a few 10 thread jobs. Instead, consider refining your resource values to get better efficiency, which may get your jobs run earlier.

Is there a way to get a nice error report summary when running many jobs on DRMAA cluster?

I need to run a snakemake pipeline on a DRMAA cluster with a total number of >2000 jobs. When some of the jobs have failed, I would like to receive in the end an easy readable summary report, where only the failed jobs are listed instead of the whole job summary as given in the log.
Is there a way to achieve this without parsing the log file by myself?
These are the (incomplete) cluster options:
jobs: 200
latency-wait: 5
keep-going: True
rerun-incomplete: True
restart-times: 2
I am not sure if there is another way than parsing the log file yourself, but I've done it several times with grep and I am happy with the results:
cat .snakemake/log/[TIME].snakemake.log | grep -B 3 -A 3 error
Of course you should change the TIME placeholder for whichever run you want to check.

SLURM releasing resources using scontrol update results in unknown endtime

I have a program that will dynamically release resources during job execution, using the command:
scontrol update JobId=$SLURM_JOB_ID NodeList=${remaininghosts}
However, this results in some very weird behavior sometimes. Where the job is re-queued. Below is the output of sacct
sacct -j 1448590
JobID NNodes State Start End NodeList
1448590 4 RESIZING 20:47:28 01:04:22 [0812,0827],[0663-0664]
1448590.0 4 COMPLETED 20:47:30 20:47:30 [0812,0827],[0663-0664]
1448590.1 4 RESIZING 20:47:30 01:04:22 [0812,0827],[0663-0664]
1448590 3 RESIZING 01:04:22 01:06:42 [0812,0827],0663
1448590 2 RESIZING 01:06:42 1:12:42 0827,tnxt-0663
1448590 4 COMPLETED 05:33:15 Unknown 0805-0807,0809]
The first lines show everything works fine, nodes are getting released but in the last line, it shows a completely different set of nodes with an unknown end time. The slurm logs show the job got requeued:
requeue JobID=1448590 State=0x8000 NodeCnt=1 due to node failure.
I suspect this might happen because the head node is killed, but the slurm documentation doesn't say anything about that.
Does anybody had an idea or suggestion?
Thanks
In this post there was a discussion about resizing jobs.
In your particular case, for shrinking I would use:
Assuming that j1 has been submitted with:
$ salloc -N4 bash
Update j1 to the new size:
$ scontrol update jobid=$SLURM_JOBID NumNodes=2
$ scontrol update jobid=$SLURM_JOBID NumNodes=ALL
And update the environmental variables of j1 (the script is created by the previous commands):
$ ./slurm_job_$SLURM_JOBID_resize.sh
Now, j1 has 2 nodes.
In your example, your "remaininghost" list, as you say, may exclude the head node that is needed by Slurm to shrink the job. If you provide a quantity instead of a list, the resize should work.

Hadoop jobs getting poor locality

I have some fairly simple Hadoop streaming jobs that look like this:
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
-files hdfs:///apps/local/count.pl \
-input /foo/data/bz2 \
-output /user/me/myoutput \
-mapper "cut -f4,8 -d," \
-reducer count.pl \
-combiner count.pl
The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary.
The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed).
When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too.
The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.
I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files.
It looks like this previous thread [ why map task always running on a single node ] is relevant but not conclusive.
EDIT: at #jtravaglini's suggestion I tried the following variation and saw the same problem - all 45 map jobs running on a single node:
yarn jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar \
wordcount /foo/data/bz2 /user/me/myoutput
At the end of the output of that task in my shell, I see:
Launched map tasks=45
Launched reduce tasks=1
Data-local map tasks=18
Rack-local map tasks=27
which is the number of data-local tasks you'd expect to see on one node just by chance alone.

File I/O in gnu parallel

I have a program that takes a single argument. I am using gnu parallel to perform parameter sweeps on this argument. Each run generates a single result, and I want to append all results into a single file, say Results.txt.
What would be a correct way to do this?
I should not have each instance open the file and write to it, as this could create conflicts and also mess up the order of results. The only way I can think of doing this is having each run generate its output in a file with a unique name, and then , when gnu parallel finishes running, merge the results into a single file using a script.
Is there a simpler way of achieving this?
What happens when multiple instances write to/read from the same file? Does gnu parallel create multiple copies, one for each instances, as it does for stdout and stderror?
thanks
If your command sends the result to stdout (standard output) the solution is trivial:
seq 1000 | parallel echo > Results.txt
GNU Parallel guarantees the output will not be mixed.
Normally GNU Parallel prints the output of a job as soon as it's completed. When jobs run for a different amount of time, this can lead to their output being mixed.
To keep the output in order, simply add -k / --keep-order parameter.
Try for example:
parallel -j4 sleep {}\; echo {} ::: 2 1 4 3
parallel -j4 -k sleep {}\; echo {} ::: 2 1 4 3