"scancel: error: Invalid job id Submitted batch job" with --cluster-cancel from snakemake - snakemake

I am running snakemake using this command:
snakemake --profile slurm -j 1 --cores 1 --cluster-cancel "scancel"
which writes this to standard out:
Submitted job 224 with external jobid 'Submitted batch job 54174212'.
but after I cancel the run with ctrl + c, I get the following error:
scancel: error: Invalid job id Submitted batch job 54174212
What I would guess is that the jobid is 'Submitted batch job 54174212'
and snakemake tries to run scancel 'Submitted batch job 54174212' instead of the expected scancel 54174212. If this is the case, how do I change the jobid to something that works with scancel?

Your suspicion is probably correct, snakemake probably tries to cancel the wrong job (id Submitted batch job 54174212).
Check your slurm profile for snakemake which you invoke (standard location ~/.config/snakemake/slurm/config.yaml):
Does it contain the --parsable flag for sbatch?
Missing to include that flag is a mistake I made before. Adding the flag solved it for me.

Related

Erroneous MissingOutputException errors on Google Cloud/Kubernetes

Executing snakemake with --kubernetes on GCP I am running into erroneous MissingeOutputException errors. Looking at the logs it seems that the jobs ended successfully and the output files were successfully uploaded to the bucket. The reported missing files appear to be intact and look as expected. Unfortunately I have not been able to reliably recreate this issue so its difficult to determine what the cause may be. I have tried increasing --latency-wait to 900 with no help
Would appreciate any insight on how Snakemake determines what files may be missing as that seems to be the best place to start. Digging through the source code myself I could not quite figure it out.
Edit 2/23/22, adding example rule:
rule dedup:
input:
get_bams_for_dedup
output:
dedupBam = config['output'] + "{Organism}/{refGenome}/" + config['bamDir'] + "{sample}" + config['bam_suffix'],
dedupBai = config['output'] + "{Organism}/{refGenome}/" + config['bamDir'] + "{sample}" + "_final.bam.bai",
conda:
"../envs/sambamba.yml"
resources:
threads = res_config['dedup']['threads'],
mem_mb = lambda wildcards, attempt: attempt * res_config['dedup']['mem']
log:
"logs/{Organism}/dedup/{refGenome}_{sample}.txt"
benchmark:
"benchmarks/{Organism}/dedup/{refGenome}_{sample}.txt"
shell:
"sambamba markdup -t {threads} {input} {output.dedupBam} 2> {log}"
This issue also leads to an IncompleteFilesException when trying to restart the workflow. Which doesn't make sense, as when Snakemake is run on kubernetes, the output file is uploaded to a bucket when the job finishes. And because the output files are in the bucket that means that the job must have completed successfully.
There seems to be something going on with how Snakemake is determining if an output file in the bucket is 'incomplete' I imagine it may have to do with the timestamps of the file vs the timestamps of when the Kubernetes job to create said file was submitted? I'm not sure though. Would appreciate feedback.

Is there a way to get a nice error report summary when running many jobs on DRMAA cluster?

I need to run a snakemake pipeline on a DRMAA cluster with a total number of >2000 jobs. When some of the jobs have failed, I would like to receive in the end an easy readable summary report, where only the failed jobs are listed instead of the whole job summary as given in the log.
Is there a way to achieve this without parsing the log file by myself?
These are the (incomplete) cluster options:
jobs: 200
latency-wait: 5
keep-going: True
rerun-incomplete: True
restart-times: 2
I am not sure if there is another way than parsing the log file yourself, but I've done it several times with grep and I am happy with the results:
cat .snakemake/log/[TIME].snakemake.log | grep -B 3 -A 3 error
Of course you should change the TIME placeholder for whichever run you want to check.

Snakemake: Is it possible to only display "Job counts" using --dryrun?

How can I make Snakemake display only the Job counts fields on a dry run? When performing a real run, that's the first information Snakemake outputs before starting the jobs.
Currently, the way I get job counts is to run Snakemake without the -n flag and immediately cancel it (^C), but that's far from ideal.
Letting the dry run complete will output the Job counts at the end, but that's not feasible for pipelines with hundreds or thousands of jobs.
Desired output:
$ snakemake -n --someflag
Job counts:
count jobs
504 BMO
1 all
504 fit_nbinoms
517 motifs_in_peaks
503 motifs_outside_peaks
2029
$
Flag -q does this.
--quiet, -q Do not output any progress or rule information.

Snakemake in cluster mode with --no-shared-fs: How to set cluster-status

I'm running Snakemake in a cluster environment and would like to use S3 as shared file system for writing output files.
Options --default-remote-provider, --default-remote-prefix and --no-shared-fs are set accordingly. The cluster uses UGE as scheduler, so setting --cluster is straightforward, but how do I set --cluster-status, whose use is enforced when using --no-shared-fs?
My best guess was a naive --cluster-status "qstat -j" which resulted in
subprocess.CalledProcessError: Command 'qstat Your job 2 ("snakejob.bwa_map.1.sh") has been submitted' returned non-zero exit status 1.
So I guess my question is, how do I get the actual jobid in there?
Thanks!
Andreas
EDIT 1:
I found https://groups.google.com/forum/#!topic/snakemake/7cyqAIfgeq4, so cluster-status has to be a script. So I wrote a Python script that is able to parse the above line, however snakemake still fails with:
/bin/sh: -c: line 0: syntax error near unexpected token `('
/bin/sh: -c: line 0: `/home/ec2-user/clusterstatus.py Your job 2 ("snakejob.bwa_map.1.sh") has been submitted'
...
subprocess.CalledProcessError: Command '/home/ec2-user/clusterstatus.py
Your job 2 ("snakejob.bwa_map.1.sh") has been submitted' returned non-zero exit status 1.
To answer my own question:
First, I needed the -terse option for qsub (which I had not added at first in my case and snakemake somehow remembered the wrong cluster command
Secondly, the cluster-status argument needs to point to a script being able to get the job status (job id being the only argument) and output "failed", "running" or "success".

Sun Grid Engine resubmit job stuck in 'Rq' state

I have what I hope is a pretty simple question, but I'm not super familiar with Sun Grid, so I've been having trouble finding the answer. I am currently submitting jobs to a grid using a bash submission script that generates a command and then executes it. I have read online that if a sun grid job exits with a code of 99, it gets re-submitted to the grid. I have successfully written my bash script to do this:
[code to generate command, stores in $command]
$command
STATUS=$?
if [[ $STATUS -ne 0 ]]; then
exit 99
fi
exit 0
When I submit this job to the grid with a command that I know has a non-zero exit status, the job does indeed appear to be resubmitted, however the scheduler never sends it to another host, instead it just remains stuck in the queue with the status "Rq":
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2150015 0.55500 GridJob.sh my_user Rq 04/08/2013 17:49:00 1
I have a feeling that this is something simple in the config options for the queue, but I haven't been able to find anything googling. I've tried submitting this job with the qsub -r y option, but that doesn't seem to change anything.
Thanks!
Rescheduled jobs will only get run in queues that have their rerun attribute (FALSE by default) set to TRUE, so check your queue configuration (qconf -mq myqueue). Without this, your job remains in the rescheduled-pending state indefinitely because it has nowhere to go.
IIRC, submitting jobs with qsub -r yes only qualifies them for automatic rescheduling in the event of an exec node crash, and that exiting with status 99 should trigger a reschedule regardless.