Is it possible to see qsub job scripts and command line options etc. that will be issued by Snakemake in the dry run mode? - snakemake

I am new to Snakemake and planning to use Snakemake for qsub in my cluster environment. In order to avoid critical mistakes that may disturb the cluster, I would like to check qsub job scripts and a qsub command that will be generated by Snakemake before actually submitting the jobs to the que.
Is it possible to see qsub job script files etc. in the dry run mode or in some other ways? I searched for relevant questions but could not find the answer. Thank you for your kind help.
Best,
Schuma

Using --printshellcmds or short version -p with --dry-run will allow you to see the commands snakemake will feed to qsub, but you won't see qsub options.
I don't know any option showing which parameters are given to qsub, but snakemake follows a simple set of rules, which you can find detailed information here and here. As you'll see, you can feed arguments to qsub in multiple ways.
With default values --default-resources resource_name1=<value1> resource_name2=<value2> when invoking snakemake.
On a per-rule basis, using resources in rules (prioritized over default values).
With explicitly set values, either for the whole pipeline using --set-resources resource_name1=<value1> or for a specific rule using --set-resources rule_name:resource_name1=<value1> (prioritized over default and per-rule values)
Suppose you have the following pipeline:
rule all:
input:
input.txt
output:
output.txt
resources:
mem_mb=2000
runtime_min=240
shell:
"""
some_command {input} {output}
"""
If you call qsub using the --cluster directive, you can access all keywords of your rules. Your command could then look like this:
snakemake all --cluster "qsub --runtime {resources.runtime} -l mem={resources.mem_mb}mb"
This means snakemake will submit the following script to the cluster just as if you did directly in your command line:
qsub --runtime 240 -l mem=2000mb some_command input.txt output.txt
It is up to you to see which parameters you define where. You might want to check your cluster's documentation or with its administrator what parameters are required and what to avoid.
Also note that for cluster use, Snakemake documentation recommends setting up a profile which you can then use with snakemake --profile myprofile instead of having to specify arguments and default values each time.
Such a profile can be written in a ~/.config/snakemake/profile_name/config.yaml file. Here is an example of such a profile:
cluster: "qsub -l mem={resources.mem_mb}mb other_resource={resources.other_resource_name}"
jobs: 256
printshellcmds: true
rerun-incomplete: true
default-resources:
- mem_mb=1000
- other_resource_name="foo"
Invoking snakemake all --profile profile_name corresponds to invoking
snakemake all --cluster "qsub -l mem={resources.mem_mb}mb other_resource= resources.other_resource_name_in_snakefile}" --jobs 256 --printshellcmds --rerun-incomplete --default-resources mem_mb=1000 other_resource_name "foo"
You may also want to define test rules, like a minimal example of your pipeline for instance, and try these first to verify all goes well before running your full pipeline.

Related

Snakemake cluster mode with different CPU count for each rule

I'm working with snakemake on a shared compute space with the SLURM scheduler. Most of my rules require very few resources/CPUs, while a couple require a lot. I want to setup snakemake to use minimal resources for every rule except the rules that requires lots of resources. From my testing, snakemake will always use the core counts and memory requests in the config.yaml when submitting each rule as a subjob in cluster mode. When I define threads and mem_mb in each rule, it doesn't consider those values when submitting the rule as a job. Is there a way in cluster mode to have snakemake look at the rule's threads/mem_mb when submitting a job, or some way to customize the SLURM resource requests based on each rule? Below are my config.yaml (I call it cluster.yml in the code) and the line that runs snakemake. Thank you!
__default__:
partition: 'standard'
group: 'mygroup'
M: 'myemail#gmail.com'
time: '24:00:00'
n: "2"
m: '8gb'
o: 'out/{rule}.{wildcards}.out'
e: 'err/{rule}.{wildcards}.err'
N: "1"
snakemake --cluster "sbatch -A {cluster.group} -p {cluster.partition} -n {cluster.n} -t {cluster.time} -N {cluster.N} --mem={cluster.m} -e {cluster.e} -o {cluster.o}" --cluster-config config/cluster.yml -j 30 --latency-wait 30
Actually I think I figured it out thanks to here. Doing
n: "{threads}"
m: "{resources[mem_mb]}mb"
will use the cores/memory defined in each rule.

How can I run multiple runs of pipeline with different config files - issue with lock on .snakemake directory

I am running a snakemake pipeline from the same working directory but with different config files and the input / output are in different directories too. The issue seems to be that although both runs are using data in different folders snakemake creates the lock on the pipeline folder due to the .snakemake folder and the lock folder within. Is there a way to force separate .snakemake folders? code example below:
Both runs are ran from within /home/pipelines/qc_pipeline :
run 1:
/home/apps/miniconda3/bin/snakemake -p -k -j 999 --latency-wait 10 --restart-times 3 --use-singularity --singularity-args "-B /pipelines_test/QC_pipeline/PE_trimming/,/clusterTMP/testingQC/,/home/www/codebase/references" --configfile /clusterTMP/testingQC/config.yaml --cluster-config QC_slurm_roadsheet.json --cluster "sbatch --job-name {cluster.name} --mem-per-cpu {cluster.mem-per-cpu} -t {cluster.time} --output {cluster.output}"
run 2:
/home/apps/miniconda3/bin/snakemake -p -k -j 999 --latency-wait 10 --restart-times 3 --use-singularity --singularity-args "-B /pipelines_test/QC_pipeline/SE_trimming/,/clusterTMP/testingQC2/,/home/www/codebase/references" --configfile /clusterTMP/testingQC2/config.yaml --cluster-config QC_slurm_roadsheet.json --cluster "sbatch --job-name {cluster.name} --mem-per-cpu {cluster.mem-per-cpu} -t {cluster.time} --output {cluster.output}"
error:
Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
/home/pipelines/qc_pipeline
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with the --unlock argument.
Maarten-vd-Sande correctly points to the --nolock option (+1), but in my opinion it's a very bad idea to use --nolock routinely.
As the error says, two snakemake processes are trying to create the same file. Unless the error is a bug in snakemake, I wouldn't blindly proceed and overwrite files.
I think it would be safer to assign to each snakemake execution its own execution directory and working directory, like:
topdir=`pwd`
mkdir -p run1
cd run1
snakemake --configfile /path/to/config1.yaml ...
cd $topdir
mkdir -p run2
cd run2
snakemake --configfile /path/to/config2.yaml ...
cd $topdir
mkdir -p run3
etc...
EDIT
Actually, it should be less clunky and probably better to use the the --directory/-d option:
snakemake -d run1 --configfile /path/to/config1.yaml ...
snakemake -d run2 --configfile /path/to/config2.yaml ...
...
As long as the different pipelines do not generate the same output files you can do it with the --nolock option:
snakemake --nolock [rest of the command]
Take a look here for a short doc about nolock.

How to output the shell script even after all the rules have been finished in `snakemake`?

If I have a snakemake workflow and I already finished all the rules, I want to output the shell command lines. Is there a way to do this?
I know -n -p can output the command lines before the rules have been finished.
Thanks in advance.
You simply need to use the option -F which tells snakemake to rerun all rules even if targets are already present on the file system.
--forceall, -F Force the execution of the selected (or the first)
rule and all rules it is dependent on regardless of
already created output.
Don't forget the -n (dry-run) option if you don't want to run your pipeline again and -p (print shell commands).

Is it possible to print commands instead of rules in snakemake dry run?

Dry runs are a super important functionality of workflow languages. What I am looking at is mostly what would be executed if I run the command and this is exactly what one see when running make -n.
However analogical functionality snakemake -n prints something like
Building DAG of jobs...
rule produce_output:
output: my_output
jobid: 0
wildcards: var=something
Job counts:
count jobs
1 produce_output
1
The log contains kind of everything else than commands that get executed. Is there a way how to get command from snakemake?
snakemake -p --quiet -n
-p for print shell commands
-n for dry run
--quiet for removing the rest
EDIT 2019-Jan
This solution seems broken for lasts versions of snakemake
snakemake -p -n
Avoid the --quiet reported in the #eric-c answer, at least in some situations the combination on -p -n -q does not print the command executed without -n.

Error: "attribute "m_numa_nodes" is not a integer value." when running a qsub snakemake

I am trying to run a snakemake with cluster submission for RNAseq analysis. Here is my script:
#path to gff
GFF = "RNASeq/data/ref_GRCh38.p2_top_level.gff3"
#sample names and classes
CTHP = 'CTHP1 CTHP2'.split()
CYP = 'CYP1 CYP2'.split()
samples = CTHP + CYP
rule all:
input:
'CTHP1/mapping_results/out_summary.gtf',
'CTHP2/mapping_results/out_summary.gtf',
'CYP2/mapping_results/out_summary.gtf',
'CYP1/mapping_results/out_summary.gtf',
rule order_sam:
input:
'{samples}/mapping_results/mapped.sam'
output:
'{samples}/mapping_results/ordered.mapped.bam'
threads: 12
params: ppn="nodes=1:ppn=12"
shell:
'samtools view -Su {input} | samtools sort > {output}'
rule count_sam:
input:
bam='{samples}/mapping_results/ordered.mapped.bam'
output:
summary='{samples}/mapping_results/out_summary.gtf',
abun='{samples}/mapping_results/abun_results.tab',
cover='{samples}/mapping_results/coveraged.gtf'
threads: 12
params: ppn="nodes=1:ppn=12"
shell:
'stringtie -o {output.summary} -G {GFF} -C {output.cover} '
'-A {output.abun} -p {threads} -l {samples} {input.bam}'
```
I want to submit each rule to a cluster. So, in the Terminal from the working directory, I do this:
snakemake --cluster "qsub -V -l {params.ppn}" -j 6
However, the jobs are not submitted and I get following error:
Unable to run job: attribute "m_numa_nodes" is not a integer value.
Exiting.
Error submitting jobscript (exit code 1):
I have also tried to set the nodes variable directly when running the snake file like this:
snakemake --cluster "qsub -V -l nodes=1:ppn=16" -j 6
and as expected, it gave me the same error. At this point I am not sure if its the local cluster setup or something that I am not doing right in the snake file. Any help would be great.
Thanks
The error does not look Snakemake related. I am not an SGE/Univa expert so I cannot really help you, but m_numa_nodes is a parameter of the engine. Snakemake does not set it in any way, so it must be either your local configuration or one of the arguments you provide to qsub.
EDIT: 2017/04/12 -- Caught one of the errors in the Google Groups Post. Remove the comma from the last line of input in your "all" rule.
**EDIT: 2017/04/13 -- Was advised the comma is not an issue **
The beauty of Snakemake is sending it to the cluster just requires additional arguments.To determine if its a Cluster issue, or a Snakemake issue, I recommend running a dryrun, via
snakemake -n
Dryrun will not submit any jobs, but it will return the list of jobs. This is a strong indicator if it's a Snakemake issue or a submission issue. I always perform dryruns while in development, to ensure my Snakemake code works before I start trying to submit it to the cluster, because cluster submissions can be a whole different basket of issues.
As per your submission problems, I use the "--drmaa" flag within Snakemake to handle my submissions to the cluster. I realize this is not what you asked for, but I really enjoy its functionality, and I guess I am just suggesting it as a robust alternative to your current approach.
https://pypi.python.org/pypi/drmaa OR https://anaconda.org/anaconda/drmaa
snakemake --jobs 10 --cluster-config input/config.json --drmaa "{cluster.clusterSpec}"
Inside config.json, my rules are mostly all provide this parameter set:
{
"__default__": {
"clusterSpec": "-V -S /bin/bash -o log/varScan -e log/varScan -l h_vmem=10G -pe ncpus 1"
}
}
SGE Cluster Arguments = "-V -S /bin/bash -l h_vmem=10G -pe ncpus 1"
DRMAA Arguments = "-o log/varScan -e log/varScan"
P.S. I think you have to post as well the Operating System (E.g. CentOS5) and your cluster type(E.g. SGE) you are using.