Snakemake cluster mode with different CPU count for each rule - snakemake

I'm working with snakemake on a shared compute space with the SLURM scheduler. Most of my rules require very few resources/CPUs, while a couple require a lot. I want to setup snakemake to use minimal resources for every rule except the rules that requires lots of resources. From my testing, snakemake will always use the core counts and memory requests in the config.yaml when submitting each rule as a subjob in cluster mode. When I define threads and mem_mb in each rule, it doesn't consider those values when submitting the rule as a job. Is there a way in cluster mode to have snakemake look at the rule's threads/mem_mb when submitting a job, or some way to customize the SLURM resource requests based on each rule? Below are my config.yaml (I call it cluster.yml in the code) and the line that runs snakemake. Thank you!
__default__:
partition: 'standard'
group: 'mygroup'
M: 'myemail#gmail.com'
time: '24:00:00'
n: "2"
m: '8gb'
o: 'out/{rule}.{wildcards}.out'
e: 'err/{rule}.{wildcards}.err'
N: "1"
snakemake --cluster "sbatch -A {cluster.group} -p {cluster.partition} -n {cluster.n} -t {cluster.time} -N {cluster.N} --mem={cluster.m} -e {cluster.e} -o {cluster.o}" --cluster-config config/cluster.yml -j 30 --latency-wait 30

Actually I think I figured it out thanks to here. Doing
n: "{threads}"
m: "{resources[mem_mb]}mb"
will use the cores/memory defined in each rule.

Related

Is it possible to see qsub job scripts and command line options etc. that will be issued by Snakemake in the dry run mode?

I am new to Snakemake and planning to use Snakemake for qsub in my cluster environment. In order to avoid critical mistakes that may disturb the cluster, I would like to check qsub job scripts and a qsub command that will be generated by Snakemake before actually submitting the jobs to the que.
Is it possible to see qsub job script files etc. in the dry run mode or in some other ways? I searched for relevant questions but could not find the answer. Thank you for your kind help.
Best,
Schuma
Using --printshellcmds or short version -p with --dry-run will allow you to see the commands snakemake will feed to qsub, but you won't see qsub options.
I don't know any option showing which parameters are given to qsub, but snakemake follows a simple set of rules, which you can find detailed information here and here. As you'll see, you can feed arguments to qsub in multiple ways.
With default values --default-resources resource_name1=<value1> resource_name2=<value2> when invoking snakemake.
On a per-rule basis, using resources in rules (prioritized over default values).
With explicitly set values, either for the whole pipeline using --set-resources resource_name1=<value1> or for a specific rule using --set-resources rule_name:resource_name1=<value1> (prioritized over default and per-rule values)
Suppose you have the following pipeline:
rule all:
input:
input.txt
output:
output.txt
resources:
mem_mb=2000
runtime_min=240
shell:
"""
some_command {input} {output}
"""
If you call qsub using the --cluster directive, you can access all keywords of your rules. Your command could then look like this:
snakemake all --cluster "qsub --runtime {resources.runtime} -l mem={resources.mem_mb}mb"
This means snakemake will submit the following script to the cluster just as if you did directly in your command line:
qsub --runtime 240 -l mem=2000mb some_command input.txt output.txt
It is up to you to see which parameters you define where. You might want to check your cluster's documentation or with its administrator what parameters are required and what to avoid.
Also note that for cluster use, Snakemake documentation recommends setting up a profile which you can then use with snakemake --profile myprofile instead of having to specify arguments and default values each time.
Such a profile can be written in a ~/.config/snakemake/profile_name/config.yaml file. Here is an example of such a profile:
cluster: "qsub -l mem={resources.mem_mb}mb other_resource={resources.other_resource_name}"
jobs: 256
printshellcmds: true
rerun-incomplete: true
default-resources:
- mem_mb=1000
- other_resource_name="foo"
Invoking snakemake all --profile profile_name corresponds to invoking
snakemake all --cluster "qsub -l mem={resources.mem_mb}mb other_resource= resources.other_resource_name_in_snakefile}" --jobs 256 --printshellcmds --rerun-incomplete --default-resources mem_mb=1000 other_resource_name "foo"
You may also want to define test rules, like a minimal example of your pipeline for instance, and try these first to verify all goes well before running your full pipeline.

How to run only one rule in snakemake

I have created a workflow within snakemake, I Have a problem when I want to run just one rule. Indeed it runs for me the rules where the output is the input of my rule even if those one are already created before.
Example :
rule A:
input A
output A
rule b:
input b = output A
output b
rule c:
input c = output b
output c
How can I run just the rule C?
If there are dependencies, I have found that only --until works if you want to run rule C just run snakemake -R --until c. If there are assumed dependencies, like shared input or output paths, it will force you to run the upstream rules without the use of --until. Always run first with -n for a dry-run.
You can used the --allowed-rules option.
snakemake --allowed-rules c
Snakemake will try to rerun upstream rules linked by the input/output chain to your downstream rule if the output file(s) of the upstream rule(s) have changed (including if they've been re-created but the content hasn't changed). This behavior makes Snakemake reproducible, but maybe isn't desirable if you're trying to debug a specific part of your pipeline and don't want to run all the intermediate steps.
See this discussion:
https://bitbucket.org/snakemake/snakemake/issues/688/execute-specified-rule-only-and-not
You just run:
snakemake -R b
To see what this will do in advance:
snakemake -R b -n
-R selects the one rule (and all its dependent rules also!), -n does a "dry run", it just prints what it would do without -n.
I think "--force" = "-f" is what is asked for here:
snakemake --force c
snakemake -f c
--force, -f Force the execution of the selected target or the first rule regardless of already created output. (default: False)
--forceall, -F Force the execution of the selected (or the first) rule and all rules it is dependent on regardless of already created output. (default: False)
--forcerun [TARGET ...], -R [TARGET ...] Force the re-execution or creation of the given rules or files. Use this option if you changed a rule
and want to have all its output in your workflow updated. (default: None)
)

How to output the shell script even after all the rules have been finished in `snakemake`?

If I have a snakemake workflow and I already finished all the rules, I want to output the shell command lines. Is there a way to do this?
I know -n -p can output the command lines before the rules have been finished.
Thanks in advance.
You simply need to use the option -F which tells snakemake to rerun all rules even if targets are already present on the file system.
--forceall, -F Force the execution of the selected (or the first)
rule and all rules it is dependent on regardless of
already created output.
Don't forget the -n (dry-run) option if you don't want to run your pipeline again and -p (print shell commands).

Is it possible to print commands instead of rules in snakemake dry run?

Dry runs are a super important functionality of workflow languages. What I am looking at is mostly what would be executed if I run the command and this is exactly what one see when running make -n.
However analogical functionality snakemake -n prints something like
Building DAG of jobs...
rule produce_output:
output: my_output
jobid: 0
wildcards: var=something
Job counts:
count jobs
1 produce_output
1
The log contains kind of everything else than commands that get executed. Is there a way how to get command from snakemake?
snakemake -p --quiet -n
-p for print shell commands
-n for dry run
--quiet for removing the rest
EDIT 2019-Jan
This solution seems broken for lasts versions of snakemake
snakemake -p -n
Avoid the --quiet reported in the #eric-c answer, at least in some situations the combination on -p -n -q does not print the command executed without -n.

Error: "attribute "m_numa_nodes" is not a integer value." when running a qsub snakemake

I am trying to run a snakemake with cluster submission for RNAseq analysis. Here is my script:
#path to gff
GFF = "RNASeq/data/ref_GRCh38.p2_top_level.gff3"
#sample names and classes
CTHP = 'CTHP1 CTHP2'.split()
CYP = 'CYP1 CYP2'.split()
samples = CTHP + CYP
rule all:
input:
'CTHP1/mapping_results/out_summary.gtf',
'CTHP2/mapping_results/out_summary.gtf',
'CYP2/mapping_results/out_summary.gtf',
'CYP1/mapping_results/out_summary.gtf',
rule order_sam:
input:
'{samples}/mapping_results/mapped.sam'
output:
'{samples}/mapping_results/ordered.mapped.bam'
threads: 12
params: ppn="nodes=1:ppn=12"
shell:
'samtools view -Su {input} | samtools sort > {output}'
rule count_sam:
input:
bam='{samples}/mapping_results/ordered.mapped.bam'
output:
summary='{samples}/mapping_results/out_summary.gtf',
abun='{samples}/mapping_results/abun_results.tab',
cover='{samples}/mapping_results/coveraged.gtf'
threads: 12
params: ppn="nodes=1:ppn=12"
shell:
'stringtie -o {output.summary} -G {GFF} -C {output.cover} '
'-A {output.abun} -p {threads} -l {samples} {input.bam}'
```
I want to submit each rule to a cluster. So, in the Terminal from the working directory, I do this:
snakemake --cluster "qsub -V -l {params.ppn}" -j 6
However, the jobs are not submitted and I get following error:
Unable to run job: attribute "m_numa_nodes" is not a integer value.
Exiting.
Error submitting jobscript (exit code 1):
I have also tried to set the nodes variable directly when running the snake file like this:
snakemake --cluster "qsub -V -l nodes=1:ppn=16" -j 6
and as expected, it gave me the same error. At this point I am not sure if its the local cluster setup or something that I am not doing right in the snake file. Any help would be great.
Thanks
The error does not look Snakemake related. I am not an SGE/Univa expert so I cannot really help you, but m_numa_nodes is a parameter of the engine. Snakemake does not set it in any way, so it must be either your local configuration or one of the arguments you provide to qsub.
EDIT: 2017/04/12 -- Caught one of the errors in the Google Groups Post. Remove the comma from the last line of input in your "all" rule.
**EDIT: 2017/04/13 -- Was advised the comma is not an issue **
The beauty of Snakemake is sending it to the cluster just requires additional arguments.To determine if its a Cluster issue, or a Snakemake issue, I recommend running a dryrun, via
snakemake -n
Dryrun will not submit any jobs, but it will return the list of jobs. This is a strong indicator if it's a Snakemake issue or a submission issue. I always perform dryruns while in development, to ensure my Snakemake code works before I start trying to submit it to the cluster, because cluster submissions can be a whole different basket of issues.
As per your submission problems, I use the "--drmaa" flag within Snakemake to handle my submissions to the cluster. I realize this is not what you asked for, but I really enjoy its functionality, and I guess I am just suggesting it as a robust alternative to your current approach.
https://pypi.python.org/pypi/drmaa OR https://anaconda.org/anaconda/drmaa
snakemake --jobs 10 --cluster-config input/config.json --drmaa "{cluster.clusterSpec}"
Inside config.json, my rules are mostly all provide this parameter set:
{
"__default__": {
"clusterSpec": "-V -S /bin/bash -o log/varScan -e log/varScan -l h_vmem=10G -pe ncpus 1"
}
}
SGE Cluster Arguments = "-V -S /bin/bash -l h_vmem=10G -pe ncpus 1"
DRMAA Arguments = "-o log/varScan -e log/varScan"
P.S. I think you have to post as well the Operating System (E.g. CentOS5) and your cluster type(E.g. SGE) you are using.