snakemake temp() causing unnecessary rerun of rules

snakemake temp() causing unnecessary rerun of rules - snakemake

I'm using snakemake v 5.4.0, and I'm running into a problem with temp(). In a hypothetical scenario:
Rule A --> Rule B1 --> Rule C1
|
--> Rule B2 --> Rule C2
where Rule A generates temp() files used by both pathways 1 (B1 + C1) and 2 (B2 + C2).
If I run the pipeline, the temp() files generated by RuleA are deleted by after they are used in both pathways, which is what I expect. However, if I then want to re-run Pathway 2, the temp() files for RuleA must be recreated which triggers the re-run of the entire pipeline, not just Pathway2. This becomes very computationally expensive for long pipelines. Is there a good way to prevent this besides not using temp(), which in my case would require many TB of extra hard drive space?

You could create the list of input files to rule all, or whatever the first rule is called, dynamically depending on whether the output of Pathway 2 already exists (and satisfies some sanity checks).
output= ['P1.out']
if not os.path.exists('P2.out'): # Some more conditions here...
output.append('P2.out')
rule all:
input:
output
rule make_tmp:
output:
temp('a.out')
shell:
r"""
touch {output}
"""
rule make_P1:
input:
'a.out'
output:
'P1.out'
shell:
r"""
touch {output}
"""
rule make_P2:
input:
'a.out'
output:
'P2.out'
shell:
r"""
touch {output}
"""
However, this somewhat defeats the point of using snakemake. If the input of Pathway 1 has to be recreated, how can you be sure that its output is still up-to-date?

Related

How does snakemake handle possible corruptions due to a rule run in parallel simultaneously appending to a single file?

I would like to learn how snakemake handles following situations, and what is the best practice avoid collisions/corruptions.
rule something:
input:
expand("/path/to/out-{asd}.txt", asd=LIST)
output:
"/path/to/merged.txt"
shell:
"cat {input} >> {output}"
With snakemake -j10 the command will try to append to the same file simultaneously, and I could not figure out if this could lead to possible corruptions or if this is already handled.
Also, how are more complicated cases handled e.g. where it is not only cat but a return value of another process based on input value being appended to the same file? Is the best practice first writing them to individual files then catting them together?
rule get_merged_total_distinct:
input:
expand("{dataset_id}/merge_libraries/{tomerge}_merged_rmd.bam",dataset_id=config["dataset_id"],tomerge=list(TOMERGE.keys())),
output:
"{dataset_id}/merge_libraries/merged_total_distinct.csv"
params:
with_dups="{dataset_id}/merge_libraries/{tomerge}_merged.bam"
shell:
"""
RCT=$(samtools view -#4 -c -F1 -F4 -q 30 {params.with_dups})
RCD=$(samtools view -#4 -c -F1 -F4 -q 30 {input})
printf "{wildcards.tomerge},${{RCT}},${{RCD}}\n" >> {output}
"""
or cases where an external script is being called to print the result to a single output file?
input:
expand("infile/{x}",...) # expanded as above
output:
"results/all.txt"
shell:
"""
bash script.sh {params.x} {input} {params.y} >> {output}
"""

With your example, the shell directive will expand to
cat /path/to/out-SAMPLE1.txt /path/to/out-SAMPLE2.txt [...] >> /path/to/merged.txt
where SAMPLE1, etc, comes from the LIST. In this case, there is no collision, corruption, or race conditions. One thread will run that command as if you typed it on your shell and all inputs will get cated to the output. Since snakemake is pull based, once the output exists that rule will only run again if the inputs change at which points the new inputs will be added to the old due to using >>. As such, I would recommend using > so the old contents are removed; rules should be deterministic where possible.
Now, if you had done something like
rule something:
input:
"/path/to/out-{asd}.txt"
output:
touch("/path/to/merged-{asd}.txt")
params:
output="/path/to/merged.txt"
shell:
"cat {input} >> {params.output}"
# then invoke
snakemake -j10 /path/to/merged-{a..z}.txt
Things are more messy. Snakemake will launch all 10 jobs and output to the single merged.txt. Note that file is now a parameter and we are targeting some dummy files. This will behave as if you had 10 different shells and executed the commands
cat /path/to/out-a.txt >> /path/to/merged.txt
# ...
cat /path/to/out-z.txt >> /path/to/merged.txt
all at once. The output will have a random order and lines may be interleaved or interrupted.
As some guidance
Try to make outputs deterministic. Given the same inputs you should always produce the same outputs. If possible, set random seeds and enforce input ordering. In the second example, you have no idea what the output will be.
Don't use the append operator. This follows from the first point. If the output already exists and needs to be updated, start from scratch.
If you need to append a bunch of outputs, say log files or to create a summary, do so in a separate rule. This again follows from the first point, but it's the only reason I can think of to use append.
Hope that helps. Otherwise you can comment or edit with a more realistic example of what you are worried about.

Snakemake shell command should only take one file at a time, but it's trying to do multiple files at once

First off, I'm sorry if I'm not explaining my problem clearly, English is not my native language.
I'm trying to make a snakemake rule that takes a fastq file and filters it with a program called Filtlong. I have multiple fastq files on which I want to run this rule and it should output a filtered file per fastq file but apparently it takes all of the fastq files as input for a single Filtlong command.
The fastq files are in separate directories and the snakemake rule should write the filtered files to separate directories aswell.
This is how my code looks right now:
from os import listdir
configfile: "config.yaml"
DATA = config["DATA"]
SAMPLES = listdir(config["RAW_DATA"])
RAW_DATA = config["RAW_DATA"]
FILT_DIR = config["FILTERED_DIR"]
rule all:
input:
expand("{FILT_DIR}/{sample}/{sample}_filtered.fastq.gz", FILT_DIR=FILT_DIR, sample=SAMPLES)
rule filter_reads:
input:
expand("{RAW_DATA}/{sample}/{sample}.fastq", sample=SAMPLES, RAW_DATA=RAW_DATA)
output:
"{FILT_DIR}/{sample}/{sample}_filtered.fastq.gz"
shell:
"filtlong --keep_percent 90 --target_bases 300000000 {input} | gzip > {output}"
And this is the config file:
DATA:
all_samples
RAW_DATA:
all_samples/raw_samples
FILTERED_DIR:
all_samples/filtered_samples
The separate directories with the fastq files are in RAW_DATA and the directories with the filtered files should be in FILTERED_DIR,
When I try to run this, I get an error that looks something like this:
Error in rule filter_reads:
jobid: 30
output: all_samples/filtered_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10_filtered.fastq.gz
shell:
filtlong --keep_percent 90 --target_bases 300000000 all_samples/raw_samples/cell3_barcode11/cell3_barcode11.fastq all_samples/raw_samples/barcode01/barcode01.fastq all_samples/raw_samples/barcode03/barcode03.fastq all_samples/raw_samples/barcode04/barcode04.fastq all_samples/raw_samples/barcode05/barcode05.fastq all_samples/raw_samples/barcode06/barcode06.fastq all_samples/raw_samples/barcode07/barcode07.fastq all_samples/raw_samples/barcode08/barcode08.fastq all_samples/raw_samples/barcode09/barcode09.fastq all_samples/raw_samples/cell3_barcode01/cell3_barcode01.fastq all_samples/raw_samples/cell3_barcode02/cell3_barcode02.fastq all_samples/raw_samples/cell3_barcode03/cell3_barcode03.fastq all_samples/raw_samples/cell3_barcode04/cell3_barcode04.fastq all_samples/raw_samples/cell3_barcode05/cell3_barcode05.fastq all_samples/raw_samples/cell3_barcode06/cell3_barcode06.fastq all_samples/raw_samples/cell3_barcode07/cell3_barcode07.fastq all_samples/raw_samples/cell3_barcode08/cell3_barcode08.fastq all_samples/raw_samples/cell3_barcode09/cell3_barcode09.fastq all_samples/raw_samples/cell3_barcode10/cell3_barcode10.fastq all_samples/raw_samples/cell3_barcode12/cell3_barcode12.fastq all_samples/raw_samples/cell_18-07-19_barcode01/cell_18-07-19_barcode01.fastq all_samples/raw_samples/cell_18-07-19_barcode02/cell_18-07-19_barcode02.fastq all_samples/raw_samples/cell_18-07-19_barcode03/cell_18-07-19_barcode03.fastq all_samples/raw_samples/cell_18-07-19_barcode04/cell_18-07-19_barcode04.fastq all_samples/raw_samples/cell_18-07-19_barcode05/cell_18-07-19_barcode05.fastq all_samples/raw_samples/cell_18-07-19_barcode06/cell_18-07-19_barcode06.fastq all_samples/raw_samples/cell_18-07-19_barcode07/cell_18-07-19_barcode07.fastq all_samples/raw_samples/cell_18-07-19_barcode08/cell_18-07-19_barcode08.fastq all_samples/raw_samples/cell_18-07-19_barcode09/cell_18-07-19_barcode09.fastq all_samples/raw_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10.fastq all_samples/raw_samples/cell_18-07-19_barcode11/cell_18-07-19_barcode11.fastq all_samples/raw_samples/cell_18-07-19_barcode12/cell_18-07-19_barcode12.fastq all_samples/raw_samples/cell_18-07-19_barcode13/cell_18-07-19_barcode13.fastq all_samples/raw_samples/cell_18-07-19_barcode14/cell_18-07-19_barcode14.fastq all_samples/raw_samples/cell_18-07-19_barcode15/cell_18-07-19_barcode15.fastq all_samples/raw_samples/cell_18-07-19_barcode16/cell_18-07-19_barcode16.fastq all_samples/raw_samples/cell_18-07-19_barcode17/cell_18-07-19_barcode17.fastq all_samples/raw_samples/cell_18-07-19_barcode18/cell_18-07-19_barcode18.fastq all_samples/raw_samples/cell_18-07-19_barcode19/cell_18-07-19_barcode19.fastq | gzip > all_samples/filtered_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10_filtered.fastq.gz
(exited with non-zero exit code)
As far as I can tell, the rule takes all of the fastq files as input for a single Filtlong command, but I don't quite understand why

You shouldn't use the expand function in your input section of the filter_reads rule. What you are doing now is requiring all your samples to be the input of each filtered file: that is what you can observe in your error message.
There is another complication that you introduce out of nothing: you mix both wildcards and variables. In your example the {FILT_DIR} is just a predefined value while the {sample} is a wildcard that Snakemake uses to match the rules. Try the following (pay special attention on single/double brackets and on the formatted string (the one that has the form f"")):
rule filter_reads:
input:
f"{RAW_DATA}/{{sample}}/{{sample}}.fastq"
output:
f"{FILT_DIR}/{{sample}}/{{sample}}_filtered.fastq.gz"
shell:
"filtlong --keep_percent 90 --target_bases 300000000 {input} | gzip > {output}"

Can I have the output as a directory with input having wildcards in Snakemake? To get around jobs that fail and force rule order

I have a rule that runs a tool on multiple samples (some fail), I use -k option to proceed with remaining samples. However, for my next step I need to check the output of the first rule and create a single text summary file. I can not get the next rule to execute after the first rule.
I have tried various things, including rule fastQC_post having the output with wildcards. But then if I use this as input for the next rule I can't have one output file. If I use expand in the input of rule checkQC with the {sample} as the list of all samples determined at initiation this breaks as not all samples successfully reached the fastQC stage.
I really just want to be able to create the post_fastqc_reports folder in the fastQC_post rule and use this as input for my checkQC rule. OR be able to force it to run checkQC after fastQC_post has finished, but again checkpoints doesn't work as some of the jobs for fastQC_post fail.
I would like something like below: (this does not work as the output directory does not use the wildcard)
Surely there is an easier way to force rule order?
rule fastQC_post:
"""
runs FastQC on the trimmed reads
"""
input:
projectDir+batch+"_trimmed_reads/{sample}_trimmed.fq.gz"
output: directory(projectDir+batch+"post_fastqc_reports/")
log:
projectDir+"logs/{sample}_trimmed_fastqc.log"
params:
p = fastqcParams,
shell:
"""
/home/apps/pipelines/FastQC/CURRENT {params.p} -o {output} {input}
"""
rule checkQC:
input: rules.parse_sampleFile.output[0],rules.parse_sampleFile.output[1], rules.parse_sampleFile.output[2], directory(projectDir+batch+"post_fastqc_reports/")
output:
projectDir+"summaries/"+batch+"_summary_tg.txt"
, projectDir+"summaries/"+batch+"_listForFastqc.txt"
, projectDir+"summaries/"+batch+"_trimmingResults.txt"
, projectDir+"summaries/"+batch+"_summary_fq.txt"
log:
projectDir+"logs/"+stamp+"_checkQC.log"
shell:
"""
python python_scripts/fastqc_checks.py --input_file {log} --output {output[1]} {batchCmd}
python python_scripts/trimGalore_checks.py --list_file {input[0]} --single {input[1]} --pair {input[2]} --log {log} --output {output[0]} --trimDir {trimDir} --sampleFile \
{input[3]} {batchCmd}
"""
With the above I get the error that not all output and log contain same wildcard as input for the fastqc_post rule.
I just want to be able to run my checkQC rule after my fastqc_post rule (regardless of failures of jobs in the fastqc_post rule)

Run one rule conditionally to the input files

I've just started to learn Snakemake so probably this is a naive question.
I need for a rule to be launched in two different moments of the pipeline, using different inputs and producing different outputs.
Let me make a silly example. Let's say I have 3 rules:
rule A:
input:
"data/sample.1.txt"
output:
"data/sample.2.sorted.txt"
shell:
"somethingsomething sort {input}"
rule B:
input:
"data/sample.2.sorted.txt"
output:
"data/sample.3.man.txt"
shell:
"somethingsomething manipulate {input}"
rule C:
input:
"data/sample.3.man.txt"
output:
"data/sample.4.man.sorted.txt"
shell:
"somethingsomething sort {input}"
In this example, where the pipeline does A > B > C, A and C do exactly the same, but one uses the file 1 and outputs 2, while the other uses 3 and outputs 4. What is the best solution to just have a single sorting rule so the pipeline does A > B > A? (Probably there's some way to do it using the wildcards, maybe in combination with an if else, but I am not sure how)
Thank you for your time

Probably there's some way to re-use a rule but I suspect it's going to make the code more convoluted than necessary.
If the shell call is the same for rule A and C and you don't want to copy and paste the code, just put the code in a variable and re-use that variable:
sorter= "somethingsomething sort {input}"
rule A:
input:
"data/sample.1.txt"
output:
"data/sample.2.sorted.txt"
shell:
sorter
rule B:
input:
"data/sample.2.sorted.txt"
output:
"data/sample.3.man.txt"
shell:
"somethingsomething manipulate {input}"
rule C:
input:
"data/sample.3.man.txt"
output:
"data/sample.4.man.sorted.txt"
shell:
sorter

Understanding and overcoming AmbiguousRuleException in snakemake

I have a complicated workflow which I progressively extended. The last extension resulted in an AmbiguousRuleException. I tried to reproduce the critical structure of the workflow in the following example:
NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]
rule all:
input:
# (1)
expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
#expand("results/allthings/{word}_{choice}.md5sum", word=WORDS + ["all"], choice=CHOICES)
rule make_things:
output:
"results/{letter}_{number}/{word}_{choice}.txt"
shell:
"""
echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
"""
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
shell:
"""
cat {input} > {output}
"""
# (2)
#rule join_all_words:
# input:
# expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
# output:
# "results/allthings/all_{choice}.txt"
# shell:
# """
# cat {input} > {output}
# """
# (3)
#def source_data(wildcards):
# if wildcards.word == "all":
# return rules.join_all_words.output
# else:
# return rules.gather_things.output
rule compute_md5:
input:
# (4)
rules.gather_things.output,
#source_data
output:
"results/allthings/{word}_{choice}.md5sum"
shell:
"""
md5sum {input} > {output}
"""
The above state is functional. Switching (1) and (4) and uncommenting (2) and (3) correspond to the extension I'm trying to make, and results in the following failure:
AmbiguousRuleException:
Rules gather_things and join_all_words are ambiguous for the file results/allthings/all_yes.txt.
Expected input files:
gather_things: results/a_1/all_yes.txt results/a_2/all_yes.txt results/b_1/all_yes.txt results/b_2/all_yes.txt results/c_1/all_yes.txt results/c_2/all_yes.txt
join_all_words: results/allthings/foo_yes.txt results/allthings/bar_yes.txt results/allthings/baz_yes.txt
It seems that snakemake thinks that results/allthings/all_yes.txt can be generated by gather_things.
Why?
How can I avoid that?
Note: The goal of modifications (3) and (4) is to have the compute_md5 work on both the direct output of gather_things (for foo, bar and baz) and the joined output of the three (all), keeping input defined in terms of other rule's output as much as possible (which makes changes easier than when file names are explicitly used).

2017-07-28 Post edited for brevity
Initially I thought it was just ambiguity. The first 3 points relate to resolving ambiguity. Afterwards, I explain how to generalize 'compute_md5' to achieve desired behaviour.
Controlling ambiguity
1) Control flowing the ambiguity:
ruleorder
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=ruleorder#handling-ambiguous-rules
I suggest avoiding this in the following situation. In the grand hopes of modularity, by using "ruleorder" you are essentially coupling two rules together. The "ruleorder" functionality can only be used if both rules are present within the Snakefile's scope. This can be a problem with modularization if the rules are not always provided together. If they rules are always provided together, I would argue they are already coupled, and doing this doesn't make the situation worse, in fact, in increases cohesion. Use "ruleorder" when using 'constraints' isn't enough, as sometimes where will be unavoidable ambiguity.
https://en.wikipedia.org/wiki/GRASP_(object-oriented_design)
conditional 'includes'
https://github.com/tboyarski/BCCRC-Snakemake/tree/master/modules/bamGen
Rule order is in the "_INCLUDE"
Outputs for sam2BAM and bamALIGN_bwa are very similar, mainly becuase sam2BAM is so generic.
Because bamALIGN_bwa and bamALIGN_star are technically switchable, and I didn't want users swapping around ruleorder just to switch between them, I have a boolean which I store in my YAML file, to act as a hard filter to literally prevent Snakemake from even seeing the rule. This works great in situations where you can ONLY pick one or the other (In this case, the two aligners have their own reference genomes. I force the user to set the reference genome at the begging of my pipeline so users could NEVER actually run both. I have not implemented functionality to detect which reference genome is being used such that the corresponding aligner is then chosen. This would be some over-head python code, great idea, but not currently implemented).
2) Asking Snakemake to ignore the ambiguity.
With an over-ride. It exists, but I think "--allow-ambiguity" should be avoided whenever possible.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=--allow-ambiguity#handling-ambiguous-rules
3) Elegantly ~ Preventing the ambiguity.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=wildcard_constraints#wildcards
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
wildcard_constraints:
word='[^(all)][0-9a-zA-Z]*'
...
This rule needs a wildcard_constraint, to prevent it from competing with the "join_all_words" rule. This is done easily by preventing the wildcard "word" here, from being the string 'all'. This makes "gather_things" and "join_all_words" differentiable.
compute_md5 generalizability
As for getting "compute_md5" to accept input from both "gather_things" and "join_all_words", this requires making it more generalized, nothing to do with ambiguity. The next thing you need to do is adjust the "join_all_words" rule, such that it is not dependent on ANY given rule's input.
https://github.com/tboyarski/BCCRC-Snakemake/blob/master/help/download.svg
I just want to also thank you for providing a TOP-NOTCH example to work from. Brilliant!
NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]
rule all:
input:
expand("results/allthings/all_{choice}.md5sum", choice=CHOICES),
expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
rule make_things:
output:
"results/{letter}_{number}/{word}_{choice}.txt"
shell:
"""
echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
"""
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
wildcard_constraints:
word='[^(all)][0-9a-zA-Z]*'
shell:
"""
cat {input} > {output}
"""
rule join_all_words:
input:
expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
output:
"results/allthings/all_{choice}.txt"
shell:
"""
cat {input} > {output}
"""
rule compute_md5:
input:
"{pathCMD5}/{sample}.txt"
output:
"{pathCMD5}/{sample}.md5sum"
#"results/allthings/{word}_{choice}.md5sum"
shell:
"""
md5sum {input} > {output}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas