Run one rule conditionally to the input files - snakemake

I've just started to learn Snakemake so probably this is a naive question.
I need for a rule to be launched in two different moments of the pipeline, using different inputs and producing different outputs.
Let me make a silly example. Let's say I have 3 rules:
rule A:
input:
"data/sample.1.txt"
output:
"data/sample.2.sorted.txt"
shell:
"somethingsomething sort {input}"
rule B:
input:
"data/sample.2.sorted.txt"
output:
"data/sample.3.man.txt"
shell:
"somethingsomething manipulate {input}"
rule C:
input:
"data/sample.3.man.txt"
output:
"data/sample.4.man.sorted.txt"
shell:
"somethingsomething sort {input}"
In this example, where the pipeline does A > B > C, A and C do exactly the same, but one uses the file 1 and outputs 2, while the other uses 3 and outputs 4. What is the best solution to just have a single sorting rule so the pipeline does A > B > A? (Probably there's some way to do it using the wildcards, maybe in combination with an if else, but I am not sure how)
Thank you for your time

Probably there's some way to re-use a rule but I suspect it's going to make the code more convoluted than necessary.
If the shell call is the same for rule A and C and you don't want to copy and paste the code, just put the code in a variable and re-use that variable:
sorter= "somethingsomething sort {input}"
rule A:
input:
"data/sample.1.txt"
output:
"data/sample.2.sorted.txt"
shell:
sorter
rule B:
input:
"data/sample.2.sorted.txt"
output:
"data/sample.3.man.txt"
shell:
"somethingsomething manipulate {input}"
rule C:
input:
"data/sample.3.man.txt"
output:
"data/sample.4.man.sorted.txt"
shell:
sorter

Related

Snakemake input two variables and output one variable

I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""

Snakemake: Exchanging variables

I have some ONT sequencing runs that have been basecalled on the MINIT. As such, when I demultiplex with guppy_barcoder, I get a directory of fastq files for each barcode. I want to use snakemake as a workflow manager to take these fastq files through our analyses, but this involves swapping the {barcode} for {sample} at some point.
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
rule all:
input:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE)), #guppy_barcoder
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq")
output:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE))
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
rules.demultiplex.output
output:
"Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat Sequencing_reads/demultiplexed/{wildcards.barcode}/*fastq > {output.fastq} "
This does give me an error:
RuleException in line 32 of /home/eriny/sandbox/ONT_unicycler_pipeline/ONT_pipeline.smk:
'Wildcards' object has no attribute 'barcode'
But I actually think I'm missing something conceptually. I would like rule gather to be something like:
cat Sequencing_reads/demultiplexed/barcode01/*fastq > Sequencing_reads/gathered/sample01_ONT.fastq
I have tried setting up some dictionaries so that sample and barcode are given the same key, but my syntax must be broken.
I'm hoping to find a 1:1 way to map one variable name onto another.
I'm hoping to find a 1:1 way to map one variable name onto another.
I think the sample to dictionary is a possibility combined with a lambda as input function to get the barcode assign to a sample. For example:
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
sam2bar= dict(zip(SAMPLE, BARCODE))
rule all:
input:
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq"),
output:
done= touch('demux.done'), # This signals that guppy has completed
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
done= 'demux.done',
fastq= lambda wc: glob.glob("Sequencing_reads/demultiplexed/%s/*fastq" % sam2bar[wc.sample])
output:
fastq= "Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat {input.fastq} > {output.fastq} "

snakemake temp() causing unnecessary rerun of rules

I'm using snakemake v 5.4.0, and I'm running into a problem with temp(). In a hypothetical scenario:
Rule A --> Rule B1 --> Rule C1
|
--> Rule B2 --> Rule C2
where Rule A generates temp() files used by both pathways 1 (B1 + C1) and 2 (B2 + C2).
If I run the pipeline, the temp() files generated by RuleA are deleted by after they are used in both pathways, which is what I expect. However, if I then want to re-run Pathway 2, the temp() files for RuleA must be recreated which triggers the re-run of the entire pipeline, not just Pathway2. This becomes very computationally expensive for long pipelines. Is there a good way to prevent this besides not using temp(), which in my case would require many TB of extra hard drive space?
You could create the list of input files to rule all, or whatever the first rule is called, dynamically depending on whether the output of Pathway 2 already exists (and satisfies some sanity checks).
output= ['P1.out']
if not os.path.exists('P2.out'): # Some more conditions here...
output.append('P2.out')
rule all:
input:
output
rule make_tmp:
output:
temp('a.out')
shell:
r"""
touch {output}
"""
rule make_P1:
input:
'a.out'
output:
'P1.out'
shell:
r"""
touch {output}
"""
rule make_P2:
input:
'a.out'
output:
'P2.out'
shell:
r"""
touch {output}
"""
However, this somewhat defeats the point of using snakemake. If the input of Pathway 1 has to be recreated, how can you be sure that its output is still up-to-date?

Understanding and overcoming AmbiguousRuleException in snakemake

I have a complicated workflow which I progressively extended. The last extension resulted in an AmbiguousRuleException. I tried to reproduce the critical structure of the workflow in the following example:
NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]
rule all:
input:
# (1)
expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
#expand("results/allthings/{word}_{choice}.md5sum", word=WORDS + ["all"], choice=CHOICES)
rule make_things:
output:
"results/{letter}_{number}/{word}_{choice}.txt"
shell:
"""
echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
"""
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
shell:
"""
cat {input} > {output}
"""
# (2)
#rule join_all_words:
# input:
# expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
# output:
# "results/allthings/all_{choice}.txt"
# shell:
# """
# cat {input} > {output}
# """
# (3)
#def source_data(wildcards):
# if wildcards.word == "all":
# return rules.join_all_words.output
# else:
# return rules.gather_things.output
rule compute_md5:
input:
# (4)
rules.gather_things.output,
#source_data
output:
"results/allthings/{word}_{choice}.md5sum"
shell:
"""
md5sum {input} > {output}
"""
The above state is functional. Switching (1) and (4) and uncommenting (2) and (3) correspond to the extension I'm trying to make, and results in the following failure:
AmbiguousRuleException:
Rules gather_things and join_all_words are ambiguous for the file results/allthings/all_yes.txt.
Expected input files:
gather_things: results/a_1/all_yes.txt results/a_2/all_yes.txt results/b_1/all_yes.txt results/b_2/all_yes.txt results/c_1/all_yes.txt results/c_2/all_yes.txt
join_all_words: results/allthings/foo_yes.txt results/allthings/bar_yes.txt results/allthings/baz_yes.txt
It seems that snakemake thinks that results/allthings/all_yes.txt can be generated by gather_things.
Why?
How can I avoid that?
Note: The goal of modifications (3) and (4) is to have the compute_md5 work on both the direct output of gather_things (for foo, bar and baz) and the joined output of the three (all), keeping input defined in terms of other rule's output as much as possible (which makes changes easier than when file names are explicitly used).
2017-07-28 Post edited for brevity
Initially I thought it was just ambiguity. The first 3 points relate to resolving ambiguity. Afterwards, I explain how to generalize 'compute_md5' to achieve desired behaviour.
Controlling ambiguity
1) Control flowing the ambiguity:
ruleorder
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=ruleorder#handling-ambiguous-rules
I suggest avoiding this in the following situation. In the grand hopes of modularity, by using "ruleorder" you are essentially coupling two rules together. The "ruleorder" functionality can only be used if both rules are present within the Snakefile's scope. This can be a problem with modularization if the rules are not always provided together. If they rules are always provided together, I would argue they are already coupled, and doing this doesn't make the situation worse, in fact, in increases cohesion. Use "ruleorder" when using 'constraints' isn't enough, as sometimes where will be unavoidable ambiguity.
https://en.wikipedia.org/wiki/GRASP_(object-oriented_design)
conditional 'includes'
https://github.com/tboyarski/BCCRC-Snakemake/tree/master/modules/bamGen
Rule order is in the "_INCLUDE"
Outputs for sam2BAM and bamALIGN_bwa are very similar, mainly becuase sam2BAM is so generic.
Because bamALIGN_bwa and bamALIGN_star are technically switchable, and I didn't want users swapping around ruleorder just to switch between them, I have a boolean which I store in my YAML file, to act as a hard filter to literally prevent Snakemake from even seeing the rule. This works great in situations where you can ONLY pick one or the other (In this case, the two aligners have their own reference genomes. I force the user to set the reference genome at the begging of my pipeline so users could NEVER actually run both. I have not implemented functionality to detect which reference genome is being used such that the corresponding aligner is then chosen. This would be some over-head python code, great idea, but not currently implemented).
2) Asking Snakemake to ignore the ambiguity.
With an over-ride. It exists, but I think "--allow-ambiguity" should be avoided whenever possible.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=--allow-ambiguity#handling-ambiguous-rules
3) Elegantly ~ Preventing the ambiguity.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=wildcard_constraints#wildcards
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
wildcard_constraints:
word='[^(all)][0-9a-zA-Z]*'
...
This rule needs a wildcard_constraint, to prevent it from competing with the "join_all_words" rule. This is done easily by preventing the wildcard "word" here, from being the string 'all'. This makes "gather_things" and "join_all_words" differentiable.
compute_md5 generalizability
As for getting "compute_md5" to accept input from both "gather_things" and "join_all_words", this requires making it more generalized, nothing to do with ambiguity. The next thing you need to do is adjust the "join_all_words" rule, such that it is not dependent on ANY given rule's input.
https://github.com/tboyarski/BCCRC-Snakemake/blob/master/help/download.svg
I just want to also thank you for providing a TOP-NOTCH example to work from. Brilliant!
NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]
rule all:
input:
expand("results/allthings/all_{choice}.md5sum", choice=CHOICES),
expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
rule make_things:
output:
"results/{letter}_{number}/{word}_{choice}.txt"
shell:
"""
echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
"""
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
wildcard_constraints:
word='[^(all)][0-9a-zA-Z]*'
shell:
"""
cat {input} > {output}
"""
rule join_all_words:
input:
expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
output:
"results/allthings/all_{choice}.txt"
shell:
"""
cat {input} > {output}
"""
rule compute_md5:
input:
"{pathCMD5}/{sample}.txt"
output:
"{pathCMD5}/{sample}.md5sum"
#"results/allthings/{word}_{choice}.md5sum"
shell:
"""
md5sum {input} > {output}

Snakemake best practice for demultiplexing

I have a question about best practices. Specifically, about the best snakemake pattern for demultiplexing reads from illumina sequencing. Our workflow needs to demultiplex multiple lanes of sequencing, and then combine these in a single analysis. Obviously, we know lane and sample names, however sample names are not the same across lanes. With only one lane, one can do something like:
SAMPLES = [...]
rule demux:
input:
reads="lanes/lanename.fastq.gz",
key="keys/lanename.txt"
output:
reads=expand("reads/{sample}.fastq.gz", sample=SAMPLES)
...
However with multiple lanes, I'm stuck wanting to use a function as an output rule. How would the following translate to something possible:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt"
output:
reads=lambda wc: expand("reads/{sample}.fastq.gz", sample=LANES[wc.lane])
...
Forgive me if this has be answered previously, or if there is some obvious approach I'm missing.
Cheers,
Kevin
I realise this is an old question, but I had the same problem, and came up with the following solution: Split the demux rule into two. The first creates a temporary directory which holds the demultiplexed reads, and then the second moves a single readset from that directory.
Here's a relatively minimal example:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
SAMPLES = {}
for lane, samples in LANES.items():
for sample in samples:
SAMPLES[sample] = lane
rule all:
input:
expand("reads/{sample}.fastq.gz", sample=SAMPLES.keys())
rule get_reads:
output:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
shell:
"touch {output}"
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
output:
directory(temp("demux_tmp_{lane}"))
params:
f=lambda w: expand("{sample}.fastq.gz", sample=LANES[w.lane]),
shell:
"mkdir -p {output} ; cd {output} ; touch {params.f}"
rule demux_files:
input:
lambda w: f"demux_tmp_{SAMPLES[w.sample]}",
output:
"reads/{sample}.fastq.gz"
shell:
"mv {input}/{wildcards.sample}.fastq.gz {output}"