Understanding and overcoming AmbiguousRuleException in snakemake - snakemake

I have a complicated workflow which I progressively extended. The last extension resulted in an AmbiguousRuleException. I tried to reproduce the critical structure of the workflow in the following example:
NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]
rule all:
input:
# (1)
expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
#expand("results/allthings/{word}_{choice}.md5sum", word=WORDS + ["all"], choice=CHOICES)
rule make_things:
output:
"results/{letter}_{number}/{word}_{choice}.txt"
shell:
"""
echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
"""
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
shell:
"""
cat {input} > {output}
"""
# (2)
#rule join_all_words:
# input:
# expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
# output:
# "results/allthings/all_{choice}.txt"
# shell:
# """
# cat {input} > {output}
# """
# (3)
#def source_data(wildcards):
# if wildcards.word == "all":
# return rules.join_all_words.output
# else:
# return rules.gather_things.output
rule compute_md5:
input:
# (4)
rules.gather_things.output,
#source_data
output:
"results/allthings/{word}_{choice}.md5sum"
shell:
"""
md5sum {input} > {output}
"""
The above state is functional. Switching (1) and (4) and uncommenting (2) and (3) correspond to the extension I'm trying to make, and results in the following failure:
AmbiguousRuleException:
Rules gather_things and join_all_words are ambiguous for the file results/allthings/all_yes.txt.
Expected input files:
gather_things: results/a_1/all_yes.txt results/a_2/all_yes.txt results/b_1/all_yes.txt results/b_2/all_yes.txt results/c_1/all_yes.txt results/c_2/all_yes.txt
join_all_words: results/allthings/foo_yes.txt results/allthings/bar_yes.txt results/allthings/baz_yes.txt
It seems that snakemake thinks that results/allthings/all_yes.txt can be generated by gather_things.
Why?
How can I avoid that?
Note: The goal of modifications (3) and (4) is to have the compute_md5 work on both the direct output of gather_things (for foo, bar and baz) and the joined output of the three (all), keeping input defined in terms of other rule's output as much as possible (which makes changes easier than when file names are explicitly used).

2017-07-28 Post edited for brevity
Initially I thought it was just ambiguity. The first 3 points relate to resolving ambiguity. Afterwards, I explain how to generalize 'compute_md5' to achieve desired behaviour.
Controlling ambiguity
1) Control flowing the ambiguity:
ruleorder
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=ruleorder#handling-ambiguous-rules
I suggest avoiding this in the following situation. In the grand hopes of modularity, by using "ruleorder" you are essentially coupling two rules together. The "ruleorder" functionality can only be used if both rules are present within the Snakefile's scope. This can be a problem with modularization if the rules are not always provided together. If they rules are always provided together, I would argue they are already coupled, and doing this doesn't make the situation worse, in fact, in increases cohesion. Use "ruleorder" when using 'constraints' isn't enough, as sometimes where will be unavoidable ambiguity.
https://en.wikipedia.org/wiki/GRASP_(object-oriented_design)
conditional 'includes'
https://github.com/tboyarski/BCCRC-Snakemake/tree/master/modules/bamGen
Rule order is in the "_INCLUDE"
Outputs for sam2BAM and bamALIGN_bwa are very similar, mainly becuase sam2BAM is so generic.
Because bamALIGN_bwa and bamALIGN_star are technically switchable, and I didn't want users swapping around ruleorder just to switch between them, I have a boolean which I store in my YAML file, to act as a hard filter to literally prevent Snakemake from even seeing the rule. This works great in situations where you can ONLY pick one or the other (In this case, the two aligners have their own reference genomes. I force the user to set the reference genome at the begging of my pipeline so users could NEVER actually run both. I have not implemented functionality to detect which reference genome is being used such that the corresponding aligner is then chosen. This would be some over-head python code, great idea, but not currently implemented).
2) Asking Snakemake to ignore the ambiguity.
With an over-ride. It exists, but I think "--allow-ambiguity" should be avoided whenever possible.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=--allow-ambiguity#handling-ambiguous-rules
3) Elegantly ~ Preventing the ambiguity.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=wildcard_constraints#wildcards
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
wildcard_constraints:
word='[^(all)][0-9a-zA-Z]*'
...
This rule needs a wildcard_constraint, to prevent it from competing with the "join_all_words" rule. This is done easily by preventing the wildcard "word" here, from being the string 'all'. This makes "gather_things" and "join_all_words" differentiable.
compute_md5 generalizability
As for getting "compute_md5" to accept input from both "gather_things" and "join_all_words", this requires making it more generalized, nothing to do with ambiguity. The next thing you need to do is adjust the "join_all_words" rule, such that it is not dependent on ANY given rule's input.
https://github.com/tboyarski/BCCRC-Snakemake/blob/master/help/download.svg
I just want to also thank you for providing a TOP-NOTCH example to work from. Brilliant!
NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]
rule all:
input:
expand("results/allthings/all_{choice}.md5sum", choice=CHOICES),
expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
rule make_things:
output:
"results/{letter}_{number}/{word}_{choice}.txt"
shell:
"""
echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
"""
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
wildcard_constraints:
word='[^(all)][0-9a-zA-Z]*'
shell:
"""
cat {input} > {output}
"""
rule join_all_words:
input:
expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
output:
"results/allthings/all_{choice}.txt"
shell:
"""
cat {input} > {output}
"""
rule compute_md5:
input:
"{pathCMD5}/{sample}.txt"
output:
"{pathCMD5}/{sample}.md5sum"
#"results/allthings/{word}_{choice}.md5sum"
shell:
"""
md5sum {input} > {output}

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

How does snakemake handle possible corruptions due to a rule run in parallel simultaneously appending to a single file?

I would like to learn how snakemake handles following situations, and what is the best practice avoid collisions/corruptions.
rule something:
input:
expand("/path/to/out-{asd}.txt", asd=LIST)
output:
"/path/to/merged.txt"
shell:
"cat {input} >> {output}"
With snakemake -j10 the command will try to append to the same file simultaneously, and I could not figure out if this could lead to possible corruptions or if this is already handled.
Also, how are more complicated cases handled e.g. where it is not only cat but a return value of another process based on input value being appended to the same file? Is the best practice first writing them to individual files then catting them together?
rule get_merged_total_distinct:
input:
expand("{dataset_id}/merge_libraries/{tomerge}_merged_rmd.bam",dataset_id=config["dataset_id"],tomerge=list(TOMERGE.keys())),
output:
"{dataset_id}/merge_libraries/merged_total_distinct.csv"
params:
with_dups="{dataset_id}/merge_libraries/{tomerge}_merged.bam"
shell:
"""
RCT=$(samtools view -#4 -c -F1 -F4 -q 30 {params.with_dups})
RCD=$(samtools view -#4 -c -F1 -F4 -q 30 {input})
printf "{wildcards.tomerge},${{RCT}},${{RCD}}\n" >> {output}
"""
or cases where an external script is being called to print the result to a single output file?
input:
expand("infile/{x}",...) # expanded as above
output:
"results/all.txt"
shell:
"""
bash script.sh {params.x} {input} {params.y} >> {output}
"""
With your example, the shell directive will expand to
cat /path/to/out-SAMPLE1.txt /path/to/out-SAMPLE2.txt [...] >> /path/to/merged.txt
where SAMPLE1, etc, comes from the LIST. In this case, there is no collision, corruption, or race conditions. One thread will run that command as if you typed it on your shell and all inputs will get cated to the output. Since snakemake is pull based, once the output exists that rule will only run again if the inputs change at which points the new inputs will be added to the old due to using >>. As such, I would recommend using > so the old contents are removed; rules should be deterministic where possible.
Now, if you had done something like
rule something:
input:
"/path/to/out-{asd}.txt"
output:
touch("/path/to/merged-{asd}.txt")
params:
output="/path/to/merged.txt"
shell:
"cat {input} >> {params.output}"
# then invoke
snakemake -j10 /path/to/merged-{a..z}.txt
Things are more messy. Snakemake will launch all 10 jobs and output to the single merged.txt. Note that file is now a parameter and we are targeting some dummy files. This will behave as if you had 10 different shells and executed the commands
cat /path/to/out-a.txt >> /path/to/merged.txt
# ...
cat /path/to/out-z.txt >> /path/to/merged.txt
all at once. The output will have a random order and lines may be interleaved or interrupted.
As some guidance
Try to make outputs deterministic. Given the same inputs you should always produce the same outputs. If possible, set random seeds and enforce input ordering. In the second example, you have no idea what the output will be.
Don't use the append operator. This follows from the first point. If the output already exists and needs to be updated, start from scratch.
If you need to append a bunch of outputs, say log files or to create a summary, do so in a separate rule. This again follows from the first point, but it's the only reason I can think of to use append.
Hope that helps. Otherwise you can comment or edit with a more realistic example of what you are worried about.

Snakemake input two variables and output one variable

I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""

Snakemake: Exchanging variables

I have some ONT sequencing runs that have been basecalled on the MINIT. As such, when I demultiplex with guppy_barcoder, I get a directory of fastq files for each barcode. I want to use snakemake as a workflow manager to take these fastq files through our analyses, but this involves swapping the {barcode} for {sample} at some point.
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
rule all:
input:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE)), #guppy_barcoder
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq")
output:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE))
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
rules.demultiplex.output
output:
"Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat Sequencing_reads/demultiplexed/{wildcards.barcode}/*fastq > {output.fastq} "
This does give me an error:
RuleException in line 32 of /home/eriny/sandbox/ONT_unicycler_pipeline/ONT_pipeline.smk:
'Wildcards' object has no attribute 'barcode'
But I actually think I'm missing something conceptually. I would like rule gather to be something like:
cat Sequencing_reads/demultiplexed/barcode01/*fastq > Sequencing_reads/gathered/sample01_ONT.fastq
I have tried setting up some dictionaries so that sample and barcode are given the same key, but my syntax must be broken.
I'm hoping to find a 1:1 way to map one variable name onto another.
I'm hoping to find a 1:1 way to map one variable name onto another.
I think the sample to dictionary is a possibility combined with a lambda as input function to get the barcode assign to a sample. For example:
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
sam2bar= dict(zip(SAMPLE, BARCODE))
rule all:
input:
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq"),
output:
done= touch('demux.done'), # This signals that guppy has completed
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
done= 'demux.done',
fastq= lambda wc: glob.glob("Sequencing_reads/demultiplexed/%s/*fastq" % sam2bar[wc.sample])
output:
fastq= "Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat {input.fastq} > {output.fastq} "

Snakemake best practice for demultiplexing

I have a question about best practices. Specifically, about the best snakemake pattern for demultiplexing reads from illumina sequencing. Our workflow needs to demultiplex multiple lanes of sequencing, and then combine these in a single analysis. Obviously, we know lane and sample names, however sample names are not the same across lanes. With only one lane, one can do something like:
SAMPLES = [...]
rule demux:
input:
reads="lanes/lanename.fastq.gz",
key="keys/lanename.txt"
output:
reads=expand("reads/{sample}.fastq.gz", sample=SAMPLES)
...
However with multiple lanes, I'm stuck wanting to use a function as an output rule. How would the following translate to something possible:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt"
output:
reads=lambda wc: expand("reads/{sample}.fastq.gz", sample=LANES[wc.lane])
...
Forgive me if this has be answered previously, or if there is some obvious approach I'm missing.
Cheers,
Kevin
I realise this is an old question, but I had the same problem, and came up with the following solution: Split the demux rule into two. The first creates a temporary directory which holds the demultiplexed reads, and then the second moves a single readset from that directory.
Here's a relatively minimal example:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
SAMPLES = {}
for lane, samples in LANES.items():
for sample in samples:
SAMPLES[sample] = lane
rule all:
input:
expand("reads/{sample}.fastq.gz", sample=SAMPLES.keys())
rule get_reads:
output:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
shell:
"touch {output}"
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
output:
directory(temp("demux_tmp_{lane}"))
params:
f=lambda w: expand("{sample}.fastq.gz", sample=LANES[w.lane]),
shell:
"mkdir -p {output} ; cd {output} ; touch {params.f}"
rule demux_files:
input:
lambda w: f"demux_tmp_{SAMPLES[w.sample]}",
output:
"reads/{sample}.fastq.gz"
shell:
"mv {input}/{wildcards.sample}.fastq.gz {output}"