I have a question about best practices. Specifically, about the best snakemake pattern for demultiplexing reads from illumina sequencing. Our workflow needs to demultiplex multiple lanes of sequencing, and then combine these in a single analysis. Obviously, we know lane and sample names, however sample names are not the same across lanes. With only one lane, one can do something like:
SAMPLES = [...]
rule demux:
input:
reads="lanes/lanename.fastq.gz",
key="keys/lanename.txt"
output:
reads=expand("reads/{sample}.fastq.gz", sample=SAMPLES)
...
However with multiple lanes, I'm stuck wanting to use a function as an output rule. How would the following translate to something possible:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt"
output:
reads=lambda wc: expand("reads/{sample}.fastq.gz", sample=LANES[wc.lane])
...
Forgive me if this has be answered previously, or if there is some obvious approach I'm missing.
Cheers,
Kevin
I realise this is an old question, but I had the same problem, and came up with the following solution: Split the demux rule into two. The first creates a temporary directory which holds the demultiplexed reads, and then the second moves a single readset from that directory.
Here's a relatively minimal example:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
SAMPLES = {}
for lane, samples in LANES.items():
for sample in samples:
SAMPLES[sample] = lane
rule all:
input:
expand("reads/{sample}.fastq.gz", sample=SAMPLES.keys())
rule get_reads:
output:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
shell:
"touch {output}"
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
output:
directory(temp("demux_tmp_{lane}"))
params:
f=lambda w: expand("{sample}.fastq.gz", sample=LANES[w.lane]),
shell:
"mkdir -p {output} ; cd {output} ; touch {params.f}"
rule demux_files:
input:
lambda w: f"demux_tmp_{SAMPLES[w.sample]}",
output:
"reads/{sample}.fastq.gz"
shell:
"mv {input}/{wildcards.sample}.fastq.gz {output}"
Related
I want to do this:
configfile: "samples.yaml"
rule all:
input:
in1="../INPUT/TEST_A_1.fa",
in2="../INPUT/TEST_A_2.fa"
output:
out1="../OUTPUT/TEST_A_1.fa",
out2="../OUTPUT/TEST_A_2.fa",
json="../OUTPUT/TEST_A.json"
shell:
"fastp --in1 {input.in1} --in2 {input.in2} --out1 {output.out1} --out2 {output.out2} --json {output.json}"
but for all the samples in samples.yaml:
samples:
TEST_A
TEST_B
Feels like it should be simple, but I am really struggling with wrapping my head around how to do it.
Edit:
I think I figured it out, thanks to this: Snakemake using multi inputs.
The code below seems to work how I intended it:
configfile: "samples.yaml"
SAMPLES = config["samples"].split()
rule all:
input:
expand("../OUT_0001_IN_0002_clean_fasta_files/{sample}_{n}.fa", sample=SAMPLES, n=["1", "2"])
rule fastp:
input:
in1="../IN_0001_raw_fasta_files/{sample}_1.fa",
in2="../IN_0001_raw_fasta_files/{sample}_2.fa"
output:
out1="../OUT_0001_IN_0002_clean_fasta_files/{sample}_1.fa",
out2="../OUT_0001_IN_0002_clean_fasta_files/{sample}_2.fa",
json="../OUT_0001_IN_0002_clean_fasta_files/{sample}.json"
shell:
"fastp --in1 {input.in1} --in2 {input.in2} --out1 {output.out1} --out2 {output.out2} --json {output.json}"
I would like to define input file names from different varialbles extracted from a csv. I have built the following simplified example:
I have a file test.csv:
data/samples/A.fastq
data/samples/B.fastq
I give the path to test.csv in a json config file:
{
"samples": {
"summaryFile": "somepath/test.csv"
}
}
Now I want to run bwa on each file within a rule. My feeling is that I have to use lambda wildcards but I am not sure. My Snakefile looks like this:
#only for bcf_tools
import pandas
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table)
def returnSamples(table):
# Have tried different things here but nothing worked
return table
rule all:
input:
expand("mapped_reads/{sample}.bam", sample= samplesData)
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: returnSamples(wildcards.sample)
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
I have tried a million things including using expand (which is working but the rule is not called on each file).
Any help will be tremendously appreciated.
Snakemake works by defining which output you want (like you do in rule all). You are very close to a working solution, however there were some small things that went wrong:
Reading the pandas dataframe does not do what you expect (try printing the samplesData to see what it did/does). Therefore the expand in rule all does not work properly.
You do not need to use lambdas for the input, you can reuse the wildcard.
This should work for your example:
import pandas
import re
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table, header=None).loc[:, 0].tolist()
samples = [re.findall("[^/]+\.", sample)[0][:-1] for sample in samplesData] # overly complicated regex
rule all:
input:
expand("mapped_reads/{sample}.bam", sample=samples)
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
However I think it would be easiest to change the description in test.csv. Now we have to do some weird magic to get the sample name from the file, it would probably be best to just store the sample names there.
I have some ONT sequencing runs that have been basecalled on the MINIT. As such, when I demultiplex with guppy_barcoder, I get a directory of fastq files for each barcode. I want to use snakemake as a workflow manager to take these fastq files through our analyses, but this involves swapping the {barcode} for {sample} at some point.
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
rule all:
input:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE)), #guppy_barcoder
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq")
output:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE))
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
rules.demultiplex.output
output:
"Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat Sequencing_reads/demultiplexed/{wildcards.barcode}/*fastq > {output.fastq} "
This does give me an error:
RuleException in line 32 of /home/eriny/sandbox/ONT_unicycler_pipeline/ONT_pipeline.smk:
'Wildcards' object has no attribute 'barcode'
But I actually think I'm missing something conceptually. I would like rule gather to be something like:
cat Sequencing_reads/demultiplexed/barcode01/*fastq > Sequencing_reads/gathered/sample01_ONT.fastq
I have tried setting up some dictionaries so that sample and barcode are given the same key, but my syntax must be broken.
I'm hoping to find a 1:1 way to map one variable name onto another.
I'm hoping to find a 1:1 way to map one variable name onto another.
I think the sample to dictionary is a possibility combined with a lambda as input function to get the barcode assign to a sample. For example:
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
sam2bar= dict(zip(SAMPLE, BARCODE))
rule all:
input:
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq"),
output:
done= touch('demux.done'), # This signals that guppy has completed
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
done= 'demux.done',
fastq= lambda wc: glob.glob("Sequencing_reads/demultiplexed/%s/*fastq" % sam2bar[wc.sample])
output:
fastq= "Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat {input.fastq} > {output.fastq} "
I've just started to learn Snakemake so probably this is a naive question.
I need for a rule to be launched in two different moments of the pipeline, using different inputs and producing different outputs.
Let me make a silly example. Let's say I have 3 rules:
rule A:
input:
"data/sample.1.txt"
output:
"data/sample.2.sorted.txt"
shell:
"somethingsomething sort {input}"
rule B:
input:
"data/sample.2.sorted.txt"
output:
"data/sample.3.man.txt"
shell:
"somethingsomething manipulate {input}"
rule C:
input:
"data/sample.3.man.txt"
output:
"data/sample.4.man.sorted.txt"
shell:
"somethingsomething sort {input}"
In this example, where the pipeline does A > B > C, A and C do exactly the same, but one uses the file 1 and outputs 2, while the other uses 3 and outputs 4. What is the best solution to just have a single sorting rule so the pipeline does A > B > A? (Probably there's some way to do it using the wildcards, maybe in combination with an if else, but I am not sure how)
Thank you for your time
Probably there's some way to re-use a rule but I suspect it's going to make the code more convoluted than necessary.
If the shell call is the same for rule A and C and you don't want to copy and paste the code, just put the code in a variable and re-use that variable:
sorter= "somethingsomething sort {input}"
rule A:
input:
"data/sample.1.txt"
output:
"data/sample.2.sorted.txt"
shell:
sorter
rule B:
input:
"data/sample.2.sorted.txt"
output:
"data/sample.3.man.txt"
shell:
"somethingsomething manipulate {input}"
rule C:
input:
"data/sample.3.man.txt"
output:
"data/sample.4.man.sorted.txt"
shell:
sorter
I have a complicated workflow which I progressively extended. The last extension resulted in an AmbiguousRuleException. I tried to reproduce the critical structure of the workflow in the following example:
NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]
rule all:
input:
# (1)
expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
#expand("results/allthings/{word}_{choice}.md5sum", word=WORDS + ["all"], choice=CHOICES)
rule make_things:
output:
"results/{letter}_{number}/{word}_{choice}.txt"
shell:
"""
echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
"""
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
shell:
"""
cat {input} > {output}
"""
# (2)
#rule join_all_words:
# input:
# expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
# output:
# "results/allthings/all_{choice}.txt"
# shell:
# """
# cat {input} > {output}
# """
# (3)
#def source_data(wildcards):
# if wildcards.word == "all":
# return rules.join_all_words.output
# else:
# return rules.gather_things.output
rule compute_md5:
input:
# (4)
rules.gather_things.output,
#source_data
output:
"results/allthings/{word}_{choice}.md5sum"
shell:
"""
md5sum {input} > {output}
"""
The above state is functional. Switching (1) and (4) and uncommenting (2) and (3) correspond to the extension I'm trying to make, and results in the following failure:
AmbiguousRuleException:
Rules gather_things and join_all_words are ambiguous for the file results/allthings/all_yes.txt.
Expected input files:
gather_things: results/a_1/all_yes.txt results/a_2/all_yes.txt results/b_1/all_yes.txt results/b_2/all_yes.txt results/c_1/all_yes.txt results/c_2/all_yes.txt
join_all_words: results/allthings/foo_yes.txt results/allthings/bar_yes.txt results/allthings/baz_yes.txt
It seems that snakemake thinks that results/allthings/all_yes.txt can be generated by gather_things.
Why?
How can I avoid that?
Note: The goal of modifications (3) and (4) is to have the compute_md5 work on both the direct output of gather_things (for foo, bar and baz) and the joined output of the three (all), keeping input defined in terms of other rule's output as much as possible (which makes changes easier than when file names are explicitly used).
2017-07-28 Post edited for brevity
Initially I thought it was just ambiguity. The first 3 points relate to resolving ambiguity. Afterwards, I explain how to generalize 'compute_md5' to achieve desired behaviour.
Controlling ambiguity
1) Control flowing the ambiguity:
ruleorder
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=ruleorder#handling-ambiguous-rules
I suggest avoiding this in the following situation. In the grand hopes of modularity, by using "ruleorder" you are essentially coupling two rules together. The "ruleorder" functionality can only be used if both rules are present within the Snakefile's scope. This can be a problem with modularization if the rules are not always provided together. If they rules are always provided together, I would argue they are already coupled, and doing this doesn't make the situation worse, in fact, in increases cohesion. Use "ruleorder" when using 'constraints' isn't enough, as sometimes where will be unavoidable ambiguity.
https://en.wikipedia.org/wiki/GRASP_(object-oriented_design)
conditional 'includes'
https://github.com/tboyarski/BCCRC-Snakemake/tree/master/modules/bamGen
Rule order is in the "_INCLUDE"
Outputs for sam2BAM and bamALIGN_bwa are very similar, mainly becuase sam2BAM is so generic.
Because bamALIGN_bwa and bamALIGN_star are technically switchable, and I didn't want users swapping around ruleorder just to switch between them, I have a boolean which I store in my YAML file, to act as a hard filter to literally prevent Snakemake from even seeing the rule. This works great in situations where you can ONLY pick one or the other (In this case, the two aligners have their own reference genomes. I force the user to set the reference genome at the begging of my pipeline so users could NEVER actually run both. I have not implemented functionality to detect which reference genome is being used such that the corresponding aligner is then chosen. This would be some over-head python code, great idea, but not currently implemented).
2) Asking Snakemake to ignore the ambiguity.
With an over-ride. It exists, but I think "--allow-ambiguity" should be avoided whenever possible.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=--allow-ambiguity#handling-ambiguous-rules
3) Elegantly ~ Preventing the ambiguity.
http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=wildcard_constraints#wildcards
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
wildcard_constraints:
word='[^(all)][0-9a-zA-Z]*'
...
This rule needs a wildcard_constraint, to prevent it from competing with the "join_all_words" rule. This is done easily by preventing the wildcard "word" here, from being the string 'all'. This makes "gather_things" and "join_all_words" differentiable.
compute_md5 generalizability
As for getting "compute_md5" to accept input from both "gather_things" and "join_all_words", this requires making it more generalized, nothing to do with ambiguity. The next thing you need to do is adjust the "join_all_words" rule, such that it is not dependent on ANY given rule's input.
https://github.com/tboyarski/BCCRC-Snakemake/blob/master/help/download.svg
I just want to also thank you for providing a TOP-NOTCH example to work from. Brilliant!
NUMBERS = ["1", "2"]
LETTERS = ["a", "b", "c"]
WORDS = ["foo", "bar", "baz"]
CHOICES = ["yes", "no"]
rule all:
input:
expand("results/allthings/all_{choice}.md5sum", choice=CHOICES),
expand("results/allthings/{word}_{choice}.md5sum", word=WORDS, choice=CHOICES)
rule make_things:
output:
"results/{letter}_{number}/{word}_{choice}.txt"
shell:
"""
echo "{wildcards.letter}_{wildcards.number}_{wildcards.word}_{wildcards.choice}" > {output}
"""
rule gather_things:
input:
expand("results/{letter}_{number}/{{word}}_{{choice}}.txt", letter=LETTERS, number=NUMBERS)
output:
"results/allthings/{word}_{choice}.txt"
wildcard_constraints:
word='[^(all)][0-9a-zA-Z]*'
shell:
"""
cat {input} > {output}
"""
rule join_all_words:
input:
expand("results/allthings/{word}_{{choice}}.txt", word=WORDS)
output:
"results/allthings/all_{choice}.txt"
shell:
"""
cat {input} > {output}
"""
rule compute_md5:
input:
"{pathCMD5}/{sample}.txt"
output:
"{pathCMD5}/{sample}.md5sum"
#"results/allthings/{word}_{choice}.md5sum"
shell:
"""
md5sum {input} > {output}