Refer to input or output files of another Snakemake rule - snakemake

How can I programmatically refer to attributes of another Snakemake rule? What do I need to replace <whatever is in test1's input> with, for this to work?
rule test1:
input: 'a.txt'
rule test2:
output: <whatever is in test1's input> <---- ?
shell: 'touch {output}'

Other rules can be referred to by rules.<rule_name> as explained in the documentation. However, the rule that's referenced needs to come first, that is be defined above the rule that's referencing:
rule test1:
input: 'a.txt'
rule test2:
output: rules.test1.input <---
shell: 'touch {output}'

Related

Snakemake input two variables and output one variable

I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""

NameError Snakemake when I want to run my code [duplicate]

I'm trying to build a Snakemake pipeline, but I'm confused why filename wildcards work for input and output, but not for shell. For example, the following works fine:
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}.out", sample=samplelist)
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {output}"
But let's say that the command I use for shell actually derives from a string I give it, so I can't name the output file directly in the shell command. Then how can I use my filename wildcard (here {sample}) in the shell command?
For example, the following doesn't work:
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}.out", sample=samplelist)
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {sample}.out"
It gives me the following error:
RuleException in line 6 of Snakefile:
NameError: The name 'sample' is unknown in this context. Please make sure
that you defined that variable. Also note that braces not used for variable
access have to be escaped by repeating them, i.e. {{print $1}}
How can I work around this?
(Or if you really want to see some real life code, here is what I'm working with):
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}_aligned.sam", sample=samplelist)
rule align:
input: "{sample}_R1_001.trimmed.fastq.gz",
"{sample}_R2_001.trimmed.fastq.gz"
output: "{sample}_aligned.sam"
threads: 4
shell: "STAR --outFileNamePrefix {sample}_aligned --readFilesIn {input[0]} {input[1]} --readFilesCommand zcat --runMode alignReads --runThreadN {threads} --genomeDir /path/to/StarIndex"
But the error message is basically the same. For shell, I can use {input}, {output}, and {threads}, but not {sample}.
I did look at Snakemake: How do I use a function that takes in a wildcard and returns a value?, but that seems to be focused on generating input file names. My issue deals with interpolation of the filename wildcard into the shell command.
Wildcards are available via {wildcards.XXXX}. source
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {wildcards.sample}.out"

Run one rule conditionally to the input files

I've just started to learn Snakemake so probably this is a naive question.
I need for a rule to be launched in two different moments of the pipeline, using different inputs and producing different outputs.
Let me make a silly example. Let's say I have 3 rules:
rule A:
input:
"data/sample.1.txt"
output:
"data/sample.2.sorted.txt"
shell:
"somethingsomething sort {input}"
rule B:
input:
"data/sample.2.sorted.txt"
output:
"data/sample.3.man.txt"
shell:
"somethingsomething manipulate {input}"
rule C:
input:
"data/sample.3.man.txt"
output:
"data/sample.4.man.sorted.txt"
shell:
"somethingsomething sort {input}"
In this example, where the pipeline does A > B > C, A and C do exactly the same, but one uses the file 1 and outputs 2, while the other uses 3 and outputs 4. What is the best solution to just have a single sorting rule so the pipeline does A > B > A? (Probably there's some way to do it using the wildcards, maybe in combination with an if else, but I am not sure how)
Thank you for your time
Probably there's some way to re-use a rule but I suspect it's going to make the code more convoluted than necessary.
If the shell call is the same for rule A and C and you don't want to copy and paste the code, just put the code in a variable and re-use that variable:
sorter= "somethingsomething sort {input}"
rule A:
input:
"data/sample.1.txt"
output:
"data/sample.2.sorted.txt"
shell:
sorter
rule B:
input:
"data/sample.2.sorted.txt"
output:
"data/sample.3.man.txt"
shell:
"somethingsomething manipulate {input}"
rule C:
input:
"data/sample.3.man.txt"
output:
"data/sample.4.man.sorted.txt"
shell:
sorter

How do I interpolate wildcards into a shell command?

I'm trying to build a Snakemake pipeline, but I'm confused why filename wildcards work for input and output, but not for shell. For example, the following works fine:
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}.out", sample=samplelist)
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {output}"
But let's say that the command I use for shell actually derives from a string I give it, so I can't name the output file directly in the shell command. Then how can I use my filename wildcard (here {sample}) in the shell command?
For example, the following doesn't work:
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}.out", sample=samplelist)
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {sample}.out"
It gives me the following error:
RuleException in line 6 of Snakefile:
NameError: The name 'sample' is unknown in this context. Please make sure
that you defined that variable. Also note that braces not used for variable
access have to be escaped by repeating them, i.e. {{print $1}}
How can I work around this?
(Or if you really want to see some real life code, here is what I'm working with):
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}_aligned.sam", sample=samplelist)
rule align:
input: "{sample}_R1_001.trimmed.fastq.gz",
"{sample}_R2_001.trimmed.fastq.gz"
output: "{sample}_aligned.sam"
threads: 4
shell: "STAR --outFileNamePrefix {sample}_aligned --readFilesIn {input[0]} {input[1]} --readFilesCommand zcat --runMode alignReads --runThreadN {threads} --genomeDir /path/to/StarIndex"
But the error message is basically the same. For shell, I can use {input}, {output}, and {threads}, but not {sample}.
I did look at Snakemake: How do I use a function that takes in a wildcard and returns a value?, but that seems to be focused on generating input file names. My issue deals with interpolation of the filename wildcard into the shell command.
Wildcards are available via {wildcards.XXXX}. source
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {wildcards.sample}.out"

Snakemake best practice for demultiplexing

I have a question about best practices. Specifically, about the best snakemake pattern for demultiplexing reads from illumina sequencing. Our workflow needs to demultiplex multiple lanes of sequencing, and then combine these in a single analysis. Obviously, we know lane and sample names, however sample names are not the same across lanes. With only one lane, one can do something like:
SAMPLES = [...]
rule demux:
input:
reads="lanes/lanename.fastq.gz",
key="keys/lanename.txt"
output:
reads=expand("reads/{sample}.fastq.gz", sample=SAMPLES)
...
However with multiple lanes, I'm stuck wanting to use a function as an output rule. How would the following translate to something possible:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt"
output:
reads=lambda wc: expand("reads/{sample}.fastq.gz", sample=LANES[wc.lane])
...
Forgive me if this has be answered previously, or if there is some obvious approach I'm missing.
Cheers,
Kevin
I realise this is an old question, but I had the same problem, and came up with the following solution: Split the demux rule into two. The first creates a temporary directory which holds the demultiplexed reads, and then the second moves a single readset from that directory.
Here's a relatively minimal example:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
SAMPLES = {}
for lane, samples in LANES.items():
for sample in samples:
SAMPLES[sample] = lane
rule all:
input:
expand("reads/{sample}.fastq.gz", sample=SAMPLES.keys())
rule get_reads:
output:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
shell:
"touch {output}"
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
output:
directory(temp("demux_tmp_{lane}"))
params:
f=lambda w: expand("{sample}.fastq.gz", sample=LANES[w.lane]),
shell:
"mkdir -p {output} ; cd {output} ; touch {params.f}"
rule demux_files:
input:
lambda w: f"demux_tmp_{SAMPLES[w.sample]}",
output:
"reads/{sample}.fastq.gz"
shell:
"mv {input}/{wildcards.sample}.fastq.gz {output}"