wildcard_constraints between two wildcards with OR - snakemake

I'd like to constraint a rule based on two wildcards to run if (id == 'FOO || (id == 'BAR' && ver == '2')). However, I am not quite sure how to do it (or if it is possible). I tried the example below but it doesn't seem to work...
rule foo:
input: "{id}{ver}.txt"
output: "{id}{ver}.out"
wildcard_constraints:
id = "FOO"
wildcard_constraints:
id = "BAR",
ver = "2"

I am not sure your current approach will work. Why not simply ask snakemake to make you files you need? e.g.:
rule all:
input: expand('FOO{ver}.txt, ver=[somelist]), 'BAR2.txt'
rule foo:
input: "{id}{ver}.txt"
output: "{id}{ver}.out"
shell: "some_command {input} > {output}
this should call rule foo for all foo{ver}.txt files you specify and for the bar2.txt file

Related

snakemake rule error. How can I select rule order?

I run the snakemake for RNA-seq analysis.
I made snakefile for running, and some error occurred in terminal.
I set rule salmon quant reads at last order but it is running at first.
So snakemake showed the error in rule salmon quant reads.
salmon quant reads must run after salmon index finished.
Error in rule salmon_quant_reads:
jobid: 173
output: salmon/WT_Veh_11/quant.sf, salmon/WT_Veh_11/lib_format_counts.json
log: logs/salmon/WT_Veh_11.log (check log file(s) for error message)
conda-env: /home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/.snakemake/conda/ff908de630224c1a4118f5dc69c8a761
RuleException:
CalledProcessError in line 111 of /home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/Snakefile_2:
Command 'source /home/baelab2/miniconda3/bin/activate '/home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/.snakemake/conda/ff908de630224c1a4118f5dc69c8a761'; set -euo pipefail; /home/baelab2/miniconda3/envs/snakemake/bin/python3.10 /home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/.snakemake/scripts/tmpr6r8ryk9.wrapper.py' returned non-zero exit status 1.
File "/home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/Snakefile_2", line 111, in __rule_salmon_quant_reads
File "/home/baelab2/miniconda3/envs/snakemake/lib/python3.10/concurrent/futures/thread.py", line 58, in run
How can I fix it?
Here is the my snakefile info.
SAMPLES = ["KO_Col_5", "KO_Col_6", "KO_Col_7", "KO_Col_8", "KO_Col_9", "KO_Col_10", "KO_Col_11", "KO_Col_15", "KO_Veh_3", "KO_Veh_4", "KO_Veh_5", "KO_Veh_9", "KO_Veh_11", "KO_Veh_13", "KO_Veh_14", "WT_Col_1", "WT_Col_2", "WT_Col_3", "WT_Col_6", "WT_Col_8", "WT_Col_10", "WT_Col_12", "WT_Veh_1", "WT_Veh_2", "WT_Veh_4", "WT_Veh_7", "WT_Veh_8", "WT_Veh_11", "WT_Veh_14"]
rule all:
input:
expand("raw/{sample}_1.fastq.gz", sample=SAMPLES),
expand("raw/{sample}_2.fastq.gz", sample=SAMPLES),
expand("qc/fastqc/{sample}_1.before.trim_fastqc.zip", sample=SAMPLES),
expand("qc/fastqc/{sample}_2.before.trim_fastqc.zip", sample=SAMPLES),
expand("trimmed/{sample}_1.fastq.gz", sample=SAMPLES),
expand("trimmed/{sample}_2.fastq.gz", sample=SAMPLES),
expand("qc/fastqc/{sample}_1.after.trim_fastqc.zip", sample=SAMPLES),
expand("qc/fastqc/{sample}_2.after.trim_fastqc.zip", sample=SAMPLES),
expand("salmon/{sample}/quant.sf", sample=SAMPLES),
expand("salmon/{sample}/lib_format_counts.json", sample=SAMPLES)
rule fastqc_before_trim_1:
input:
"raw/{sample}.fastq.gz",
output:
html="qc/fastqc/{sample}.before.trim.html",
zip="qc/fastqc/{sample}.before.trim_fastqc.zip",
log:
"logs/fastqc/{sample}.before.log"
threads: 10
priority: 1
wrapper:
"v1.7.0/bio/fastqc"
rule cutadapt:
input:
r1 = "raw/{sample}_1.fastq.gz",
r2 = "raw/{sample}_2.fastq.gz"
output:
fastq1="trimmed/{sample}_1.fastq.gz",
fastq2="trimmed/{sample}_2.fastq.gz",
qc="trimmed/{sample}.qc.txt"
params:
adapters = "-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT",
extra = "--minimum-length 1 -q 20"
log:
"logs/cutadapt/{sample}.log"
threads: 10
priority: 2
wrapper:
"v1.7.0/bio/cutadapt/pe"
rule fastqc_after_trim_2:
input:
"trimmed/{sample}.fastq.gz"
output:
html="qc/fastqc/{sample}.after.trim.html",
zip="qc/fastqc/{sample}.after.trim_fastqc.zip"
log:
"logs/fastqc/{sample}.after.log"
threads: 10
priority: 3
wrapper:
"v1.7.0/bio/fastqc"
rule salmon_index:
input:
sequences="raw/Mus_musculus.GRCm39.cdna.all.fasta"
output:
multiext(
"salmon/transcriptome_index/",
"complete_ref_lens.bin",
"ctable.bin",
"ctg_offsets.bin",
"duplicate_clusters.tsv",
"info.json",
"mphf.bin",
"pos.bin",
"pre_indexing.log",
"rank.bin",
"refAccumLengths.bin",
"ref_indexing.log",
"reflengths.bin",
"refseq.bin",
"seq.bin",
"versionInfo.json",
),
log:
"logs/salmon/transcriptome_index.log",
threads: 10
priority: 10
params:
# optional parameters
extra="",
wrapper:
"v1.7.0/bio/salmon/index"
rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates)
# use a list for r1 and r2.
r1 = "trimmed/{sample}_1.fastq.gz",
r2 = "trimmed/{sample}_2.fastq.gz",
index = "salmon/transcriptome_index"
output:
quant = "salmon/{sample}/quant.sf",
lib = "salmon/{sample}/lib_format_counts.json"
log:
"logs/salmon/{sample}.log"
params:
# optional parameters
libtype ="A",
extra="--validateMappings"
threads: 10
priority: 20
wrapper:
"v1.7.0/bio/salmon/quant"
The only link between salmon_quant_reads and salmon_index is directory salmon/transcriptome_index. However, the directory creation is not a sufficient signal that all work in salmon_index has been completed. So, a quick way to fix this is to include an explicit file:
rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates)
# use a list for r1 and r2.
r1 = "trimmed/{sample}_1.fastq.gz",
r2 = "trimmed/{sample}_2.fastq.gz",
index = "salmon/transcriptome_index",
_temp_dependency = "salmon/transcriptome_index/info.json",

nextflow .collect() method in RNA-seq example workflow

I understand we have to use collect() when we run a process that takes as input two channels, where the first channel has one element and then second one has > 1 element:
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
process A {
input:
val(input1)
output:
path 'index.txt', emit: foo
script:
"""
echo 'This is an index' > index.txt
"""
}
process B {
input:
val(input1)
path(input2)
output:
path("${input1}.txt")
script:
"""
cat <(echo ${input1}) ${input2} > \"${input1}.txt\"
"""
}
workflow {
A( Channel.from( 'A' ) )
// This would only run for one element of the first channel:
B( Channel.from( 1, 2, 3 ), A.out.foo )
// and this for all of them as intended:
B( Channel.from( 1, 2, 3 ), A.out.foo.collect() )
}
Now the question: Why can this line in the example workflow from nextflow-io (https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/rnaseq.nf#L15) work without using collect() or toList()?
It is the same situation, a channel with one element (the index) and a channel with > 1 (the fastq pairs) shall be used by the same process (quant), and it runs on all fastq files. What am I missing compared to my dummy example?
You need to create the first channel with a value factory which never exhausts the channel.
Your linked example implicitly creates a value channel which is why it works. The same happens when you call .collect() on A.out.foo.
Channel.from (or the more modern Channel.of) create a sequence channel which can be exhausted which is why both A and B only run once.
So
A( Channel.value('A') )
is all you need.

chain/dependency of some rules by wildcards

I have a particular use case for which I have not found the solution in the Snakemake documentation.
Let's say in a given pipeline I have a portion with 3 rules a, b and c which will run for N samples.
Those rules handle large amount of data and for reasons of local storage limits I do not want those rules to execute at the same time. For instance rule a produces the large amount of data then rule c compresses and export the results.
So what I am looking for is a way to chain those 3 rules for 1 sample/wildcard, and only then execute those 3 rules for the next sample. All of this to make sure the local space is available.
Thanks
I agree that this is problem that Snakemake still has no solution for. However you may have a workaround.
rule all:
input: expand("a{sample}", sample=[1, 2, 3])
rule a:
input: "b{sample}"
output: "a{sample}"
rule b:
input: "c{sample}"
output: "b{sample}"
rule c:
input:
lambda wildcards: f"a{wildcards.sample-1}"
output: "c{sample}"
That means that the rule c for sample 2 wouldn't start before the output for rule a for sample 1 is ready. You need to add a pseudo output a0 though or make the lambda more complicated.
So building on Dmitry Kuzminov's answer, the following can work (both with numbers as samples and strings).
The execution order will be a3 > b3 > a1 > b1 > a2 > b2.
I used a different sample order to show it can be made different from the sample list.
samples = [1, 2, 3]
sample_order = [3, 1, 2]
def get_previous(wildcards):
if wildcards.sample != sample_order[0]: # if different from a3 in this case
previous_sample = sample_order[sample_order.index(wildcards.sample) - 1]
return f'b_out_{previous_sample}'
else: # if is the first sample in the order i.e. a3
return #here put dummy file always present e.g. the file containing those rules or the Snakemake
rule all:
expand("b_out_{S}", S=sample)
rule a:
input:
"a_in_{sample}",
get_previous
output:
"a_out_{sample}"
rule b:
input:
"a_out_{sample}"
output:
"b_out_{sample}"

Snakemake multiple input files with expand but no repetitions

I'm new to snakemake and I don't know how to figure out this problem.
I've got my rule which has two inputs:
rule test
input_file1=f1
input_file2=f2
f1 is in [A{1}$, A{2}£, B{1}€, B{2}¥]
f2 is in [C{1}, C{2}]
The numbers are wildcards that come from an expand call. I need to find a way to pass to the file f1 and f2 a pair of files that match exactly with the number. For example:
f1 = A1
f2 = C1
or
f1 = B1
f2 = C1
I have to avoid combinations such as:
f1 = A1
f2 = C2
I would create a function that makes this kind of matches between the files, but the same should manage the input_file1 and the input_file2 at the same time. I thought to make a function that creates a dictionary with the different allowed combinations but how would I "iterate" over it during the expand?
Thanks
Assuming rule test gives you in output a file named {f1}.{f2}.txt, then you need some mechanism that correctly pairs f1 and f2 and create a list of {f1}.{f2}.txt files.
How you create this list is up to you, expand is just a convenience function for that but maybe in this case you may want to avoid it.
Here's a super simple example:
fin1 = ['A1$', 'A2£', 'B1€', 'B2¥']
fin2 = ['C1', 'C2']
outfiles = []
for x in fin1:
for y in fin2:
## Here you pair f1 and f2. This is a very trivial way of doing it:
if y[1] in x:
outfiles.append('%s.%s.txt' % (x, y))
wildcard_constraints:
f1 = '|'.join([re.escape(x) for x in fin1]),
f2 = '|'.join([re.escape(x) for x in fin2]),
rule all:
input:
outfiles,
rule test:
input:
input_f1 = '{f1}.txt',
input_f2 = '{f2}.txt',
output:
'{f1}.{f2}.txt',
shell:
r"""
cat {input} > {output}
"""
This pipeline will execute the following commands
cat A2£.txt C2.txt > A2£.C2.txt
cat A1$.txt C1.txt > A1$.C1.txt
cat B1€.txt C1.txt > B1€.C1.txt
cat B2¥.txt C2.txt > B2¥.C2.txt
If you touch the starting input files with touch 'A1$.txt' 'A2£.txt' 'B1€.txt' 'B2¥.txt' 'C1.txt' 'C2.txt' you should be able to run this example.

in snakemake, for two inputs, expand pairwise combination of a vector

I am new to Snakemake and have a problem in Snakemake expand function.
First, I need to have a group of combinations and use them as base to expand another vector upon them with pair-wise elements combinations of it.
Lets say the set for the pairwise combination is
setC=["A","B","C","D"]
I get the partial group as follows:
part_group1 = expand("TEMPDIR/{setA}_{setB}_", setA = config["setA"], setB = config["setB"]
Then, (if that is OK), I used this partial group, to expand another set with its pairwise combinations. But I am not sure how to expand pairwise combinations of setC as seen below. It is obviously not correct; just written to clarify the question. Also, how to input the name of the expanded estimator from shell?
rule get_performance:
input:
xdata1 = TEMPDIR + part_group1 +"{setC}.rda"
xdata2 = TEMPDIR + part_group1 +"{setC}.rda"
estimator1= {estimator}
output:
results = TEMPDIR + "result_" + part_group1 +{estimator}_{setC}_{setC}.txt"
params:
Rfile = FunctionDIR + "function.{estimator}.R"
shell:
"Rscript {params.Rfile} {input.xdata1} {input.xdata12} {input.estimator1} "
"{output.results}"
The expand function will return a list of the product of the variables used. For example, if
setA=["A","B"]
setB=["C","D"]
then
expand("TEMPDIR/{setA}_{setB}_", setA = config["setA"], setB = config["setB"]
will give you:
["TEMPDIR/A_C_","TEMPDIR/A_D_","TEMPDIR/B_C_","TEMPDIR/B_D_"]
Your question is not very clear on what you want to achieve but I'll have a guess.
If you want to make pairwise combinations of setC:
import itertools
combiC=list(itertools.combinations(setC, 2))
combiList=list()
for c in combiC:
combiList.append(c[0]+"_"+c[1])
the you (probably) want the files:
rule all:
input: expand(TEMPDIR + "/result_{A}_{B}_estim{estimator}_combi{C}.txt",A=setA, B=setB, estimator=estimators, C=combiList)
I'm putting some words like "estim" and "combi" not to confuse the wildcards here. I do not know what the list or set "estimators" is supposed to be but I suppose you have declared it above.
Then your rule get_performance:
rule get_performance:
input:
xdata1 = TEMPDIR + "/{A}_{B}_{firstC}.rda",
xdata2 = TEMPDIR + "/{A}_{B}_{secondC}.rda"
output:
TEMPDIR + "/result_{A}_{B}_estim{estimator}_combi{firstC}_{secondC}.txt"
params:
Rfile = FunctionDIR + "/function.{estimator}.R"
shell:
"Rscript {params.Rfile} {input.xdata1} {input.xdata2} {input.estimator} {output.results}"
Again, this is a guess since you haven't defined all the necessary items.