Why is no AmbiguousRuleException raised with these Snakemake rule and output combinations? - snakemake

I've run into a few situations where I would expect Snakemake would complain about ambiguous rules, but doesn't, so I'm trying to figure out exactly what the expected behavior is supposed to be.
For example, with this as rules1.smk:
rule rule1:
output: "something.txt"
rule rule2:
output: "{thing}.txt"
rule rule3:
output: "{thing}.{ext}"
If I request file.txt it complains as expected:
$ snakemake -n --debug-dag -s rules1.smk file.txt
Building DAG of jobs...
candidate job rule2
wildcards: thing=file
candidate job rule3
wildcards: thing=file, ext=txt
AmbiguousRuleException:
Rules rule2 and rule3 are ambiguous for the file file.txt.
...
But if I request something.txt, it goes straight to rule1 and stops at that:
$ snakemake -n --debug-dag -s rules1.smk something.txt
Building DAG of jobs...
candidate job rule1
wildcards:
selected job rule1
...
My question is, why does it do that? Shouldn't it complain that all three of these rules are ambiguous for that output?
My first thought was that rules that yield an output match with no wildcards might implicitly get a higher ruleorder defined than rules that use any number of wildcards, but I can't see anything like that in the documentation for ambiguous rules.
A slightly more complex example shows a little more about the behavior:
if not config.get("noruleorder"):
ruleorder: rule1 > rule1alt
rule rule1:
output: "something.txt"
rule rule1alt:
output: "something.txt"
rule rule2:
output: "{thing}.txt"
rule rule3:
output: "{thing}.{ext}"
That works by default, allowing that ruleorder directive:
$ snakemake -n --debug-dag -s rules2.smk something.txt
Building DAG of jobs...
candidate job rule1
wildcards:
selected job rule1
...
And obviously without ruleorder it can't work, since rule1 and rule1alt are as ambiguous as can be:
$ snakemake --config noruleorder=yep -n --debug-dag -s rules2.smk something.txt
Building DAG of jobs...
candidate job rule1alt
wildcards:
candidate job rule1
wildcards:
candidate job rule2
wildcards: thing=something
candidate job rule3
wildcards: thing=something, ext=txt
AmbiguousRuleException:
Rules rule1alt and rule1 are ambiguous for the file something.txt.
...
...but it's interesting that it then considers all the rules I would have thought would be candidates in the first place. I'm just not sure what that says about the candidate job logic. All this seems related to snakemake: Ambiguous rule not detected? but not quite identical.
This is with Snakemake 7.16.0.

Rules without wildcards are implicitly given a higher ruleorder than those with wildcards. This is noted way back in an old changelog for 3.2.2 as "rules without wildcards now beat other rules in case of ambiguity." There's actually a unit test for this that looks almost exactly like what I set up here. I just couldn't find any of this in the docs.
How I found this:
DAG.update loops over each job and breaks from the loop if it finds a job that is > than all other candidate jobs. Job.__gt__ just calls Rule.__gt__ which calls Ruleorder.compare which does this:
# if no ruleorder given, prefer rule without wildcards
wildcard_cmp = rule2.has_wildcards() - rule1.has_wildcards()
if wildcard_cmp != 0:
return wildcard_cmp
If I comment that out, I get the behavior I originally expected:
$ snakemake -n --debug-dag -s rules1.smk something.txt
Building DAG of jobs...
candidate job rule1
wildcards:
candidate job rule2
wildcards: thing=something
candidate job rule3
wildcards: thing=something, ext=txt
AmbiguousRuleException:
Rules rule1 and rule2 are ambiguous for the file something.txt.
The behavior and test were added in this commit. Unless it's already there and I'm just missing it this should probably get documented in the section about ambiguous rules.

Related

Snakemake combine ambiguous rules

I am trying to combine some rules. The rule1 creates in automatics {sample}_unmapped.bam file taking information from library_params.txt file, which I cannot specify as output as program outputs it itself, but which I need to use in the rule2. Is there a way for program to attend the rule1 to finish and then run rule2 using the output from rule1? Because the error it is giving me now is: {sample}_unmapped.bam file is missing.
rule rule1:
input:
basecalls_dir="/RUN1/Data/Intensities/BaseCalls/",
barcodes_dir=directory("barcodes"),
library_params="library_params.txt",
metrics_file="metrics_output.txt"
output:
log="barcodes.log"
shell:
"""
java -Djava.io.tmpdir=/path/to/tmp -Xmx2g -jar picard.jar IlluminaBasecallsToSam BASECALLS_DIR={input.basecalls_dir} BARCODES_DIR={input.barcodes_dir} LANE=1 READ_STRUCTURE=151T8B9M8B151T RUN_BARCODE=run1 LIBRARY_PARAMS={input.library_params} MOLECULAR_INDEX_TAG=RX ADAPTERs_TO_CHECK=INDEXED READ_GROUP_ID=BO NUM_PROCESSORS=2 IGNORE_UNEXPECTED_BARCODES=true > {output.log}
"""
rule rule2:
input:
log="barcodes.log",
infile="{sample}_unmapped.bam"
params:
ref="ref.fasta"
output:
outfile="{sample}.mapped.bam"
shell:
"""
java -Djava.io.tmpdir=/path/to/tmp -Xmx2g -jar picard.jar SamToFastq I={input.infile} F=/dev/stdout INTERLEAVE=true | bwa mem -p -t 7 {params.ref} /dev/stdin | java -Djava.io.tmpdir=/path/to/tmp -Xmx4g -jar picard.jar MergeBamAlignment UNMAPPED={input.infile} ALIGNED=/dev/stdin O={output.outfile} R={params.ref} SORT_ORDER=coordinate MAX_GAPS=-1 ORIENTATIONS=FR
In rule2 I would move infile="{sample}_unmapped.bam" from the input directive to the params directive. And of course you would change the shell script from I={input.infile} to I={params.infile}.
rule2 will still wait for rule1 to complete because you give barcodes.log as input to rule2.

Snakemake is unable to match wildcard although it's defined and even suggested

I am still very confused about the wildcards concept despite reading the full docs and a few examples, so maybe someone can shed light on this weird behaviour. It might be a bug but it's such a basic example that I am pretty sure I am doing or understanding something wrong.
Here is my Snakefile which should generate a bunch of files defined in a dictionary where the location of the files is stored (those can be served by all kinds of data providers like iRODS, XRootD etc., but it's not important now).
import os
some_files = {
"foo": "some_location/foo",
"bar": "another_location/bar",
"baz": "yet_another_loc/baz"
}
rule all:
input: ["raw/" + os.path.basename(f) for f in some_files.keys()]
rule generate_files:
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {some_files[wildcards.fname]} > {output}"
As you can see, I need to use a similar "trick" which was proposed in my previous question (Array of values as input in Snakemake workflows) to force the recognition of the files by adding a rule and listing those (in rule all), which works nicely.
The rule generate_files should then generate (retrieve) those by using the corresponding URL and protocol defined in some_files. For the sake of simplicity, it's now just echoing the origin into the output file.
To achieve this, I thought I can simply use the wildcards.fname in the shell section but I when I run the workflow, I get:
░ tamasgal#silentbox-(2):PhD/snakemake  master ●●● snakemake took 16s
░ 08:47:35 > snakemake -c1
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
generate_files 3 1 1
total 4 1 1
Select jobs to execute...
[Fri Feb 18 08:47:38 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'wildcards.fname' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
If I use fname (and not wildcards.fname), Snakemake proposes to use wildcards.fname, which again, does not work. Here is the output when running with fname in output:
[Fri Feb 18 08:47:48 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'fname' is unknown in this context. Did you mean 'wildcards.fname'?
Why is this happening? The output of the workflow clearly shows that wildcards: fname=bar, so it exists and is defined. Is this a bug?
Hm, you may have to try and get at some_files[wildcards.fname] outside of the shell part? It looks to me like it can tell what the wildcard is supposed to be for the output to be raw/bar, but it can't handle using it to access the dict in the shell part. It seems like this could be handled with an input function to me.
Off the top of my head:
rule generate_files:
input:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {input.some_file} > {output}"
EDIT: if it fails because the file isn't local so Snakemake can't find it, you may supply the path to it as a parameter instead:
rule generate_files:
params:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {params.some_file} > {output}"

Snakemake: MissingInputException with inconsistent naming scheme

I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.

Workflow always results in "Nothing to do" even when forcing rules

So as the title says I can't bring my workflow to execute anything, except the all rule...
When Executing the all rule it correctly finds all the input files, so the configfile is okay, every path is correct.
when trying to run without additional tags I get
Building DAG of jobs...
Checking status of 0 jobs.
Nothing to be done
Things I tried:
-f rcorrector -> only all rule
filenameR1.fcor_val1.fq -> MissingRuleException (No Typos)
--forceall -> only all rule
Some more fiddling I can't formulate clearly
please Help
from os import path
configfile:"config.yaml"
RNA_DIR = config["RAW_RNA_DIR"]
RESULT_DIR = config["OUTPUT_DIR"]
FILES = glob_wildcards(path.join(RNA_DIR, '{sample}R1.fastq.gz')).sample
############################################################################
rule all:
input:
r1=expand(path.join(RNA_DIR, '{sample}R1.fastq.gz'), sample=FILES),
r2=expand(path.join(RNA_DIR, '{sample}R2.fastq.gz'), sample=FILES)
#############################################################################
rule rcorrector:
input:
r1=path.join(RNA_DIR, '{sample}R1.fastq.gz'),
r2=path.join(RNA_DIR, '{sample}R2.fastq.gz')
output:
o1=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R1.cor.fq'),
o2=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R2.cor.fq')
#group: "cleaning"
threads: 8
params: "-t {threads}"
envmodules:
"bio/Rcorrector/1.0.4-foss-2019a"
script:
"scripts/Rcorrector.py"
############################################################################
rule FilterUncorrectabledPEfastq:
input:
r1=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R1.cor.fq'),
r2=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R2.cor.fq')
output:
o1=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R1.fcor.fq"),
o2=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R2.fcor.fq")
#group: "cleaning"
envmodules:
"bio/Jellyfish/2.2.6-foss-2017a",
"lang/Python/2.7.13-foss-2017a"
#TODO: load as module
script:
"/scripts/filterUncorrectable.py"
#############################################################################
rule trim_galore:
input:
r1=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R1.fcor.fq"),
r2=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R2.fcor.fq")
output:
o1=path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val1.fq"),
o2=path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val2.fq")
threads: 8
#group: "cleaning"
envmodules:
"bio/Trim_Galore/0.6.5-foss-2019a-Python-3.7.4"
params:
"--paired --retain_unpaired --phred33 --length 36 -q 5 --stringency 1 -e 0.1 -j {threads}"
script:
"scripts/trim_galore.py"
In snakemake, you define final output files of the pipeline as target files and define them as inputs in first rule of the pipeline. This rule is traditionally named as all (more recently as targets in snakemake doc).
In your code, rule all specifies input files of the pipeline, which already exists, and therefore snakemake doesn't see anything to do. It just instead needs to specify output files of interest from the pipeline.
rule all:
input:
expand(path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val{read}.fq"), sample=FILES, read=[1,2]),
Why your attempted methods didn't work?
-f not working:
As per doc:
--force, -f
Force the execution of the selected target or the first rule regardless of already created output.
Default: False
In your code, this means rule all, which doesn't have output defined, and therefore nothing happened.
filenameR1.fcor_val1.fq
This doesn't match output of any of the rules and therefore the error MissingRuleException.
--forceall
Same reasoning as that for -f flag in your case.
--forceall, -F
Force the execution of the selected (or the first) rule and all rules it is dependent on regardless of already created output.
Default: False

Snakemake running Subworkflow but not the Rest of my workflow (goes directly to rule All)

I'm a newbie in Snakemake and on StackOverflow. Don't hesitate to tell me if something is unclear or if you want any other detail.
I have written a workflow permitting to convert .BCL Illumina Base Calls files to demultiplexed .FASTQ files and to generate QC report (FastQC files). This workflow is composed of :
Subworkflow "convert_bcl_to_fastq" It creates FASTQ files in a directory named Fastq from BCL files. It must be executed before the main workflow, this is why I have chosen to use a subworkflow since my second rule depends on the generation of these FASTQ files which I don't know the names in advance. A fake file "convert_bcl_to_fastq.done" is created as an output in order to know when this subworkflow ran as espected.
Rule "generate_fastqc" It takes the FASTQ files generated thanks to the subworkflow and creates FASTQC files in a directory named FastQC.
Problem
When I try to run my workflow, I don't have any error but my workflow does not behave as expected. I only get the Subworkflow to be ran and then, the main workflow but only the Rule "all" is executed. My Rule "generate_fastqc" is not executed at all. I would like to know where I could possibly have been wrong ?
Here is what I get :
Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Job counts:
count jobs
1 convert_bcl_to_fastq
1
[...]
Processing completed with 0 errors and 1 warnings.
Touching output file convert_bcl_to_fastq.done.
Finished job 0.
1 of 1 steps (100%) done
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T171952.799414.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1
localrule all:
input: /path/to/my/working/directory/conversion/convert_bcl_to_fastq.done
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
And when all of my FASTQ files are generated, if I run again my workflow, this time it will execute the Rule "generate_fastqc".
Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Nothing to be done.
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T174337.605716.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
95 generate_fastqc
96
I wanted my workflow to execute itself entirely by running rule "generate_fastqc" just after the completion of the subworkflow execution but I am actually forced to execute my workflow 2 times. I thought that this workflow would work since all the files needed in the second part of the workflow will be generated thanks to the subworkflow... Do you have any idea of where I could have been wrong ?
My Code
Here is my Snakefile for the main workflow :
subworkflow convert_bcl_to_fastq:
workdir: WDIR + "conversion/"
snakefile: WDIR + "conversion/Snakefile"
SAMPLES, = glob_wildcards(FASTQ_DIR + "{sample}_R1_001.fastq.gz")
rule all:
input:
convert_bcl_to_fastq("convert_bcl_to_fastq.done"),
expand(FASTQC_DIR + "{sample}_R1_001_fastqc.html", sample=SAMPLES),
expand(FASTQC_DIR + "{sample}_R2_001_fastqc.html", sample=SAMPLES)
rule generate_fastqc:
output:
FASTQC_DIR + "{sample}_R1_001_fastqc.html",
FASTQC_DIR + "{sample}_R2_001_fastqc.html",
temp(FASTQC_DIR + "{sample}_R1_001_fastqc.zip"),
temp(FASTQC_DIR + "{sample}_R2_001_fastqc.zip")
shell:
"mkdir -p "+ FASTQC_DIR +" | " #Creates a FastQC directory if it is missing
"fastqc --outdir "+ FASTQC_DIR +" "+ FASTQ_DIR +"{wildcards.sample}_R1_001.fastq.gz "+ FASTQ_DIR + " {wildcards.sample}_R2_001.fastq.gz &" #Generates FASTQC files for each sample at a time
Here is my Snakefile for the subworkflow "convert_bcl_to_fastq" :
rule all:
input:
"convert_bcl_to_fastq.done"
rule convert_bcl_to_fastq:
output:
touch("convert_bcl_to_fastq.done")
shell:
"mkdir -p "+ FASTQ_DIR +" | " #Creates a Fastq directory if it is missing
"bcl2fastq --no-lane-splitting --runfolder-dir "+ INPUT_DIR +" --output-dir "+ FASTQ_DIR #Demultiplexes and Converts BCL files to FASTQ files
Thank you in advance for your help !
The documentation about subworkflows currently states:
When executing, snakemake first tries to create (or update, if necessary)
"test.txt" (and all other possibly mentioned dependencies) by executing the subworkflow.
Then the current workflow is executed.
In your case, the only dependency declared is "convert_bcl_to_fastq.done", which Snakemake happily produces the first time.
Snakemake usually does a one-pass parsing, and the main workflow has not been told to look for sample-files from the subworkflow. Since sample-files do not exist yet during the first execution, the main workflow gets no match in the expand() statements. No match, no work to be done :-)
When you run the main workflow the second time, it finds sample-matches in the expand() of rule all: and produces them.
Side note 1: Be happy to have noticed this. Using your code, if you actually had done changes that mandated re-run of the subworkflow, Snakemake would find an old "convert_bcl_to_fastq.done" and not re-execute the subworkflow.
Side note 2: If you want to make Snakemake be less 'one-pass' it has a rule-keyword checkpoint that can be used to re-evaluate what needs to be done as consequences of rule-execution. In your case, the checkpoint would have been rule convert_bcl_to_fastq . That would mandate the rules to be in the same logical snakefile (with include permitting multiple files though)