I have a workflow with a subworkflow. The subworkflow's DAG takes a long time to generate. Is there a way to tell Snakemake that the subworkflow is up to date and avoid the long evaluation of its DAG?

I'm not sure that a great way to accomplish this exists, but the following can get the job done, even if it's a bit of a hack.
Suppose your workflow depends on a subworkflow defined by sw/Snakefile, which produces file sw/test.txt that your main workflow uses on. Then you can simply take advantage of conditional Snakemake blocks to use the subworkflow to generate sw/test.txt only when this file doesn't already exist:
import os
if os.path.exists("sw/test.txt"):
rule result:
subworkflow sw:
rule result:
In this way, the subworkflow DAG is only evaluated when sw/test.txt does not already exist. Of course, this also means that you'll have to explicitly rm sw/test.txt whenever you need the subworkflow to update it.


AmbiguousRuleException when there is no ambiguity

In this Snakemake script the rule all defines a target, and there are three other rules that claim this target as an output:
rule all:
rule from_non_existing_file:
rule broad_input:
rule narrow_input:
ruleorder: narrow_input > broad_input
The file non_existing_file.txt doesn't exist, so the rule from_non_existing_file should not be regarded by Snakemake. The rule broad_input has no input files, so it always can produce the output, and the rule narrow_input can produce the output whenever the file optional_input.txt exists. To resolve the ambiguity between broad and narrow inputs, the ruleorder is defined.
Whenever the file optional_input.txt exists, the script prefers the rule narrow_input:
Job counts:
count jobs
1 all
1 narrow_input
This script works most of the times, but sometimes it fails:
Rules narrow_input and broad_input are ambiguous for the file target.txt.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Expected input files:
narrow_input: optional_input.txt
broad_input: Expected output files:
narrow_input: target.txt
broad_input: target.txt
Here Snakemake ignores the fact that the ruleorder directive is defined, and advises to define it again.
To confirm this behavior I've designed the test script below:
import os
def test_snakemake():
for i in range(100):
rcode = os.system("snakemake --cores=1 --printshellcmds --forceall --dry-run")
assert(rcode == 0)
This test fails within first 20 iterations with high confidence.
I've conducted some experiments and got surprising results:
The test pass if optional_input.txt doesn't exist
The test pass if any of the three rules is removed
This problem is confirmed on two different Windows machines with Snakemake versions 5.7.4 and 6.5.3.
My question is whether that is a Snakemake bug. Is there another explanation of this behavior?

snakemake "rule all" is rerunning unnecessary rules

I'm running into a case where snakemake is rerunning rules that have already been run, even though the output from those rules is still present. I am specifying all the desired output files in a "rule all". I ran the pipeline the point where I had all the desired outputs from "rule B", and wanted to restart the pipeline and just run rule A. But snakemake reruns "rule B" even though all the outputs from "rule B" are already present. This isn't the behavior I expect from snakemake, which should only rerun the rules necessary to get to a target (here specified in the rule all).
When I run snakemake in a dry-run mode, this is at the end of the output.
Job counts:
count jobs
1 count_matrix
27 picard_fastq2sam
27 star_align
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
The output from picard_fastq2sam is used by star_align and all the outputs from star_align are used for count_matrix. So there should only one rule to run, because the outputs from "picard_fastq2sam" and "star_align" are all already present. My rule all looks like this:
This workflow is based on the template from, but I've modified it enough that I thought I should post here.
snakemake version is 6.0.5
Any hints on where to look? This is really the opposite of the behavior I expect from snakemake.
When I looked further into the debug-dag output, I saw a bunch of these blocks of text:
candidate job star_align
wildcards: sample=8R_S14, unit=lane1
candidate job picard_fastq2sam
wildcards: sample=8R_S14, unit=lane1
selected job picard_fastq2sam
wildcards: sample=8R_S14, unit=lane1
file results/picard_fastq2sam/8R_S14-lane1.unaligned.bam:
Producer found, hence exceptions are ignored.
selected job star_align
wildcards: sample=8R_S14, unit=lane1
file results/star/
Producer found, hence exceptions are ignored.
I'm not sure what means, but it seems relevant.
I had forgotten that snakemake checks the timestamp of files in addition to whether they are present or not. I figured this out when I found this post: How to not rerule updated files with Snakemake
and fixed in my case by rerunning with "--touch" and "--forceall" to update the timestamps of all the output files.

How to run rule even when some of its inputs are missing?

In the first step of my process, I am extracting some hourly data from a database. Because of things data is sometimes missing for some hours resulting in files. As long as the amount of missing files is not too large I still want to run some of the rules that depend on that data. When running those rules I will check how much data is missing and then decide if I want to generate an error or not.
An example below. The Snakefile:
rule parse_data:
"data/1.csv", "data/2.csv", "data/3.csv", "data/4.csv"
"touch {output}"
rule get_data:
"Rscript get_data.R {output}"
And my get_data.R script:
output <- commandArgs(trailingOnly = TRUE)[1]
if (output == "data/1.csv")
stop("Some error")
writeLines("foo", output)
How do I force running of the rule parse_data even when some of it's inputs are missing? I do not want to force running any other rules when input is missing.
One possible solution would be to generate, for example, an empty file in get_data.R when the query failed. However, in practice I am also using --restart-times 5 when running snakemake as the query can also fail because of database timeouts. When creating an empty file this mechanism of retrying the queries would no longer work.
You need data-dependent conditional execution.
Use a checkpoint on get_data. Then you replace parse_data's input with a function, that aggregates whatever files do exist.
(note that I am a Snakemake newbie and am just learning this myself, I hope this is helpful)

Snakemake: The use of --debug-dag for detecting cyclic dependencies

I am using snakemake in a workflow for NGS analyses.
In one rule, I make use of the unique (temporary) output from another rule.The output of this one rule is also unique and contributes to the creation of the final output. A simple wildcard {sample} is used over these rules. I do not see any cyclic dependency, but snakemake tells me there is:
CyclicGraphException in line xxx of Snakefile: Cyclic dependency on rule
I understand that there is an option to investigate this problem: --debug-dag.
How do I interpret the output? What is candidate versus selected?
This my (pseudo-) code of the rule:
rule split_fasta:
threads: 4
python bin/ {input.dataFile} {input.scaffolds} {input.database} {output.onefasta} {output.twofasta} {output.threefasta}
There is no other connection between input and output than in this rule.
The problem is solved now, further downstream and upstream some subtle dependencies were present.
But, for future reference I would like to know how to interpret the output od the --debug-dag option.
--debug-dag Print candidate and selected jobs (including their wildcards) while inferring DAG. This can help to debug unexpected DAG topology or errors.
It does not seem to have further documentation than this, but I believe the candidate jobs are the jobs that can be made matching to the required string through wildcards. The selected job is the one that is chosen from the candidates (either through wildcard constraints, ruleorder, or the first candidate with the option --allow-ambiguity).
As an example I have a rule that does adapter trimming, and I have a rule for both paired end and single end:
rule trim_SE:
rule trim_PE:
If I now tell snakemake to generate the output exp_R1_trimmed.fastq.gz it complains that it can use either rule.
Rules trim_PE and trim_SE are ambiguous for the file exp_R1_trimmed.fastq.gz.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
trim_PE: sample=exp
trim_SE: sample=exp_R1
we can solve this problem by for instance placing a ruleorder:
ruleorder: trim_PE > trim_SE
And the file gets generated as we want. If we now use the --debug-dag option we get two candidate rules, and one selected rule (based on our ruleorder).
candidate job trim_PE
wildcards: sample=exp
candidate job trim_SE
wildcards: sample=exp_R1
selected job sra2fastq_PE
wildcards: sample=GSM2837484
If the rule trim_PE and trim_SE depended on other rules downstream, we can use the --debug-dag option to detect in which rule the wildcard expansion goes wrong, instead of only getting an error in the rule where it goes wrong.

How can Snakemake be made to update files in a hierarchical rule-based manner when a new file appears at the bottom of the hierarchy?

I have a snakefile with dozens of rules, and it processes thousands of files. This is a bioinformatics pipeline for DNA sequencing analysis. Today I added two more samples to my set of samples, and I expected to be able to run snakemake and it would automatically determine which rules to run on which files to process the new sample files and all files that depend on them on up the hierarchy to the very top level. However, it does nothing. And the -R option doesn't do it either.
The problem is illustrated with this snakefile:
> cat tst
rule A:
output: "test1.txt"
input: "test2.txt"
shell: "cp {input} {output}"
rule B:
output: "test2.txt"
input: "test3.txt"
shell: "cp {input} {output}"
rule C:
output: "test3.txt"
input: "test4.txt"
shell: "cp {input} {output}"
rule D:
output: "test4.txt"
input: "test5.txt"
shell: "cp {input} {output}"
Execute it as follows:
> rm test*.txt
> touch test2.txt
> touch test1.txt
> snakemake -s tst -F
Output is:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 A
rule A:
input: test2.txt
output: test1.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
Since test5.txt does not exist, I expected an error message to that effect, but it did not happen. And of course, test3.txt and test4.txt do not exist.
Furthermore, using -R instead of -F results in "Nothing to be done."
Using "-R A" runs rule A only.
This relates to my project in that it shows that Snakemake does not analyze the entire dependent tree if you tell it to build a rule at the top of the tree and that rule's output and input files already exist. And the -R option does not force it either. When I tried -F on my project, it started rebuilding the entire thing, including files that did not need to be rebuilt.
It seems to me that this is fundamental to what Snakemake should be doing, and I just don't understand it. The only way I can see to get my pipeline to analyze the new samples is to individually invoke each rule required for the new files, in order. And that is way too tedious and is one reason why I used Snakemake in the first place.
Snakemake does not automatically trigger re-runs when adding new input files (e.g. samples) to the DAG. However, you can enforce this as outlined in the FAQ.
The reason for not doing this by default is mostly consistency: in order to do this, Snakemake needs to store meta information. Hence, if the meta information is lost, you would have a different behavior than if it was there.
However, I might change this in the future. With such fundamental changes though, I am usually very careful in order to not forget a counter example where the current default behavior is of advantage.
Remember that snakemake wants to satisfy the dependency of the first rule and builds the graph by pulling additional dependencies through the rest of the graph to satisfy that initial dependency. By touching test2.txt you've satisfied the dependency for the first rule, so nothing more needs to be done. Even with -R A nothing else needs to be run to satisfy the dependency of rule A - the files already exist.
Snakemake definitely does do what you want (add new samples and the entire rule graph runs on those samples) and you don't need to individually invoke each rule, but it seems to me that you might be thinking of the dependencies wrong. I'm not sure I fully understand where your new samples fit into the tst example you've given but I see at least two possibilites.
Your graph dependency runs D->C->B->A, so if you're thinking that you've added new input data at the top (i.e. a new sample as test5.txt in rule D), then you need to be sure that you have a dependency at your endpoint (test2.txt in rule A). By touching test2.txt you've just completed your pipeline, so no dependencies exist. If touch test5.txt (that's your new data) then your example works and the entire graph runs.
Since you touched test1.txt and test2.txt in your example execution maybe you intended those to represent the new samples. If so then you need to rethink your dependency graph, because adding them doesn't create a dependency on the rest of the graph. In your example, the test2.txt file is your terminal dependency (the final dependency of your workflow not the input to it). In your tst example new data needs come is as test5.txt as input to rule D (the top of your graph) and get pulled through the dependency graph to satisfy an input dependency of rule A which is test2.txt. If you're thinking of either test1.txt or test2.txt as your new input then you need to remember that snakemake pulls data through the graph to satisfy dependencies at the bottom of the graph, it doesn't push data from the top down. Run snakemake -F --rulegraph see that the graph runs D->C->B->A and so new data needs to come is as input to rule D and be pulled through the graph as a dependency to rule A.