Snakemake skips multiple of the rules with same input

Snakemake skips multiple of the rules with same input - snakemake

Let's say I have 3 rules with the same input, snakemake skips 2 of them and only runs one of the rules. Is there a workaround to force all 3 rules to execute, since I need all 3 of them? I could add some other files as input to the existing input, but I feel like that's somewhat cheated and probably confusing to other people looking at my code, since I declare an input that is not used at all.

It appears target files were not defined. By default, snakemake executes the first rule in the snakefile.
Example:
rule all
input: "a.txt", "b.png"
rule x:
output "a.txt"
shell: "touch {output}"
rule y:
output "b.png"
shell: "touch {output}"
It is customary to name the first rule all which has all the desired output files.

Related

AmbiguousRuleException when there is no ambiguity

In this Snakemake script the rule all defines a target, and there are three other rules that claim this target as an output:
rule all:
input:
"target.txt"
rule from_non_existing_file:
input:
"non_existing_file.txt"
output:
"target.txt"
rule broad_input:
output:
"target.txt"
rule narrow_input:
input:
"optional_input.txt"
output:
"target.txt"
ruleorder: narrow_input > broad_input
The file non_existing_file.txt doesn't exist, so the rule from_non_existing_file should not be regarded by Snakemake. The rule broad_input has no input files, so it always can produce the output, and the rule narrow_input can produce the output whenever the file optional_input.txt exists. To resolve the ambiguity between broad and narrow inputs, the ruleorder is defined.
Whenever the file optional_input.txt exists, the script prefers the rule narrow_input:
Job counts:
count jobs
1 all
1 narrow_input
2
This script works most of the times, but sometimes it fails:
AmbiguousRuleException:
Rules narrow_input and broad_input are ambiguous for the file target.txt.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
narrow_input:
broad_input:
Expected input files:
narrow_input: optional_input.txt
broad_input: Expected output files:
narrow_input: target.txt
broad_input: target.txt
Here Snakemake ignores the fact that the ruleorder directive is defined, and advises to define it again.
To confirm this behavior I've designed the test script below:
import os
def test_snakemake():
for i in range(100):
rcode = os.system("snakemake --cores=1 --printshellcmds --forceall --dry-run")
assert(rcode == 0)
This test fails within first 20 iterations with high confidence.
I've conducted some experiments and got surprising results:
The test pass if optional_input.txt doesn't exist
The test pass if any of the three rules is removed
This problem is confirmed on two different Windows machines with Snakemake versions 5.7.4 and 6.5.3.
My question is whether that is a Snakemake bug. Is there another explanation of this behavior?

How to target intermediary Snakemake rule that contains wildcards

I have a workflow that, very simplified for this question, looks as follows:
rule all:
input: multiext("final",".a",".b",".c",".d")
rule final_cheap:
input: "intermediary.{ext}"
output: "final.{ext}"
#dummy for cheap but complicated operation
shell: "cp {input} {output}"
rule intermediary_cheap:
input: "start.{ext}"
output: "intermediary.{ext}"
#dummy for cheap complicated operation
shell: "cp {input} {output}"
rule start_expensive:
output: "start.{ext}"
#dummy for very expensive operation
shell: "touch {output}"
There's a very expensive first step and two complicated steps that follow.
After I've run this workflow once with snakemake -c1 I want to rerun the workflow but just from the intermediary rule onwards. How can I achieve this goal with command line flags?
snakemake intermediary_cheap all does not work, because intermediary_cheap contains wildcards, even though the inclusion of all really shows the values of the required wildcards.
Is there a command line flag that tells snakemake to run the rule and ignore all output from the rule intermediary_cheap, something like snakemake all --forcerule=intermediary_cheap? (I invented that --forcerule flag, it doesn't exist as far as I know.
The workaround I'm using right now is manually deleting the output of the rule intermediary_cheap, then forcing execution of the rule with --force and then running rule all, which notices that some upstream inputs have changed. But this requires knowledge of the precise file names that are produced, whereas knowledge of rules only would be preferable because it is at a higher level of abstraction.

I haven't used it before but I think you want:
snakemake -c 1 --forcerun intermediary_cheap
--forcerun [TARGET [TARGET ...]], -R [TARGET [TARGET ...]]
Force the re-execution or creation of the given rules
or files. Use this option if you changed a rule and
want to have all its output in your workflow updated.
(default: None)

snakemake wildcard scoping between rules

I've read the docs on rules, the FAQ, and this question, but I still can't tell: if a wildcard foo is defined in rule bar, can its values be accessed in rule baz?

If I understood your question correctly, the answer should be "no".
By using a wildcard in the rule you just define the pattern that can be applied to many different files. For example, in this rule you define the a way to produce files whose filenames match a pattern "new_name{n}.txt", where {n} can be any string:
rule example:
input: "old_name{n}.txt"
output: "new_name{n}.txt"
shell: "cp input output"
For sure this rule will be regarded only if the file "old_name{n}.txt" exists with the same string {n} used.
Returning back to your question: how could you access the value if this is just a pattern that may be applied to different values?
Another possible interpretation of your question is that you need to know all the values (all the inputs) the rule bar was applied to. In this case you probably need to employ checkpoint: this is the way to delay the pattern application until the part of pipeline finishes. But even in this case you wouldn't be accessing "the values of wildcard" explicitly.

I'm not sure I'm answering your question and what follows may not be entirely correct... Snakemake only cares that you have one and only one path that leads to the requested files (i.e., the files defined in the first rule, usually called all).
If rule bar defines wildcards that can to lead to the final output, then yes, those wildcards are visible to the following rules.
In the script below we ask for files A.txt and B.txt. To produce A.txt we don't need any wildcard. To produce B.txt we need to pass by wildcard {wc} defined in rule bar and used in rule B. Note that wildcard {sample} doesn't appear at all outside rule all. Note also that rule bar produces two files, B.tmp and C.tmp, but rule B only needs B.tmp. (You should be able to dry-run this script with snakemake -p -n)
rule all:
input:
expand('{sample}.txt', sample= ['A', 'B']),
rule A:
output:
'A.txt',
shell:
"touch {output}"
rule bar:
output:
expand('{wc}.tmp', wc= ['B', 'C'])
shell:
r"""
touch {output}
"""
rule B:
input:
'{wc}.tmp',
output:
'{wc}.txt',
shell:
r"""
touch {input} {output}
"""

Varying (known) number of outputs in Snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.

Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"

Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!

You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Snakemake skips multiple of the rules with same input - snakemake

Related

AmbiguousRuleException when there is no ambiguity

How to target intermediary Snakemake rule that contains wildcards

snakemake wildcard scoping between rules

Varying (known) number of outputs in Snakemake

how can I pass a string in config file into the output section?

Categories

Resources