snakemake wildcard scoping between rules - snakemake

I've read the docs on rules, the FAQ, and this question, but I still can't tell: if a wildcard foo is defined in rule bar, can its values be accessed in rule baz?

If I understood your question correctly, the answer should be "no".
By using a wildcard in the rule you just define the pattern that can be applied to many different files. For example, in this rule you define the a way to produce files whose filenames match a pattern "new_name{n}.txt", where {n} can be any string:
rule example:
input: "old_name{n}.txt"
output: "new_name{n}.txt"
shell: "cp input output"
For sure this rule will be regarded only if the file "old_name{n}.txt" exists with the same string {n} used.
Returning back to your question: how could you access the value if this is just a pattern that may be applied to different values?
Another possible interpretation of your question is that you need to know all the values (all the inputs) the rule bar was applied to. In this case you probably need to employ checkpoint: this is the way to delay the pattern application until the part of pipeline finishes. But even in this case you wouldn't be accessing "the values of wildcard" explicitly.

I'm not sure I'm answering your question and what follows may not be entirely correct... Snakemake only cares that you have one and only one path that leads to the requested files (i.e., the files defined in the first rule, usually called all).
If rule bar defines wildcards that can to lead to the final output, then yes, those wildcards are visible to the following rules.
In the script below we ask for files A.txt and B.txt. To produce A.txt we don't need any wildcard. To produce B.txt we need to pass by wildcard {wc} defined in rule bar and used in rule B. Note that wildcard {sample} doesn't appear at all outside rule all. Note also that rule bar produces two files, B.tmp and C.tmp, but rule B only needs B.tmp. (You should be able to dry-run this script with snakemake -p -n)
rule all:
input:
expand('{sample}.txt', sample= ['A', 'B']),
rule A:
output:
'A.txt',
shell:
"touch {output}"
rule bar:
output:
expand('{wc}.tmp', wc= ['B', 'C'])
shell:
r"""
touch {output}
"""
rule B:
input:
'{wc}.tmp',
output:
'{wc}.txt',
shell:
r"""
touch {input} {output}
"""

Related

Possible to have wildcard as dictionary key in snakemake rule output?

I need my output to be dynamic to input, the best way I thought to do is this by having output based on a dictionary. He is a stripped down example:
config.yaml:
{'names' : ['foo', 'bar']}
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("{name}",name=config['names'])
rule make_file:
output:
lambda wildcards: config[wildcards.name]
shell:
"""
touch {output}
"""
I get Only input files can be specified as functions
I also tried adding making output of rule make_file config["{name}"]
Snakemake has the opposite logic: that is the output that is used to define the actual wildcards values which then are used in other sections.
If you need the output to depend on the input, you may define several rules, where each one would define one possible pattern of how an input is being converted to output. There are other possibilities to add a dynamic behavior like checkpoints and dynamic files. Anyway there is no one size fits all solution for every problem, so you need to clarify what you are trying to achieve.
Anyway, when designing a Snakemake pipeline you should think in terms of "what target is my goal", but not in terms of "what could I make out of what I have".
Maybe you are making things more complicated than necessary or you simplified the example too much (or I'm misunderstanding the question...).
This will create files foo and bar by running rule make_file on each item in config['names']:
rule all:
input:
expand("{name}", name= config['names'])
rule make_file:
output:
'{name}'
shell:
"""
touch {output}
"""
I need my output to be dynamic to input
I'll add that you can use multiple rules to have snakemake select what to run based on the input to generate a single output file. This may be what you are getting at, but do provide more details if not:
ruleorder:
rule1 > rule2 > rule3
rule rule1:
output: 'foo'
input: 'bar'
shell: 'echo rule1 > {output}'
rule rule2:
output: 'foo'
input: 'baz'
shell: 'echo rule2 > {output}'
rule rule3:
output: 'foo'
shell: 'echo rule3 > {output}'
The rule order resolves ambiguity if bar and baz are both present. If neither are present than rule3 runs as a default (with no necessary input).
I've used something similar in cases where say a fasta file can be generated from a vcf, a bam file, or downloaded as a fallback.

Snakemake skips multiple of the rules with same input

Let's say I have 3 rules with the same input, snakemake skips 2 of them and only runs one of the rules. Is there a workaround to force all 3 rules to execute, since I need all 3 of them? I could add some other files as input to the existing input, but I feel like that's somewhat cheated and probably confusing to other people looking at my code, since I declare an input that is not used at all.
It appears target files were not defined. By default, snakemake executes the first rule in the snakefile.
Example:
rule all
input: "a.txt", "b.png"
rule x:
output "a.txt"
shell: "touch {output}"
rule y:
output "b.png"
shell: "touch {output}"
It is customary to name the first rule all which has all the desired output files.

How to get a rule that would work the same on a directory and its sub-directories

I am trying to make a rule that would work the same on a directory and any of its sub sub-directory (to avoid having to repeat the rule several times). I would like to have access to the name of the subdirectory if there is one.
My approach was to make the sub-directory optional. Given that wildcards can be made to accept an empty string by explicitly giving the ".*" pattern, I therefore tried the following rule:
rule test_optional_sub_dir:
input:
"{adir}/{bdir}/a.txt"
output:
"{adir}/{bdir,.*}/b.txt"
shell:
"cp {input} {output}"
I was hoping that this rule would match both A/b.txt and A/B/b.txt.
However, A/b.txt doesn't match the rule. (Neither does A//b.txt which would be the litteral omission of bdir, I guess the double / gets removed before the matching happens).
The following rule works with both A/b.txt and A/B/b.txt:
rule test_optional_sub_dir2:
input:
"{path}/a.txt"
output:
"{path,.*}/b.txt"
shell:
"cp {input} {output}"
but the problem in this case is that I don't have easy access to the name of the directories in path. I could use the function pathlib.Path to break {path} up but this seems to get overly complicated.
Is there a better way to accomplish what I am trying to do?
Thanks a lot for your help.
How exactly you want to use the sub-directory in your rule might determine the best way to do this. Maybe something like:
def get_subdir(path):
dirs = path.split('/')
if len(dirs) > 1:
return dirs[1]
else:
return ''
rule myrule:
input:
"{dirpath}/a.txt"
output:
"{dirpath}/b.txt"
params:
subdir = lambda wildcards: get_subdir(wildcards.dirpath)
shell:
#use {params.subdir}
Of course, if your rule uses "run" or "script" instead of "shell" you don't even need that function and the subdir param, and can just figure out the subdir from the wildcard that gets passed into the script.
With some further fiddling, I found something that is close to what I want:
Let's say I want at least one directory and no more than 2 optional ones below it. The following works. The only downside is that opt_dir1 and opt_dir2 contain the trailing slash rather than just the name of the directory.
rule test_optional_sub_dir3:
input:
"{mand_dir}/{opt_dir1}{opt_dir2}a.txt"
output:
"{mand_dir}/{opt_dir1}{opt_dir2}b.txt"
wildcard_constraints:
mand_dir="[^/]+",
opt_dir1="([^/]+/)?",
opt_dir2="([^/]+/)?"
shell:
"cp {input} {output}"
Still interested in better approaches if anyone has one.

Varying (known) number of outputs in Snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.
Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy