Possible to have wildcard as dictionary key in snakemake rule output? - snakemake

I need my output to be dynamic to input, the best way I thought to do is this by having output based on a dictionary. He is a stripped down example:
config.yaml:
{'names' : ['foo', 'bar']}
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("{name}",name=config['names'])
rule make_file:
output:
lambda wildcards: config[wildcards.name]
shell:
"""
touch {output}
"""
I get Only input files can be specified as functions
I also tried adding making output of rule make_file config["{name}"]

Snakemake has the opposite logic: that is the output that is used to define the actual wildcards values which then are used in other sections.
If you need the output to depend on the input, you may define several rules, where each one would define one possible pattern of how an input is being converted to output. There are other possibilities to add a dynamic behavior like checkpoints and dynamic files. Anyway there is no one size fits all solution for every problem, so you need to clarify what you are trying to achieve.
Anyway, when designing a Snakemake pipeline you should think in terms of "what target is my goal", but not in terms of "what could I make out of what I have".

Maybe you are making things more complicated than necessary or you simplified the example too much (or I'm misunderstanding the question...).
This will create files foo and bar by running rule make_file on each item in config['names']:
rule all:
input:
expand("{name}", name= config['names'])
rule make_file:
output:
'{name}'
shell:
"""
touch {output}
"""

I need my output to be dynamic to input
I'll add that you can use multiple rules to have snakemake select what to run based on the input to generate a single output file. This may be what you are getting at, but do provide more details if not:
ruleorder:
rule1 > rule2 > rule3
rule rule1:
output: 'foo'
input: 'bar'
shell: 'echo rule1 > {output}'
rule rule2:
output: 'foo'
input: 'baz'
shell: 'echo rule2 > {output}'
rule rule3:
output: 'foo'
shell: 'echo rule3 > {output}'
The rule order resolves ambiguity if bar and baz are both present. If neither are present than rule3 runs as a default (with no necessary input).
I've used something similar in cases where say a fasta file can be generated from a vcf, a bam file, or downloaded as a fallback.

Related

Snakemake: Combine different rules for all files in a certain directory

Recently, I started using snakemake for data analysis.
I am still a beginner, and this is my first post on stackoverflow.
I have different rules that produce different output, but all should run on all files of a certain directory.
Here is a simplified example:
LABELS, = glob_wildcards('{label}.dat')
rule all:
input:
expand('{label}-A.out', label=LABELS),
expand('{label}-B.out', label=LABELS)
rule A:
input: expand('{label}-A.out', label=LABELS)
rule B:
input: expand('{label}-B.out', label=LABELS)
rule create_A_out:
output:
'{label}-A.out'
shell:
'touch {output}'
rule create_B_out:
input:
'test.dat'
output:
'{label}-B.out'
shell:
'touch {output}'
To update all output files at once, do I have to write a rule like 'all' that manually collects all output files that I need?
Or is there a way to combine rules 'A' and 'B' (and more rules), so that I can easily run all at once?
Thank you very much!
I am not positive on your question, but as written the rules A and B are superfluous. If you want to directly link rule outputs, you can use rule dependencies, though I prefer to have the file names left explicit. Perhaps you are looking for a more complex expand, such as this:
LABELS, = glob_wildcards('{label}.dat')
analyses = ['A', 'B'] # ...
rule all:
input:
expand('{label}-{analysis}.out',
label=LABELS,
analysis=analyses)
rule create_A_out:
input:
'{label}.dat'
output:
'{label}-A.out'
shell:
'touch {output}'
rule create_B_out:
input:
'{label}.dat'
output:
'{label}-B.out'
shell:
'touch {output}'
By default, expand will produce the combination of all lists you supply. I also added wildcards to your inputs and added inputs for the .dat file you are globbing against. If you are keeping the same general format, just add a rule and an element to analyses.
If you have tons of analyses to do, you can also get fancier with anonymous rules. Say I have python scripts, analysis-[A-Z].py. You can link those to an output file like so:
LABELS, = glob_wildcards('{label}.dat')
analyses = ['A', 'B', ..., 'Z']
rule all:
input:
expand('{label}-{analysis}.out',
label=LABELS,
analysis=analyses)
for analysis in analyses:
rule:
input: '{label}.dat'
output: f'{{label}}-{analysis}.dat' # will be {label}-A.dat after formatting
script: f'analysis-{analysis}.py'
You could even make analyses a dict with shell scripts as the values, but that would be really hard to follow!

snakemake wildcard scoping between rules

I've read the docs on rules, the FAQ, and this question, but I still can't tell: if a wildcard foo is defined in rule bar, can its values be accessed in rule baz?
If I understood your question correctly, the answer should be "no".
By using a wildcard in the rule you just define the pattern that can be applied to many different files. For example, in this rule you define the a way to produce files whose filenames match a pattern "new_name{n}.txt", where {n} can be any string:
rule example:
input: "old_name{n}.txt"
output: "new_name{n}.txt"
shell: "cp input output"
For sure this rule will be regarded only if the file "old_name{n}.txt" exists with the same string {n} used.
Returning back to your question: how could you access the value if this is just a pattern that may be applied to different values?
Another possible interpretation of your question is that you need to know all the values (all the inputs) the rule bar was applied to. In this case you probably need to employ checkpoint: this is the way to delay the pattern application until the part of pipeline finishes. But even in this case you wouldn't be accessing "the values of wildcard" explicitly.
I'm not sure I'm answering your question and what follows may not be entirely correct... Snakemake only cares that you have one and only one path that leads to the requested files (i.e., the files defined in the first rule, usually called all).
If rule bar defines wildcards that can to lead to the final output, then yes, those wildcards are visible to the following rules.
In the script below we ask for files A.txt and B.txt. To produce A.txt we don't need any wildcard. To produce B.txt we need to pass by wildcard {wc} defined in rule bar and used in rule B. Note that wildcard {sample} doesn't appear at all outside rule all. Note also that rule bar produces two files, B.tmp and C.tmp, but rule B only needs B.tmp. (You should be able to dry-run this script with snakemake -p -n)
rule all:
input:
expand('{sample}.txt', sample= ['A', 'B']),
rule A:
output:
'A.txt',
shell:
"touch {output}"
rule bar:
output:
expand('{wc}.tmp', wc= ['B', 'C'])
shell:
r"""
touch {output}
"""
rule B:
input:
'{wc}.tmp',
output:
'{wc}.txt',
shell:
r"""
touch {input} {output}
"""

Snakemake skips multiple of the rules with same input

Let's say I have 3 rules with the same input, snakemake skips 2 of them and only runs one of the rules. Is there a workaround to force all 3 rules to execute, since I need all 3 of them? I could add some other files as input to the existing input, but I feel like that's somewhat cheated and probably confusing to other people looking at my code, since I declare an input that is not used at all.
It appears target files were not defined. By default, snakemake executes the first rule in the snakefile.
Example:
rule all
input: "a.txt", "b.png"
rule x:
output "a.txt"
shell: "touch {output}"
rule y:
output "b.png"
shell: "touch {output}"
It is customary to name the first rule all which has all the desired output files.

An WorkflowError with wildcards

I want to use snakemake to QC the fastq file, but it show that :
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files
or a rule without wildcards.
The code I wrote is like this
SAMPLE = ["A","B","C"]
rule trimmomatic:
input:
"/data/samples/{sample}.fastq"
output:
"/data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"
I'm a novice, if anyone know that, please tell me. Thanks so much!
You could do one of the following, but chances are you want to do the latter one.
Explicitly specifiy output filenames via commandline:
snakemake data/samples/A.clean.fastq
This would run rule to create file data/samples/A.clean.fastq
Specify target output files to be created in Snakefile itself using rule all. See here to learn more about adding targets via rule all
SAMPLE_NAMES = ["A","B", "C"]
rule all:
input:
expand("data/samples/{sample}.clean.fastq", sample=SAMPLE_NAMES)
rule trimmomatic:
input:
"data/samples/{sample}.fastq"
output:
"data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"

Varying (known) number of outputs in Snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.
Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'