Varying (known) number of outputs in Snakemake

Varying (known) number of outputs in Snakemake - snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.

Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

Related

How to parallelize jobs for a list of files using snakemake (beginner question)

I am struggling with a very simple thing. On input of my snakemake pipeline I would like to have a directory, list its content, and process each file from that directory in parallel. Naively I thought something like this should work:
rule all:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
This ends with the error
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
All the resources I could find start with hard-coding the list of jobs in the script, which is something I want to avoid to keep the pipeline generic. The idea is to just point the pipeline to a directory with a list of files and let it do its job. Is this possible? Seems fairly simple and intuitive, but couldn't find an example showing that.

I don't know what command you used for this rule, but the following workflow should suffice your purpose
rule all:
input:
expand("out/{prefix}.txt", prefix=glob_wildcards("in/{test}.txt").test)
rule test:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
glob_wildcards is a function by snakemake to find out all the files that match the specified pattern (in/{test}.txt in this case), then .text is to get the list of strings that match {test} in filenames (example: "ab" in "in/ab.txt").
Then expand can fill the string to the placeholder variable that wrapped by curly bracket, then generate a list of input file names.
So rule all wants a list of input files correspond to all txt files in in folder, then it would let snakemake execute rule test for every file

How to target intermediary Snakemake rule that contains wildcards

I have a workflow that, very simplified for this question, looks as follows:
rule all:
input: multiext("final",".a",".b",".c",".d")
rule final_cheap:
input: "intermediary.{ext}"
output: "final.{ext}"
#dummy for cheap but complicated operation
shell: "cp {input} {output}"
rule intermediary_cheap:
input: "start.{ext}"
output: "intermediary.{ext}"
#dummy for cheap complicated operation
shell: "cp {input} {output}"
rule start_expensive:
output: "start.{ext}"
#dummy for very expensive operation
shell: "touch {output}"
There's a very expensive first step and two complicated steps that follow.
After I've run this workflow once with snakemake -c1 I want to rerun the workflow but just from the intermediary rule onwards. How can I achieve this goal with command line flags?
snakemake intermediary_cheap all does not work, because intermediary_cheap contains wildcards, even though the inclusion of all really shows the values of the required wildcards.
Is there a command line flag that tells snakemake to run the rule and ignore all output from the rule intermediary_cheap, something like snakemake all --forcerule=intermediary_cheap? (I invented that --forcerule flag, it doesn't exist as far as I know.
The workaround I'm using right now is manually deleting the output of the rule intermediary_cheap, then forcing execution of the rule with --force and then running rule all, which notices that some upstream inputs have changed. But this requires knowledge of the precise file names that are produced, whereas knowledge of rules only would be preferable because it is at a higher level of abstraction.

I haven't used it before but I think you want:
snakemake -c 1 --forcerun intermediary_cheap
--forcerun [TARGET [TARGET ...]], -R [TARGET [TARGET ...]]
Force the re-execution or creation of the given rules
or files. Use this option if you changed a rule and
want to have all its output in your workflow updated.
(default: None)

Possible to have wildcard as dictionary key in snakemake rule output?

I need my output to be dynamic to input, the best way I thought to do is this by having output based on a dictionary. He is a stripped down example:
config.yaml:
{'names' : ['foo', 'bar']}
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("{name}",name=config['names'])
rule make_file:
output:
lambda wildcards: config[wildcards.name]
shell:
"""
touch {output}
"""
I get Only input files can be specified as functions
I also tried adding making output of rule make_file config["{name}"]

Snakemake has the opposite logic: that is the output that is used to define the actual wildcards values which then are used in other sections.
If you need the output to depend on the input, you may define several rules, where each one would define one possible pattern of how an input is being converted to output. There are other possibilities to add a dynamic behavior like checkpoints and dynamic files. Anyway there is no one size fits all solution for every problem, so you need to clarify what you are trying to achieve.
Anyway, when designing a Snakemake pipeline you should think in terms of "what target is my goal", but not in terms of "what could I make out of what I have".

Maybe you are making things more complicated than necessary or you simplified the example too much (or I'm misunderstanding the question...).
This will create files foo and bar by running rule make_file on each item in config['names']:
rule all:
input:
expand("{name}", name= config['names'])
rule make_file:
output:
'{name}'
shell:
"""
touch {output}
"""

I need my output to be dynamic to input
I'll add that you can use multiple rules to have snakemake select what to run based on the input to generate a single output file. This may be what you are getting at, but do provide more details if not:
ruleorder:
rule1 > rule2 > rule3
rule rule1:
output: 'foo'
input: 'bar'
shell: 'echo rule1 > {output}'
rule rule2:
output: 'foo'
input: 'baz'
shell: 'echo rule2 > {output}'
rule rule3:
output: 'foo'
shell: 'echo rule3 > {output}'
The rule order resolves ambiguity if bar and baz are both present. If neither are present than rule3 runs as a default (with no necessary input).
I've used something similar in cases where say a fasta file can be generated from a vcf, a bam file, or downloaded as a fallback.

Snakemake skips multiple of the rules with same input

Let's say I have 3 rules with the same input, snakemake skips 2 of them and only runs one of the rules. Is there a workaround to force all 3 rules to execute, since I need all 3 of them? I could add some other files as input to the existing input, but I feel like that's somewhat cheated and probably confusing to other people looking at my code, since I declare an input that is not used at all.

It appears target files were not defined. By default, snakemake executes the first rule in the snakefile.
Example:
rule all
input: "a.txt", "b.png"
rule x:
output "a.txt"
shell: "touch {output}"
rule y:
output "b.png"
shell: "touch {output}"
It is customary to name the first rule all which has all the desired output files.

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"

Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!

You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Varying (known) number of outputs in Snakemake - snakemake

Related

How to parallelize jobs for a list of files using snakemake (beginner question)

How to target intermediary Snakemake rule that contains wildcards

Possible to have wildcard as dictionary key in snakemake rule output?

Snakemake skips multiple of the rules with same input

how can I pass a string in config file into the output section?

Categories

Resources