Chain dynamic input/output rules in snakemake

Chain dynamic input/output rules in snakemake - snakemake

I am trying to create a pipeline where a small chain of rules are ran on a dynamic number of files output by an earlier rule using output. However, I am getting the following error: "wildcards in input files cannot be determined from output files:".
This suggests to me that what I am trying to do is not currently supported. Here is a pseudo example of what I am trying to do:
rule a:
input: "my static file.txt"
output: dynamic('my/path/{id}.txt')
rule b:
input: dynamic('my/path/{id}.txt')
output: dynamic('my/path/{id}.reprocessed.txt')
rule c:
input: dynamic('my/path/{id}.reprocessed.txt')
output: 'gather.txt'
Running snakemake with
rule all:
input: dynamic('my/path/{id}.txt')
Works without any issues, but when I run snakemake with:
rule all:
input: dynamic('my/path/{id}.reprocessed.txt')
I get the error: "wildcards in input files cannot be determined from output files:"
Is this feature supported? Has anyone successfully made such a chain? Any considerations I need to take into account?
Thanks!

This was resolved by removing the dynamic statement from rule b.

Related

Accessing file path from a config.yaml in Snakemake

I'm working with Snakemake for NGS analysis. I have a list of input files, stored in a YAML file as follows:
DATASETS:
sample1: /path/to/input/bam
.
.
A very simplified skeleton of my Snakemake file, as described earlier in Snakemake: How to use config file efficiently and https://www.biostars.org/p/406452/, is as follows:
rule all:
input:
expand("report/{sample}.xlsx", sample = config["DATASETS"])
rule call:
input:
lambda wildcards: config["DATASETS"][wildcards.sample]
output:
"tmp/{sample}.vcf"
shell:
"some mutect2 script"
rule summarize:
input:
"tmp/{sample}.vcf"
output:
"report/{sample}.xlsx"
shell:
"processVCF.py"
This complains about missing input files for rule all. I'm really not too sure what I am missing out here: Could someone perhaps point out where I can start looking to try to solve my problem?
This problem persists even when I execute snakemake -n tmp/sample1.vcf, so it seems the problem is related to the inability to pass the input file to the rule call. I have a nagging feeling that I'm really missing something trivial here.

How to run rule even when some of its inputs are missing?

In the first step of my process, I am extracting some hourly data from a database. Because of things data is sometimes missing for some hours resulting in files. As long as the amount of missing files is not too large I still want to run some of the rules that depend on that data. When running those rules I will check how much data is missing and then decide if I want to generate an error or not.
An example below. The Snakefile:
rule parse_data:
input:
"data/1.csv", "data/2.csv", "data/3.csv", "data/4.csv"
output:
"result.csv"
shell:
"touch {output}"
rule get_data:
output:
"data/{id}.csv"
shell:
"Rscript get_data.R {output}"
And my get_data.R script:
output <- commandArgs(trailingOnly = TRUE)[1]
if (output == "data/1.csv")
stop("Some error")
writeLines("foo", output)
How do I force running of the rule parse_data even when some of it's inputs are missing? I do not want to force running any other rules when input is missing.
One possible solution would be to generate, for example, an empty file in get_data.R when the query failed. However, in practice I am also using --restart-times 5 when running snakemake as the query can also fail because of database timeouts. When creating an empty file this mechanism of retrying the queries would no longer work.

You need data-dependent conditional execution.
Use a checkpoint on get_data. Then you replace parse_data's input with a function, that aggregates whatever files do exist.
(note that I am a Snakemake newbie and am just learning this myself, I hope this is helpful)

Snakemake: The use of --debug-dag for detecting cyclic dependencies

I am using snakemake in a workflow for NGS analyses.
In one rule, I make use of the unique (temporary) output from another rule.The output of this one rule is also unique and contributes to the creation of the final output. A simple wildcard {sample} is used over these rules. I do not see any cyclic dependency, but snakemake tells me there is:
CyclicGraphException in line xxx of Snakefile: Cyclic dependency on rule
I understand that there is an option to investigate this problem: --debug-dag.
How do I interpret the output? What is candidate versus selected?
This my (pseudo-) code of the rule:
rule split_fasta:
input:
dataFile="data/path1/{sample}.tab",
scaffolds="data/path2/{sample}.fasta",
database="path/to/db",
output:
onefasta="data/path2/{sample}_one.fasta",
twofasta="data/path2/{sample}_two.fasta",
threefasta="data/path2/{sample}_three.fasta",
conda:
"envs/env.yaml"
log:
"logs/split_fasta_{sample}.log"
benchmark:
"logs/benchmark/split_fasta_{sample}.txt"
threads: 4
shell:
"""
python bin/split_fasta.py {input.dataFile} {input.scaffolds} {input.database} {output.onefasta} {output.twofasta} {output.threefasta}
"""
There is no other connection between input and output than in this rule.
The problem is solved now, further downstream and upstream some subtle dependencies were present.
But, for future reference I would like to know how to interpret the output od the --debug-dag option.

--debug-dag Print candidate and selected jobs (including their wildcards) while inferring DAG. This can help to debug unexpected DAG topology or errors.
It does not seem to have further documentation than this, but I believe the candidate jobs are the jobs that can be made matching to the required string through wildcards. The selected job is the one that is chosen from the candidates (either through wildcard constraints, ruleorder, or the first candidate with the option --allow-ambiguity).
As an example I have a rule that does adapter trimming, and I have a rule for both paired end and single end:
rule trim_SE:
input:
"{sample}.fastq.gz"
output:
"{sample}_trimmed.fastq.gz"
shell:
...
rule trim_PE:
input:
"{sample}_R1.fastq.gz",
"{sample}_R2.fastq.gz"
output:
"{sample}_R1_trimmed.fastq.gz"
"{sample}_R2_trimmed.fastq.gz"
shell:
...
If I now tell snakemake to generate the output exp_R1_trimmed.fastq.gz it complains that it can use either rule.
AmbiguousRuleException:
Rules trim_PE and trim_SE are ambiguous for the file exp_R1_trimmed.fastq.gz.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
trim_PE: sample=exp
trim_SE: sample=exp_R1
we can solve this problem by for instance placing a ruleorder:
ruleorder: trim_PE > trim_SE
And the file gets generated as we want. If we now use the --debug-dag option we get two candidate rules, and one selected rule (based on our ruleorder).
candidate job trim_PE
wildcards: sample=exp
candidate job trim_SE
wildcards: sample=exp_R1
selected job sra2fastq_PE
wildcards: sample=GSM2837484
If the rule trim_PE and trim_SE depended on other rules downstream, we can use the --debug-dag option to detect in which rule the wildcard expansion goes wrong, instead of only getting an error in the rule where it goes wrong.

Snakemake: 'Missing input files' due to wrong wildcard expansion

I am new to Snakemake and I want to write a very simple Snakefile with a rule that processes each input file separately to an output file, but somehow my wildcards aren't interpreted correctly.
I have set up a minimal, reproducible example environment in Ubuntu 18.04 with the input files "test/test1.txt", "test/test2.txt", and a Snakefile. (snakemake version 5.5.4)
Snakefile:
ins = glob_wildcards("test/{f}.txt")
rule all:
input: expand("out/{f}.txt", f=ins)
rule test:
input: "test/{f}.txt"
output: "out/{f}.txt"
shell: "touch {output}"
This Snakefile throws the following error while building the DAG of jobs:
Missing input files for rule test:
test/['test1', 'test2'].txt
Any ideas how to fix this error?

I think you need to use ins.f or something similar:
expand("out/{f}.txt", f= ins.f)
The reason is explained in the FAQ
[glob_wildcards returns] a named tuple that contains a list of values
for each wildcard.

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"

Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!

You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Chain dynamic input/output rules in snakemake - snakemake

This was resolved by removing the dynamic statement from rule b.

Related

Accessing file path from a config.yaml in Snakemake

How to run rule even when some of its inputs are missing?

Snakemake: The use of --debug-dag for detecting cyclic dependencies

Snakemake: 'Missing input files' due to wrong wildcard expansion

how can I pass a string in config file into the output section?

Categories

Resources