How to run rule even when some of its inputs are missing? - snakemake

In the first step of my process, I am extracting some hourly data from a database. Because of things data is sometimes missing for some hours resulting in files. As long as the amount of missing files is not too large I still want to run some of the rules that depend on that data. When running those rules I will check how much data is missing and then decide if I want to generate an error or not.
An example below. The Snakefile:
rule parse_data:
input:
"data/1.csv", "data/2.csv", "data/3.csv", "data/4.csv"
output:
"result.csv"
shell:
"touch {output}"
rule get_data:
output:
"data/{id}.csv"
shell:
"Rscript get_data.R {output}"
And my get_data.R script:
output <- commandArgs(trailingOnly = TRUE)[1]
if (output == "data/1.csv")
stop("Some error")
writeLines("foo", output)
How do I force running of the rule parse_data even when some of it's inputs are missing? I do not want to force running any other rules when input is missing.
One possible solution would be to generate, for example, an empty file in get_data.R when the query failed. However, in practice I am also using --restart-times 5 when running snakemake as the query can also fail because of database timeouts. When creating an empty file this mechanism of retrying the queries would no longer work.

You need data-dependent conditional execution.
Use a checkpoint on get_data. Then you replace parse_data's input with a function, that aggregates whatever files do exist.
(note that I am a Snakemake newbie and am just learning this myself, I hope this is helpful)

Related

Snakemake using the same input and output in a rule

Is it possible to use the same input and output in a rule?
For example,
rule example:
input:
"/path/to/my/data"
output:
"/path/to/my/data"
shell:
"my_command {input}"
I am pulling data from a previous rule, and am trying to move some of its outputs around, and merge files together.
I appreciate any help!
In a nutshell, no. Snakemake builds a DAG (directed acyclic graph) and then makes the dependencies for each node required by a target. In your case you are introducing a loop.
Anyway, from your description I don't see any reason for this cycle:
I am pulling data from a previous rule, and am trying to move some of
its outputs around, and merge files together.
That can be done in a "normal" way.

Snakemake, RNA-seq : How can I execute one subpart of a pipeline or another subpart based on the characteristics of the sample that is analysed?

I am using snakemake to design a RNAseq-data analysis pipeline. While I've managed to do that, I want to make my pipeline to be as adaptable as possible and make it able to deal with single-reads (SE) data or paired-end (PE) data within the same run of analyses, instead of analysing SE data in one run and PE data in another.
My pipeline is supposed to be designed like this :
dataset download that gives 1 file (SE data) or 2 files (PE data) -->
set of rules A specific to 1 file OR set of rules B specific to 2 files -->
rule that takes 1 or 2 input files and merges it/them
into a single output -->
final set of rules.
Note : all rules of A have 1 input and 1 output, all rules of B have 2 inputs and 2 outputs and their respective commands look like :
1 input : somecommand -i {input} -o {output}
2 inputs : somecommand -i1 {input1} -i2 {input2} -o1 {output1} -o2 {output2}
Note 2 : except their differences in inputs/outputs, all rules of sets A and B have the same commands, parameters/etc...
In other words, I want my pipeline to be able to switch between the execution of set of rules A or set of rules B depending on the sample, either by giving it information on the sample in a config file at the start (sample 1 is SE, sample 2 is PE... this is known before-hand) or asking snakemake to counts the number of files after the dataset download to choose the proper next set of rules for each sample. If you see another way to do that, you're welcome to tell be about it.
I thought about using checkpoints, input functions and if/else statement, but I haven't managed to solve my problem with these.
Do you have any hints/advice/ways to make that "switch" happen ?
If you know the layout beforehand, then the easiest way would be to store it in some variable, something like this (or alternatively you read this from a config file into a dictionary):
layouts = {"sample1": "paired", "sample2": "single", ... etc}
What you can then do is "merge" your rule like this (I am guessing you are talking about trimming and alignment, so that's my example):
ruleorder: B > A
rule A:
input:
{sample}.fastq.gz
output:
trimmed_{sample}.fastq.gz
shell:
"somecommand -i {input} -o {output}"
rule B:
input:
input1={sample}_R1.fastq.gz,
input2={sample}_R2.fastq.gz
output:
output1=trimmed_{sample}_R1.fastq.gz,
output2=trimmed_{sample}_R2.fastq.gz
shell:
"somecommand -i1 {input.input1} -i2 {input.input2} -o1 {output.output1} -o2 {output.output2}"
def get_fastqs(wildcards):
output = dict()
if layouts[wildcards.sample] == "single":
output["input"] = "trimmed_sample2.fastq.gz"
elif layouts[wildcards.sample] == "paired":
output["input1"] = "trimmed_sample1_R1.fastq.gz"
output["input2"] = "trimmed_sample1_R2.fastq.gz"
return output
rule alignment:
def input:
unpack(get_fastqs)
def output:
somepath/{sample}.bam
shell:
...
There is a lot of stuff going on here.
First of all you need a ruleorder so snakemake knows how to handle ambiguous cases
Rule A and B both have to exist (unless you do sth hacky with the output files).
The alignment rule needs an input function to determine which input it requires.
Some self-promotion: I made a snakemake pipeline which does many things, including RNA-seq and downloading of samples online and automatically determining their layout (single-end vs paired-end). Please take a look and see if it solves your problem: https://vanheeringen-lab.github.io/seq2science/content/workflows/rna_seq.html
EDIT:
When you say “merging” rules, do you mean rule A, B and alignment ?
That was unclear wording of me. With merging I meant to "merge
the single-end and paired-end and paired-end logic together, so you can continue with a single rule (e.g. count table, you name it).
Rule order : why did you choose B > A ? To make sure that paired samples don’t end up running in the single-end rules?
Exactly! When a rule needs trimmed_sample1_R1.fastq.gz, how would Snakemake know the name of your sample? Is the name of the sample, sample1, or is it sample1_R1? It can be either, and that makes snakemake complain that it does not know how to resolve this. When you add a ruleorder you tell Snakemake, when it is unclear, resolve in this order.
The command in the alignment rule needs 1 or 2 inputs. I intend to use an if/else in params directive to choose the inputs. Am I correct to think that? (I think you did that as well in your pipeline)
Yes that's the way we solved it. We did it in that way since we want every rule to have it's own environment. If you do not use a seperate conda environment for alignment, then you can do it cleaner/prettier, like so
rule alignment:
input:
unpack(get_fastqs)
output:
somepath/{sample}.bam
run:
if layouts[wildcards.sample] == "single":
shell("single-end command")
if layouts[wildcards.sample] == "paired":
shell("paired-end command")
I feel like this option is much clearer than what we did in the seq2science pipeline. However in the seq2science pipeline we support many different aligners and they all have a different conda environment, so the run directive can not be used.

Snakemake live user input

I have an a bunch of R scripts that follow one another and I wanted to connect them using Snakemake. But I’m running in a problem.
One of my R scripts shows two images and asks a user’s input on how many cluster there are present. The R function for this is [readline]
This query on how many clusters is asked but directly after the next line of code is run. Without an opportunity to input a number. the rest of the program crashes, since trying to calculate (empty number) of clusters doesn’t really work. Is there a way around this. By getting the values via a function/rule from Snakemake
or is there a other way to work around this issue?
Based on my testing with snakemake v5.8.2 in MacOS, this is not a snakemake issue. Example setup below works without any problem.
File test.R
cat("What's your name? ")
x <- readLines(file("stdin"),1)
print(x)
File Snakefile
rule all:
input:
"a.txt",
"b.txt"
rule test_rule:
output:
"{sample}.txt"
shell:
"Rscript test.R; touch {output}"
Executing them with command snakemake -p behaves as expected. That is, it asks for user input and then touch output file.
I used function readLines in R script, but this example shows that error you are facing is likely not a snakemake issue.

How to properly use batch flag in snakemake to subset DAG

I have a workflow written in snakemake that balloons at one point to a large number of output files. This happens because there are many combinations of my wildcards yielding on the order of 60,000 output files.
With this number of input files, DAG generation is very slow for subsequent rules to the point of being unusable. In the past, I've solved this issue by iterating with bash loops across subsetted configuration files where all but one of the wildcards is commented out. For example, in my case I had one wildcard (primer) that had 12 possible values. By running snakemake iteratively for each value of "primer", it divided up the workflow into digestible chunks (~5000 input files). With this strategy DAG generation is quick and the workflow proceeds well.
Now I'm interested in using the new --batch flag available in snakemake 5.7.4 since Johannes suggested to me on twitter that it should basically do the same thing I did with bash loops and subsetted config files.
However, I'm running into an error that's confusing me. I don't know if this is an issue I should be posting on the github issue tracker or I'm just missing something basic.
The rule my workflow is failing on is as follows:
rule get_fastq_for_subset:
input:
fasta="compute-workflow-intermediate/06-subsetted/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.fasta",
fastq="compute-workflow-intermediate/04-sorted/{sample}.{direction}.SSU.{group}.fastq"
output:
fastq=temp("compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.fastq"),
fastq_revcomp="compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.revcomped.fastq"
conda:
"envs/bbmap.yaml"
shell:
"filterbyname.sh names={input.fasta} include=t in={input.fastq} out={output.fastq} ; "
"revcompfastq_according_to_pyNAST.py --inpynast {input.fasta} --infastq {output.fastq} --outfastq {output.fastq_revcomp}"
I set up a target in my all rule to generate the output specified in the rule, then tried running snakemake with the batch flag as follows:
snakemake --configfile config/config.yaml --batch get_fastq_for_subset=1/20 --snakefile Snakefile-compute.smk --cores 40 --use-conda
It then fails with this message:
WorkflowError:
Batching rule get_fastq_for_subset has less input files than batches. Please choose a smaller number of batches.
Now the thing I'm confused about is that my input for this rule should actually be on the order of 60,000 files. Yet it appears snakemake getting less than 20 input files. Perhaps it's only counting 2, as in the number of input files that are specified by the rule? But this doesn't really make sense to me...
The source code for snakemake is here, showing how snakemake counts things but it's a bit beyond me:
https://snakemake.readthedocs.io/en/stable/_modules/snakemake/dag.html
If anyone has any idea what's going on, I'd be very grateful for your help!
Best,
Jesse

How can Snakemake be made to update files in a hierarchical rule-based manner when a new file appears at the bottom of the hierarchy?

I have a snakefile with dozens of rules, and it processes thousands of files. This is a bioinformatics pipeline for DNA sequencing analysis. Today I added two more samples to my set of samples, and I expected to be able to run snakemake and it would automatically determine which rules to run on which files to process the new sample files and all files that depend on them on up the hierarchy to the very top level. However, it does nothing. And the -R option doesn't do it either.
The problem is illustrated with this snakefile:
> cat tst
rule A:
output: "test1.txt"
input: "test2.txt"
shell: "cp {input} {output}"
rule B:
output: "test2.txt"
input: "test3.txt"
shell: "cp {input} {output}"
rule C:
output: "test3.txt"
input: "test4.txt"
shell: "cp {input} {output}"
rule D:
output: "test4.txt"
input: "test5.txt"
shell: "cp {input} {output}"
Execute it as follows:
> rm test*.txt
> touch test2.txt
> touch test1.txt
> snakemake -s tst -F
Output is:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 A
1
rule A:
input: test2.txt
output: test1.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
Since test5.txt does not exist, I expected an error message to that effect, but it did not happen. And of course, test3.txt and test4.txt do not exist.
Furthermore, using -R instead of -F results in "Nothing to be done."
Using "-R A" runs rule A only.
This relates to my project in that it shows that Snakemake does not analyze the entire dependent tree if you tell it to build a rule at the top of the tree and that rule's output and input files already exist. And the -R option does not force it either. When I tried -F on my project, it started rebuilding the entire thing, including files that did not need to be rebuilt.
It seems to me that this is fundamental to what Snakemake should be doing, and I just don't understand it. The only way I can see to get my pipeline to analyze the new samples is to individually invoke each rule required for the new files, in order. And that is way too tedious and is one reason why I used Snakemake in the first place.
Help!
Snakemake does not automatically trigger re-runs when adding new input files (e.g. samples) to the DAG. However, you can enforce this as outlined in the FAQ.
The reason for not doing this by default is mostly consistency: in order to do this, Snakemake needs to store meta information. Hence, if the meta information is lost, you would have a different behavior than if it was there.
However, I might change this in the future. With such fundamental changes though, I am usually very careful in order to not forget a counter example where the current default behavior is of advantage.
Remember that snakemake wants to satisfy the dependency of the first rule and builds the graph by pulling additional dependencies through the rest of the graph to satisfy that initial dependency. By touching test2.txt you've satisfied the dependency for the first rule, so nothing more needs to be done. Even with -R A nothing else needs to be run to satisfy the dependency of rule A - the files already exist.
Snakemake definitely does do what you want (add new samples and the entire rule graph runs on those samples) and you don't need to individually invoke each rule, but it seems to me that you might be thinking of the dependencies wrong. I'm not sure I fully understand where your new samples fit into the tst example you've given but I see at least two possibilites.
Your graph dependency runs D->C->B->A, so if you're thinking that you've added new input data at the top (i.e. a new sample as test5.txt in rule D), then you need to be sure that you have a dependency at your endpoint (test2.txt in rule A). By touching test2.txt you've just completed your pipeline, so no dependencies exist. If touch test5.txt (that's your new data) then your example works and the entire graph runs.
Since you touched test1.txt and test2.txt in your example execution maybe you intended those to represent the new samples. If so then you need to rethink your dependency graph, because adding them doesn't create a dependency on the rest of the graph. In your example, the test2.txt file is your terminal dependency (the final dependency of your workflow not the input to it). In your tst example new data needs come is as test5.txt as input to rule D (the top of your graph) and get pulled through the dependency graph to satisfy an input dependency of rule A which is test2.txt. If you're thinking of either test1.txt or test2.txt as your new input then you need to remember that snakemake pulls data through the graph to satisfy dependencies at the bottom of the graph, it doesn't push data from the top down. Run snakemake -F --rulegraph see that the graph runs D->C->B->A and so new data needs to come is as input to rule D and be pulled through the graph as a dependency to rule A.