How to properly use batch flag in snakemake to subset DAG - snakemake

I have a workflow written in snakemake that balloons at one point to a large number of output files. This happens because there are many combinations of my wildcards yielding on the order of 60,000 output files.
With this number of input files, DAG generation is very slow for subsequent rules to the point of being unusable. In the past, I've solved this issue by iterating with bash loops across subsetted configuration files where all but one of the wildcards is commented out. For example, in my case I had one wildcard (primer) that had 12 possible values. By running snakemake iteratively for each value of "primer", it divided up the workflow into digestible chunks (~5000 input files). With this strategy DAG generation is quick and the workflow proceeds well.
Now I'm interested in using the new --batch flag available in snakemake 5.7.4 since Johannes suggested to me on twitter that it should basically do the same thing I did with bash loops and subsetted config files.
However, I'm running into an error that's confusing me. I don't know if this is an issue I should be posting on the github issue tracker or I'm just missing something basic.
The rule my workflow is failing on is as follows:
rule get_fastq_for_subset:
input:
fasta="compute-workflow-intermediate/06-subsetted/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.fasta",
fastq="compute-workflow-intermediate/04-sorted/{sample}.{direction}.SSU.{group}.fastq"
output:
fastq=temp("compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.fastq"),
fastq_revcomp="compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.revcomped.fastq"
conda:
"envs/bbmap.yaml"
shell:
"filterbyname.sh names={input.fasta} include=t in={input.fastq} out={output.fastq} ; "
"revcompfastq_according_to_pyNAST.py --inpynast {input.fasta} --infastq {output.fastq} --outfastq {output.fastq_revcomp}"
I set up a target in my all rule to generate the output specified in the rule, then tried running snakemake with the batch flag as follows:
snakemake --configfile config/config.yaml --batch get_fastq_for_subset=1/20 --snakefile Snakefile-compute.smk --cores 40 --use-conda
It then fails with this message:
WorkflowError:
Batching rule get_fastq_for_subset has less input files than batches. Please choose a smaller number of batches.
Now the thing I'm confused about is that my input for this rule should actually be on the order of 60,000 files. Yet it appears snakemake getting less than 20 input files. Perhaps it's only counting 2, as in the number of input files that are specified by the rule? But this doesn't really make sense to me...
The source code for snakemake is here, showing how snakemake counts things but it's a bit beyond me:
https://snakemake.readthedocs.io/en/stable/_modules/snakemake/dag.html
If anyone has any idea what's going on, I'd be very grateful for your help!
Best,
Jesse

Related

snakemake "rule all" is rerunning unnecessary rules

I'm running into a case where snakemake is rerunning rules that have already been run, even though the output from those rules is still present. I am specifying all the desired output files in a "rule all". I ran the pipeline the point where I had all the desired outputs from "rule B", and wanted to restart the pipeline and just run rule A. But snakemake reruns "rule B" even though all the outputs from "rule B" are already present. This isn't the behavior I expect from snakemake, which should only rerun the rules necessary to get to a target (here specified in the rule all).
When I run snakemake in a dry-run mode, this is at the end of the output.
Job counts:
count jobs
1 count_matrix
27 picard_fastq2sam
27 star_align
55
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
The output from picard_fastq2sam is used by star_align and all the outputs from star_align are used for count_matrix. So there should only one rule to run, because the outputs from "picard_fastq2sam" and "star_align" are all already present. My rule all looks like this:
This workflow is based on the template from https://github.com/snakemake-workflows/rna-seq-star-deseq2, but I've modified it enough that I thought I should post here.
snakemake version is 6.0.5
Any hints on where to look? This is really the opposite of the behavior I expect from snakemake.
When I looked further into the debug-dag output, I saw a bunch of these blocks of text:
candidate job star_align
wildcards: sample=8R_S14, unit=lane1
candidate job picard_fastq2sam
wildcards: sample=8R_S14, unit=lane1
selected job picard_fastq2sam
wildcards: sample=8R_S14, unit=lane1
file results/picard_fastq2sam/8R_S14-lane1.unaligned.bam:
Producer found, hence exceptions are ignored.
selected job star_align
wildcards: sample=8R_S14, unit=lane1
file results/star/8R_S14-lane1.ReadsPerGene.out.tab:
Producer found, hence exceptions are ignored.
I'm not sure what means, but it seems relevant.
I had forgotten that snakemake checks the timestamp of files in addition to whether they are present or not. I figured this out when I found this post: How to not rerule updated files with Snakemake
and fixed in my case by rerunning with "--touch" and "--forceall" to update the timestamps of all the output files.

Snakemake using the same input and output in a rule

Is it possible to use the same input and output in a rule?
For example,
rule example:
input:
"/path/to/my/data"
output:
"/path/to/my/data"
shell:
"my_command {input}"
I am pulling data from a previous rule, and am trying to move some of its outputs around, and merge files together.
I appreciate any help!
In a nutshell, no. Snakemake builds a DAG (directed acyclic graph) and then makes the dependencies for each node required by a target. In your case you are introducing a loop.
Anyway, from your description I don't see any reason for this cycle:
I am pulling data from a previous rule, and am trying to move some of
its outputs around, and merge files together.
That can be done in a "normal" way.

Snakemake live user input

I have an a bunch of R scripts that follow one another and I wanted to connect them using Snakemake. But I’m running in a problem.
One of my R scripts shows two images and asks a user’s input on how many cluster there are present. The R function for this is [readline]
This query on how many clusters is asked but directly after the next line of code is run. Without an opportunity to input a number. the rest of the program crashes, since trying to calculate (empty number) of clusters doesn’t really work. Is there a way around this. By getting the values via a function/rule from Snakemake
or is there a other way to work around this issue?
Based on my testing with snakemake v5.8.2 in MacOS, this is not a snakemake issue. Example setup below works without any problem.
File test.R
cat("What's your name? ")
x <- readLines(file("stdin"),1)
print(x)
File Snakefile
rule all:
input:
"a.txt",
"b.txt"
rule test_rule:
output:
"{sample}.txt"
shell:
"Rscript test.R; touch {output}"
Executing them with command snakemake -p behaves as expected. That is, it asks for user input and then touch output file.
I used function readLines in R script, but this example shows that error you are facing is likely not a snakemake issue.

How to run rule even when some of its inputs are missing?

In the first step of my process, I am extracting some hourly data from a database. Because of things data is sometimes missing for some hours resulting in files. As long as the amount of missing files is not too large I still want to run some of the rules that depend on that data. When running those rules I will check how much data is missing and then decide if I want to generate an error or not.
An example below. The Snakefile:
rule parse_data:
input:
"data/1.csv", "data/2.csv", "data/3.csv", "data/4.csv"
output:
"result.csv"
shell:
"touch {output}"
rule get_data:
output:
"data/{id}.csv"
shell:
"Rscript get_data.R {output}"
And my get_data.R script:
output <- commandArgs(trailingOnly = TRUE)[1]
if (output == "data/1.csv")
stop("Some error")
writeLines("foo", output)
How do I force running of the rule parse_data even when some of it's inputs are missing? I do not want to force running any other rules when input is missing.
One possible solution would be to generate, for example, an empty file in get_data.R when the query failed. However, in practice I am also using --restart-times 5 when running snakemake as the query can also fail because of database timeouts. When creating an empty file this mechanism of retrying the queries would no longer work.
You need data-dependent conditional execution.
Use a checkpoint on get_data. Then you replace parse_data's input with a function, that aggregates whatever files do exist.
(note that I am a Snakemake newbie and am just learning this myself, I hope this is helpful)

How can Snakemake be made to update files in a hierarchical rule-based manner when a new file appears at the bottom of the hierarchy?

I have a snakefile with dozens of rules, and it processes thousands of files. This is a bioinformatics pipeline for DNA sequencing analysis. Today I added two more samples to my set of samples, and I expected to be able to run snakemake and it would automatically determine which rules to run on which files to process the new sample files and all files that depend on them on up the hierarchy to the very top level. However, it does nothing. And the -R option doesn't do it either.
The problem is illustrated with this snakefile:
> cat tst
rule A:
output: "test1.txt"
input: "test2.txt"
shell: "cp {input} {output}"
rule B:
output: "test2.txt"
input: "test3.txt"
shell: "cp {input} {output}"
rule C:
output: "test3.txt"
input: "test4.txt"
shell: "cp {input} {output}"
rule D:
output: "test4.txt"
input: "test5.txt"
shell: "cp {input} {output}"
Execute it as follows:
> rm test*.txt
> touch test2.txt
> touch test1.txt
> snakemake -s tst -F
Output is:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 A
1
rule A:
input: test2.txt
output: test1.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
Since test5.txt does not exist, I expected an error message to that effect, but it did not happen. And of course, test3.txt and test4.txt do not exist.
Furthermore, using -R instead of -F results in "Nothing to be done."
Using "-R A" runs rule A only.
This relates to my project in that it shows that Snakemake does not analyze the entire dependent tree if you tell it to build a rule at the top of the tree and that rule's output and input files already exist. And the -R option does not force it either. When I tried -F on my project, it started rebuilding the entire thing, including files that did not need to be rebuilt.
It seems to me that this is fundamental to what Snakemake should be doing, and I just don't understand it. The only way I can see to get my pipeline to analyze the new samples is to individually invoke each rule required for the new files, in order. And that is way too tedious and is one reason why I used Snakemake in the first place.
Help!
Snakemake does not automatically trigger re-runs when adding new input files (e.g. samples) to the DAG. However, you can enforce this as outlined in the FAQ.
The reason for not doing this by default is mostly consistency: in order to do this, Snakemake needs to store meta information. Hence, if the meta information is lost, you would have a different behavior than if it was there.
However, I might change this in the future. With such fundamental changes though, I am usually very careful in order to not forget a counter example where the current default behavior is of advantage.
Remember that snakemake wants to satisfy the dependency of the first rule and builds the graph by pulling additional dependencies through the rest of the graph to satisfy that initial dependency. By touching test2.txt you've satisfied the dependency for the first rule, so nothing more needs to be done. Even with -R A nothing else needs to be run to satisfy the dependency of rule A - the files already exist.
Snakemake definitely does do what you want (add new samples and the entire rule graph runs on those samples) and you don't need to individually invoke each rule, but it seems to me that you might be thinking of the dependencies wrong. I'm not sure I fully understand where your new samples fit into the tst example you've given but I see at least two possibilites.
Your graph dependency runs D->C->B->A, so if you're thinking that you've added new input data at the top (i.e. a new sample as test5.txt in rule D), then you need to be sure that you have a dependency at your endpoint (test2.txt in rule A). By touching test2.txt you've just completed your pipeline, so no dependencies exist. If touch test5.txt (that's your new data) then your example works and the entire graph runs.
Since you touched test1.txt and test2.txt in your example execution maybe you intended those to represent the new samples. If so then you need to rethink your dependency graph, because adding them doesn't create a dependency on the rest of the graph. In your example, the test2.txt file is your terminal dependency (the final dependency of your workflow not the input to it). In your tst example new data needs come is as test5.txt as input to rule D (the top of your graph) and get pulled through the dependency graph to satisfy an input dependency of rule A which is test2.txt. If you're thinking of either test1.txt or test2.txt as your new input then you need to remember that snakemake pulls data through the graph to satisfy dependencies at the bottom of the graph, it doesn't push data from the top down. Run snakemake -F --rulegraph see that the graph runs D->C->B->A and so new data needs to come is as input to rule D and be pulled through the graph as a dependency to rule A.