Snakemake expand function alternative - snakemake

I have been having some difficulty for some time producing a workflow with many inputs and a single output, such as is shown below. The code below works fine to some extent, however when there are too many input files the concatenate step invariably fails:
rule generate_text:
input:
"data/{name}.csv"
output:
"text_files/{name}.txt"
shell:
"somecommand {input} -o {output}"
rule concatenate_text :
input:
expand("text_files/{name}.txt", name=names)
output:
"summaries/summary.txt"
shell:
"cat {input} > {output}"
I have done some digging and found that this is attributable to a limitation on the number of characters that can be put in a single command. I am working with increasingly large numbers of inputs and therefore the above solution is not scalable.
Can anybody please propose any solutions to this issue? I haven't been able to find any online.
Ideally the solution wouldn't be one limited to just cat or other shell commands and could be employed within the structure of a rule in cases where --use-conda can be employed. My current fix involves using an onsuccess script as follows, but this doesn't allow use of --use-conda and rule specific conda environments.
One handy thing about the shell command is that you can feed it snakemake variables, but its not quite flexible enough for my purposes due to the aforementioned conda issue.
onsuccess:
shell("cat text_files/*.txt > summaries/summary.txt")

Related

How to target intermediary Snakemake rule that contains wildcards

I have a workflow that, very simplified for this question, looks as follows:
rule all:
input: multiext("final",".a",".b",".c",".d")
rule final_cheap:
input: "intermediary.{ext}"
output: "final.{ext}"
#dummy for cheap but complicated operation
shell: "cp {input} {output}"
rule intermediary_cheap:
input: "start.{ext}"
output: "intermediary.{ext}"
#dummy for cheap complicated operation
shell: "cp {input} {output}"
rule start_expensive:
output: "start.{ext}"
#dummy for very expensive operation
shell: "touch {output}"
There's a very expensive first step and two complicated steps that follow.
After I've run this workflow once with snakemake -c1 I want to rerun the workflow but just from the intermediary rule onwards. How can I achieve this goal with command line flags?
snakemake intermediary_cheap all does not work, because intermediary_cheap contains wildcards, even though the inclusion of all really shows the values of the required wildcards.
Is there a command line flag that tells snakemake to run the rule and ignore all output from the rule intermediary_cheap, something like snakemake all --forcerule=intermediary_cheap? (I invented that --forcerule flag, it doesn't exist as far as I know.
The workaround I'm using right now is manually deleting the output of the rule intermediary_cheap, then forcing execution of the rule with --force and then running rule all, which notices that some upstream inputs have changed. But this requires knowledge of the precise file names that are produced, whereas knowledge of rules only would be preferable because it is at a higher level of abstraction.
I haven't used it before but I think you want:
snakemake -c 1 --forcerun intermediary_cheap
--forcerun [TARGET [TARGET ...]], -R [TARGET [TARGET ...]]
Force the re-execution or creation of the given rules
or files. Use this option if you changed a rule and
want to have all its output in your workflow updated.
(default: None)

Snakemake live user input

I have an a bunch of R scripts that follow one another and I wanted to connect them using Snakemake. But I’m running in a problem.
One of my R scripts shows two images and asks a user’s input on how many cluster there are present. The R function for this is [readline]
This query on how many clusters is asked but directly after the next line of code is run. Without an opportunity to input a number. the rest of the program crashes, since trying to calculate (empty number) of clusters doesn’t really work. Is there a way around this. By getting the values via a function/rule from Snakemake
or is there a other way to work around this issue?
Based on my testing with snakemake v5.8.2 in MacOS, this is not a snakemake issue. Example setup below works without any problem.
File test.R
cat("What's your name? ")
x <- readLines(file("stdin"),1)
print(x)
File Snakefile
rule all:
input:
"a.txt",
"b.txt"
rule test_rule:
output:
"{sample}.txt"
shell:
"Rscript test.R; touch {output}"
Executing them with command snakemake -p behaves as expected. That is, it asks for user input and then touch output file.
I used function readLines in R script, but this example shows that error you are facing is likely not a snakemake issue.

How to run rule even when some of its inputs are missing?

In the first step of my process, I am extracting some hourly data from a database. Because of things data is sometimes missing for some hours resulting in files. As long as the amount of missing files is not too large I still want to run some of the rules that depend on that data. When running those rules I will check how much data is missing and then decide if I want to generate an error or not.
An example below. The Snakefile:
rule parse_data:
input:
"data/1.csv", "data/2.csv", "data/3.csv", "data/4.csv"
output:
"result.csv"
shell:
"touch {output}"
rule get_data:
output:
"data/{id}.csv"
shell:
"Rscript get_data.R {output}"
And my get_data.R script:
output <- commandArgs(trailingOnly = TRUE)[1]
if (output == "data/1.csv")
stop("Some error")
writeLines("foo", output)
How do I force running of the rule parse_data even when some of it's inputs are missing? I do not want to force running any other rules when input is missing.
One possible solution would be to generate, for example, an empty file in get_data.R when the query failed. However, in practice I am also using --restart-times 5 when running snakemake as the query can also fail because of database timeouts. When creating an empty file this mechanism of retrying the queries would no longer work.
You need data-dependent conditional execution.
Use a checkpoint on get_data. Then you replace parse_data's input with a function, that aggregates whatever files do exist.
(note that I am a Snakemake newbie and am just learning this myself, I hope this is helpful)

Varying (known) number of outputs in Snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.
Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

How can Snakemake be made to update files in a hierarchical rule-based manner when a new file appears at the bottom of the hierarchy?

I have a snakefile with dozens of rules, and it processes thousands of files. This is a bioinformatics pipeline for DNA sequencing analysis. Today I added two more samples to my set of samples, and I expected to be able to run snakemake and it would automatically determine which rules to run on which files to process the new sample files and all files that depend on them on up the hierarchy to the very top level. However, it does nothing. And the -R option doesn't do it either.
The problem is illustrated with this snakefile:
> cat tst
rule A:
output: "test1.txt"
input: "test2.txt"
shell: "cp {input} {output}"
rule B:
output: "test2.txt"
input: "test3.txt"
shell: "cp {input} {output}"
rule C:
output: "test3.txt"
input: "test4.txt"
shell: "cp {input} {output}"
rule D:
output: "test4.txt"
input: "test5.txt"
shell: "cp {input} {output}"
Execute it as follows:
> rm test*.txt
> touch test2.txt
> touch test1.txt
> snakemake -s tst -F
Output is:
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 A
1
rule A:
input: test2.txt
output: test1.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
Since test5.txt does not exist, I expected an error message to that effect, but it did not happen. And of course, test3.txt and test4.txt do not exist.
Furthermore, using -R instead of -F results in "Nothing to be done."
Using "-R A" runs rule A only.
This relates to my project in that it shows that Snakemake does not analyze the entire dependent tree if you tell it to build a rule at the top of the tree and that rule's output and input files already exist. And the -R option does not force it either. When I tried -F on my project, it started rebuilding the entire thing, including files that did not need to be rebuilt.
It seems to me that this is fundamental to what Snakemake should be doing, and I just don't understand it. The only way I can see to get my pipeline to analyze the new samples is to individually invoke each rule required for the new files, in order. And that is way too tedious and is one reason why I used Snakemake in the first place.
Help!
Snakemake does not automatically trigger re-runs when adding new input files (e.g. samples) to the DAG. However, you can enforce this as outlined in the FAQ.
The reason for not doing this by default is mostly consistency: in order to do this, Snakemake needs to store meta information. Hence, if the meta information is lost, you would have a different behavior than if it was there.
However, I might change this in the future. With such fundamental changes though, I am usually very careful in order to not forget a counter example where the current default behavior is of advantage.
Remember that snakemake wants to satisfy the dependency of the first rule and builds the graph by pulling additional dependencies through the rest of the graph to satisfy that initial dependency. By touching test2.txt you've satisfied the dependency for the first rule, so nothing more needs to be done. Even with -R A nothing else needs to be run to satisfy the dependency of rule A - the files already exist.
Snakemake definitely does do what you want (add new samples and the entire rule graph runs on those samples) and you don't need to individually invoke each rule, but it seems to me that you might be thinking of the dependencies wrong. I'm not sure I fully understand where your new samples fit into the tst example you've given but I see at least two possibilites.
Your graph dependency runs D->C->B->A, so if you're thinking that you've added new input data at the top (i.e. a new sample as test5.txt in rule D), then you need to be sure that you have a dependency at your endpoint (test2.txt in rule A). By touching test2.txt you've just completed your pipeline, so no dependencies exist. If touch test5.txt (that's your new data) then your example works and the entire graph runs.
Since you touched test1.txt and test2.txt in your example execution maybe you intended those to represent the new samples. If so then you need to rethink your dependency graph, because adding them doesn't create a dependency on the rest of the graph. In your example, the test2.txt file is your terminal dependency (the final dependency of your workflow not the input to it). In your tst example new data needs come is as test5.txt as input to rule D (the top of your graph) and get pulled through the dependency graph to satisfy an input dependency of rule A which is test2.txt. If you're thinking of either test1.txt or test2.txt as your new input then you need to remember that snakemake pulls data through the graph to satisfy dependencies at the bottom of the graph, it doesn't push data from the top down. Run snakemake -F --rulegraph see that the graph runs D->C->B->A and so new data needs to come is as input to rule D and be pulled through the graph as a dependency to rule A.