Snakemake: dynamic + non-dynamic output - snakemake

I have a use case in which a rule generates an arbitrary number of "checkpoint" files and a single output file. For example, "example.input" would produce:
example_000.checkpoint
example_001.checkpoint
...
example_093.checkpoint (arbitrary number here)
example.output (guaranteed non-dynamic output)
The checkpoints are intended to be used to restart from that point in the calculation, but I have additional use for them. However, I only need the first (e.g., example_000.checkpoint) and the last (e.g., example_093.checkpoint). How can I construct a rule such that my outputs are defined as:
rule example:
input:
{id}.input
output:
non_dynamic = {id}.output
first = {id}_{first}.checkpoint
last = {id}_{last}.checkpoint
# OR
checkpoints = dynamic({id}_{checkpoint}.checkpoint)
If I define new wildcards, I get the error "Not all output files of rule example contain the same wildcards." If I try to use dynamic output, I get the error "A rule with dynamic output may not define any non-dynamic output files."
Thanks in advance for any help!

Related

What are different methods used for naming snakemake pipeline output files that depends on multiple variables?

I wrote a snakemake pipeline which is intended to be run again with different variables provided by the user in a new config file during each run.
config.yml:
param_a: 100 #filter dataset rule1
param_b: 200 #filter sample rule2
param_c: 300 #filter sample again rule3
config2.yml:
param_a: 150 #100->150
param_b: 200
param_c: 300
Snakefile:
rule rule1:
#dataset is filtered by param_a
output: {dataset}_{param_a}/{sample}
rule rule2:
#sample is filtered by param_a
output: {dataset}_{param_a}/{sample}_{param_b}
rule rule3:
#sample is then filtered by param_c
output: {dataset}_{param_a}/{sample}_{param_b}_{param_c}
The aim is making it possible for user to rerun the analyses with different options at different steps without having to run everything until the step with the param change again.
When we have too many of such parameters the directory and file names start to get too long, e.g.:
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase100/mysample.bam
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase200/mysample.bam
Is there any method for easier and more efficient naming, such as auto creating version names and saving parameter details to a text file?
I read about the shadow directory feature but I don't think it does what I am looking for.
If you want to be very fancy, you could encode the params into a SHA hash or similar and use that for the filename, recording the hash and parameter values in a table. You just need a function to take keyword params and translate that to the hash and use it for all your rule inputs. If I were you, I would use directories instead of flat filenames.
dataset1/sample-minSize200/samtools-F4-F1024-q20/mosdepth-minDepth4-maxDepth100/bedtools-merge-gap200/angsd-minQ20/loci-maxBase100/mysample.bam
That would make it easier to discard all of some parameter set that you don't need anymore and will make directory listing faster.

How to implement splitting of files in snakemake when number of files is known

Context
rule A uses the split command in a shell directive.
The number of files generated by rule A depends on a user specified value from the config and is thus known.
In this question there is a difference because the number of output files is unknown, but there is a reference to the dynamic() keyword. Apparently this has been replaced by the use of checkpoint. Is this really the correct way to go in this scenario? There is also something like scattergatter but the example is not clear to me.
Code
chunks = config["chunks"]
sample_list = ["S1", "S2"]
rule all:
input:
expand("{sample}_chunk_{chunk}_done_something.tsv", sample=sample_list,
chunk=[f"{i}".zfill(len(str(chunks))-1) for i in range(0, chunks)])
rule A:
input:
"input_file_{sample}.tsv"
output:
# the user defined number of chunks, how to specify these?
params: chunks=chunks
shell:
"split -n {params.chunks} --numeric-suffixes=1 --additional-suffix=.tsv {input[0]} some_prefix_{wildcards.sample}_"
rule B:
input:
"some_prefix_{sample}_{chunk}.tsv"
output:
"{sample}_chunk_{chunk}_done_something.tsv"
shell:
"#Do something"
Attempts
I tried using a checkpoint with an input function for rule B and using directory() in rule A. However using directory results in SyntaxError in line 253 of MySnakefile: Unexpected keyword directory in rule definition (Snakefile, line 253) and even if that would not throw an error, I don't know how to get chunks into this input function since it is not a wildcard.
How to implement the splitting of an input file best in Snakemake?
Since the number of chunks is known beforehand, you can set the number of output files in rule A from the chunks parameter using an array:
rule A:
...
output:
chunks = ["some_prefix_{{sample}}_{02d}.tsv".format(x+1) for x in range(chunks)]
With chunks = 2, this would expand to chunks = ["some_prefix_{sample}_01.tsv", "some_prefix_{sample}_02.tsv"], matching the synatx of the split output. The {sample} wildcard will be filled-in with Snakemake's standard wildcard replacement.

How to write a Snakemake rule-all, where expand statements can handle the absence of all particular input files

I want to write a Snakemake-Pipeline to process either short or long read sequencing files or both types, depending on which type of files is provided in the input file.
First my Snakefile calls a shell script that creates a config file with the name of all short read files in the input directory under the heading short_reads and all long read files under the heading long_reads.
This is followed by my all rule:
rule all:
input:
expand("../qc/id/{sample}/fastqc_raw/{sample}_R1_fastqc.html", sample=config["samples_short"]),
expand("../qc/id/{sample}/nanoplot_raw/NanoPlot-report.html", sample=config["samples_long"])
...
However, if one of the file types (long or short reads) is not provided, Snakemake fails with a KeyError.
If I modify the config file in a way that the heading is still there but no sample names, Snakemake tries to call the input with the value None, e.g.
Missing input files for rule nanoplot_raw:
../raw_reads/None_ont.fastq.gz
How can I design the rule-all in a way, that it can handle either only short or long reads as well as both sequence types as Input?
Thanks for your help!
Does the following work?
if config["samples_short"]:
fastqc_short = expand("../qc/id/{sample}/fastqc_raw/{sample}_R1_fastqc.html", sample=config["samples_short"])
else:
fastqc_short = []
if config["samples_long"]:
nanoplot_long = expand("../qc/id/{sample}/nanoplot_raw/NanoPlot-report.html", sample=config["samples_long"])
else:
nanoplot_long = []
rule all:
input:
fastqc_short,
nanoplot_long,
...

Snakemake: Error when trying to run a rule for multiple directories and files

I create a dictionary in python and save the path to the directories (that I want the software to run on) as the keys and the corresponding values are a list of the expected output for each directory. Right now I have a structure like this:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = lambda wildcards: expand("{dir}", dir=dirSampleDict.keys())
params:
# some params
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
No values given for wildcard 'dir
Edit: Maybe it was not so clear what I actually want to do so I filled in some data.
I also tried using the wildcards I set up in rule all as follows:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = "{{dir}}"
params:
# some params
output:
'{dir}/{sample}*.foo'
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
Not all output, log and benchmark files of rule example contain the
same wildcards. This is crucial though, in order to avoid that two or
more jobs write to the same file.
I'm pretty sure the second part is more likely what I actually want to do, since expand() as output would only run the rule once but I need to run it for every key value pair in the dictonary.
First of all, what do you expect from the asterisk in the output?
output:
'{dir}/{sample}*.foo'
The output has to be a list of valid filenames that can be formed with substitution of each wildcard with some string.
Next problem is that you are using the "{dir}" in the run: section. There is no variable dir defined in the script used for run. If you want to use the wildcard, you need to address it using wildcards.dir. However the run: can be substituted with a shell: section:
shell:
"software {wildcards.dir}"
Regarding your first script: there is no dir wildcard defined (actually there are no wildcards at all):
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
Both {dir} and {sample} are the variables in the context of expand function, and they are fully substituted with the named parameters.
Now the second script. What did you mean by this input?
input:
dir = "{{dir}}"
Here the "{{dir}}" is not a wildcard, but a reference to a global variable (you haven't provided the rest of your script, so I cannot judge whether it is defined or not). Moreover, what's the need in the input? You never use the {input} variable at all, and there is no dependencies that are needed to connect the rule example with any other rule to produce the input for rule example.

Can I add a file to rule all: which is not defined in output

A number of commands produce silently extra files not defined in the rule output section.
When I try to make sure these are produced by adding them to 'rule all:' a re-run of the workflow fails because the file are not found in the rule(s) output list.
Can I add a supplementary file (not present as {output}) to the 'rule all:'?
Thanks
eg: STAR index produces a number of files in a folder defined by command arguments, checking for the presence of the folder does not mean that indexing has worked out normally
added for clarity, the STAR index exmple takes 'star_idx_75' as output argument and makes a folder of it in which all the following files are stored (their number may vary in function of the index type).
chrLength.txt
chrName.txt
chrNameLength.txt
chrStart.txt
exonGeTrInfo.tab
exonInfo.tab
geneInfo.tab
Genome
genomeParameters.txt
SA
SAindex
sjdbInfo.txt
sjdbList.fromGTF.out.tab
sjdbList.out.tab
transcriptInfo.tab
What I wanted was to check that they are all present BUT none of them is used to build the command itself and if I required them in the rule all: a rerun breaks because they are not in any snakemake {output} definition.
This is why I asked wether I could create 'fake' output variables that are not 'used' for running a command but allow placing the corresponding items in the 'rule all:' - am I more clear now :-).
Can I add a supplementary file (not present as {output}) to the 'rule all:'?
I don't think so, at least not without resorting on some convoluted solution. Every file in rule all (or more precisely the first rule) must have a rule that lists it in output.
If you don't want to repeat a long list, why not doing something like this?
star_index= ['ref.idx1', 'ref.idx2', ...]
rule all:
input:
star_index
rule make_index:
input:
...
output:
star_index
shell:
...
It's probably better to list them all in the rule's output, but only use the relevant ones in subsequent rules. You could also look into using directory() which could possibly fit here.