Eliminating snakemake temporary directories - snakemake

In order to save the space on PC, I am working using a temp() function in snakemake. This is eliminating all the files {sample}.dup.bam inside the dup directory,but not the directory itself. How to improve this?
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES),
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = temp("dup/{sample}.dup.bam")
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""

The temp() function deletes all files which are not needed anymore in the workflow.
Since you specify in rule all that you need to create the file dup/bam_list, snakemake will not the delete this file, and thus, the dup directory. I'm even surprised all the bam files get deleted since you're asking for them in rule all.
Tips
You're defining a dependency between your rules:
rule samtools_markdup is needed before running rule bam_list. Therefore, you do not need to ask for expand("dup/{sample}.dup.bam", sample=SAMPLES) in rule all. The lasts will be created (and deleted as marked as temporary files) in order to create the file dup/bam_list.
If you need to delete a directory you can (probably) mark it as temp too as well as the directory() function:
output: temp(directory("dup"))
but once more, if any file in this folder is given to rule all, it won't be deleted. Working with directories is always a bit tricky since snakemake uses files (and their timestamps) to define the DAG.

Related

Discard part of filename in Snakemake: "Wildcards in input files cannot be determined from output files"

I am running into a WildcardError: Wildcards in input files cannot be determined from output files problem with Snakemake. The issue is that I don't want to keep a variable part of my input file name. For instance, suppose I have these files.
$ mkdir input
$ touch input/a-foo.txt
$ touch input/b-wsdfg.txt
$ touch input/c-3523.txt
And I have a Snakemake file like this:
subjects = ['a', 'b', 'c']
result_pattern = "output/{kind}.txt"
rule all:
input:
expand(result_pattern, kind=subjects)
rule step1:
input:
"input/{kind}-{fluff}.txt"
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
I want the output file names to just have the part I'm interested in. I understand the principle that every wildcard in input needs a corresponding wildcard in output. So is what I'm trying to do a sort of anti-pattern? For instance, I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other. Should I be renaming my input files prior to feeding into snakemake?
I want the output file names to just have the part I'm interested in [...]. I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other.
It seems to me you need to decide how to resolve such conflicts. If the input files are:
input/a-bar.txt
input/a-foo.txt <- Note duplicate {a}
input/b-wsdfg.txt
input/c-3523.txt
How do you want the output files to be named and according to what criteria? The answer is independent of snakemake but depending on your circumstances you could include python code within the Snakefile to do handle such conflicts automatically.
Basically, once you make such decisions you can work on the solution.
But suppose there are no file name conflicts, it seems like the wildcard system doesn't handle cases where you want to remove some variable fluff from a filename
The variable part can be handled using python's glob patterns:
import glob
...
rule step1:
input:
glob.glob("input/{kind}-*.txt")
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
You could even be more elaborate and use a dedicated function to match files given the {kind} wildcard:
def get_kind_files(wc):
ff = glob.glob("input/%s-*.txt" % wc.kind)
if len(ff) != 1:
raise Exception('Exepected exactly 1 file for kind "%s"' % wc.kind)
# Possibly more checks tha you got the right file
return ff
rule step1:
input:
get_kind_files,
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""

How to get a rule that would work the same on a directory and its sub-directories

I am trying to make a rule that would work the same on a directory and any of its sub sub-directory (to avoid having to repeat the rule several times). I would like to have access to the name of the subdirectory if there is one.
My approach was to make the sub-directory optional. Given that wildcards can be made to accept an empty string by explicitly giving the ".*" pattern, I therefore tried the following rule:
rule test_optional_sub_dir:
input:
"{adir}/{bdir}/a.txt"
output:
"{adir}/{bdir,.*}/b.txt"
shell:
"cp {input} {output}"
I was hoping that this rule would match both A/b.txt and A/B/b.txt.
However, A/b.txt doesn't match the rule. (Neither does A//b.txt which would be the litteral omission of bdir, I guess the double / gets removed before the matching happens).
The following rule works with both A/b.txt and A/B/b.txt:
rule test_optional_sub_dir2:
input:
"{path}/a.txt"
output:
"{path,.*}/b.txt"
shell:
"cp {input} {output}"
but the problem in this case is that I don't have easy access to the name of the directories in path. I could use the function pathlib.Path to break {path} up but this seems to get overly complicated.
Is there a better way to accomplish what I am trying to do?
Thanks a lot for your help.
How exactly you want to use the sub-directory in your rule might determine the best way to do this. Maybe something like:
def get_subdir(path):
dirs = path.split('/')
if len(dirs) > 1:
return dirs[1]
else:
return ''
rule myrule:
input:
"{dirpath}/a.txt"
output:
"{dirpath}/b.txt"
params:
subdir = lambda wildcards: get_subdir(wildcards.dirpath)
shell:
#use {params.subdir}
Of course, if your rule uses "run" or "script" instead of "shell" you don't even need that function and the subdir param, and can just figure out the subdir from the wildcard that gets passed into the script.
With some further fiddling, I found something that is close to what I want:
Let's say I want at least one directory and no more than 2 optional ones below it. The following works. The only downside is that opt_dir1 and opt_dir2 contain the trailing slash rather than just the name of the directory.
rule test_optional_sub_dir3:
input:
"{mand_dir}/{opt_dir1}{opt_dir2}a.txt"
output:
"{mand_dir}/{opt_dir1}{opt_dir2}b.txt"
wildcard_constraints:
mand_dir="[^/]+",
opt_dir1="([^/]+/)?",
opt_dir2="([^/]+/)?"
shell:
"cp {input} {output}"
Still interested in better approaches if anyone has one.

An WorkflowError with wildcards

I want to use snakemake to QC the fastq file, but it show that :
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files
or a rule without wildcards.
The code I wrote is like this
SAMPLE = ["A","B","C"]
rule trimmomatic:
input:
"/data/samples/{sample}.fastq"
output:
"/data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"
I'm a novice, if anyone know that, please tell me. Thanks so much!
You could do one of the following, but chances are you want to do the latter one.
Explicitly specifiy output filenames via commandline:
snakemake data/samples/A.clean.fastq
This would run rule to create file data/samples/A.clean.fastq
Specify target output files to be created in Snakefile itself using rule all. See here to learn more about adding targets via rule all
SAMPLE_NAMES = ["A","B", "C"]
rule all:
input:
expand("data/samples/{sample}.clean.fastq", sample=SAMPLE_NAMES)
rule trimmomatic:
input:
"data/samples/{sample}.fastq"
output:
"data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"

Varying (known) number of outputs in Snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.
Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

Snakemake: dependencies which are not input

I'd like to know if there is a way in Snakemake to define a dependency which is actually not an input file.
What I mean by that is that there are programs that expect some files to exists while there are not provided on the command line.
Let's consider bwa as an example.
This is a rule from Johannes Köster mapping rules:
rule bwa_mem_map:
input:
lambda wildcards: config["references"][wildcards.reference],
lambda wildcards: config["units"][wildcards.unit]
output:
"mapping/{reference}/units/{unit}.bam"
params:
sample=lambda wildcards: UNIT_TO_SAMPLE[wildcards.unit],
custom=config.get("params_bwa_mem", "")
log:
"mapping/log/{reference}/{unit}.log"
threads: 8
shell:
"bwa mem {params.custom} "
r"-R '#RG\tID:{wildcards.unit}\t"
"SM:{params.sample}\tPL:{config[platform]}' "
"-t {threads} {input} 2> {log} "
"| samtools view -Sbh - > {output}"
Here, bwa expect that the genome index file exists while it is not a command-line argument (the path to the index file is deduced from the genome path).
Is there a way to tell Snakemake that the index file is a dependency, and Snakemake will look in its rule if he knows how to generate this file ?
I suppose you could still rewrite your rule inputs as:
rule bwa_mem_map:
input:
genome=lambda wildcards: config["references"][wildcards.reference],
fastq=lambda wildcards: config["units"][wildcards.unit]
index=foo.idx
And adapt the rule run part consequently.
Is it the best solution?
Thanks in advance.
Benoist
I think the only way snakemake handles dependencies between rules is through files, so I'd say you are doing it correctly when you put the index file explicitly as an input for your mapping rule, even though this file does not appear in the mapping command.
For what its worth, I do the same for bam index files, which are an implicit dependency for some tools: I put both the sorted bam file and its index as input, but only use the bam file in the shell or run part. And I have a rule generating both having the two files as output.
input and output files do not need to appear in shell / run parts of a rule.