Snakemake always rebuilds targets, even when up to date - snakemake

I'm new to snakemake and running into some behavior I don't understand. I have a set of fastq files with file names following the standard Illumina convention:
SAMPLENAME_SAMPLENUMBER_LANE_READ_001.fastq.gz
In a directory reads/raw_fastq. I'd like to create symbolic links to simplify the names to follow the pattern:
SAMPLENAME_READ.fastq.gz
In a directory reads/renamed_raw_fastq
My aim is that as I add new fastq files to the project, snakemake will create symlinks only for the newly-added files.
My snakefile is as follows:
# Get sample names from read file names in the "raw" directory
readRootDir = 'reads/'
readRawDir = readRootDir + 'raw_fastq/'
import os
samples = list(set([x.split('_', 1)[0] for x in os.listdir(readRawDir)]))
samples.sort()
# Generate simplified names
readRenamedRawDir = readRootDir + 'renamed_raw_fastq/'
newNames = expand(readRenamedRawDir + "{sample}_{read}.fastq.gz", sample = samples, read = ["R1", "R2"])
# Create symlinks
import glob
def getRawName(wildcards):
rawName = glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0]
return rawName
rule all:
input: newNames
rule rename:
input: getRawName
output: "reads/renamed_raw_fastq/{sample}_{read}.fastq.gz"
shell: "ln -sf {input} {output}"
When I run snakemake, it tries to generate the symlinks as expected but:
Always tries to create the target symlinks, even when they already exist and have later timestamps than the source fastq files.
Throws errors like:
MissingOutputException in line 68 of /work/nick/FAW-MIPs/renameRaw.snakefile:
Missing files after 5 seconds:
reads/renamed_raw_fastq/Ben21_R2.fastq.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
It's almost like snakemake isn't seeing the ouput files it creates. Can anyone suggest what I might be missing here?
Thanks!

I think
ln -sf {input} {output}
gives a symlink pointing to a missing file, i.e., it doesn't point to the source file. You could fix it by e.g. using absolute paths, like:
def getRawName(wildcards):
rawName = os.path.abspath(glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0])
return rawName
(As an aside, I would make sure that renaming fastq files the way you do doesn't result in a name-collision, for example when the same sample is sequenced on different lanes of the same flow cell.)

Related

Discard part of filename in Snakemake: "Wildcards in input files cannot be determined from output files"

I am running into a WildcardError: Wildcards in input files cannot be determined from output files problem with Snakemake. The issue is that I don't want to keep a variable part of my input file name. For instance, suppose I have these files.
$ mkdir input
$ touch input/a-foo.txt
$ touch input/b-wsdfg.txt
$ touch input/c-3523.txt
And I have a Snakemake file like this:
subjects = ['a', 'b', 'c']
result_pattern = "output/{kind}.txt"
rule all:
input:
expand(result_pattern, kind=subjects)
rule step1:
input:
"input/{kind}-{fluff}.txt"
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
I want the output file names to just have the part I'm interested in. I understand the principle that every wildcard in input needs a corresponding wildcard in output. So is what I'm trying to do a sort of anti-pattern? For instance, I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other. Should I be renaming my input files prior to feeding into snakemake?
I want the output file names to just have the part I'm interested in [...]. I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other.
It seems to me you need to decide how to resolve such conflicts. If the input files are:
input/a-bar.txt
input/a-foo.txt <- Note duplicate {a}
input/b-wsdfg.txt
input/c-3523.txt
How do you want the output files to be named and according to what criteria? The answer is independent of snakemake but depending on your circumstances you could include python code within the Snakefile to do handle such conflicts automatically.
Basically, once you make such decisions you can work on the solution.
But suppose there are no file name conflicts, it seems like the wildcard system doesn't handle cases where you want to remove some variable fluff from a filename
The variable part can be handled using python's glob patterns:
import glob
...
rule step1:
input:
glob.glob("input/{kind}-*.txt")
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
You could even be more elaborate and use a dedicated function to match files given the {kind} wildcard:
def get_kind_files(wc):
ff = glob.glob("input/%s-*.txt" % wc.kind)
if len(ff) != 1:
raise Exception('Exepected exactly 1 file for kind "%s"' % wc.kind)
# Possibly more checks tha you got the right file
return ff
rule step1:
input:
get_kind_files,
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""

Snakemake: Error when trying to run a rule for multiple directories and files

I create a dictionary in python and save the path to the directories (that I want the software to run on) as the keys and the corresponding values are a list of the expected output for each directory. Right now I have a structure like this:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = lambda wildcards: expand("{dir}", dir=dirSampleDict.keys())
params:
# some params
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
No values given for wildcard 'dir
Edit: Maybe it was not so clear what I actually want to do so I filled in some data.
I also tried using the wildcards I set up in rule all as follows:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = "{{dir}}"
params:
# some params
output:
'{dir}/{sample}*.foo'
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
Not all output, log and benchmark files of rule example contain the
same wildcards. This is crucial though, in order to avoid that two or
more jobs write to the same file.
I'm pretty sure the second part is more likely what I actually want to do, since expand() as output would only run the rule once but I need to run it for every key value pair in the dictonary.
First of all, what do you expect from the asterisk in the output?
output:
'{dir}/{sample}*.foo'
The output has to be a list of valid filenames that can be formed with substitution of each wildcard with some string.
Next problem is that you are using the "{dir}" in the run: section. There is no variable dir defined in the script used for run. If you want to use the wildcard, you need to address it using wildcards.dir. However the run: can be substituted with a shell: section:
shell:
"software {wildcards.dir}"
Regarding your first script: there is no dir wildcard defined (actually there are no wildcards at all):
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
Both {dir} and {sample} are the variables in the context of expand function, and they are fully substituted with the named parameters.
Now the second script. What did you mean by this input?
input:
dir = "{{dir}}"
Here the "{{dir}}" is not a wildcard, but a reference to a global variable (you haven't provided the rest of your script, so I cannot judge whether it is defined or not). Moreover, what's the need in the input? You never use the {input} variable at all, and there is no dependencies that are needed to connect the rule example with any other rule to produce the input for rule example.

Eliminating snakemake temporary directories

In order to save the space on PC, I am working using a temp() function in snakemake. This is eliminating all the files {sample}.dup.bam inside the dup directory,but not the directory itself. How to improve this?
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES),
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = temp("dup/{sample}.dup.bam")
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""
The temp() function deletes all files which are not needed anymore in the workflow.
Since you specify in rule all that you need to create the file dup/bam_list, snakemake will not the delete this file, and thus, the dup directory. I'm even surprised all the bam files get deleted since you're asking for them in rule all.
Tips
You're defining a dependency between your rules:
rule samtools_markdup is needed before running rule bam_list. Therefore, you do not need to ask for expand("dup/{sample}.dup.bam", sample=SAMPLES) in rule all. The lasts will be created (and deleted as marked as temporary files) in order to create the file dup/bam_list.
If you need to delete a directory you can (probably) mark it as temp too as well as the directory() function:
output: temp(directory("dup"))
but once more, if any file in this folder is given to rule all, it won't be deleted. Working with directories is always a bit tricky since snakemake uses files (and their timestamps) to define the DAG.

Snakemake copy from several directories

Snakemake is super-confusing to me. I have files of the form:
indir/type/name_1/run_1/name_1_processed.out
indir/type/name_1/run_2/name_1_processed.out
indir/type/name_2/run_1/name_2_processed.out
indir/type/name_2/run_2/name_2_processed.out
where type, name, and the numbers are variable. I would like to aggregate files such that all files with the same "name" end up in a single dir:
outdir/type/name/name_1-1.out
outdir/type/name/name_1-2.out
outdir/type/name/name_2-1.out
outdir/type/name/name_2-2.out
How do I write a snakemake rule to do this? I first tried the following
rule rename:
input:
"indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"
output:
"outdir/{type}/{name}/{name}_{nameno}-{runno}.out"
shell:
"cp {input} {output}"
# example command: snakemake --cores 1 outdir/type/name/name_1-1.out
This worked, but doing it this way doesn't save me any effort because I have to know what the output files are ahead of time, so basically I'd have to pass all the output files as a list of arguments to snakemake, requiring a bit of shell trickery to get the variables.
So then I tried to use directory (as well as give up on preserving runno).
rule rename2:
input:
"indir/{type}/{name}_{nameno}"
output:
directory("outdir/{type}/{name}")
shell:
"""
for d in {input}/run_*; do
i=0
for f in ${{d}}/*processed.out; do
cp ${{f}} {output}/{wildcards.name}_{wildcards.nameno}-${{i}}.out
done
let ++i
done
"""
This gave me the error, Wildcards in input files cannot be determined from output files: 'nameno'. I get it; {nameno} doesn't exist in output. But I don't want it there in the directory name, only in the filename that gets copied.
Also, if I delete {nameno}, then it complains because it can't find the right input file.
What are the best practices here for what I'm trying to do? Also, how does one wrap their head around the fact that in snakemake, you specify outputs, not inputs? I think this latter fact is what is so confusing.
I guess what you need is the expand function:
rule all:
input: expand("outdir/{type}/{name}/{name}_{nameno}-{runno}.out",
type=TYPES,
name=NAMES,
nameno=NAME_NUMBERS,
runno=RUN_NUMBERS)
The TYPES, NAMES, NAME_NUMBERS and RUN_NUMBERS are the lists of all possible values for these parameters. You either need to hardcode or use the glob_wildcards function to collects these data:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out")
This however would give you duplicates. If that is not desireble, remove the duplicates:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = map(set, glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"))

glob_wildcards on multiple directories with different file names

I am trying to write a rule that takes two files from different directories and the put the output of the rule into the same directory as in the file structure below:
DIR_A
dir1
file1.clean.vcf
dir2
file2.clean.vcf
dir3
file1.output.vcf
file2.output.vcf
so far I have tried using glob_wildcards:
(DIR,NAME) = glob_wildcards("DIR_A/{dir}/{name}.clean.vcf")
input: expand("DIR_A/{dir}/{name}.clean.vcf", dir=DIR, name=NAME)
output: "DIR_A/dir3/{name}.output.vcf
but it throws an error:
MissingInputException in line 80 of DIR_A:
Missing input files for rule convert_output:
DIR_A/dir1/file2.clean.vcf
DIR_A/dir2/file1.clean.vcf
adding zip to input:
input: expand("DIR_A/{dir}/{name}.clean.vcf", zip, dir=DIR, name=NAME)
if $ snakemake -s snakefile -n (dry run):
rule conv_output:
input: DIR_A/dir1/file1.clean.vcf, DIR_A/file2/file2.clean.vcf
This is accepted by snakemake and prevents the above error, but now both file1.clean.vcf and file2.clean.vcf are both inputs to the rule but the {name} wild card makes the rule run once per file. This ends up as a many files to one file rather than the one to one that I am looking for.
Is there a way to set this up so I can get the output of the rule conv_output to act on each of the files then put the output in dir3? Any help would be greatly appreciated!!
Using python, pair input vcf sample/filename to its path, and then use it to specify input path in snakemake rule. Below example works for directory structure given in the question.
from pathlib import Path
def pair_name_to_infiles():
# get all *.clean.vcf files recursively under DIR_A
vcf_path = Path('DIR_A').glob('**/*.clean.vcf')
# pair vcf name to infile path using a dictionary
vcf_infiles_dict = {}
for f in vcf_path:
vcf_name = f.name.replace('.clean.vcf', '')
vcf_infiles_dict[vcf_name] = str(f)
return vcf_infiles_dict
# using function written in python code, map vcf name to their infile path
vcf_infiles_dict = pair_name_to_infiles()
rule all:
input:
expand('DIR_A/dir3/{vcf_name}.output.vcf', vcf_name=vcf_infiles_dict.keys())
rule foo:
input:
lambda wildcards: vcf_infiles_dict[wildcards.vcf_name]
output:
'DIR_A/dir3/{vcf_name}.output.vcf'
shell:
'touch {output}'