glob_wildcards on multiple directories with different file names - snakemake

I am trying to write a rule that takes two files from different directories and the put the output of the rule into the same directory as in the file structure below:
DIR_A
dir1
file1.clean.vcf
dir2
file2.clean.vcf
dir3
file1.output.vcf
file2.output.vcf
so far I have tried using glob_wildcards:
(DIR,NAME) = glob_wildcards("DIR_A/{dir}/{name}.clean.vcf")
input: expand("DIR_A/{dir}/{name}.clean.vcf", dir=DIR, name=NAME)
output: "DIR_A/dir3/{name}.output.vcf
but it throws an error:
MissingInputException in line 80 of DIR_A:
Missing input files for rule convert_output:
DIR_A/dir1/file2.clean.vcf
DIR_A/dir2/file1.clean.vcf
adding zip to input:
input: expand("DIR_A/{dir}/{name}.clean.vcf", zip, dir=DIR, name=NAME)
if $ snakemake -s snakefile -n (dry run):
rule conv_output:
input: DIR_A/dir1/file1.clean.vcf, DIR_A/file2/file2.clean.vcf
This is accepted by snakemake and prevents the above error, but now both file1.clean.vcf and file2.clean.vcf are both inputs to the rule but the {name} wild card makes the rule run once per file. This ends up as a many files to one file rather than the one to one that I am looking for.
Is there a way to set this up so I can get the output of the rule conv_output to act on each of the files then put the output in dir3? Any help would be greatly appreciated!!

Using python, pair input vcf sample/filename to its path, and then use it to specify input path in snakemake rule. Below example works for directory structure given in the question.
from pathlib import Path
def pair_name_to_infiles():
# get all *.clean.vcf files recursively under DIR_A
vcf_path = Path('DIR_A').glob('**/*.clean.vcf')
# pair vcf name to infile path using a dictionary
vcf_infiles_dict = {}
for f in vcf_path:
vcf_name = f.name.replace('.clean.vcf', '')
vcf_infiles_dict[vcf_name] = str(f)
return vcf_infiles_dict
# using function written in python code, map vcf name to their infile path
vcf_infiles_dict = pair_name_to_infiles()
rule all:
input:
expand('DIR_A/dir3/{vcf_name}.output.vcf', vcf_name=vcf_infiles_dict.keys())
rule foo:
input:
lambda wildcards: vcf_infiles_dict[wildcards.vcf_name]
output:
'DIR_A/dir3/{vcf_name}.output.vcf'
shell:
'touch {output}'

Related

Discard part of filename in Snakemake: "Wildcards in input files cannot be determined from output files"

I am running into a WildcardError: Wildcards in input files cannot be determined from output files problem with Snakemake. The issue is that I don't want to keep a variable part of my input file name. For instance, suppose I have these files.
$ mkdir input
$ touch input/a-foo.txt
$ touch input/b-wsdfg.txt
$ touch input/c-3523.txt
And I have a Snakemake file like this:
subjects = ['a', 'b', 'c']
result_pattern = "output/{kind}.txt"
rule all:
input:
expand(result_pattern, kind=subjects)
rule step1:
input:
"input/{kind}-{fluff}.txt"
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
I want the output file names to just have the part I'm interested in. I understand the principle that every wildcard in input needs a corresponding wildcard in output. So is what I'm trying to do a sort of anti-pattern? For instance, I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other. Should I be renaming my input files prior to feeding into snakemake?
I want the output file names to just have the part I'm interested in [...]. I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other.
It seems to me you need to decide how to resolve such conflicts. If the input files are:
input/a-bar.txt
input/a-foo.txt <- Note duplicate {a}
input/b-wsdfg.txt
input/c-3523.txt
How do you want the output files to be named and according to what criteria? The answer is independent of snakemake but depending on your circumstances you could include python code within the Snakefile to do handle such conflicts automatically.
Basically, once you make such decisions you can work on the solution.
But suppose there are no file name conflicts, it seems like the wildcard system doesn't handle cases where you want to remove some variable fluff from a filename
The variable part can be handled using python's glob patterns:
import glob
...
rule step1:
input:
glob.glob("input/{kind}-*.txt")
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
You could even be more elaborate and use a dedicated function to match files given the {kind} wildcard:
def get_kind_files(wc):
ff = glob.glob("input/%s-*.txt" % wc.kind)
if len(ff) != 1:
raise Exception('Exepected exactly 1 file for kind "%s"' % wc.kind)
# Possibly more checks tha you got the right file
return ff
rule step1:
input:
get_kind_files,
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""

Snakemake copy one file to multiple files

I have several folders
folder_1234
folder_4321
I want to copy one file myfile.sh from one folders (already exists) to all folders and then change the number inside the file using sed:
SAMPLE=["1234","4321"]
rule all:
input:
expand("workdir/folder_{sample}/myfile.sh", sample=SAMPLES)
rule copy:
input:
copy_from="/path/to/folder_1234/myfile.sh"
output:
copy_to="workdir/folder_{sample}/myfile.sh"
shell:
"""
cp {input.copy_from} {output.copy_to}
sed "s/folder_1234/folder_{sample}/g" folder_{sample}/myfile.sh
"""
This gives me an error:
NameError: The name 'sample' is unknown in this context. Did you mean 'wildcards.sample'?
In shell command, wildcard syntax here would be {wildcards.sample} instead of {sample}. For documentation on wildcards usage, see here.

Specify input and output files in Snakefile

I'm new to Snakemake and I want to make a pipeline that takes a given input text file and concatenates its content to a given output file. However I want to be able to specify the names of both the input and output files at run time, so neither file names are hardcoded in the Snakefile. Right now all I can come up with is:
rule all:
input:
"{input}.txt",
"{output}.txt"
rule output_files:
input:
"{input}.txt"
output:
"{output}.txt"
shell:
"cat {input}.txt > {output}.txt"
I tried running this with "snakemake input1.txt output.txt" but I got the error:
Building DAG of jobs...
WildcardError in line 6 of Snakefile:
Wildcards in input files cannot be determined from output files:
'input'
Any suggestions would be greatly appreciated.
In your example you actually copy a single input file into an output file using a cat shell command. That could be understood as an intention to concatenate several inputs into one output:
rule concatenate:
input:
"input1.txt",
"input2.txt"
output:
"output.txt"
shell:
"cat {input} > {output}"
takes a given input text file and concatenates its content to a given output file
Another way to understand the question is that you are trying to append an input file to the end of the output. That is more challenging: Snakemake "thinks" in terms of goals where each goal is a distinct file. How would Snakemake know if the output file is a raw one or if it is a concatenated version? One way to do that is to have "flag" files: the presence of such file would mean that the goal is achieved and no concatenation is needed. One more problem: Snakemake clears the output file before running the rule. Than means that you need to specify it as a input:
rule append:
input:
in = "input.txt",
out = "output.txt"
output:
flag = "flag"
shell:
"cat {input.in} >> {input.out} && touch {output.flag}"
Now back to your question regarding the error and the way to specify the filenames in runtime. You get this error because the wildcards should be fully inferred from the output section, and both your rules are ill-formed. Let's start with the rule all.
You need to say Snakemake what goal you are building. No wildcards in the input, everything should be disambigued:
def getInput():
pass
# form the actual goal (you may query the database, service, hardcode, etc.)
rule all:
input: getInput
Let's say you decided that the goal should be 3 files: ["output1.txt", "output3.txt", "output3.txt"]:
def getInput():
magic_numbers_from_oracle = ["1", "2", "3"]
return magic_numbers_from_oracle
rule all:
input: expand("output{number}.txt", number=getInput())
Ok, now Snakemake knows the goal. The next step is to write a rule that says how to create a single output{number}.txt file. For simplicity I'm taking your initial approach with cat/copying:
rule cat_copy:
input:
"input{n}.txt"
output:
"output{n}.txt"
shell:
"cat {input} > {output}"
That's it. As long as you have files input1.txt, input2.txt, input3.txt you would get the corresponding outputs.

Snakemake always rebuilds targets, even when up to date

I'm new to snakemake and running into some behavior I don't understand. I have a set of fastq files with file names following the standard Illumina convention:
SAMPLENAME_SAMPLENUMBER_LANE_READ_001.fastq.gz
In a directory reads/raw_fastq. I'd like to create symbolic links to simplify the names to follow the pattern:
SAMPLENAME_READ.fastq.gz
In a directory reads/renamed_raw_fastq
My aim is that as I add new fastq files to the project, snakemake will create symlinks only for the newly-added files.
My snakefile is as follows:
# Get sample names from read file names in the "raw" directory
readRootDir = 'reads/'
readRawDir = readRootDir + 'raw_fastq/'
import os
samples = list(set([x.split('_', 1)[0] for x in os.listdir(readRawDir)]))
samples.sort()
# Generate simplified names
readRenamedRawDir = readRootDir + 'renamed_raw_fastq/'
newNames = expand(readRenamedRawDir + "{sample}_{read}.fastq.gz", sample = samples, read = ["R1", "R2"])
# Create symlinks
import glob
def getRawName(wildcards):
rawName = glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0]
return rawName
rule all:
input: newNames
rule rename:
input: getRawName
output: "reads/renamed_raw_fastq/{sample}_{read}.fastq.gz"
shell: "ln -sf {input} {output}"
When I run snakemake, it tries to generate the symlinks as expected but:
Always tries to create the target symlinks, even when they already exist and have later timestamps than the source fastq files.
Throws errors like:
MissingOutputException in line 68 of /work/nick/FAW-MIPs/renameRaw.snakefile:
Missing files after 5 seconds:
reads/renamed_raw_fastq/Ben21_R2.fastq.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
It's almost like snakemake isn't seeing the ouput files it creates. Can anyone suggest what I might be missing here?
Thanks!
I think
ln -sf {input} {output}
gives a symlink pointing to a missing file, i.e., it doesn't point to the source file. You could fix it by e.g. using absolute paths, like:
def getRawName(wildcards):
rawName = os.path.abspath(glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0])
return rawName
(As an aside, I would make sure that renaming fastq files the way you do doesn't result in a name-collision, for example when the same sample is sequenced on different lanes of the same flow cell.)

Snakemake: dependencies which are not input

I'd like to know if there is a way in Snakemake to define a dependency which is actually not an input file.
What I mean by that is that there are programs that expect some files to exists while there are not provided on the command line.
Let's consider bwa as an example.
This is a rule from Johannes Köster mapping rules:
rule bwa_mem_map:
input:
lambda wildcards: config["references"][wildcards.reference],
lambda wildcards: config["units"][wildcards.unit]
output:
"mapping/{reference}/units/{unit}.bam"
params:
sample=lambda wildcards: UNIT_TO_SAMPLE[wildcards.unit],
custom=config.get("params_bwa_mem", "")
log:
"mapping/log/{reference}/{unit}.log"
threads: 8
shell:
"bwa mem {params.custom} "
r"-R '#RG\tID:{wildcards.unit}\t"
"SM:{params.sample}\tPL:{config[platform]}' "
"-t {threads} {input} 2> {log} "
"| samtools view -Sbh - > {output}"
Here, bwa expect that the genome index file exists while it is not a command-line argument (the path to the index file is deduced from the genome path).
Is there a way to tell Snakemake that the index file is a dependency, and Snakemake will look in its rule if he knows how to generate this file ?
I suppose you could still rewrite your rule inputs as:
rule bwa_mem_map:
input:
genome=lambda wildcards: config["references"][wildcards.reference],
fastq=lambda wildcards: config["units"][wildcards.unit]
index=foo.idx
And adapt the rule run part consequently.
Is it the best solution?
Thanks in advance.
Benoist
I think the only way snakemake handles dependencies between rules is through files, so I'd say you are doing it correctly when you put the index file explicitly as an input for your mapping rule, even though this file does not appear in the mapping command.
For what its worth, I do the same for bam index files, which are an implicit dependency for some tools: I put both the sorted bam file and its index as input, but only use the bam file in the shell or run part. And I have a rule generating both having the two files as output.
input and output files do not need to appear in shell / run parts of a rule.