Snakemake - input function exception - snakemake

I am trying to run snakemake code using.json file as input. While checking the dry run i got foloowing error
InputFunctionException in line 172 of /home/Snakefile_ChIPseq_pe:
KeyError: '130241_1'
Wildcards:
library=130241_1
This is the part of snakemake code
rule findPeaks:
input:
sample = os.path.join(HOMERTAG_DIR, "{library}"),
input = lambda wildcards: os.path.join(HOMERTAG_DIR, config['lib_input'][wildcards.library])
output:
os.path.join(HOMERPEAK_DIR, "{library}.all.hpeaks")
params:
config['homer_findPeaks_params']
shell:
"findPeaks {input.sample} -i {input.input} {params} -o {output}"
There is single quote around input sample which is missing in the 'lib_input' part. How to add that single quote ahead of variable?
Also library names are like 12345_1, 12345_2 etc., never had this problem before however for the first time I have libraries with "underscore" in the names.
Snakemake will first try to interpret the given value as number. Only if that fails, it will interpret the value as string. Here, it does not fail, because the underscore _ is interpreted as thousand separator.

My guess is that in your json file the library IDs are not quoted. E.g. you have this:
{
"lib_input": {1234_1: "input.txt"}
}
Instead of:
{
"lib_input": {"1234_1": "input.txt"}
}
Or maybe library 130241_1 is not in the json at all?

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

How to stop snakemake from adding non file endings to wildcards when using expand function? (.g.vcf fails, .vcf works)

Adding .g.vcf instead of .vcf after the variable in expand rule is somehow adding the .g to a wildcard in another module
I have tried the following in the all rule :
{stuff}.g.vcf
{stuff}"+"g.vcf"
{stuff}_var"+".g.vcf"
{stuff}.t.vcf
all fail but {stuff}.gvcf or {stuff}.vcf work
Error:
InputFunctionException in line 21 of snake_modules/mark_duplicates.snakefile:
KeyError: 'Mother.g'
Wildcards:
lane=Mother.g
Code:
LANES = config["list2"].split()
rule all:
input:
expand(projectDir+"results/alignments/variants/{stuff}.g.vcf", stuff=LANES)
rule mark_duplicates:
""" this will mark duplicates for bam files from the same sample and library """
input:
get_lanes
output:
projectDir+"results/alignments/markdups/{lane}.markdup.bam"
log:
projectDir+"logs/"+stamp+"_{lane}_markdup.log"
shell:
" input=$(echo '{input}' |sed -e s'/ / I=/g') && java -jar /home/apps/pipelines/picard-tools/CURRENT MarkDuplicates I=$input O={projectDir}results/alignments/markdups/{wildcards.lane}.markdup.bam M={projectDir}results/alignments/markdups/{wildcards.lane}.markdup_metrics.txt &> {log}"
I want my final output to have the {stuff}.g.vcf notation. Please note this output is created in another snake module but the error appears in the mark duplicates which is before the other module.
I have tried multiple changes but it is the .g.vcf in the all rule that causes the issue.
My guess is that {lane} is interpreted as a regular expression and it's capturing more than it should. Try adding before rule all:
wildcard_constraints:
stuff= '|'.join([re.escape(x) for x in LANES]),
lane= '|'.join([re.escape(x) for x in LANES])
(See also this thread https://groups.google.com/forum/#!topic/snakemake/wVlJW9X-9EU)

Aggregate undetermined number of files for all wildcards in one rule

I have a set of files which will be individually processed to produce multiple files. Exactly how many files is unknown before runtime. (If it matters, this is demultiplexing DNA sequencing results.) I then have a script which takes all of these files at once.
Right now I have something like this:
checkpoint demultiplex:
input: "{sample}.fastq"
output: directory("{sample}")
shell:
# in reality the number of output files is not known
"mkdir -p {output} &&"
"touch {output}/{wildcards.sample}-1.fastq &&"
"touch {output}/{wildcards.sample}-2.fastq &&"
"touch {output}/{wildcards.sample}-3.fastq"
def find_outputs(wildcards) :
outdir = checkpoints.demultiplex.get(**wildcards)
return glob.glob("{sample}/{sample}-*.fastq".format_map(wildcards))
rule analysis:
input: find_outputs
outputs: "results.txt"
script: "scripts/do_analysis.R"
This obviously doesn't work, because the values of {sample} (Assume they should be A, B, C, D) are never defined.
As I was writing the question, I came up with this answer, which seems to work. However, if you have something cleaner, I would be happy to accept it!
For checkpoints.<rule>.get() to work its magic, it has to be in the body of a function which is given as a reference, not called. Also, this function needs to take one argument, wildcards.
So we make a function that returns closures having the behavior we need. The value of wildcards (which will be empty in this case) is ignored, allowing us to specify the values manually.
def find_outputs(sample):
def f(wildcards):
checkpoints.demultiplex.get(sample = sample)
return glob.glob("{sample}/{sample}-*.fastq".format(sample = sample))
return f
rule analysis:
input:
find_outputs("A"),
find_outputs("B"),
find_outputs("C"),
find_outputs("D")
output: "results.txt"
script: "script/do_analysis.R"

Snakemake Using expand with dictionary

I am writing this rule:
rule process_files:
input:
dataout=expand("{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_list[wildcards.ref])
output:
"{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv"
shell:
do something ...
Were expand will get value from dictionary my_dictionary based on the ref value. I used wildcards like this my_dictionary[wildcards.ref]. But it ends up with this error name 'wildcards' is not defined
my_dictionary something like:
{A:[1,2,3], B:[s1,s2..].....}
I could use
def myfun(wildcards):
return expand("{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_dictionary[wildcards.ref])
and use myfun as input , but this does not answer why I can not use expand in place directly
Any suggestion how to fix it?
As #dariober mentioned there is the wildcards objects but this is only accesible in the run/shell portion but can be accessed using an input function in input.
Here is an example implementation that will expand the input based on the wildcards.ref:
rule all:
input: expand("{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv", dataset=["D1", "D2"], sample=["S1", "S2"], ref=["R1", "R2"], state=["STATE1", "STATE2"], case=["C1", "C2"])
my_list = {"R1": [1, 2, 3], "R2": ["s1", "s2"]}
rule process_files:
input:
lambda wildcards: expand(
"{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_list[wildcards.ref])
output:
"{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv"
shell:
"echo '{input}' > {output}"
If you implement it as the lambda function example above, it should resolve the issue you mention:
The function worked but it did not resolve the variable between double curly braces so it will ask for input for {dataset}/{sample}.{ref}.{state}.{case}and raise an error.
Your question seems similar to snakemake wildcards or expand command and the bottom line is that wildcards is not defined in the input. So your solution of using an input function (or a lambda function) seems correct.
(As to why wildcards is not defined in input, I don't know...)

InputFunctionException with KeyError

Suppose I have a code in python that generates a dictionary as the result. I need to write each element of dictionary in a separate folder which later will be used by other set of rules in snakemake.
I have written the code as following but it does not work!
simulation_index_dict={1:'test1',2:'test2'}
def indexer(wildcards):
return(simulation_index_dict[wildcards.simulation_index])
rule SimulateAll:
input:
expand("{simulation_index}/ProteinCodingGene/alfsim.drw",simulation_index=simulation_index_dict.keys())
rule simulate_phylogeny:
output:
ProteinCodingGeneParams=expand("{{simulation_index}}/ProteinCodingGene/alfsim.drw"),
IntergenicRegionParams=expand("{{simulation_index}}/IntergenicRegions/dawg_IR.dawg"),
RNAGeneParams=expand("{{simulation_index}}/IntergenicRegions/dawg_RG.dawg"),
RepeatRegionParams=expand("{{simulation_index}}/IntergenicRegions/dawg_RR.dawg"),
params:
value= indexer,
shell:
"""
echo {params.value} > {output.ProteinCodingGeneParams}
echo {params.value} > {output.IntergenicRegionParams}
echo {params.value} > {output.RNAGeneParams}
echo {params.value} > {output.RepeatRegionParams}
"""
The error it return is :
InputFunctionException in line 14 of /$/test.snake:
KeyError: '1'
Wildcards:
simulation_index=1
It seems that problems is with the params section of the rule because deleting it will eliminates the error but I can not figure out what is wrong with the params!
The solution: using strings as dictionary keys
One can guess from the error message (KeyError: '1') that some query in a dictionary went wrong on a key that is '1', which happens to be a string.
However, the dictionary used in the indexer "params" function has integers as keys.
Apparently, using strings instead of ints as keys to this simulation_index_dict dictionary solves the problem (see comments below the question).
The cause: loss of type information during workflow inference
The cause of the problem is likely that the integer nature (inherited from simulation_index_dict.keys()) of the value assigned to the simulation_index parameter of the expand in SimulateAll is "forgotten" in subsequent steps of the workflow inference.
Indeed, the expand results in a list of strings, which are then matched against the output of the other rules (which also consist in strings), to infer the values of the wildcards attributes (which are also strings). Therefore, when the indexer function is executed, wildcards.simulation_index is a string, and this causes a KeyError when looking it up in simulation_index_dict.