Snakemake Using expand with dictionary - expand

I am writing this rule:
rule process_files:
input:
dataout=expand("{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_list[wildcards.ref])
output:
"{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv"
shell:
do something ...
Were expand will get value from dictionary my_dictionary based on the ref value. I used wildcards like this my_dictionary[wildcards.ref]. But it ends up with this error name 'wildcards' is not defined
my_dictionary something like:
{A:[1,2,3], B:[s1,s2..].....}
I could use
def myfun(wildcards):
return expand("{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_dictionary[wildcards.ref])
and use myfun as input , but this does not answer why I can not use expand in place directly
Any suggestion how to fix it?

As #dariober mentioned there is the wildcards objects but this is only accesible in the run/shell portion but can be accessed using an input function in input.
Here is an example implementation that will expand the input based on the wildcards.ref:
rule all:
input: expand("{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv", dataset=["D1", "D2"], sample=["S1", "S2"], ref=["R1", "R2"], state=["STATE1", "STATE2"], case=["C1", "C2"])
my_list = {"R1": [1, 2, 3], "R2": ["s1", "s2"]}
rule process_files:
input:
lambda wildcards: expand(
"{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_list[wildcards.ref])
output:
"{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv"
shell:
"echo '{input}' > {output}"
If you implement it as the lambda function example above, it should resolve the issue you mention:
The function worked but it did not resolve the variable between double curly braces so it will ask for input for {dataset}/{sample}.{ref}.{state}.{case}and raise an error.

Your question seems similar to snakemake wildcards or expand command and the bottom line is that wildcards is not defined in the input. So your solution of using an input function (or a lambda function) seems correct.
(As to why wildcards is not defined in input, I don't know...)

Related

Snakemake: wildcards for parameter keys

I'm trying to create a snakemake rule for which the input and output are config parameters specified by a wildcard but having problems.
I would like to do something like:
config.yaml
cam1:
raw: "src/in1.avi"
bg: "out/bg1.png"
cam2:
raw: "src/in2.avi"
bg: "out/bg2.png"
cam3:
raw: "src/in3.avi"
bg: "out/bg3.png"
Snakefile:
configfile: "config.yml"
...
rule all:
input:
[config[f'cam{id}']['background'] for id in [1, 2, 3]]
rule make_bg:
input:
raw=config["{cam}"]["raw"]
output:
bg=config["{cam}"]["bg"]
shell:
"""
./process.py {input.raw} {output.bg}
"""
But this doesn't seem to play - I would like {cam} to be treated as a wildcard, instead I get a KeyError for {cam}. Can anyone help?
Is it possible to specify {cam} as a wildcard (or something else) that could then be used a config key?
I think that there are a few problems with this approach:
Conceptually
It does not make much sense to specify the exact input and output filenames in a config, since this is pretty much diametrically opposed to why you would use snakemake: Infer from the inputs what part of the pipeline needs to be run to create the desired outputs. In this case, you would always have to first edit the config for each input/output pair and the whole point of automatisation is lost.
Now, the actual problem is to access config variables from the config for input and output. Typically, you would e.g. provide some paths in the config and use something like:
config.yaml:
raw_input = 'src'
bg_output = 'out'
In the pipeline, you could then use it like this:
input: os.path.join(config['raw_input'], in{id}.avi)
output: os.path.join(config['bg_output'], bg{id}.avi)
Like I said, it makes no sense to specify especially the outputs in the config file.
If you were to specify the inputs in config.yaml:
cam1:
raw: "src/in1.avi"
cam2:
raw: "src/in2.avi"
cam3:
raw: "src/in3.avi"
you could then get the inputs with a function as below:
configfile: "config.yaml"
# create sample data
os.makedirs('src', exist_ok= True)
for i in [1,2,3]:
Path(f'src/in{i}.avi').touch()
ids = [1,2,3]
def get_raw(wildcards):
id = 'cam' + wildcards.id
raw = config[f'{id}']['raw']
return raw
rule all:
input: expand('out/bg{id}.png', id = ids)
rule make_bg:
input:
raw = get_raw
output:
bg='out/bg{id}.png'
shell:
" touch {input.raw} ;"
" cp {input.raw} {output.bg};"

Snakemake - input function exception

I am trying to run snakemake code using.json file as input. While checking the dry run i got foloowing error
InputFunctionException in line 172 of /home/Snakefile_ChIPseq_pe:
KeyError: '130241_1'
Wildcards:
library=130241_1
This is the part of snakemake code
rule findPeaks:
input:
sample = os.path.join(HOMERTAG_DIR, "{library}"),
input = lambda wildcards: os.path.join(HOMERTAG_DIR, config['lib_input'][wildcards.library])
output:
os.path.join(HOMERPEAK_DIR, "{library}.all.hpeaks")
params:
config['homer_findPeaks_params']
shell:
"findPeaks {input.sample} -i {input.input} {params} -o {output}"
There is single quote around input sample which is missing in the 'lib_input' part. How to add that single quote ahead of variable?
Also library names are like 12345_1, 12345_2 etc., never had this problem before however for the first time I have libraries with "underscore" in the names.
Snakemake will first try to interpret the given value as number. Only if that fails, it will interpret the value as string. Here, it does not fail, because the underscore _ is interpreted as thousand separator.
My guess is that in your json file the library IDs are not quoted. E.g. you have this:
{
"lib_input": {1234_1: "input.txt"}
}
Instead of:
{
"lib_input": {"1234_1": "input.txt"}
}
Or maybe library 130241_1 is not in the json at all?

Aggregate undetermined number of files for all wildcards in one rule

I have a set of files which will be individually processed to produce multiple files. Exactly how many files is unknown before runtime. (If it matters, this is demultiplexing DNA sequencing results.) I then have a script which takes all of these files at once.
Right now I have something like this:
checkpoint demultiplex:
input: "{sample}.fastq"
output: directory("{sample}")
shell:
# in reality the number of output files is not known
"mkdir -p {output} &&"
"touch {output}/{wildcards.sample}-1.fastq &&"
"touch {output}/{wildcards.sample}-2.fastq &&"
"touch {output}/{wildcards.sample}-3.fastq"
def find_outputs(wildcards) :
outdir = checkpoints.demultiplex.get(**wildcards)
return glob.glob("{sample}/{sample}-*.fastq".format_map(wildcards))
rule analysis:
input: find_outputs
outputs: "results.txt"
script: "scripts/do_analysis.R"
This obviously doesn't work, because the values of {sample} (Assume they should be A, B, C, D) are never defined.
As I was writing the question, I came up with this answer, which seems to work. However, if you have something cleaner, I would be happy to accept it!
For checkpoints.<rule>.get() to work its magic, it has to be in the body of a function which is given as a reference, not called. Also, this function needs to take one argument, wildcards.
So we make a function that returns closures having the behavior we need. The value of wildcards (which will be empty in this case) is ignored, allowing us to specify the values manually.
def find_outputs(sample):
def f(wildcards):
checkpoints.demultiplex.get(sample = sample)
return glob.glob("{sample}/{sample}-*.fastq".format(sample = sample))
return f
rule analysis:
input:
find_outputs("A"),
find_outputs("B"),
find_outputs("C"),
find_outputs("D")
output: "results.txt"
script: "script/do_analysis.R"

InputFunctionException with KeyError

Suppose I have a code in python that generates a dictionary as the result. I need to write each element of dictionary in a separate folder which later will be used by other set of rules in snakemake.
I have written the code as following but it does not work!
simulation_index_dict={1:'test1',2:'test2'}
def indexer(wildcards):
return(simulation_index_dict[wildcards.simulation_index])
rule SimulateAll:
input:
expand("{simulation_index}/ProteinCodingGene/alfsim.drw",simulation_index=simulation_index_dict.keys())
rule simulate_phylogeny:
output:
ProteinCodingGeneParams=expand("{{simulation_index}}/ProteinCodingGene/alfsim.drw"),
IntergenicRegionParams=expand("{{simulation_index}}/IntergenicRegions/dawg_IR.dawg"),
RNAGeneParams=expand("{{simulation_index}}/IntergenicRegions/dawg_RG.dawg"),
RepeatRegionParams=expand("{{simulation_index}}/IntergenicRegions/dawg_RR.dawg"),
params:
value= indexer,
shell:
"""
echo {params.value} > {output.ProteinCodingGeneParams}
echo {params.value} > {output.IntergenicRegionParams}
echo {params.value} > {output.RNAGeneParams}
echo {params.value} > {output.RepeatRegionParams}
"""
The error it return is :
InputFunctionException in line 14 of /$/test.snake:
KeyError: '1'
Wildcards:
simulation_index=1
It seems that problems is with the params section of the rule because deleting it will eliminates the error but I can not figure out what is wrong with the params!
The solution: using strings as dictionary keys
One can guess from the error message (KeyError: '1') that some query in a dictionary went wrong on a key that is '1', which happens to be a string.
However, the dictionary used in the indexer "params" function has integers as keys.
Apparently, using strings instead of ints as keys to this simulation_index_dict dictionary solves the problem (see comments below the question).
The cause: loss of type information during workflow inference
The cause of the problem is likely that the integer nature (inherited from simulation_index_dict.keys()) of the value assigned to the simulation_index parameter of the expand in SimulateAll is "forgotten" in subsequent steps of the workflow inference.
Indeed, the expand results in a list of strings, which are then matched against the output of the other rules (which also consist in strings), to infer the values of the wildcards attributes (which are also strings). Therefore, when the indexer function is executed, wildcards.simulation_index is a string, and this causes a KeyError when looking it up in simulation_index_dict.

Missing wildcards in S4 snakemake Object in R

I'm running a workflow with a main Snakefile including rules from the rules folder and calling rscripts from those included rules.
Here are a few lines and their specific files:
Snakefile:
samples = pd.read_table("samples.csv", header=0, sep=',', index_col=0)
rule extract:
input:
'summary/umi_expression_matrix.tsv'
include: "rules/extract_expression_single.smk"
rules/extract_expression_single.smk:
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
scripts/merge_counts_single.R:
samples = read.csv('samples.csv', header=TRUE, stringsAsFactors=FALSE)$samples
read_list = c()
for (i in 1:length(samples)){
temp_matrix = read.table(snakemake#input[[i]][1], header=T, stringsAsFactors = F)
cell_barcodes = colnames(temp_matrix)[-1]
colnames(temp_matrix) = c("GENE",paste(samples[i], cell_barcodes, sep = "_"))
read_list=c(read_list, list(temp_matrix))
}
# Little function that allows to merge unequal matrices
merge.all <- function(x, y) {
merge(x, y, all=TRUE, by="GENE")
}
read_counts <- Reduce(merge.all, read_list)
read_counts[is.na(read_counts)] = 0
rownames(read_counts) = read_counts[,1]
read_counts = read_counts[,-1]
write.table(read_counts, file=snakemake#output[[1]], sep='\t')
The "clean" way to do it would be to call snakemake#wildcard.sample to attribute sample names to the script. But for some reason snakemake#wildcards is an empty vector.
In python:
print(type(snakemake.wildcards))
print(snakemake.wildcards)
print('done')
gives:
<class 'snakemake.io.Wildcards'>
done
which means it's also empty.
So right now I have to rely on getting back to the samples.csv file and getting the sample names there. I will also have to double check matching indexes maybe using greps, don't want the samples and the files to get mixed up.
Any idea why this is happening?
Update:
I've tried adding the sample_name as params to see if this would work and it actually does.
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
params:
sample_name = lambda wildcards: samples.index
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
I'm gonna use this for now, but my guess is there is still an issue with the scope of wildcards in included rules. Or maybe I'm doing it wrong.
The idea of using wildcards is to call a rule for each value in the wildcards. If you use the expand function in the input of a rule, then your rule will take all of the wildcard values and create a list of strings. Which means, your rule will be invoked just for once (not for each wildcard value). Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values.
By doing so, you cannot use that wildcard inside your rule any longer. Because when that rule is invoked, it gets all of the wildcard values and convert them into a list that will be given to your R script just for once (not for each wildcard value).
In your case, using wildcards is not suitable, since your merge_count rule will be run only for once (not for each wildcard value).