Missing wildcards in S4 snakemake Object in R - snakemake

I'm running a workflow with a main Snakefile including rules from the rules folder and calling rscripts from those included rules.
Here are a few lines and their specific files:
Snakefile:
samples = pd.read_table("samples.csv", header=0, sep=',', index_col=0)
rule extract:
input:
'summary/umi_expression_matrix.tsv'
include: "rules/extract_expression_single.smk"
rules/extract_expression_single.smk:
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
scripts/merge_counts_single.R:
samples = read.csv('samples.csv', header=TRUE, stringsAsFactors=FALSE)$samples
read_list = c()
for (i in 1:length(samples)){
temp_matrix = read.table(snakemake#input[[i]][1], header=T, stringsAsFactors = F)
cell_barcodes = colnames(temp_matrix)[-1]
colnames(temp_matrix) = c("GENE",paste(samples[i], cell_barcodes, sep = "_"))
read_list=c(read_list, list(temp_matrix))
}
# Little function that allows to merge unequal matrices
merge.all <- function(x, y) {
merge(x, y, all=TRUE, by="GENE")
}
read_counts <- Reduce(merge.all, read_list)
read_counts[is.na(read_counts)] = 0
rownames(read_counts) = read_counts[,1]
read_counts = read_counts[,-1]
write.table(read_counts, file=snakemake#output[[1]], sep='\t')
The "clean" way to do it would be to call snakemake#wildcard.sample to attribute sample names to the script. But for some reason snakemake#wildcards is an empty vector.
In python:
print(type(snakemake.wildcards))
print(snakemake.wildcards)
print('done')
gives:
<class 'snakemake.io.Wildcards'>
done
which means it's also empty.
So right now I have to rely on getting back to the samples.csv file and getting the sample names there. I will also have to double check matching indexes maybe using greps, don't want the samples and the files to get mixed up.
Any idea why this is happening?
Update:
I've tried adding the sample_name as params to see if this would work and it actually does.
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
params:
sample_name = lambda wildcards: samples.index
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
I'm gonna use this for now, but my guess is there is still an issue with the scope of wildcards in included rules. Or maybe I'm doing it wrong.

The idea of using wildcards is to call a rule for each value in the wildcards. If you use the expand function in the input of a rule, then your rule will take all of the wildcard values and create a list of strings. Which means, your rule will be invoked just for once (not for each wildcard value). Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values.
By doing so, you cannot use that wildcard inside your rule any longer. Because when that rule is invoked, it gets all of the wildcard values and convert them into a list that will be given to your R script just for once (not for each wildcard value).
In your case, using wildcards is not suitable, since your merge_count rule will be run only for once (not for each wildcard value).

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

snakemake: define parameter based on sample name or other input

Thank you in advance for all of your help on here!
I have a snakemake file defining steps for processing short-read data, mapping, and variant calling. I'm hoping to use different reference sequences for different samples and I'm wondering how you would recommend defining the reference based on an input sample name?
For example, I defined my run and sample names using wildcards. I hope to define my ref based on the sample (or run) name, so that samples are mapped to the correct reference. My rule map_reads is below.
Thank you in advance for your help!
# Define samples:
RUNS, SAMPLES = glob_wildcards("/xyz/{run}/{samp}_L001_R1_001.fastq.gz")
sample_dict = dict(zip(SAMPLES,RUNS))
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
# Map reads.
rule map_reads:
input:
ref_path='/xyz/refs/{ref}.fasta',
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/xyz/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"
You can create a file relating your samples and reference genome and then read that into a dictionary (or pandas dataframe).
The dictionary/dataframe can then be accessed in the input to determine the right reference for the given sample.
Here is a dictionary example.
Given a tab separated file samples.txt relating sample to reference like so:
sample_A ref_A
sample_B ref_B
sample_C ref_C
Then, using a lambda function, we can access the wildcards object in the input and use the samp wildcard to find the corresponding reference in our dictionary.
# Define samples:
RUNS, SAMPLES = glob_wildcards("/xyz/{run}/{samp}_L001_R1_001.fastq.gz")
sample_dict = dict(zip(SAMPLES,RUNS))
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
# Read samples.txt into dictionary.
sample_to_ref = {}
with open("samples.txt") as f:
for line in f:
line = line.strip().split("\t")
sample_to_ref[line[0]] = line[1] # sample_to_ref[sample] = reference
# Map reads.
rule map_reads:
input:
ref_path= lambda wildcards: expand('/xyz/refs/{ref}.fasta', ref=sample_to_ref[wildcards.samp]), # lambda allows access to wildcards, to then access dictionary.
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/xyz/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"

Getting wildcard from input files when not used in output files

I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.
In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?
runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole = ['big','small']
rule all:
input:
expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)
rule merge_results:
input:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
output:
os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
run:
import pandas as pd
import os
# Iterate over input files, read into pandas df
tmplist = []
for f in input:
data = pd.read_csv(f, sep='\t')
# getting the pheno from the input file and adding it to the data frame
pheno = os.path.split(f)[1].split('.')[0]
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
You are doing it the right way !
In your line:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
you have to understand that role and study are wildcards. pheno is not a wildcard and is set by the second argument of the expand function.
In order to get the phenotype if your for loop, you can either parse the file name like you are doing or directly reconstruct the file name since you know the different values that pheno takes and you can access the wildcards:
run:
import pandas as pd
import os
# Iterate over phenotypes, read into pandas df
tmplist = []
for pheno in runpheno:
# conflicting variable name 'output' between a global variable and the rule variable here. Renamed global var outputDir for example
file = os.path.join(outputDir, wildcards.role, wildcards.study, pheno, pheno+'.summary')
data = pd.read_csv(file, sep='\t')
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
I don't know if this is better than parsing the file name like you were doing though. I wanted to show that you can access wildcards in the code. Either way, you are defining the input and output correctly.

Aggregate undetermined number of files for all wildcards in one rule

I have a set of files which will be individually processed to produce multiple files. Exactly how many files is unknown before runtime. (If it matters, this is demultiplexing DNA sequencing results.) I then have a script which takes all of these files at once.
Right now I have something like this:
checkpoint demultiplex:
input: "{sample}.fastq"
output: directory("{sample}")
shell:
# in reality the number of output files is not known
"mkdir -p {output} &&"
"touch {output}/{wildcards.sample}-1.fastq &&"
"touch {output}/{wildcards.sample}-2.fastq &&"
"touch {output}/{wildcards.sample}-3.fastq"
def find_outputs(wildcards) :
outdir = checkpoints.demultiplex.get(**wildcards)
return glob.glob("{sample}/{sample}-*.fastq".format_map(wildcards))
rule analysis:
input: find_outputs
outputs: "results.txt"
script: "scripts/do_analysis.R"
This obviously doesn't work, because the values of {sample} (Assume they should be A, B, C, D) are never defined.
As I was writing the question, I came up with this answer, which seems to work. However, if you have something cleaner, I would be happy to accept it!
For checkpoints.<rule>.get() to work its magic, it has to be in the body of a function which is given as a reference, not called. Also, this function needs to take one argument, wildcards.
So we make a function that returns closures having the behavior we need. The value of wildcards (which will be empty in this case) is ignored, allowing us to specify the values manually.
def find_outputs(sample):
def f(wildcards):
checkpoints.demultiplex.get(sample = sample)
return glob.glob("{sample}/{sample}-*.fastq".format(sample = sample))
return f
rule analysis:
input:
find_outputs("A"),
find_outputs("B"),
find_outputs("C"),
find_outputs("D")
output: "results.txt"
script: "script/do_analysis.R"

Snakemake Using expand with dictionary

I am writing this rule:
rule process_files:
input:
dataout=expand("{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_list[wildcards.ref])
output:
"{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv"
shell:
do something ...
Were expand will get value from dictionary my_dictionary based on the ref value. I used wildcards like this my_dictionary[wildcards.ref]. But it ends up with this error name 'wildcards' is not defined
my_dictionary something like:
{A:[1,2,3], B:[s1,s2..].....}
I could use
def myfun(wildcards):
return expand("{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_dictionary[wildcards.ref])
and use myfun as input , but this does not answer why I can not use expand in place directly
Any suggestion how to fix it?
As #dariober mentioned there is the wildcards objects but this is only accesible in the run/shell portion but can be accessed using an input function in input.
Here is an example implementation that will expand the input based on the wildcards.ref:
rule all:
input: expand("{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv", dataset=["D1", "D2"], sample=["S1", "S2"], ref=["R1", "R2"], state=["STATE1", "STATE2"], case=["C1", "C2"])
my_list = {"R1": [1, 2, 3], "R2": ["s1", "s2"]}
rule process_files:
input:
lambda wildcards: expand(
"{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_list[wildcards.ref])
output:
"{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv"
shell:
"echo '{input}' > {output}"
If you implement it as the lambda function example above, it should resolve the issue you mention:
The function worked but it did not resolve the variable between double curly braces so it will ask for input for {dataset}/{sample}.{ref}.{state}.{case}and raise an error.
Your question seems similar to snakemake wildcards or expand command and the bottom line is that wildcards is not defined in the input. So your solution of using an input function (or a lambda function) seems correct.
(As to why wildcards is not defined in input, I don't know...)