Getting wildcard from input files when not used in output files - snakemake

I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.
In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?
runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole = ['big','small']
rule all:
input:
expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)
rule merge_results:
input:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
output:
os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
run:
import pandas as pd
import os
# Iterate over input files, read into pandas df
tmplist = []
for f in input:
data = pd.read_csv(f, sep='\t')
# getting the pheno from the input file and adding it to the data frame
pheno = os.path.split(f)[1].split('.')[0]
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')

You are doing it the right way !
In your line:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
you have to understand that role and study are wildcards. pheno is not a wildcard and is set by the second argument of the expand function.
In order to get the phenotype if your for loop, you can either parse the file name like you are doing or directly reconstruct the file name since you know the different values that pheno takes and you can access the wildcards:
run:
import pandas as pd
import os
# Iterate over phenotypes, read into pandas df
tmplist = []
for pheno in runpheno:
# conflicting variable name 'output' between a global variable and the rule variable here. Renamed global var outputDir for example
file = os.path.join(outputDir, wildcards.role, wildcards.study, pheno, pheno+'.summary')
data = pd.read_csv(file, sep='\t')
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
I don't know if this is better than parsing the file name like you were doing though. I wanted to show that you can access wildcards in the code. Either way, you are defining the input and output correctly.

Related

snakemake: define parameter based on sample name or other input

Thank you in advance for all of your help on here!
I have a snakemake file defining steps for processing short-read data, mapping, and variant calling. I'm hoping to use different reference sequences for different samples and I'm wondering how you would recommend defining the reference based on an input sample name?
For example, I defined my run and sample names using wildcards. I hope to define my ref based on the sample (or run) name, so that samples are mapped to the correct reference. My rule map_reads is below.
Thank you in advance for your help!
# Define samples:
RUNS, SAMPLES = glob_wildcards("/xyz/{run}/{samp}_L001_R1_001.fastq.gz")
sample_dict = dict(zip(SAMPLES,RUNS))
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
# Map reads.
rule map_reads:
input:
ref_path='/xyz/refs/{ref}.fasta',
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/xyz/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"
You can create a file relating your samples and reference genome and then read that into a dictionary (or pandas dataframe).
The dictionary/dataframe can then be accessed in the input to determine the right reference for the given sample.
Here is a dictionary example.
Given a tab separated file samples.txt relating sample to reference like so:
sample_A ref_A
sample_B ref_B
sample_C ref_C
Then, using a lambda function, we can access the wildcards object in the input and use the samp wildcard to find the corresponding reference in our dictionary.
# Define samples:
RUNS, SAMPLES = glob_wildcards("/xyz/{run}/{samp}_L001_R1_001.fastq.gz")
sample_dict = dict(zip(SAMPLES,RUNS))
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
# Read samples.txt into dictionary.
sample_to_ref = {}
with open("samples.txt") as f:
for line in f:
line = line.strip().split("\t")
sample_to_ref[line[0]] = line[1] # sample_to_ref[sample] = reference
# Map reads.
rule map_reads:
input:
ref_path= lambda wildcards: expand('/xyz/refs/{ref}.fasta', ref=sample_to_ref[wildcards.samp]), # lambda allows access to wildcards, to then access dictionary.
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/xyz/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"

Snakemake variable number of files

I'm in a situation, where I would like to scatter my workflow into a variable number of chunks, which I don't know beforehand. Maybe it is easiest to explain the problem by being concrete:
Someone has handed me FASTQ files demultiplexed using bcl2fastq with the no-lane-splitting option. I would like to split these files according to lane, map each lane individually, and then finally gather everything again. However, I don't know the number of lanes beforehand.
Ideally, I would like a solution like this,
rule split_fastq_file: (...) # results in N FASTQ files
rule map_fastq_file: (...) # do this N times
rule merge_bam_files: (...) # merge the N BAM files
but I am not sure this is possbile. The expand function requires me to know the number of lanes, and can't see how it would be possible to use wildcards for this, either.
I should say that I am rather new to Snakemake, and that I may have complete misunderstood how Snakemake works. It has taken me some time to get used to think about things "upside-down" by focusing on output files and then working backwards.
One option is to use checkpoint when splitting the fastqs, so that you can dynamically re-evaluate the DAG at a later point to get the resulting lanes.
Here's an MWE step by step:
Setup and make an example fastq file.
# Requires Python 3.6+ for f-strings, Snakemake 5.4+ for checkpoints
import pathlib
import random
random.seed(1)
rule make_fastq:
output:
fastq = touch("input/{sample}.fastq")
Create a random number of lanes between 1 and 9 each with random identifier from 1 to 9. Note that we declare this as a checkpoint, rather than a rule, so that we can later access the result. Also, we declare the output here as a directory specific to the sample, so that we can later glob in it to get the lanes that were created.
checkpoint split_fastq:
input:
fastq = rules.make_fastq.output.fastq
output:
lane_dir = directory("temp/split_fastq/{sample}")
run:
pathlib.Path(output.lane_dir).mkdir(exist_ok=True)
n_lanes = random.randrange(1, 10)-
lane_numbers = random.sample(range(1, 10), k = n_lanes)
for lane_number in lane_numbers:
path = pathlib.Path(output.lane_dir) / f"L00{lane_number}.fastq"
path.touch()
Do some intermediate processing.
rule map_fastq:
input:
fastq = "temp/split_fastq/{sample}/L00{lane_number}.fastq"
output:
bam = "temp/map_fastq/{sample}/L00{lane_number}.bam"
run:
bam = pathlib.Path(output.bam)
bam.parent.mkdir(exist_ok=True)
bam.touch()
To merge all the processed files, we use an input function to access the lanes that were created in split_fastq, so that we can do a dynamic expand on these. We do the expand on the last rule in the chain of intermediate processing steps, in this case map_fastq, so that we ask for the correct inputs.
def get_bams(wildcards):
lane_dir = checkpoints.split_fastq.get(**wildcards).output[0]
lane_numbers = glob_wildcards(f"{lane_dir}/L00{{lane_number}}.fastq").lane_number
bams = expand(rules.map_fastq.output.bam, **wildcards, lane_number=lane_numbers)
return bams
This input function now gives us easy access to the bam files we wish to merge, however many there are, and whatever they may be called.
rule merge_bam:
input:
get_bams
output:
bam = "temp/merge_bam/{sample}.bam"
shell:
"cat {input} > {output.bam}"
This example runs, and with random.seed(1) happens to create three lanes (l001, l002, and l005).
If you don't want to use checkpoint, I think you could achieve something similar by creating an input function for merge_bam that opens up the original input fastq, scans the read names for lane info, and predicts what the input files ought to be. This seems less robust, however.

How to skip duplicate headers in multiple CSV files having indetical columns and merge as one big data frame

I have copied 34 CSV files having identical columns in google colab and trying to merge as one big data frame. However, each CSV has a duplicate header which needs to be skipped.
The actual header anyway will be skipped while concatenating, as my CSV files having identical columns correct?
dfs = [pd.read_csv(path.join('/content/drive/My Drive/',x)skiprows=1) for x in os.listdir('/content/drive/My Drive/') if path.isfile(path.join('/content/drive/My Drive/',x))]
df = pd.concat(dfs)
Above code throwing below error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1: invalid continuation byte
Below code working for sample files,but need an efficient way to skip dup headers and merged into one data frame.Please suggest.
df1=pd.read_csv("./Aug_0816.csv",skiprows=1)
df2=pd.read_csv("./Sep_0916.csv",skiprows=1)
df3=pd.read_csv("./Oct_1016.csv",skiprows=1)
df4=pd.read_csv("./Nov_1116.csv",skiprows=1)
df5=pd.read_csv("./Dec_1216.csv",skiprows=1)
dfs=[df1,df2,df3,df4,df5]
df=pd.concat(dfs)
Have you considered using glob from the standard library?
Try this
path = ('/content/drive/My Drive/')
os.chdir(path)
allFiles = glob.glob("*.csv")
dfs = [pd.read_csv(f,header=None,error_bad_lines=False) for f in allFiles]
#or if you know the specific delimiter for your csv
#dfs = [pd.read_csv(f,header=None,delimiter='yourdelimiter') for f in allFiles]
df = pd.concat(dfs)
Try this, the most generic script for concatenating multiple 'n' csv files in a specific path with a common file name format!
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f,**kwargs) for f in flist], ignore_index=True)
path = r"C:\Users\Jyotsna\Documents"
fmask = os.path.join(path, 'Detail**.csv')
df = get_merged_csv(glob.glob(fmask), index_col=None)
df.head()
If you want to skip some fixed rows and/or columns in each of the files before concatenating, edit the code accordingly on this line!
return pd.concat([pd.read_csv(f, skiprows=4,usecols=range(9),**kwargs) for f in flist], ignore_index=True)

Missing wildcards in S4 snakemake Object in R

I'm running a workflow with a main Snakefile including rules from the rules folder and calling rscripts from those included rules.
Here are a few lines and their specific files:
Snakefile:
samples = pd.read_table("samples.csv", header=0, sep=',', index_col=0)
rule extract:
input:
'summary/umi_expression_matrix.tsv'
include: "rules/extract_expression_single.smk"
rules/extract_expression_single.smk:
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
scripts/merge_counts_single.R:
samples = read.csv('samples.csv', header=TRUE, stringsAsFactors=FALSE)$samples
read_list = c()
for (i in 1:length(samples)){
temp_matrix = read.table(snakemake#input[[i]][1], header=T, stringsAsFactors = F)
cell_barcodes = colnames(temp_matrix)[-1]
colnames(temp_matrix) = c("GENE",paste(samples[i], cell_barcodes, sep = "_"))
read_list=c(read_list, list(temp_matrix))
}
# Little function that allows to merge unequal matrices
merge.all <- function(x, y) {
merge(x, y, all=TRUE, by="GENE")
}
read_counts <- Reduce(merge.all, read_list)
read_counts[is.na(read_counts)] = 0
rownames(read_counts) = read_counts[,1]
read_counts = read_counts[,-1]
write.table(read_counts, file=snakemake#output[[1]], sep='\t')
The "clean" way to do it would be to call snakemake#wildcard.sample to attribute sample names to the script. But for some reason snakemake#wildcards is an empty vector.
In python:
print(type(snakemake.wildcards))
print(snakemake.wildcards)
print('done')
gives:
<class 'snakemake.io.Wildcards'>
done
which means it's also empty.
So right now I have to rely on getting back to the samples.csv file and getting the sample names there. I will also have to double check matching indexes maybe using greps, don't want the samples and the files to get mixed up.
Any idea why this is happening?
Update:
I've tried adding the sample_name as params to see if this would work and it actually does.
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
params:
sample_name = lambda wildcards: samples.index
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
I'm gonna use this for now, but my guess is there is still an issue with the scope of wildcards in included rules. Or maybe I'm doing it wrong.
The idea of using wildcards is to call a rule for each value in the wildcards. If you use the expand function in the input of a rule, then your rule will take all of the wildcard values and create a list of strings. Which means, your rule will be invoked just for once (not for each wildcard value). Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values.
By doing so, you cannot use that wildcard inside your rule any longer. Because when that rule is invoked, it gets all of the wildcard values and convert them into a list that will be given to your R script just for once (not for each wildcard value).
In your case, using wildcards is not suitable, since your merge_count rule will be run only for once (not for each wildcard value).

How to load lists in pelilcanconf.py from external file

There are different lists available in pelicanconf.py such as
SOCIAL = (('Facebook','www.facebook.com'),)
LINKS =
etc.
I want to manage these content and create my own lists by loading these values from an external file which can be edited independently. I tried importing data as a text file using python but it doesn't work. Is there any other way?
What exactly did not work? Can you provide code?
You can execute arbitrary python code in your pelicanconf.py.
Example for a very simple CSV reader:
# in pelicanconf.py
def fn_to_list(fn):
with open(fn, 'r') as res:
return tuple(map(lambda line: tuple(line[:-1].split(';')), res.readlines()))
print(fn_to_list("data"))
CSV file data:
A;1
B;2
C;3
D;4
E;5
F;6
Together, this yields the following when running pelican:
# ...
((u'A', u'1'), (u'B', u'2'), (u'C', u'3'), (u'D', u'4'), (u'E', u'5'), (u'F', u'6'))
# ...
Instead of printing you can also assign this list to a variable, say LINKS.