Snakemake tabular configuration, expand, and merge - how to expand input files correctly? - snakemake

I would greatly appreciate a little pointer for the following. I have a TSV samples table:
Sample Unit Tumor_or_Normal Fastq1 Fastq2
A 1 T reads/a.t.1.fastq reads/a.t.2.fastq
A 2 N reads/a.n.1.fastq reads/a.n.2.fastq
B 1 T reads/b.t1.1.fastq reads/b.t1.2.fastq
...
and is read in
samples = pd.read_table(config["samples"], dtype=str).set_index(["Sample", "Unit", "Tumor_or_Normal"], drop=False)
samples.index = samples.index.set_levels([i.astype(str) for i in samples.index.levels])
I would like to merge all bam files that have the same Sample and Tumor_or_Normal. For example, C-1-T.bam and C-2-T.bam and C-3-T.bam should be merged into C-T.bam. I have a rule
rule merge_recal_by_unit:
input:
expand("recal/{{Sample}}-{Unit}-{{Tumor_or_Normal}}.bam",
Unit=samples.loc[samples.Sample].Unit)
output:
bam=protected("merged/{Sample}-{Tumor_or_Normal}.bam")
params:
""
threads:
8
wrapper:
"0.39.0/bio/samtools/merge"
but this gave an InputFunctionException. I've also tried replacing the expand with
lamblda wildcards: expand("recal/{{Sample}}-{Unit}-{{Tumor_or_Normal}}.bam",
Unit=samples.loc[wildcards.Sample].Unit)
but this gave me a syntax error, and
expand("recal/{{Sample}}-{Unit}-{{Tumor_or_Normal}}.bam",
Unit=samples.index.get_level_values('Unit').unique().values())
resulted in the message that numpy.ndarray object is not callable. This seems similar to this and this question, but I wasn't able to make it work.
Any help here would be greatly appreciated. Many thanks!

It seems you want to query the samples table to get all the rows sharing the same Sample and Tumor_or_Normal and use the list of Unit to construct the input list of bam files. If so, something like this should do:
rule merge_recal_by_unit:
input:
lambda wc: ['recal/{Sample}-%s-{Tumor_or_Normal}.bam' % x for x in
samples[(samples.Sample == wc.Sample) & (samples.Tumor_or_Normal == wc.Tumor_or_Normal)].Unit]
output:
bam=protected("merged/{Sample}-{Tumor_or_Normal}.bam")
...

Related

Interconnected variables in snakemake

Let's say I have sample SAMPLE_A, divided into two files SAMPLE_A_1, SAMPLE_A_2 and associated to barcodes AATT, TTAA, and SAMPLE_Bassociated to barcodes CCGG, GGCC, GCGC, divided in 4 files SAMPLE_B_1...SAMPLE_B_4.
I can create getSampleNames() to get [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B] and [1,2,1,2,3,4] and then zip them to get the combination {sample}_{id}. And then I can do the same thing for the barcodes: [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B] and [AATT, TTAA,CCGG, GGCC, GCGC].
SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
"isoseq3 refine {input} "
rule split:
input:
expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
"python {params.script_dir}/split_cells_bam.py"
rule dedup_split:
input:
"{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
"isoseq3 dedup {input} {output.bam} "
rule merge:
input:
expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
zip, sample = SAMPLES_BC, barcode = BCs),
How can I prevent the rule split to be a bottleneck in my pipeline? For now it waits for the refine rule to be done for all samples while it's not necessary, every sample should run independently, but I can't because the set of barcodes is different for each sample. Is there a way to have something like
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs[SAMPLES_BC]), where each {sample} of SAMPLES_BC is a key in BCs dictionary ? And same for IDs? I know I can use functions, but then I'm not sure how to propagate the {barcode} through the rules
Based on your comment, there are a few routes to take which would involve changing your data structure holding the samples, barcodes, and ids. For now, you can just create a rule per sample:
for sample in set(SAMPLES_ID): # get uniq samples
# get ids and barcodes for this sample
ids = [tup[1] for tup in zip(SAMPLES_ID, IDs) if tup[0] == sample]
bcs = [tup[1] for tup in zip(SAMPLES_BC, BCs) if tup[0] == sample]
rule:
name: f'{sample}_split'
input:
expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
sample = sample, id = ids),
output:
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam",
sample = sample, barcode = bcs),
shell:
"python {params.script_dir}/split_cells_bam.py"
You don't need zip in the expand since the ids and bcs are for the single sample. I don't think this is the best way in general, but will be easiest for your current workflow.
Just noticing your shell command, how are you passing the input/output to your script?
I found how to use dictionaries through functions, which solved my problem!
The major default of this solution is that you have to create a dummy file as output of the split rule, instead of checking if each '{sample}/cells/{barcode}_{sample}/fltnc.bam' file is created, so I am still looking for something more elegant...
IDs = getSampleNames() #{SAMPLE_A:[1,2], SAMPLE_B:[1,2,3,4]}
SAMPLES = list(IDs.keys())
BCs = getBCs(SAMPLES) #{SAMPLE_A:[AATT, TTAA], SAMPLE_B:[CCGG,GGCC,GCGC]}
# function linking IDs and SAMPLE
def sample2ids(wildcards):
return expand('{{sample}}/polyA_trimming/{{sample}}_{id}.fltnc.bam',
id = IDs[wildcards.sample])
# function linking BCs and SAMPLE
def sample2ids(wildcards):
return expand('{{sample}}/cells/{barcode}_{{sample}}/dedup/dedup.bam',
barcode = BCs[wildcards.sample])
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
rule split:
input:
sample2ids
output:
# cannot use a function here, so I create a dummy file to pipe
'dummy_file.txt'
rule dedup_split:
input:
'dummy_file.txt'
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
rule merge:
input:
sample2bc

snakemake: define parameter based on sample name or other input

Thank you in advance for all of your help on here!
I have a snakemake file defining steps for processing short-read data, mapping, and variant calling. I'm hoping to use different reference sequences for different samples and I'm wondering how you would recommend defining the reference based on an input sample name?
For example, I defined my run and sample names using wildcards. I hope to define my ref based on the sample (or run) name, so that samples are mapped to the correct reference. My rule map_reads is below.
Thank you in advance for your help!
# Define samples:
RUNS, SAMPLES = glob_wildcards("/xyz/{run}/{samp}_L001_R1_001.fastq.gz")
sample_dict = dict(zip(SAMPLES,RUNS))
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
# Map reads.
rule map_reads:
input:
ref_path='/xyz/refs/{ref}.fasta',
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/xyz/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"
You can create a file relating your samples and reference genome and then read that into a dictionary (or pandas dataframe).
The dictionary/dataframe can then be accessed in the input to determine the right reference for the given sample.
Here is a dictionary example.
Given a tab separated file samples.txt relating sample to reference like so:
sample_A ref_A
sample_B ref_B
sample_C ref_C
Then, using a lambda function, we can access the wildcards object in the input and use the samp wildcard to find the corresponding reference in our dictionary.
# Define samples:
RUNS, SAMPLES = glob_wildcards("/xyz/{run}/{samp}_L001_R1_001.fastq.gz")
sample_dict = dict(zip(SAMPLES,RUNS))
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
# Read samples.txt into dictionary.
sample_to_ref = {}
with open("samples.txt") as f:
for line in f:
line = line.strip().split("\t")
sample_to_ref[line[0]] = line[1] # sample_to_ref[sample] = reference
# Map reads.
rule map_reads:
input:
ref_path= lambda wildcards: expand('/xyz/refs/{ref}.fasta', ref=sample_to_ref[wildcards.samp]), # lambda allows access to wildcards, to then access dictionary.
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/xyz/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"

How to merge multiple grouped output files in nextflow

I have a process that outputs multiple files as tuple. Like this:
[chr1,[[chr1.chunk1.bgen],[chr1.chunk1.stat],[chr1.chunk2.bgen],[chr1.chunk2.stat],[chr1.chunk3.bgen],[chr1.chunk3.stat]]]
How could I get chr1.merged.bgen and chr1.merged.stat . I want to use cat to merge all these chunks.
I tried:
input:
tuple val (chrom), file('*.bgen'),file('*.stat') from my_output
"""
cat "${chrom}.${*.bgen}" > "${chrom}.merged.bgen"
cat "${chrom}.${*.stat}" > "${chrom}.merged.stat"
"""
But got " Input tuple does not match input set cardinality decalred
Also for:
input:
tuple val (chrom), path(bgen),path(stat) from my_output
"""
cat "${bgen}" > "${chrom}.merged.bgen"
cat "${stat}" > "${chrom}.merged.stat"
"""
Same error.
I also tried to use my_output.collect() and my_output.toList() But getting same error.
Any help?
In your example process, you defined 3 input variables, but you only provide 2. That's what the error message is trying to tell you - in fact I think this is only a warning.
The way your process' input is defined, nextflow would expect it to be in this form:
[chr1,[chr1.chunk1.bgen, chr1.chunk2.bgen, chr1.chunk3.bgen],[chr1.chunk1.stat, chr1.chunk2.stat, chr1.chunk3.stat]]
So one option would be to reformat your channel before you feed it into that process. Like this for example:
my_output
.map { it -> [it[0],
it.flatten().findAll{ it.getExtension=="bgen"},
it.flatten().findAll{ it.getExtension=="stat"}
]
}

Snakemake variable number of files

I'm in a situation, where I would like to scatter my workflow into a variable number of chunks, which I don't know beforehand. Maybe it is easiest to explain the problem by being concrete:
Someone has handed me FASTQ files demultiplexed using bcl2fastq with the no-lane-splitting option. I would like to split these files according to lane, map each lane individually, and then finally gather everything again. However, I don't know the number of lanes beforehand.
Ideally, I would like a solution like this,
rule split_fastq_file: (...) # results in N FASTQ files
rule map_fastq_file: (...) # do this N times
rule merge_bam_files: (...) # merge the N BAM files
but I am not sure this is possbile. The expand function requires me to know the number of lanes, and can't see how it would be possible to use wildcards for this, either.
I should say that I am rather new to Snakemake, and that I may have complete misunderstood how Snakemake works. It has taken me some time to get used to think about things "upside-down" by focusing on output files and then working backwards.
One option is to use checkpoint when splitting the fastqs, so that you can dynamically re-evaluate the DAG at a later point to get the resulting lanes.
Here's an MWE step by step:
Setup and make an example fastq file.
# Requires Python 3.6+ for f-strings, Snakemake 5.4+ for checkpoints
import pathlib
import random
random.seed(1)
rule make_fastq:
output:
fastq = touch("input/{sample}.fastq")
Create a random number of lanes between 1 and 9 each with random identifier from 1 to 9. Note that we declare this as a checkpoint, rather than a rule, so that we can later access the result. Also, we declare the output here as a directory specific to the sample, so that we can later glob in it to get the lanes that were created.
checkpoint split_fastq:
input:
fastq = rules.make_fastq.output.fastq
output:
lane_dir = directory("temp/split_fastq/{sample}")
run:
pathlib.Path(output.lane_dir).mkdir(exist_ok=True)
n_lanes = random.randrange(1, 10)-
lane_numbers = random.sample(range(1, 10), k = n_lanes)
for lane_number in lane_numbers:
path = pathlib.Path(output.lane_dir) / f"L00{lane_number}.fastq"
path.touch()
Do some intermediate processing.
rule map_fastq:
input:
fastq = "temp/split_fastq/{sample}/L00{lane_number}.fastq"
output:
bam = "temp/map_fastq/{sample}/L00{lane_number}.bam"
run:
bam = pathlib.Path(output.bam)
bam.parent.mkdir(exist_ok=True)
bam.touch()
To merge all the processed files, we use an input function to access the lanes that were created in split_fastq, so that we can do a dynamic expand on these. We do the expand on the last rule in the chain of intermediate processing steps, in this case map_fastq, so that we ask for the correct inputs.
def get_bams(wildcards):
lane_dir = checkpoints.split_fastq.get(**wildcards).output[0]
lane_numbers = glob_wildcards(f"{lane_dir}/L00{{lane_number}}.fastq").lane_number
bams = expand(rules.map_fastq.output.bam, **wildcards, lane_number=lane_numbers)
return bams
This input function now gives us easy access to the bam files we wish to merge, however many there are, and whatever they may be called.
rule merge_bam:
input:
get_bams
output:
bam = "temp/merge_bam/{sample}.bam"
shell:
"cat {input} > {output.bam}"
This example runs, and with random.seed(1) happens to create three lanes (l001, l002, and l005).
If you don't want to use checkpoint, I think you could achieve something similar by creating an input function for merge_bam that opens up the original input fastq, scans the read names for lane info, and predicts what the input files ought to be. This seems less robust, however.

Define input files from csv

I would like to define input file names from different varialbles extracted from a csv. I have built the following simplified example:
I have a file test.csv:
data/samples/A.fastq
data/samples/B.fastq
I give the path to test.csv in a json config file:
{
"samples": {
"summaryFile": "somepath/test.csv"
}
}
Now I want to run bwa on each file within a rule. My feeling is that I have to use lambda wildcards but I am not sure. My Snakefile looks like this:
#only for bcf_tools
import pandas
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table)
def returnSamples(table):
# Have tried different things here but nothing worked
return table
rule all:
input:
expand("mapped_reads/{sample}.bam", sample= samplesData)
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: returnSamples(wildcards.sample)
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
I have tried a million things including using expand (which is working but the rule is not called on each file).
Any help will be tremendously appreciated.
Snakemake works by defining which output you want (like you do in rule all). You are very close to a working solution, however there were some small things that went wrong:
Reading the pandas dataframe does not do what you expect (try printing the samplesData to see what it did/does). Therefore the expand in rule all does not work properly.
You do not need to use lambdas for the input, you can reuse the wildcard.
This should work for your example:
import pandas
import re
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table, header=None).loc[:, 0].tolist()
samples = [re.findall("[^/]+\.", sample)[0][:-1] for sample in samplesData] # overly complicated regex
rule all:
input:
expand("mapped_reads/{sample}.bam", sample=samples)
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
However I think it would be easiest to change the description in test.csv. Now we have to do some weird magic to get the sample name from the file, it would probably be best to just store the sample names there.