Using multiple wildcards in snakemake for differentiate replicates in a tsv file - pandas

I've successfully implemented a Snakemake workflow and now want to add the possibility to merge specific samples, whether they are technical or biological replicates. In this case, I was told I should use multiple wildcards and work with a proper data frame. However, I'm pretty unsure what the syntax would even look like and how I can differentiate between entries properly.
My samples.tsv used to look like this:
sample
fq1
fq2
bCalAnn1_1_1
path/to/file
path/to/file
bCalAnn1_1_2
path/to/file
path/to/file
bCalAnn2_1_1
path/to/file
path/to/file
bCalAnn2_1_2
path/to/file
path/to/file
bCalAnn2_2_1
path/to/file
path/to/file
bCalAnn2_3_1
path/to/file
path/to/file
Now it looks like this:
sample
bio_unit
tech_unit
fq1
fq2
bCalAnn1
1
1
path/to/file
path/to/file
bCalAnn1
1
2
path/to/file
path/to/file
bCalAnn2
1
1
path/to/file
path/to/file
bCalAnn2
1
2
path/to/file
path/to/file
bCalAnn2
2
1
path/to/file
path/to/file
bCalAnn2
3
1
path/to/file
path/to/file
Where bio_unit contains the index of the library for one sample and tech_unit is the index of the lanes for one library.
My Snakefile looks like this:
import pandas as pd
import os
import yaml
configfile: "config.yaml"
samples = pd.read_table(config["samples"], index_col="sample")
rule all:
input:
expand(config["arima_mapping"] + "{sample}_{unit_bio}_{unit_tech}_R1.bam",
sample=samples.sample, unit_bio=samples.unit_bio, unit_tech=samples.unit_tech)
rule r1_mapping:
input:
read1 = lambda wc:
samples[(samples.sample == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
ref = config["PacBio_assembly"],
linker = config["index"] + "pac_bio_assembly.fna.amb"
output:
config["arima_mapping"] + "unprocessed_bam/({sample},{unit_bio},{unit_tech})_R1.bam"
params:
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/mapping/({sample},{unit_bio},{unit_tech})_R1.log"
threads:
12
shell:
"bwa mem -t {threads} {input.ref} {input.read1} | samtools view --threads {threads} -h -b -o {output} 2> {log}"
r1_mapping is basically my first rule which requires no differentiating between replicates. Therefore, I would want to use it for every row. I experimented a little bit with the syntax but ultimately ran into this error:
MissingInputException in rule r1_mapping in line 20 of /home/path/to/assembly_downstream/workflow/rules/arima_mapping.smk:
Missing input files for rule r1_mapping:
output: /home/path/to/assembly_downstream/intermed/arima_mapping/unprocessed_bam/('bCalAnn1', 1, 1)_R1.bam
wildcards: sample='bCalAnn1', unit_bio= 1, unit_tech= 1
affected files:
/home/path/to/assembly_downstream/data/arima_HiC/('bCalAnn1', 1, 1)_R1.fastq.gz
Do I even read the table properly? I´m currently very confused, can anyone give me a hint on where to start fixing this?
Fixed by changing:
samples = pd.read_table(config["samples"], index_col=["sample","unit_bio","unit_tech"])
to
samples = pd.read_table(config["samples"], index_col="sample")
EDIT:
New error
Missing input files for rule all:
affected files:
/home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of unit_bio ... fq2
sample ...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 2 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 3 ... /home/path/to/assembly_downstream/data/ari...
[6 rows x 4 columns]>_3_2_R1.bam
/home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of unit_bio ... fq2
sample ...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 2 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 3 ... /home/path/to/assembly_downstream/data/ari...
[6 rows x 4 columns]>_3_1_R1.bam
/home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of unit_bio ...
This goes on to a total of 6 times which sounds like I´m currently getting all rows 6 times each red. I guess this has something to do with how expand() works..? I´m gonna keep researching for now.

This is what I would do, slightly simplified to keep things shorter - not tested, there may be errors!
import pandas as pd
import os
import yaml
samples = pd.read_table('samples.tsv', dtype=str)
wildcard_constraints:
# Constraints maybe not needed but in my opinion better to set them
sample='|'.join([re.escape(x) for x in samples['sample']]),
unit_bio='|'.join([re.escape(x) for x in samples.unit_bio]),
unit_tech='|'.join([re.escape(x) for x in samples.unit_tech]),
rule all:
input:
# Assume for now you just want to produce these bam files,
# one per row of the sample sheet:
expand("scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam", zip,
sample=samples['sample'], unit_bio=samples.unit_bio, unit_tech=samples.unit_tech),
rule r1_mapping:
input:
read1 = lambda wc:
samples[(samples['sample'] == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
# Other input/params that don;t need wildcards...
output:
"scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam",
The main issue is to use an input function (lambda here) to query the sample sheet and get the fastq file corresponding to a given sample, unit_bio, and unit_tech.
If this doesn't work or you can't make sense of it, post a fully reproducible example that we can easily replicate.

Related

Input function for technical / biological replicates in snakemake

I´m currently trying to write a Snakemake workflow that can check automatically via a sample.tsv file if a given sample is a biological or technical replicate. And then use in this case at some point of my workflow a rule to merge technical/biological replicates.
My tsv file looks like this:
|sample | unit_bio | unit_tech | fq1 | fq2 |
|----------|----------|-----------|-----|-----|
| bCalAnn1 | 1 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_1_R2.fastq.gz |
| bCalAnn1 | 1 | 2 | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_2_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_2_R2.fastq.gz |
| bCalAnn2 | 1 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_1_R2.fastq.gz |
| bCalAnn2 | 1 | 2 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_2_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_2_R2.fastq.gz |
| bCalAnn2 | 2 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_2_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_2_1_R2.fastq.gz |
| bCalAnn2 | 3 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_3_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_3_1_R2.fastq.gz |
My Pipeline looks like this:
import pandas as pd
import os
import yaml
configfile: "config.yaml"
samples = pd.read_table(config["samples"], dtype=str)
rule all:
input:
expand(config["arima_mapping"] + "final/{sample}_{unit_bio}_{unit_tech}.bam", zip,
sample=samples["sample"], unit_bio=samples["unit_bio"], unit_tech=samples["unit_tech"])
..
some rules
..
rule add_read_groups:
input:
config["arima_mapping"] + "paired/{sample}_{unit_bio}_{unit_tech}.bam"
output:
config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
params:
platform = "ILLUMINA",
sampleName = "{sample}",
library = "{sample}",
platform_unit ="None"
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/paired_read_groups/{sample}_{unit_bio}_{unit_tech}.log"
shell:
"picard AddOrReplaceReadGroups I={input} O={output} SM={params.sampleName} LB={params.library} PU={params.platform_unit} PL={params.platform} 2> {log}"
rule merge_tech_repl:
input:
config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
output:
config["arima_mapping"] + "merge_tech_repl/{sample}_{unit_bio}_{unit_tech}.bam"
params:
val_string = "SILENT"
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/merged_tech_repl/{sample}_{unit_bio}_{unit_tech}.log"
threads:
2 #verwendet nur maximal 2
shell:
"picard MergeSamFiles -I {input} -O {output} --ASSUME_SORTED true --USE_THREADING true --VALIDATION_STRINGENCY {params.val_string} 2> {log}"
rule mark_duplicates:
input:
config["arima_mapping"] + "merge_tech_repl/{sample}_{unit_bio}_{unit_tech}.bam" if config["tech_repl"] else config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
output:
bam = config["arima_mapping"] + "final/{sample}_{unit_bio}_{unit_tech}.bam",
metric = config["arima_mapping"] + "final/metric_{sample}_{unit_bio}_{unit_tech}.txt"
#params:
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/mark_duplicates/{sample}_{unit_bio}_{unit_tech}.log"
shell:
"picard MarkDuplicates I={input} O={output.bam} M={output.metric} 2> {log}"
At the moment I have set a boolean in a config file that tells the mark_duplicates rule whether to take its input from the add_read_group or the merge_technical_replicates rule. This is of course not optimal since it could be that some samples may have duplicates (of any numbers) while others don´t. Therefore I want to create a syntax that checks the tsv table if a given sample name and unit_bio number are identical while the unit_tech number is different (and later analog to this for biological replicates), thus merging these specific samples while nonduplicate samples skip the merging rule.
EDIT
For clarification since I think I explained my goal confusingly.
My first attempt looks like this, I want "i" to be flexible, in case the duplicate number changes. I don't think that my input function returns all duplicates together that match each other but gives them one by one which is not what I want. I´m also unsure on how I would have to handle samples that do not have duplicates since they would have to skip this rule somehow.
input_function(wildcards):
return expand({sample}_{unit_bio}_{i}.bam", sample = wildcards.sample,
unit_bio = wildcards.unit_bio,
i = samples["sample"].str.count(wildcards.sample))
rule tech_duplicate_check:
input:
input_function #(that returns a list of 2-n duplicates, where n could be different for each sample)
output:
{sample}_{unit_bio}.bam
shell:
MergeTechDupl_tool {input} # input is a list
Therefore I want to create a syntax that checks the tsv table if a given sample name and unit_bio number are identical while the unit_tech number is different (and later analog to this for biological replicates), thus merging these specific samples while nonduplicate samples skip the merging rule.
rule gather_techdups_of_a_biodup:
output: "{sample}/{unit_bio}"
input: gather_techdups_of_a_biodup_input_fn
shell: "true" # Fill this in
rule gather_biodips_of_a_techdup:
output: "{sample}/{unit_tech}"
input: gather_biodips_of_a_techdup_input_fn
shell: "true" # Fill this in
After some attempts my main problem I struggle with is the table checking. As far as I know snakemake takes templates as input and checks for all samples that match this. But I would need to check the table for every sample that shares (e.g. for technical replicate) the sample name and the unit_bio number take all these samples and give them as input for the first rule run. Then I would have to take the next sample which was not already part of a previous run to prevent merging the same samples multiple times.
The logic you describe here can be implemented in the gather_techdups_of_a_biodup_input_fn and gather_biodips_of_a_techdup_input_fn functions above. For example, read your sample TSV file with pandas, filter for wildcards.sample and wildcards.unit_bio (or wildcards.unit_tech), then extract columns fq1 and fq2 from the filtered data frame.

snakemake rule error. How can I select rule order?

I run the snakemake for RNA-seq analysis.
I made snakefile for running, and some error occurred in terminal.
I set rule salmon quant reads at last order but it is running at first.
So snakemake showed the error in rule salmon quant reads.
salmon quant reads must run after salmon index finished.
Error in rule salmon_quant_reads:
jobid: 173
output: salmon/WT_Veh_11/quant.sf, salmon/WT_Veh_11/lib_format_counts.json
log: logs/salmon/WT_Veh_11.log (check log file(s) for error message)
conda-env: /home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/.snakemake/conda/ff908de630224c1a4118f5dc69c8a761
RuleException:
CalledProcessError in line 111 of /home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/Snakefile_2:
Command 'source /home/baelab2/miniconda3/bin/activate '/home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/.snakemake/conda/ff908de630224c1a4118f5dc69c8a761'; set -euo pipefail; /home/baelab2/miniconda3/envs/snakemake/bin/python3.10 /home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/.snakemake/scripts/tmpr6r8ryk9.wrapper.py' returned non-zero exit status 1.
File "/home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/Snakefile_2", line 111, in __rule_salmon_quant_reads
File "/home/baelab2/miniconda3/envs/snakemake/lib/python3.10/concurrent/futures/thread.py", line 58, in run
How can I fix it?
Here is the my snakefile info.
SAMPLES = ["KO_Col_5", "KO_Col_6", "KO_Col_7", "KO_Col_8", "KO_Col_9", "KO_Col_10", "KO_Col_11", "KO_Col_15", "KO_Veh_3", "KO_Veh_4", "KO_Veh_5", "KO_Veh_9", "KO_Veh_11", "KO_Veh_13", "KO_Veh_14", "WT_Col_1", "WT_Col_2", "WT_Col_3", "WT_Col_6", "WT_Col_8", "WT_Col_10", "WT_Col_12", "WT_Veh_1", "WT_Veh_2", "WT_Veh_4", "WT_Veh_7", "WT_Veh_8", "WT_Veh_11", "WT_Veh_14"]
rule all:
input:
expand("raw/{sample}_1.fastq.gz", sample=SAMPLES),
expand("raw/{sample}_2.fastq.gz", sample=SAMPLES),
expand("qc/fastqc/{sample}_1.before.trim_fastqc.zip", sample=SAMPLES),
expand("qc/fastqc/{sample}_2.before.trim_fastqc.zip", sample=SAMPLES),
expand("trimmed/{sample}_1.fastq.gz", sample=SAMPLES),
expand("trimmed/{sample}_2.fastq.gz", sample=SAMPLES),
expand("qc/fastqc/{sample}_1.after.trim_fastqc.zip", sample=SAMPLES),
expand("qc/fastqc/{sample}_2.after.trim_fastqc.zip", sample=SAMPLES),
expand("salmon/{sample}/quant.sf", sample=SAMPLES),
expand("salmon/{sample}/lib_format_counts.json", sample=SAMPLES)
rule fastqc_before_trim_1:
input:
"raw/{sample}.fastq.gz",
output:
html="qc/fastqc/{sample}.before.trim.html",
zip="qc/fastqc/{sample}.before.trim_fastqc.zip",
log:
"logs/fastqc/{sample}.before.log"
threads: 10
priority: 1
wrapper:
"v1.7.0/bio/fastqc"
rule cutadapt:
input:
r1 = "raw/{sample}_1.fastq.gz",
r2 = "raw/{sample}_2.fastq.gz"
output:
fastq1="trimmed/{sample}_1.fastq.gz",
fastq2="trimmed/{sample}_2.fastq.gz",
qc="trimmed/{sample}.qc.txt"
params:
adapters = "-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT",
extra = "--minimum-length 1 -q 20"
log:
"logs/cutadapt/{sample}.log"
threads: 10
priority: 2
wrapper:
"v1.7.0/bio/cutadapt/pe"
rule fastqc_after_trim_2:
input:
"trimmed/{sample}.fastq.gz"
output:
html="qc/fastqc/{sample}.after.trim.html",
zip="qc/fastqc/{sample}.after.trim_fastqc.zip"
log:
"logs/fastqc/{sample}.after.log"
threads: 10
priority: 3
wrapper:
"v1.7.0/bio/fastqc"
rule salmon_index:
input:
sequences="raw/Mus_musculus.GRCm39.cdna.all.fasta"
output:
multiext(
"salmon/transcriptome_index/",
"complete_ref_lens.bin",
"ctable.bin",
"ctg_offsets.bin",
"duplicate_clusters.tsv",
"info.json",
"mphf.bin",
"pos.bin",
"pre_indexing.log",
"rank.bin",
"refAccumLengths.bin",
"ref_indexing.log",
"reflengths.bin",
"refseq.bin",
"seq.bin",
"versionInfo.json",
),
log:
"logs/salmon/transcriptome_index.log",
threads: 10
priority: 10
params:
# optional parameters
extra="",
wrapper:
"v1.7.0/bio/salmon/index"
rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates)
# use a list for r1 and r2.
r1 = "trimmed/{sample}_1.fastq.gz",
r2 = "trimmed/{sample}_2.fastq.gz",
index = "salmon/transcriptome_index"
output:
quant = "salmon/{sample}/quant.sf",
lib = "salmon/{sample}/lib_format_counts.json"
log:
"logs/salmon/{sample}.log"
params:
# optional parameters
libtype ="A",
extra="--validateMappings"
threads: 10
priority: 20
wrapper:
"v1.7.0/bio/salmon/quant"
The only link between salmon_quant_reads and salmon_index is directory salmon/transcriptome_index. However, the directory creation is not a sufficient signal that all work in salmon_index has been completed. So, a quick way to fix this is to include an explicit file:
rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates)
# use a list for r1 and r2.
r1 = "trimmed/{sample}_1.fastq.gz",
r2 = "trimmed/{sample}_2.fastq.gz",
index = "salmon/transcriptome_index",
_temp_dependency = "salmon/transcriptome_index/info.json",

chain/dependency of some rules by wildcards

I have a particular use case for which I have not found the solution in the Snakemake documentation.
Let's say in a given pipeline I have a portion with 3 rules a, b and c which will run for N samples.
Those rules handle large amount of data and for reasons of local storage limits I do not want those rules to execute at the same time. For instance rule a produces the large amount of data then rule c compresses and export the results.
So what I am looking for is a way to chain those 3 rules for 1 sample/wildcard, and only then execute those 3 rules for the next sample. All of this to make sure the local space is available.
Thanks
I agree that this is problem that Snakemake still has no solution for. However you may have a workaround.
rule all:
input: expand("a{sample}", sample=[1, 2, 3])
rule a:
input: "b{sample}"
output: "a{sample}"
rule b:
input: "c{sample}"
output: "b{sample}"
rule c:
input:
lambda wildcards: f"a{wildcards.sample-1}"
output: "c{sample}"
That means that the rule c for sample 2 wouldn't start before the output for rule a for sample 1 is ready. You need to add a pseudo output a0 though or make the lambda more complicated.
So building on Dmitry Kuzminov's answer, the following can work (both with numbers as samples and strings).
The execution order will be a3 > b3 > a1 > b1 > a2 > b2.
I used a different sample order to show it can be made different from the sample list.
samples = [1, 2, 3]
sample_order = [3, 1, 2]
def get_previous(wildcards):
if wildcards.sample != sample_order[0]: # if different from a3 in this case
previous_sample = sample_order[sample_order.index(wildcards.sample) - 1]
return f'b_out_{previous_sample}'
else: # if is the first sample in the order i.e. a3
return #here put dummy file always present e.g. the file containing those rules or the Snakemake
rule all:
expand("b_out_{S}", S=sample)
rule a:
input:
"a_in_{sample}",
get_previous
output:
"a_out_{sample}"
rule b:
input:
"a_out_{sample}"
output:
"b_out_{sample}"

Two variables with inconsistent names as input for a Snakemake rule

How can I pair up input data for rules in snakemake if the naming isn't consistent and they are all in the same folder?
For example if I want to use each pair of samples as input for each rule:
PT1 T5
S6 T7
S1 T20
In this example I would want to have 3 pairs PT1 & T5, S6 & T7, S1 & T20 so to start with, I would want to create 3 folders:
PT1vsT5
S6vsT7
S1vsT20
And then perform an analysis with manta and output the results into these 3 folders accordingly.
In the following pipeline I want the GERMLINE sample to be the first element in each line (PT1, S6, S1) and TUMOR the second one (T5, T7, T20).
rule all:
input:
expand("/{samples_g}vs{samples_t}", samples_g = GERMLINE, samples_t = TUMOR),
expand("/{samples_g}vs{samples_t}/runWorkflow.py", samples_g = GERMLINE, samples_t = TUMOR),
# Create folders
rule folders:
output: "./{samples_g}vs{samples_t}"
shell: "mkdir {output}"
# Manta configuration
rule manta_config:
input:
g = BAMPATH + "/{samples_g}.bam",
t = BAMPATH + "/{samples_t}.bam"
output:
wf = "{samples_g}vs{samples_t}/runWorkflow.py"
params:
ref = IND,
out_dir = "{samples_g}vs{samples_t}/runWorkflow.py"
shell:
"python configManta.py --normalBam {input.g} --tumorBam {input.t} --referenceFasta {params.ref} --runDir {params.out_dir} "
Could I do it by using as an input a .txt containing the pairs and then use a loop? If so how should I do it? Otherwise how could it be done?
You can generate the list of input or output files "manually" using any appropriate python code. For instance, you could proceed as follows to generate the first of your input lists:
In [1]: GERMLINE = ("PT1", "S6", "S1")
In [2]: TUMOR = ("T5", "T7", "T20")
In [3]: ["/{}vs{}".format(sample_g, sample_t) for (sample_g, sample_t) in zip(GERMLINE, TUMOR)]
Out[3]: ['/PT1vsT5', '/S6vsT7', '/S1vsT20']
So this would be applied as follows:
rule all:
input:
["/{}vs{}".format(sample_g, sample_t) for (sample_g, sample_t) in zip(GERMLINE, TUMOR)],
["/{}vs{}/runWorkflow.py".format(sample_g, sample_t) for (sample_g, sample_t) in zip(GERMLINE, TUMOR)],
(Note that I put sample_g and sample_t in singular form, as it sounded more logical in this context, where those variable represent individual sample names, and not lists of several samples)

antlr3 NOT rule

negExpression : (NOT^)* primitiveElement ;
Is the rule I have. I now have this code:
!!(1==1)
I expected I would end up with this tree:
NOT
|
NOT
|
==
/ \
1 1
However, in Antlr3, it seems the tree ends up like
NOT
/ \
NOT ==
/ \
1 1
IE. I end up with the second NOT having no children, instead the child node it should have, has become its sibling node.
What am I doing wrong?
And as I wrote the question, it came to me that my rule was perhaps wrong.
And indeed, this one does exactly what I expected.
negExpression : NOT^ negExpression | primitiveElement^;