Delete unwanted Snakemake Outputs - snakemake

I have looked at a few other post about Snakemake and deleting unneeded data to clean up diskspace. I have designed a rule called: rule BamRemove that touches my rule all. However, my the workflow manager isnt recognizing. I am getting this error: WildcardError in line 35 of /PATH:
No values given for wildcard 'SampleID'. I am not seeing why. Any help to get this to work would be nice.
sampleIDs = d.keys()
rule all:
input:
expand('bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam', sampleID=sampleIDs),
expand('bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai', sampleID=sampleIDs),
expand('logs/{SampleID}_removed.txt', sampleID=sampleIDs) #Line 35
# Some tools require unzipped fastqs
rule AnnotateUMI:
input: 'bams/{sampleID}_unisamp_L001_001.star_rg_added.sorted.dmark.bam'
output: 'bams/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bam',
# Modify each run
params: '/data/Test/fastqs/{sampleID}_unisamp_L001_UMI.fastq.gz'
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar AnnotateBamWithUmis \
-i {input} \
-f {params} \
-o {output}')
rule SortSam:
input: rules.AnnotateUMI.output
output: 'bams/{sampleID}_Qsorted.MarkUMI.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx110g -jar /data/Tools/picard.jar SortSam \
INPUT={input} \
OUTPUT={output} \
SORT_ORDER=queryname')
rule MItag:
input: rules.SortSam.output
output: 'bams/{sampleID}_Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar SetMateInformation \
-i {input} \
-o {output}')
rule GroupUMI:
input: rules.MItag.output
output: 'bams/{sampleID}_grouped.Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar GroupReadsByUmi \
-i {input} \
-s adjacency \
-e 1 \
-m 20 \
-o {output}')
rule ConcensusUMI:
input: rules.GroupUMI.output
output: 'bams/{sampleID}_concensus.Qunsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.2.jar CallMolecularConsensusReads \
--input={input} \
--min-reads=1 \
--output={output}')
rule STARmap:
input: rules.ConcensusUMI.output
output:
log = 'bams/{sampleID}_UMI_Concensus_Log.final.out',
bam = 'bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam'
params: 'bams/{sampleID}_UMI_Concensus_'
threads: 32
run:
# Each user needs to genome path
shell('STAR \
--runThreadN {threads} \
--readFilesIn {input} \
--readFilesType SAM PE \
--readFilesCommand samtools view -h \
--genomeDir /data/reference/star/STAR_hg19_v2.7.5c \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--limitBAMsortRAM 220000000000 \
--outFileNamePrefix {params}')
rule Index:
input: rules.STARmap.output.bam
output: 'bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai'
params:
threads: 32
run:
shell('samtools index {input}')
rule BamRemove:
input:
AnnotateUMI_BAM = rules.AnnotateUMI.output,
AnnotateUMI_BAI = '{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bai',
SortSam = rules.SortSam.output,
MItag = rules.MItag.output,
GroupUMI = rules.GroupUMI.output,
ConcensusUMI = rules.ConcensusUMI.output
output: touch('logs/{SampleID}_removed.txt')
threads: 32
run:
shell('rm {input}')

expand('logs/{SampleID}_removed.txt', sampleID=sampleIDs) #Line 35
^^^ ^^^
The error is due to SampleID being different from sampleID, make them consistent throughout the script.

Related

Have Snakemake recognize complete files upon relaunch

I have created this Snakemake workflow. This pipeline works really well; however, if any rule fails and I relaunch, Snakemake isnt recognizing all completed files. For instances, Sample A finishes all the way through and creates all files for rule all, but Sample B fails at rule Annotate UMI. When I relaunch, snakemake wants to do all jobs for both A and B, instead of just B. What do I need to get this to work?
sampleIDs = [A, B]
rule all:
input:
expand('PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam', sampleID=sampleIDs),
expand('PATH/bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai', sampleID=sampleIDs),
expand('/PATH/logfiles/{sampleID}_removed.txt', sampleID=sampleIDs)
# Some tools require unzipped fastqs
rule AnnotateUMI:
# Modify each run
input: 'PATH/{sampleID}_unisamp_L001_001.star_rg_added.sorted.dmark.bam'
# Modify each run
output: 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bam'
# Modify each run
params: 'PATH/{sampleID}_unisamp_L001_UMI.fastq.gz'
threads: 36
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar AnnotateBamWithUmis \
-i {input} \
-f {params} \
-o {output}')
rule SortSam:
input: rules.AnnotateUMI.output
# Modify each run
output: 'PATH/{sampleID}_Qsorted.MarkUMI.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx110g -jar PATH/picard.jar SortSam \
INPUT={input} \
OUTPUT={output} \
SORT_ORDER=queryname')
rule MItag:
input: rules.SortSam.output
# Modify each run
output: 'PATH/{sampleID}_Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar SetMateInformation \
-i {input} \
-o {output}')
rule GroupUMI:
input: rules.MItag.output
# Modify each run
output: 'PATH/{sampleID}_grouped.Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar GroupReadsByUmi \
-i {input} \
-s adjacency \
-e 1 \
-m 20 \
-o {output}')
rule ConcensusUMI:
input: rules.GroupUMI.output
# Modify each run
output: 'PATH/{sampleID}_concensus.Qunsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.2.jar CallMolecularConsensusReads \
--input={input} \
--min-reads=1 \
--output={output}')
rule STARmap:
input: rules.ConcensusUMI.output
# Modify each run
output:
log = 'PATH/{sampleID}_UMI_Concensus_Log.final.out',
bam = 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam'
# Modify each run
params: 'PATH/{sampleID}_UMI_Concensus_'
threads: 32
run:
# Each user needs to genome path
shell('STAR \
--runThreadN {threads} \
--readFilesIn {input} \
--readFilesType SAM PE \
--readFilesCommand samtools view -h \
--genomeDir PATH/STAR_hg19_v2.7.5c \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--limitBAMsortRAM 220000000000 \
--outFileNamePrefix {params}')
rule Index:
input: rules.STARmap.output.bam
# Modify each run
output: 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai'
params:
threads: 32
run:
shell('samtools index {input}')
rule BamRemove:
input:
AnnotateUMI_BAM = rules.AnnotateUMI.output,
# Modify each run and include in future version to delete
#AnnotateUMI_BAI = 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bai',
SortSam = rules.SortSam.output,
MItag = rules.MItag.output,
GroupUMI = rules.GroupUMI.output,
ConcensusUMI = rules.ConcensusUMI.output,
STARmap = rules.STARmap.output.bam,
Index = rules.Index.output
# Modify each run
output: touch('PATH/logfiles/{sampleID}_removed.txt')
threads: 32
run:
shell('rm {input.AnnotateUMI_BAM} {input.SortSam} {input.MItag} {input.GroupUMI} {input.ConcensusUMI}')

nextflow: change part of the script basing on a parameter

I have a Nextflow workflow that's like this (reduced):
params.filter_pass = true
// ... more stuff
process concatenate_vcf {
cpus 6
input:
file(vcf_files) from source_vcf.collect()
file(tabix_files) from source_vcf_tbi.collect()
output:
file("assembled.vcf.gz") into decompose_ch
script:
"""
echo ${vcf_files} | tr " " "\n" > vcflist
bcftools merge \
-l vcflist \
-m none \
-f PASS,. \
--threads ${task.cpus} \
-O z \
-o assembled.vcf.gz
rm -f vcflist
"""
}
Now, I want to add the -f PASS,. part of the command in the script in the bcftools merge call only if params.filter_pass is true.
In other words, the script would be executed like this, if params.filter_pass is true (other lines removed for clarity):
bcftools merge \
-l vcflist \
-m none \
-f PASS,. \
--threads ${task.cpus} \
-O z \
-o assembled.vcf.gz
and if it instead params.filter_pass is false:
bcftools merge \
-l vcflist \
-m none \
--threads ${task.cpus} \
-O z \
-o assembled.vcf.gz
I know I can use conditional scripts but that would mean replicating the whole script stanza just to change one parameter.
Is this use case possible with Nextflow?
The general pattern is to use a local variable in the 'script' block and a ternary operator to add the -f PASS,. filter option when params.filter_pass is true:
process concatenate_vcf {
...
script:
def filter_pass = params.filter_pass ? '-f PASS,.' : ''
"""
echo "${vcf_files.join('\n')}" > vcf.list
bcftools merge \\
-l vcf.list \\
-m none \\
${filter_pass} \\
--threads ${task.cpus} \\
-O z \\
-o assembled.vcf.gz
"""
}
An if/else statement could also be used in place of the ternary operator if preferred.

"Input files updated by another job" but not true

I have an issue with a pipeline and I don't know if it is a bad use on my part or a bug. I am using snakemake 5.26.1. I did not have any problems until a few days ago, with the same version of snakemake. I don't understand what changed.
A pipeline part is just after a checkpoint rule and an aggregating step, this produces an output file e.g. foo.fasta.
I have rules following the production of foo.fasta using this as input file, and that are run again with the --reason Input files updated by another job but this is not the case and their output is more recent than foo.fasta.
Additionally, using the --summary option the file foo.fasta is marked as no update and files have the right timestamp order for rules not having to be executed again.
I cannot find the reason why following rules are run again. Is it possible the checkpoint is causing an issue making snakemake think foo.fasta is updated while it is not?
Here is a dry run output:
rule agouti_scaffolding produces tros_v5.pseudohap.fasta.gz (my foo.fasta)
Building DAG of jobs...
Updating job 7 (aggregate_gff3).
Replace aggregate_gff3 with dynamic branch aggregate_gff3
updating depending job agouti_scaffolding
Updating job 3 (agouti_scaffolding).
Replace agouti_scaffolding with dynamic branch agouti_scaffolding
updating depending job unzip_fasta
updating depending job btk_prepare_workdir
updating depending job v6_clean_rename
updating depending job kat_comp
updating depending job assembly_stats
Job counts:
count jobs
1 kat_comp
1 kat_plot_spectra
2
[Tue Oct 27 12:08:41 2020]
rule kat_comp:
input: results/preprocessing/tros/tros_dedup_proc_fastp_R1_001.fastq.gz, results/pre
sing/tros/tros_dedup_proc_fastp_R2_001.fastq.gz, results/fasta/tros_v5.pseudohap.fasta.g
output: results/kat/tros_v5/tros_v5_comp-main.mx
log: logs/kat_comp.tros_v5.log
jobid: 1
reason: Input files updated by another job: results/fasta/tros_v5.pseudohap.fasta.gz
wildcards: sample=tros, version=v5
[Tue Oct 27 12:08:41 2020]
rule kat_plot_spectra:
input: results/kat/tros_v5/tros_v5_comp-main.mx
output: results/kat/tros_v5/tros_v5_spectra.pdf
log: logs/kat_plot.tros_v5.log
jobid: 0
reason: Input files updated by another job: results/kat/tros_v5/tros_v5_comp-main.mx
wildcards: sample=tros, version=v5
Job counts:
count jobs
1 kat_comp
1 kat_plot_spectra
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
It seems updating depending job kat_comp is the issue.
Even the ancient flag on kat_comp input does not change anything.
EDIT
Here is part of the pipeline.
As #TroyComi was suggesting the last development version (5.26.1+26.gc2e2b501.dirty) solves the issue of the ancient directive. When removing the ancient directive rules are still executed.
rule all:
"results/kat/tros_v5/tros_v5_spectra.pdf"
checkpoint split_fa_augustus:
input:
"results/fasta/{sample}_v4.pseudohap.fasta.gz"
output:
"results/agouti/{sample}/{sample}_v4.pseudohap.fa",
directory("results/agouti/{sample}/split")
params:
split_size = 50000000
conda:
"../envs/augustus.yaml"
shell:
"""
zcat {input} > {output[0]}
mkdir {output[1]}
splitMfasta.pl {output[0]} \
--outputpath={output[1]} --minsize={params.split_size}
"""
rule augustus:
input:
"results/agouti/{sample}/split/{sample}_v4.pseudohap.split.{i}.fa"
output:
"results/agouti/{sample}/split/pred_{i}.gff3"
conda:
"../envs/augustus.yaml"
shell:
"""
augustus --gff3=on --species=caenorhabditis {input} > {output}
"""
def aggregate_input_gff3(wildcards):
checkpoint_output = checkpoints.split_fa_augustus.get(**wildcards).output[1]
return expand("results/agouti/{sample}/split/pred_{i}.gff3",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, f"{wildcards.sample}_v4.pseudohap.split." + "{i}.fa")).i)
rule aggregate_gff3:
input:
aggregate_input_gff3
output:
"results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
conda:
"../envs/augustus.yaml"
shell:
"cat {input} | join_aug_pred.pl > {output}"
#===============================
# Preprocess RNAseq data
# The RNAseq reads need to be in the folder resources/RNAseq_raw/{sample}
# Files must be named {run}_R1.fastq.gz and {run}_R2.fastq.gz for globbing to work
# globbing is done in the rule merge_RNA_bams
rule rna_rcorrector:
input:
expand("resources/RNAseq_raw/{{sample}}/{{run}}_{R}.fastq.gz",
R=['R1', 'R2'])
output:
temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
R=['R1', 'R2']))
params:
outdir = lambda w, output: os.path.dirname(output[0])
log:
"logs/rcorrector_{sample}_{run}.log"
threads:
config['agouti']['threads']
conda:
"../envs/rna_seq.yaml"
shell:
"""
run_rcorrector.pl -1 {input[0]} -2 {input[1]} \
-t {threads} \
-od {params.outdir} \
> {log} 2>&1
"""
rule rna_trimgalore:
input:
expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
R=['R1', 'R2'])
output:
temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
i=['1', '2']))
params:
outdir = lambda w, output: os.path.dirname(output[0]),
basename = lambda w: f'{w.run}_trimgal'
log:
"logs/trimgalore_{sample}_{run}.log"
threads:
config['agouti']['threads']
conda:
"../envs/rna_seq.yaml"
shell:
"""
trim_galore --cores {threads} \
--phred33 \
--quality 20 \
--stringency 1 \
-e 0.1 \
--length 70 \
--output_dir {params.outdir} \
--basename {params.basename} \
--dont_gzip \
--paired \
{input} \
> {log} 2>&1
"""
#===============================
# Map the RNAseq reads
rule index_ref:
input:
"results/agouti/{sample}/{sample}_v4.pseudohap.fa"
output:
multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
conda:
"../envs/mapping.yaml"
shell:
"bwa-mem2 index {input}"
rule map_RNAseq:
input:
expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
i=['1', '2']),
"results/agouti/{sample}/{sample}_v4.pseudohap.fa",
multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
output:
"results/agouti/{sample}/mapping/{run}.bam"
log:
"logs/bwa_rna_{sample}_{run}.log"
conda:
"../envs/mapping.yaml"
threads:
config['agouti']['threads']
shell:
"""
bwa-mem2 mem -t {threads} {input[2]} {input[0]} {input[1]} 2> {log} \
| samtools view -b -# {threads} -o {output}
"""
def get_sample_rna_runs(w):
list_R1_files = glob.glob(f"resources/RNAseq_raw/{w.sample}/*_R1.fastq.gz")
list_runs = [re.sub('_R1\.fastq\.gz$', '', os.path.basename(f)) for f in list_R1_files]
return [f'results/agouti/{w.sample}/mapping/{run}.bam' for run in list_runs]
rule merge_RNA_bams:
input:
get_sample_rna_runs
output:
"results/agouti/{sample}/RNAseq_mapped_merged.bam"
params:
tmp_merge = lambda w: f'results/agouti/{w.sample}/tmp_merge.bam'
conda:
"../envs/mapping.yaml"
threads:
config['agouti']['threads']
shell:
"""
samtools merge -# {threads} {params.tmp_merge} {input}
samtools sort -# {threads} -n -o {output} {params.tmp_merge}
rm {params.tmp_merge}
"""
#===============================
# Run agouti on all that
rule agouti_scaffolding:
input:
fa = "results/agouti/{sample}/{sample}_v4.pseudohap.fa",
bam = "results/agouti/{sample}/RNAseq_mapped_merged.bam",
gff = "results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
output:
protected("results/fasta/{sample}_v5.pseudohap.fasta.gz")
params:
outdir = lambda w: f'results/agouti/{w.sample}/agouti_out',
minMQ = 20,
maxFracMM = 0.05
log:
"logs/agouti_{sample}.log"
conda:
"../envs/agouti.yaml"
shell:
"""
python /opt/agouti/agouti.py scaffold \
-assembly {input.fa} \
-bam {input.bam} \
-gff {input.gff} \
-outdir {params.outdir} \
-minMQ {params.minMQ} -maxFracMM {params.maxFracMM} \
> {log} 2>&1
gzip -c {params.outdir}/agouti.agouti.fasta > {output}
"""
#===============================================================
# Now do something on output {sample}_{version}.pseudohap.fasta.gz
rule kat_comp:
input:
expand("results/preprocessing/{{sample}}/{{sample}}_dedup_proc_fastp_{R}_001.fastq.gz",
R=["R1", "R2"]),
ancient("results/fasta/{sample}_{version}.pseudohap.fasta.gz")
output:
"results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
params:
outprefix = lambda w: f'results/kat/{w.sample}_{w.version}/{w.sample}_{w.version}_comp'
log:
"logs/kat_comp.{sample}_{version}.log"
conda:
"../envs/kat.yaml"
threads:
16
shell:
"""
kat comp -t {threads} \
-o {params.outprefix} \
'{input[0]} {input[1]}' \
{input[2]} \
> {log} 2>&1
"""
rule kat_plot_spectra:
input:
"results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
output:
"results/kat/{sample}_{version}/{sample}_{version}_spectra.pdf"
params:
title = lambda w: f'{w.sample}_{w.version}'
log:
"logs/kat_plot.{sample}_{version}.log"
conda:
"../envs/kat.yaml"
shell:
"""
kat plot spectra-cn \
-o {output} \
-t {params.title} \
{input} \
> {log} 2>&1
"""

Create index of a reference genome with bwa and gatk using snakemake touch

I align reads with bwa and call variants with gatk. gatk needs the creation of a dict for the reference genome, and bwa needs creation of indices. When I use touch for both of them I get this error:
AmbiguousRuleException:
Rules bwa_index and gatk_refdict are ambiguous for the file ref.
Expected input files:
bwa_index: ref.fasta
gatk_refdict: ref.fasta
This is the code:
rule bwa_index:
input:
database="ref.fasta"
output:
done =touch("ref")
shell:
"""
bwa index -p ref {input.database}
"""
rule bwa_mem:
input:
bwa_index_done = "ref",
fastq1="{sample}_R1.trimmed.fastq.gz",
fastq2="{sample}_R2.trimmed.fastq.gz"
output:
bam = temp("{sample}.bam")
shell:
"""
bwa mem ref {input.fastq1} {input.fastq2} -o {output.bam}
"""
rule_gatk_refdict:
input:
ref="ref.fasta"
output:
done =touch("ref")
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar CreateSequenceDictionary -R {input.ref} -O {output.done}
"""
rule gatk:
input:
gatk_refdict_done = "ref",
bam="bam_list"
output:
outf ="{chr}.vcf"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -L {wildcards.chr} -R ref -I {input.bam} --min-base-quality-score 20 -O {output.outf}
"""
Alternatively, I am specifing the index .dict, but then it doesn't work neither, because gatk calls variants before creating dict and so I get an error that there is no dict file:
rule_gatk_refdict:
input:
ref="ref.fasta"
output:
outf ="ref.dict"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar CreateSequenceDictionary -R {input.ref} -O {output.outf}
"""
rule gatk:
input:
ref = "ref.fasta",
bam="bam_list"
output:
outf ="{chr}.vcf"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -L {wildcards.chr} -R {input.ref} -I {input.bam} --min-base-quality-score 20 -O {output.outf}
"""
How to solve this?
Why don't you simply define the dict file as an input of the gatk rule and the ìndex as an input of the bwa rule?
rule bwa_index:
input:
database="ref.fasta"
output:
done =touch("ref")
shell:
"""
bwa index -p ref {input.database}
"""
rule bwa_mem:
input:
bwa_index_done = "ref",
fastq1="{sample}_R1.trimmed.fastq.gz",
fastq2="{sample}_R2.trimmed.fastq.gz"
output:
bam = temp("{sample}.bam")
shell:
"""
bwa mem ref {input.fastq1} {input.fastq2} -o {output.bam}
"""
rule gatk_refdict:
input:
ref="ref.fasta"
output:
done = "ref.dict"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar CreateSequenceDictionary -R {input.ref} -O {output.done}
"""
rule gatk:
input:
ref = "ref.fasta",
dict = "ref.dict",
bam="bam_list"
output:
outf ="{chr}.vcf"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -L {wildcards.chr} -R {input.ref} -I {input.bam} --min-base-quality-score 20 -O {output.outf}
"""
The AmbiguousRuleException you get is because snakemake doesn't know which rule to run since two rules have the same output. Don't forget snakemake tries to build the DAG starting from the rule all. When it comes to run rule gatk, you define "ref" as an input. Since two rules can produce this file, snakemake does not known weither it must use rule gatk_refdict or rule bwa_index.
You have a typo there ("_" that should not be there):
----v
rule_gatk_refdict:
input:
ref="ref.fasta"
...

Snakemake problem: Merge all files together with space delimiter instead of iterating through it

I was trying to run a command which ideally looks like this,
minimap2 -a -x map-ont -t 20 /staging/reference.fasta fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample01.bam
Similarly, I have multiple samples (referring to fastq/sample01.fastq) in the folder.
The snakemake file I wrote to automate this behaviour is, however, parsing all files at once in the command like,
minimap2 -a -x map-ont -t 1 /staging/reference.fasta fastq/sample02.fastq fastq/sample03.fastq fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample02.bam fastq_minon/sample03.bam fastq_minon/sample01.bam
I have pasted the code and logs below. Please help me try to figure out this mistake.
Code
SAMPLES, = glob_wildcards("fastq/{smp}.fastq")
rule minimap:
input:
expand("fastq/{smp}.fastq", smp=SAMPLES)
output:
expand("fastq_minon/{smp}.bam", smp=SAMPLES)
params:
ref = FASTA
threads: 40
shell:
"""
minimap2 -a -x map-ont -t {threads} {params.ref} {input} | samtools view -bS -F 4 - | samtools sort -o {output}
"""
log
Building DAG of jobs...
Job counts:
count jobs
1 minimap
1
[Tue May 5 03:28:50 2020]
rule minimap:
input: fastq/sample02.fastq, fastq/sample03.fastq, fastq/sample01.fastq
output: fastq_minon/sample02.bam, fastq_minon/sample03.bam, fastq_minon/sample01.bam
jobid: 0
minimap2 -a -x map-ont -t 1 /staging/reference.fasta fastq/sample02.fastq fastq/sample03.fastq fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample02.bam fastq_minon/sample03.bam fastq_minon/sample01.bam
Job counts:
count jobs
1 minimap
1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
The expand function is used to create a list. Thus, in your rule minimap, you're telling snakemake that you want all fastq files as input and that the rule will produce as many bam files. What you want is a rule that will be triggered for every sample using a wildcard:
SAMPLES, = glob_wildcards("fastq/{smp}.fastq")
rule all:
input: expand("fastq_minon/{smp}.bam", smp=SAMPLES)
rule minimap:
input:
"fastq/{smp}.fastq"
output:
"fastq_minon/{smp}.bam"
params:
ref = FASTA
threads: 40
shell:
"""
minimap2 -a -x map-ont -t {threads} {params.ref} {input} | samtools view -bS -F 4 - | samtools sort -o {output}
"""
By defining all the files wanted in rule all, the rule minimap will be triggered as many times as necessary to create ONE bam file from ONE fastq file.
Have a look at my answer to this question to understand the use of wildcards and expand.