"Input files updated by another job" but not true - snakemake

I have an issue with a pipeline and I don't know if it is a bad use on my part or a bug. I am using snakemake 5.26.1. I did not have any problems until a few days ago, with the same version of snakemake. I don't understand what changed.
A pipeline part is just after a checkpoint rule and an aggregating step, this produces an output file e.g. foo.fasta.
I have rules following the production of foo.fasta using this as input file, and that are run again with the --reason Input files updated by another job but this is not the case and their output is more recent than foo.fasta.
Additionally, using the --summary option the file foo.fasta is marked as no update and files have the right timestamp order for rules not having to be executed again.
I cannot find the reason why following rules are run again. Is it possible the checkpoint is causing an issue making snakemake think foo.fasta is updated while it is not?
Here is a dry run output:
rule agouti_scaffolding produces tros_v5.pseudohap.fasta.gz (my foo.fasta)
Building DAG of jobs...
Updating job 7 (aggregate_gff3).
Replace aggregate_gff3 with dynamic branch aggregate_gff3
updating depending job agouti_scaffolding
Updating job 3 (agouti_scaffolding).
Replace agouti_scaffolding with dynamic branch agouti_scaffolding
updating depending job unzip_fasta
updating depending job btk_prepare_workdir
updating depending job v6_clean_rename
updating depending job kat_comp
updating depending job assembly_stats
Job counts:
count jobs
1 kat_comp
1 kat_plot_spectra
2
[Tue Oct 27 12:08:41 2020]
rule kat_comp:
input: results/preprocessing/tros/tros_dedup_proc_fastp_R1_001.fastq.gz, results/pre
sing/tros/tros_dedup_proc_fastp_R2_001.fastq.gz, results/fasta/tros_v5.pseudohap.fasta.g
output: results/kat/tros_v5/tros_v5_comp-main.mx
log: logs/kat_comp.tros_v5.log
jobid: 1
reason: Input files updated by another job: results/fasta/tros_v5.pseudohap.fasta.gz
wildcards: sample=tros, version=v5
[Tue Oct 27 12:08:41 2020]
rule kat_plot_spectra:
input: results/kat/tros_v5/tros_v5_comp-main.mx
output: results/kat/tros_v5/tros_v5_spectra.pdf
log: logs/kat_plot.tros_v5.log
jobid: 0
reason: Input files updated by another job: results/kat/tros_v5/tros_v5_comp-main.mx
wildcards: sample=tros, version=v5
Job counts:
count jobs
1 kat_comp
1 kat_plot_spectra
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
It seems updating depending job kat_comp is the issue.
Even the ancient flag on kat_comp input does not change anything.
EDIT
Here is part of the pipeline.
As #TroyComi was suggesting the last development version (5.26.1+26.gc2e2b501.dirty) solves the issue of the ancient directive. When removing the ancient directive rules are still executed.
rule all:
"results/kat/tros_v5/tros_v5_spectra.pdf"
checkpoint split_fa_augustus:
input:
"results/fasta/{sample}_v4.pseudohap.fasta.gz"
output:
"results/agouti/{sample}/{sample}_v4.pseudohap.fa",
directory("results/agouti/{sample}/split")
params:
split_size = 50000000
conda:
"../envs/augustus.yaml"
shell:
"""
zcat {input} > {output[0]}
mkdir {output[1]}
splitMfasta.pl {output[0]} \
--outputpath={output[1]} --minsize={params.split_size}
"""
rule augustus:
input:
"results/agouti/{sample}/split/{sample}_v4.pseudohap.split.{i}.fa"
output:
"results/agouti/{sample}/split/pred_{i}.gff3"
conda:
"../envs/augustus.yaml"
shell:
"""
augustus --gff3=on --species=caenorhabditis {input} > {output}
"""
def aggregate_input_gff3(wildcards):
checkpoint_output = checkpoints.split_fa_augustus.get(**wildcards).output[1]
return expand("results/agouti/{sample}/split/pred_{i}.gff3",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, f"{wildcards.sample}_v4.pseudohap.split." + "{i}.fa")).i)
rule aggregate_gff3:
input:
aggregate_input_gff3
output:
"results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
conda:
"../envs/augustus.yaml"
shell:
"cat {input} | join_aug_pred.pl > {output}"
#===============================
# Preprocess RNAseq data
# The RNAseq reads need to be in the folder resources/RNAseq_raw/{sample}
# Files must be named {run}_R1.fastq.gz and {run}_R2.fastq.gz for globbing to work
# globbing is done in the rule merge_RNA_bams
rule rna_rcorrector:
input:
expand("resources/RNAseq_raw/{{sample}}/{{run}}_{R}.fastq.gz",
R=['R1', 'R2'])
output:
temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
R=['R1', 'R2']))
params:
outdir = lambda w, output: os.path.dirname(output[0])
log:
"logs/rcorrector_{sample}_{run}.log"
threads:
config['agouti']['threads']
conda:
"../envs/rna_seq.yaml"
shell:
"""
run_rcorrector.pl -1 {input[0]} -2 {input[1]} \
-t {threads} \
-od {params.outdir} \
> {log} 2>&1
"""
rule rna_trimgalore:
input:
expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
R=['R1', 'R2'])
output:
temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
i=['1', '2']))
params:
outdir = lambda w, output: os.path.dirname(output[0]),
basename = lambda w: f'{w.run}_trimgal'
log:
"logs/trimgalore_{sample}_{run}.log"
threads:
config['agouti']['threads']
conda:
"../envs/rna_seq.yaml"
shell:
"""
trim_galore --cores {threads} \
--phred33 \
--quality 20 \
--stringency 1 \
-e 0.1 \
--length 70 \
--output_dir {params.outdir} \
--basename {params.basename} \
--dont_gzip \
--paired \
{input} \
> {log} 2>&1
"""
#===============================
# Map the RNAseq reads
rule index_ref:
input:
"results/agouti/{sample}/{sample}_v4.pseudohap.fa"
output:
multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
conda:
"../envs/mapping.yaml"
shell:
"bwa-mem2 index {input}"
rule map_RNAseq:
input:
expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
i=['1', '2']),
"results/agouti/{sample}/{sample}_v4.pseudohap.fa",
multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
output:
"results/agouti/{sample}/mapping/{run}.bam"
log:
"logs/bwa_rna_{sample}_{run}.log"
conda:
"../envs/mapping.yaml"
threads:
config['agouti']['threads']
shell:
"""
bwa-mem2 mem -t {threads} {input[2]} {input[0]} {input[1]} 2> {log} \
| samtools view -b -# {threads} -o {output}
"""
def get_sample_rna_runs(w):
list_R1_files = glob.glob(f"resources/RNAseq_raw/{w.sample}/*_R1.fastq.gz")
list_runs = [re.sub('_R1\.fastq\.gz$', '', os.path.basename(f)) for f in list_R1_files]
return [f'results/agouti/{w.sample}/mapping/{run}.bam' for run in list_runs]
rule merge_RNA_bams:
input:
get_sample_rna_runs
output:
"results/agouti/{sample}/RNAseq_mapped_merged.bam"
params:
tmp_merge = lambda w: f'results/agouti/{w.sample}/tmp_merge.bam'
conda:
"../envs/mapping.yaml"
threads:
config['agouti']['threads']
shell:
"""
samtools merge -# {threads} {params.tmp_merge} {input}
samtools sort -# {threads} -n -o {output} {params.tmp_merge}
rm {params.tmp_merge}
"""
#===============================
# Run agouti on all that
rule agouti_scaffolding:
input:
fa = "results/agouti/{sample}/{sample}_v4.pseudohap.fa",
bam = "results/agouti/{sample}/RNAseq_mapped_merged.bam",
gff = "results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
output:
protected("results/fasta/{sample}_v5.pseudohap.fasta.gz")
params:
outdir = lambda w: f'results/agouti/{w.sample}/agouti_out',
minMQ = 20,
maxFracMM = 0.05
log:
"logs/agouti_{sample}.log"
conda:
"../envs/agouti.yaml"
shell:
"""
python /opt/agouti/agouti.py scaffold \
-assembly {input.fa} \
-bam {input.bam} \
-gff {input.gff} \
-outdir {params.outdir} \
-minMQ {params.minMQ} -maxFracMM {params.maxFracMM} \
> {log} 2>&1
gzip -c {params.outdir}/agouti.agouti.fasta > {output}
"""
#===============================================================
# Now do something on output {sample}_{version}.pseudohap.fasta.gz
rule kat_comp:
input:
expand("results/preprocessing/{{sample}}/{{sample}}_dedup_proc_fastp_{R}_001.fastq.gz",
R=["R1", "R2"]),
ancient("results/fasta/{sample}_{version}.pseudohap.fasta.gz")
output:
"results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
params:
outprefix = lambda w: f'results/kat/{w.sample}_{w.version}/{w.sample}_{w.version}_comp'
log:
"logs/kat_comp.{sample}_{version}.log"
conda:
"../envs/kat.yaml"
threads:
16
shell:
"""
kat comp -t {threads} \
-o {params.outprefix} \
'{input[0]} {input[1]}' \
{input[2]} \
> {log} 2>&1
"""
rule kat_plot_spectra:
input:
"results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
output:
"results/kat/{sample}_{version}/{sample}_{version}_spectra.pdf"
params:
title = lambda w: f'{w.sample}_{w.version}'
log:
"logs/kat_plot.{sample}_{version}.log"
conda:
"../envs/kat.yaml"
shell:
"""
kat plot spectra-cn \
-o {output} \
-t {params.title} \
{input} \
> {log} 2>&1
"""

Related

Delete unwanted Snakemake Outputs

I have looked at a few other post about Snakemake and deleting unneeded data to clean up diskspace. I have designed a rule called: rule BamRemove that touches my rule all. However, my the workflow manager isnt recognizing. I am getting this error: WildcardError in line 35 of /PATH:
No values given for wildcard 'SampleID'. I am not seeing why. Any help to get this to work would be nice.
sampleIDs = d.keys()
rule all:
input:
expand('bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam', sampleID=sampleIDs),
expand('bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai', sampleID=sampleIDs),
expand('logs/{SampleID}_removed.txt', sampleID=sampleIDs) #Line 35
# Some tools require unzipped fastqs
rule AnnotateUMI:
input: 'bams/{sampleID}_unisamp_L001_001.star_rg_added.sorted.dmark.bam'
output: 'bams/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bam',
# Modify each run
params: '/data/Test/fastqs/{sampleID}_unisamp_L001_UMI.fastq.gz'
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar AnnotateBamWithUmis \
-i {input} \
-f {params} \
-o {output}')
rule SortSam:
input: rules.AnnotateUMI.output
output: 'bams/{sampleID}_Qsorted.MarkUMI.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx110g -jar /data/Tools/picard.jar SortSam \
INPUT={input} \
OUTPUT={output} \
SORT_ORDER=queryname')
rule MItag:
input: rules.SortSam.output
output: 'bams/{sampleID}_Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar SetMateInformation \
-i {input} \
-o {output}')
rule GroupUMI:
input: rules.MItag.output
output: 'bams/{sampleID}_grouped.Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar GroupReadsByUmi \
-i {input} \
-s adjacency \
-e 1 \
-m 20 \
-o {output}')
rule ConcensusUMI:
input: rules.GroupUMI.output
output: 'bams/{sampleID}_concensus.Qunsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.2.jar CallMolecularConsensusReads \
--input={input} \
--min-reads=1 \
--output={output}')
rule STARmap:
input: rules.ConcensusUMI.output
output:
log = 'bams/{sampleID}_UMI_Concensus_Log.final.out',
bam = 'bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam'
params: 'bams/{sampleID}_UMI_Concensus_'
threads: 32
run:
# Each user needs to genome path
shell('STAR \
--runThreadN {threads} \
--readFilesIn {input} \
--readFilesType SAM PE \
--readFilesCommand samtools view -h \
--genomeDir /data/reference/star/STAR_hg19_v2.7.5c \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--limitBAMsortRAM 220000000000 \
--outFileNamePrefix {params}')
rule Index:
input: rules.STARmap.output.bam
output: 'bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai'
params:
threads: 32
run:
shell('samtools index {input}')
rule BamRemove:
input:
AnnotateUMI_BAM = rules.AnnotateUMI.output,
AnnotateUMI_BAI = '{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bai',
SortSam = rules.SortSam.output,
MItag = rules.MItag.output,
GroupUMI = rules.GroupUMI.output,
ConcensusUMI = rules.ConcensusUMI.output
output: touch('logs/{SampleID}_removed.txt')
threads: 32
run:
shell('rm {input}')
expand('logs/{SampleID}_removed.txt', sampleID=sampleIDs) #Line 35
^^^ ^^^
The error is due to SampleID being different from sampleID, make them consistent throughout the script.

Have Snakemake recognize complete files upon relaunch

I have created this Snakemake workflow. This pipeline works really well; however, if any rule fails and I relaunch, Snakemake isnt recognizing all completed files. For instances, Sample A finishes all the way through and creates all files for rule all, but Sample B fails at rule Annotate UMI. When I relaunch, snakemake wants to do all jobs for both A and B, instead of just B. What do I need to get this to work?
sampleIDs = [A, B]
rule all:
input:
expand('PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam', sampleID=sampleIDs),
expand('PATH/bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai', sampleID=sampleIDs),
expand('/PATH/logfiles/{sampleID}_removed.txt', sampleID=sampleIDs)
# Some tools require unzipped fastqs
rule AnnotateUMI:
# Modify each run
input: 'PATH/{sampleID}_unisamp_L001_001.star_rg_added.sorted.dmark.bam'
# Modify each run
output: 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bam'
# Modify each run
params: 'PATH/{sampleID}_unisamp_L001_UMI.fastq.gz'
threads: 36
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar AnnotateBamWithUmis \
-i {input} \
-f {params} \
-o {output}')
rule SortSam:
input: rules.AnnotateUMI.output
# Modify each run
output: 'PATH/{sampleID}_Qsorted.MarkUMI.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx110g -jar PATH/picard.jar SortSam \
INPUT={input} \
OUTPUT={output} \
SORT_ORDER=queryname')
rule MItag:
input: rules.SortSam.output
# Modify each run
output: 'PATH/{sampleID}_Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar SetMateInformation \
-i {input} \
-o {output}')
rule GroupUMI:
input: rules.MItag.output
# Modify each run
output: 'PATH/{sampleID}_grouped.Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar GroupReadsByUmi \
-i {input} \
-s adjacency \
-e 1 \
-m 20 \
-o {output}')
rule ConcensusUMI:
input: rules.GroupUMI.output
# Modify each run
output: 'PATH/{sampleID}_concensus.Qunsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.2.jar CallMolecularConsensusReads \
--input={input} \
--min-reads=1 \
--output={output}')
rule STARmap:
input: rules.ConcensusUMI.output
# Modify each run
output:
log = 'PATH/{sampleID}_UMI_Concensus_Log.final.out',
bam = 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam'
# Modify each run
params: 'PATH/{sampleID}_UMI_Concensus_'
threads: 32
run:
# Each user needs to genome path
shell('STAR \
--runThreadN {threads} \
--readFilesIn {input} \
--readFilesType SAM PE \
--readFilesCommand samtools view -h \
--genomeDir PATH/STAR_hg19_v2.7.5c \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--limitBAMsortRAM 220000000000 \
--outFileNamePrefix {params}')
rule Index:
input: rules.STARmap.output.bam
# Modify each run
output: 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai'
params:
threads: 32
run:
shell('samtools index {input}')
rule BamRemove:
input:
AnnotateUMI_BAM = rules.AnnotateUMI.output,
# Modify each run and include in future version to delete
#AnnotateUMI_BAI = 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bai',
SortSam = rules.SortSam.output,
MItag = rules.MItag.output,
GroupUMI = rules.GroupUMI.output,
ConcensusUMI = rules.ConcensusUMI.output,
STARmap = rules.STARmap.output.bam,
Index = rules.Index.output
# Modify each run
output: touch('PATH/logfiles/{sampleID}_removed.txt')
threads: 32
run:
shell('rm {input.AnnotateUMI_BAM} {input.SortSam} {input.MItag} {input.GroupUMI} {input.ConcensusUMI}')

Create index of a reference genome with bwa and gatk using snakemake touch

I align reads with bwa and call variants with gatk. gatk needs the creation of a dict for the reference genome, and bwa needs creation of indices. When I use touch for both of them I get this error:
AmbiguousRuleException:
Rules bwa_index and gatk_refdict are ambiguous for the file ref.
Expected input files:
bwa_index: ref.fasta
gatk_refdict: ref.fasta
This is the code:
rule bwa_index:
input:
database="ref.fasta"
output:
done =touch("ref")
shell:
"""
bwa index -p ref {input.database}
"""
rule bwa_mem:
input:
bwa_index_done = "ref",
fastq1="{sample}_R1.trimmed.fastq.gz",
fastq2="{sample}_R2.trimmed.fastq.gz"
output:
bam = temp("{sample}.bam")
shell:
"""
bwa mem ref {input.fastq1} {input.fastq2} -o {output.bam}
"""
rule_gatk_refdict:
input:
ref="ref.fasta"
output:
done =touch("ref")
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar CreateSequenceDictionary -R {input.ref} -O {output.done}
"""
rule gatk:
input:
gatk_refdict_done = "ref",
bam="bam_list"
output:
outf ="{chr}.vcf"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -L {wildcards.chr} -R ref -I {input.bam} --min-base-quality-score 20 -O {output.outf}
"""
Alternatively, I am specifing the index .dict, but then it doesn't work neither, because gatk calls variants before creating dict and so I get an error that there is no dict file:
rule_gatk_refdict:
input:
ref="ref.fasta"
output:
outf ="ref.dict"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar CreateSequenceDictionary -R {input.ref} -O {output.outf}
"""
rule gatk:
input:
ref = "ref.fasta",
bam="bam_list"
output:
outf ="{chr}.vcf"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -L {wildcards.chr} -R {input.ref} -I {input.bam} --min-base-quality-score 20 -O {output.outf}
"""
How to solve this?
Why don't you simply define the dict file as an input of the gatk rule and the ìndex as an input of the bwa rule?
rule bwa_index:
input:
database="ref.fasta"
output:
done =touch("ref")
shell:
"""
bwa index -p ref {input.database}
"""
rule bwa_mem:
input:
bwa_index_done = "ref",
fastq1="{sample}_R1.trimmed.fastq.gz",
fastq2="{sample}_R2.trimmed.fastq.gz"
output:
bam = temp("{sample}.bam")
shell:
"""
bwa mem ref {input.fastq1} {input.fastq2} -o {output.bam}
"""
rule gatk_refdict:
input:
ref="ref.fasta"
output:
done = "ref.dict"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar CreateSequenceDictionary -R {input.ref} -O {output.done}
"""
rule gatk:
input:
ref = "ref.fasta",
dict = "ref.dict",
bam="bam_list"
output:
outf ="{chr}.vcf"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -L {wildcards.chr} -R {input.ref} -I {input.bam} --min-base-quality-score 20 -O {output.outf}
"""
The AmbiguousRuleException you get is because snakemake doesn't know which rule to run since two rules have the same output. Don't forget snakemake tries to build the DAG starting from the rule all. When it comes to run rule gatk, you define "ref" as an input. Since two rules can produce this file, snakemake does not known weither it must use rule gatk_refdict or rule bwa_index.
You have a typo there ("_" that should not be there):
----v
rule_gatk_refdict:
input:
ref="ref.fasta"
...

Calling another pipeline within a snakefile result in mising output errors

I am using an assembly pipeline called Canu inside my snakemake pipeline, but when it comes to the rule calling Canu, snakemake exits witht he MissingOutputException error as the pipeline submits multiple jobs to the cluster itself so it seems snakemake expects the output after the first job has finished. Is there a way to avoid this? I know I could use a very long --latency-wait option but this is not very optimal.
snakefile code:
#!/miniconda/bin/python
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
sample = '|' .join(config["samples"]),
rule all:
input:
expand("assembly-stats/{sample}_stats.txt", sample = config["samples"])
rule short_reads_QC:
input:
f"short_reads/{{sample}}_short{config['separator']}*.fq.gz"
output:
"fastQC-reports/{sample}.html"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"""
mkdir fastqc-reports
fastqc -o fastqc-reports {input}
"""
rule quallity_trimming:
input:
forward = f"short_reads/{{sample}}_short{config['separator']}1.fq.gz",
reverse = f"short_reads/{{sample}}_short{config['separator']}2.fq.gz",
output:
forward = "cleaned_short-reads/{sample}_short_1-clean.fastq",
reverse = "cleaned_short-reads/{sample}_short_2-clean.fastq"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bbduk.sh -Xmx1g in1={input.forward} in2={input.reverse} out1={output.forward} out2={output.reverse} qtrim=rl trimq=10"
rule long_read_assembly:
input:
"long_reads/{sample}_long.fastq.gz"
output:
"canu-outputs/{sample}.subreads.contigs.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"canu -p {wildcards.sample} -d canu-outputs genomeSize=8m -pacbio-raw {input}"
rule short_read_alignment:
input:
short_read_fwd = "cleaned_short-reads/{sample}_short_1-clean.fastq",
short_read_rvs = "cleaned_short-reads/{sample}_short_2-clean.fastq",
reference = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"bwa-output/{sample}_short.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bwa mem {input.reference} {input.short_read_fwd} {input.short_read_rvs} | samtools view -S -b > {output}"
rule indexing_and_sorting:
input:
"bwa-output/{sample}_short.bam"
output:
"bwa-output/{sample}_short_sorted.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"samtools sort {input} > {output}"
rule polishing:
input:
bam_files = "bwa-output/{sample}_short_sorted.bam",
long_assembly = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"pilon-output/{sample}-improved.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"pilon --genome {input.long_assembly} --frags {input.bam_files} --output {output} --outdir pilon-output"
rule assembly_stats:
input:
"pilon-output/{sample}-improved.fasta"
output:
"assembly-stats/{sample}_stats.txt"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"stats.sh in={input} gc=assembly-stats/{wildcards.sample}/{wildcards.sample}_gc.csv gchist=assembly-stats/{wildcards.sample}/{wildcards.sample}_gchist.csv shist=assembly-stats/{wildcards.sample}/{wildcards.sample}_shist.csv > assembly-stats/{wildcards.sample}/{wildcards.sample}_stats.txt"
The exact error:
Waiting at most 60 seconds for missing files.
MissingOutputException in line 43 of /faststorage/home/lamma/scripts/hybrid_assembly/bacterial-hybrid-assembly.smk:
Missing files after 60 seconds:
canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakemake command being used:
snakemake --latency-wait 60 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/faststorage/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config bacterial-hybrid-assembly-config.json --configfile yaml-config-files/test_experiment3.yaml --use-conda --snakefile bacterial-hybrid-assembly.smk
I surmise that canu is giving you canu-outputs/{sample}.contigs.fasta not canu-outputs/{sample}.subreads.contigs.fasta. If so edit the canu commad to be
canu -p {wildcards.sample}.subreads ...
(By the way, I don't think #!/miniconda/bin/python is necessary).

Command not found error in snakemake pipeline despite the package existing in the conda environment

I am getting the following error in the snakemake pipeline:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 long_read_assembly
1
[Wed Jan 15 11:35:18 2020]
rule long_read_assembly:
input: long_reads/F19FTSEUHT1027.PSU4_ISF1A_long.fastq.gz
output: canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
jobid: 0
wildcards: sample=F19FTSEUHT1027.PSU4_ISF1A
/usr/bin/bash: canu: command not found
[Wed Jan 15 11:35:18 2020]
Error in rule long_read_assembly:
jobid: 0
output: canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
shell:
canu -p F19FTSEUHT1027.PSU4_ISF1A -d canu-outputs genomeSize=8m -pacbio-raw long_reads/F19FTSEUHT1027.PSU4_ISF1A_long.fastq.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I assume it is meaning that the command canu can not be found. But the Canu package does exist inside the conda environment:
(hybrid_assembly) [lamma#fe1 Assembly]$ conda list | grep canu
canu 1.9 he1b5a44_0 bioconda
The snakefile looks like this:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
sample = '|' .join(config["samples"]),
rule all:
input:
expand("assembly-stats/{sample}_stats.txt", sample = config["samples"])
rule short_reads_QC:
input:
f"short_reads/{{sample}}_short{config['separator']}*.fq.gz"
output:
"fastQC-reports/{sample}.html"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"""
mkdir fastqc-reports
fastqc -o fastqc-reports {input}
"""
rule quallity_trimming:
input:
forward = f"short_reads/{{sample}}_short{config['separator']}1.fq.gz",
reverse = f"short_reads/{{sample}}_short{config['separator']}2.fq.gz",
output:
forward = "cleaned_short-reads/{sample}_short_1-clean.fastq",
reverse = "cleaned_short-reads/{sample}_short_2-clean.fastq"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bbduk.sh -Xmx1g in1={input.forward} in2={input.reverse} out1={output.forward} out2={output.reverse} qtrim=rl trimq=10"
rule long_read_assembly:
input:
"long_reads/{sample}_long.fastq.gz"
output:
"canu-outputs/{sample}.subreads.contigs.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"canu -p {wildcards.sample} -d canu-outputs genomeSize=8m -pacbio-raw {input}"
rule short_read_alignment:
input:
short_read_fwd = "cleaned_short-reads/{sample}_short_1-clean.fastq",
short_read_rvs = "cleaned_short-reads/{sample}_short_2-clean.fastq",
reference = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"bwa-output/{sample}_short.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bwa mem {input.reference} {input.short_read_fwd} {input.short_read_rvs} | samtools view -S -b > {output}"
rule indexing_and_sorting:
input:
"bwa-output/{sample}_short.bam"
output:
"bwa-output/{sample}_short_sorted.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"samtools sort {input} > {output}"
rule polishing:
input:
bam_files = "bwa-output/{sample}_short_sorted.bam",
long_assembly = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"pilon-output/{sample}-improved.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"pilon --genome {input.long_assembly} --frags {input.bam_files} --output {output} --outdir pilon-output"
rule assembly_stats:
input:
"pilon-output/{sample}-improved.fasta"
output:
"assembly-stats/{sample}_stats.txt"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"stats.sh in={input} gc=assembly-stats/{wildcards.sample}/{wildcards.sample}_gc.csv gchist=assembly-stats/{wildcards.sample}/{wildcards.sample}_gchist.csv shist=assembly-stats/{wildcards.sample}/{wildcards.sample}_shist.csv > assembly-stats/{wildcards.sample}/{wildcards.sample}_stats.txt"
The rule calling canu has the correct syntax as far as I am awear so I am not sure what is causing this error.
Edit:
Adding the snakemake command
snakemake --latency-wait 60 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/faststorage/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output} --wait --parsable' --cluster-config bacterial-hybrid-assembly-config.json --configfile yaml-config-files/test_experiment3.yaml --snakefile bacterial-hybrid-assembly.smk
When running a snakemake workflow, if certain rules are to be ran within a rule-specific conda environment, the command line call should be of the form
snakemake [... various options ...] --use-conda [--conda-prefix <some-directory>]
If you don't tell snakemake to use conda, all the conda: <some_path> entries in your rules are ignored, and the rules are run in whatever environment is currently activated.
The --conda-prefix <dir> is optional, but tells snakemake where to find the installed environment (if you don't specify this, a conda env will be installed within the .snakemake folder, meaning that the .snakemake folder can get pretty huge and that the .snakemake folders for multiple projects may contain a lot of duplicated conda stuff)