Have Snakemake recognize complete files upon relaunch - snakemake

I have created this Snakemake workflow. This pipeline works really well; however, if any rule fails and I relaunch, Snakemake isnt recognizing all completed files. For instances, Sample A finishes all the way through and creates all files for rule all, but Sample B fails at rule Annotate UMI. When I relaunch, snakemake wants to do all jobs for both A and B, instead of just B. What do I need to get this to work?
sampleIDs = [A, B]
rule all:
input:
expand('PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam', sampleID=sampleIDs),
expand('PATH/bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai', sampleID=sampleIDs),
expand('/PATH/logfiles/{sampleID}_removed.txt', sampleID=sampleIDs)
# Some tools require unzipped fastqs
rule AnnotateUMI:
# Modify each run
input: 'PATH/{sampleID}_unisamp_L001_001.star_rg_added.sorted.dmark.bam'
# Modify each run
output: 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bam'
# Modify each run
params: 'PATH/{sampleID}_unisamp_L001_UMI.fastq.gz'
threads: 36
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar AnnotateBamWithUmis \
-i {input} \
-f {params} \
-o {output}')
rule SortSam:
input: rules.AnnotateUMI.output
# Modify each run
output: 'PATH/{sampleID}_Qsorted.MarkUMI.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx110g -jar PATH/picard.jar SortSam \
INPUT={input} \
OUTPUT={output} \
SORT_ORDER=queryname')
rule MItag:
input: rules.SortSam.output
# Modify each run
output: 'PATH/{sampleID}_Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar SetMateInformation \
-i {input} \
-o {output}')
rule GroupUMI:
input: rules.MItag.output
# Modify each run
output: 'PATH/{sampleID}_grouped.Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar GroupReadsByUmi \
-i {input} \
-s adjacency \
-e 1 \
-m 20 \
-o {output}')
rule ConcensusUMI:
input: rules.GroupUMI.output
# Modify each run
output: 'PATH/{sampleID}_concensus.Qunsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar PATH/fgbio-2.0.2.jar CallMolecularConsensusReads \
--input={input} \
--min-reads=1 \
--output={output}')
rule STARmap:
input: rules.ConcensusUMI.output
# Modify each run
output:
log = 'PATH/{sampleID}_UMI_Concensus_Log.final.out',
bam = 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam'
# Modify each run
params: 'PATH/{sampleID}_UMI_Concensus_'
threads: 32
run:
# Each user needs to genome path
shell('STAR \
--runThreadN {threads} \
--readFilesIn {input} \
--readFilesType SAM PE \
--readFilesCommand samtools view -h \
--genomeDir PATH/STAR_hg19_v2.7.5c \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--limitBAMsortRAM 220000000000 \
--outFileNamePrefix {params}')
rule Index:
input: rules.STARmap.output.bam
# Modify each run
output: 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai'
params:
threads: 32
run:
shell('samtools index {input}')
rule BamRemove:
input:
AnnotateUMI_BAM = rules.AnnotateUMI.output,
# Modify each run and include in future version to delete
#AnnotateUMI_BAI = 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bai',
SortSam = rules.SortSam.output,
MItag = rules.MItag.output,
GroupUMI = rules.GroupUMI.output,
ConcensusUMI = rules.ConcensusUMI.output,
STARmap = rules.STARmap.output.bam,
Index = rules.Index.output
# Modify each run
output: touch('PATH/logfiles/{sampleID}_removed.txt')
threads: 32
run:
shell('rm {input.AnnotateUMI_BAM} {input.SortSam} {input.MItag} {input.GroupUMI} {input.ConcensusUMI}')

Related

Delete unwanted Snakemake Outputs

I have looked at a few other post about Snakemake and deleting unneeded data to clean up diskspace. I have designed a rule called: rule BamRemove that touches my rule all. However, my the workflow manager isnt recognizing. I am getting this error: WildcardError in line 35 of /PATH:
No values given for wildcard 'SampleID'. I am not seeing why. Any help to get this to work would be nice.
sampleIDs = d.keys()
rule all:
input:
expand('bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam', sampleID=sampleIDs),
expand('bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai', sampleID=sampleIDs),
expand('logs/{SampleID}_removed.txt', sampleID=sampleIDs) #Line 35
# Some tools require unzipped fastqs
rule AnnotateUMI:
input: 'bams/{sampleID}_unisamp_L001_001.star_rg_added.sorted.dmark.bam'
output: 'bams/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bam',
# Modify each run
params: '/data/Test/fastqs/{sampleID}_unisamp_L001_UMI.fastq.gz'
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar AnnotateBamWithUmis \
-i {input} \
-f {params} \
-o {output}')
rule SortSam:
input: rules.AnnotateUMI.output
output: 'bams/{sampleID}_Qsorted.MarkUMI.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx110g -jar /data/Tools/picard.jar SortSam \
INPUT={input} \
OUTPUT={output} \
SORT_ORDER=queryname')
rule MItag:
input: rules.SortSam.output
output: 'bams/{sampleID}_Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar SetMateInformation \
-i {input} \
-o {output}')
rule GroupUMI:
input: rules.MItag.output
output: 'bams/{sampleID}_grouped.Qsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.0.jar GroupReadsByUmi \
-i {input} \
-s adjacency \
-e 1 \
-m 20 \
-o {output}')
rule ConcensusUMI:
input: rules.GroupUMI.output
output: 'bams/{sampleID}_concensus.Qunsorted.MarkUMI.MQ.bam'
params:
threads: 32
run:
# Each user needs to set tool path
shell('java -Xmx220g -jar /data/Tools/fgbio-2.0.2.jar CallMolecularConsensusReads \
--input={input} \
--min-reads=1 \
--output={output}')
rule STARmap:
input: rules.ConcensusUMI.output
output:
log = 'bams/{sampleID}_UMI_Concensus_Log.final.out',
bam = 'bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam'
params: 'bams/{sampleID}_UMI_Concensus_'
threads: 32
run:
# Each user needs to genome path
shell('STAR \
--runThreadN {threads} \
--readFilesIn {input} \
--readFilesType SAM PE \
--readFilesCommand samtools view -h \
--genomeDir /data/reference/star/STAR_hg19_v2.7.5c \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--limitBAMsortRAM 220000000000 \
--outFileNamePrefix {params}')
rule Index:
input: rules.STARmap.output.bam
output: 'bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai'
params:
threads: 32
run:
shell('samtools index {input}')
rule BamRemove:
input:
AnnotateUMI_BAM = rules.AnnotateUMI.output,
AnnotateUMI_BAI = '{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bai',
SortSam = rules.SortSam.output,
MItag = rules.MItag.output,
GroupUMI = rules.GroupUMI.output,
ConcensusUMI = rules.ConcensusUMI.output
output: touch('logs/{SampleID}_removed.txt')
threads: 32
run:
shell('rm {input}')
expand('logs/{SampleID}_removed.txt', sampleID=sampleIDs) #Line 35
^^^ ^^^
The error is due to SampleID being different from sampleID, make them consistent throughout the script.

"Input files updated by another job" but not true

I have an issue with a pipeline and I don't know if it is a bad use on my part or a bug. I am using snakemake 5.26.1. I did not have any problems until a few days ago, with the same version of snakemake. I don't understand what changed.
A pipeline part is just after a checkpoint rule and an aggregating step, this produces an output file e.g. foo.fasta.
I have rules following the production of foo.fasta using this as input file, and that are run again with the --reason Input files updated by another job but this is not the case and their output is more recent than foo.fasta.
Additionally, using the --summary option the file foo.fasta is marked as no update and files have the right timestamp order for rules not having to be executed again.
I cannot find the reason why following rules are run again. Is it possible the checkpoint is causing an issue making snakemake think foo.fasta is updated while it is not?
Here is a dry run output:
rule agouti_scaffolding produces tros_v5.pseudohap.fasta.gz (my foo.fasta)
Building DAG of jobs...
Updating job 7 (aggregate_gff3).
Replace aggregate_gff3 with dynamic branch aggregate_gff3
updating depending job agouti_scaffolding
Updating job 3 (agouti_scaffolding).
Replace agouti_scaffolding with dynamic branch agouti_scaffolding
updating depending job unzip_fasta
updating depending job btk_prepare_workdir
updating depending job v6_clean_rename
updating depending job kat_comp
updating depending job assembly_stats
Job counts:
count jobs
1 kat_comp
1 kat_plot_spectra
2
[Tue Oct 27 12:08:41 2020]
rule kat_comp:
input: results/preprocessing/tros/tros_dedup_proc_fastp_R1_001.fastq.gz, results/pre
sing/tros/tros_dedup_proc_fastp_R2_001.fastq.gz, results/fasta/tros_v5.pseudohap.fasta.g
output: results/kat/tros_v5/tros_v5_comp-main.mx
log: logs/kat_comp.tros_v5.log
jobid: 1
reason: Input files updated by another job: results/fasta/tros_v5.pseudohap.fasta.gz
wildcards: sample=tros, version=v5
[Tue Oct 27 12:08:41 2020]
rule kat_plot_spectra:
input: results/kat/tros_v5/tros_v5_comp-main.mx
output: results/kat/tros_v5/tros_v5_spectra.pdf
log: logs/kat_plot.tros_v5.log
jobid: 0
reason: Input files updated by another job: results/kat/tros_v5/tros_v5_comp-main.mx
wildcards: sample=tros, version=v5
Job counts:
count jobs
1 kat_comp
1 kat_plot_spectra
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
It seems updating depending job kat_comp is the issue.
Even the ancient flag on kat_comp input does not change anything.
EDIT
Here is part of the pipeline.
As #TroyComi was suggesting the last development version (5.26.1+26.gc2e2b501.dirty) solves the issue of the ancient directive. When removing the ancient directive rules are still executed.
rule all:
"results/kat/tros_v5/tros_v5_spectra.pdf"
checkpoint split_fa_augustus:
input:
"results/fasta/{sample}_v4.pseudohap.fasta.gz"
output:
"results/agouti/{sample}/{sample}_v4.pseudohap.fa",
directory("results/agouti/{sample}/split")
params:
split_size = 50000000
conda:
"../envs/augustus.yaml"
shell:
"""
zcat {input} > {output[0]}
mkdir {output[1]}
splitMfasta.pl {output[0]} \
--outputpath={output[1]} --minsize={params.split_size}
"""
rule augustus:
input:
"results/agouti/{sample}/split/{sample}_v4.pseudohap.split.{i}.fa"
output:
"results/agouti/{sample}/split/pred_{i}.gff3"
conda:
"../envs/augustus.yaml"
shell:
"""
augustus --gff3=on --species=caenorhabditis {input} > {output}
"""
def aggregate_input_gff3(wildcards):
checkpoint_output = checkpoints.split_fa_augustus.get(**wildcards).output[1]
return expand("results/agouti/{sample}/split/pred_{i}.gff3",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, f"{wildcards.sample}_v4.pseudohap.split." + "{i}.fa")).i)
rule aggregate_gff3:
input:
aggregate_input_gff3
output:
"results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
conda:
"../envs/augustus.yaml"
shell:
"cat {input} | join_aug_pred.pl > {output}"
#===============================
# Preprocess RNAseq data
# The RNAseq reads need to be in the folder resources/RNAseq_raw/{sample}
# Files must be named {run}_R1.fastq.gz and {run}_R2.fastq.gz for globbing to work
# globbing is done in the rule merge_RNA_bams
rule rna_rcorrector:
input:
expand("resources/RNAseq_raw/{{sample}}/{{run}}_{R}.fastq.gz",
R=['R1', 'R2'])
output:
temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
R=['R1', 'R2']))
params:
outdir = lambda w, output: os.path.dirname(output[0])
log:
"logs/rcorrector_{sample}_{run}.log"
threads:
config['agouti']['threads']
conda:
"../envs/rna_seq.yaml"
shell:
"""
run_rcorrector.pl -1 {input[0]} -2 {input[1]} \
-t {threads} \
-od {params.outdir} \
> {log} 2>&1
"""
rule rna_trimgalore:
input:
expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
R=['R1', 'R2'])
output:
temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
i=['1', '2']))
params:
outdir = lambda w, output: os.path.dirname(output[0]),
basename = lambda w: f'{w.run}_trimgal'
log:
"logs/trimgalore_{sample}_{run}.log"
threads:
config['agouti']['threads']
conda:
"../envs/rna_seq.yaml"
shell:
"""
trim_galore --cores {threads} \
--phred33 \
--quality 20 \
--stringency 1 \
-e 0.1 \
--length 70 \
--output_dir {params.outdir} \
--basename {params.basename} \
--dont_gzip \
--paired \
{input} \
> {log} 2>&1
"""
#===============================
# Map the RNAseq reads
rule index_ref:
input:
"results/agouti/{sample}/{sample}_v4.pseudohap.fa"
output:
multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
conda:
"../envs/mapping.yaml"
shell:
"bwa-mem2 index {input}"
rule map_RNAseq:
input:
expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
i=['1', '2']),
"results/agouti/{sample}/{sample}_v4.pseudohap.fa",
multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
output:
"results/agouti/{sample}/mapping/{run}.bam"
log:
"logs/bwa_rna_{sample}_{run}.log"
conda:
"../envs/mapping.yaml"
threads:
config['agouti']['threads']
shell:
"""
bwa-mem2 mem -t {threads} {input[2]} {input[0]} {input[1]} 2> {log} \
| samtools view -b -# {threads} -o {output}
"""
def get_sample_rna_runs(w):
list_R1_files = glob.glob(f"resources/RNAseq_raw/{w.sample}/*_R1.fastq.gz")
list_runs = [re.sub('_R1\.fastq\.gz$', '', os.path.basename(f)) for f in list_R1_files]
return [f'results/agouti/{w.sample}/mapping/{run}.bam' for run in list_runs]
rule merge_RNA_bams:
input:
get_sample_rna_runs
output:
"results/agouti/{sample}/RNAseq_mapped_merged.bam"
params:
tmp_merge = lambda w: f'results/agouti/{w.sample}/tmp_merge.bam'
conda:
"../envs/mapping.yaml"
threads:
config['agouti']['threads']
shell:
"""
samtools merge -# {threads} {params.tmp_merge} {input}
samtools sort -# {threads} -n -o {output} {params.tmp_merge}
rm {params.tmp_merge}
"""
#===============================
# Run agouti on all that
rule agouti_scaffolding:
input:
fa = "results/agouti/{sample}/{sample}_v4.pseudohap.fa",
bam = "results/agouti/{sample}/RNAseq_mapped_merged.bam",
gff = "results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
output:
protected("results/fasta/{sample}_v5.pseudohap.fasta.gz")
params:
outdir = lambda w: f'results/agouti/{w.sample}/agouti_out',
minMQ = 20,
maxFracMM = 0.05
log:
"logs/agouti_{sample}.log"
conda:
"../envs/agouti.yaml"
shell:
"""
python /opt/agouti/agouti.py scaffold \
-assembly {input.fa} \
-bam {input.bam} \
-gff {input.gff} \
-outdir {params.outdir} \
-minMQ {params.minMQ} -maxFracMM {params.maxFracMM} \
> {log} 2>&1
gzip -c {params.outdir}/agouti.agouti.fasta > {output}
"""
#===============================================================
# Now do something on output {sample}_{version}.pseudohap.fasta.gz
rule kat_comp:
input:
expand("results/preprocessing/{{sample}}/{{sample}}_dedup_proc_fastp_{R}_001.fastq.gz",
R=["R1", "R2"]),
ancient("results/fasta/{sample}_{version}.pseudohap.fasta.gz")
output:
"results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
params:
outprefix = lambda w: f'results/kat/{w.sample}_{w.version}/{w.sample}_{w.version}_comp'
log:
"logs/kat_comp.{sample}_{version}.log"
conda:
"../envs/kat.yaml"
threads:
16
shell:
"""
kat comp -t {threads} \
-o {params.outprefix} \
'{input[0]} {input[1]}' \
{input[2]} \
> {log} 2>&1
"""
rule kat_plot_spectra:
input:
"results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
output:
"results/kat/{sample}_{version}/{sample}_{version}_spectra.pdf"
params:
title = lambda w: f'{w.sample}_{w.version}'
log:
"logs/kat_plot.{sample}_{version}.log"
conda:
"../envs/kat.yaml"
shell:
"""
kat plot spectra-cn \
-o {output} \
-t {params.title} \
{input} \
> {log} 2>&1
"""

Create index of a reference genome with bwa and gatk using snakemake touch

I align reads with bwa and call variants with gatk. gatk needs the creation of a dict for the reference genome, and bwa needs creation of indices. When I use touch for both of them I get this error:
AmbiguousRuleException:
Rules bwa_index and gatk_refdict are ambiguous for the file ref.
Expected input files:
bwa_index: ref.fasta
gatk_refdict: ref.fasta
This is the code:
rule bwa_index:
input:
database="ref.fasta"
output:
done =touch("ref")
shell:
"""
bwa index -p ref {input.database}
"""
rule bwa_mem:
input:
bwa_index_done = "ref",
fastq1="{sample}_R1.trimmed.fastq.gz",
fastq2="{sample}_R2.trimmed.fastq.gz"
output:
bam = temp("{sample}.bam")
shell:
"""
bwa mem ref {input.fastq1} {input.fastq2} -o {output.bam}
"""
rule_gatk_refdict:
input:
ref="ref.fasta"
output:
done =touch("ref")
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar CreateSequenceDictionary -R {input.ref} -O {output.done}
"""
rule gatk:
input:
gatk_refdict_done = "ref",
bam="bam_list"
output:
outf ="{chr}.vcf"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -L {wildcards.chr} -R ref -I {input.bam} --min-base-quality-score 20 -O {output.outf}
"""
Alternatively, I am specifing the index .dict, but then it doesn't work neither, because gatk calls variants before creating dict and so I get an error that there is no dict file:
rule_gatk_refdict:
input:
ref="ref.fasta"
output:
outf ="ref.dict"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar CreateSequenceDictionary -R {input.ref} -O {output.outf}
"""
rule gatk:
input:
ref = "ref.fasta",
bam="bam_list"
output:
outf ="{chr}.vcf"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -L {wildcards.chr} -R {input.ref} -I {input.bam} --min-base-quality-score 20 -O {output.outf}
"""
How to solve this?
Why don't you simply define the dict file as an input of the gatk rule and the ìndex as an input of the bwa rule?
rule bwa_index:
input:
database="ref.fasta"
output:
done =touch("ref")
shell:
"""
bwa index -p ref {input.database}
"""
rule bwa_mem:
input:
bwa_index_done = "ref",
fastq1="{sample}_R1.trimmed.fastq.gz",
fastq2="{sample}_R2.trimmed.fastq.gz"
output:
bam = temp("{sample}.bam")
shell:
"""
bwa mem ref {input.fastq1} {input.fastq2} -o {output.bam}
"""
rule gatk_refdict:
input:
ref="ref.fasta"
output:
done = "ref.dict"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar CreateSequenceDictionary -R {input.ref} -O {output.done}
"""
rule gatk:
input:
ref = "ref.fasta",
dict = "ref.dict",
bam="bam_list"
output:
outf ="{chr}.vcf"
shell:
"""
java -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -L {wildcards.chr} -R {input.ref} -I {input.bam} --min-base-quality-score 20 -O {output.outf}
"""
The AmbiguousRuleException you get is because snakemake doesn't know which rule to run since two rules have the same output. Don't forget snakemake tries to build the DAG starting from the rule all. When it comes to run rule gatk, you define "ref" as an input. Since two rules can produce this file, snakemake does not known weither it must use rule gatk_refdict or rule bwa_index.
You have a typo there ("_" that should not be there):
----v
rule_gatk_refdict:
input:
ref="ref.fasta"
...

snakemake running single jobs in parallel from all files in folder

My problem is related to Running parallel instances of a single job/rule on Snakemake but I believe different.
I cannot create a all: rule for it in advance because the folder of input files will be created by a previous rule and depends on the user initial data
pseudocode
rule1: get a big file (OK)
rule2: split the file in parts in Split folder (OK)
rule3: run a program on each file created in Split
I am now at rule3 with Split containing 70 files like
Split/file_001.fq
Split/file_002.fq
..
Split/file_069.fq
Could you please help me creating a rule for pigz to run compress the 70 files in parallel to 70 .gz files
I am running with snakemake -j 24 ZipSplit
config["pigt"] gives 4 threads for each compression job and I give 24 threads to snakemake so I expect 6 parallel compressions but my current rule merges the inputs to one archive in a single job instead of parallelizing !?
Should I build the list of input fully in the rule? how?
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I tried to define input directly with
input: glob_wildcards("Split/{x}.fq")
but syntax error occures
# InSilico_PCR Snakefile
import os
import re
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
# source config variables
configfile: "config.yaml"
# single job
rule GetRawData:
input:
HTTP.remote(os.path.join(config["host"], config["infile"]), keep_local=True, allow_redirects=True)
output:
os.path.join("RawData", config["infile"])
run:
shell("cp {input} {output}")
# single job
rule SplitFastq:
input:
os.path.join("RawData", config["infile"])
params:
lines_per_file = config["lines_per_file"]
output:
pfx = os.path.join("Split", config["infile"] + "_")
shell:
"""
zcat {input} | split --numeric-suffixes --additional-suffix=.fq -a 3 -l {params.lines_per_file} - {output.pfx}
"""
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I think the example below should do it, using checkpoints as suggested by #Maarten-vd-Sande.
However, in your particular case of splitting a big file and compress the output on the fly, you may be better off using the --filter option of split as in
split -a 3 -d -l 4 --filter='gzip -c > $FILE.fastq.gz' bigfile.fastq split/
The snakemake solution, assuming your input file is called bigfile.fastq, split and compress output will be in directory splitting./bigfile/
rule all:
input:
expand("{sample}.split.done", sample= ['bigfile']),
checkpoint splitting:
input:
"{sample}.fastq"
output:
directory("splitting/{sample}")
shell:
r"""
mkdir splitting/{wildcards.sample}
split -a 3 -d --additional-suffix .fastq -l 4 {input} splitting/{wildcards.sample}/
"""
rule compress:
input:
"splitting/{sample}/{i}.fastq",
output:
"splitting/{sample}/{i}.fastq.gz",
shell:
r"""
gzip -c {input} > {output}
"""
def aggregate_input(wildcards):
checkpoint_output = checkpoints.splitting.get(**wildcards).output[0]
return expand("splitting/{sample}/{i}.fastq.gz",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fastq")).i)
rule all_done:
input:
aggregate_input
output:
touch("{sample}.split.done")

rule not picked up by snakemake

I'm starting with snakemake. I managed to define some rules which I can run indepently, but not in a workflow. Maybe the issue is that they have unrelated inputs and outputs.
My current workflow is like this:
configfile: './config.yaml'
rule all:
input: dynamic("task/{job}/taskOutput.tab")
rule split_input:
input: "input_fasta/snp.fa"
output: dynamic("task/{job}/taskInput.fa")
shell:
"rm -Rf tasktmp task; \
mkdir tasktmp task; \
split -l 200 -d {input} ./tasktmp/; \
ls tasktmp | awk '{{print \"mkdir task/\"$0}}' | sh; \
ls tasktmp | awk '{{print \"mv ./tasktmp/\"$0\" ./task/\"$0\"/taskInput.fa\"}}' | sh"
rule task:
input: "task/{job}/taskInput.fa"
output: "task/{job}/taskOutput.tab"
shell: "cp {input} {output}"
rule make_parameter_file:
output:
"par/parameters.txt
shell:
"rm -Rf par;mkdir par; \
echo \"\
minimumFlankLength=5\n\
maximumFlankLength=200\n\
alignmentLengthDifference=2\
allowedMismatch=4\n\
allowedProxyMismatch=2\n\
allowedIndel=3\n\
ambiguitiesAsMatch=1\n\" \
> par/parameters.txt"
rule build_target:
input:
"./my_target"
output:
touch("build_target.done")
shell:
"build_target -template format_nt -source {input} -target my_target"
If I call this as such:
snakemake -p -s snakefile
The first three rules are being executed, the others not.
I can run the last rule by specifying it as an argument.
snakemake -p -s snakefile build_target
But I don't see how I can run all.
Thanks a lot for any suggestion on how to solve this.
By default snakemake executes only the first rule of a snakefile. Here it is rule all. In order to produce rule all's input dynamic("task/{job}/taskOutput.tab"), it needs to run the following two rules task and split_input, and so it does.
If you want the other rules to be run as well, you should put their output in rule all, eg.:
rule all:
input:
dynamic("task/{job}/taskOutput.tab"),
"par/parameters.txt",
"build_target.done"