Calling another pipeline within a snakefile result in mising output errors - snakemake

I am using an assembly pipeline called Canu inside my snakemake pipeline, but when it comes to the rule calling Canu, snakemake exits witht he MissingOutputException error as the pipeline submits multiple jobs to the cluster itself so it seems snakemake expects the output after the first job has finished. Is there a way to avoid this? I know I could use a very long --latency-wait option but this is not very optimal.
snakefile code:
#!/miniconda/bin/python
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
sample = '|' .join(config["samples"]),
rule all:
input:
expand("assembly-stats/{sample}_stats.txt", sample = config["samples"])
rule short_reads_QC:
input:
f"short_reads/{{sample}}_short{config['separator']}*.fq.gz"
output:
"fastQC-reports/{sample}.html"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"""
mkdir fastqc-reports
fastqc -o fastqc-reports {input}
"""
rule quallity_trimming:
input:
forward = f"short_reads/{{sample}}_short{config['separator']}1.fq.gz",
reverse = f"short_reads/{{sample}}_short{config['separator']}2.fq.gz",
output:
forward = "cleaned_short-reads/{sample}_short_1-clean.fastq",
reverse = "cleaned_short-reads/{sample}_short_2-clean.fastq"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bbduk.sh -Xmx1g in1={input.forward} in2={input.reverse} out1={output.forward} out2={output.reverse} qtrim=rl trimq=10"
rule long_read_assembly:
input:
"long_reads/{sample}_long.fastq.gz"
output:
"canu-outputs/{sample}.subreads.contigs.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"canu -p {wildcards.sample} -d canu-outputs genomeSize=8m -pacbio-raw {input}"
rule short_read_alignment:
input:
short_read_fwd = "cleaned_short-reads/{sample}_short_1-clean.fastq",
short_read_rvs = "cleaned_short-reads/{sample}_short_2-clean.fastq",
reference = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"bwa-output/{sample}_short.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bwa mem {input.reference} {input.short_read_fwd} {input.short_read_rvs} | samtools view -S -b > {output}"
rule indexing_and_sorting:
input:
"bwa-output/{sample}_short.bam"
output:
"bwa-output/{sample}_short_sorted.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"samtools sort {input} > {output}"
rule polishing:
input:
bam_files = "bwa-output/{sample}_short_sorted.bam",
long_assembly = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"pilon-output/{sample}-improved.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"pilon --genome {input.long_assembly} --frags {input.bam_files} --output {output} --outdir pilon-output"
rule assembly_stats:
input:
"pilon-output/{sample}-improved.fasta"
output:
"assembly-stats/{sample}_stats.txt"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"stats.sh in={input} gc=assembly-stats/{wildcards.sample}/{wildcards.sample}_gc.csv gchist=assembly-stats/{wildcards.sample}/{wildcards.sample}_gchist.csv shist=assembly-stats/{wildcards.sample}/{wildcards.sample}_shist.csv > assembly-stats/{wildcards.sample}/{wildcards.sample}_stats.txt"
The exact error:
Waiting at most 60 seconds for missing files.
MissingOutputException in line 43 of /faststorage/home/lamma/scripts/hybrid_assembly/bacterial-hybrid-assembly.smk:
Missing files after 60 seconds:
canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakemake command being used:
snakemake --latency-wait 60 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/faststorage/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config bacterial-hybrid-assembly-config.json --configfile yaml-config-files/test_experiment3.yaml --use-conda --snakefile bacterial-hybrid-assembly.smk

I surmise that canu is giving you canu-outputs/{sample}.contigs.fasta not canu-outputs/{sample}.subreads.contigs.fasta. If so edit the canu commad to be
canu -p {wildcards.sample}.subreads ...
(By the way, I don't think #!/miniconda/bin/python is necessary).

Related

Running different snakemake rules in parallel

I show below a pseudocode version of my snakefile. Snakemake rule A creates the input files for Snakemake rule B2 and I would like to run Snakemake rules B1 and B2 at the same time but am not having success. I can run this snakefile successfully on very small data without a problem (although the Snakemake rules B1 and B2 do not run in parallel) but once I give it larger data it fails to create the output for Snakemake rule B1. The commands between Snakemake rule B1 and B2 use the same program but have different arguments and input files so I didn't think they should be in the same rule.
rule all:
input: file_A_out, file_B1_out, file_B2_out, file_C_out
rule A:
input: file_A_in
output: file_A_out
log: file_A_log
shell: 'progA {input} --output {output}'
rule B1:
input: file_B1_in
output: file_B1_out
group: 'groupB'
log: file_B1_log
shell: 'progB {input} -x 100 -o {output}'
rule B2:
input: file_A_out
output: file_B2_out
group: 'groupB'
log: file_B2_log
shell: 'progB {input} -x 1 --y -o {output}'
rule C:
input: file_B1_out, file_B2_out
output: file_C_out
log: file_C_log
shell: 'progC {input[0]} {input[1]} -o {output}'
I thought using group to group the rules would indicate to Snakemake that the two rules can be ran at once. To execute snakemake I run nohup snakemake --cores 16 > log.txt 2>&1 & however, it only successfully runs rule B2 while the output of rule B1 is deemed corrupted. I have seen solutions on running one rule in parallel but what about running different rules in parallel?
Error in rule B1:
jobid: 2
input: 'file_B1_in'
output: 'file_B1_out'
log: 'file_B1_log'
(check log file(s) for error details)
shell: 'progB {input} -x 100 -o {output}'
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job B1 since they might be corrupted:
file_B1_out
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakefile below runs rules A, B1, and B2 in parallel then runs rule C, as expected. Maybe there is something you are not showing us?
# Make dummy input files
touch file_A_in file_B1_in
# Run pipeline
snakemake -p -j 10
The snakefile:
rule all:
input: 'file_A_out', 'file_B1_out', 'file_B2_out', 'file_C_out'
rule A:
input: 'file_A_in'
output: 'file_A_out'
shell: 'sleep 10; echo {input} > {output}'
rule B1:
input: 'file_B1_in'
output: 'file_B1_out'
shell: 'sleep 10; echo {input} > {output}'
rule B2:
input: 'file_A_in'
output: 'file_B2_out'
shell: 'sleep 10; echo {input} > {output}'
rule C:
input: 'file_B1_out', 'file_B2_out'
output: 'file_C_out'
shell: 'sleep 10; echo {input[0]} {input[1]} > {output}'

Snakemake Megahit output issue

A few days ago I started using Snakemake for the first time. I am having an issue when I am trying to run the megahit rule in my pipeline.
It gives me the following error "Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). ......"
So initially it runs and then crashes with the above error. I implemented the solution with the directory() option in my pipeline but I think its not a good practice since, for various reasons, you can loose files without even knowing it.
Is there a way to run the rule without using the directory() ?
I would appreciate any help on the issue!
Thanking you in advance
sra = []
with open("run_ids") as f:
for line in f:
sra.append(line.strip())
rule all:
input:
expand("raw_reads/{sample}/{sample}.fastq", sample=sra),
expand("trimmo/{sample}/{sample}.trimmed.fastq", sample=sra),
expand("megahit/{sample}/final.contigs.fa", sample=sra)
rule download:
output:
"raw_reads/{sample}/{sample}.fastq"
params:
"--split-spot --skip-technical"
log:
"logs/fasterq-dump/{sample}.log"
benchmark:
"benchmarks/fastqdump/{sample}.fasterq-dump.benchmark.txt"
threads: 8
shell:
"""
fasterq-dump {params} --outdir /home/raw_reads/{wildcards.sample} {wildcards.sample} -e {threads}
"""
rule trim:
input:
"raw_reads/{sample}/{sample}.fastq"
output:
"trimmo/{sample}/{sample}.trimmed.fastq"
params:
"HEADCROP:15 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"
log:
"logs/trimmo/{sample}.log"
benchmark:
"benchmarks/trimmo/{sample}.trimmo.benchmark.txt"
threads: 6
shell:
"""
trimmomatic SE -phred33 -threads {threads} {input} trimmo/{wildcards.sample}/{wildcards.sample}.trimmed.fastq {params}
"""
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
"megahit/{sample}/final.contigs.fa"
params:
"-m 0.7 -t"
log:
"logs/megahit/{sample}.log"
benchmark:
"benchmarks/megahit/{sample}.megahit.benchmark.txt"
threads: 10
shell:
"""
megahit -r {input} -o {output} -t {threads}
"""
IMHO it is a bad design of the megahit software that it takes a directory as a parameter and outputs into a file in this directory with a hardcoded name. Flagging the filename with directory() doesn't solve the issue, as in this case what you expect to be a file with the .fa extension megahit treats as a directory. The rest of the pipeline is broken in this case.
But this issue can be solved in Snakemake like that:
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
"megahit/{sample}/final.contigs.fa"
# ...
shell:
"""
megahit -r {input} -o megahit/{wildcards.sample} -t {threads}
"""
A better design of the megahit rule would look as follows:
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
out_dir = directory("megahit/{sample}/"),
fasta = "megahit/{sample}/final.contigs.fa"
log:
"logs/megahit/{sample}.log"
benchmark:
"benchmarks/megahit/{sample}.megahit.benchmark.txt"
threads:
10
shell:
"megahit -r {input} -f -o {output.out_dir} -t {threads}"
This guarantees that the output directory is removed upon failure, while the -f argument to megahit tells it to ignore the fact that the output folder exists (it is created by Snakemake automatically because one of the outputs is a file inside it: final.contigs.fa).
BTW, the -m (--memory) parameter is best implemented as a resource. The only problem though is that snakemake's default resource, mem_mb is in megabytes. One workaround would be as follows:
resources:
mem_mb = mem_mb_limit_for_megahit # could be a fraction of a global constant
params:
mem_bytes = lambda w, resources: round(resources.mem_mb * 1e6)
shell:
"megahit ... -m {params.mem_bytes}"

Command not found error in snakemake pipeline despite the package existing in the conda environment

I am getting the following error in the snakemake pipeline:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 long_read_assembly
1
[Wed Jan 15 11:35:18 2020]
rule long_read_assembly:
input: long_reads/F19FTSEUHT1027.PSU4_ISF1A_long.fastq.gz
output: canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
jobid: 0
wildcards: sample=F19FTSEUHT1027.PSU4_ISF1A
/usr/bin/bash: canu: command not found
[Wed Jan 15 11:35:18 2020]
Error in rule long_read_assembly:
jobid: 0
output: canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
shell:
canu -p F19FTSEUHT1027.PSU4_ISF1A -d canu-outputs genomeSize=8m -pacbio-raw long_reads/F19FTSEUHT1027.PSU4_ISF1A_long.fastq.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I assume it is meaning that the command canu can not be found. But the Canu package does exist inside the conda environment:
(hybrid_assembly) [lamma#fe1 Assembly]$ conda list | grep canu
canu 1.9 he1b5a44_0 bioconda
The snakefile looks like this:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
sample = '|' .join(config["samples"]),
rule all:
input:
expand("assembly-stats/{sample}_stats.txt", sample = config["samples"])
rule short_reads_QC:
input:
f"short_reads/{{sample}}_short{config['separator']}*.fq.gz"
output:
"fastQC-reports/{sample}.html"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"""
mkdir fastqc-reports
fastqc -o fastqc-reports {input}
"""
rule quallity_trimming:
input:
forward = f"short_reads/{{sample}}_short{config['separator']}1.fq.gz",
reverse = f"short_reads/{{sample}}_short{config['separator']}2.fq.gz",
output:
forward = "cleaned_short-reads/{sample}_short_1-clean.fastq",
reverse = "cleaned_short-reads/{sample}_short_2-clean.fastq"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bbduk.sh -Xmx1g in1={input.forward} in2={input.reverse} out1={output.forward} out2={output.reverse} qtrim=rl trimq=10"
rule long_read_assembly:
input:
"long_reads/{sample}_long.fastq.gz"
output:
"canu-outputs/{sample}.subreads.contigs.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"canu -p {wildcards.sample} -d canu-outputs genomeSize=8m -pacbio-raw {input}"
rule short_read_alignment:
input:
short_read_fwd = "cleaned_short-reads/{sample}_short_1-clean.fastq",
short_read_rvs = "cleaned_short-reads/{sample}_short_2-clean.fastq",
reference = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"bwa-output/{sample}_short.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bwa mem {input.reference} {input.short_read_fwd} {input.short_read_rvs} | samtools view -S -b > {output}"
rule indexing_and_sorting:
input:
"bwa-output/{sample}_short.bam"
output:
"bwa-output/{sample}_short_sorted.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"samtools sort {input} > {output}"
rule polishing:
input:
bam_files = "bwa-output/{sample}_short_sorted.bam",
long_assembly = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"pilon-output/{sample}-improved.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"pilon --genome {input.long_assembly} --frags {input.bam_files} --output {output} --outdir pilon-output"
rule assembly_stats:
input:
"pilon-output/{sample}-improved.fasta"
output:
"assembly-stats/{sample}_stats.txt"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"stats.sh in={input} gc=assembly-stats/{wildcards.sample}/{wildcards.sample}_gc.csv gchist=assembly-stats/{wildcards.sample}/{wildcards.sample}_gchist.csv shist=assembly-stats/{wildcards.sample}/{wildcards.sample}_shist.csv > assembly-stats/{wildcards.sample}/{wildcards.sample}_stats.txt"
The rule calling canu has the correct syntax as far as I am awear so I am not sure what is causing this error.
Edit:
Adding the snakemake command
snakemake --latency-wait 60 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/faststorage/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output} --wait --parsable' --cluster-config bacterial-hybrid-assembly-config.json --configfile yaml-config-files/test_experiment3.yaml --snakefile bacterial-hybrid-assembly.smk
When running a snakemake workflow, if certain rules are to be ran within a rule-specific conda environment, the command line call should be of the form
snakemake [... various options ...] --use-conda [--conda-prefix <some-directory>]
If you don't tell snakemake to use conda, all the conda: <some_path> entries in your rules are ignored, and the rules are run in whatever environment is currently activated.
The --conda-prefix <dir> is optional, but tells snakemake where to find the installed environment (if you don't specify this, a conda env will be installed within the .snakemake folder, meaning that the .snakemake folder can get pretty huge and that the .snakemake folders for multiple projects may contain a lot of duplicated conda stuff)

ChildIOException: error in snake make after running flye

So I have an issue when I run other programs after I ran flye in my snakemake pipeline. This is because the output from flye is a directory. My rules are as followd:
samples, = glob_wildcards("data/samples/{sample}.fastq")
rule all:
input:
[f"assembled/" for sample in samples],
[f"nanopolish/draft.fa" for sample in samples],
[f"nanopolish/reads.sorted.bam" for sample in samples],
[f"nanopolish/reads.indexed.sorted.bam" for sample in samples]
rule fly:
input:
"unzipped/read.fastq"
output:
directory("assembled/")
conda:
"envs/flye.yaml"
shell:
"flye --nano-corr {input} --genome-size 5m --out-dir {output}"
rule bwa:
input:
"assembled/assembly.fasta"
output:
"nanopolish/draft.fa"
conda:
"nanopolish.yaml"
shell:
"bwa index {input} {output}"
rule nanopolish:
input:
"nanopolish/draft.fa",
"zipped/zipped.gz"
output:
"nanopolish/reads.sorted.bam"
conda:
"nanopolish.yaml"
shell:
"bwa mem -x ont2d -t 8 {input} | samtools sort -o {output}"
there are a few steps before this but they work just fine. when I run this it gives the following error:
ChildIOException:
File/directory is a child to another output:
/home/fronglesquad/snakemake_poging_1/assembled
/home/fronglesquad/snakemake_poging_1/assembled/assembly.fasta
I have googled the error. All I could find there that its because snakemake doesnt work well with output directorys. But this tool needs a output directory to work. Does anyone know how to bypass this?
(I think) The problem lies somewhere else in your code.
You have defined two rules, the first that outputs directory assembled, the second that outputs assembled/assembly.fasta. Since the output of the second rule is always at least the directory assembled, Snakemake complains. You can solve it by using the directory as input:
rule second:
input:
"assembled"
output:
...
shell:
cat {input}/assembly.fasta > {output}

ImproperOutputException in snakemake 5.5.4

I am new to snakemake and recently I encountered a new error which I get only from very recent version of snakemake. This my snakemake rule
rule fastqc_raw:
input:
expand(directory("{fastqc_dir}/{samples}_fastqc/"),fastqc_dir = FASTQC_DIR, samples = SAMPLES_wo_extension)
rule do_fastqc_raw:
input:
expand("{fastq_dir}/{{samples}}.fastq.gz", fastq_dir = inputDir)
output:
expand(directory("{fastqc_dir}/{{samples}}_fastqc/"),fastqc_dir = FASTQC_DIR)
log:
expand("{fastqc_dir}/{{samples}}.log", fastqc_dir = FASTQC_DIR)
threads:
10
message:
"performing fastQC of the sample : {wildcards.samples}"
shell:
"""mkdir -p {output} && fastqc -q -t {threads} --outdir {output} --contaminants /home/Contaminants/idot.txt {input} 2> {log}"""
I get the following error which i didn't receive when I use snakemake version lower than 5.2.0
ImproperOutputException in line 10 of /home/FASTQC.snakefile:
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule do_fastqc_raw:
I think you have directory in the wrong place in rule do_fastqc_raw.
This:
output:
expand(directory("{fastqc_dir}/{{samples}}_fastqc/"),fastqc_dir = FASTQC_DIR)
should be:
output:
directory(expand("{fastqc_dir}/{{samples}}_fastqc/", fastqc_dir = FASTQC_DIR))