snakemake RuleException, if run command in CLI, work well; but if run in snakemake, failed; - snakemake

rule bcftools_call:
input:
fa="data/ucsc.hg19.fasta",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"bcftools mpileup -f {input.fa} {input.bam} |"
" bcftools call -mv - > {output}"
when run this rule, get an Error
CalledProcessError in line 37 of/public/home/wangdl/duolin_work_dir/test/snakemake/snakefile.smk:
Command ' set -euo pipefail; bcftools mpileup -f data/ucsc.hg19.fasta sorted_reads/A.bam sorted_reads/B.bam | bcftools call -mv - > calls/all.vcf ' returned non-zero exit status 1.
BUT, if I run this command in CLI; it works well
set -euo pipefail; bcftools mpileup -f data/ucsc.hg19.fasta sorted_reads/A.bam sorted_reads/B.bam | bcftools call -mv - > calls/all.vcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 2 samples in 2 input files
SO WHY?

For troubleshooting, try to split the pipe into two commands and see if any of them fails. E.g. something like:
shell:
r"""
bcftools mpileup -f {input.fa} {input.bam} > {output}.tmp
bcftools call -mv {output}.tmp > {output}"
rm {output}.tmp
""""

Related

Snakemake problem: Merge all files together with space delimiter instead of iterating through it

I was trying to run a command which ideally looks like this,
minimap2 -a -x map-ont -t 20 /staging/reference.fasta fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample01.bam
Similarly, I have multiple samples (referring to fastq/sample01.fastq) in the folder.
The snakemake file I wrote to automate this behaviour is, however, parsing all files at once in the command like,
minimap2 -a -x map-ont -t 1 /staging/reference.fasta fastq/sample02.fastq fastq/sample03.fastq fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample02.bam fastq_minon/sample03.bam fastq_minon/sample01.bam
I have pasted the code and logs below. Please help me try to figure out this mistake.
Code
SAMPLES, = glob_wildcards("fastq/{smp}.fastq")
rule minimap:
input:
expand("fastq/{smp}.fastq", smp=SAMPLES)
output:
expand("fastq_minon/{smp}.bam", smp=SAMPLES)
params:
ref = FASTA
threads: 40
shell:
"""
minimap2 -a -x map-ont -t {threads} {params.ref} {input} | samtools view -bS -F 4 - | samtools sort -o {output}
"""
log
Building DAG of jobs...
Job counts:
count jobs
1 minimap
1
[Tue May 5 03:28:50 2020]
rule minimap:
input: fastq/sample02.fastq, fastq/sample03.fastq, fastq/sample01.fastq
output: fastq_minon/sample02.bam, fastq_minon/sample03.bam, fastq_minon/sample01.bam
jobid: 0
minimap2 -a -x map-ont -t 1 /staging/reference.fasta fastq/sample02.fastq fastq/sample03.fastq fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample02.bam fastq_minon/sample03.bam fastq_minon/sample01.bam
Job counts:
count jobs
1 minimap
1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
The expand function is used to create a list. Thus, in your rule minimap, you're telling snakemake that you want all fastq files as input and that the rule will produce as many bam files. What you want is a rule that will be triggered for every sample using a wildcard:
SAMPLES, = glob_wildcards("fastq/{smp}.fastq")
rule all:
input: expand("fastq_minon/{smp}.bam", smp=SAMPLES)
rule minimap:
input:
"fastq/{smp}.fastq"
output:
"fastq_minon/{smp}.bam"
params:
ref = FASTA
threads: 40
shell:
"""
minimap2 -a -x map-ont -t {threads} {params.ref} {input} | samtools view -bS -F 4 - | samtools sort -o {output}
"""
By defining all the files wanted in rule all, the rule minimap will be triggered as many times as necessary to create ONE bam file from ONE fastq file.
Have a look at my answer to this question to understand the use of wildcards and expand.

Calling another pipeline within a snakefile result in mising output errors

I am using an assembly pipeline called Canu inside my snakemake pipeline, but when it comes to the rule calling Canu, snakemake exits witht he MissingOutputException error as the pipeline submits multiple jobs to the cluster itself so it seems snakemake expects the output after the first job has finished. Is there a way to avoid this? I know I could use a very long --latency-wait option but this is not very optimal.
snakefile code:
#!/miniconda/bin/python
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
sample = '|' .join(config["samples"]),
rule all:
input:
expand("assembly-stats/{sample}_stats.txt", sample = config["samples"])
rule short_reads_QC:
input:
f"short_reads/{{sample}}_short{config['separator']}*.fq.gz"
output:
"fastQC-reports/{sample}.html"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"""
mkdir fastqc-reports
fastqc -o fastqc-reports {input}
"""
rule quallity_trimming:
input:
forward = f"short_reads/{{sample}}_short{config['separator']}1.fq.gz",
reverse = f"short_reads/{{sample}}_short{config['separator']}2.fq.gz",
output:
forward = "cleaned_short-reads/{sample}_short_1-clean.fastq",
reverse = "cleaned_short-reads/{sample}_short_2-clean.fastq"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bbduk.sh -Xmx1g in1={input.forward} in2={input.reverse} out1={output.forward} out2={output.reverse} qtrim=rl trimq=10"
rule long_read_assembly:
input:
"long_reads/{sample}_long.fastq.gz"
output:
"canu-outputs/{sample}.subreads.contigs.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"canu -p {wildcards.sample} -d canu-outputs genomeSize=8m -pacbio-raw {input}"
rule short_read_alignment:
input:
short_read_fwd = "cleaned_short-reads/{sample}_short_1-clean.fastq",
short_read_rvs = "cleaned_short-reads/{sample}_short_2-clean.fastq",
reference = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"bwa-output/{sample}_short.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bwa mem {input.reference} {input.short_read_fwd} {input.short_read_rvs} | samtools view -S -b > {output}"
rule indexing_and_sorting:
input:
"bwa-output/{sample}_short.bam"
output:
"bwa-output/{sample}_short_sorted.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"samtools sort {input} > {output}"
rule polishing:
input:
bam_files = "bwa-output/{sample}_short_sorted.bam",
long_assembly = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"pilon-output/{sample}-improved.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"pilon --genome {input.long_assembly} --frags {input.bam_files} --output {output} --outdir pilon-output"
rule assembly_stats:
input:
"pilon-output/{sample}-improved.fasta"
output:
"assembly-stats/{sample}_stats.txt"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"stats.sh in={input} gc=assembly-stats/{wildcards.sample}/{wildcards.sample}_gc.csv gchist=assembly-stats/{wildcards.sample}/{wildcards.sample}_gchist.csv shist=assembly-stats/{wildcards.sample}/{wildcards.sample}_shist.csv > assembly-stats/{wildcards.sample}/{wildcards.sample}_stats.txt"
The exact error:
Waiting at most 60 seconds for missing files.
MissingOutputException in line 43 of /faststorage/home/lamma/scripts/hybrid_assembly/bacterial-hybrid-assembly.smk:
Missing files after 60 seconds:
canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakemake command being used:
snakemake --latency-wait 60 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/faststorage/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config bacterial-hybrid-assembly-config.json --configfile yaml-config-files/test_experiment3.yaml --use-conda --snakefile bacterial-hybrid-assembly.smk
I surmise that canu is giving you canu-outputs/{sample}.contigs.fasta not canu-outputs/{sample}.subreads.contigs.fasta. If so edit the canu commad to be
canu -p {wildcards.sample}.subreads ...
(By the way, I don't think #!/miniconda/bin/python is necessary).

Include Parameters and source code in Snakemake HTML Report

I want to include the shell command as well as the source code of external scripts of snakemake Rules in my html report (I saw that people have those in the table of the RULE seqment).
The example below is part of the Basic Example from the doc.
https://snakemake.readthedocs.io/en/stable/tutorial/basics.html
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule plot_quals:
input:
"calls/all.vcf"
output:
report("plots/quals.svg", caption="report/quals.rst")
script:
"scripts/plot-quals.py"
And the Report should include something like
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
as well as the code of "plot-quals.py"
How can I acomplish that?

Snakemake basic issue

I tried to run Snakemake command on my local computer. It didn’t work even I used the simplest code structure, like so:
rule fastqc_raw:
input:
"raw/A.fastq"
output:
"output/fastqc_raw/A.html"
shell:
"fastqc {input} -o {output} -t 4"
It displayed this error:
Error in rule fastqc_raw:
jobid: 1
output: output/fastqc_raw/A.html RuleException: CalledProcessError in line 13 of
/Users/01/Desktop/Snakemake/Snakefile: Command ' set -euo pipefail;
fastqc raw/A.fastq -o output/fastqc_raw/A.html -t 4 ' returned
non-zero exit status 2. File
"/Users/01/Desktop/Snakemake/Snakefile", line 13, in __rule_fastqc_raw
File "/Users/01/miniconda3/lib/python3.6/concurrent/futures/thread.py",line 56, in run
However the snakemake program did created DAG file that looks normal and when I used “snakemake --np” command, it didn’t display any errors.
I did also ran fastqc locally without Snakemake using the same command, and it worked perfectly.
I hope anyone can help me with this
Thanks !!
It looks like Snakemake did its job. It ran the command:
fastqc raw/A.fastq -o output/fastqc_raw/A.html -t 4
But the command returned an error:
Command ' set -euo pipefail;
fastqc raw/A.fastq -o output/fastqc_raw/A.html -t 4 ' returned
non-zero exit status 2.
The next step in debugging is to run the fastqc command manually to see if it gives an error.
I hope you have gotten an answer by now but I had the exact same issue so I will offer my solution.
The error is in the
shell:
"fastqc {input} -o {output} -t 4"
FastQC flag -o expects the output directory and you have given it an output file. Your code should be:
shell:
"fastqc {input} -o output/fastqc_raw/ -t 4"
Your error relates to the fact that the output files have been output in a different location (most likely the input directory) and the rule all: has failed as a result.
Additionally, FastQC will give an error if the directories are not already created, so you will need to do that first.
It is strange as I have seen Snakemake scripts that have no -o flag in the fastqc shell and it worked fine, but I haven't been so lucky.
An additional note: I can see you're using 4 threads there with the '-t 4' argument. You should specify this so Snakemake gives it 4 threads, otherwise I believe it will run with 1 thread and may fail due to lack of memory. This can be done like so:
rule fastqc_raw:
input:
"raw/A.fastq"
output:
"output/fastqc_raw/A.html"
threads: 4
shell:
"fastqc {input} -o {output} -t 4"

rule not picked up by snakemake

I'm starting with snakemake. I managed to define some rules which I can run indepently, but not in a workflow. Maybe the issue is that they have unrelated inputs and outputs.
My current workflow is like this:
configfile: './config.yaml'
rule all:
input: dynamic("task/{job}/taskOutput.tab")
rule split_input:
input: "input_fasta/snp.fa"
output: dynamic("task/{job}/taskInput.fa")
shell:
"rm -Rf tasktmp task; \
mkdir tasktmp task; \
split -l 200 -d {input} ./tasktmp/; \
ls tasktmp | awk '{{print \"mkdir task/\"$0}}' | sh; \
ls tasktmp | awk '{{print \"mv ./tasktmp/\"$0\" ./task/\"$0\"/taskInput.fa\"}}' | sh"
rule task:
input: "task/{job}/taskInput.fa"
output: "task/{job}/taskOutput.tab"
shell: "cp {input} {output}"
rule make_parameter_file:
output:
"par/parameters.txt
shell:
"rm -Rf par;mkdir par; \
echo \"\
minimumFlankLength=5\n\
maximumFlankLength=200\n\
alignmentLengthDifference=2\
allowedMismatch=4\n\
allowedProxyMismatch=2\n\
allowedIndel=3\n\
ambiguitiesAsMatch=1\n\" \
> par/parameters.txt"
rule build_target:
input:
"./my_target"
output:
touch("build_target.done")
shell:
"build_target -template format_nt -source {input} -target my_target"
If I call this as such:
snakemake -p -s snakefile
The first three rules are being executed, the others not.
I can run the last rule by specifying it as an argument.
snakemake -p -s snakefile build_target
But I don't see how I can run all.
Thanks a lot for any suggestion on how to solve this.
By default snakemake executes only the first rule of a snakefile. Here it is rule all. In order to produce rule all's input dynamic("task/{job}/taskOutput.tab"), it needs to run the following two rules task and split_input, and so it does.
If you want the other rules to be run as well, you should put their output in rule all, eg.:
rule all:
input:
dynamic("task/{job}/taskOutput.tab"),
"par/parameters.txt",
"build_target.done"