Snakemake Error with MissingOutputException - snakemake

I am trying to run STAR with snakemake in a server,
My smk file is that one :
import pandas as pd
configfile: 'config.yaml'
#Read sample to batch dataframe mapping batch to sample (here with zip)
sample_to_batch = pd.read_csv("/mnt/DataArray1/users/zisis/STAR_mapping/snakemake_STAR_index/all_samples_test.csv", sep = '\t')
#rule spcifying output
rule all_STAR:
input:
#expand("{sample}/Aligned.sortedByCoord.out.bam", sample = sample_to_batch['sample'])
expand(config['path_to_output']+"{sample}/Aligned.sortedByCoord.out.bam", sample = sample_to_batch['sample'])
rule STAR_align:
#specify input fastq files
input:
fq1 = config['path_to_output']+"{sample}_1.fastq.gz",
fq2 = config['path_to_output']+"{sample}_2.fastq.gz"
params:
#location of indexed genome andl location to save the ouput
genome = directory(config['path_to_reference']+config['ref_assembly']+".STAR_index"),
prefix_outdir = directory(config['path_to_output']+"{sample}/")
threads: 12
output:
config['path_to_output']+"{sample}/Aligned.sortedByCoord.out.bam"
log:
config['path_to_output']+"logs/{sample}.log"
message:
"--- Mapping STAR---"
shell:
"""
STAR --runThreadN {threads} \
--readFilesCommand zcat \
--readFilesIn {input} \
--genomeDir {params.genome} \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard
"""
While STAR starts normally at the end i have this error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 14 of /mnt/DataArray1/users/zisis/STAR_mapping/snakemake/STAR_snakefile_align.smk:
Job Missing files after 5 seconds:
/mnt/DataArray1/users/zisis/STAR_mapping/snakemake/001_T1/Aligned.sortedByCoord.out.bam
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 1 completed successfully, but some output files are missing. 1
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I tried --latency-wait but is not working.
In order to execute snake make i run the command
users/zisis/STAR_mapping/snakemake_STAR_index$ snakemake --snakefile STAR_new_snakefile.smk --cores all --printshellcmds
Technically i am in my directory with full access and permissions
Do you think that this is happening due to strange rights in the execution of snakemake or when it tries to create directories ?
It creates the directory and the files but i can see that there is a files Aligned.sortedByCoord.out.bamAligned.sortedByCoord.out.bam .
IS this the problem ?

I think your STAR command does not have the option that says which file and directory to write to, presumably it is writing the default filename to the current directory. Try something like:
rule STAR_align:
input: ...
output: ...
...
shell:
r"""
outprefix=`dirname {output}`
STAR --outFileNamePrefix $outprefix \
--runThreadN {threads} \
etc...
"""
I am runing the command from my directory in which i am sudo user
I don't think that is the problem but it is strongly recommended to work as regular user and use sudo only in special circumstances (e.g. installing system-wide programs but if you use conda you shouldn't need that).

Related

Snakemake - MissingInputException

My snakemake pipeline containing 31 rules is driving me crazy. It's a mapping and snp calling pipeline that uses BWA and HaplotypeCaller among others. I have created a conda environment for each rule, depending on the program used. My code is quite long and can be seen if needed at this address : https://github.com/ltalignani/SHAVE1
Concretely, when I want to build the DAG, snakemake tells me that the haplotype_caller rule doesn't have the reference genome as input. But it is in the file. Here is the concerned code:
rule haplotype_caller_gvcf:
# Aim: Call germline SNPs and indels via local re-assembly of haplotypes
# Use: gatk --java-options '-Xmx{MEM_GB}g' HaplotypeCaller \
# -R Homo_sapiens_assembly38.fasta \
# -I input.bam \
# -O output.g.vcf.gz \
# -ERC GVCF # Essential to GenotypeGVCFs: produce genotype likelihoods
message:
"HaplotypeCaller calling SNVs and Indels for {wildcards.sample} sample ({wildcards.aligner}-{wildcards.mincov})"
conda:
GATK4
input:
refpath = REFPATH,
reference = REFERENCE,
bam = "results/04_Variants/{sample}_{aligner}_{mincov}X_indel-qual.bam"
output:
gvcf="results/04_Variants/haplotypecaller/{sample}_{aligner}_{mincov}X_variant-call.g.vcf"
log:
"results/11_Reports/haplotypecaller/{sample}_{aligner}_{mincov}X_variant-call.log" # optional
resources:
mem_gb= MEM_GB,
shell:
"gatk HaplotypeCaller " # --java-options '-Xmx{resources.mem_gb}g'
"-R {input.refpath}{input.reference} "
"-I {input.bam} "
"-O {output.gvcf} "
"-ERC GVCF" # Essential to GenotypeGVCFs: produce genotype likelihoods
With the REFPATH and REFERENCE variables defined as follows in the snakefile header:
REFPATH = config["consensus"]["path"] # Path to reference genome
REFERENCE = config["consensus"]["reference"] # Genome reference sequence, in fasta format
And the config file in .yaml is like this:
consensus:
reference: "GCA_018104305.1_AalbF3_genomic.fasta"
path: "resources/genomes/" # Path to genome reference
When I ask for the DAG :
snakemake -s workflow/rules/shave.smk --dag | dot -Tpng > test.png
I get this error:
`MissingInputException in line 247 of /Users/loic/snakemake/short-read-alignment-vector-pipeline/workflow/rules/shave.smk:`
Missing input files for rule haplotype_caller_gvcf:
GCA_018104305.1_AalbF3_genomic.fasta
Here is the structure of the snakemake:
enter image description here
also tried to use snakemake --lint but the output was OK.
I've looked at your Github, and the folder resources/genomes/ only contains a file GCA_018104305.1_AalbF3_genomic.fasta.fai. Have you tried renaming this file to the expected input name GCA_018104305.1_AalbF3_genomic.fasta, e.g. get rid of the .fai extension?
Thanks for your answer,
The fasta is 1.47 GB. That's why it's not in the resources/genomes folder.
The .fai is the fasta index, necessary for some programs like GATK.

Snakemake Megahit output issue

A few days ago I started using Snakemake for the first time. I am having an issue when I am trying to run the megahit rule in my pipeline.
It gives me the following error "Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). ......"
So initially it runs and then crashes with the above error. I implemented the solution with the directory() option in my pipeline but I think its not a good practice since, for various reasons, you can loose files without even knowing it.
Is there a way to run the rule without using the directory() ?
I would appreciate any help on the issue!
Thanking you in advance
sra = []
with open("run_ids") as f:
for line in f:
sra.append(line.strip())
rule all:
input:
expand("raw_reads/{sample}/{sample}.fastq", sample=sra),
expand("trimmo/{sample}/{sample}.trimmed.fastq", sample=sra),
expand("megahit/{sample}/final.contigs.fa", sample=sra)
rule download:
output:
"raw_reads/{sample}/{sample}.fastq"
params:
"--split-spot --skip-technical"
log:
"logs/fasterq-dump/{sample}.log"
benchmark:
"benchmarks/fastqdump/{sample}.fasterq-dump.benchmark.txt"
threads: 8
shell:
"""
fasterq-dump {params} --outdir /home/raw_reads/{wildcards.sample} {wildcards.sample} -e {threads}
"""
rule trim:
input:
"raw_reads/{sample}/{sample}.fastq"
output:
"trimmo/{sample}/{sample}.trimmed.fastq"
params:
"HEADCROP:15 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"
log:
"logs/trimmo/{sample}.log"
benchmark:
"benchmarks/trimmo/{sample}.trimmo.benchmark.txt"
threads: 6
shell:
"""
trimmomatic SE -phred33 -threads {threads} {input} trimmo/{wildcards.sample}/{wildcards.sample}.trimmed.fastq {params}
"""
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
"megahit/{sample}/final.contigs.fa"
params:
"-m 0.7 -t"
log:
"logs/megahit/{sample}.log"
benchmark:
"benchmarks/megahit/{sample}.megahit.benchmark.txt"
threads: 10
shell:
"""
megahit -r {input} -o {output} -t {threads}
"""
IMHO it is a bad design of the megahit software that it takes a directory as a parameter and outputs into a file in this directory with a hardcoded name. Flagging the filename with directory() doesn't solve the issue, as in this case what you expect to be a file with the .fa extension megahit treats as a directory. The rest of the pipeline is broken in this case.
But this issue can be solved in Snakemake like that:
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
"megahit/{sample}/final.contigs.fa"
# ...
shell:
"""
megahit -r {input} -o megahit/{wildcards.sample} -t {threads}
"""
A better design of the megahit rule would look as follows:
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
out_dir = directory("megahit/{sample}/"),
fasta = "megahit/{sample}/final.contigs.fa"
log:
"logs/megahit/{sample}.log"
benchmark:
"benchmarks/megahit/{sample}.megahit.benchmark.txt"
threads:
10
shell:
"megahit -r {input} -f -o {output.out_dir} -t {threads}"
This guarantees that the output directory is removed upon failure, while the -f argument to megahit tells it to ignore the fact that the output folder exists (it is created by Snakemake automatically because one of the outputs is a file inside it: final.contigs.fa).
BTW, the -m (--memory) parameter is best implemented as a resource. The only problem though is that snakemake's default resource, mem_mb is in megabytes. One workaround would be as follows:
resources:
mem_mb = mem_mb_limit_for_megahit # could be a fraction of a global constant
params:
mem_bytes = lambda w, resources: round(resources.mem_mb * 1e6)
shell:
"megahit ... -m {params.mem_bytes}"

Does snakemake support none file Input?

I get a MissingInputException when I run the following rule:
configfile: "Configs.yaml"
rule download_data_from_ZFIN:
input:
anatomy_item = config["ZFIN_url"]["anatomy_item"],
xpat_stage_anatomy = config["ZFIN_url"]["xpat_stage_anatomy"],
xpat_fish = config["ZFIN_url"]["xpat_fish"],
anatomy_synonyms = config["ZFIN_url"]["anatomy_synonyms"]
output:
anatomy_item = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "anatomy_item.tsv"),
xpat_stage_anatomy = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "xpat_stage_anatomy.tsv"),
xpat_fish = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "xpat_fish.tsv"),
anatomy_synonyms = os.path.join(os.getcwd(), config["download_data_from_ZFIN"]["dir"], "anatomy_synonyms.tsv")
shell:
"wget -O {output.anatomy_item} {input.anatomy_item};" \
"wget -O {output.anatomy_synonyms} {input.anatomy_synonyms};" \
"wget -O {output.xpat_stage_anatomy} {input.xpat_stage_anatomy};" \
"wget -O {output.xpat_fish} {input.xpat_fish};"
And this is the content of my configs.yaml file:
ZFIN_url:
# Zebrafish Anatomy Term
anatomy_item: "https://zfin.org/downloads/file/anatomy_item.txt"
# Zebrafish Gene Expression by Stage and Anatomy Term
xpat_stage_anatomy: "https://zfin.org/downloads/file/xpat_stage_anatomy.txt"
# ZFIN Genes with Expression Assay Records
xpat_fish: "https://zfin.org/downloads/file/xpat_fish.txt"
# Zebrafish Anatomy Term Synonyms
anatomy_synonyms: "https://zfin.org/downloads/file/anatomy_synonyms.txt"
download_data_from_ZFIN:
dir: ZFIN_data
The error message is:
Building DAG of jobs...
MissingInputException in line 10 of /home/zhangdong/works/NGS/coevolution/snakemake/coevolution.rule:
Missing input files for rule download_data_from_ZFIN:
https://zfin.org/downloads/file/anatomy_item.txt
I want to make sure that if this exception is caused by none file input for the input rule?
Note that you can also use remote files as input so you may avoid rule download_data_from_ZFIN altogether. E.g.:
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
rule all:
input:
'output.txt',
rule one:
input:
# Some file from the web
x= HTTP.remote('https://plasmodb.org/common/downloads/release-49/PbergheiANKA/txt/PlasmoDB-49_PbergheiANKA_CodonUsage.txt', keep_local=True)
output:
'output.txt',
shell:
r"""
# Do something with the remote file
head {input.x} > {output}
"""
The remote file will be downloaded and stored locally under plasmodb.org/common/.../PlasmoDB-49_PbergheiANKA_CodonUsage.txt
Many thanks #dariober, I tried the follwing code and it worked,
import os
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
configfile: "Configs.yaml"
HTTP = HTTPRemoteProvider()
rule all:
input:
expand(os.path.join(os.getcwd(),config["download_data_from_ZFIN"]["dir"],"{item}.tsv"),
item=list(config["ZFIN_url"].keys()))
rule download_data_from_ZFIN:
input:
lambda wildcards: HTTP.remote(config["ZFIN_url"][wildcards.item], keep_local=True)
output:
os.path.join(os.getcwd(),config["download_data_from_ZFIN"]["dir"],"{item}.tsv")
threads:
1
shell:
"mv {input} > {output}"
Such code is more snakemake-like, but I have two further questions:
Is there a way to specify the output file name for the downloading? Now I use the mv command to achieve that.
Does this remote files function support parallel works? I tried the above code together with --cores 6, but it still download the file one by one.

Snakemake how to execute downstream rules when an upstream rule fails

Apologies that the title is bad - I can't figure out how best to explain my issue in a few words. I'm having trouble dealing with downstream rules in snakemake when one of the rules fails. In the example below, rule spades fails on some samples. This is expected because some of my input files will have issues, spades will return an error, and the target file is not generated. This is fine until I get to rule eval_ani. Here I basically want to run this rule on all of the successful output of rule ani. But I'm not sure how to do this because I have effectively dropped some of my samples in rule spades. I think using snakemake checkpoints might be useful but I just can't figure out how to apply it from the documentation.
I'm also wondering if there is a way to re-run rule ani without re-running rule spades. Say I prematurely terminated my run, and rule ani didn't run on all the samples. Now I want to re-run my pipeline, but I don't want snakemake to try to re-run all the failed spades jobs because I already know they won't be useful to me and it would just waste resources. I tried -R and --allowed-rules but neither of these does what I want.
rule spades:
input:
read1=config["fastq_dir"]+"combined/{sample}_1_combined.fastq",
read2=config["fastq_dir"]+"combined/{sample}_2_combined.fastq"
output:
contigs=config["spades_dir"]+"{sample}/contigs.fasta",
scaffolds=config["spades_dir"]+"{sample}/scaffolds.fasta"
log:
config["log_dir"]+"spades/{sample}.log"
threads: 8
shell:
"""
python3 {config[path_to_spades]} -1 {input.read1} -2 {input.read2} -t 16 --tmp-dir {config[temp_dir]}spades_test -o {config[spades_dir]}{wildcards.sample} --careful > {log} 2>&1
"""
rule ani:
input:
config["spades_dir"]+"{sample}/scaffolds.fasta"
output:
"fastANI_out/{sample}.txt"
log:
config["log_dir"]+"ani/{sample}.log"
shell:
"""
fastANI -q {input} --rl {config[reference_dir]}ref_list.txt -o fastANI_out/{wildcards.sample}.txt
"""
rule eval_ani:
input:
expand("fastANI_out/{sample}.txt", sample=samples)
output:
"ani_results.txt"
log:
config["log_dir"]+"eval_ani/{sample}.log"
shell:
"""
python3 ./bin/evaluate_ani.py {input} {output} > {log} 2>&1
"""
If I understand correctly, you want to allow spades to fail without stopping the whole pipeline and you want to ignore the output files from spades that failed. For this you could append to the command running spades || true to catch the non-zero exit status (so snakemake will not stop). Then you could analyse the output of spades and write to a "flag" file whether that sample succeded or not. So the rule spades would be something like:
rule spades:
input:
read1=config["fastq_dir"]+"combined/{sample}_1_combined.fastq",
read2=config["fastq_dir"]+"combined/{sample}_2_combined.fastq"
output:
contigs=config["spades_dir"]+"{sample}/contigs.fasta",
scaffolds=config["spades_dir"]+"{sample}/scaffolds.fasta",
exit= config["spades_dir"]+'{sample}/exit.txt',
log:
config["log_dir"]+"spades/{sample}.log"
threads: 8
shell:
"""
python3 {config[path_to_spades]} ... || true
# ... code that writes to {output.exit} stating whether spades succeded or not
"""
For the following steps, you use the flag files '{sample}/exit.txt' to decide which spade files should be used and which should be discarded. For example:
rule ani:
input:
spades= config["spades_dir"]+"{sample}/scaffolds.fasta",
exit= config["spades_dir"]+'{sample}/exit.txt',
output:
"fastANI_out/{sample}.txt"
log:
config["log_dir"]+"ani/{sample}.log"
shell:
"""
if {input.exit} contains 'PASS':
fastANI -q {input.spades} --rl {config[reference_dir]}ref_list.txt -o fastANI_out/{wildcards.sample}.txt
else:
touch {output}
"""
rule eval_ani:
input:
ani= expand("fastANI_out/{sample}.txt", sample=samples),
exit= expand(config["spades_dir"]+'{sample}/exit.txt', sample= samples),
output:
"ani_results.txt"
log:
config["log_dir"]+"eval_ani/{sample}.log"
shell:
"""
# Parse list of file {input.exit} to decide which files in {input.ani} should be used
python3 ./bin/evaluate_ani.py {input} {output} > {log} 2>&1
"""
EDIT (not tested) Instead of || true inside the shell directive it may be better to use the run directive and use python's subprocess to run the system commands that are allowed to fail. The reason is that || true will return 0 exit code no matter what error happened; the subprocess solution instead allows more precise handling of exceptions. E.g.
rule spades:
input:
...
output:
...
run:
cmd = "spades ..."
p = subprocess.Popen(cmd, shell= True, stdout= subprocess.PIPE, stderr= subprocess.PIPE)
stdout, stderr= p.communicate()
if p.returncode == 0:
print('OK')
else:
# Analyze exit code and stderr and decide what to do next
print(p.returncode)
print(stderr.decode())

snakemake running single jobs in parallel from all files in folder

My problem is related to Running parallel instances of a single job/rule on Snakemake but I believe different.
I cannot create a all: rule for it in advance because the folder of input files will be created by a previous rule and depends on the user initial data
pseudocode
rule1: get a big file (OK)
rule2: split the file in parts in Split folder (OK)
rule3: run a program on each file created in Split
I am now at rule3 with Split containing 70 files like
Split/file_001.fq
Split/file_002.fq
..
Split/file_069.fq
Could you please help me creating a rule for pigz to run compress the 70 files in parallel to 70 .gz files
I am running with snakemake -j 24 ZipSplit
config["pigt"] gives 4 threads for each compression job and I give 24 threads to snakemake so I expect 6 parallel compressions but my current rule merges the inputs to one archive in a single job instead of parallelizing !?
Should I build the list of input fully in the rule? how?
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I tried to define input directly with
input: glob_wildcards("Split/{x}.fq")
but syntax error occures
# InSilico_PCR Snakefile
import os
import re
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
# source config variables
configfile: "config.yaml"
# single job
rule GetRawData:
input:
HTTP.remote(os.path.join(config["host"], config["infile"]), keep_local=True, allow_redirects=True)
output:
os.path.join("RawData", config["infile"])
run:
shell("cp {input} {output}")
# single job
rule SplitFastq:
input:
os.path.join("RawData", config["infile"])
params:
lines_per_file = config["lines_per_file"]
output:
pfx = os.path.join("Split", config["infile"] + "_")
shell:
"""
zcat {input} | split --numeric-suffixes --additional-suffix=.fq -a 3 -l {params.lines_per_file} - {output.pfx}
"""
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I think the example below should do it, using checkpoints as suggested by #Maarten-vd-Sande.
However, in your particular case of splitting a big file and compress the output on the fly, you may be better off using the --filter option of split as in
split -a 3 -d -l 4 --filter='gzip -c > $FILE.fastq.gz' bigfile.fastq split/
The snakemake solution, assuming your input file is called bigfile.fastq, split and compress output will be in directory splitting./bigfile/
rule all:
input:
expand("{sample}.split.done", sample= ['bigfile']),
checkpoint splitting:
input:
"{sample}.fastq"
output:
directory("splitting/{sample}")
shell:
r"""
mkdir splitting/{wildcards.sample}
split -a 3 -d --additional-suffix .fastq -l 4 {input} splitting/{wildcards.sample}/
"""
rule compress:
input:
"splitting/{sample}/{i}.fastq",
output:
"splitting/{sample}/{i}.fastq.gz",
shell:
r"""
gzip -c {input} > {output}
"""
def aggregate_input(wildcards):
checkpoint_output = checkpoints.splitting.get(**wildcards).output[0]
return expand("splitting/{sample}/{i}.fastq.gz",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fastq")).i)
rule all_done:
input:
aggregate_input
output:
touch("{sample}.split.done")