Input vs Output: Job completed successfully, but some output files are missing - snakemake

My pipeline is failing I believe due to a conflict between the expected out put of rule all vs the actual final output. I believe snakemake is waiting for the file kma/{sample} without an extension to appear instead it is getting a directory that has multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz") and I am having trouble getting them to play well.
configfile: "config.yaml"
rule all:
input:
expand("kma/{sample}", sample = config["samples"])
#multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz", sample = config["samples"])
rule seqtk_qualtiy_filter:
input:
lambda wildcards: "S5_Raw/" + config["samples"][wildcards.sample]
output:
temp("qtrim/{sample}.qtrim.fq")
shell:
"seqtk trimfq -b 0.01 {input} > {output}"
rule seqtk_clip:
input:
"qtrim/{sample}.qtrim.fq"
output:
temp("clip/{sample}.clip.fq")
shell:
"seqtk trimfq -b20 -L 350 {input} > {output}"
rule bbnorm:
input:
"clip/{sample}.clip.fq"
output:
"S5_processed/{sample}.norm.fq"
shell:
"bbnorm.sh in={input} out={output} target=100"
rule kma_map:
input:
"S5_processed/{sample}.norm.fq"
params:
ref = "ref/consensus.fasta"
output:
directory("kma/{sample}")
#multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz")
shell:
"kma -i {input} -t_db {params.ref} -o {output}"
The error if you run it the way that kma would like to see the handling done
Waiting at most 5 seconds for missing files.
MissingOutputException in line 33 of /home/sean/Desktop/reo/antisera project/ReovirusS1AmpliconS5.smk:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
kma/BA8359-19
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
File "/home/sean/.local/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
File "/home/sean/.local/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 231, in handle_job_success
Shutting down, this might take some time.
I have tried increasing the latency time as well, however since the expected file is never actually created it doesn't matter how long you wait.
the error you receive if you using the multiext function
Error in rule kma_map:
jobid: 11
output: kma/BA8359-19.res, kma/BA8359-19.aln, kma/BA8359-19.fsa, kma/BA8359-19.frag.gz
shell:
kma -i S5_processed/BA8359-19.norm.fq -t_db ref/consensus.fasta -o kma/BA8359-19.res kma/BA8359-19.aln kma/BA8359-19.fsa kma/BA8359-19.frag.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
This error happens, I believe, because snakemake lists all the file types in the -o flag and kma says no.

There may be an inconsistency between the input files in rule all and the output files in rule kma_map. Showing the error you get would help.
Anyway, the command multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz", sample = config["samples"]) that you seem you have tried should not compile since multiext does not accept named arguments. Maybe what you want is this:
rule all:
input:
expand("kma/{sample}{ext}", sample= config['samples'], ext= [".res", ".aln", ".fsa", ".gz"]),
rule kma_map:
...
output:
multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz")
shell:
"kmap ... {input} {output}"

EDIT 21SEP2020
I have figured out the correct way to run this application.
first I need the rule all like this
KMA_OUTPUTS = [".res",".aln",".fsa",".frag.gz"]
rule all:
input:
expand("kma/{sample}{ext}", sample = config["samples"], ext = KMA_OUTPUTS)
And then the rule for kma like this
rule kma_map:
input:
"S5_processed/{sample}.norm.fq"
params:
ref = "ref/consensus.fasta",
prefix = "kma/{sample}"
output:
multiext("kma/{sample}", ".res", ".aln", ".fsa", ".frag.gz")
shell:
"kma -i {input} -t_db {params.ref} -o {params.prefix}"
by using the parama.prefix as the output file input I can get the desired output.

Related

Snakemake: MissingInputException with inconsistent naming scheme

I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.

Snakemake “Missing files after X seconds” error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 99
Job counts:
count jobs
1 all
1 antiSMASH
1 pear
1 prodigal
4
[Wed Dec 11 14:59:43 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 3
wildcards: sample=Unmap_41, extension=fastq
Submitted job 3 with external jobid 'Submitted batch job 4572437'.
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
It would seem that the first rule is not executing, but I am unsure as to why as from what I can see all the syntax is correct. Anyone have some advice?
The snakefile is the following:
#!/miniconda/bin/python
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {output} -t 21")
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta")
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}")
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
I think it is likely to be something involving the way I am calling the conda environment in the run commands but using the conda: option with a yaml files returns version not found style errors.
As of what I read from pear documentation:
-o Specify the name to be used as base for the output files. PEAR outputs four files. A file containing the assembled reads with a
assembled.fastq extension, two files containing the forward, resp.
reverse, unassembled reads with extensions unassembled.forward.fastq,
resp. unassembled.reverse.fastq, and a file containing the discarded
reads with a discarded.fastq extension
So if the output defined in your rule is just a base, I suggest you put this as a param and the real names of the output as output:
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}".assembled.fastq,
"merged_reads/{sample}.{extension}".unassembled.forward.fastq,
"merged_reads/{sample}.{extension}".unassembled.reverse.fastq,
"merged_reads/{sample}.{extension}".discarded.fastq
params:
base="merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {params.base} -t 21")
I haven't tested pear so not sure what the output file names are exactly.

snakemake with prefix as output including a path

How can I make sure in rule all that the output folder was well created?
Should I add each expected result file?
somehow relates to snakemake define folder as output but in my case the specified 'output' is a combination of a path to a dir and a prefix for all results files (they wil be multiple)
the following command creates a folder path Analysis/MosDepth and adds to that path the files:
gt0.mosdepth.global.dist.txt
gt0.mosdepth.region.dist.txt
gt0.per-base.bed.gz
gt0.per-base.bed.gz.csi
gt0.regions.bed.gz
gt0.regions.bed.gz.csi
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {output.pfx} {input.bam}"
I currently have only one of the files in rule all:, is this enough or is there a better way to ensure that the mosdepth has run well and not redo it in a later re-run?
rule all:
input:
"Analysis/MosDepth/gt0.regions.bed.gz"
I would recommend sth like this:
mos_out = ['gt0.mosdepth.global.dist.txt', 'gt0.mosdepth.region.dist.txt', 'gt0.per-base.bed.gz', 'gt0.per-base.bed.gz.csi', 'gt0.regions.bed.gz', 'gt0.regions.bed.gz.csi']
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
expand("Analysis/MosDepth/{mos_out}", mos_out=mos_out)
params:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {params.pfx} {input.bam}"
If one of the output files is not created by the rule, snakemake will remove all the output files for you, and throw an error.

How to pass variable value as input in snakemake?

I want to download the fastq files from SRA database using SRR ID using Snakemake. I read a file to get SRR ID using python code.
I want to parse the Variable one by one as input. My code is below.
I want to run command
fastq-dump SRR390728
#SAMPLES = ['SRR390728','SRR400816']
SAMPLES = [line.strip() for line in open("./srrList", 'r')]
rule all:
input:
expand("fastq/{sample}.fastq.log",sample=SAMPLES)
rule download_fastq:
input:
"{sample}"
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {input} > {output}"
Skip input and just call the wildcard in shell command. input needs to be a filepath that needs to already exist or be created as part of the pipeline - neither are true in your case.
rule download_fastq:
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {wildcards.sample} > {output}"

Snakemake run rule with wildcarded output of previous rules multiple times

I have multiple studies and I must make two files (a .notsad and .txt file) for each of the n number of studies. After these are created, I must run a command which runs per chromosome and uses the same two input files (.notsad, .txt) for each chromosome within a given study. So:
mycommand.py study1.notsad study1_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
mycommand.py study1.notsad study1_filter.txt chr2.bad.gz --out chr2_filter.bad.gz
...
mycommand.py study2.notsad study2_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
...
However Im having trouble getting this to run. Im getting an error:
WildcardError in line 33 of /scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'ds_lower'
My rules so far:
import os
import glob
ROOT = "/rootdir/"
ORIGINAL_DATA_FOLDER="original/"
PROCESS_DATA_FOLDER="process/"
ORIGINAL_DATA_SOURCE=ROOT+ORIGINAL_DATA_FOLDER
PROCESS_DATA_SOURCE=ROOT+PROCESS_DATA_FOLDER
DATASETS = [name for name in os.listdir(ORIGINAL_DATA_SOURCE) if os.path.isdir(os.path.join(ORIGINAL_DATA_SOURCE, name))]
LOWERCASE_DATASETS = [dataset.lower() for dataset in DATASETS]
CHROMOSOME = list(range(1,23))
rule all:
input:
expand(PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz", ds=DATASETS, chr=CHROMOSOME)
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.bad.gz", # Matches 22 chroms
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt", # But this should be common to all chr runs for this study.
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad" # This one as well.
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
shell:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
rule write_txt_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/{ds_lower}_info.txt"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt"
shell:
"touch {output}"
rule write_notsad_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/_{ds_lower}.sad"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad"
shell:
"touch {output}"
UPDATE
Changing inputs for rule run_command to lambda functions did work.
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.gen.gz",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}_filter.txt",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}.sample"
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
run:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
All the wildcards used in input need to be present in output. In rule run_command, wildcard {ds_lower} is present only in input but not in output.