MissingOutputException snakemake - snakemake

I am getting an MissingOutputException from my snakemake workflow, snakemake creates the required output in the desired directory but keeps looking for it and exits.
this is my snakefile.
rule all:
input:
expand('/home/stud9/NAS/results/qc_reports/fastqc/trimmed_{sample}_1_fastqc.html', sample=SAMPLES),
expand('/home/stud9/NAS/results/qc_reports/fastqc/trimmed_{sample}_2_fastqc.html', sample=SAMPLES),
expand('home/stud9/NAS/results/non_aligned/{sample}_nm2cov.bam', sample=SAMPLS)
rule nm2cov:
input:
'/home/stud9/NAS/results/aligned/to_cov/{sample}_cov.sorted.bam'
output:
'home/stud9/NAS/results/non_aligned/{sample}_nm2cov.bam'
shell:
"cd /home/stud9/NAS/results/non_aligned && samtools view -b -f 4 {input} > {wildcards.sample}_nm2cov.bam"
I have used cd before the actual cmd because I want my results there otherwise they would show in the snakefile directory.
This is the messsage I am getting:
Waiting at most 10 seconds for missing files.
MissingOutputException in rule nm2cov in line 50 of /home/stud9/NAS/scripts/wf_1:
Job Missing files after 10 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
home/stud9/NAS/results/non_aligned/148_nm2cov.bam completed successfully, but some output files are missing. 55
Shutting down, this might take some time.
sorry if my post is a little bit messy but this is the first time I post here
tried changing --latency-wait to 15 still no response.

A MissingOutputException can be easily caused by typos or wrong paths for the output files.
In your case a preceeding / seems to be missing, causing snakemake to consider your output path to be relative rather than absolute.
Try this:
rule nm2cov:
input:
'/home/stud9/NAS/results/aligned/to_cov/{sample}_cov.sorted.bam'
output:
'/home/stud9/NAS/results/non_aligned/{sample}_nm2cov.bam'
shell:
"cd /home/stud9/NAS/results/non_aligned && samtools view -b -f 4 {input} > {wildcards.sample}_nm2cov.bam"
NB: It is generally recommended to use relative paths rather than absolute paths for your Snakefile to keep the reproducibility of your workflow.

Related

Snakemake: catch output file whose name cannot be changed

As part of a Snakemake pipeline that I'm building, I have to use a program that does not allow me to specify the file path or name of an output file.
E.g. when running the program in the working directory workdir/ it produces the following output:
workdir/output.txt
My snakemake rule looks something like this:
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell: "somecommand {input} {output}"
So every time the rule NAME runs, I get an additional file output.txt in the snakemake working directory, which is then overwritten if the rule NAME runs multiple times or in parallel.
I'm aware of shadow rules, and adding shadow: "full" allows me to simply ignore the output.txt file. However, I'd like to keep output.txt and save it in the same directory as the outputfile. Is there a way of achieving this, either with the shadow directive or otherwise?
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
How about simply moving it directly afterwards in the shell part (provided somecommand completes successfully)?
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
params:
output_dir = "path/to/output_dir",
shell: "somecommand {input} {output} && mv output.txt {params.output_dir}/output.txt"
EDIT: for multiple executions of NAME in parallel, combining with shadow: "full" could work:
rule NAME:
input: "path/to/inputfile"
output:
output_file = "path/to/outputfile"
output_txt = "path/to/output_dir/output.txt"
shadow: "full"
shell: "somecommand {input} {output.output_file} && mv output.txt {output.output_txt}"
That should run each execution of the rule in its own temporary dir, and by specifying the moved output.txt as an output Snakemake should move it to the real output dir once the rule is done running.
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
I think you are on the right track here. Each shell block is run in a separate process with the working directory inherited from the snakemake process (specified with the --directory argument on the command line). Accordingly, cd commands in one shell block will not affect other jobs from the same rule or other downstream/upstream jobs.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell:
"""
input_file=$(realpath "{input}") # get the absolute path, before the `cd`
base_dir=$(dirname "{output}")
cd "$base_dir"
somecommand ...
"""

running metabat2 with snakemake but not getting the bin files

I have been trying to run metabat2 with snakemake. I can run it but the output files in metabat2/ are missing. The checkM that works after it does use the data and can work I just cant find the files later. There should be files created with numbers but it is imposible to predict how many files will be created. Is there a way I can specify it to make sure that the files are created in that file?
rule all:
[f"metabat2/" for sample in samples],
[f"checkm/" for sample in samples]
rule metabat2:
input:
"input/consensus.fasta"
output:
directory("metabat2/")
conda:
"envs/metabat2.yaml"
shell:
"metabat2 -i {input} -o {output} -v"
rule checkM:
input:
"metabat2/"
output:
c = "bacteria/CheckM.txt",
d = directory("checkm/")
conda:
"envs/metabat2.yaml"
shell:
"checkm lineage_wf -f {output.c} -t 10 -x fa {input} {output.d}"
the normal code to run metabat2 would be
metabat2 -i path/to/consensus.fasta -o /outputdir/bin -v
this will create in outputdir files with bin.[number].fa
I can't tell what the problem is but I have a couple of suggestions...
[f"metabat2/" for sample in samples]: I doubt this will do what you expect as it will simply create a list with the string metabat2/ repeat len(samples) times. Maybe you want [f"metabat2/{sample}" for sample in samples]? The same for [f"checkm/" for sample in samples]
The samples variable is not used anywhere in the rules following all. I suspect somewhere it should be used and/or you should use something like output: directory("metabat2/{sample}")
Execute snakemake with -p option to see what commands are executed. It may be useful to post the stdout from it.

Does Snakefile location matter?

I am absolute beginner to snakemake. I am building a pipeline as I learn. My question is if the Snakefile is placed with data file that I want to process an NameError: occurs but if I move the Snakefile to a parent directory and edit the path information of input: and output: the code works. what am I missing?
rule sra_convert:
input:
"rna/{id}.sra"
output:
"rna/fastq/{id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code works fine when I run with
snakemake -p rna/fastq/SRR873382.fastq
However, if I move the file to "rna" directory where the SRR873382.sra file is and edit the code as below
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {id}.sra to {id}.fastq"
shell:
"fastq-dump {input} -O {output}"
and run
snakemake -p fastq/SRR873382.fastq
I get the following error
Building DAG of jobs...
Job counts:
count jobs
1 sra_convert
1
RuleException in line 7 of /home/sarc/Data/rna/Snakefile:
NameError: The name 'id' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
Solution
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {wildcards.id}.sra to {wildcards.id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code runs fine without error
I believe that the best source that answers your actual question is:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
If the rule’s output matches a requested file, the substrings matched
by the wildcards are propagated to the input files and to the variable
wildcards, that is here also used in the shell command. The wildcards
object can be accessed in the same way as input and output, which is
described above.

snakemake: Use different folders for input and output files

This is most likely a very basic issue, but I could not find it documented anywhere.
rule all:
input:
"fasta_file.fna"
output:
"headers.txt"
shell:
"grep "^>" {input} > {output}"
I want to run this for a set of files that are not necessarily in the same folder. Is there a way to provide as command (or config file) the input file name from another directory?
Okay never mind, this was probably no smart question indeed.
rule all:
input:
"input/{sample}.fna"
output:
"output/{sample}_headers.txt"
shell:
"grep "^>" {input} > {output}"
And with that I can just run snakemake for my target file, something like snakemake output/A1_headers.txt, or build a for loop over my input sequences.

Snakemake cannot find output file, gives MissingOutputException while latency-wait is seemingly ignored

I have a simple rule to generate a file in Snakemake. Running snakemake results in an immediate error that it cannot find the generated file, even when --latency-wait is specified as a command line option.
However, this does seem to be a latency-related issue, as this Snakefile runs without problems on a local machine. The output below is on a system that has known latency problems.
Contents of Snakefile:
rule generate_file:
output:
"dummy.txt"
shell:
"head --bytes 1024 < /dev/zero | base64 > '{output}'; ls"
Commands:
$ snakemake --version
5.2.0
$ snakemake -p --latency-wait 10
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 generate_file
1
rule generate_file:
output: dummy.txt
jobid: 0
head --bytes 1024 < /dev/zero | base64 > 'dummy.txt'; ls
dummy.txt Snakefile
MissingOutputException in line 1 of /home/user/project/Snakefile:
[Errno 2] No such file or directory: ''
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Removing output files of failed job generate_file since they might be corrupted:
dummy.txt
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/user/project/.snakemake/log/2018-08-08T101648.774072.snakemake.log
Interestingly, the ls command shows the file is created and visible.
Your rule creates output file dummy.txt when used with snakemake version 5.2.2 and linux, and snakemake ends successfully. Perhaps it is a bug in version 5.2.0? I don't see anything about it in change logs though.
On related note, use of head in shell command used to result in non-zero exit status error. Apparently recent version behaves differently in this respect.