Running external scripts with wildcards in snakemake - snakemake

I am trying to run a snakemake rule with an external script that contains a wildcard as noted in the snakemake reathedocs. However I am running into KeyError when running snakemake.
For example, if we have the following rule:
SAMPLE = ['test']
rule all:
input:
expand("output/{sample}.txt", sample=SAMPLE)
rule NAME:
input: "workflow/scripts/{sample}.R"
output: "output/{sample}.txt",
script: "workflow/scripts/{wildcards.sample}.R"
with the script workflow/scripts/test.R containing the following code
out.path = snakemake#output[[1]]
out = "Hello World"
writeLines(out, out.path)
I get the following error when trying to execute snakemake.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 NAME
1 all
2
[Fri May 21 12:04:55 2021]
rule NAME:
input: workflow/scripts/test.R
output: output/test.txt
jobid: 1
wildcards: sample=test
[Fri May 21 12:04:55 2021]
Error in rule NAME:
jobid: 1
output: output/test.txt
RuleException:
KeyError in line 14 of /sc/arion/projects/LOAD/Projects/sandbox/Snakefile:
'wildcards'
File "/sc/arion/work/andres12/conda/envs/py38/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 2231, in run_wrapper
File "/sc/arion/projects/LOAD/Projects/sandbox/Snakefile", line 14, in __rule_NAME
File "/sc/arion/work/andres12/conda/envs/py38/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 560, in _callback
File "/sc/arion/work/andres12/conda/envs/py38/lib/python3.8/concurrent/futures/thread.py", line 57, in run
File "/sc/arion/work/andres12/conda/envs/py38/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 546, in cached_or_run
File "/sc/arion/work/andres12/conda/envs/py38/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 2262, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /sc/arion/projects/LOAD/Projects/sandbox/.snakemake/log/2021-05-21T120454.713963.snakemake.log
Does anyone know why this not working correctly?

I agree with Dmitry Kuzminov that having a script depending on a wildcard is odd. Maybe there are better solutions.
Anyway, this below works for me on snakemake 6.0.0. Note that in your R script snakemake#output[1] should be snakemake#output[[1]], but that doesn't give the problem you report.
SAMPLE = ['test']
rule all:
input:
expand("output/{sample}.txt", sample=SAMPLE)
rule make_script:
output:
"workflow/scripts/{sample}.R",
shell:
r"""
echo 'out.path = snakemake#output[[1]]' > {output}
echo 'out = "Hello World"' >> {output}
echo 'writeLines(out, out.path)' >> {output}
"""
rule NAME:
input:
"workflow/scripts/{sample}.R"
output:
"output/{sample}.txt",
script:
"workflow/scripts/{wildcards.sample}.R"

Related

Snakemake Error with MissingOutputException

I am trying to run STAR with snakemake in a server,
My smk file is that one :
import pandas as pd
configfile: 'config.yaml'
#Read sample to batch dataframe mapping batch to sample (here with zip)
sample_to_batch = pd.read_csv("/mnt/DataArray1/users/zisis/STAR_mapping/snakemake_STAR_index/all_samples_test.csv", sep = '\t')
#rule spcifying output
rule all_STAR:
input:
#expand("{sample}/Aligned.sortedByCoord.out.bam", sample = sample_to_batch['sample'])
expand(config['path_to_output']+"{sample}/Aligned.sortedByCoord.out.bam", sample = sample_to_batch['sample'])
rule STAR_align:
#specify input fastq files
input:
fq1 = config['path_to_output']+"{sample}_1.fastq.gz",
fq2 = config['path_to_output']+"{sample}_2.fastq.gz"
params:
#location of indexed genome andl location to save the ouput
genome = directory(config['path_to_reference']+config['ref_assembly']+".STAR_index"),
prefix_outdir = directory(config['path_to_output']+"{sample}/")
threads: 12
output:
config['path_to_output']+"{sample}/Aligned.sortedByCoord.out.bam"
log:
config['path_to_output']+"logs/{sample}.log"
message:
"--- Mapping STAR---"
shell:
"""
STAR --runThreadN {threads} \
--readFilesCommand zcat \
--readFilesIn {input} \
--genomeDir {params.genome} \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard
"""
While STAR starts normally at the end i have this error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 14 of /mnt/DataArray1/users/zisis/STAR_mapping/snakemake/STAR_snakefile_align.smk:
Job Missing files after 5 seconds:
/mnt/DataArray1/users/zisis/STAR_mapping/snakemake/001_T1/Aligned.sortedByCoord.out.bam
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 1 completed successfully, but some output files are missing. 1
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I tried --latency-wait but is not working.
In order to execute snake make i run the command
users/zisis/STAR_mapping/snakemake_STAR_index$ snakemake --snakefile STAR_new_snakefile.smk --cores all --printshellcmds
Technically i am in my directory with full access and permissions
Do you think that this is happening due to strange rights in the execution of snakemake or when it tries to create directories ?
It creates the directory and the files but i can see that there is a files Aligned.sortedByCoord.out.bamAligned.sortedByCoord.out.bam .
IS this the problem ?
I think your STAR command does not have the option that says which file and directory to write to, presumably it is writing the default filename to the current directory. Try something like:
rule STAR_align:
input: ...
output: ...
...
shell:
r"""
outprefix=`dirname {output}`
STAR --outFileNamePrefix $outprefix \
--runThreadN {threads} \
etc...
"""
I am runing the command from my directory in which i am sudo user
I don't think that is the problem but it is strongly recommended to work as regular user and use sudo only in special circumstances (e.g. installing system-wide programs but if you use conda you shouldn't need that).

Snakemake cannot handle very long command line?

This is a very strange problem.
When my {input} specified in the rule section is a list of <200 files, snakemake worked all right.
But when {input} has more than 500 files, snakemake just quitted with messages (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!). The complete log did not provide any error messages.
For the log, please see: https://github.com/snakemake/snakemake/files/5285271/2020-09-25T151835.613199.snakemake.log
The rule that worked is (NOTE the input is capped to 200 files):
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')[:200]
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
The rule that failed is:
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
My question is also posted in GitHub: https://github.com/snakemake/snakemake/issues/643.
I second Maarten's answer, with that many files you are running up against a shell limit; snakemake is just doing a poor job helping you identify the problem.
Based on the issue you reference, it seems like you are using cat to combine all of your files. Maybe following the answer here would help:
rule combine_fastq_list:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
temp("{sample}.tmp.list")
group: "minion_assemble"
script:
with open(output[0]) as out:
out.write('\n'.join(input))
rule combine_fastq:
input:
temp("{sample}.tmp.list")
output:
'combined.fastq/{sample}.fastq.gz'
group: "minion_assemble"
shell:
'cat {input} | ' # this is reading the list of files from the file
'xargs zcat -f | '
'...'
Hope it gets you on the right track.
edit
The first option executes your command separately for each input file. A different option that executes the command once for the whole list of input is:
rule combine_fastq:
...
shell:
"""
command $(< {input}) ...
"""
For those landing here with similar questions (like Snakemake expand function alternative), snakemake 6 can handle long command lines. The following test fails on snakemake < 6 but succeeds on 6.0.0 on my Ubuntu machine:
rule all:
input:
'output.txt',
rule one:
output:
'output.txt',
params:
x= list(range(0, 1000000))
shell:
r"""
echo {params.x} > {output}
"""

InputFunctionException: unexpected EOF while parsing

Major EDIT:
Having fixed a couple of issues thanks to comments and written a minimal reproducible example to help my helpers, I've narrowed down the issue to a difference between execution locally and using DRMAA.
Here is a minimal reproducible pipeline that does not require any external file download and can be executed out of the box or clone following git repository:
git clone git#github.com:kevinrue/snakemake-issue-all.git
When I run the pipeline using DRMAA I get the following error:
Building DAG of jobs...
Using shell: /bin/bash
Provided cluster nodes: 100
Singularity containers: ignored
Job counts:
count jobs
1 all
2 cat
3
InputFunctionException in line 22 of /ifs/research-groups/sims/kevin/snakemake-issue-all/workflow/Snakefile:
SyntaxError: unexpected EOF while parsing (<string>, line 1)
Wildcards:
sample=A
However, if I run the pipeline locally (--cores 1), it works:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Singularity containers: ignored
Job counts:
count jobs
1 all
2 cat
3
[Sat Jun 13 08:49:46 2020]
rule cat:
input: data/A1, data/A2
output: results/A/cat
jobid: 1
wildcards: sample=A
[Sat Jun 13 08:49:46 2020]
Finished job 1.
1 of 3 steps (33%) done
[Sat Jun 13 08:49:46 2020]
rule cat:
input: data/B1, data/B2
output: results/B/cat
jobid: 2
wildcards: sample=B
[Sat Jun 13 08:49:46 2020]
Finished job 2.
2 of 3 steps (67%) done
[Sat Jun 13 08:49:46 2020]
localrule all:
input: results/A/cat, results/B/cat
jobid: 0
[Sat Jun 13 08:49:46 2020]
Finished job 0.
3 of 3 steps (100%) done
Complete log: /ifs/research-groups/sims/kevin/snakemake-issue-all/.snakemake/log/2020-06-13T084945.632545.snakemake.log
My DRMAA profile is the following:
jobs: 100
default-resources: 'mem_free=4G'
drmaa: "-V -notify -p -10 -l mem_free={resources.mem_free} -pe dedicated {threads} -v MKL_NUM_THREADS={threads} -v OPENBLAS_NUM_THREADS={threads} -v OMP_NUM_THREADS={threads} -R y -q all.q"
drmaa-log-dir: /ifs/scratch/kevin
use-conda: true
conda-prefix: /ifs/home/kevin/devel/snakemake/envs
printshellcmds: true
reason: true
Briefly, the Snakefile looks like this
# The main entry point of your workflow.
# After configuring, running snakemake -n in a clone of this repository should successfully execute a dry-run of the workflow.
report: "report/workflow.rst"
# Allow users to fix the underlying OS via singularity.
singularity: "docker://continuumio/miniconda3"
include: "rules/common.smk"
include: "rules/other.smk"
rule all:
input:
# The first rule should define the default target files
# Subsequent target rules can be specified below. They should start with all_*.
expand("results/{sample}/cat", sample=samples['sample'])
rule cat:
input:
file1="data/{sample}1",
file2="data/{sample}2"
output:
"results/{sample}/cat"
shell:
"cat {input.file1} {input.file2} > {output}"
Running snakemake -np gives me what I expect:
$ snakemake -np
sample condition
sample_id
A A untreated
B B treated
Building DAG of jobs...
Job counts:
count jobs
1 all
2 cat
3
[Sat Jun 13 08:51:19 2020]
rule cat:
input: data/B1, data/B2
output: results/B/cat
jobid: 2
wildcards: sample=B
cat data/B1 data/B2 > results/B/cat
[Sat Jun 13 08:51:19 2020]
rule cat:
input: data/A1, data/A2
output: results/A/cat
jobid: 1
wildcards: sample=A
cat data/A1 data/A2 > results/A/cat
[Sat Jun 13 08:51:19 2020]
localrule all:
input: results/A/cat, results/B/cat
jobid: 0
Job counts:
count jobs
1 all
2 cat
3
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
I'm not sure how to debug it further. I'm happy to provide more information as needed.
Note: I use snakemake version 5.19.2
Thanks in advance!
EDIT
Using the --verbose option, Snakemake seems to trip on the 'default-resources: 'mem_free=4G' and/or drmaa: "-l mem_free={resources.mem_free} that are defined in my 'drmaa' profile (see above).
$ snakemake --profile drmaa --verbose
Building DAG of jobs...
Using shell: /bin/bash
Provided cluster nodes: 100
Singularity containers: ignored
Job counts:
count jobs
1 all
2 cat
3
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 100}
Ready jobs (2):
cat
cat
Full Traceback (most recent call last):
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/rules.py", line 941, in apply
res, _ = self.apply_input_function(
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/rules.py", line 684, in apply_input_function
raise e
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/rules.py", line 678, in apply_input_function
value = func(Wildcards(fromdict=wildcards), **_aux_params)
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/resources.py", line 10, in callable
value = eval(
File "<string>", line 1
4G
^
SyntaxError: unexpected EOF while parsing
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/__init__.py", line 626, in snakemake
success = workflow.execute(
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/workflow.py", line 951, in execute
success = scheduler.schedule()
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/scheduler.py", line 394, in schedule
run = self.job_selector(needrun)
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/scheduler.py", line 540, in job_selector
a = list(map(self.job_weight, jobs)) # resource usage of jobs
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/scheduler.py", line 613, in job_weight
res = job.resources
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/jobs.py", line 267, in resources
self._resources = self.rule.expand_resources(
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/rules.py", line 977, in expand_resources
resources[name] = apply(name, res, threads=threads)
File "/ifs/devel/kevin/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/rules.py", line 960, in apply
raise InputFunctionException(e, rule=self, wildcards=wildcards)
snakemake.exceptions.InputFunctionException: SyntaxError: unexpected EOF while parsing (<string>, line 1)
Wildcards:
sample=B
InputFunctionException in line 20 of /ifs/research-groups/sims/kevin/snakemake-issue-all/workflow/Snakefile:
SyntaxError: unexpected EOF while parsing (<string>, line 1)
Wildcards:
sample=B
unlocking
removing lock
removing lock
removed all locks
Thanks to #JohannesKöster I realised that my profile settings were wrong.
--default-resources [NAME=INT [NAME=INT ...]] indicates indicates that only integer values are supported, while I was providing string (i.e., mem_free=4G), naively hoping those would be supported as well.
I've updated the following settings in my profile, and successfully ran both snakemake --cores 1 and snakemake --profile drmaa.
default-resources: 'mem_free=4'
drmaa: "-V -notify -p -10 -l mem_free={resources.mem_free}G -pe dedicated {threads} -v MKL_NUM_THREADS={threads} -v OPENBLAS_NUM_THREADS={threads} -v OMP_NUM_THREADS={threads} -R y -q all.q"
Note the integer value 4 set as default resources, and how I moved the G to the drmaa: ... -l mem_free=...G setting.
Thanks a lot for the help everyone!

Command not found error in snakemake pipeline despite the package existing in the conda environment

I am getting the following error in the snakemake pipeline:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 long_read_assembly
1
[Wed Jan 15 11:35:18 2020]
rule long_read_assembly:
input: long_reads/F19FTSEUHT1027.PSU4_ISF1A_long.fastq.gz
output: canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
jobid: 0
wildcards: sample=F19FTSEUHT1027.PSU4_ISF1A
/usr/bin/bash: canu: command not found
[Wed Jan 15 11:35:18 2020]
Error in rule long_read_assembly:
jobid: 0
output: canu-outputs/F19FTSEUHT1027.PSU4_ISF1A.subreads.contigs.fasta
shell:
canu -p F19FTSEUHT1027.PSU4_ISF1A -d canu-outputs genomeSize=8m -pacbio-raw long_reads/F19FTSEUHT1027.PSU4_ISF1A_long.fastq.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I assume it is meaning that the command canu can not be found. But the Canu package does exist inside the conda environment:
(hybrid_assembly) [lamma#fe1 Assembly]$ conda list | grep canu
canu 1.9 he1b5a44_0 bioconda
The snakefile looks like this:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
sample = '|' .join(config["samples"]),
rule all:
input:
expand("assembly-stats/{sample}_stats.txt", sample = config["samples"])
rule short_reads_QC:
input:
f"short_reads/{{sample}}_short{config['separator']}*.fq.gz"
output:
"fastQC-reports/{sample}.html"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"""
mkdir fastqc-reports
fastqc -o fastqc-reports {input}
"""
rule quallity_trimming:
input:
forward = f"short_reads/{{sample}}_short{config['separator']}1.fq.gz",
reverse = f"short_reads/{{sample}}_short{config['separator']}2.fq.gz",
output:
forward = "cleaned_short-reads/{sample}_short_1-clean.fastq",
reverse = "cleaned_short-reads/{sample}_short_2-clean.fastq"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bbduk.sh -Xmx1g in1={input.forward} in2={input.reverse} out1={output.forward} out2={output.reverse} qtrim=rl trimq=10"
rule long_read_assembly:
input:
"long_reads/{sample}_long.fastq.gz"
output:
"canu-outputs/{sample}.subreads.contigs.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"canu -p {wildcards.sample} -d canu-outputs genomeSize=8m -pacbio-raw {input}"
rule short_read_alignment:
input:
short_read_fwd = "cleaned_short-reads/{sample}_short_1-clean.fastq",
short_read_rvs = "cleaned_short-reads/{sample}_short_2-clean.fastq",
reference = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"bwa-output/{sample}_short.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"bwa mem {input.reference} {input.short_read_fwd} {input.short_read_rvs} | samtools view -S -b > {output}"
rule indexing_and_sorting:
input:
"bwa-output/{sample}_short.bam"
output:
"bwa-output/{sample}_short_sorted.bam"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"samtools sort {input} > {output}"
rule polishing:
input:
bam_files = "bwa-output/{sample}_short_sorted.bam",
long_assembly = "canu-outputs/{sample}.subreads.contigs.fasta"
output:
"pilon-output/{sample}-improved.fasta"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"pilon --genome {input.long_assembly} --frags {input.bam_files} --output {output} --outdir pilon-output"
rule assembly_stats:
input:
"pilon-output/{sample}-improved.fasta"
output:
"assembly-stats/{sample}_stats.txt"
conda:
"/home/lamma/env-export/hybrid_assembly.yaml"
shell:
"stats.sh in={input} gc=assembly-stats/{wildcards.sample}/{wildcards.sample}_gc.csv gchist=assembly-stats/{wildcards.sample}/{wildcards.sample}_gchist.csv shist=assembly-stats/{wildcards.sample}/{wildcards.sample}_shist.csv > assembly-stats/{wildcards.sample}/{wildcards.sample}_stats.txt"
The rule calling canu has the correct syntax as far as I am awear so I am not sure what is causing this error.
Edit:
Adding the snakemake command
snakemake --latency-wait 60 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/faststorage/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output} --wait --parsable' --cluster-config bacterial-hybrid-assembly-config.json --configfile yaml-config-files/test_experiment3.yaml --snakefile bacterial-hybrid-assembly.smk
When running a snakemake workflow, if certain rules are to be ran within a rule-specific conda environment, the command line call should be of the form
snakemake [... various options ...] --use-conda [--conda-prefix <some-directory>]
If you don't tell snakemake to use conda, all the conda: <some_path> entries in your rules are ignored, and the rules are run in whatever environment is currently activated.
The --conda-prefix <dir> is optional, but tells snakemake where to find the installed environment (if you don't specify this, a conda env will be installed within the .snakemake folder, meaning that the .snakemake folder can get pretty huge and that the .snakemake folders for multiple projects may contain a lot of duplicated conda stuff)

Snakemake how to execute downstream rules when an upstream rule fails

Apologies that the title is bad - I can't figure out how best to explain my issue in a few words. I'm having trouble dealing with downstream rules in snakemake when one of the rules fails. In the example below, rule spades fails on some samples. This is expected because some of my input files will have issues, spades will return an error, and the target file is not generated. This is fine until I get to rule eval_ani. Here I basically want to run this rule on all of the successful output of rule ani. But I'm not sure how to do this because I have effectively dropped some of my samples in rule spades. I think using snakemake checkpoints might be useful but I just can't figure out how to apply it from the documentation.
I'm also wondering if there is a way to re-run rule ani without re-running rule spades. Say I prematurely terminated my run, and rule ani didn't run on all the samples. Now I want to re-run my pipeline, but I don't want snakemake to try to re-run all the failed spades jobs because I already know they won't be useful to me and it would just waste resources. I tried -R and --allowed-rules but neither of these does what I want.
rule spades:
input:
read1=config["fastq_dir"]+"combined/{sample}_1_combined.fastq",
read2=config["fastq_dir"]+"combined/{sample}_2_combined.fastq"
output:
contigs=config["spades_dir"]+"{sample}/contigs.fasta",
scaffolds=config["spades_dir"]+"{sample}/scaffolds.fasta"
log:
config["log_dir"]+"spades/{sample}.log"
threads: 8
shell:
"""
python3 {config[path_to_spades]} -1 {input.read1} -2 {input.read2} -t 16 --tmp-dir {config[temp_dir]}spades_test -o {config[spades_dir]}{wildcards.sample} --careful > {log} 2>&1
"""
rule ani:
input:
config["spades_dir"]+"{sample}/scaffolds.fasta"
output:
"fastANI_out/{sample}.txt"
log:
config["log_dir"]+"ani/{sample}.log"
shell:
"""
fastANI -q {input} --rl {config[reference_dir]}ref_list.txt -o fastANI_out/{wildcards.sample}.txt
"""
rule eval_ani:
input:
expand("fastANI_out/{sample}.txt", sample=samples)
output:
"ani_results.txt"
log:
config["log_dir"]+"eval_ani/{sample}.log"
shell:
"""
python3 ./bin/evaluate_ani.py {input} {output} > {log} 2>&1
"""
If I understand correctly, you want to allow spades to fail without stopping the whole pipeline and you want to ignore the output files from spades that failed. For this you could append to the command running spades || true to catch the non-zero exit status (so snakemake will not stop). Then you could analyse the output of spades and write to a "flag" file whether that sample succeded or not. So the rule spades would be something like:
rule spades:
input:
read1=config["fastq_dir"]+"combined/{sample}_1_combined.fastq",
read2=config["fastq_dir"]+"combined/{sample}_2_combined.fastq"
output:
contigs=config["spades_dir"]+"{sample}/contigs.fasta",
scaffolds=config["spades_dir"]+"{sample}/scaffolds.fasta",
exit= config["spades_dir"]+'{sample}/exit.txt',
log:
config["log_dir"]+"spades/{sample}.log"
threads: 8
shell:
"""
python3 {config[path_to_spades]} ... || true
# ... code that writes to {output.exit} stating whether spades succeded or not
"""
For the following steps, you use the flag files '{sample}/exit.txt' to decide which spade files should be used and which should be discarded. For example:
rule ani:
input:
spades= config["spades_dir"]+"{sample}/scaffolds.fasta",
exit= config["spades_dir"]+'{sample}/exit.txt',
output:
"fastANI_out/{sample}.txt"
log:
config["log_dir"]+"ani/{sample}.log"
shell:
"""
if {input.exit} contains 'PASS':
fastANI -q {input.spades} --rl {config[reference_dir]}ref_list.txt -o fastANI_out/{wildcards.sample}.txt
else:
touch {output}
"""
rule eval_ani:
input:
ani= expand("fastANI_out/{sample}.txt", sample=samples),
exit= expand(config["spades_dir"]+'{sample}/exit.txt', sample= samples),
output:
"ani_results.txt"
log:
config["log_dir"]+"eval_ani/{sample}.log"
shell:
"""
# Parse list of file {input.exit} to decide which files in {input.ani} should be used
python3 ./bin/evaluate_ani.py {input} {output} > {log} 2>&1
"""
EDIT (not tested) Instead of || true inside the shell directive it may be better to use the run directive and use python's subprocess to run the system commands that are allowed to fail. The reason is that || true will return 0 exit code no matter what error happened; the subprocess solution instead allows more precise handling of exceptions. E.g.
rule spades:
input:
...
output:
...
run:
cmd = "spades ..."
p = subprocess.Popen(cmd, shell= True, stdout= subprocess.PIPE, stderr= subprocess.PIPE)
stdout, stderr= p.communicate()
if p.returncode == 0:
print('OK')
else:
# Analyze exit code and stderr and decide what to do next
print(p.returncode)
print(stderr.decode())