How to ensure snakemake rule dependency while submitting via qsub - snakemake

I am using Snakemake to submit jobs to the cluster. I am facing a situation where I would like to force a particular rule to run only after all other rules have run - this is because the input files for this job (R script) are not yet ready.
I happened to see this on the Snakemake documentation page where it states one can force rule execution order - https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#flag-files
I have different rules, but for sake of simplicity, I am showing my Snakefile and the last 2 rules below (rsem_model and tximport_rsem). On my qsub cluster workflow, I want tximport_rsem to execute only after rsem_model has finished and I tried the "touchfile" method but I am not able to get it working successfully.
# Snakefile
rule all:
input:
expand("results/fastqc/{sample}_fastqc.zip",sample=samples),
expand("results/bbduk/{sample}_trimmed.fastq",sample=samples),
expand("results/bbduk/{sample}_trimmed_fastqc.zip",sample=samples),
expand("results/bam/{sample}_Aligned.toTranscriptome.out.bam",sample=samples),
expand("results/bam/{sample}_ReadsPerGene.out.tab",sample=samples),
expand("results/quant/{sample}.genes.results",sample=samples),
expand("results/quant/{sample}_diagnostic.pdf",sample=samples),
expand("results/multiqc/project_QS_STAR_RSEM_trial.html"),
expand("results/rsem_tximport/RSEM_GeneLevel_Summarization.csv"),
expand("mytask.done")
rule clean:
shell: "rm -rf .snakemake/"
include: 'rules/fastqc.smk'
include: 'rules/bbduk.smk'
include: 'rules/fastqc_after.smk'
include: 'rules/star_align.smk'
include: 'rules/rsem_norm.smk'
include: 'rules/rsem_model.smk'
include: 'rules/tximport_rsem.smk'
include: 'rules/multiqc.smk'
rule rsem_model:
input:
'results/quant/{sample}.genes.results'
output:
'results/quant/{sample}_diagnostic.pdf'
params:
plotmodel = config['rsem_plot_model'],
prefix = 'results/quant/{sample}',
touchfile = 'mytask.done'
threads: 16
priority: 60
shell:"""
touch {params.touchfile}
{params.plotmodel} {params.prefix} {output}
"""
rule tximport_rsem:
input: 'mytask.done'
output:
'results/rsem_tximport/RSEM_GeneLevel_Summarization.csv'
priority: 50
shell: "Rscript scripts/RSEM_tximport.R"
Here is the error I get when I try to do a dry-run
snakemake -np
Building DAG of jobs...
MissingInputException in line 1 of /home/yh6314/rsem/tutorials/QS_Snakemake/rules/tximport_rsem.smk:
Missing input files for rule tximport_rsem:
mytask.done
One important thing to note: If I try running this on the head node, I do not have to do "touch file" and everything works fine.
I would appreciate suggestions and help to figure out a workaround.
Thanks in advance.

Rule tximport_rsem will be executed only after all jobs from rule rsem_model are completed (based on comments). Hence, intermediate file mytask.done is unnecessary in this scenario. Using output files of rule rsem_model for all samples to rule tximport_rsem will suffice.
rule rsem_model:
input:
'results/quant/{sample}.genes.results'
output:
'results/quant/{sample}_diagnostic.pdf',
shell:
"""
{params.plotmodel} {params.prefix} {output.pdf}
"""
rule tximport_rsem:
input:
expand('results/quant/{sample}_diagnostic.pdf', sample=sample_names)
output:
'results/rsem_tximport/RSEM_GeneLevel_Summarization.csv'
shell:
"Rscript scripts/RSEM_tximport.R"

Related

MissingOutputException snakemake

I am getting an MissingOutputException from my snakemake workflow, snakemake creates the required output in the desired directory but keeps looking for it and exits.
this is my snakefile.
rule all:
input:
expand('/home/stud9/NAS/results/qc_reports/fastqc/trimmed_{sample}_1_fastqc.html', sample=SAMPLES),
expand('/home/stud9/NAS/results/qc_reports/fastqc/trimmed_{sample}_2_fastqc.html', sample=SAMPLES),
expand('home/stud9/NAS/results/non_aligned/{sample}_nm2cov.bam', sample=SAMPLS)
rule nm2cov:
input:
'/home/stud9/NAS/results/aligned/to_cov/{sample}_cov.sorted.bam'
output:
'home/stud9/NAS/results/non_aligned/{sample}_nm2cov.bam'
shell:
"cd /home/stud9/NAS/results/non_aligned && samtools view -b -f 4 {input} > {wildcards.sample}_nm2cov.bam"
I have used cd before the actual cmd because I want my results there otherwise they would show in the snakefile directory.
This is the messsage I am getting:
Waiting at most 10 seconds for missing files.
MissingOutputException in rule nm2cov in line 50 of /home/stud9/NAS/scripts/wf_1:
Job Missing files after 10 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
home/stud9/NAS/results/non_aligned/148_nm2cov.bam completed successfully, but some output files are missing. 55
Shutting down, this might take some time.
sorry if my post is a little bit messy but this is the first time I post here
tried changing --latency-wait to 15 still no response.
A MissingOutputException can be easily caused by typos or wrong paths for the output files.
In your case a preceeding / seems to be missing, causing snakemake to consider your output path to be relative rather than absolute.
Try this:
rule nm2cov:
input:
'/home/stud9/NAS/results/aligned/to_cov/{sample}_cov.sorted.bam'
output:
'/home/stud9/NAS/results/non_aligned/{sample}_nm2cov.bam'
shell:
"cd /home/stud9/NAS/results/non_aligned && samtools view -b -f 4 {input} > {wildcards.sample}_nm2cov.bam"
NB: It is generally recommended to use relative paths rather than absolute paths for your Snakefile to keep the reproducibility of your workflow.

Snakemake: catch output file whose name cannot be changed

As part of a Snakemake pipeline that I'm building, I have to use a program that does not allow me to specify the file path or name of an output file.
E.g. when running the program in the working directory workdir/ it produces the following output:
workdir/output.txt
My snakemake rule looks something like this:
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell: "somecommand {input} {output}"
So every time the rule NAME runs, I get an additional file output.txt in the snakemake working directory, which is then overwritten if the rule NAME runs multiple times or in parallel.
I'm aware of shadow rules, and adding shadow: "full" allows me to simply ignore the output.txt file. However, I'd like to keep output.txt and save it in the same directory as the outputfile. Is there a way of achieving this, either with the shadow directive or otherwise?
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
How about simply moving it directly afterwards in the shell part (provided somecommand completes successfully)?
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
params:
output_dir = "path/to/output_dir",
shell: "somecommand {input} {output} && mv output.txt {params.output_dir}/output.txt"
EDIT: for multiple executions of NAME in parallel, combining with shadow: "full" could work:
rule NAME:
input: "path/to/inputfile"
output:
output_file = "path/to/outputfile"
output_txt = "path/to/output_dir/output.txt"
shadow: "full"
shell: "somecommand {input} {output.output_file} && mv output.txt {output.output_txt}"
That should run each execution of the rule in its own temporary dir, and by specifying the moved output.txt as an output Snakemake should move it to the real output dir once the rule is done running.
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
I think you are on the right track here. Each shell block is run in a separate process with the working directory inherited from the snakemake process (specified with the --directory argument on the command line). Accordingly, cd commands in one shell block will not affect other jobs from the same rule or other downstream/upstream jobs.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell:
"""
input_file=$(realpath "{input}") # get the absolute path, before the `cd`
base_dir=$(dirname "{output}")
cd "$base_dir"
somecommand ...
"""

Snakemake cannot handle very long command line?

This is a very strange problem.
When my {input} specified in the rule section is a list of <200 files, snakemake worked all right.
But when {input} has more than 500 files, snakemake just quitted with messages (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!). The complete log did not provide any error messages.
For the log, please see: https://github.com/snakemake/snakemake/files/5285271/2020-09-25T151835.613199.snakemake.log
The rule that worked is (NOTE the input is capped to 200 files):
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')[:200]
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
The rule that failed is:
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
My question is also posted in GitHub: https://github.com/snakemake/snakemake/issues/643.
I second Maarten's answer, with that many files you are running up against a shell limit; snakemake is just doing a poor job helping you identify the problem.
Based on the issue you reference, it seems like you are using cat to combine all of your files. Maybe following the answer here would help:
rule combine_fastq_list:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
temp("{sample}.tmp.list")
group: "minion_assemble"
script:
with open(output[0]) as out:
out.write('\n'.join(input))
rule combine_fastq:
input:
temp("{sample}.tmp.list")
output:
'combined.fastq/{sample}.fastq.gz'
group: "minion_assemble"
shell:
'cat {input} | ' # this is reading the list of files from the file
'xargs zcat -f | '
'...'
Hope it gets you on the right track.
edit
The first option executes your command separately for each input file. A different option that executes the command once for the whole list of input is:
rule combine_fastq:
...
shell:
"""
command $(< {input}) ...
"""
For those landing here with similar questions (like Snakemake expand function alternative), snakemake 6 can handle long command lines. The following test fails on snakemake < 6 but succeeds on 6.0.0 on my Ubuntu machine:
rule all:
input:
'output.txt',
rule one:
output:
'output.txt',
params:
x= list(range(0, 1000000))
shell:
r"""
echo {params.x} > {output}
"""

Using directory as input in snakemake for particulars scripts

i'm sorry if my question may seem a bit dumb.
So, i'm currently trying to write a workflow on snakemake (my first, as a trainee), i've to automate a couple of steps, those steps dependings all on python scripts already made.
My trouble is that the input and outputs of those scripts are folders themselves (and their content corresponding of files linked of the first directory content..).
So far, i did this (which is not working, as we can expect)
configfile: "config.yaml"
rule all:
input:
"{dirname}/directory_results/sub_dir2", dirname=config["dirname"]
rule script1:
input:
"{dirname}/reference/{files}.gbff", dirname=config["dirname"]
output:
"{dirname}/directory_results", dirname=config["dirname"]
shell:
"python script_1.py -i {dirname}/reference -o {output}"
rule script2:
input:
"{dirname}/directory_results/sub_dir1/{files}.gbff.gff", dirname=config["dirname"]
output:
"{dirname}/directory_results/sub_dir2", dirname=config["dirname"]
shell:
"python script_2.py -i {dirname}/directory_results/sub_dir1"
As for config.yaml, it's a simple file that i used for now, to put the path of the said "dirname"
dirname:
Sero_1: /project/work/test_snake/Sero_1
I know that there is much to refactor (i'm still not accustomed to snakemake since, beside the tutorial, it's my first workflow ever made). I also understand that the problem lie probably in the fact that, input can't be directories. I tried a couple of things since a couple of days, and i thought i may ask some advice since i'm struggling
How can i put an input that will permit to use for scripts directories?
If it may help, i solved my rule "script1" by doing:
configfile: "config.yaml"
dirname = config["dirname"]
rule all:
input:
expand("{dirname}/directory_results/", "{dirname}/directory_results/subdir2" dirname=dirname)
rule script1:
input:
expand("{dirname}/reference/", dirname=dirname)
output:
directory(expand("{dirname}/directory_results", dirname=dirname))
shell:
"python script_1.py -i {input} -o {output}"
rule script2:
input:
rules.script1.output
output:
directory(extend("{dirname}/directory_results/sub_dir2", dirname=dirname))
shell:
"python script_2.py -i {input}"
As for the config.yaml file:
dirname:
- /project/work/test_snake/Sero_1
- /project/work/test_snake/Sero_2

Snakemake how to execute downstream rules when an upstream rule fails

Apologies that the title is bad - I can't figure out how best to explain my issue in a few words. I'm having trouble dealing with downstream rules in snakemake when one of the rules fails. In the example below, rule spades fails on some samples. This is expected because some of my input files will have issues, spades will return an error, and the target file is not generated. This is fine until I get to rule eval_ani. Here I basically want to run this rule on all of the successful output of rule ani. But I'm not sure how to do this because I have effectively dropped some of my samples in rule spades. I think using snakemake checkpoints might be useful but I just can't figure out how to apply it from the documentation.
I'm also wondering if there is a way to re-run rule ani without re-running rule spades. Say I prematurely terminated my run, and rule ani didn't run on all the samples. Now I want to re-run my pipeline, but I don't want snakemake to try to re-run all the failed spades jobs because I already know they won't be useful to me and it would just waste resources. I tried -R and --allowed-rules but neither of these does what I want.
rule spades:
input:
read1=config["fastq_dir"]+"combined/{sample}_1_combined.fastq",
read2=config["fastq_dir"]+"combined/{sample}_2_combined.fastq"
output:
contigs=config["spades_dir"]+"{sample}/contigs.fasta",
scaffolds=config["spades_dir"]+"{sample}/scaffolds.fasta"
log:
config["log_dir"]+"spades/{sample}.log"
threads: 8
shell:
"""
python3 {config[path_to_spades]} -1 {input.read1} -2 {input.read2} -t 16 --tmp-dir {config[temp_dir]}spades_test -o {config[spades_dir]}{wildcards.sample} --careful > {log} 2>&1
"""
rule ani:
input:
config["spades_dir"]+"{sample}/scaffolds.fasta"
output:
"fastANI_out/{sample}.txt"
log:
config["log_dir"]+"ani/{sample}.log"
shell:
"""
fastANI -q {input} --rl {config[reference_dir]}ref_list.txt -o fastANI_out/{wildcards.sample}.txt
"""
rule eval_ani:
input:
expand("fastANI_out/{sample}.txt", sample=samples)
output:
"ani_results.txt"
log:
config["log_dir"]+"eval_ani/{sample}.log"
shell:
"""
python3 ./bin/evaluate_ani.py {input} {output} > {log} 2>&1
"""
If I understand correctly, you want to allow spades to fail without stopping the whole pipeline and you want to ignore the output files from spades that failed. For this you could append to the command running spades || true to catch the non-zero exit status (so snakemake will not stop). Then you could analyse the output of spades and write to a "flag" file whether that sample succeded or not. So the rule spades would be something like:
rule spades:
input:
read1=config["fastq_dir"]+"combined/{sample}_1_combined.fastq",
read2=config["fastq_dir"]+"combined/{sample}_2_combined.fastq"
output:
contigs=config["spades_dir"]+"{sample}/contigs.fasta",
scaffolds=config["spades_dir"]+"{sample}/scaffolds.fasta",
exit= config["spades_dir"]+'{sample}/exit.txt',
log:
config["log_dir"]+"spades/{sample}.log"
threads: 8
shell:
"""
python3 {config[path_to_spades]} ... || true
# ... code that writes to {output.exit} stating whether spades succeded or not
"""
For the following steps, you use the flag files '{sample}/exit.txt' to decide which spade files should be used and which should be discarded. For example:
rule ani:
input:
spades= config["spades_dir"]+"{sample}/scaffolds.fasta",
exit= config["spades_dir"]+'{sample}/exit.txt',
output:
"fastANI_out/{sample}.txt"
log:
config["log_dir"]+"ani/{sample}.log"
shell:
"""
if {input.exit} contains 'PASS':
fastANI -q {input.spades} --rl {config[reference_dir]}ref_list.txt -o fastANI_out/{wildcards.sample}.txt
else:
touch {output}
"""
rule eval_ani:
input:
ani= expand("fastANI_out/{sample}.txt", sample=samples),
exit= expand(config["spades_dir"]+'{sample}/exit.txt', sample= samples),
output:
"ani_results.txt"
log:
config["log_dir"]+"eval_ani/{sample}.log"
shell:
"""
# Parse list of file {input.exit} to decide which files in {input.ani} should be used
python3 ./bin/evaluate_ani.py {input} {output} > {log} 2>&1
"""
EDIT (not tested) Instead of || true inside the shell directive it may be better to use the run directive and use python's subprocess to run the system commands that are allowed to fail. The reason is that || true will return 0 exit code no matter what error happened; the subprocess solution instead allows more precise handling of exceptions. E.g.
rule spades:
input:
...
output:
...
run:
cmd = "spades ..."
p = subprocess.Popen(cmd, shell= True, stdout= subprocess.PIPE, stderr= subprocess.PIPE)
stdout, stderr= p.communicate()
if p.returncode == 0:
print('OK')
else:
# Analyze exit code and stderr and decide what to do next
print(p.returncode)
print(stderr.decode())