Snakemake is unable to match wildcard although it's defined and even suggested - snakemake

I am still very confused about the wildcards concept despite reading the full docs and a few examples, so maybe someone can shed light on this weird behaviour. It might be a bug but it's such a basic example that I am pretty sure I am doing or understanding something wrong.
Here is my Snakefile which should generate a bunch of files defined in a dictionary where the location of the files is stored (those can be served by all kinds of data providers like iRODS, XRootD etc., but it's not important now).
import os
some_files = {
"foo": "some_location/foo",
"bar": "another_location/bar",
"baz": "yet_another_loc/baz"
}
rule all:
input: ["raw/" + os.path.basename(f) for f in some_files.keys()]
rule generate_files:
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {some_files[wildcards.fname]} > {output}"
As you can see, I need to use a similar "trick" which was proposed in my previous question (Array of values as input in Snakemake workflows) to force the recognition of the files by adding a rule and listing those (in rule all), which works nicely.
The rule generate_files should then generate (retrieve) those by using the corresponding URL and protocol defined in some_files. For the sake of simplicity, it's now just echoing the origin into the output file.
To achieve this, I thought I can simply use the wildcards.fname in the shell section but I when I run the workflow, I get:
░ tamasgal#silentbox-(2):PhD/snakemake  master ●●● snakemake took 16s
░ 08:47:35 > snakemake -c1
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
generate_files 3 1 1
total 4 1 1
Select jobs to execute...
[Fri Feb 18 08:47:38 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'wildcards.fname' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
If I use fname (and not wildcards.fname), Snakemake proposes to use wildcards.fname, which again, does not work. Here is the output when running with fname in output:
[Fri Feb 18 08:47:48 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'fname' is unknown in this context. Did you mean 'wildcards.fname'?
Why is this happening? The output of the workflow clearly shows that wildcards: fname=bar, so it exists and is defined. Is this a bug?

Hm, you may have to try and get at some_files[wildcards.fname] outside of the shell part? It looks to me like it can tell what the wildcard is supposed to be for the output to be raw/bar, but it can't handle using it to access the dict in the shell part. It seems like this could be handled with an input function to me.
Off the top of my head:
rule generate_files:
input:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {input.some_file} > {output}"
EDIT: if it fails because the file isn't local so Snakemake can't find it, you may supply the path to it as a parameter instead:
rule generate_files:
params:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {params.some_file} > {output}"

Related

Snakemake checkpoint output unknown number of files with no subsequent aggregation but instead rules that peform actions on individual files?

Thanks for any help ahead of time.
I'm trying to use the Snakemake checkpoint functionality to produce an unknown number of files in a directory, which I've gotten to work using the pattern described in the docs, but then I don't want to do any kind of aggregation rule afterwards, I want to have rules that do actions on each individual file (of course inherently in parallel via wildcards).
Here's a simple reproducible example of my problem:
from os.path import join
rule all:
input:
"aggregated.txt",
checkpoint create_gzip_file:
output:
directory("my_directory/"),
shell:
"""
mkdir my_directory/
cd my_directory
for i in 1 2 3; do gzip < /dev/null > $i.txt.gz; done
"""
rule gunzip_file:
input:
join("my_directory", "{i}.txt.gz"),
output:
join("my_directory", "{i}.txt"),
shell:
"""
gunzip -c {input} > {output}
"""
def gather_gunzip_input(wildcards):
out_dir = checkpoints.create_gzip_file.get(**wildcards).output[0]
i = glob_wildcards(join(out_dir, "{i}.txt.gz"))
return expand(f"{out_dir}/{{i}}", i=i)
rule aggregate:
input:
gather_gunzip_input,
output:
"aggregated.txt",
shell:
"cat {input} > {output}"
I'm getting the following error:
$ snakemake --printshellcmds --cores all
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
---------------- ------- ------------- -------------
aggregate 1 1 1
all 1 1 1
create_gzip_file 1 1 1
total 3 1 1
Select jobs to execute...
[Wed Jul 13 14:57:09 2022]
checkpoint create_gzip_file:
output: my_directory
jobid: 2
reason: Missing output files: my_directory
resources: tmpdir=/tmp
Downstream jobs will be updated after completion.
mkdir my_directory/
cd my_directory
for i in 1 2 3; do gzip < /dev/null > $i.txt.gz; done
[Wed Jul 13 14:57:09 2022]
Finished job 2.
1 of 3 steps (33%) done
MissingInputException in line 20 of /home/hermidalc/projects/github/hermidalc/test/Snakefile:
Missing input files for rule gunzip_file:
output: my_directory/['1', '2', '3'].txt
wildcards: i=['1', '2', '3']
affected files:
my_directory/['1', '2', '3'].txt.gz
I had a syntax issue (which wasn't triggering any syntax check or compiler issues) that was causing the seemingly unrelated MissingInputException. The glob_wildcards line:
i = glob_wildcards(join(out_dir, "{i}.txt.gz"))
needs to be with a trailing comma:
i, = glob_wildcards(join(out_dir, "{i}.txt.gz"))
or
i = glob_wildcards(join(out_dir, "{i}.txt.gz")).i
Also, in answering the other part of the question - I believe if you don't want an aggregation-type rule (which uses the function that gathers the unknown number of files as its input) then you need to put that function as input to your rule all. As shown in this question, you can continue to have downstream rules of your checkpoint, which do not aggregate, but perform actions on the individual unknown files, you just have to use the wildcards created in your gather function and write the expand in the right way that it outputs the file structure for how the output of the last rule performing actions on files from the checkpoint come out.

Problems with the VEP snakemake wrapper

I'm experiencing two issues trying to run the VEP wrapper for snakemake.
The first is that I would like to use lambda wildcards in calls like so:
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}_sorted_dedupped_snp_varscan.vcf".format(sample,pair)) for pair in ['']))
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="variants.annotated.vcf",
stats="variants.html"
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
However, I get an error:
When I replace lambda wildcards with a calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names) ([which is not ideal - see this post for reason][1]) it give me errors about resources folder?
(snakemake) [moldach#cedar1 MTG353]$ snakemake -n -r
Building DAG of jobs...
MissingInputException in line 333 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/Snakefile:
Missing input files for rule variant_annotation:
resources/vep/cache
resources/vep/plugins
I'm also [confused from the documentation as to how it knows which reference genome (version, _etc.) should be specified][2].
UPDATE:
Because of the character limit I cannot even respond to the two respondents so I will continue the issue here:
As #jafors mentioned the two wrappers solved the issue for cache and plugins - thanks!
Now I get an error from trying to run VEP though from the following rule:
rule variant_annotation:
input:
calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.annotated.vcf', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names),
stats=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.html', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names)
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
this is the error I get from the log:
Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 variant_annotation
1
[Wed Aug 12 20:22:49 2020]
Job 0: --- Annotating Variants.
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpwx1u_776.wrapper.py", line 36, in <module>
if snakemake.output.calls.endswith(".vcf.gz"):
AttributeError: 'Namedlist' object has no attribute 'endswith'
[Wed Aug 12 20:22:53 2020]
Error in rule variant_annotation:
jobid: 0
output: ANNOTATION/VEP/BC1217.annotated.vcf, ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/MTG109.annotated.vcf, ANNOTATION/VEP/BC1217.html, ANNOTATION/VEP/470.html, ANNOTATION/VEP/MTG$
conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail; python /scratch/moldach/MADDOG/VCF-FILE$
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
TO BE CLEAR:
This is the code I had running VEP prior to trying out the wrapper so I would like to preserve similar options (e.g. offline, etc.):
vep \
-i {input.sample} \
--species "caenorhabditis_elegans" \
--format "vcf" \
--everything \
--cache_version 100 \
--offline \
--force_overwrite \
--fasta {input.ref} \
--gff {input.annot} \
--tab \
--variant_class \
--regulatory \
--show_ref_allele \
--numbers \
--symbol \
--protein \
-o {params.sample}
UPDATE 2:
Yes the use of expand() was the issue. I remember this is why I like to use lambda or os.path.join() as rule input/output except for as you mentioned in rule all:
The following seems to get rid of that problem although I'm met with a new one:
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
Not sure why I get the unknown file type error - as I mentioned this was first tested out with the full command with the same input data?
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpsh388k23.wrapper.py", line 44, in <module>
"(bcftools view {snakemake.input.calls} | "
File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; (bcftools view VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cach$
[Thu Aug 13 09:02:22 2020]
Update 3:
bcftools view is giving the warning from the output of samtools mpileup/varscan pileup2snp:
def getDeduppedBamsIndex(sample):
return(list(os.path.join(aligns_dict[sample],"{0}.sorted.dedupped.bam.bai".format(sample,pair)) for pair in ['']))
rule mpilup:
input:
bam=lambda wildcards: getDeduppedBams(wildcards.sample),
reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log")
params:
extra=lambda wc: "-r {}".format(wc.contig)
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/samtools/mpileup"
rule mpileup_to_vcf:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf")
message:
"Calling SNP with Varscan2"
threads:
2 # Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_{contig}.log")
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/varscan/mpileup2snp"
rule vcf_merge:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_II.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_III.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_IV.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_V.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_X.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_MtDNA.vcf")
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}.vcf")
log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_vcf-merge.log")
resources:
mem = 1000,
time = 10
threads: 1
message: """--- Merge VarScan by Chromosome."""
shell: """
awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' {input} > {output}
"""
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}.vcf".format(sample,pair)) for pair in ['']))
rule annotate_variants:
input:
calls=lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="{sample}.annotated.vcf",
stats="{sample}.html"
params:
# Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
# Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
plugins=["LoFtool"],
extra="--everything" # optional: extra arguments
log:
"logs/vep/{sample}.log"
threads: 4
resources:
time=30,
mem=5000
wrapper:
"0.65.0/bio/vep/annotate"
If I run bcftools view on the output I get the error:
$ bcftools view variant_calling/varscan/MTG324.vcf
Failed to read from variant_calling/varscan/MTG324.vcf: unknown file type
About using the expand vs wildcard, it does not matter at all. The biostar post is just advice how to keep things readable. On the snakemake/programmatic side should not matter how you define you input, as long as it is correct.
The complaint about resources is that you define in the input of rule variant_annotation that resources/vep/cache and resources/vep/plugins are necessary inputs to be able to run variant_annotation. With this error snakemake is effectively telling you that those files do not exist, so it can not run the rule for you.
When I look at the code in the docs it seems like the cache directory as input should define which genome you use:
entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
species = entrypath.parent.name
release, build = entrypath.name.split("_")
Additionally to what Maarten said (the resources/vep/cache and resources/vep/plugins are just example paths to the required input which defines also which genome and version you want to use), you can get the cache and plugin directories easily with two other simple rules in your Snakefile using these wrappers:
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/cache.htm
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/plugins.html
EDIT
Glad this worked out for your first problem.
The second error seems to arise from the expand in the output.
Am I understanding correctly that you want to annotate all your vcfs one-by-one? So input is {sample}.vcf and output would be {sample}.annotated.vcf?
If that's the case, you probably don't want to use expand in this rule.
I am also not sure, why you would need the {ANNOT_DIR} and {ANNOT_TOOL} to be wildcards here. I guess if you are using VEP, the ANNOT_TOOL would always be VEP and the ANNOT_DIR will be ANNOTATION?
Then, you could write them directly in the output as ANNOTATION/VEP/{sample}.annotated.vcf.
Same for the {CALLING_DIR}, I guess this will always be the same directory, right? I get that the {CALLING_TOOL} might have more than one value if you used multiple callers on the samples.
If I am still on track, you have two wildcards you could want to expand on when using VEP, the {sample} and the {CALLING_TOOL}.
Just write
input:
calls: 'CALLDIR/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf',
cache="resources/vep/cache",
plugins="resources/vep/plugins"
output:
calls='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf',
stats='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.html'
The expand belongs in your rule all or any other target rule that uses all annotated vcfs at once, sth. like this:
rule all:
input: expand('ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf', CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names)
Then, the variant_annotation rule will run all the samples you expand on in rule all.
I hope I got your idea correctly and this helps.
EDIT2
Ok, seems like we are nearly done. The error you get is thrown by bcftools view - it indicates that something might be wrong with the vcf.
Did you try bcftools view with your vcf outside of the Snakefile? This would give us an idea if the problem arises during this rule or if the vcf is already somehow problematic.

Workflow always results in "Nothing to do" even when forcing rules

So as the title says I can't bring my workflow to execute anything, except the all rule...
When Executing the all rule it correctly finds all the input files, so the configfile is okay, every path is correct.
when trying to run without additional tags I get
Building DAG of jobs...
Checking status of 0 jobs.
Nothing to be done
Things I tried:
-f rcorrector -> only all rule
filenameR1.fcor_val1.fq -> MissingRuleException (No Typos)
--forceall -> only all rule
Some more fiddling I can't formulate clearly
please Help
from os import path
configfile:"config.yaml"
RNA_DIR = config["RAW_RNA_DIR"]
RESULT_DIR = config["OUTPUT_DIR"]
FILES = glob_wildcards(path.join(RNA_DIR, '{sample}R1.fastq.gz')).sample
############################################################################
rule all:
input:
r1=expand(path.join(RNA_DIR, '{sample}R1.fastq.gz'), sample=FILES),
r2=expand(path.join(RNA_DIR, '{sample}R2.fastq.gz'), sample=FILES)
#############################################################################
rule rcorrector:
input:
r1=path.join(RNA_DIR, '{sample}R1.fastq.gz'),
r2=path.join(RNA_DIR, '{sample}R2.fastq.gz')
output:
o1=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R1.cor.fq'),
o2=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R2.cor.fq')
#group: "cleaning"
threads: 8
params: "-t {threads}"
envmodules:
"bio/Rcorrector/1.0.4-foss-2019a"
script:
"scripts/Rcorrector.py"
############################################################################
rule FilterUncorrectabledPEfastq:
input:
r1=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R1.cor.fq'),
r2=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R2.cor.fq')
output:
o1=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R1.fcor.fq"),
o2=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R2.fcor.fq")
#group: "cleaning"
envmodules:
"bio/Jellyfish/2.2.6-foss-2017a",
"lang/Python/2.7.13-foss-2017a"
#TODO: load as module
script:
"/scripts/filterUncorrectable.py"
#############################################################################
rule trim_galore:
input:
r1=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R1.fcor.fq"),
r2=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R2.fcor.fq")
output:
o1=path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val1.fq"),
o2=path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val2.fq")
threads: 8
#group: "cleaning"
envmodules:
"bio/Trim_Galore/0.6.5-foss-2019a-Python-3.7.4"
params:
"--paired --retain_unpaired --phred33 --length 36 -q 5 --stringency 1 -e 0.1 -j {threads}"
script:
"scripts/trim_galore.py"
In snakemake, you define final output files of the pipeline as target files and define them as inputs in first rule of the pipeline. This rule is traditionally named as all (more recently as targets in snakemake doc).
In your code, rule all specifies input files of the pipeline, which already exists, and therefore snakemake doesn't see anything to do. It just instead needs to specify output files of interest from the pipeline.
rule all:
input:
expand(path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val{read}.fq"), sample=FILES, read=[1,2]),
Why your attempted methods didn't work?
-f not working:
As per doc:
--force, -f
Force the execution of the selected target or the first rule regardless of already created output.
Default: False
In your code, this means rule all, which doesn't have output defined, and therefore nothing happened.
filenameR1.fcor_val1.fq
This doesn't match output of any of the rules and therefore the error MissingRuleException.
--forceall
Same reasoning as that for -f flag in your case.
--forceall, -F
Force the execution of the selected (or the first) rule and all rules it is dependent on regardless of already created output.
Default: False

Snakemake “Missing files after X seconds” error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 99
Job counts:
count jobs
1 all
1 antiSMASH
1 pear
1 prodigal
4
[Wed Dec 11 14:59:43 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 3
wildcards: sample=Unmap_41, extension=fastq
Submitted job 3 with external jobid 'Submitted batch job 4572437'.
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
It would seem that the first rule is not executing, but I am unsure as to why as from what I can see all the syntax is correct. Anyone have some advice?
The snakefile is the following:
#!/miniconda/bin/python
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {output} -t 21")
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta")
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}")
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
I think it is likely to be something involving the way I am calling the conda environment in the run commands but using the conda: option with a yaml files returns version not found style errors.
As of what I read from pear documentation:
-o Specify the name to be used as base for the output files. PEAR outputs four files. A file containing the assembled reads with a
assembled.fastq extension, two files containing the forward, resp.
reverse, unassembled reads with extensions unassembled.forward.fastq,
resp. unassembled.reverse.fastq, and a file containing the discarded
reads with a discarded.fastq extension
So if the output defined in your rule is just a base, I suggest you put this as a param and the real names of the output as output:
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}".assembled.fastq,
"merged_reads/{sample}.{extension}".unassembled.forward.fastq,
"merged_reads/{sample}.{extension}".unassembled.reverse.fastq,
"merged_reads/{sample}.{extension}".discarded.fastq
params:
base="merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {params.base} -t 21")
I haven't tested pear so not sure what the output file names are exactly.

How to use output directories to aggregate files (and receive more informative error messages)?

The overall problem I'm trying to solve is a way to count the number of reads present in each file at every step of a QC pipeline I'm building. I have a shell script I've used in the past which takes in a directory and outputs the number of reads per file. Since I'm looking to use a directory as input, I tried following the format laid out by Rasmus in this post:
https://bitbucket.org/snakemake/snakemake/issues/961/rule-with-folder-as-input-and-output
Here is some example input created earlier in the pipeline:
$ ls -1 cut_reads/
97_R1_cut.fastq.gz
97_R2_cut.fastq.gz
98_R1_cut.fastq.gz
98_R2_cut.fastq.gz
99_R1_cut.fastq.gz
99_R2_cut.fastq.gz
And a simplified Snakefile to first aggregate all reads by creating symlinks in a new directory, and then use that directory as input for the read counting shell script:
import os
configfile: "config.yaml"
rule all:
input:
"read_counts/read_counts.txt"
rule agg_count:
input:
cut_reads = expand("cut_reads/{sample}_{rdir}_cut.fastq.gz", rdir=["R1", "R2"], sample=config["forward_reads"])
output:
cut_dir = directory("read_counts/cut_reads")
run:
os.makedir(output.cut_dir)
for read in input.cut_reads:
abspath = os.path.abspath(read)
shell("ln -s {abspath} {output.cut_dir}")
rule count_reads:
input:
cut_reads = "read_counts/cut_reads"
output:
"read_counts/read_counts.txt"
shell:
'''
readcounts.sh {input.cut_reads} >> {output}
'''
Everything's fine in the dry-run, but when I try to actually execute it, I get a fairly cryptic error message:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 agg_count
1 all
1 count_reads
3
[Tue Jun 18 11:31:22 2019]
rule agg_count:
input: cut_reads/99_R1_cut.fastq.gz, cut_reads/98_R1_cut.fastq.gz, cut_reads/97_R1_cut.fastq.gz, cut_reads/99_R2_cut.fastq.gz, cut_reads/98_R2_cut.fastq.gz, cut_reads/97_R2_cut.fastq.gz
output: read_counts/cut_reads
jobid: 2
Job counts:
count jobs
1 agg_count
1
[Tue Jun 18 11:31:22 2019]
Error in rule agg_count:
jobid: 0
output: read_counts/cut_reads
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/douglas/snakemake/scrap_directory/.snakemake/log/2019-06-18T113122.202962.snakemake.log
read_counts/ was created, but there's no cut_reads/ directory inside. No other error messages are present in the complete log. Anyone know what's going wrong or how to receive a more descriptive error message?
I'm also (obviously) fairly new to snakemake, so there might be a better way to go about this whole process. Any help is much appreciated!
... And it was a typo. Typical. os.makedir(output.cut_dir) should be os.makedirs(output.cut_dir). I'm still really curious why snakemake isn't displaying the AttributeError python throws when you try to run this:
AttributeError: module 'os' has no attribute 'makedir'
Is there somewhere this is stored or can be accessed to prevent future headaches?
Are you sure the error message is due to the typo in os.makedir? In this test script os.makedir does throw AttributeError ...:
rule all:
input:
'tmp.done',
rule one:
output:
x= 'tmp.done',
xdir= directory('tmp'),
run:
os.makedir(output.xdir)
When executed:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 one
2
[Wed Jun 19 09:05:57 2019]
rule one:
output: tmp.done, tmp
jobid: 1
Job counts:
count jobs
1 one
1
[Wed Jun 19 09:05:57 2019]
Error in rule one:
jobid: 0
output: tmp.done, tmp
RuleException:
AttributeError in line 10 of /home/dario/Tritume/Snakefile:
module 'os' has no attribute 'makedir'
File "/home/dario/Tritume/Snakefile", line 10, in __rule_one
File "/home/dario/miniconda3/envs/tritume/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/dario/Tritume/.snakemake/log/2019-06-19T090557.113876.snakemake.log
Use f-string to resolve local variables like {abspath}:
for read in input.cut_reads:
abspath = os.path.abspath(read)
shell(f"ln -s {abspath} {output.cut_dir}")
Wrap the wildcards that snakemake resolves automatically into double braces inside of f-strings.