Problems with the VEP snakemake wrapper - wrapper

I'm experiencing two issues trying to run the VEP wrapper for snakemake.
The first is that I would like to use lambda wildcards in calls like so:
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}_sorted_dedupped_snp_varscan.vcf".format(sample,pair)) for pair in ['']))
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="variants.annotated.vcf",
stats="variants.html"
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
However, I get an error:
When I replace lambda wildcards with a calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names) ([which is not ideal - see this post for reason][1]) it give me errors about resources folder?
(snakemake) [moldach#cedar1 MTG353]$ snakemake -n -r
Building DAG of jobs...
MissingInputException in line 333 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/Snakefile:
Missing input files for rule variant_annotation:
resources/vep/cache
resources/vep/plugins
I'm also [confused from the documentation as to how it knows which reference genome (version, _etc.) should be specified][2].
UPDATE:
Because of the character limit I cannot even respond to the two respondents so I will continue the issue here:
As #jafors mentioned the two wrappers solved the issue for cache and plugins - thanks!
Now I get an error from trying to run VEP though from the following rule:
rule variant_annotation:
input:
calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.annotated.vcf', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names),
stats=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.html', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names)
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
this is the error I get from the log:
Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 variant_annotation
1
[Wed Aug 12 20:22:49 2020]
Job 0: --- Annotating Variants.
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpwx1u_776.wrapper.py", line 36, in <module>
if snakemake.output.calls.endswith(".vcf.gz"):
AttributeError: 'Namedlist' object has no attribute 'endswith'
[Wed Aug 12 20:22:53 2020]
Error in rule variant_annotation:
jobid: 0
output: ANNOTATION/VEP/BC1217.annotated.vcf, ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/MTG109.annotated.vcf, ANNOTATION/VEP/BC1217.html, ANNOTATION/VEP/470.html, ANNOTATION/VEP/MTG$
conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail; python /scratch/moldach/MADDOG/VCF-FILE$
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
TO BE CLEAR:
This is the code I had running VEP prior to trying out the wrapper so I would like to preserve similar options (e.g. offline, etc.):
vep \
-i {input.sample} \
--species "caenorhabditis_elegans" \
--format "vcf" \
--everything \
--cache_version 100 \
--offline \
--force_overwrite \
--fasta {input.ref} \
--gff {input.annot} \
--tab \
--variant_class \
--regulatory \
--show_ref_allele \
--numbers \
--symbol \
--protein \
-o {params.sample}
UPDATE 2:
Yes the use of expand() was the issue. I remember this is why I like to use lambda or os.path.join() as rule input/output except for as you mentioned in rule all:
The following seems to get rid of that problem although I'm met with a new one:
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
Not sure why I get the unknown file type error - as I mentioned this was first tested out with the full command with the same input data?
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpsh388k23.wrapper.py", line 44, in <module>
"(bcftools view {snakemake.input.calls} | "
File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; (bcftools view VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cach$
[Thu Aug 13 09:02:22 2020]
Update 3:
bcftools view is giving the warning from the output of samtools mpileup/varscan pileup2snp:
def getDeduppedBamsIndex(sample):
return(list(os.path.join(aligns_dict[sample],"{0}.sorted.dedupped.bam.bai".format(sample,pair)) for pair in ['']))
rule mpilup:
input:
bam=lambda wildcards: getDeduppedBams(wildcards.sample),
reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log")
params:
extra=lambda wc: "-r {}".format(wc.contig)
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/samtools/mpileup"
rule mpileup_to_vcf:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf")
message:
"Calling SNP with Varscan2"
threads:
2 # Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_{contig}.log")
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/varscan/mpileup2snp"
rule vcf_merge:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_II.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_III.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_IV.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_V.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_X.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_MtDNA.vcf")
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}.vcf")
log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_vcf-merge.log")
resources:
mem = 1000,
time = 10
threads: 1
message: """--- Merge VarScan by Chromosome."""
shell: """
awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' {input} > {output}
"""
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}.vcf".format(sample,pair)) for pair in ['']))
rule annotate_variants:
input:
calls=lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="{sample}.annotated.vcf",
stats="{sample}.html"
params:
# Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
# Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
plugins=["LoFtool"],
extra="--everything" # optional: extra arguments
log:
"logs/vep/{sample}.log"
threads: 4
resources:
time=30,
mem=5000
wrapper:
"0.65.0/bio/vep/annotate"
If I run bcftools view on the output I get the error:
$ bcftools view variant_calling/varscan/MTG324.vcf
Failed to read from variant_calling/varscan/MTG324.vcf: unknown file type

About using the expand vs wildcard, it does not matter at all. The biostar post is just advice how to keep things readable. On the snakemake/programmatic side should not matter how you define you input, as long as it is correct.
The complaint about resources is that you define in the input of rule variant_annotation that resources/vep/cache and resources/vep/plugins are necessary inputs to be able to run variant_annotation. With this error snakemake is effectively telling you that those files do not exist, so it can not run the rule for you.
When I look at the code in the docs it seems like the cache directory as input should define which genome you use:
entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
species = entrypath.parent.name
release, build = entrypath.name.split("_")

Additionally to what Maarten said (the resources/vep/cache and resources/vep/plugins are just example paths to the required input which defines also which genome and version you want to use), you can get the cache and plugin directories easily with two other simple rules in your Snakefile using these wrappers:
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/cache.htm
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/plugins.html
EDIT
Glad this worked out for your first problem.
The second error seems to arise from the expand in the output.
Am I understanding correctly that you want to annotate all your vcfs one-by-one? So input is {sample}.vcf and output would be {sample}.annotated.vcf?
If that's the case, you probably don't want to use expand in this rule.
I am also not sure, why you would need the {ANNOT_DIR} and {ANNOT_TOOL} to be wildcards here. I guess if you are using VEP, the ANNOT_TOOL would always be VEP and the ANNOT_DIR will be ANNOTATION?
Then, you could write them directly in the output as ANNOTATION/VEP/{sample}.annotated.vcf.
Same for the {CALLING_DIR}, I guess this will always be the same directory, right? I get that the {CALLING_TOOL} might have more than one value if you used multiple callers on the samples.
If I am still on track, you have two wildcards you could want to expand on when using VEP, the {sample} and the {CALLING_TOOL}.
Just write
input:
calls: 'CALLDIR/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf',
cache="resources/vep/cache",
plugins="resources/vep/plugins"
output:
calls='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf',
stats='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.html'
The expand belongs in your rule all or any other target rule that uses all annotated vcfs at once, sth. like this:
rule all:
input: expand('ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf', CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names)
Then, the variant_annotation rule will run all the samples you expand on in rule all.
I hope I got your idea correctly and this helps.
EDIT2
Ok, seems like we are nearly done. The error you get is thrown by bcftools view - it indicates that something might be wrong with the vcf.
Did you try bcftools view with your vcf outside of the Snakefile? This would give us an idea if the problem arises during this rule or if the vcf is already somehow problematic.

Related

Snakemake is unable to match wildcard although it's defined and even suggested

I am still very confused about the wildcards concept despite reading the full docs and a few examples, so maybe someone can shed light on this weird behaviour. It might be a bug but it's such a basic example that I am pretty sure I am doing or understanding something wrong.
Here is my Snakefile which should generate a bunch of files defined in a dictionary where the location of the files is stored (those can be served by all kinds of data providers like iRODS, XRootD etc., but it's not important now).
import os
some_files = {
"foo": "some_location/foo",
"bar": "another_location/bar",
"baz": "yet_another_loc/baz"
}
rule all:
input: ["raw/" + os.path.basename(f) for f in some_files.keys()]
rule generate_files:
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {some_files[wildcards.fname]} > {output}"
As you can see, I need to use a similar "trick" which was proposed in my previous question (Array of values as input in Snakemake workflows) to force the recognition of the files by adding a rule and listing those (in rule all), which works nicely.
The rule generate_files should then generate (retrieve) those by using the corresponding URL and protocol defined in some_files. For the sake of simplicity, it's now just echoing the origin into the output file.
To achieve this, I thought I can simply use the wildcards.fname in the shell section but I when I run the workflow, I get:
░ tamasgal#silentbox-(2):PhD/snakemake  master ●●● snakemake took 16s
░ 08:47:35 > snakemake -c1
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
generate_files 3 1 1
total 4 1 1
Select jobs to execute...
[Fri Feb 18 08:47:38 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'wildcards.fname' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
If I use fname (and not wildcards.fname), Snakemake proposes to use wildcards.fname, which again, does not work. Here is the output when running with fname in output:
[Fri Feb 18 08:47:48 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'fname' is unknown in this context. Did you mean 'wildcards.fname'?
Why is this happening? The output of the workflow clearly shows that wildcards: fname=bar, so it exists and is defined. Is this a bug?
Hm, you may have to try and get at some_files[wildcards.fname] outside of the shell part? It looks to me like it can tell what the wildcard is supposed to be for the output to be raw/bar, but it can't handle using it to access the dict in the shell part. It seems like this could be handled with an input function to me.
Off the top of my head:
rule generate_files:
input:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {input.some_file} > {output}"
EDIT: if it fails because the file isn't local so Snakemake can't find it, you may supply the path to it as a parameter instead:
rule generate_files:
params:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {params.some_file} > {output}"

Workflow always results in "Nothing to do" even when forcing rules

So as the title says I can't bring my workflow to execute anything, except the all rule...
When Executing the all rule it correctly finds all the input files, so the configfile is okay, every path is correct.
when trying to run without additional tags I get
Building DAG of jobs...
Checking status of 0 jobs.
Nothing to be done
Things I tried:
-f rcorrector -> only all rule
filenameR1.fcor_val1.fq -> MissingRuleException (No Typos)
--forceall -> only all rule
Some more fiddling I can't formulate clearly
please Help
from os import path
configfile:"config.yaml"
RNA_DIR = config["RAW_RNA_DIR"]
RESULT_DIR = config["OUTPUT_DIR"]
FILES = glob_wildcards(path.join(RNA_DIR, '{sample}R1.fastq.gz')).sample
############################################################################
rule all:
input:
r1=expand(path.join(RNA_DIR, '{sample}R1.fastq.gz'), sample=FILES),
r2=expand(path.join(RNA_DIR, '{sample}R2.fastq.gz'), sample=FILES)
#############################################################################
rule rcorrector:
input:
r1=path.join(RNA_DIR, '{sample}R1.fastq.gz'),
r2=path.join(RNA_DIR, '{sample}R2.fastq.gz')
output:
o1=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R1.cor.fq'),
o2=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R2.cor.fq')
#group: "cleaning"
threads: 8
params: "-t {threads}"
envmodules:
"bio/Rcorrector/1.0.4-foss-2019a"
script:
"scripts/Rcorrector.py"
############################################################################
rule FilterUncorrectabledPEfastq:
input:
r1=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R1.cor.fq'),
r2=path.join(RESULT_DIR, 'trimmed_reads/corrected/{sample}R2.cor.fq')
output:
o1=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R1.fcor.fq"),
o2=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R2.fcor.fq")
#group: "cleaning"
envmodules:
"bio/Jellyfish/2.2.6-foss-2017a",
"lang/Python/2.7.13-foss-2017a"
#TODO: load as module
script:
"/scripts/filterUncorrectable.py"
#############################################################################
rule trim_galore:
input:
r1=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R1.fcor.fq"),
r2=path.join(RESULT_DIR, "trimmed_reads/filtered/{sample}R2.fcor.fq")
output:
o1=path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val1.fq"),
o2=path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val2.fq")
threads: 8
#group: "cleaning"
envmodules:
"bio/Trim_Galore/0.6.5-foss-2019a-Python-3.7.4"
params:
"--paired --retain_unpaired --phred33 --length 36 -q 5 --stringency 1 -e 0.1 -j {threads}"
script:
"scripts/trim_galore.py"
In snakemake, you define final output files of the pipeline as target files and define them as inputs in first rule of the pipeline. This rule is traditionally named as all (more recently as targets in snakemake doc).
In your code, rule all specifies input files of the pipeline, which already exists, and therefore snakemake doesn't see anything to do. It just instead needs to specify output files of interest from the pipeline.
rule all:
input:
expand(path.join(RESULT_DIR, "trimmed_reads/{sample}.fcor_val{read}.fq"), sample=FILES, read=[1,2]),
Why your attempted methods didn't work?
-f not working:
As per doc:
--force, -f
Force the execution of the selected target or the first rule regardless of already created output.
Default: False
In your code, this means rule all, which doesn't have output defined, and therefore nothing happened.
filenameR1.fcor_val1.fq
This doesn't match output of any of the rules and therefore the error MissingRuleException.
--forceall
Same reasoning as that for -f flag in your case.
--forceall, -F
Force the execution of the selected (or the first) rule and all rules it is dependent on regardless of already created output.
Default: False

Snakemake “Missing files after X seconds” error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 99
Job counts:
count jobs
1 all
1 antiSMASH
1 pear
1 prodigal
4
[Wed Dec 11 14:59:43 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 3
wildcards: sample=Unmap_41, extension=fastq
Submitted job 3 with external jobid 'Submitted batch job 4572437'.
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
It would seem that the first rule is not executing, but I am unsure as to why as from what I can see all the syntax is correct. Anyone have some advice?
The snakefile is the following:
#!/miniconda/bin/python
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {output} -t 21")
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta")
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}")
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
I think it is likely to be something involving the way I am calling the conda environment in the run commands but using the conda: option with a yaml files returns version not found style errors.
As of what I read from pear documentation:
-o Specify the name to be used as base for the output files. PEAR outputs four files. A file containing the assembled reads with a
assembled.fastq extension, two files containing the forward, resp.
reverse, unassembled reads with extensions unassembled.forward.fastq,
resp. unassembled.reverse.fastq, and a file containing the discarded
reads with a discarded.fastq extension
So if the output defined in your rule is just a base, I suggest you put this as a param and the real names of the output as output:
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}".assembled.fastq,
"merged_reads/{sample}.{extension}".unassembled.forward.fastq,
"merged_reads/{sample}.{extension}".unassembled.reverse.fastq,
"merged_reads/{sample}.{extension}".discarded.fastq
params:
base="merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {params.base} -t 21")
I haven't tested pear so not sure what the output file names are exactly.

Parallel execution of a snakemake rule with same input and a range of values for a single parameter

I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.
My original bash script runs a program multiple times for a range of values of the parameter K.
for num in {1..3}
do
structure.py -K $num --input=fileprefix --output=fileprefix
done
There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.
Config:
cat config.yaml
infile: fileprefix
K:
- 1
- 2
- 3
Snakemake:
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['infile'],
K = config['K']
threads: 3
shell:
"""
structure.py -K {params.K} \
--input=output/{params.prefix} \
--output=output/{params.prefix}
"""
This was executed with snakemake --cores 3. The problem persists when I only use one thread.
I expected the outputs described above for each value of K, but the run fails with this error:
RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail; structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
File "Snakefile", line 84, in __rule_Structure
File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.
You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['invcf'],
K = config['K']
threads: 3
shell:
"structure.py -K {wildcards.K} "
"--input=output/{params.prefix} "
"--output=output/{params.prefix}"
Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.
There is an option now for parameter space exploration
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration

Snakemake : CalledProcessError when running BWA on multiple files

I have a folder with multiple sub-folders that each contain .fastq files(s) that I would like to align to a genome. I am trying to create a snakemake workflow for it. First I access each sub-directory and the files in them using wildcards. Then I use the expand function to store all the paths to the files and write a rule to map the files to the genome. The code is as follows:
from snakemake.io import glob_wildcards, expand
import sys
import os
directories, files = glob_wildcards("data/samples/{dir}/{file}.fastq")
print(directories, files)
rule all:
input:
expand("data/samples/{dir}/{file}.fastq", zip, dir=directories,
file=files)
rule bwa_map:
input:
G = "data/genome.fa",
r1 = expand("data/samples/{dir}/{file}.fastq", zip,
dir=directories, file=files)
output:
r2 = expand("data/results/{dir}/{file}.bam", zip, dir=directories,
file=files)
shell:
"./bwa mem {input.G} {input.r1} | ./samtools sort -o - > {output.r2}"
However, when I execute this code as "snakemake bwa_map", I get the following error:
Error in job bwa_map while creating output files data/results/SRR5923/A.bam, data/results/SRR5924/B.bam, data/results/SRR5925/C.bam.
RuleException:
CalledProcessError in line 19 of /Users/rewatitappu/PycharmProjects/RNA-seq_Snakemake/Snakefile:
Command './bwa mem data/genome.fa data/samples/SRR5923/A.fastq data/samples/SRR5924/B.fastq data/samples/SRR5925/C.fastq | ./samtools sort -o - > data/results/SRR5923/A.bam data/results/SRR5924/B.bam data/results/SRR5925/C.bam' returned non-zero exit status 1.
File "/Users/rewatitappu/PycharmProjects/RNA-seq_Snakemake/Snakefile", line 19, in __rule_bwa_map
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/thread.py", line 55, in run
Removing output files of failed job bwa_map since they might be corrupted:
data/results/SRR5923/A.bam
Will exit after finishing currently running jobs.
Am I wrongly executing the snakemake command or could there be a problem with the code?
The error message suggests that the error occurred at the execution of the following shell command:
./bwa mem data/genome.fa data/samples/SRR5923/A.fastq data/samples/SRR5924/B.fastq data/samples/SRR5925/C.fastq | ./samtools sort -o - > data/results/SRR5923/A.bam data/results/SRR5924/B.bam data/results/SRR5925/C.bam
The problem could be caused by the fact that you have two bam files as output.
You probably shouldn't use expand in the bwa_map rule. The expand already took place in the all rule.