I'm experiencing two issues trying to run the VEP wrapper for snakemake.
The first is that I would like to use lambda wildcards in calls like so:
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}_sorted_dedupped_snp_varscan.vcf".format(sample,pair)) for pair in ['']))
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="variants.annotated.vcf",
stats="variants.html"
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
However, I get an error:
When I replace lambda wildcards with a calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names) ([which is not ideal - see this post for reason][1]) it give me errors about resources folder?
(snakemake) [moldach#cedar1 MTG353]$ snakemake -n -r
Building DAG of jobs...
MissingInputException in line 333 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/Snakefile:
Missing input files for rule variant_annotation:
resources/vep/cache
resources/vep/plugins
I'm also [confused from the documentation as to how it knows which reference genome (version, _etc.) should be specified][2].
UPDATE:
Because of the character limit I cannot even respond to the two respondents so I will continue the issue here:
As #jafors mentioned the two wrappers solved the issue for cache and plugins - thanks!
Now I get an error from trying to run VEP though from the following rule:
rule variant_annotation:
input:
calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.annotated.vcf', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names),
stats=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.html', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names)
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
this is the error I get from the log:
Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 variant_annotation
1
[Wed Aug 12 20:22:49 2020]
Job 0: --- Annotating Variants.
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpwx1u_776.wrapper.py", line 36, in <module>
if snakemake.output.calls.endswith(".vcf.gz"):
AttributeError: 'Namedlist' object has no attribute 'endswith'
[Wed Aug 12 20:22:53 2020]
Error in rule variant_annotation:
jobid: 0
output: ANNOTATION/VEP/BC1217.annotated.vcf, ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/MTG109.annotated.vcf, ANNOTATION/VEP/BC1217.html, ANNOTATION/VEP/470.html, ANNOTATION/VEP/MTG$
conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail; python /scratch/moldach/MADDOG/VCF-FILE$
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
TO BE CLEAR:
This is the code I had running VEP prior to trying out the wrapper so I would like to preserve similar options (e.g. offline, etc.):
vep \
-i {input.sample} \
--species "caenorhabditis_elegans" \
--format "vcf" \
--everything \
--cache_version 100 \
--offline \
--force_overwrite \
--fasta {input.ref} \
--gff {input.annot} \
--tab \
--variant_class \
--regulatory \
--show_ref_allele \
--numbers \
--symbol \
--protein \
-o {params.sample}
UPDATE 2:
Yes the use of expand() was the issue. I remember this is why I like to use lambda or os.path.join() as rule input/output except for as you mentioned in rule all:
The following seems to get rid of that problem although I'm met with a new one:
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
Not sure why I get the unknown file type error - as I mentioned this was first tested out with the full command with the same input data?
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpsh388k23.wrapper.py", line 44, in <module>
"(bcftools view {snakemake.input.calls} | "
File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; (bcftools view VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cach$
[Thu Aug 13 09:02:22 2020]
Update 3:
bcftools view is giving the warning from the output of samtools mpileup/varscan pileup2snp:
def getDeduppedBamsIndex(sample):
return(list(os.path.join(aligns_dict[sample],"{0}.sorted.dedupped.bam.bai".format(sample,pair)) for pair in ['']))
rule mpilup:
input:
bam=lambda wildcards: getDeduppedBams(wildcards.sample),
reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log")
params:
extra=lambda wc: "-r {}".format(wc.contig)
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/samtools/mpileup"
rule mpileup_to_vcf:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf")
message:
"Calling SNP with Varscan2"
threads:
2 # Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_{contig}.log")
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/varscan/mpileup2snp"
rule vcf_merge:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_II.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_III.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_IV.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_V.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_X.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_MtDNA.vcf")
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}.vcf")
log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_vcf-merge.log")
resources:
mem = 1000,
time = 10
threads: 1
message: """--- Merge VarScan by Chromosome."""
shell: """
awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' {input} > {output}
"""
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}.vcf".format(sample,pair)) for pair in ['']))
rule annotate_variants:
input:
calls=lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="{sample}.annotated.vcf",
stats="{sample}.html"
params:
# Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
# Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
plugins=["LoFtool"],
extra="--everything" # optional: extra arguments
log:
"logs/vep/{sample}.log"
threads: 4
resources:
time=30,
mem=5000
wrapper:
"0.65.0/bio/vep/annotate"
If I run bcftools view on the output I get the error:
$ bcftools view variant_calling/varscan/MTG324.vcf
Failed to read from variant_calling/varscan/MTG324.vcf: unknown file type
About using the expand vs wildcard, it does not matter at all. The biostar post is just advice how to keep things readable. On the snakemake/programmatic side should not matter how you define you input, as long as it is correct.
The complaint about resources is that you define in the input of rule variant_annotation that resources/vep/cache and resources/vep/plugins are necessary inputs to be able to run variant_annotation. With this error snakemake is effectively telling you that those files do not exist, so it can not run the rule for you.
When I look at the code in the docs it seems like the cache directory as input should define which genome you use:
entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
species = entrypath.parent.name
release, build = entrypath.name.split("_")
Additionally to what Maarten said (the resources/vep/cache and resources/vep/plugins are just example paths to the required input which defines also which genome and version you want to use), you can get the cache and plugin directories easily with two other simple rules in your Snakefile using these wrappers:
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/cache.htm
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/plugins.html
EDIT
Glad this worked out for your first problem.
The second error seems to arise from the expand in the output.
Am I understanding correctly that you want to annotate all your vcfs one-by-one? So input is {sample}.vcf and output would be {sample}.annotated.vcf?
If that's the case, you probably don't want to use expand in this rule.
I am also not sure, why you would need the {ANNOT_DIR} and {ANNOT_TOOL} to be wildcards here. I guess if you are using VEP, the ANNOT_TOOL would always be VEP and the ANNOT_DIR will be ANNOTATION?
Then, you could write them directly in the output as ANNOTATION/VEP/{sample}.annotated.vcf.
Same for the {CALLING_DIR}, I guess this will always be the same directory, right? I get that the {CALLING_TOOL} might have more than one value if you used multiple callers on the samples.
If I am still on track, you have two wildcards you could want to expand on when using VEP, the {sample} and the {CALLING_TOOL}.
Just write
input:
calls: 'CALLDIR/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf',
cache="resources/vep/cache",
plugins="resources/vep/plugins"
output:
calls='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf',
stats='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.html'
The expand belongs in your rule all or any other target rule that uses all annotated vcfs at once, sth. like this:
rule all:
input: expand('ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf', CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names)
Then, the variant_annotation rule will run all the samples you expand on in rule all.
I hope I got your idea correctly and this helps.
EDIT2
Ok, seems like we are nearly done. The error you get is thrown by bcftools view - it indicates that something might be wrong with the vcf.
Did you try bcftools view with your vcf outside of the Snakefile? This would give us an idea if the problem arises during this rule or if the vcf is already somehow problematic.
The overall problem I'm trying to solve is a way to count the number of reads present in each file at every step of a QC pipeline I'm building. I have a shell script I've used in the past which takes in a directory and outputs the number of reads per file. Since I'm looking to use a directory as input, I tried following the format laid out by Rasmus in this post:
https://bitbucket.org/snakemake/snakemake/issues/961/rule-with-folder-as-input-and-output
Here is some example input created earlier in the pipeline:
$ ls -1 cut_reads/
97_R1_cut.fastq.gz
97_R2_cut.fastq.gz
98_R1_cut.fastq.gz
98_R2_cut.fastq.gz
99_R1_cut.fastq.gz
99_R2_cut.fastq.gz
And a simplified Snakefile to first aggregate all reads by creating symlinks in a new directory, and then use that directory as input for the read counting shell script:
import os
configfile: "config.yaml"
rule all:
input:
"read_counts/read_counts.txt"
rule agg_count:
input:
cut_reads = expand("cut_reads/{sample}_{rdir}_cut.fastq.gz", rdir=["R1", "R2"], sample=config["forward_reads"])
output:
cut_dir = directory("read_counts/cut_reads")
run:
os.makedir(output.cut_dir)
for read in input.cut_reads:
abspath = os.path.abspath(read)
shell("ln -s {abspath} {output.cut_dir}")
rule count_reads:
input:
cut_reads = "read_counts/cut_reads"
output:
"read_counts/read_counts.txt"
shell:
'''
readcounts.sh {input.cut_reads} >> {output}
'''
Everything's fine in the dry-run, but when I try to actually execute it, I get a fairly cryptic error message:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 agg_count
1 all
1 count_reads
3
[Tue Jun 18 11:31:22 2019]
rule agg_count:
input: cut_reads/99_R1_cut.fastq.gz, cut_reads/98_R1_cut.fastq.gz, cut_reads/97_R1_cut.fastq.gz, cut_reads/99_R2_cut.fastq.gz, cut_reads/98_R2_cut.fastq.gz, cut_reads/97_R2_cut.fastq.gz
output: read_counts/cut_reads
jobid: 2
Job counts:
count jobs
1 agg_count
1
[Tue Jun 18 11:31:22 2019]
Error in rule agg_count:
jobid: 0
output: read_counts/cut_reads
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/douglas/snakemake/scrap_directory/.snakemake/log/2019-06-18T113122.202962.snakemake.log
read_counts/ was created, but there's no cut_reads/ directory inside. No other error messages are present in the complete log. Anyone know what's going wrong or how to receive a more descriptive error message?
I'm also (obviously) fairly new to snakemake, so there might be a better way to go about this whole process. Any help is much appreciated!
... And it was a typo. Typical. os.makedir(output.cut_dir) should be os.makedirs(output.cut_dir). I'm still really curious why snakemake isn't displaying the AttributeError python throws when you try to run this:
AttributeError: module 'os' has no attribute 'makedir'
Is there somewhere this is stored or can be accessed to prevent future headaches?
Are you sure the error message is due to the typo in os.makedir? In this test script os.makedir does throw AttributeError ...:
rule all:
input:
'tmp.done',
rule one:
output:
x= 'tmp.done',
xdir= directory('tmp'),
run:
os.makedir(output.xdir)
When executed:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 one
2
[Wed Jun 19 09:05:57 2019]
rule one:
output: tmp.done, tmp
jobid: 1
Job counts:
count jobs
1 one
1
[Wed Jun 19 09:05:57 2019]
Error in rule one:
jobid: 0
output: tmp.done, tmp
RuleException:
AttributeError in line 10 of /home/dario/Tritume/Snakefile:
module 'os' has no attribute 'makedir'
File "/home/dario/Tritume/Snakefile", line 10, in __rule_one
File "/home/dario/miniconda3/envs/tritume/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/dario/Tritume/.snakemake/log/2019-06-19T090557.113876.snakemake.log
Use f-string to resolve local variables like {abspath}:
for read in input.cut_reads:
abspath = os.path.abspath(read)
shell(f"ln -s {abspath} {output.cut_dir}")
Wrap the wildcards that snakemake resolves automatically into double braces inside of f-strings.
I am running in snakefile on cluster using this default config:
"__default__":
"account" : "myAccount"
"queue" : "myQueue"
"nCPUs" : "16"
"memory" : 20000
"resources" : "\"select[mem>20000] rusage[mem=20000] span[hosts=1]\""
"name" : "JOBNAME.{rule}.{wildcards}"
"output" : "{rule}.{wildcards}.out"
"error" : "{rule}.{wildcards}.err"
"time" : "24:00:00"
It runs ok for some rules, but it raise this error for one of the rules.
Traceback (most recent call last):
File "home/conda3_64/lib/python3.5/site-packages/snakemake/__init__.py", line 537, in snakemake
report=report)
File "home/conda3_64/lib/python3.5/site-packages/snakemake/workflow.py", line 653, in execute
success = scheduler.schedule()
File "home/conda3_64/lib/python3.5/site-packages/snakemake/scheduler.py", line 286, in schedule
self.run(job)
File "home/conda3_64/lib/python3.5/site-packages/snakemake/scheduler.py", line 302, in run
error_callback=self._error)
File "home/conda3_64/lib/python3.5/site-packages/snakemake/executors.py", line 638, in run
jobscript = self.get_jobscript(job)
File "home/conda3_64/lib/python3.5/site-packages/snakemake/executors.py", line 496, in get_jobscript
cluster=self.cluster_wildcards(job))
File "home/conda3_64/lib/python3.5/site-packages/snakemake/executors.py", line 556, in cluster_wildcards
return Wildcards(fromdict=self.cluster_params(job))
File "home/conda3_64/lib/python3.5/site-packages/snakemake/executors.py", line 547, in cluster_params
cluster.update(self.cluster_config.get(job.name, dict()))
ValueError: need more than 1 value to unpack
This is how I run the snakemake;
snakemake -j 20 --cluster-config ./config.yaml --cluster "qsub -A {cluster.account} -l walltime={cluster.time} -q {cluster.queue} -l nodes=1:ppn={cluster.nCPUs},mem={cluster.memory}" -p
If I run it without a cluster snakemake only it will run normally.
similar error is here ValueError: need more than 1 value to unpack python but I could not relate.
I cannot test it right now but based on the error I suspect the issue seems to be with your JOBNAME placeholder not being explicitly specified in call to qsub.
Adding -N some_name to your qsub arguments should resolve it.
I have a folder with multiple sub-folders that each contain .fastq files(s) that I would like to align to a genome. I am trying to create a snakemake workflow for it. First I access each sub-directory and the files in them using wildcards. Then I use the expand function to store all the paths to the files and write a rule to map the files to the genome. The code is as follows:
from snakemake.io import glob_wildcards, expand
import sys
import os
directories, files = glob_wildcards("data/samples/{dir}/{file}.fastq")
print(directories, files)
rule all:
input:
expand("data/samples/{dir}/{file}.fastq", zip, dir=directories,
file=files)
rule bwa_map:
input:
G = "data/genome.fa",
r1 = expand("data/samples/{dir}/{file}.fastq", zip,
dir=directories, file=files)
output:
r2 = expand("data/results/{dir}/{file}.bam", zip, dir=directories,
file=files)
shell:
"./bwa mem {input.G} {input.r1} | ./samtools sort -o - > {output.r2}"
However, when I execute this code as "snakemake bwa_map", I get the following error:
Error in job bwa_map while creating output files data/results/SRR5923/A.bam, data/results/SRR5924/B.bam, data/results/SRR5925/C.bam.
RuleException:
CalledProcessError in line 19 of /Users/rewatitappu/PycharmProjects/RNA-seq_Snakemake/Snakefile:
Command './bwa mem data/genome.fa data/samples/SRR5923/A.fastq data/samples/SRR5924/B.fastq data/samples/SRR5925/C.fastq | ./samtools sort -o - > data/results/SRR5923/A.bam data/results/SRR5924/B.bam data/results/SRR5925/C.bam' returned non-zero exit status 1.
File "/Users/rewatitappu/PycharmProjects/RNA-seq_Snakemake/Snakefile", line 19, in __rule_bwa_map
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/thread.py", line 55, in run
Removing output files of failed job bwa_map since they might be corrupted:
data/results/SRR5923/A.bam
Will exit after finishing currently running jobs.
Am I wrongly executing the snakemake command or could there be a problem with the code?
The error message suggests that the error occurred at the execution of the following shell command:
./bwa mem data/genome.fa data/samples/SRR5923/A.fastq data/samples/SRR5924/B.fastq data/samples/SRR5925/C.fastq | ./samtools sort -o - > data/results/SRR5923/A.bam data/results/SRR5924/B.bam data/results/SRR5925/C.bam
The problem could be caused by the fact that you have two bam files as output.
You probably shouldn't use expand in the bwa_map rule. The expand already took place in the all rule.