Snakemake: Random FileNotFoundError for a shared file in a rule when submitting jobs in parallel - jobs

I have the following rule:
rule run_example:
input:
counts=config['output_dir'] + "/Counts/skin.txt"
params:
chrom=lambda wildcards: gene_chrom()[wildcards.SigGene]
output:
config['out_refbias'] + "/{SigGene}.txt"
script:
"Scripts/run_example.R"
with SigGene=["gene1", "gene2"]
I define the following function:
def gene_chrom(File=config['output_dir'] + "/genes2test.txt", sep=" "):
""" Makes a dictionary with keys gene and values chromosome from a file with first col gene_id and second col CHROM """
data=pd.read_csv(File, sep=sep)
keys=list(data['gene_id'])
values=[str(x) for x in data['CHROM']]
dic=dict(zip(keys,values))
return dic
I submit the rule to a cluster to run jobs in parallel. For some jobs I get the following error message:
FileNotFoundError in line 67 of Snakefile:
[Tue Jun 23 09:47:16 2020] [Errno 2] File b'/scratch/genes2test.txt' does not exist: b'/scratch/genes2test.txt'
The file exist and is shared among all instances of the rule. Most jobs were able to read the file and run to completion but some failed with above error message.

Related

Snakemake checkpoint output unknown number of files with no subsequent aggregation but instead rules that peform actions on individual files?

Thanks for any help ahead of time.
I'm trying to use the Snakemake checkpoint functionality to produce an unknown number of files in a directory, which I've gotten to work using the pattern described in the docs, but then I don't want to do any kind of aggregation rule afterwards, I want to have rules that do actions on each individual file (of course inherently in parallel via wildcards).
Here's a simple reproducible example of my problem:
from os.path import join
rule all:
input:
"aggregated.txt",
checkpoint create_gzip_file:
output:
directory("my_directory/"),
shell:
"""
mkdir my_directory/
cd my_directory
for i in 1 2 3; do gzip < /dev/null > $i.txt.gz; done
"""
rule gunzip_file:
input:
join("my_directory", "{i}.txt.gz"),
output:
join("my_directory", "{i}.txt"),
shell:
"""
gunzip -c {input} > {output}
"""
def gather_gunzip_input(wildcards):
out_dir = checkpoints.create_gzip_file.get(**wildcards).output[0]
i = glob_wildcards(join(out_dir, "{i}.txt.gz"))
return expand(f"{out_dir}/{{i}}", i=i)
rule aggregate:
input:
gather_gunzip_input,
output:
"aggregated.txt",
shell:
"cat {input} > {output}"
I'm getting the following error:
$ snakemake --printshellcmds --cores all
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
---------------- ------- ------------- -------------
aggregate 1 1 1
all 1 1 1
create_gzip_file 1 1 1
total 3 1 1
Select jobs to execute...
[Wed Jul 13 14:57:09 2022]
checkpoint create_gzip_file:
output: my_directory
jobid: 2
reason: Missing output files: my_directory
resources: tmpdir=/tmp
Downstream jobs will be updated after completion.
mkdir my_directory/
cd my_directory
for i in 1 2 3; do gzip < /dev/null > $i.txt.gz; done
[Wed Jul 13 14:57:09 2022]
Finished job 2.
1 of 3 steps (33%) done
MissingInputException in line 20 of /home/hermidalc/projects/github/hermidalc/test/Snakefile:
Missing input files for rule gunzip_file:
output: my_directory/['1', '2', '3'].txt
wildcards: i=['1', '2', '3']
affected files:
my_directory/['1', '2', '3'].txt.gz
I had a syntax issue (which wasn't triggering any syntax check or compiler issues) that was causing the seemingly unrelated MissingInputException. The glob_wildcards line:
i = glob_wildcards(join(out_dir, "{i}.txt.gz"))
needs to be with a trailing comma:
i, = glob_wildcards(join(out_dir, "{i}.txt.gz"))
or
i = glob_wildcards(join(out_dir, "{i}.txt.gz")).i
Also, in answering the other part of the question - I believe if you don't want an aggregation-type rule (which uses the function that gathers the unknown number of files as its input) then you need to put that function as input to your rule all. As shown in this question, you can continue to have downstream rules of your checkpoint, which do not aggregate, but perform actions on the individual unknown files, you just have to use the wildcards created in your gather function and write the expand in the right way that it outputs the file structure for how the output of the last rule performing actions on files from the checkpoint come out.

Problems with the VEP snakemake wrapper

I'm experiencing two issues trying to run the VEP wrapper for snakemake.
The first is that I would like to use lambda wildcards in calls like so:
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}_sorted_dedupped_snp_varscan.vcf".format(sample,pair)) for pair in ['']))
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="variants.annotated.vcf",
stats="variants.html"
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
However, I get an error:
When I replace lambda wildcards with a calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names) ([which is not ideal - see this post for reason][1]) it give me errors about resources folder?
(snakemake) [moldach#cedar1 MTG353]$ snakemake -n -r
Building DAG of jobs...
MissingInputException in line 333 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/Snakefile:
Missing input files for rule variant_annotation:
resources/vep/cache
resources/vep/plugins
I'm also [confused from the documentation as to how it knows which reference genome (version, _etc.) should be specified][2].
UPDATE:
Because of the character limit I cannot even respond to the two respondents so I will continue the issue here:
As #jafors mentioned the two wrappers solved the issue for cache and plugins - thanks!
Now I get an error from trying to run VEP though from the following rule:
rule variant_annotation:
input:
calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.annotated.vcf', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names),
stats=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.html', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names)
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
this is the error I get from the log:
Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 variant_annotation
1
[Wed Aug 12 20:22:49 2020]
Job 0: --- Annotating Variants.
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpwx1u_776.wrapper.py", line 36, in <module>
if snakemake.output.calls.endswith(".vcf.gz"):
AttributeError: 'Namedlist' object has no attribute 'endswith'
[Wed Aug 12 20:22:53 2020]
Error in rule variant_annotation:
jobid: 0
output: ANNOTATION/VEP/BC1217.annotated.vcf, ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/MTG109.annotated.vcf, ANNOTATION/VEP/BC1217.html, ANNOTATION/VEP/470.html, ANNOTATION/VEP/MTG$
conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail; python /scratch/moldach/MADDOG/VCF-FILE$
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
TO BE CLEAR:
This is the code I had running VEP prior to trying out the wrapper so I would like to preserve similar options (e.g. offline, etc.):
vep \
-i {input.sample} \
--species "caenorhabditis_elegans" \
--format "vcf" \
--everything \
--cache_version 100 \
--offline \
--force_overwrite \
--fasta {input.ref} \
--gff {input.annot} \
--tab \
--variant_class \
--regulatory \
--show_ref_allele \
--numbers \
--symbol \
--protein \
-o {params.sample}
UPDATE 2:
Yes the use of expand() was the issue. I remember this is why I like to use lambda or os.path.join() as rule input/output except for as you mentioned in rule all:
The following seems to get rid of that problem although I'm met with a new one:
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
Not sure why I get the unknown file type error - as I mentioned this was first tested out with the full command with the same input data?
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpsh388k23.wrapper.py", line 44, in <module>
"(bcftools view {snakemake.input.calls} | "
File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; (bcftools view VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cach$
[Thu Aug 13 09:02:22 2020]
Update 3:
bcftools view is giving the warning from the output of samtools mpileup/varscan pileup2snp:
def getDeduppedBamsIndex(sample):
return(list(os.path.join(aligns_dict[sample],"{0}.sorted.dedupped.bam.bai".format(sample,pair)) for pair in ['']))
rule mpilup:
input:
bam=lambda wildcards: getDeduppedBams(wildcards.sample),
reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log")
params:
extra=lambda wc: "-r {}".format(wc.contig)
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/samtools/mpileup"
rule mpileup_to_vcf:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf")
message:
"Calling SNP with Varscan2"
threads:
2 # Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_{contig}.log")
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/varscan/mpileup2snp"
rule vcf_merge:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_II.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_III.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_IV.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_V.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_X.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_MtDNA.vcf")
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}.vcf")
log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_vcf-merge.log")
resources:
mem = 1000,
time = 10
threads: 1
message: """--- Merge VarScan by Chromosome."""
shell: """
awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' {input} > {output}
"""
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}.vcf".format(sample,pair)) for pair in ['']))
rule annotate_variants:
input:
calls=lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="{sample}.annotated.vcf",
stats="{sample}.html"
params:
# Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
# Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
plugins=["LoFtool"],
extra="--everything" # optional: extra arguments
log:
"logs/vep/{sample}.log"
threads: 4
resources:
time=30,
mem=5000
wrapper:
"0.65.0/bio/vep/annotate"
If I run bcftools view on the output I get the error:
$ bcftools view variant_calling/varscan/MTG324.vcf
Failed to read from variant_calling/varscan/MTG324.vcf: unknown file type
About using the expand vs wildcard, it does not matter at all. The biostar post is just advice how to keep things readable. On the snakemake/programmatic side should not matter how you define you input, as long as it is correct.
The complaint about resources is that you define in the input of rule variant_annotation that resources/vep/cache and resources/vep/plugins are necessary inputs to be able to run variant_annotation. With this error snakemake is effectively telling you that those files do not exist, so it can not run the rule for you.
When I look at the code in the docs it seems like the cache directory as input should define which genome you use:
entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
species = entrypath.parent.name
release, build = entrypath.name.split("_")
Additionally to what Maarten said (the resources/vep/cache and resources/vep/plugins are just example paths to the required input which defines also which genome and version you want to use), you can get the cache and plugin directories easily with two other simple rules in your Snakefile using these wrappers:
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/cache.htm
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/plugins.html
EDIT
Glad this worked out for your first problem.
The second error seems to arise from the expand in the output.
Am I understanding correctly that you want to annotate all your vcfs one-by-one? So input is {sample}.vcf and output would be {sample}.annotated.vcf?
If that's the case, you probably don't want to use expand in this rule.
I am also not sure, why you would need the {ANNOT_DIR} and {ANNOT_TOOL} to be wildcards here. I guess if you are using VEP, the ANNOT_TOOL would always be VEP and the ANNOT_DIR will be ANNOTATION?
Then, you could write them directly in the output as ANNOTATION/VEP/{sample}.annotated.vcf.
Same for the {CALLING_DIR}, I guess this will always be the same directory, right? I get that the {CALLING_TOOL} might have more than one value if you used multiple callers on the samples.
If I am still on track, you have two wildcards you could want to expand on when using VEP, the {sample} and the {CALLING_TOOL}.
Just write
input:
calls: 'CALLDIR/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf',
cache="resources/vep/cache",
plugins="resources/vep/plugins"
output:
calls='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf',
stats='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.html'
The expand belongs in your rule all or any other target rule that uses all annotated vcfs at once, sth. like this:
rule all:
input: expand('ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf', CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names)
Then, the variant_annotation rule will run all the samples you expand on in rule all.
I hope I got your idea correctly and this helps.
EDIT2
Ok, seems like we are nearly done. The error you get is thrown by bcftools view - it indicates that something might be wrong with the vcf.
Did you try bcftools view with your vcf outside of the Snakefile? This would give us an idea if the problem arises during this rule or if the vcf is already somehow problematic.

A number of Trimmomatic trimming parameters not working in Snakemake wrapper

Previously my Trimmomatic shell command included the following trimmers:
ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
For the Snakemake wrapper for Trimmomatic only LEADING:3 and MINLEN:36 work in the params: trimmer:
rule trimming:
input:
r1 = lambda wildcards: getHome(wildcards.sample)[0],
r2 = lambda wildcards: getHome(wildcards.sample)[1]
output:
r1 = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R1_trim_paired.fastq.gz"),
r1_unpaired = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R1_trim_unpaired.fastq.gz"),
r2 = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R2_trim_paired.fastq.gz"),
r2_unpaired = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R2_trim_unpaired.fastq.gz")
log: os.path.join(dirs_dict["LOG_DIR"],config["TRIM_TOOL"],"{sample}.log")
threads: 32
params:
# list of trimmers (see manual)
trimmer=["LEADING:3", "MINLEN:36"],
# optional parameters
extra="",
compression_level="-9"
resources:
mem = 1000,
time = 120
message: """--- Trimming FASTQ files with Trimmomatic."""
wrapper:
"0.64.0/bio/trimmomatic/pe"
When trying to use any of the other parameters (ILLUMINACLIP:adapters.fa:2:30:10 TRAILING:3 SLIDINGWINDOW:4:15) it fails.
For example, trying only TRAILING:3:
rule trimming:
input:
r1 = lambda wildcards: getHome(wildcards.sample)[0],
r2 = lambda wildcards: getHome(wildcards.sample)[1]
output:
r1 = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R1_trim_paired.fastq.gz"),
r1_unpaired = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R1_trim_unpaired.fastq.gz"),
r2 = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R2_trim_paired.fastq.gz"),
r2_unpaired = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"],"{sample}_R2_trim_unpaired.fastq.gz")
log: os.path.join(dirs_dict["LOG_DIR"],config["TRIM_TOOL"],"{sample}.log")
threads: 32
params:
# list of trimmers (see manual)
trimmer=["TRAILING:3"],
# optional parameters
extra="",
compression_level="-9"
resources:
mem = 1000,
time = 120
message: """--- Trimming FASTQ files with Trimmomatic."""
wrapper:
"0.64.0/bio/trimmomatic/pe"
Results in the following error:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 qc_before_align_r1
1
[Mon Sep 14 13:42:08 2020]
Job 0: --- Quality check of raw data with FastQC before alignment.
Activating conda environment: /home/moldach/wrappers/.snakemake/conda/975fb1fd
Activating conda environment: /home/moldach/wrappers/.snakemake/conda/975fb1fd
Skipping ' 2> logs/fastqc/before_align/MTG324_R1.log' which didn't exist, or couldn't be read
Failed to process file MTG324_R1_trim_paired.fastq.gz
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry. Your file is probably truncated
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:179)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:125)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77)
at java.base/java.lang.Thread.run(Thread.java:834)
mv: cannot stat ‘/tmp/tmpsnncjthh/MTG324_R1_trim_paired_fastqc.html’: No such file or directory
Traceback (most recent call last):
File "/home/moldach/wrappers/.snakemake/scripts/tmpp34b98yj.wrapper.py", line 47, in <module>
shell("mv {html_path:q} {snakemake.output.html:q}")
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/shell.py", line 205, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; mv /tmp/tmpsnncjthh/MTG324_R1_trim_paired_fastqc.html qc/fastQC/before_align/MTG324_R1_trim_paired_fastqc.html' returned non-zero exit status $
[Mon Sep 14 13:45:16 2020]
Error in rule qc_before_align_r1:
jobid: 0
output: qc/fastQC/before_align/MTG324_R1_trim_paired_fastqc.html, qc/fastQC/before_align/MTG324_R1_trim_paired_fastqc.zip
log: logs/fastqc/before_align/MTG324_R1.log (check log file(s) for error message)
conda-env: /home/moldach/wrappers/.snakemake/conda/975fb1fd
RuleException:
CalledProcessError in line 181 of /home/moldach/wrappers/Trim:
Command 'source /home/moldach/anaconda3/bin/activate '/home/moldach/wrappers/.snakemake/conda/975fb1fd'; set -euo pipefail; python /home/moldach/wrappers/.snakemake/scripts/tmpp34b98yj.wrapper.py' retu$
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2189, in run_wrapper
File "/home/moldach/wrappers/Trim", line 181, in __rule_qc_before_align_r1
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 529, in _callback
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/concurrent/futures/thread.py", line 57, in run
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
File "/home/moldach/anaconda3/envs/snakemake/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2201, in run_wrapper
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

How to use output directories to aggregate files (and receive more informative error messages)?

The overall problem I'm trying to solve is a way to count the number of reads present in each file at every step of a QC pipeline I'm building. I have a shell script I've used in the past which takes in a directory and outputs the number of reads per file. Since I'm looking to use a directory as input, I tried following the format laid out by Rasmus in this post:
https://bitbucket.org/snakemake/snakemake/issues/961/rule-with-folder-as-input-and-output
Here is some example input created earlier in the pipeline:
$ ls -1 cut_reads/
97_R1_cut.fastq.gz
97_R2_cut.fastq.gz
98_R1_cut.fastq.gz
98_R2_cut.fastq.gz
99_R1_cut.fastq.gz
99_R2_cut.fastq.gz
And a simplified Snakefile to first aggregate all reads by creating symlinks in a new directory, and then use that directory as input for the read counting shell script:
import os
configfile: "config.yaml"
rule all:
input:
"read_counts/read_counts.txt"
rule agg_count:
input:
cut_reads = expand("cut_reads/{sample}_{rdir}_cut.fastq.gz", rdir=["R1", "R2"], sample=config["forward_reads"])
output:
cut_dir = directory("read_counts/cut_reads")
run:
os.makedir(output.cut_dir)
for read in input.cut_reads:
abspath = os.path.abspath(read)
shell("ln -s {abspath} {output.cut_dir}")
rule count_reads:
input:
cut_reads = "read_counts/cut_reads"
output:
"read_counts/read_counts.txt"
shell:
'''
readcounts.sh {input.cut_reads} >> {output}
'''
Everything's fine in the dry-run, but when I try to actually execute it, I get a fairly cryptic error message:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 agg_count
1 all
1 count_reads
3
[Tue Jun 18 11:31:22 2019]
rule agg_count:
input: cut_reads/99_R1_cut.fastq.gz, cut_reads/98_R1_cut.fastq.gz, cut_reads/97_R1_cut.fastq.gz, cut_reads/99_R2_cut.fastq.gz, cut_reads/98_R2_cut.fastq.gz, cut_reads/97_R2_cut.fastq.gz
output: read_counts/cut_reads
jobid: 2
Job counts:
count jobs
1 agg_count
1
[Tue Jun 18 11:31:22 2019]
Error in rule agg_count:
jobid: 0
output: read_counts/cut_reads
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/douglas/snakemake/scrap_directory/.snakemake/log/2019-06-18T113122.202962.snakemake.log
read_counts/ was created, but there's no cut_reads/ directory inside. No other error messages are present in the complete log. Anyone know what's going wrong or how to receive a more descriptive error message?
I'm also (obviously) fairly new to snakemake, so there might be a better way to go about this whole process. Any help is much appreciated!
... And it was a typo. Typical. os.makedir(output.cut_dir) should be os.makedirs(output.cut_dir). I'm still really curious why snakemake isn't displaying the AttributeError python throws when you try to run this:
AttributeError: module 'os' has no attribute 'makedir'
Is there somewhere this is stored or can be accessed to prevent future headaches?
Are you sure the error message is due to the typo in os.makedir? In this test script os.makedir does throw AttributeError ...:
rule all:
input:
'tmp.done',
rule one:
output:
x= 'tmp.done',
xdir= directory('tmp'),
run:
os.makedir(output.xdir)
When executed:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 one
2
[Wed Jun 19 09:05:57 2019]
rule one:
output: tmp.done, tmp
jobid: 1
Job counts:
count jobs
1 one
1
[Wed Jun 19 09:05:57 2019]
Error in rule one:
jobid: 0
output: tmp.done, tmp
RuleException:
AttributeError in line 10 of /home/dario/Tritume/Snakefile:
module 'os' has no attribute 'makedir'
File "/home/dario/Tritume/Snakefile", line 10, in __rule_one
File "/home/dario/miniconda3/envs/tritume/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/dario/Tritume/.snakemake/log/2019-06-19T090557.113876.snakemake.log
Use f-string to resolve local variables like {abspath}:
for read in input.cut_reads:
abspath = os.path.abspath(read)
shell(f"ln -s {abspath} {output.cut_dir}")
Wrap the wildcards that snakemake resolves automatically into double braces inside of f-strings.

Parallel execution of a snakemake rule with same input and a range of values for a single parameter

I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.
My original bash script runs a program multiple times for a range of values of the parameter K.
for num in {1..3}
do
structure.py -K $num --input=fileprefix --output=fileprefix
done
There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.
Config:
cat config.yaml
infile: fileprefix
K:
- 1
- 2
- 3
Snakemake:
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['infile'],
K = config['K']
threads: 3
shell:
"""
structure.py -K {params.K} \
--input=output/{params.prefix} \
--output=output/{params.prefix}
"""
This was executed with snakemake --cores 3. The problem persists when I only use one thread.
I expected the outputs described above for each value of K, but the run fails with this error:
RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail; structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
File "Snakefile", line 84, in __rule_Structure
File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.
You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['invcf'],
K = config['K']
threads: 3
shell:
"structure.py -K {wildcards.K} "
"--input=output/{params.prefix} "
"--output=output/{params.prefix}"
Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.
There is an option now for parameter space exploration
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration