snakemake: how to implement log directive when using run directive? - snakemake

Snakemake allows creation of a log for each rule with log parameter that specifies the name of the log file. It is relatively straightforward to pipe results from shell output to this log, but I am not able to figure out a way of logging output of run output (i.e. python script).
One workaround is to save the python code in a script and then run it from the shell, but I wonder if there is another way?

I have some rules that use both the log and run directives. In the run directive, I "manually" open and write the log file.
For instance:
rule compute_RPM:
input:
counts_table = source_small_RNA_counts,
summary_table = rules.gather_read_counts_summaries.output.summary_table,
tags_table = rules.associate_small_type.output.tags_table,
output:
RPM_table = OPJ(
annot_counts_dir,
"all_{mapped_type}_on_%s" % genome, "{small_type}_RPM.txt"),
log:
log = OPJ(log_dir, "compute_RPM_{mapped_type}", "{small_type}.log"),
benchmark:
OPJ(log_dir, "compute_RPM_{mapped_type}", "{small_type}_benchmark.txt"),
run:
with open(log.log, "w") as logfile:
logfile.write(f"Reading column counts from {input.counts_table}\n")
counts_data = pd.read_table(
input.counts_table,
index_col="gene")
logfile.write(f"Reading number of non-structural mappers from {input.summary_table}\n")
norm = pd.read_table(input.summary_table, index_col=0).loc["non_structural"]
logfile.write(str(norm))
logfile.write("Computing counts per million non-structural mappers\n")
RPM = 1000000 * counts_data / norm
add_tags_column(RPM, input.tags_table, "small_type").to_csv(output.RPM_table, sep="\t")
For third-party code that writes to stdout, maybe the redirect_stdout context manager could be helpful (found in https://stackoverflow.com/a/40417352/1878788, documented at
https://docs.python.org/3/library/contextlib.html#contextlib.redirect_stdout).
Test snakefile, test_run_log.snakefile:
from contextlib import redirect_stdout
rule all:
input:
"test_run_log.txt"
rule test_run_log:
output:
"test_run_log.txt"
log:
"test_run_log.log"
run:
with open(log[0], "w") as log_file:
with redirect_stdout(log_file):
print(f"Writing result to {output[0]}")
with open(output[0], "w") as out_file:
out_file.write("result\n")
Running it:
$ snakemake -s test_run_log.snakefile
Results:
$ cat test_run_log.log
Writing result to test_run_log.txt
$ cat test_run_log.txt
result

My solution was the following. This is usefull both for normal log and logging exceptions with traceback. You can then wrap logger setup in a function to make it more organized. It's not very pretty though. Would be much nicer if snakemake could do it by itself.
import logging
# some stuff
rule logging_test:
input: 'input.json'
output: 'output.json'
log: 'rules_logs/logging_test.log'
run:
logger = logging.getLogger('logging_test')
fh = logging.FileHandler(str(log))
fh.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
logger.addHandler(fh)
try:
logger.info('Starting operation!')
# do something
with open(str(output), 'w') as f:
f.write('success!')
logger.info('Ended!')
except Exception as e:
logger.error(e, exc_info=True)

Related

How does snakemake --show-failed-logs works

I want to use snakemake with the --show-failed-logs parameter but not sure how it works or what to expect.
For example (not a working example I just typed and copied some code parts, if necessary I can create a working example)
My command is:
snakemake -s Snakefile_test.smk --configfile test.yml -j 4 --restart-times 2 --show-failed-logs
In Snakefile_test.smk I have a rule:
rule testrule:
input:
R1_trimmed = rules.trim.output.R1_trimmed,
R2_trimmed = rules.trim.output.R2_trimmed
output:
plot_R1 = output+"/{sample}/figures/{sample}_R1_plot.png",
plot_R2 = output+"/{sample}/figures/{sample}_R2_plot.png",
log:
testrule_log = output+"/{sample}/logs/plot_log.txt"
run:
shell("python scripts/createplot.py -r1_trimmed {input.R1_trimmed} -r2_trimmed {input.R2_trimmed} -plot_r1 {output.plot_R1} -plot_r2 {output.plot_R2} -log {log.testrule_log}")
In createplot.py to test I have something like:
if __name__ == "__main__":
logging.basicConfig(filename=args.log, level=logging.DEBUG, format='%(asctime)s %(levelname)s %(name)s %(message)s')
logger=logging.getLogger(__name__)
#to add some content to the log file
try:
1/0
except ZeroDivisionError as err:
logger.error(err)
main()
#to let the script crash
a = 1/0
Because obviously createplot.py will crash the pipeline I expected that snakemake would print the contents of testrule_log to the screen. But that is not the case. I am using version 5.25.0
EDIT:
I am familiar with
script:
"scripts/createplot.py"
But need to use shell in this case. If that causes the problem let me know.

Problems with the VEP snakemake wrapper

I'm experiencing two issues trying to run the VEP wrapper for snakemake.
The first is that I would like to use lambda wildcards in calls like so:
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}_sorted_dedupped_snp_varscan.vcf".format(sample,pair)) for pair in ['']))
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="variants.annotated.vcf",
stats="variants.html"
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
However, I get an error:
When I replace lambda wildcards with a calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names) ([which is not ideal - see this post for reason][1]) it give me errors about resources folder?
(snakemake) [moldach#cedar1 MTG353]$ snakemake -n -r
Building DAG of jobs...
MissingInputException in line 333 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/MTG353/Snakefile:
Missing input files for rule variant_annotation:
resources/vep/cache
resources/vep/plugins
I'm also [confused from the documentation as to how it knows which reference genome (version, _etc.) should be specified][2].
UPDATE:
Because of the character limit I cannot even respond to the two respondents so I will continue the issue here:
As #jafors mentioned the two wrappers solved the issue for cache and plugins - thanks!
Now I get an error from trying to run VEP though from the following rule:
rule variant_annotation:
input:
calls= expand('{CALLING_DIR}/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf', CALLING_DIR=dirs_dict["CALLING_DIR"], CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.annotated.vcf', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names),
stats=expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}.html', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names)
params:
plugins=["LoFtool"],
extra="--everything"
message: """--- Annotating Variants."""
resources:
mem = 30000,
time = 120
threads: 4
wrapper:
"0.64.0/bio/vep/annotate"
this is the error I get from the log:
Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 variant_annotation
1
[Wed Aug 12 20:22:49 2020]
Job 0: --- Annotating Variants.
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpwx1u_776.wrapper.py", line 36, in <module>
if snakemake.output.calls.endswith(".vcf.gz"):
AttributeError: 'Namedlist' object has no attribute 'endswith'
[Wed Aug 12 20:22:53 2020]
Error in rule variant_annotation:
jobid: 0
output: ANNOTATION/VEP/BC1217.annotated.vcf, ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/MTG109.annotated.vcf, ANNOTATION/VEP/BC1217.html, ANNOTATION/VEP/470.html, ANNOTATION/VEP/MTG$
conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail; python /scratch/moldach/MADDOG/VCF-FILE$
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
TO BE CLEAR:
This is the code I had running VEP prior to trying out the wrapper so I would like to preserve similar options (e.g. offline, etc.):
vep \
-i {input.sample} \
--species "caenorhabditis_elegans" \
--format "vcf" \
--everything \
--cache_version 100 \
--offline \
--force_overwrite \
--fasta {input.ref} \
--gff {input.annot} \
--tab \
--variant_class \
--regulatory \
--show_ref_allele \
--numbers \
--symbol \
--protein \
-o {params.sample}
UPDATE 2:
Yes the use of expand() was the issue. I remember this is why I like to use lambda or os.path.join() as rule input/output except for as you mentioned in rule all:
The following seems to get rid of that problem although I'm met with a new one:
rule variant_annotation:
input:
calls= lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
Not sure why I get the unknown file type error - as I mentioned this was first tested out with the full command with the same input data?
Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpsh388k23.wrapper.py", line 44, in <module>
"(bcftools view {snakemake.input.calls} | "
File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; (bcftools view VARIANT_CALLING/varscan/MTG109_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cach$
[Thu Aug 13 09:02:22 2020]
Update 3:
bcftools view is giving the warning from the output of samtools mpileup/varscan pileup2snp:
def getDeduppedBamsIndex(sample):
return(list(os.path.join(aligns_dict[sample],"{0}.sorted.dedupped.bam.bai".format(sample,pair)) for pair in ['']))
rule mpilup:
input:
bam=lambda wildcards: getDeduppedBams(wildcards.sample),
reference_genome=os.path.join(dirs_dict["REF_DIR"],config["REF_GENOME"])
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_{contig}_samtools_mpileup.log")
params:
extra=lambda wc: "-r {}".format(wc.contig)
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/samtools/mpileup"
rule mpileup_to_vcf:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.mpileup.gz"),
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_{contig}.vcf")
message:
"Calling SNP with Varscan2"
threads:
2 # Keep threading value to one for unzipped mpileup input
# Set it to two for zipped mipileup files
log:
os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"varscan_{sample}_{contig}.log")
resources:
mem = 1000,
time = 30
wrapper:
"0.65.0/bio/varscan/mpileup2snp"
rule vcf_merge:
input:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_I.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_II.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_III.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_IV.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_V.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_X.vcf"),
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}_MtDNA.vcf")
output:
os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"],"{sample}.vcf")
log: os.path.join(dirs_dict["LOG_DIR"],config["CALLING_TOOL"],"{sample}_vcf-merge.log")
resources:
mem = 1000,
time = 10
threads: 1
message: """--- Merge VarScan by Chromosome."""
shell: """
awk 'FNR==1 && NR!=1 {{ while (/^<header>/) getline; }} 1 {{print}} ' {input} > {output}
"""
calling_dir = os.path.join(dirs_dict["CALLING_DIR"],config["CALLING_TOOL"])
callings_locations = [calling_dir] * len_samples
callings_dict = dict(zip(sample_names, callings_locations))
def getVCFs(sample):
return(list(os.path.join(callings_dict[sample],"{0}.vcf".format(sample,pair)) for pair in ['']))
rule annotate_variants:
input:
calls=lambda wildcards: getVCFs(wildcards.sample),
cache="resources/vep/cache",
plugins="resources/vep/plugins",
output:
calls="{sample}.annotated.vcf",
stats="{sample}.html"
params:
# Pass a list of plugins to use, see https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html
# Plugin args can be added as well, e.g. via an entry "MyPlugin,1,FOO", see docs.
plugins=["LoFtool"],
extra="--everything" # optional: extra arguments
log:
"logs/vep/{sample}.log"
threads: 4
resources:
time=30,
mem=5000
wrapper:
"0.65.0/bio/vep/annotate"
If I run bcftools view on the output I get the error:
$ bcftools view variant_calling/varscan/MTG324.vcf
Failed to read from variant_calling/varscan/MTG324.vcf: unknown file type
About using the expand vs wildcard, it does not matter at all. The biostar post is just advice how to keep things readable. On the snakemake/programmatic side should not matter how you define you input, as long as it is correct.
The complaint about resources is that you define in the input of rule variant_annotation that resources/vep/cache and resources/vep/plugins are necessary inputs to be able to run variant_annotation. With this error snakemake is effectively telling you that those files do not exist, so it can not run the rule for you.
When I look at the code in the docs it seems like the cache directory as input should define which genome you use:
entrypath = get_only_child_dir(get_only_child_dir(Path(cache)))
species = entrypath.parent.name
release, build = entrypath.name.split("_")
Additionally to what Maarten said (the resources/vep/cache and resources/vep/plugins are just example paths to the required input which defines also which genome and version you want to use), you can get the cache and plugin directories easily with two other simple rules in your Snakefile using these wrappers:
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/cache.htm
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep/plugins.html
EDIT
Glad this worked out for your first problem.
The second error seems to arise from the expand in the output.
Am I understanding correctly that you want to annotate all your vcfs one-by-one? So input is {sample}.vcf and output would be {sample}.annotated.vcf?
If that's the case, you probably don't want to use expand in this rule.
I am also not sure, why you would need the {ANNOT_DIR} and {ANNOT_TOOL} to be wildcards here. I guess if you are using VEP, the ANNOT_TOOL would always be VEP and the ANNOT_DIR will be ANNOTATION?
Then, you could write them directly in the output as ANNOTATION/VEP/{sample}.annotated.vcf.
Same for the {CALLING_DIR}, I guess this will always be the same directory, right? I get that the {CALLING_TOOL} might have more than one value if you used multiple callers on the samples.
If I am still on track, you have two wildcards you could want to expand on when using VEP, the {sample} and the {CALLING_TOOL}.
Just write
input:
calls: 'CALLDIR/{CALLING_TOOL}/{sample}_sorted_dedupped_snp_varscan.vcf',
cache="resources/vep/cache",
plugins="resources/vep/plugins"
output:
calls='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf',
stats='ANNOTATION/VEP/{CALLING_TOOL}/{sample}.html'
The expand belongs in your rule all or any other target rule that uses all annotated vcfs at once, sth. like this:
rule all:
input: expand('ANNOTATION/VEP/{CALLING_TOOL}/{sample}.annotated.vcf', CALLING_TOOL=config["CALLING_TOOL"], sample=sample_names)
Then, the variant_annotation rule will run all the samples you expand on in rule all.
I hope I got your idea correctly and this helps.
EDIT2
Ok, seems like we are nearly done. The error you get is thrown by bcftools view - it indicates that something might be wrong with the vcf.
Did you try bcftools view with your vcf outside of the Snakefile? This would give us an idea if the problem arises during this rule or if the vcf is already somehow problematic.

Snakemake: variable that defines whether process is submitted cluster job or the snakefile

My current architecture is that at the start of my Snakefile I have a long running function somefunc which helps decide the "input" to rule all. I realized when I was running the workflow with slurm that somefunc is being executed by each job. Is there some variable I can access that defines whether the code is a submitted job or whether it is the main process:
if not snakemake.submitted_job:
config['layout'] = somefunc()
...
A solution which I don't really recommend is to make somefunc write the list of inputs to a tmp file so that slurm jobs will read this tmp file rather than reconstructing the list from scratch. The tmp file is created by whatever job is executed first so the long-running part is done only once.
At the end of the workflow delete the tmp file so that later executions will start fresh with new input.
Here's a sketch:
def somefunc():
try:
all_output = open('tmp.txt').readlines()
all_output = [x.strip() for x in all_output]
print('List of input files read from tmp.txt')
except:
all_output = ['file1.txt', 'file2.txt'] # Long running part
with open('tmp.txt', 'w') as fout:
for x in all_output:
fout.write(x + '\n')
print('List of input files created and written to tmp.txt')
return all_output
all_output = somefunc()
rule all:
input:
all_output,
rule one:
output:
all_output,
shell:
r"""
touch {output}
"""
onsuccess:
os.remove('tmp.txt')
onerror:
os.remove('tmp.txt')
Since jobs will be submitted in parallel, you should make sure that only one job writes tmp.txt and the others read it. I think the try/except above will do it but I'm not 100% sure. (Probably you want to use some better filename than tmp.txt, see the module tempfile. see also the module atexit) for exit handlers)
As discussed with #dariober it seems the cleanest to check whether the (hidden) snakemake directory has locks since they seem not to be generated until the first rule starts (assuming you are not using the --nolock argument).
import os
locked = len(os.listdir(".snakemake/locks")) > 0
However this results in a problem in my case:
import time
import os
def longfunc():
time.sleep(10)
return range(5)
locked = len(os.listdir(".snakemake/locks")) > 0
if not locked:
info = longfunc()
rule all:
input:
expand("test_{sample}", sample=info)
rule test:
output:
touch("test_{sample}")
run:
"""
sleep 1
"""
Somehow snakemake lets each rule reinterpret the complete snakefile, with the issue that all the jobs will complain that 'info is not defined'. For me it was easiest to store the results and load them for each job (pickle.dump and pickle.load).

Parallel execution of a snakemake rule with same input and a range of values for a single parameter

I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.
My original bash script runs a program multiple times for a range of values of the parameter K.
for num in {1..3}
do
structure.py -K $num --input=fileprefix --output=fileprefix
done
There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.
Config:
cat config.yaml
infile: fileprefix
K:
- 1
- 2
- 3
Snakemake:
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['infile'],
K = config['K']
threads: 3
shell:
"""
structure.py -K {params.K} \
--input=output/{params.prefix} \
--output=output/{params.prefix}
"""
This was executed with snakemake --cores 3. The problem persists when I only use one thread.
I expected the outputs described above for each value of K, but the run fails with this error:
RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail; structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
File "Snakefile", line 84, in __rule_Structure
File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.
You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['invcf'],
K = config['K']
threads: 3
shell:
"structure.py -K {wildcards.K} "
"--input=output/{params.prefix} "
"--output=output/{params.prefix}"
Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.
There is an option now for parameter space exploration
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration

snakemake STAR module issue and extra question

I discovered that the snakemake STAR module outputs as 'BAM Unsorted'.
Q1:Is there a way to change this to:
--outSAMtype BAM SortedByCoordinate
When I add the option in the 'extra' options I get an error message about duplicate definition:
EXITING: FATAL INPUT ERROR: duplicate parameter "outSAMtype" in input "Command-Line"
SOLUTION: keep only one definition of input parameters in each input source
Nov 15 09:46:07 ...... FATAL ERROR, exiting
logs/star/se/UY2_S7.log (END)
Should I consider adding a sorting module behind STAR instead?
Q2: How can I take a module from the wrapper repo and make it a local module, allowing me to edit it?
the code:
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester#jimmy.harvard.edu"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1
fq2 = snakemake.input.get("fq2")
if fq2:
fq2 = [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2
assert len(fq1) == len(fq2), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])
if fq1[0].endswith(".gz"):
readcmd = "--readFilesCommand zcat"
else:
readcmd = ""
outprefix = os.path.dirname(snakemake.output[0]) + "/"
shell(
"STAR "
"{extra} "
"--runThreadN {snakemake.threads} "
"--genomeDir {snakemake.params.index} "
"--readFilesIn {input_str} "
"{readcmd} "
"--outSAMtype BAM Unsorted "
"--outFileNamePrefix {outprefix} "
"--outStd Log "
"{log}")
Q1:Is there a way to change this to:
--outSAMtype BAM SortedByCoordinate
I would add another sorting rule after the wrapper as it is the most 'standardized` way of doing it. You can also use another wrapper for sorting.
There is an explanation from the author of snakemake for the reason why the default is unsorted and why there is no option for sorted output in the wrapper:
https://bitbucket.org/snakemake/snakemake/issues/440/pre-post-wrapper
Regarding the SAM/BAM issue, I would say any wrapper should always output the optimal file format. Hence, whenever I write a wrapper for a read mapper, I ensure that output is not SAM. Indexing and sorting should not be part of the same wrapper I think, because such a task has a completely different behavior regarding parallelization. Also, you would loose the mapping output if something goes wrong during the sorting or indexing.
Q2: How can I take a module from the wrapper repo and make it a local module, allowing me to edit it?
If you wanted to do this, one way would be to download the local copy of the wrapper. Change in the shell portion of the downloaded wrapper Unsorted to {snakemake.params.outsamtype}. In your Snakefile change (wrapper to script, path/to/downloaded/wrapper and add the outsamtype parameter):
rule star_se:
input:
fq1 = "reads/{sample}_R1.1.fastq"
output:
# see STAR manual for additional output files
"star/{sample}/Aligned.out.bam"
log:
"logs/star/{sample}.log"
params:
# path to STAR reference genome index
index="index",
# optional parameters
extra="",
outsamtype = "SortedByCoordinate"
threads: 8
script:
"path/to/downloaded/wrapper"
I think a separate rule w/o a wrapper for sorting or even making your own star rule rather is better. Modifying the wrapper defeats the whole purpose of it.