Log executed shell command from snakemake - snakemake

I'd like to save the shell command executed by each snakemake job to a log file.
With --printshellcmds I can print to stdout the shell commands as they are submitted but I would like to save them to individual files. For example, this script:
samples= ['a', 'b', 'c']
rule all:
input:
expand('{sample}.done', sample= samples),
rule do_stuff:
output:
'{sample}.done'
shell:
"""
echo {wildcards.sample} > {output}
"""
will execute the 3 jobs:
echo a > a.done
echo c > c.done
echo b > b.done
I'd like to save each of these to a separate file named after a wildcard (or whatever). Like:
rule do_stuff:
output:
'{sample}.done'
log_shell: < save shell command
'{sample}.cmd.log' <
shell:
"""
echo {wildcards.sample} > {output}
"""
Is this possible? Thank you!

While fiddling with other issues, I came up with this solution. Combine the rundirective with the shell function to first write the command to file and then execute the command itself. Example:
rule do_stuff:
output:
'{sample}.done'
log:
cmd= 'my/logs/{sample}.log'
run:
cmd= """
echo hello > {output}
"""
shell("sort --version > {log.cmd} || true") # Log version
shell("cat <<'EOF' >> {log.cmd}" + cmd) # Log command
shell(cmd) # Exec command
For each {sample} you get a file my/logs/{sample}.log that looks like this:
sort (GNU coreutils) 5.93
Copyright (C) 2005 Free Software Foundation, Inc.
echo hello > a.done
This does the trick. However, it makes the code more cluttered and you effectively lose the --printshellcmds option.

Related

snakemake - replacing command line parameters with wildcards by cluster profile

I am writing a snakemake pipeline to eventually identify corona virus variants.
Below is a minimal example with three steps:
LOGDIR = '/path/to/logDir'
barcodes = ['barcode49', 'barcode50', 'barcode51']
rule all:
input:
expand([
# guppyplex
"out/guppyplex/{barcode}/{barcode}.fastq",
# catFasta
"out/catFasta/cat_consensus.fasta",
], barcode = barcodes)
rule guppyplex:
input:
FQ = f"fastq/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
output:
"out/guppyplex/{barcode}/{barcode}.fastq"
shell:
"touch {output}" # variables in CAPITALS are parsed from config.yaml
rule minion:
input:
INFQ = rules.guppyplex.output,
FAST5 = f"fasta/{{barcode}}"
params:
OUTDIR = "out/nanopolish/{barcode}"
output:
"out/nanopolish/{barcode}/{barcode}.consensus.fasta"
shell:
"""
touch {output} && echo {wildcards.barcode} > {output}
"""
rule catFasta:
input:
expand("out/nanopolish/{barcode}/{barcode}.consensus.fasta", barcode = barcodes)
output:
"out/catFasta/cat_consensus.fasta"
shell:
"cat {input} > {output}"
If I run the snakemake locally by calling snakemake -p --cores 1 all everything works. Yet my ultimate goal is to use qsub to run the jobs on a cluster. I also want the stderr and stdout from qsub to have meaningful names, which include wildcards and the rule names for each job.
However, if I call snakemake with
snakemake -p --cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}" all
I will get the following error:
AttributeError: 'Wildcards' object has no attribute 'barcode'
I have recently read the snakemake documentation where it appears that I could replace the command line parameters (--cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}") by a yaml file. Although the documentation is not all that clear to me.
I have created a config.yaml file at /home/user/.config/snakemake which looks like so:
cluster: 'qsub'
q: 'onlybngs05b'
e: '/home/ngs/tempOutSnakemake'
o: '/home/ngs/tempOutSnakemake'
j: 5
jobname: "{wildcards.barcode}.{rule}.{jobid}
But then it appears that snakemake is not properly parsing the config.yaml. I am getting
snakemake: error: ambiguous option: --o=/home/ngs/tempOutSnakemake could match --omit-from, --output-wait, --overwrite-shellcmd
I also tried to replace o in the config file by stdout (kind of the long version of the parameter (-h vs --help for several programs), though it does not work.
Therefore my question is how I can replace the command line parameters --cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}" by a config.yaml file that accepts wildcards?
I think the problem is that rule catFasta doesn't contain the wildcard barcode. If you think about it, what job name would you expect in {wildcards.barcode}.{rule}.{jobid}?
Maybe a solution could be to add to each rule a jobname parameter that could be {barcode} for guppyplex and minion and 'all_barcodes' for catFasta. Then use --jobname "{params.jobname}.{rule}.{jobid}"

Snakemake: output files in one output directory

The program I am running requires only the directory to be specified in order to save the output files. If I run this with snakemake it is giving me error:
IOError: [Errno 2] No such file or directory: 'BOB/call_images/chrX_194_1967_BOB_78.rpkm.png'
My try:
rule all:
input:
"BOB/call_images/"
rule plot:
input:
hdf="BOB/hdf5/analysis.hdf5",
txt="BOB/calls.txt"
output:
directory("BOB/call_images/")
shell:
"""
python someprogram.py plotcalls --input {input.hdf} --calls {input.txt} --outputdir {output[0]}
"""
Neither this version works:
output:
outdir="BOB/call_images/"
Normally, snakemake will create the output directory for specified output files. The snakemake directory() declaration tells snakemake that the directory is the output, so it leaves it up to the rule to create the output directory.
If you can predict what the output files are called (even if you don't specify them on the command line), you should tell snakemake the output filenames in the output: field. EG:
rule plot:
input:
hdf="BOB/hdf5/analysis.hdf5",
txt="BOB/calls.txt"
output:
"BOB/call_images/chrX_194_1967_BOB_78.rpkm.png"
params:
outdir="BOB/call_images"
shell:
"""
python someprogram.py plotcalls --input {input.hdf} --calls {input.txt} --outputdir {params.outdir}
"""
(The advantage of using the params to define the output dir instead of hard coding it in the shell command is that you can use wildcards there)
If you can't predict the output file name, then you have to manually run mkdir -p {output} as the first step of the shell command.
rule plot:
input:
hdf="BOB/hdf5/analysis.hdf5",
txt="BOB/calls.txt"
output: directory("BOB/call_images")
shell:
"""
mkdir -p {output}
python someprogram.py plotcalls --input {input.hdf} --calls {input.txt} --outputdir {output}
"""

Snakemake cannot handle very long command line?

This is a very strange problem.
When my {input} specified in the rule section is a list of <200 files, snakemake worked all right.
But when {input} has more than 500 files, snakemake just quitted with messages (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!). The complete log did not provide any error messages.
For the log, please see: https://github.com/snakemake/snakemake/files/5285271/2020-09-25T151835.613199.snakemake.log
The rule that worked is (NOTE the input is capped to 200 files):
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')[:200]
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
The rule that failed is:
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
My question is also posted in GitHub: https://github.com/snakemake/snakemake/issues/643.
I second Maarten's answer, with that many files you are running up against a shell limit; snakemake is just doing a poor job helping you identify the problem.
Based on the issue you reference, it seems like you are using cat to combine all of your files. Maybe following the answer here would help:
rule combine_fastq_list:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
temp("{sample}.tmp.list")
group: "minion_assemble"
script:
with open(output[0]) as out:
out.write('\n'.join(input))
rule combine_fastq:
input:
temp("{sample}.tmp.list")
output:
'combined.fastq/{sample}.fastq.gz'
group: "minion_assemble"
shell:
'cat {input} | ' # this is reading the list of files from the file
'xargs zcat -f | '
'...'
Hope it gets you on the right track.
edit
The first option executes your command separately for each input file. A different option that executes the command once for the whole list of input is:
rule combine_fastq:
...
shell:
"""
command $(< {input}) ...
"""
For those landing here with similar questions (like Snakemake expand function alternative), snakemake 6 can handle long command lines. The following test fails on snakemake < 6 but succeeds on 6.0.0 on my Ubuntu machine:
rule all:
input:
'output.txt',
rule one:
output:
'output.txt',
params:
x= list(range(0, 1000000))
shell:
r"""
echo {params.x} > {output}
"""

Snakemake how to execute downstream rules when an upstream rule fails

Apologies that the title is bad - I can't figure out how best to explain my issue in a few words. I'm having trouble dealing with downstream rules in snakemake when one of the rules fails. In the example below, rule spades fails on some samples. This is expected because some of my input files will have issues, spades will return an error, and the target file is not generated. This is fine until I get to rule eval_ani. Here I basically want to run this rule on all of the successful output of rule ani. But I'm not sure how to do this because I have effectively dropped some of my samples in rule spades. I think using snakemake checkpoints might be useful but I just can't figure out how to apply it from the documentation.
I'm also wondering if there is a way to re-run rule ani without re-running rule spades. Say I prematurely terminated my run, and rule ani didn't run on all the samples. Now I want to re-run my pipeline, but I don't want snakemake to try to re-run all the failed spades jobs because I already know they won't be useful to me and it would just waste resources. I tried -R and --allowed-rules but neither of these does what I want.
rule spades:
input:
read1=config["fastq_dir"]+"combined/{sample}_1_combined.fastq",
read2=config["fastq_dir"]+"combined/{sample}_2_combined.fastq"
output:
contigs=config["spades_dir"]+"{sample}/contigs.fasta",
scaffolds=config["spades_dir"]+"{sample}/scaffolds.fasta"
log:
config["log_dir"]+"spades/{sample}.log"
threads: 8
shell:
"""
python3 {config[path_to_spades]} -1 {input.read1} -2 {input.read2} -t 16 --tmp-dir {config[temp_dir]}spades_test -o {config[spades_dir]}{wildcards.sample} --careful > {log} 2>&1
"""
rule ani:
input:
config["spades_dir"]+"{sample}/scaffolds.fasta"
output:
"fastANI_out/{sample}.txt"
log:
config["log_dir"]+"ani/{sample}.log"
shell:
"""
fastANI -q {input} --rl {config[reference_dir]}ref_list.txt -o fastANI_out/{wildcards.sample}.txt
"""
rule eval_ani:
input:
expand("fastANI_out/{sample}.txt", sample=samples)
output:
"ani_results.txt"
log:
config["log_dir"]+"eval_ani/{sample}.log"
shell:
"""
python3 ./bin/evaluate_ani.py {input} {output} > {log} 2>&1
"""
If I understand correctly, you want to allow spades to fail without stopping the whole pipeline and you want to ignore the output files from spades that failed. For this you could append to the command running spades || true to catch the non-zero exit status (so snakemake will not stop). Then you could analyse the output of spades and write to a "flag" file whether that sample succeded or not. So the rule spades would be something like:
rule spades:
input:
read1=config["fastq_dir"]+"combined/{sample}_1_combined.fastq",
read2=config["fastq_dir"]+"combined/{sample}_2_combined.fastq"
output:
contigs=config["spades_dir"]+"{sample}/contigs.fasta",
scaffolds=config["spades_dir"]+"{sample}/scaffolds.fasta",
exit= config["spades_dir"]+'{sample}/exit.txt',
log:
config["log_dir"]+"spades/{sample}.log"
threads: 8
shell:
"""
python3 {config[path_to_spades]} ... || true
# ... code that writes to {output.exit} stating whether spades succeded or not
"""
For the following steps, you use the flag files '{sample}/exit.txt' to decide which spade files should be used and which should be discarded. For example:
rule ani:
input:
spades= config["spades_dir"]+"{sample}/scaffolds.fasta",
exit= config["spades_dir"]+'{sample}/exit.txt',
output:
"fastANI_out/{sample}.txt"
log:
config["log_dir"]+"ani/{sample}.log"
shell:
"""
if {input.exit} contains 'PASS':
fastANI -q {input.spades} --rl {config[reference_dir]}ref_list.txt -o fastANI_out/{wildcards.sample}.txt
else:
touch {output}
"""
rule eval_ani:
input:
ani= expand("fastANI_out/{sample}.txt", sample=samples),
exit= expand(config["spades_dir"]+'{sample}/exit.txt', sample= samples),
output:
"ani_results.txt"
log:
config["log_dir"]+"eval_ani/{sample}.log"
shell:
"""
# Parse list of file {input.exit} to decide which files in {input.ani} should be used
python3 ./bin/evaluate_ani.py {input} {output} > {log} 2>&1
"""
EDIT (not tested) Instead of || true inside the shell directive it may be better to use the run directive and use python's subprocess to run the system commands that are allowed to fail. The reason is that || true will return 0 exit code no matter what error happened; the subprocess solution instead allows more precise handling of exceptions. E.g.
rule spades:
input:
...
output:
...
run:
cmd = "spades ..."
p = subprocess.Popen(cmd, shell= True, stdout= subprocess.PIPE, stderr= subprocess.PIPE)
stdout, stderr= p.communicate()
if p.returncode == 0:
print('OK')
else:
# Analyze exit code and stderr and decide what to do next
print(p.returncode)
print(stderr.decode())

snakemake running single jobs in parallel from all files in folder

My problem is related to Running parallel instances of a single job/rule on Snakemake but I believe different.
I cannot create a all: rule for it in advance because the folder of input files will be created by a previous rule and depends on the user initial data
pseudocode
rule1: get a big file (OK)
rule2: split the file in parts in Split folder (OK)
rule3: run a program on each file created in Split
I am now at rule3 with Split containing 70 files like
Split/file_001.fq
Split/file_002.fq
..
Split/file_069.fq
Could you please help me creating a rule for pigz to run compress the 70 files in parallel to 70 .gz files
I am running with snakemake -j 24 ZipSplit
config["pigt"] gives 4 threads for each compression job and I give 24 threads to snakemake so I expect 6 parallel compressions but my current rule merges the inputs to one archive in a single job instead of parallelizing !?
Should I build the list of input fully in the rule? how?
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I tried to define input directly with
input: glob_wildcards("Split/{x}.fq")
but syntax error occures
# InSilico_PCR Snakefile
import os
import re
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
# source config variables
configfile: "config.yaml"
# single job
rule GetRawData:
input:
HTTP.remote(os.path.join(config["host"], config["infile"]), keep_local=True, allow_redirects=True)
output:
os.path.join("RawData", config["infile"])
run:
shell("cp {input} {output}")
# single job
rule SplitFastq:
input:
os.path.join("RawData", config["infile"])
params:
lines_per_file = config["lines_per_file"]
output:
pfx = os.path.join("Split", config["infile"] + "_")
shell:
"""
zcat {input} | split --numeric-suffixes --additional-suffix=.fq -a 3 -l {params.lines_per_file} - {output.pfx}
"""
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I think the example below should do it, using checkpoints as suggested by #Maarten-vd-Sande.
However, in your particular case of splitting a big file and compress the output on the fly, you may be better off using the --filter option of split as in
split -a 3 -d -l 4 --filter='gzip -c > $FILE.fastq.gz' bigfile.fastq split/
The snakemake solution, assuming your input file is called bigfile.fastq, split and compress output will be in directory splitting./bigfile/
rule all:
input:
expand("{sample}.split.done", sample= ['bigfile']),
checkpoint splitting:
input:
"{sample}.fastq"
output:
directory("splitting/{sample}")
shell:
r"""
mkdir splitting/{wildcards.sample}
split -a 3 -d --additional-suffix .fastq -l 4 {input} splitting/{wildcards.sample}/
"""
rule compress:
input:
"splitting/{sample}/{i}.fastq",
output:
"splitting/{sample}/{i}.fastq.gz",
shell:
r"""
gzip -c {input} > {output}
"""
def aggregate_input(wildcards):
checkpoint_output = checkpoints.splitting.get(**wildcards).output[0]
return expand("splitting/{sample}/{i}.fastq.gz",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fastq")).i)
rule all_done:
input:
aggregate_input
output:
touch("{sample}.split.done")