Is it possible in snakemake to produce reports and the DAG images automatically? - snakemake

I would like to automatically produce the report and DAG images automatically after running the workflow in snakemake. Also I would like to create the report with a given name, e.g. specified in the config.yaml.
I cannot use the snakemake shell command inside the Snakefile which I would usually use to create the reports manually.
The code I would use for creating the report manually:
snakemake --report
The code for manually creating the DAG image:
snakemake --rulegraph | dot -Tpdf > dag.pdf
How can I do this in the Snakefile?
Thanks for any help!

You could do this but it looks pretty ugly to me. At the end of your Snakefile add:
onsuccess:
shell(
r"""
snakemake --unlock
snakemake --report
snakemake --rulegraph | dot -Tpdf > dag.pdf
""")

As suggested by #FGV comment, it can be done by using auto_report and providing the dag from workflow.persistence:
onsuccess:
from snakemake.report import auto_report
auto_report(workflow.persistence.dag, "report/report.html")
For the dag itself, you can export it to a text file and use shell to turn it to a pdf:
with open("report/dag.txt","w") as f:
f.writelines(str(workflow.persistence.dag))
shell("cat report/dag.txt | dot -Tpdf > report/dag.pdf")
Note that it also works with the rules graph workflow.persistence.dag.rule_dot()

Related

How to parallelize jobs for a list of files using snakemake (beginner question)

I am struggling with a very simple thing. On input of my snakemake pipeline I would like to have a directory, list its content, and process each file from that directory in parallel. Naively I thought something like this should work:
rule all:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
This ends with the error
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
All the resources I could find start with hard-coding the list of jobs in the script, which is something I want to avoid to keep the pipeline generic. The idea is to just point the pipeline to a directory with a list of files and let it do its job. Is this possible? Seems fairly simple and intuitive, but couldn't find an example showing that.
I don't know what command you used for this rule, but the following workflow should suffice your purpose
rule all:
input:
expand("out/{prefix}.txt", prefix=glob_wildcards("in/{test}.txt").test)
rule test:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
glob_wildcards is a function by snakemake to find out all the files that match the specified pattern (in/{test}.txt in this case), then .text is to get the list of strings that match {test} in filenames (example: "ab" in "in/ab.txt").
Then expand can fill the string to the placeholder variable that wrapped by curly bracket, then generate a list of input file names.
So rule all wants a list of input files correspond to all txt files in in folder, then it would let snakemake execute rule test for every file

Accessing the --default-remote-prefix within the Snakefile

When I run snakemake on the google life sciences executor, I run something like:
snakemake --google-lifesciences --default-remote-prefix my_bucket_name --preemption-default 10 --use-conda
Now, my_bucket_name is going to get added to all of the input and output paths.
BUT for reasons I need to recreate the full path within the Snakefile code and therefore I want to be able to access whatever is passed to --default-remote-prefix within the code
Is there a way to do this?
I want to be able to access whatever is passed to --default-remote-prefix within the code
You can use the workflow object like:
print(workflow.default_remote_prefix) # Will print my_bucket_name in your example
rule all:
input: ...
I'm not 100% sure if the workflow object is supposed to be used by the user or if it's private to snakemake and if so it could be changed in the future without warning. But I think it's ok, I use workflow.basedir all the time to get the directory where the Snakefile sits.
Alternatively you could parse the sys.argv list but I think that this is more hacky.
Another option:
bucket_name=foo
snakemake --default-remote-prefix $bucket_name --config bucket_name=$bucket_name ...
then use config["bucket_name"] within the code to get the value foo. But I still prefer the workflow solution.

Accessing file path from a config.yaml in Snakemake

I'm working with Snakemake for NGS analysis. I have a list of input files, stored in a YAML file as follows:
DATASETS:
sample1: /path/to/input/bam
.
.
A very simplified skeleton of my Snakemake file, as described earlier in Snakemake: How to use config file efficiently and https://www.biostars.org/p/406452/, is as follows:
rule all:
input:
expand("report/{sample}.xlsx", sample = config["DATASETS"])
rule call:
input:
lambda wildcards: config["DATASETS"][wildcards.sample]
output:
"tmp/{sample}.vcf"
shell:
"some mutect2 script"
rule summarize:
input:
"tmp/{sample}.vcf"
output:
"report/{sample}.xlsx"
shell:
"processVCF.py"
This complains about missing input files for rule all. I'm really not too sure what I am missing out here: Could someone perhaps point out where I can start looking to try to solve my problem?
This problem persists even when I execute snakemake -n tmp/sample1.vcf, so it seems the problem is related to the inability to pass the input file to the rule call. I have a nagging feeling that I'm really missing something trivial here.

Snakemake live user input

I have an a bunch of R scripts that follow one another and I wanted to connect them using Snakemake. But I’m running in a problem.
One of my R scripts shows two images and asks a user’s input on how many cluster there are present. The R function for this is [readline]
This query on how many clusters is asked but directly after the next line of code is run. Without an opportunity to input a number. the rest of the program crashes, since trying to calculate (empty number) of clusters doesn’t really work. Is there a way around this. By getting the values via a function/rule from Snakemake
or is there a other way to work around this issue?
Based on my testing with snakemake v5.8.2 in MacOS, this is not a snakemake issue. Example setup below works without any problem.
File test.R
cat("What's your name? ")
x <- readLines(file("stdin"),1)
print(x)
File Snakefile
rule all:
input:
"a.txt",
"b.txt"
rule test_rule:
output:
"{sample}.txt"
shell:
"Rscript test.R; touch {output}"
Executing them with command snakemake -p behaves as expected. That is, it asks for user input and then touch output file.
I used function readLines in R script, but this example shows that error you are facing is likely not a snakemake issue.

How to access cluster_config dict within rule?

I'm working on writing a benchmarking report as part of a workflow, and one of the things I'd like to include is information about the amount of resources requested for each job.
Right now, I can manually require the cluster config file ('cluster.json') as a hardcoded input. Ideally, though, I would like to be able to access the per-rule cluster config information that is passed through the --cluster-config arg. In init.py, this is accessed as a dict called cluster_config.
Is there any way of importing or copying this dict directly into the rule?
From the documentation, it looks like you can now use a custom wrapper script to access the job properties (including the cluster config data) when submitting the script to the cluster. Here is an example from the documentation:
#!python
#!/usr/bin/env python3
import os
import sys
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
# do something useful with the threads
threads = job_properties[threads]
# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]
os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))
During submission (last line of the previous example) you could either pass the arguments you want from the cluster.json to the script or dump the dict into a JSON file, pass the location of that file to the script during submission, and parse the json file inside your script. Here is an example of how I would change the submission script to do the latter (untested code):
#!python
#!/usr/bin/env python3
import os
import sys
import tempfile
import json
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
job_json = tempfile.mkstemp(suffix='.json')
json.dump(job_properties, job_json)
os.system("qsub -t {threads} {script} -- {job_json}".format(threads=threads, script=jobscript, job_json=job_json))
job_json should now appear as the first argument to the job script. Make sure to delete the job_json at the end of the job.
From a comment on another answer, it appears that you are just looking to store the job_json somewhere along with the job's output. In that case, it might not be necessary to pass job_json to the job script at all. Just store it in a place of your choosing.
You can manage the resources for the cluster easily per Rules.
Indeed you have the keyword "resources:" to use like this :
rule one:
input: ...
output: ...
resources:
gpu=1,
time=HH:MM:SS
threads: 4
shell: "..."
You can specify the resources by the yaml configuration files for the cluster give with the parameter --cluster-config like this:
rule one:
input: ...
output: ...
resources:
time=cluster_config["one"]["time"]
threads: 4
shell: "..."
When you will call snakemake you will just have to access to the resources like this (example for slurm cluster):
snakemake --cluster "sbatch -c {threads} -t {resources.time} " --cluster-config cluster.yml
It will send each rule with its specific resources for the cluster.
For more informations, you can check the documentations with this link : http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
Best regards