How to access cluster_config dict within rule? - snakemake

I'm working on writing a benchmarking report as part of a workflow, and one of the things I'd like to include is information about the amount of resources requested for each job.
Right now, I can manually require the cluster config file ('cluster.json') as a hardcoded input. Ideally, though, I would like to be able to access the per-rule cluster config information that is passed through the --cluster-config arg. In init.py, this is accessed as a dict called cluster_config.
Is there any way of importing or copying this dict directly into the rule?

From the documentation, it looks like you can now use a custom wrapper script to access the job properties (including the cluster config data) when submitting the script to the cluster. Here is an example from the documentation:
#!python
#!/usr/bin/env python3
import os
import sys
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
# do something useful with the threads
threads = job_properties[threads]
# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]
os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))
During submission (last line of the previous example) you could either pass the arguments you want from the cluster.json to the script or dump the dict into a JSON file, pass the location of that file to the script during submission, and parse the json file inside your script. Here is an example of how I would change the submission script to do the latter (untested code):
#!python
#!/usr/bin/env python3
import os
import sys
import tempfile
import json
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
job_json = tempfile.mkstemp(suffix='.json')
json.dump(job_properties, job_json)
os.system("qsub -t {threads} {script} -- {job_json}".format(threads=threads, script=jobscript, job_json=job_json))
job_json should now appear as the first argument to the job script. Make sure to delete the job_json at the end of the job.
From a comment on another answer, it appears that you are just looking to store the job_json somewhere along with the job's output. In that case, it might not be necessary to pass job_json to the job script at all. Just store it in a place of your choosing.

You can manage the resources for the cluster easily per Rules.
Indeed you have the keyword "resources:" to use like this :
rule one:
input: ...
output: ...
resources:
gpu=1,
time=HH:MM:SS
threads: 4
shell: "..."
You can specify the resources by the yaml configuration files for the cluster give with the parameter --cluster-config like this:
rule one:
input: ...
output: ...
resources:
time=cluster_config["one"]["time"]
threads: 4
shell: "..."
When you will call snakemake you will just have to access to the resources like this (example for slurm cluster):
snakemake --cluster "sbatch -c {threads} -t {resources.time} " --cluster-config cluster.yml
It will send each rule with its specific resources for the cluster.
For more informations, you can check the documentations with this link : http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
Best regards

Related

What is the structure of the executable transformation script for transform_script of GCSFileTransformOperator?

Currently working on a task in Airflow that requires pre-processing a large csv file using GCSFileTransformOperator. Reading the documentation on the class and its implementation, but don't quite understand how the executable transformation script for transform_script should be structured.
For example, is the following script structure correct? If so, does that mean with GCSFileTransformOperator, Airflow is calling the executable transformation script and passing arguments from command line?
# Import the required modules
import preprocessing modules
import sys
# Define the function that passes source_file and destination_file params
def preprocess_file(source_file, destination_file):
# (1) code that processes the source_file
# (2) code then writes to destination_file
# Extract source_file and destination_file from the list of command-line arguments
source_file = sys.argv[1]
destination_file = sys.argv[2]
preprocess_file(source_file, destination_file)
GCSFileTransformOperator passes the script to subprocess.Popen, so your script will work but you will need to add a shebang #!/usr/bin/python (of wherever Python is on your path in Airflow).
Your arguments are correct and the format of your script can be anything you want. Airflow passes in the path of the downloaded file, and a temporary new file:
cmd = (
[self.transform_script]
if isinstance(self.transform_script, str)
else self.transform_script
)
cmd += [source_file.name, destination_file.name]
with subprocess.Popen(
args=cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, close_fds=True
) as process:
# ...
process.wait()
(you can see the source here)

Accessing the --default-remote-prefix within the Snakefile

When I run snakemake on the google life sciences executor, I run something like:
snakemake --google-lifesciences --default-remote-prefix my_bucket_name --preemption-default 10 --use-conda
Now, my_bucket_name is going to get added to all of the input and output paths.
BUT for reasons I need to recreate the full path within the Snakefile code and therefore I want to be able to access whatever is passed to --default-remote-prefix within the code
Is there a way to do this?
I want to be able to access whatever is passed to --default-remote-prefix within the code
You can use the workflow object like:
print(workflow.default_remote_prefix) # Will print my_bucket_name in your example
rule all:
input: ...
I'm not 100% sure if the workflow object is supposed to be used by the user or if it's private to snakemake and if so it could be changed in the future without warning. But I think it's ok, I use workflow.basedir all the time to get the directory where the Snakefile sits.
Alternatively you could parse the sys.argv list but I think that this is more hacky.
Another option:
bucket_name=foo
snakemake --default-remote-prefix $bucket_name --config bucket_name=$bucket_name ...
then use config["bucket_name"] within the code to get the value foo. But I still prefer the workflow solution.

Setting environment variables before the execution of the pyiron wrapper on remote cluster

I use a jobfile for SLURM in ~/pyiron/resources/queues/, which looks roughly like this:
#!/bin/bash
#SBATCH --output=time.out
#SBATCH --job-name={{job_name}}
#SBATCH --workdir={{working_directory}}
#SBATCH --get-user-env=L
#SBATCH --partition=cpu
module load some_python_module
export PYTHONPATH=path/to/lib:$PYTHONPATH
echo {{command}}
As you can see, I need to load a module to access the correct python version before calling "python -m pyiron.base.job.wrappercmd ..." and I also want to set the PYTHONPATH variable.
Setting the environment directly in the SLURM jobfile is of course working, but it seems very inconvenient, because I need a new jobfile under ~/pyiron/resources/queues/ whenever I want to run a calculation with a slightly different environment. Ideally, I would like to be able to adjust the environment directly in the Jupyter notebook. Something like an {{environment}} block in the above jobile, which can be configured via Jupyter, seems to a nice solution.
As far as I can tell, this is impossible with the current version of pyiron and pysqa. Is there a similar solution available?
As an alternative, I could also imagine to store the above jobfile close to the Jupyter notebook. This would also ease the reproducibility for my colleagues. Is there an option to define a specific file to be used as a jinja2-template for the jobile?
I could achieve my intended setup by writing a temporary jobfile under ~/pyiron/resources/queues/ via Jupyter before running the pyiron job, but this feels like quite a hacky solution.
Thank you very much,
Florian
To explain the example in a bit more detail:
I create a notebook named: reading.ipynb with the following content:
import subprocess
subprocess.check_output("echo ${My_SPECIAL_VAR}", shell=True)
This reads the environment variable My_SPECIAL_VAR.
I can now submit this job using a second jupyter notebook:
import os
os.environ["My_SPECIAL_VAR"] = "SoSpecial"
from pyiron import Project
pr = Project("envjob")
job = pr.create_job(pr.job_type.ScriptJob, "script")
job.script_path = "readenv.ipynb"
job.server.queue = "cm"
job.run()
In this case I first set the environment variable and then submit a script job, the script job is able to read the corresponding environment variable as it is forwarded using the --get-user-env=L option. So you should be able to define the environment in the jupyter notebook which you use to submit the calculation.

Is it possible in snakemake to produce reports and the DAG images automatically?

I would like to automatically produce the report and DAG images automatically after running the workflow in snakemake. Also I would like to create the report with a given name, e.g. specified in the config.yaml.
I cannot use the snakemake shell command inside the Snakefile which I would usually use to create the reports manually.
The code I would use for creating the report manually:
snakemake --report
The code for manually creating the DAG image:
snakemake --rulegraph | dot -Tpdf > dag.pdf
How can I do this in the Snakefile?
Thanks for any help!
You could do this but it looks pretty ugly to me. At the end of your Snakefile add:
onsuccess:
shell(
r"""
snakemake --unlock
snakemake --report
snakemake --rulegraph | dot -Tpdf > dag.pdf
""")
As suggested by #FGV comment, it can be done by using auto_report and providing the dag from workflow.persistence:
onsuccess:
from snakemake.report import auto_report
auto_report(workflow.persistence.dag, "report/report.html")
For the dag itself, you can export it to a text file and use shell to turn it to a pdf:
with open("report/dag.txt","w") as f:
f.writelines(str(workflow.persistence.dag))
shell("cat report/dag.txt | dot -Tpdf > report/dag.pdf")
Note that it also works with the rules graph workflow.persistence.dag.rule_dot()

How can one access Snakemake config variables inside `shell` section?

In snakemake I would like to access keys from the config from within the shell: directive. I can use {input.foo}, {output.bar}, and {params.baz}, but {config.quux} is not supported. Is there a way to achieve this?
rule do_something:
input: "source.txt"
output: "target.txt"
params:
# access config[] here. parameter tracking is a side effect
tmpdir = config['tmpdir']
shell:
# using {config.tmpdir} or {config['tmpdir']} here breaks the build
"./scripts/do.sh --tmpdir {params.tmpdir} {input} > {output}; "
I could assign the parts of the config I want to a key under params, and then use a {param.x} replacement, but this has unwanted side effects (e.g. the parameter is saved in the snakemake metadata (i.e. .snakemake/params_tracking). Using run: instead of shell: would be another workaround, but accessing {config.tmpdir} directly from the shell block, would be most desirable.
"./scripts/do.sh --tmpdir {config[tmpdir]} {input} > {output}; "
should work here.
It is stated in the documentation:
http://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#standard-configuration
"For adding config placeholders into a shell command, Python string formatting syntax requires you to leave out the quotes around the key name, like so:"
shell:
"mycommand {config[foo]} ..."