snakemake: is there a way to specify an output directory for each rule? - snakemake

The scripts I used all put the output files to the current directory where the script was called so in my shell script pipeline I would have cd commands to go to a particular directory to run commands and output files will just be saved in relevant directories. My scripts don't have the parameter for output directory and most of them get the output file names deduced from the input. That has worked pretty well for me.
Now I'm running into this output directory issue consistently as snakemake seem to output the files to the directory where Snakefile is. I could modify all the scripts to take in an additional parameter for output directory but that's gone be a pain for modifying many scripts. I'm wondering if there is any way to specify where the output should go for each specific rule?

One hack would be to first cd into the output directory, i.e. "cd $(dirname {output[0]})". This needs to be the first in your shell commands.
Having said this, it would be better to change the script to accept an output directory as argument.
Andreas

Here is an example rule that I use in one of my snakefiles:
rule link_raw_data:
output:
OPJ(data_dir, "{lib}_{rep}.fastq.gz"),
params:
directory = data_dir,
shell_command = lib2data,
message:
"Making link to raw data {output}."
shell:
"""
(
cd {params.directory}
{params.shell_command}
)
"""
This is probably a bit different from your situation, but hopefully some of the techniques can help. In particular, note the parentheses in the shell section and the usage of a params section to define the output directory.
I'm not sure I'm doing this in the most elegant way, but it works.
data_dir is a parameter read from a config file.
lib2data is a function that generates commands based on the values of some wildcards. I have to ensure that these commands use the correct input file paths of course (and, in this case, also the output in a coherent manner with what the output section says). In your case, it is possible that you will simply have a "hard-coded" shell commands, possibly using some of the rule's input.
More streamlined example
rule run_script1:
input:
path/to/initial/input
output:
script1_out/output1
shell:
""""
cd script1_out
script1 {input}
""""
rule run_script2:
input:
script1/output1
output:
script2/output2
shell:
"""
cd script2_out
script2 {input}
"""
Starting from these examples, you can use functions of the wildcards in the input or output if necessary.

In snakemake documentation:
"All paths in the snakefile are interpreted relative to the directory snakemake is executed in. This behaviour can be overridden by specifying a workdir in the snakefile:"
workdir: "path/to/workdir"
So just put that at the begining of your snakefile and all inputs and outputs will be interpreted relative to this path.

You could try to use a configuration file either in YAML or JSON maybe. Then use the directory as a parameter in your expand or in the input/output of your rules.
See the documentation here

Related

Relative Directories in snakefile using workflow.get_source()

For my workflow, the scripts & snakefile are in a different directory than the output files. I specify the latter using the workdir directive, which seems to work fine.
Now in some cases there are static input files in other directories, whose paths I want to specify relative to the snakefile. According to the documentation, I should use workflow.get_source("path/relative/to/snakefile"). However this gives me
'Workflow' object has no attribute 'get_source'. Do I need some additional imports?
Looking at the documentation, it looks like what's used in the code example is workflow.source_path("path/relative/to/snakefile") rather than get_source - you can give that a try.
Otherwise, though I don't think it's officially documented, workflow.basedir gives you the path of the directory the Snakefile lives in, and you can build the rest of the path off that.

Snakemake shadow rule when program writes to /tmp

I am using Snakemake to run the defense-finder program. This program creates and overwites generic temporary files in /tmp/defense-finder, i.e. the file names do not contain unique identifiers. When running my rule across separate cores on different input files, Snakemake crashes due to clashes in /tmp/defense-finder.
It appears that Shadow rules can help when different jobs write to the same files within the working directory. Is there a way to use Shadow rules when a program writes to the /tmp directory?
Following #Marmaduke's comment that file paths are hard-coded, a temporary workaround is to force snakemake to run the defense-finder jobs one at a time while allowing other jobs to run in parallel. You can do this with the resources directive:
rule defense_finder:
resources:
n_defense= 1,
input: ...
output: ...
shell: ...
then run with:
snakemake --resources n_defense=1 -j 10 ...

Snakemake : subworkflow not playing well with the main DAG

I have a main Snakefile and several subworkflows running in independent subdirectories (with paths relative to their own directories). I've noticed that if I modify one of the input of a subworkflow, it will rerun correctly but all the following rules that come afterwards are not rerun.
If I understand correctly what is going on, there's a different DAG for the main Snakefile and for each subworkflow. The main DAG is not aware of any modification in a subworkflow and therefore won't trigger a rerun since the output of the subworkflow hasn't been modified yet.
I'd like that all the rules depending of the output of a subworkflow are rerun if there's a modification in that subworkflow. Isn't that what the default behaviour should be ?
I've also tried the other modularisation techniques. Using includes works but is super annoying because I have to modify all the paths to be relative to the main directory (and therefore I can't run snakemake independently in one subdirectory anymore). I've also tried using the new module system coming with snakemake v.6 that is supposed to be replacing subworkflows. Maybe I don't use it correctly, but it doesn't seem to work for my use case. If I import a rule from a subdirectory it complains that there are missing inputs. It doesn't find the scripts because they are in the subdirectory and not in the main directory. So in that sense it works more like an include than a subworkflow.
Do you have any idea on how to solve my issue ?
Here's a small working example with the module implementation:
MainDirectory
| - Snakefile
rule all:
input: "Subdirectory/file.txt"
module other_workflow:
snakefile: "Subdirectory/Snakefile"
use rule * from other_workflow as other_*
| - Subdirectory
| | - Snakefile
rule rule_a:
input:
script = 'code.py'
output: 'file.txt'
shell: 'python {input.script}'
| | - code.py
with open('file.txt', 'w') as f:
print('This is a test.', file=f)
This doesn't work as the snakefile in the main directory uses all the rules in the same workdir, whereas I would like it to be running the imported rules in their own workdir. I can make it work by modifying all the relative paths in the subdirectory but that's not what I want. I want to be able to run it without modifications.
The issue is that you mix input files with code here. If the script code.py is defined as an input file, Snakemake expects it to be in the workdir. If you'd use the script directive or the Jupyter notebook directive instead, the path will be automatically relative to the Snakefile. If that should be not an option for whatever reason, you can instead build the path relative to the current Snakefile via Path(workflow.snakefile).parent / "code.py".
Note, there is really no reason to register code as input files. If you intend to get a rerun upon changes in the code, it is better to rely on snakemake --list-code-changes. The reason Snakemake does not automatically trigger reruns upon code changes is that they can be just cosmetic (e.g. formatting). Hence, it is up to the dev to trigger the rerun, e.g. via --list-code-changes, or manually.

Working directory when using include in snakemake for rules that use the report() function

I am using snakemake to program my workflows. In order to reuse code, the simplest way is to use the statement
include: "path/to/other/Snakefile"
This works fine for most cases but fails when creating the reports via the report() function. The problem is that it does not find the .rst file that is specified for the caption.
Thus it seems that report() has the working directory in which the other Snakefile is located and not the one of the main Snakefile.
Is there a flexible workaround for this, so that it behaves as just being loaded into the Snakefile and then being executed as it were in the main Snakefile?
This is an example rule in another Snakemake file:
rule evaluation:
input:
"data/final_feature.model"
output:
report("data/results.txt",caption="report/evaluation.rst",category ="Evaluation")
shell:
"Rscript {scripts}/evaluation.R {input}"
This is included in the main Snakefile via:
include: "../General/subworkflows/evaluation.snakemake"
This is the error message showing that the file is not present:
WorkflowError:
Error loading caption file of output marked for report.
FileNotFoundError: [Errno 2] No such file or directory: '.../workflows/General/subworkflows/report/evaluation.rst'
Thank you for any help in advance!
One option may be to expand relative paths to absolute paths using os.path.abspath(). If the paths are relative to the directory where the Snakefile is, you may need instead to use workflow.basedir which contains the path to the Snakefile. For example:
caption= os.path.join(workflow.basedir, "report/evaluation.rst")

How to add to a config file within a snakemake pipeline and define global variables?

I am new to snakemake and am trying to create a pipeline that would have a sample file in the config. A rule would parse this file and pull out certain information that is needed by the pipeline. So I would like to add this to the config file or create global variables is this possible?or to force it to wait for the rule to finish before defining config file or variable?
I have tried adding to a config file using a rule with shell cat command. But it doesn't try to run as the output is the config file. Also tried declaring config file after rule but It doesn't matter on order. Also tried to touch the output file as part of the rule
include:"fastqc.snake"
IDS, = glob_wildcards("testdata/samples/{id}.fastq.gz")
rule all:
input: expand("fastqc_reports/{id}_fastqc.html", id=IDS)
rule add:
input:
"test_in.txt"
output:
"config.yaml"
shell:
"cat {input} >> {output}"
configfile: "config.yaml"
I would like the contents of the input file to be appended to the config file. In reality it will be a number of variables I find in the input file. OR I would like to be able to create global variables based on the parsing of the input file.