Snakemake : subworkflow not playing well with the main DAG

Snakemake : subworkflow not playing well with the main DAG - snakemake

I have a main Snakefile and several subworkflows running in independent subdirectories (with paths relative to their own directories). I've noticed that if I modify one of the input of a subworkflow, it will rerun correctly but all the following rules that come afterwards are not rerun.
If I understand correctly what is going on, there's a different DAG for the main Snakefile and for each subworkflow. The main DAG is not aware of any modification in a subworkflow and therefore won't trigger a rerun since the output of the subworkflow hasn't been modified yet.
I'd like that all the rules depending of the output of a subworkflow are rerun if there's a modification in that subworkflow. Isn't that what the default behaviour should be ?
I've also tried the other modularisation techniques. Using includes works but is super annoying because I have to modify all the paths to be relative to the main directory (and therefore I can't run snakemake independently in one subdirectory anymore). I've also tried using the new module system coming with snakemake v.6 that is supposed to be replacing subworkflows. Maybe I don't use it correctly, but it doesn't seem to work for my use case. If I import a rule from a subdirectory it complains that there are missing inputs. It doesn't find the scripts because they are in the subdirectory and not in the main directory. So in that sense it works more like an include than a subworkflow.
Do you have any idea on how to solve my issue ?
Here's a small working example with the module implementation:
MainDirectory
| - Snakefile
rule all:
input: "Subdirectory/file.txt"
module other_workflow:
snakefile: "Subdirectory/Snakefile"
use rule * from other_workflow as other_*
| - Subdirectory
| | - Snakefile
rule rule_a:
input:
script = 'code.py'
output: 'file.txt'
shell: 'python {input.script}'
| | - code.py
with open('file.txt', 'w') as f:
print('This is a test.', file=f)
This doesn't work as the snakefile in the main directory uses all the rules in the same workdir, whereas I would like it to be running the imported rules in their own workdir. I can make it work by modifying all the relative paths in the subdirectory but that's not what I want. I want to be able to run it without modifications.

The issue is that you mix input files with code here. If the script code.py is defined as an input file, Snakemake expects it to be in the workdir. If you'd use the script directive or the Jupyter notebook directive instead, the path will be automatically relative to the Snakefile. If that should be not an option for whatever reason, you can instead build the path relative to the current Snakefile via Path(workflow.snakefile).parent / "code.py".
Note, there is really no reason to register code as input files. If you intend to get a rerun upon changes in the code, it is better to rely on snakemake --list-code-changes. The reason Snakemake does not automatically trigger reruns upon code changes is that they can be just cosmetic (e.g. formatting). Hence, it is up to the dev to trigger the rerun, e.g. via --list-code-changes, or manually.

Related

Relative Directories in snakefile using workflow.get_source()

For my workflow, the scripts & snakefile are in a different directory than the output files. I specify the latter using the workdir directive, which seems to work fine.
Now in some cases there are static input files in other directories, whose paths I want to specify relative to the snakefile. According to the documentation, I should use workflow.get_source("path/relative/to/snakefile"). However this gives me
'Workflow' object has no attribute 'get_source'. Do I need some additional imports?

Looking at the documentation, it looks like what's used in the code example is workflow.source_path("path/relative/to/snakefile") rather than get_source - you can give that a try.
Otherwise, though I don't think it's officially documented, workflow.basedir gives you the path of the directory the Snakefile lives in, and you can build the rest of the path off that.

Snakemake shadow rule when program writes to /tmp

I am using Snakemake to run the defense-finder program. This program creates and overwites generic temporary files in /tmp/defense-finder, i.e. the file names do not contain unique identifiers. When running my rule across separate cores on different input files, Snakemake crashes due to clashes in /tmp/defense-finder.
It appears that Shadow rules can help when different jobs write to the same files within the working directory. Is there a way to use Shadow rules when a program writes to the /tmp directory?

Following #Marmaduke's comment that file paths are hard-coded, a temporary workaround is to force snakemake to run the defense-finder jobs one at a time while allowing other jobs to run in parallel. You can do this with the resources directive:
rule defense_finder:
resources:
n_defense= 1,
input: ...
output: ...
shell: ...
then run with:
snakemake --resources n_defense=1 -j 10 ...

Singularity definition file with paths relative to it

Question
When building Singularity images using definition files, is there a way to specify the path to a file on the host system relative to the definition file (i.e. independent of where the build command is called)?
Example to Illustrate the Problem
I have the following files in the same directory (e.g. a git repository):
foobar.def
some_file.txt
foobar.def looks as follows:
Bootstrap: library
From: ubuntu:20.04
Stage: build
%files
# Add some_file.txt at the root of the image
some_file.txt /some_file.txt
This works fine when I build with the following command in the directory which contains the files:
singularity build --fakeroot foobar.sif foobar.def
However, it fails if I call the build command from anywhere else (e.g. from a dedicated "build" directory) because it searches some_file.txt relative to the current working directory of the build command, not relative to the definition file.
Is there a way to implement the definition file such that the build works independently of where the command is called? I know that I could use absolute paths but this is not a viable solution in my case.
To make it even more complicated: My actual definition file is bootstrapping from another local image, which is located in the build directory. So ideally I would need a solution where some files are found relative the working directory while others are found relative to the location of the definition file.

Short answer: Not really
Longer answer: Not really, but there's a reason why and it shouldn't really matter for most use cases. While Docker went the route of letting you specify what your directory context is, Singularity decided to base all of its commands off the current directory where it is being executed. This also follows with $PWD being auto-mounted into the container, so it makes sense for it to be consistent.
That said, is there a reason you can't run singularity build --fakeroot $build_dir/foobar.sif foobar.def from the repo directory? There isn't any other output written besides the final image and it makes more sense for the directory with the data being used to be the context to work from.

Working directory when using include in snakemake for rules that use the report() function

I am using snakemake to program my workflows. In order to reuse code, the simplest way is to use the statement
include: "path/to/other/Snakefile"
This works fine for most cases but fails when creating the reports via the report() function. The problem is that it does not find the .rst file that is specified for the caption.
Thus it seems that report() has the working directory in which the other Snakefile is located and not the one of the main Snakefile.
Is there a flexible workaround for this, so that it behaves as just being loaded into the Snakefile and then being executed as it were in the main Snakefile?
This is an example rule in another Snakemake file:
rule evaluation:
input:
"data/final_feature.model"
output:
report("data/results.txt",caption="report/evaluation.rst",category ="Evaluation")
shell:
"Rscript {scripts}/evaluation.R {input}"
This is included in the main Snakefile via:
include: "../General/subworkflows/evaluation.snakemake"
This is the error message showing that the file is not present:
WorkflowError:
Error loading caption file of output marked for report.
FileNotFoundError: [Errno 2] No such file or directory: '.../workflows/General/subworkflows/report/evaluation.rst'
Thank you for any help in advance!

One option may be to expand relative paths to absolute paths using os.path.abspath(). If the paths are relative to the directory where the Snakefile is, you may need instead to use workflow.basedir which contains the path to the Snakefile. For example:
caption= os.path.join(workflow.basedir, "report/evaluation.rst")

snakemake: is there a way to specify an output directory for each rule?

The scripts I used all put the output files to the current directory where the script was called so in my shell script pipeline I would have cd commands to go to a particular directory to run commands and output files will just be saved in relevant directories. My scripts don't have the parameter for output directory and most of them get the output file names deduced from the input. That has worked pretty well for me.
Now I'm running into this output directory issue consistently as snakemake seem to output the files to the directory where Snakefile is. I could modify all the scripts to take in an additional parameter for output directory but that's gone be a pain for modifying many scripts. I'm wondering if there is any way to specify where the output should go for each specific rule?

One hack would be to first cd into the output directory, i.e. "cd $(dirname {output[0]})". This needs to be the first in your shell commands.
Having said this, it would be better to change the script to accept an output directory as argument.
Andreas

Here is an example rule that I use in one of my snakefiles:
rule link_raw_data:
output:
OPJ(data_dir, "{lib}_{rep}.fastq.gz"),
params:
directory = data_dir,
shell_command = lib2data,
message:
"Making link to raw data {output}."
shell:
"""
(
cd {params.directory}
{params.shell_command}
)
"""
This is probably a bit different from your situation, but hopefully some of the techniques can help. In particular, note the parentheses in the shell section and the usage of a params section to define the output directory.
I'm not sure I'm doing this in the most elegant way, but it works.
data_dir is a parameter read from a config file.
lib2data is a function that generates commands based on the values of some wildcards. I have to ensure that these commands use the correct input file paths of course (and, in this case, also the output in a coherent manner with what the output section says). In your case, it is possible that you will simply have a "hard-coded" shell commands, possibly using some of the rule's input.
More streamlined example
rule run_script1:
input:
path/to/initial/input
output:
script1_out/output1
shell:
""""
cd script1_out
script1 {input}
""""
rule run_script2:
input:
script1/output1
output:
script2/output2
shell:
"""
cd script2_out
script2 {input}
"""
Starting from these examples, you can use functions of the wildcards in the input or output if necessary.

In snakemake documentation:
"All paths in the snakefile are interpreted relative to the directory snakemake is executed in. This behaviour can be overridden by specifying a workdir in the snakefile:"
workdir: "path/to/workdir"
So just put that at the begining of your snakefile and all inputs and outputs will be interpreted relative to this path.

You could try to use a configuration file either in YAML or JSON maybe. Then use the directory as a parameter in your expand or in the input/output of your rules.
See the documentation here

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas