Snakemake shadow rule when program writes to /tmp - relative-path

I am using Snakemake to run the defense-finder program. This program creates and overwites generic temporary files in /tmp/defense-finder, i.e. the file names do not contain unique identifiers. When running my rule across separate cores on different input files, Snakemake crashes due to clashes in /tmp/defense-finder.
It appears that Shadow rules can help when different jobs write to the same files within the working directory. Is there a way to use Shadow rules when a program writes to the /tmp directory?

Following #Marmaduke's comment that file paths are hard-coded, a temporary workaround is to force snakemake to run the defense-finder jobs one at a time while allowing other jobs to run in parallel. You can do this with the resources directive:
rule defense_finder:
resources:
n_defense= 1,
input: ...
output: ...
shell: ...
then run with:
snakemake --resources n_defense=1 -j 10 ...

Related

How to update output files without changing them in Snakemake

I have changed an initial rule that adds links to database files. Snakemake typically requires all dependent rules to rerun, even the database change is not really relevant. I tried to avoid this behaviour with
snakemake --touch
Unfortunately it does not work and still wants to rerun all blast processes on top.
How to check the dependence tree ? How to mark individual files as up-to-date ?
snakemake --version 6.4.1
Best, Michael

Snakemake : subworkflow not playing well with the main DAG

I have a main Snakefile and several subworkflows running in independent subdirectories (with paths relative to their own directories). I've noticed that if I modify one of the input of a subworkflow, it will rerun correctly but all the following rules that come afterwards are not rerun.
If I understand correctly what is going on, there's a different DAG for the main Snakefile and for each subworkflow. The main DAG is not aware of any modification in a subworkflow and therefore won't trigger a rerun since the output of the subworkflow hasn't been modified yet.
I'd like that all the rules depending of the output of a subworkflow are rerun if there's a modification in that subworkflow. Isn't that what the default behaviour should be ?
I've also tried the other modularisation techniques. Using includes works but is super annoying because I have to modify all the paths to be relative to the main directory (and therefore I can't run snakemake independently in one subdirectory anymore). I've also tried using the new module system coming with snakemake v.6 that is supposed to be replacing subworkflows. Maybe I don't use it correctly, but it doesn't seem to work for my use case. If I import a rule from a subdirectory it complains that there are missing inputs. It doesn't find the scripts because they are in the subdirectory and not in the main directory. So in that sense it works more like an include than a subworkflow.
Do you have any idea on how to solve my issue ?
Here's a small working example with the module implementation:
MainDirectory
| - Snakefile
rule all:
input: "Subdirectory/file.txt"
module other_workflow:
snakefile: "Subdirectory/Snakefile"
use rule * from other_workflow as other_*
| - Subdirectory
| | - Snakefile
rule rule_a:
input:
script = 'code.py'
output: 'file.txt'
shell: 'python {input.script}'
| | - code.py
with open('file.txt', 'w') as f:
print('This is a test.', file=f)
This doesn't work as the snakefile in the main directory uses all the rules in the same workdir, whereas I would like it to be running the imported rules in their own workdir. I can make it work by modifying all the relative paths in the subdirectory but that's not what I want. I want to be able to run it without modifications.
The issue is that you mix input files with code here. If the script code.py is defined as an input file, Snakemake expects it to be in the workdir. If you'd use the script directive or the Jupyter notebook directive instead, the path will be automatically relative to the Snakefile. If that should be not an option for whatever reason, you can instead build the path relative to the current Snakefile via Path(workflow.snakefile).parent / "code.py".
Note, there is really no reason to register code as input files. If you intend to get a rerun upon changes in the code, it is better to rely on snakemake --list-code-changes. The reason Snakemake does not automatically trigger reruns upon code changes is that they can be just cosmetic (e.g. formatting). Hence, it is up to the dev to trigger the rerun, e.g. via --list-code-changes, or manually.

Working directory when using include in snakemake for rules that use the report() function

I am using snakemake to program my workflows. In order to reuse code, the simplest way is to use the statement
include: "path/to/other/Snakefile"
This works fine for most cases but fails when creating the reports via the report() function. The problem is that it does not find the .rst file that is specified for the caption.
Thus it seems that report() has the working directory in which the other Snakefile is located and not the one of the main Snakefile.
Is there a flexible workaround for this, so that it behaves as just being loaded into the Snakefile and then being executed as it were in the main Snakefile?
This is an example rule in another Snakemake file:
rule evaluation:
input:
"data/final_feature.model"
output:
report("data/results.txt",caption="report/evaluation.rst",category ="Evaluation")
shell:
"Rscript {scripts}/evaluation.R {input}"
This is included in the main Snakefile via:
include: "../General/subworkflows/evaluation.snakemake"
This is the error message showing that the file is not present:
WorkflowError:
Error loading caption file of output marked for report.
FileNotFoundError: [Errno 2] No such file or directory: '.../workflows/General/subworkflows/report/evaluation.rst'
Thank you for any help in advance!
One option may be to expand relative paths to absolute paths using os.path.abspath(). If the paths are relative to the directory where the Snakefile is, you may need instead to use workflow.basedir which contains the path to the Snakefile. For example:
caption= os.path.join(workflow.basedir, "report/evaluation.rst")

snakemake: is there a way to specify an output directory for each rule?

The scripts I used all put the output files to the current directory where the script was called so in my shell script pipeline I would have cd commands to go to a particular directory to run commands and output files will just be saved in relevant directories. My scripts don't have the parameter for output directory and most of them get the output file names deduced from the input. That has worked pretty well for me.
Now I'm running into this output directory issue consistently as snakemake seem to output the files to the directory where Snakefile is. I could modify all the scripts to take in an additional parameter for output directory but that's gone be a pain for modifying many scripts. I'm wondering if there is any way to specify where the output should go for each specific rule?
One hack would be to first cd into the output directory, i.e. "cd $(dirname {output[0]})". This needs to be the first in your shell commands.
Having said this, it would be better to change the script to accept an output directory as argument.
Andreas
Here is an example rule that I use in one of my snakefiles:
rule link_raw_data:
output:
OPJ(data_dir, "{lib}_{rep}.fastq.gz"),
params:
directory = data_dir,
shell_command = lib2data,
message:
"Making link to raw data {output}."
shell:
"""
(
cd {params.directory}
{params.shell_command}
)
"""
This is probably a bit different from your situation, but hopefully some of the techniques can help. In particular, note the parentheses in the shell section and the usage of a params section to define the output directory.
I'm not sure I'm doing this in the most elegant way, but it works.
data_dir is a parameter read from a config file.
lib2data is a function that generates commands based on the values of some wildcards. I have to ensure that these commands use the correct input file paths of course (and, in this case, also the output in a coherent manner with what the output section says). In your case, it is possible that you will simply have a "hard-coded" shell commands, possibly using some of the rule's input.
More streamlined example
rule run_script1:
input:
path/to/initial/input
output:
script1_out/output1
shell:
""""
cd script1_out
script1 {input}
""""
rule run_script2:
input:
script1/output1
output:
script2/output2
shell:
"""
cd script2_out
script2 {input}
"""
Starting from these examples, you can use functions of the wildcards in the input or output if necessary.
In snakemake documentation:
"All paths in the snakefile are interpreted relative to the directory snakemake is executed in. This behaviour can be overridden by specifying a workdir in the snakefile:"
workdir: "path/to/workdir"
So just put that at the begining of your snakefile and all inputs and outputs will be interpreted relative to this path.
You could try to use a configuration file either in YAML or JSON maybe. Then use the directory as a parameter in your expand or in the input/output of your rules.
See the documentation here

Problem with multiple listings of the same file in RPM spec

I have some problems with an rpm spec file that is listing the same file multiple times. For this spec we do some normal compilation and then we have script that copies everything to the buildroot. Within this buildroot we have a lot of generic scripts that need to be installed on the final system, so we just list this directory.
However the problem is, that one of the scripts might be changed and configuration options might be changed within the script. So we list this script with different attributes as %config. However this means the script is defined multiple times with conflicting attributes, so rpmbuild complains and does not include the script at all in the installation package.
Is there a good way to handle this problem and to tell rpmbuild to only use the second definition, or do we have to seperate the script into two parts, one containing the configuration and one containing the actual logic?
Instead of specifying the directory, you can create a file list and then prune duplicate files from that.
So where you have something like
%files
%dir foo
%config foo/scriptname
You modify those parts to
find $RPM_BUILD_ROOT -type f | sed -e "s|^$RPM_BUILD_ROOT||" > filelist
sed -i "\|^foo/scriptname$|d" filelist
%files -f filelist
%config foo/scriptname
You can also use %{buildroot} in place of $RPM_BUILD_ROOT.