Snakemake: What will happen if the output file of a rule is already generated? - snakemake

I'm very new to snakemake, and I downloaded a package from github that utilize snakemake, I managed to run it once, but since my data is so large, it took 27 hours to complete the whole thing, but around 99% of it is spent on executing 1 rule, so I wanted to skip that particular rule, when the output file of that rule has already existed. Is snakemake going to skip that rule automatically if the output file of that rule is listed in the rule all section? else, what should I do to skip it?

From the way you describe it, yes, snakemake will skip that long-running rule if its output is already present AND the output is newer than its input. If this second condition is not met, snakemake will run the rule again. This makes sense, right? If the input has been updated then the output is obsolete and needs to be redone. Note that snakemake checks the timestamps not the content of the files.
In practice, you can execute snakemake with the --dry-run option to confirm it is not going to run that rule again. Look also at the --summary option to see why snakemake wants to execute some rules and skip others.
(In doubt, make a copy of the output from the long-running rule, just in case...)

Related

How does the --touch flag for snakemake work exactly?

I am running a big snakemake workflow on a cluster and for bug fixing I had to update a specific rule. I do not want to recreate every file that has been created before the update. Just those that were not created. When I just start the workflow snakemake wants to update these files too. So I added the --touch flag. This is failing as soon as it hits a file that has not been created yet.
Can someone explain how the --touch flag works exactly? Is this supposed to happen this way? Is there a way to apply --touch only to files, that have already been created? Maybe be combining with --forceall?
Thanks for your help :)

snakemake randomly identifying outputs as incomplete

I'm having some trouble with a snakemake workflow I developed. For a specific rule, the output is sometimes identified as incomplete by snakemake:
IncompleteFilesException:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with
snakemake --cleanup-metadata <filenames>
To re-generate the files rerun your command with the --rerun-incomplete flag.
Incomplete files:
This rule runs several times (with different wildcard values) and only some fail with this error. Interestingly, if I rerun the workflow from scratch, the same jobs will complete with no error and other ones might produce it. Also, I manually checked the output and don't see anything wrong with it. I can resume the workflow with no problem.
I am aware of the --ignore-incomplete workaround, but still curious as to why this might happen? How does snakemake decide about an output being incomplete? I should also mention that the jobs run on a PBS HPC system - not sure if it's related.
Incomplete in this context probably means, that the job did not finish how it should have been, so Snakemake cannot guarantee the output is how it should be. If your rule produces output but then fails, Snakemake would still mark the output as incomplete.
I looked up in the source code when the IncompleteFilesException is raised. Snakemake seems to mark files as complete when persistence.finished() is called, see code here.
And finished() is called by postprocess() which again gets called by a number of places. Without knowing Snakemake inside out, it seems hard to know where the problem lies. Somehow, Snakemake must think that the job didn't complete properly.
I would look into the logs of the Snakemake runs. Possibly some of the jobs fail.

How to disable Perl 6 REPL creating .precomp

Every time I run perl6 to enter the REPL mode, it creates a .precomp directory, which also slows down the appearance of the prompt. If the .precomp directory already exists, the prompt appears almost immediately, otherwise perl6 takes several seconds to create it.
Is there a way to disable this feature?
Check if you have a PERL6LIB environment variable set, and if it contains .. I can produce exactly the behavior you're encountering if I set that. The solution is to clear that from your PERL6LIB.

Snakemake: go back and clean up temp() files

I know variants on this have been asked before (e.g. https://groups.google.com/forum/#!topic/snakemake/4kslVBX2kew), but I don't see a definitive solution.
If I run a long-running and complex Snakemake pipeline with '--notemp' (maybe because I'm debugging), it would be really nice to be able to subsequently run a 'cleanup' command to delete anything that would automatically have been deleted on the first run without --notemp. Is there any easy way of doing this?
The way I'm doing this right now is to re-run after using '--forceall --touch', without '--notemp', such that everything just gets touched, and the temp files then get removed at the end. But it's not ideal to change all the timestamps. Is there a better way?
Jon
Since v5.0.0, --delete-temp-output achieves this.
--delete-temp-output
Remove all temporary files generated by the workflow. Use together with –dry-run to list files without actually deleting anything. Note that this will not recurse into subworkflows.
Default: False

Faster way of testing your prolog program

I am new to Prolog, and the task of launching the prolog interpreter from the terminal, typing consult('some_prolog_program.pl'), and then testing the predicate you just wrote is very time consuming, is there a way to run a scripted test to speed up development?
For example in C I can write a main where I would use the functions I defined, I can then execute:
make && ./a.out
to test the code, can I do something similar with Prolog?
You can have the interpreter always open and then recompile the file.
You can auto-run a predicate after compiling the file:
:- foo(4,2).
This will run foo(4,2) when the line is encountered in the file.
There are flags that can be used while launching (most) Prolog interpreters that allow you to compile a file and run predicates (check the man page). This way you could make a Bash script. The following will consult file.pl and run foo/0 using SWI-Prolog:
#!/bin/sh
exec swipl -q -f none -g "load_files([file],[silent(true)])" \
-t foo -- $*
This predicate will unify Arguments with a list of the flags you gave at the command line:
current_prolog_flag(argv, Arguments)
But unless you are going to run a lot of tests, I don't think that writing all this extra code will be faster.
Personally I really like the flexibility of testing any predicate at any time with or without tracing (see trace/0) without having to write extra code to call them (unlike in C).
P.S. about reloading the file without leaving the interpreter: You might have some problem if you have used dynamic predicates or global variables; you will have to do some cleaning.
You can invoke a test file from the command-line with prolog +l <file>
Also, you can build a single run_tests predicate that exercises a series of calls and validates the actual results against expected results. Here's an article with a good worked-out example: http://kenegozi.com/blog/2008/07/24/unit-testing-in-prolog
In SWI, you can load things as usual. Then, when you edit your files you simply say make. on the toplevel and it checks all dependencies automatically and only reloads the modified files.
For bigger projects it does make a lot of sense to use makefiles. In particular to do unit testing. See SWI's package plunit.
For simple scripts in SWI-Prolog, using REPL to test the code manually is usually good enough. Changed files can be reloaded via make/0 (?- make. on toplevel). Just keep the Prolog REPL running while editing, then save the edits, run make. in the REPL and hit ↑, ↑, Enter to execute the last query before the make. from history.
The main benefit of REPL is its interactivity:
You may fiddle with the arguments.
Transition to debugging or tracing (both command line and graphical) is easy.
You don't need to perform I/O to print the result. Output is handled by the toplevel, which prints the substitution. You see the whole substitution, not only its part you just happen to print (possibly accidentally overlooking other parts).
You may interactively choose how many substitutions you want to see for a goal that succeeds multiple times.
It is obvious if there is a choice point left after the last result returned by a non-deterministic predicate, which is hard to observe otherwise. In that case, false. is printed when backtracking beyond the last result.
If you need to preserve the test calls to repeat them later, create a protocol (transcript or "log" of the interactive session) and edit it to become a script, or even a test suite (see below). The protocol is a plain text file with escape sequences for the terminal, containing a verbatim copy of what you see during the interactive session. View the protocol using cat protocol.txt on Linux (and other *NIXes) or type protocol.txt on Windows.
If interactivity is not needed, perform the test calls from the command line non-interactively. Let's test the CLP(FD) factorial example n_factorial/2, saved in factorial.pl (don't forget to add :- use_module(library(clpfd)). when copying the code):
$ swipl -q -t "between(0, 9, N), n_factorial(N, F), format('~D ', F), fail." factorial.pl
1 1 2 6 24 120 720 5,040 40,320 362,880
On Windows, you may need to specify full path to swipl.exe as it's not in the PATH, probably.
If the call is always the same, you may save it to a shell script or Makefile (run would be a good name for the target).
In your current workflow for testing functions in C, you create a new program and call the function under test from its entry point (main function). Prolog scripts can have an entry point, too. See library(main). Prolog does not require compilation, so you can just directly call the script (./test.pl) without calling Make first.
For larger projects, you may want to create a less ad-hoc test suite. A unit testing framework like PlUnit is needed. Its use is beyond the scope of this answer; see the documentation.