How does the --touch flag for snakemake work exactly? - snakemake

I am running a big snakemake workflow on a cluster and for bug fixing I had to update a specific rule. I do not want to recreate every file that has been created before the update. Just those that were not created. When I just start the workflow snakemake wants to update these files too. So I added the --touch flag. This is failing as soon as it hits a file that has not been created yet.
Can someone explain how the --touch flag works exactly? Is this supposed to happen this way? Is there a way to apply --touch only to files, that have already been created? Maybe be combining with --forceall?
Thanks for your help :)

Related

Snakemake: What will happen if the output file of a rule is already generated?

I'm very new to snakemake, and I downloaded a package from github that utilize snakemake, I managed to run it once, but since my data is so large, it took 27 hours to complete the whole thing, but around 99% of it is spent on executing 1 rule, so I wanted to skip that particular rule, when the output file of that rule has already existed. Is snakemake going to skip that rule automatically if the output file of that rule is listed in the rule all section? else, what should I do to skip it?
From the way you describe it, yes, snakemake will skip that long-running rule if its output is already present AND the output is newer than its input. If this second condition is not met, snakemake will run the rule again. This makes sense, right? If the input has been updated then the output is obsolete and needs to be redone. Note that snakemake checks the timestamps not the content of the files.
In practice, you can execute snakemake with the --dry-run option to confirm it is not going to run that rule again. Look also at the --summary option to see why snakemake wants to execute some rules and skip others.
(In doubt, make a copy of the output from the long-running rule, just in case...)

snakemake randomly identifying outputs as incomplete

I'm having some trouble with a snakemake workflow I developed. For a specific rule, the output is sometimes identified as incomplete by snakemake:
IncompleteFilesException:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with
snakemake --cleanup-metadata <filenames>
To re-generate the files rerun your command with the --rerun-incomplete flag.
Incomplete files:
This rule runs several times (with different wildcard values) and only some fail with this error. Interestingly, if I rerun the workflow from scratch, the same jobs will complete with no error and other ones might produce it. Also, I manually checked the output and don't see anything wrong with it. I can resume the workflow with no problem.
I am aware of the --ignore-incomplete workaround, but still curious as to why this might happen? How does snakemake decide about an output being incomplete? I should also mention that the jobs run on a PBS HPC system - not sure if it's related.
Incomplete in this context probably means, that the job did not finish how it should have been, so Snakemake cannot guarantee the output is how it should be. If your rule produces output but then fails, Snakemake would still mark the output as incomplete.
I looked up in the source code when the IncompleteFilesException is raised. Snakemake seems to mark files as complete when persistence.finished() is called, see code here.
And finished() is called by postprocess() which again gets called by a number of places. Without knowing Snakemake inside out, it seems hard to know where the problem lies. Somehow, Snakemake must think that the job didn't complete properly.
I would look into the logs of the Snakemake runs. Possibly some of the jobs fail.

Snakemake: go back and clean up temp() files

I know variants on this have been asked before (e.g. https://groups.google.com/forum/#!topic/snakemake/4kslVBX2kew), but I don't see a definitive solution.
If I run a long-running and complex Snakemake pipeline with '--notemp' (maybe because I'm debugging), it would be really nice to be able to subsequently run a 'cleanup' command to delete anything that would automatically have been deleted on the first run without --notemp. Is there any easy way of doing this?
The way I'm doing this right now is to re-run after using '--forceall --touch', without '--notemp', such that everything just gets touched, and the temp files then get removed at the end. But it's not ideal to change all the timestamps. Is there a better way?
Jon
Since v5.0.0, --delete-temp-output achieves this.
--delete-temp-output
Remove all temporary files generated by the workflow. Use together with –dry-run to list files without actually deleting anything. Note that this will not recurse into subworkflows.
Default: False

How can a modified Julia package be used natively?

So, there is this cool package I've found but it leaves a lot to be desired. Since it made more sense to modify it, rather than build a new one myself, I changed the code in the corresponding source directory (C:\Users[my username].julia\v0.4[package name]\src). I made sure to modify not just the base.jl file, but also the [name of package].jl one so that there are no issues with dependencies or the new functions I added. I tried running the package several times to ensure that Julia doesn't spit out any errors or exceptions (the original package had some deprecated stuff, which I also remedied). Still, I fail to use the additional functionality of the package that I augmented. Any help would be greatly appreciated.
I'm using Julia ver 0.4.2, on a Windows 7 machine. As an IDE I use Notepad++. Thanks
I'm not exactly sure what you tried, but here's a guess as to what's going on: if you've already loaded the package in your julia session, edits to the source files won't take effect unless you explicitly reload the package. There are some good workflow tips here, and more explanation of the module system here.
However, for a newbie the easiest thing might be to quit julia and restart.
As far as making changes to a package, as Gnumic commented, your best approach is to make a branch and commit your changes there. Once you become convinced your changes represent an improvement, consider sharing your changes with the rest of the world.

Darcs conflicts

I installed Darcs a few days ago and have a doubt.
I am the only programmer and I usually work on two or three instances of the application, making new feautures. The problems cames because this instances modify the same source code file, so when I finished them and send to main repository they make a conflict.
Is there any way to deal with this? Can I write the same file in multiple instances without making conflict when pushing to main repository?
thanks
First of all when changes occurs at different places of the file there is generally no conflicts when merging. When two patches can be merged without conflicts one says that they commute. In your case it happens that you've modified the same part of the file in two different branches. In this case darcs don't allow you to "push" the second patch that makes the conflict.
There is two ways to resolve such a confilct, but you have to start to locally merge the both patches to get the conflict in your working repo. To do this just pull the patches from the main repository. Then you have to edit the offended file and resolve the conflict.
The first way is simple and the prefered solution, you have to "amend-record" the patch that is not yet on the main repository (look at the usage of the "darcs amend-record" command).
The other solution is to record a resolution patch, by calling "darcs record" and then pushing both the conflicting patch and the resolution patch. This solution tends to complicate the history and can make some later operations longer. However when the branch has been heavily distributed this solution becomes needed.