Snakemake: parameterized runs that re-use intermediate output files - snakemake

here is a complex problem that I am struggling to find a clean solution for:
Imagine having a Snakemake workflow with several rules that can be parameterized in some way. Now, we might want to test different parameter settings for some rules, to see how the results differ. However, ideally, if these rules depend on the output of other rules that are not parameterized, we want to re-use these non-changing files, instead of re-computing them for each of our parameter settings. Furthermore, if at all possible, all this should be optional, so that in the default case, a user does not see any of this.
There is inherent complexity in there (to specify which files are re-used, etc). I am also aware that this is not exactly the intended use case of Snakemake ("reproducible workflows"), but is more of a meta-feature for experimentation.
Here are some approaches:
Naive solution: Add wildcards for each possible parameter to the file paths. This gets ugly, hard to maintain, and hard to extend really quickly. Not a solution.
A nice approach might be to name each run, and have an individual config file for that name which contains all settings that we need. Then, we only need a wildcard for such a named set of parameter settings. That would probably require to read some table of meta-config file, and process that. That doesn't solve the re-use issue though. Also, that means we need multiple config files for one snakemake call, and it seems that this is not possible (they would instead update each other, but not considered as individual configs to be run separately).
Somehow use sub-workflows, by specifying individual config files each time, e.g., via a wildcard. Not sure that this can be done (e.g., configfile: path/to/{config_name}.yaml). Still not a solution for file re-using.
Quick-and-dirty: Run all the rules up to the last output file that is shared between different configurations. Then, manually (or with some extra script) create directories with symlinks to this "base" run, with individual config files that specify the parameters for the per-config-runs. This still necessitates to call snakemake individually for each of these directories, making cluster usage harder.
None of these solve all issues though. Any ideas appreciated!
Thanks in advance, all the best
Lucas

Snakemake now offers the Paramspace helper to solve this! https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=parameter#parameter-space-exploration
I have not tried it yet, but it seems like the solution to the issue!

Related

using require in layer.conf in yocto

Considering all freedom that yocto gives to the developer, I have a question.
I would like to make this my_file.inc available only for recipes in one particular meta-layer. I know, that, for instance, using INHERIT keyword inside the local.conf will make my_class.bbclass file available globally for each recipe.
Is it a good practice to add this:
require my_file.inc
inside layer.conf? Or should I change my_file.inc to the my_file.bbclass, and, add INHERIT = "my_file.bbclass" to the layer.conf?
Any other possibilities?
Even if it seems to work, neither of your approaches is technically completely correct. The key point is that all .conf files are parsed first and everything they contain is globally visible throughout the whole build process. So if you add something through the layer.conf file, itis not being pulled in through an unexpected place, it also is not being limited that layer only and might therefore cause breakage at other places.
While I do not have a really good and clean solution, maybe the following can help you:
You can make your custom recipes react on certain keywords in DISTRO_FEATURES or MACHINE_FEATURES. Then you can create a two-staged approach:
Add the desired keyword in local.conf (or your MACHINE, or DISTRO, or whatever configuration)
Make the recipes react to it. If you need the mechanism in several places, then it might be useful to pour it into a .bbclass that your layer brings along and that you pull in for the respective recipes.
This way the effect is properly contained.
Maybe part 5.1.3.2 from the Yocto Project answers your question:
Avoid duplicating include files. Use append files (.bbappend) for each recipe that uses an include file. Or, if you are introducing a new recipe that requires the included file, use the path relative to the original layer directory to refer to the file. For example, use require recipes-core/package/file.inc instead of require file.inc. If you're finding you have to overlay the include file, it could indicate a deficiency in the include file in the layer to which it originally belongs. If this is the case, you should try to address that deficiency instead of overlaying the include file. For example, you could address this by getting the maintainer of the include file to add a variable or variables to make it easy to override the parts needing to be overridden.
So to avoid duplicate inclusion later, it would be better not to include your .inc file(s) this way.

CMake find_library: why are paths in PATHS searched after default paths?

CMake documantation seems clear: unless NO_DEFAULT_PATH is specified, cmake proceeds to first search a bunch of default paths, and if the library is not found, only then it looks in the directories listed in PATHS. A few other NO_* exist, to only exclude some of cmake default paths.
I know how to overcome this behavior. E.g., I can do what's suggested here, namely, do two searches: the first with NO_DEFAULT_PATH, and the second without it. If the first succeeds, the second will be skipped. Win.
However, this question is not about how to achieve what I want (I know the hack). The question is why do I need to do that? Why doesn't cmake look first in the provided paths? It seems logic to me to use those first: if the user is specifying some hints, perhaps I should use those first, and fall back on defaults if they don't work...
Is there a good reason I am missing for the current implementation? I would assume so, I doubt folks down at cmake do things without a good reason...
Edit: when I said 'the user is specifying some hints', I was not referring to the actual HINTS argument of find_library. It was more of a generic 'hints'. However, as suggested in a comment, HINTS is indeed scanned before system paths. My concern comes from cmake's documentation:
Search the paths specified by the HINTS option. These should be paths
computed by system introspection, such as a hint provided by the
location of another item already found. Hard-coded guesses should be
specified with the PATHS option.
The fact that HINTS was not designed to receive hard coded full paths, makes me thing it may fire back in the future, so I'm not sure it is the solution. But maybe it is?

Default memory request with possibility of override in a Snakefile?

I have a Snakefile with several rules and only a few need more than 1 GB/core to run on a cluster. The resources directive is great for this, but I can't find a way of setting a default value. I would prefer not having to write resources: mem_per_cpu = 1024 for every rule that doesn't need more than the default.
I realize that I could get what I want using __default__ in a cluster config file and overriding the mem_per_cpu value for specific rules. I hesitate to do this because the memory requirements are platform-independent, so I would prefer including them in the Snakefile itself. It would also prevent me from being able to specify local resource limits using the --resources command-line option.
Is there a simple solution with Snakemake that would help me here? Thanks!
I was reading the changelog of the Snakemake and I came across this:
Add –default-resources flag, that allows to define default resources
for jobs (e.g. mem_mb, disk_mb), see docs.

Should I merge .pbxproj files with git using merge=union?

I'm wondering whether the merge=union option in .gitattributes makes sense for .pbxproj files.
The manpage states for this option:
Run 3-way file level merge for text files, but take lines from both versions, instead of leaving conflict markers. This tends to leave the added lines in the resulting file in random order and the user should verify the result.
Normally, this should be fine for the 90% case of adding files to the project. Does anybody have experience with this?
Not a direct experience, but:
This SO question really advices again merging .pbxproj files.
The pbxproj file isn't really human mergable.
While it is plain ASCII text, it's a form of JSON. Essentially you want to treat it as a binary file.
(hence a gitignore solution)
Actually, Peter Hosey adds in the comment:
It's a property list, not JSON. Same ideas, different syntax.
Yet, according to this question:
The truth is that it's way more harmful to disallow merging of that .pbxproj file than it is helpful.
The .pbxproj file is simply JSON (similar to XML). From experience, just about the ONLY merge conflict you were ever get is if two people have added files at the same time. The solution in 99% of the merge conflict cases is to keep both sides of the merge.
So a merge 'union' (with a gitattributes merge directive) makes sense, but do some test to see if it does the same thing than the script mentioned in the last question.
See also this question for potential conflicts.
See the Wikipedia article on Property List
I have been working with a large team lately and tried *.pbxproj merge=union, but ultimately had to remove it.
The problem was that braces would become out of place on a regular basis, which made the files unreadable. It is true that tho does work most of the time - but fails maybe 1 out of 4 times.
We are back to using *.pbxproj -crlf -merge for now. This seems to be the best solution that is workable for us.

Haskell IO Testing

I've been trying to figure out if there is already an accepted method for testing file io operations in Haskell, but I have yet to find any information that is useful for what I am trying to do.
I'm writing a small library that performs various file system operations (recursively traverse a directory and return a list of all files; sync multiple directories so that each directory contains the same files using inodes as the equality test and hardlinks...) and I want to make sure that they actually work, but the only way I can think of to test them is to create a temporary directory with a known structure and compare the results from the functions executed on this temporary directory with the known results. The thing is, I would like to get as much test coverage as possible while still being mainly automated: I don't want to have create the directory structure by hand.
I have searched google and hackage, but the packages that I have seen on hackage do not use any testing -- maybe I just picked the wrong ones -- and anything I find on google does not deal with IO testing.
Any help would be appreciated
Thanks, James
Maybe you can find a way to make this work for you.
EDIT:
the packages that I have seen on hackage do not use any testing
I have found an unit testing framework for Haskell on Hackage. Including this framework, maybe you could use assertions to verify that the files you require are present in the directories that you want them to be and they correspond to their intended purpose.
HUnit is the usual library for IO-based tests. I don't know of a set of properties/combinators for file actions -- that would be useful.
There is no reason why your test code cannot create a temporary directory, and check its contents after running your impure code.
If you want mainly automated testing of monadic code, you might want to look into Monadic QuickCheck. You can write down properties that you think should be true, such as
If you create a file with read permission, it will be possible to open the file for reading.
If you remove a file, it won't open.
Whatever else you think of...
QuickCheck will then generate random tests.