Wildcard constraints scope not used by modules - snakemake

I have a pipeline that are combining multiple snakemake pipelines using snakemake modules and I have a question about wildcard constrains scope. Let say the pipelines I include, themselves have wildcard constraints definitions:
wildcard_constraints:
barcode="[A-Z+]+",
chr="[^_]+",
Is it possible to override these already defined constraints? I have defined wildcard_constrains in my new pipeline but they doesn't seem to progate into the included modules.

Related

Override YAML element in GitLab CI

I'm currently working on a pipeline for some projects on GitLab. I tried to make use of YAML aliases and anchors to organize my files better and avoid having a huge pipeline file. Still, I realized that I cannot use those artifacts, but instead I must use the !reference tag from GitLab itself.
Even though it's better then nothing, I didn't find a way to override some of the properties of the referenced elements. Does anyone know how can I achieve this?

Configure rules in Detekt

I am adding Detekt to a new project.
But, I find that some rules are too strict.
How can I implement my own thresholds for a few rules?
I don't want to use Baseline files, because this is new code and we don't want to consider some things code smells.
It's possible to configure Detekt via configuration files.
As per Detekt's documentation, with my emphasis:
detekt allows easily to just pick the rules you want and configure them the way you like. For example if you want to allow up to 20 functions inside a Kotlin file instead of the default threshold, write:
complexity:
TooManyFunctions:
thresholdInFiles: 20
You'll want to create a config folder in the root of your project, and add detekt.yaml to it. This config will contain all the rules that you want to override the default, or all the rules (establishing the defaults as the correct value).
For more information, check the official documentation on configuration

Terraform - Are single resource modules always bad?

I decided to learn more about Terraform and see if I could replicate what I did manually in the console, using Terraform. I set up two VMs, one that was publicly accessible and one that was not and had to be accessed through the first VM. These two VMs are almost identical, apart from the firewall rules.
In the interest of being DRY, I thought I'd create a module, so that I don't have to repeat all the options for the two VMs and just specify the differences. Since I wasn't sure about how to create a module, I checked the documentation and found the following:
When to write a module
[...]
We do not recommend writing modules that are just thin wrappers around single other resource types. If you have trouble finding a name for your module that isn't the same as the main resource type inside it, that may be a sign that your module is not creating any new abstraction and so the module is adding unnecessary complexity. Just use the resource type directly in the calling module instead.
Source: https://www.terraform.io/docs/modules/index.html#when-to-write-a-module
It makes sense to me that publishing a module that is just a wrapper around a single resource may not be that useful, but for internal use in your configuration, it seems like a useful tool to make your configuration DRY. If 9 out of 10 arguments are the same for all of your VMs, why wouldn't you create a module to hide the 9 common arguments from the main configuration and not repeat them?
As I am new to Terraform, I just want to make sure that I am not teaching myself bad practices.

Snakemake: parameterized runs that re-use intermediate output files

here is a complex problem that I am struggling to find a clean solution for:
Imagine having a Snakemake workflow with several rules that can be parameterized in some way. Now, we might want to test different parameter settings for some rules, to see how the results differ. However, ideally, if these rules depend on the output of other rules that are not parameterized, we want to re-use these non-changing files, instead of re-computing them for each of our parameter settings. Furthermore, if at all possible, all this should be optional, so that in the default case, a user does not see any of this.
There is inherent complexity in there (to specify which files are re-used, etc). I am also aware that this is not exactly the intended use case of Snakemake ("reproducible workflows"), but is more of a meta-feature for experimentation.
Here are some approaches:
Naive solution: Add wildcards for each possible parameter to the file paths. This gets ugly, hard to maintain, and hard to extend really quickly. Not a solution.
A nice approach might be to name each run, and have an individual config file for that name which contains all settings that we need. Then, we only need a wildcard for such a named set of parameter settings. That would probably require to read some table of meta-config file, and process that. That doesn't solve the re-use issue though. Also, that means we need multiple config files for one snakemake call, and it seems that this is not possible (they would instead update each other, but not considered as individual configs to be run separately).
Somehow use sub-workflows, by specifying individual config files each time, e.g., via a wildcard. Not sure that this can be done (e.g., configfile: path/to/{config_name}.yaml). Still not a solution for file re-using.
Quick-and-dirty: Run all the rules up to the last output file that is shared between different configurations. Then, manually (or with some extra script) create directories with symlinks to this "base" run, with individual config files that specify the parameters for the per-config-runs. This still necessitates to call snakemake individually for each of these directories, making cluster usage harder.
None of these solve all issues though. Any ideas appreciated!
Thanks in advance, all the best
Lucas
Snakemake now offers the Paramspace helper to solve this! https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=parameter#parameter-space-exploration
I have not tried it yet, but it seems like the solution to the issue!

Default memory request with possibility of override in a Snakefile?

I have a Snakefile with several rules and only a few need more than 1 GB/core to run on a cluster. The resources directive is great for this, but I can't find a way of setting a default value. I would prefer not having to write resources: mem_per_cpu = 1024 for every rule that doesn't need more than the default.
I realize that I could get what I want using __default__ in a cluster config file and overriding the mem_per_cpu value for specific rules. I hesitate to do this because the memory requirements are platform-independent, so I would prefer including them in the Snakefile itself. It would also prevent me from being able to specify local resource limits using the --resources command-line option.
Is there a simple solution with Snakemake that would help me here? Thanks!
I was reading the changelog of the Snakemake and I came across this:
Add –default-resources flag, that allows to define default resources
for jobs (e.g. mem_mb, disk_mb), see docs.