Default memory request with possibility of override in a Snakefile? - snakemake

I have a Snakefile with several rules and only a few need more than 1 GB/core to run on a cluster. The resources directive is great for this, but I can't find a way of setting a default value. I would prefer not having to write resources: mem_per_cpu = 1024 for every rule that doesn't need more than the default.
I realize that I could get what I want using __default__ in a cluster config file and overriding the mem_per_cpu value for specific rules. I hesitate to do this because the memory requirements are platform-independent, so I would prefer including them in the Snakefile itself. It would also prevent me from being able to specify local resource limits using the --resources command-line option.
Is there a simple solution with Snakemake that would help me here? Thanks!

I was reading the changelog of the Snakemake and I came across this:
Add –default-resources flag, that allows to define default resources
for jobs (e.g. mem_mb, disk_mb), see docs.

Related

Using a reference genome not on ensembl for Snakemake?

I am wanting to process and analyze RNAseq data for a species that does not have a genome on ensembl. My lab does have an assembly and annotation of it, and it's being processed by genbank but is not public yet. Is there a way to use my own reference genome in snakemake instead of pointing to one from ensembl?
I'm using this snakemake workflow:
https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth
And wanting to change the reference genome input for the config.yaml file at this point:
ref:
# ensembl species name
species: homo_sapiens
# ensembl release version
release: "104"
# genome build
build: GRCh38
# pfam release to use for annotation of domains in differential splicing analysis
pfam: "33.0"
# Choose strategy for selecting representative transcripts for each gene.
# Possible values:
# - canonical (use the canonical transcript from ensembl, only works for human at the moment)
# - mostsignificant (use the most significant transcript)
# - path/to/any/file.txt (a path to a file with ensembl transcript IDs to use;
# the user has to ensure that there is only one ID per gene given)
representative_transcripts: canonical
ontology:
# gene ontology to download, used e.g. in goatools
gene_ontology: "http://current.geneontology.org/ontology/go-basic.obo"
This outline may help you see how to adapt the workflow to provide your own resources:
Effectively, you'll complete up to step #3 as written in the Usage. Then in addition to edits discussed in the Usage in config, you'll want to produce edits in the files also found in workflow. And add in to resources directory all the equivalents of the files that would normally be fetched from ensembl necessary for all the parts of the workflow you care about. Keep in mind you may wish to skip some parts if you don't have the necessary, equivalent input files.
To make it more reproducible you'll probably want to fork the repo and actually make the edits in your fork, i.e., upstream of what you'll deploy. And then in the snakedeploy step use your fork as the URL and not the main one. That should get you the rules with your edits. In fact, you could even alter the stuff in config so you have a record, too. The idea with the fork is that you have a place to point out all the changes you made, and even perhaps the version of the config files, you use. But that is just an option to consider to make it easier to track everything.
No matter how you go about it, you'll want to effectively change what is in workflow/rules/ref.smk (code on the main repo) to not run for the files you are going to place into resources by hand instead of having them be obtained from Ensembl. So just remove those rules. I don't know enough about this to know if this means all the rules in ref.smk or not. If it does mean all the rules, you can make things easier and just comment out the line include: "rules/ref.smk" in workflow/Snakefile (code in the main repo). As a by-product of these changes keep in mind for the equivalent portions of the ref section in config.yaml what you put there is moot.
Keep in mind that representative_transcripts section already has a way to provide a path so you can use that part as-is under the resources: & ref: section in config.yaml.
If things aren't apparent and you want to be cautious, you do this adapting gradually. First try to get it working "as-is" then substitute in one piece at a time. It will certainly break downstream steps as you do it, but it lets you see more clearly what is happening with the parts you are adapting as you go along.

Configure rules in Detekt

I am adding Detekt to a new project.
But, I find that some rules are too strict.
How can I implement my own thresholds for a few rules?
I don't want to use Baseline files, because this is new code and we don't want to consider some things code smells.
It's possible to configure Detekt via configuration files.
As per Detekt's documentation, with my emphasis:
detekt allows easily to just pick the rules you want and configure them the way you like. For example if you want to allow up to 20 functions inside a Kotlin file instead of the default threshold, write:
complexity:
TooManyFunctions:
thresholdInFiles: 20
You'll want to create a config folder in the root of your project, and add detekt.yaml to it. This config will contain all the rules that you want to override the default, or all the rules (establishing the defaults as the correct value).
For more information, check the official documentation on configuration

Snakemake: parameterized runs that re-use intermediate output files

here is a complex problem that I am struggling to find a clean solution for:
Imagine having a Snakemake workflow with several rules that can be parameterized in some way. Now, we might want to test different parameter settings for some rules, to see how the results differ. However, ideally, if these rules depend on the output of other rules that are not parameterized, we want to re-use these non-changing files, instead of re-computing them for each of our parameter settings. Furthermore, if at all possible, all this should be optional, so that in the default case, a user does not see any of this.
There is inherent complexity in there (to specify which files are re-used, etc). I am also aware that this is not exactly the intended use case of Snakemake ("reproducible workflows"), but is more of a meta-feature for experimentation.
Here are some approaches:
Naive solution: Add wildcards for each possible parameter to the file paths. This gets ugly, hard to maintain, and hard to extend really quickly. Not a solution.
A nice approach might be to name each run, and have an individual config file for that name which contains all settings that we need. Then, we only need a wildcard for such a named set of parameter settings. That would probably require to read some table of meta-config file, and process that. That doesn't solve the re-use issue though. Also, that means we need multiple config files for one snakemake call, and it seems that this is not possible (they would instead update each other, but not considered as individual configs to be run separately).
Somehow use sub-workflows, by specifying individual config files each time, e.g., via a wildcard. Not sure that this can be done (e.g., configfile: path/to/{config_name}.yaml). Still not a solution for file re-using.
Quick-and-dirty: Run all the rules up to the last output file that is shared between different configurations. Then, manually (or with some extra script) create directories with symlinks to this "base" run, with individual config files that specify the parameters for the per-config-runs. This still necessitates to call snakemake individually for each of these directories, making cluster usage harder.
None of these solve all issues though. Any ideas appreciated!
Thanks in advance, all the best
Lucas
Snakemake now offers the Paramspace helper to solve this! https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=parameter#parameter-space-exploration
I have not tried it yet, but it seems like the solution to the issue!

How can you specify local path of InputPath or OutputPath in Kubeflow Pipelines

I've started using Kubeflow Pipelines to run data processing, training and predicting for a machine learning project, and I'm using InputPath and OutputhPath to pass large files between components.
I'd like to know how, if it's possible, do I set the path that OutputPath would look for a file in in a component, and where InputPath would load a file in a component.
Currently, the code stores them in a pre-determined place (e.g. data/my_data.csv), and it would be ideal if I could 'tell' InputPath/OutputPath this is the file it should copy, instead of having to rename all the files to match what OutputPath expects, as per below minimal example.
#dsl.pipelines(name='test_pipeline')
def pipeline():
pp = create_component_from_func(func=_pre_process_data)()
# use pp['pre_processed']...
def pre_process_data(pre_processed_path: OutputPath('csv')):
import os
print('do some processing which saves file to data/pre_processed.csv')
# want to avoid this:
print('move files to OutputPath locations...')
os.rename(f'data/pre_processed.csv', pre_processed_path)
Naturally I would prefer not to update the code to adhere to Kubeflow pipeline naming convention, as that seems like very bad practice to me.
Thanks!
Update - See ark-kun's comment, the approach in my original answer is deprecated and should not be used. It is better to let Kubeflow Pipelines specify where you should store your pipeline's artifacts.
For lightweight components (such as the one in your example), Kubeflow Pipelines builds the container image for your component and specifies the paths for inputs and outputs (based upon the types you use to decorate your component function). I would recommend using those paths directly, instead of writing to one location and then renaming the file. The Kubeflow Pipelines samples follow this pattern.
For reusable components, you define the pipeline inputs and outputs as part of the YAML specification for the component. In that case you can specify your preferred location for the output files. That being said, reusable components take a bit more effort to create, since you need to build a Docker container image and component specification in YAML.
This is not supported by the system.
Components should use the system-provided paths.
This is important, because on some execution engines the data is mounted to those paths. And sometimes these paths have certain restrictions or might even be unchangeable. So the system must have the freedom to choose the paths.
Usually, good programs do not hard-code any absolute paths inside their code, but rather receive the paths from the command line.
In any case, it's pretty easy to copy the files from or to the system-provided paths (as you already do in the code).

What is Lithium's equivalent to CakePHP's Configure::load() and Configure::read()?

I'd like to store configuration data in separate files and load it/read it using the proper Lithium way.
Depends on what it's for. We pretty strongly discourage throwing around global configuration unless it's managed carefully.
If it's related to connecting to some kind of external system, I'd suggest you take a look at the Connections, Cache, Session, Auth or Logger classes. Take a look here for more info: http://li3.me/docs/lithium/core/Adaptable
If your configuration doesn't fall into any specific categor(y/ies), and is related to general site operations, take a look at the Environment class: http://li3.me/docs/lithium/core/Environment. It doesn't have any specific methods to load from files, but it just works with arrays, so if you have a config file that returns an array, you can pass it the value of include "foo.php" as a parameter.
If you go this route though, be sure that you carefully manage your configuration and don't change it once you've written it. Poor management of this kind of global state is the #1 cause of software bugs.