Using a reference genome not on ensembl for Snakemake? - snakemake

I am wanting to process and analyze RNAseq data for a species that does not have a genome on ensembl. My lab does have an assembly and annotation of it, and it's being processed by genbank but is not public yet. Is there a way to use my own reference genome in snakemake instead of pointing to one from ensembl?
I'm using this snakemake workflow:
https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth
And wanting to change the reference genome input for the config.yaml file at this point:
ref:
# ensembl species name
species: homo_sapiens
# ensembl release version
release: "104"
# genome build
build: GRCh38
# pfam release to use for annotation of domains in differential splicing analysis
pfam: "33.0"
# Choose strategy for selecting representative transcripts for each gene.
# Possible values:
# - canonical (use the canonical transcript from ensembl, only works for human at the moment)
# - mostsignificant (use the most significant transcript)
# - path/to/any/file.txt (a path to a file with ensembl transcript IDs to use;
# the user has to ensure that there is only one ID per gene given)
representative_transcripts: canonical
ontology:
# gene ontology to download, used e.g. in goatools
gene_ontology: "http://current.geneontology.org/ontology/go-basic.obo"

This outline may help you see how to adapt the workflow to provide your own resources:
Effectively, you'll complete up to step #3 as written in the Usage. Then in addition to edits discussed in the Usage in config, you'll want to produce edits in the files also found in workflow. And add in to resources directory all the equivalents of the files that would normally be fetched from ensembl necessary for all the parts of the workflow you care about. Keep in mind you may wish to skip some parts if you don't have the necessary, equivalent input files.
To make it more reproducible you'll probably want to fork the repo and actually make the edits in your fork, i.e., upstream of what you'll deploy. And then in the snakedeploy step use your fork as the URL and not the main one. That should get you the rules with your edits. In fact, you could even alter the stuff in config so you have a record, too. The idea with the fork is that you have a place to point out all the changes you made, and even perhaps the version of the config files, you use. But that is just an option to consider to make it easier to track everything.
No matter how you go about it, you'll want to effectively change what is in workflow/rules/ref.smk (code on the main repo) to not run for the files you are going to place into resources by hand instead of having them be obtained from Ensembl. So just remove those rules. I don't know enough about this to know if this means all the rules in ref.smk or not. If it does mean all the rules, you can make things easier and just comment out the line include: "rules/ref.smk" in workflow/Snakefile (code in the main repo). As a by-product of these changes keep in mind for the equivalent portions of the ref section in config.yaml what you put there is moot.
Keep in mind that representative_transcripts section already has a way to provide a path so you can use that part as-is under the resources: & ref: section in config.yaml.
If things aren't apparent and you want to be cautious, you do this adapting gradually. First try to get it working "as-is" then substitute in one piece at a time. It will certainly break downstream steps as you do it, but it lets you see more clearly what is happening with the parts you are adapting as you go along.

Related

Snakemake: parameterized runs that re-use intermediate output files

here is a complex problem that I am struggling to find a clean solution for:
Imagine having a Snakemake workflow with several rules that can be parameterized in some way. Now, we might want to test different parameter settings for some rules, to see how the results differ. However, ideally, if these rules depend on the output of other rules that are not parameterized, we want to re-use these non-changing files, instead of re-computing them for each of our parameter settings. Furthermore, if at all possible, all this should be optional, so that in the default case, a user does not see any of this.
There is inherent complexity in there (to specify which files are re-used, etc). I am also aware that this is not exactly the intended use case of Snakemake ("reproducible workflows"), but is more of a meta-feature for experimentation.
Here are some approaches:
Naive solution: Add wildcards for each possible parameter to the file paths. This gets ugly, hard to maintain, and hard to extend really quickly. Not a solution.
A nice approach might be to name each run, and have an individual config file for that name which contains all settings that we need. Then, we only need a wildcard for such a named set of parameter settings. That would probably require to read some table of meta-config file, and process that. That doesn't solve the re-use issue though. Also, that means we need multiple config files for one snakemake call, and it seems that this is not possible (they would instead update each other, but not considered as individual configs to be run separately).
Somehow use sub-workflows, by specifying individual config files each time, e.g., via a wildcard. Not sure that this can be done (e.g., configfile: path/to/{config_name}.yaml). Still not a solution for file re-using.
Quick-and-dirty: Run all the rules up to the last output file that is shared between different configurations. Then, manually (or with some extra script) create directories with symlinks to this "base" run, with individual config files that specify the parameters for the per-config-runs. This still necessitates to call snakemake individually for each of these directories, making cluster usage harder.
None of these solve all issues though. Any ideas appreciated!
Thanks in advance, all the best
Lucas
Snakemake now offers the Paramspace helper to solve this! https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=parameter#parameter-space-exploration
I have not tried it yet, but it seems like the solution to the issue!

How can you specify local path of InputPath or OutputPath in Kubeflow Pipelines

I've started using Kubeflow Pipelines to run data processing, training and predicting for a machine learning project, and I'm using InputPath and OutputhPath to pass large files between components.
I'd like to know how, if it's possible, do I set the path that OutputPath would look for a file in in a component, and where InputPath would load a file in a component.
Currently, the code stores them in a pre-determined place (e.g. data/my_data.csv), and it would be ideal if I could 'tell' InputPath/OutputPath this is the file it should copy, instead of having to rename all the files to match what OutputPath expects, as per below minimal example.
#dsl.pipelines(name='test_pipeline')
def pipeline():
pp = create_component_from_func(func=_pre_process_data)()
# use pp['pre_processed']...
def pre_process_data(pre_processed_path: OutputPath('csv')):
import os
print('do some processing which saves file to data/pre_processed.csv')
# want to avoid this:
print('move files to OutputPath locations...')
os.rename(f'data/pre_processed.csv', pre_processed_path)
Naturally I would prefer not to update the code to adhere to Kubeflow pipeline naming convention, as that seems like very bad practice to me.
Thanks!
Update - See ark-kun's comment, the approach in my original answer is deprecated and should not be used. It is better to let Kubeflow Pipelines specify where you should store your pipeline's artifacts.
For lightweight components (such as the one in your example), Kubeflow Pipelines builds the container image for your component and specifies the paths for inputs and outputs (based upon the types you use to decorate your component function). I would recommend using those paths directly, instead of writing to one location and then renaming the file. The Kubeflow Pipelines samples follow this pattern.
For reusable components, you define the pipeline inputs and outputs as part of the YAML specification for the component. In that case you can specify your preferred location for the output files. That being said, reusable components take a bit more effort to create, since you need to build a Docker container image and component specification in YAML.
This is not supported by the system.
Components should use the system-provided paths.
This is important, because on some execution engines the data is mounted to those paths. And sometimes these paths have certain restrictions or might even be unchangeable. So the system must have the freedom to choose the paths.
Usually, good programs do not hard-code any absolute paths inside their code, but rather receive the paths from the command line.
In any case, it's pretty easy to copy the files from or to the system-provided paths (as you already do in the code).

using require in layer.conf in yocto

Considering all freedom that yocto gives to the developer, I have a question.
I would like to make this my_file.inc available only for recipes in one particular meta-layer. I know, that, for instance, using INHERIT keyword inside the local.conf will make my_class.bbclass file available globally for each recipe.
Is it a good practice to add this:
require my_file.inc
inside layer.conf? Or should I change my_file.inc to the my_file.bbclass, and, add INHERIT = "my_file.bbclass" to the layer.conf?
Any other possibilities?
Even if it seems to work, neither of your approaches is technically completely correct. The key point is that all .conf files are parsed first and everything they contain is globally visible throughout the whole build process. So if you add something through the layer.conf file, itis not being pulled in through an unexpected place, it also is not being limited that layer only and might therefore cause breakage at other places.
While I do not have a really good and clean solution, maybe the following can help you:
You can make your custom recipes react on certain keywords in DISTRO_FEATURES or MACHINE_FEATURES. Then you can create a two-staged approach:
Add the desired keyword in local.conf (or your MACHINE, or DISTRO, or whatever configuration)
Make the recipes react to it. If you need the mechanism in several places, then it might be useful to pour it into a .bbclass that your layer brings along and that you pull in for the respective recipes.
This way the effect is properly contained.
Maybe part 5.1.3.2 from the Yocto Project answers your question:
Avoid duplicating include files. Use append files (.bbappend) for each recipe that uses an include file. Or, if you are introducing a new recipe that requires the included file, use the path relative to the original layer directory to refer to the file. For example, use require recipes-core/package/file.inc instead of require file.inc. If you're finding you have to overlay the include file, it could indicate a deficiency in the include file in the layer to which it originally belongs. If this is the case, you should try to address that deficiency instead of overlaying the include file. For example, you could address this by getting the maintainer of the include file to add a variable or variables to make it easy to override the parts needing to be overridden.
So to avoid duplicate inclusion later, it would be better not to include your .inc file(s) this way.

Default memory request with possibility of override in a Snakefile?

I have a Snakefile with several rules and only a few need more than 1 GB/core to run on a cluster. The resources directive is great for this, but I can't find a way of setting a default value. I would prefer not having to write resources: mem_per_cpu = 1024 for every rule that doesn't need more than the default.
I realize that I could get what I want using __default__ in a cluster config file and overriding the mem_per_cpu value for specific rules. I hesitate to do this because the memory requirements are platform-independent, so I would prefer including them in the Snakefile itself. It would also prevent me from being able to specify local resource limits using the --resources command-line option.
Is there a simple solution with Snakemake that would help me here? Thanks!
I was reading the changelog of the Snakemake and I came across this:
Add –default-resources flag, that allows to define default resources
for jobs (e.g. mem_mb, disk_mb), see docs.

Apache giraph process graph with a custom algorithm

I have a custom algorithm for processing a graph which accepts a txt file as input. Because it is a large scale graph I want to implement it in the apache giraph framework. I' ve done a lot of research but I am still not sure if I am in the right path.
I am reading a .csv file which contain the graph data and using a parser I am converting it to the txt file and uploading to the HDFS file system of hadoop.
I have read the SimpleShortestPathsVertex example from the apache quick start guide and I can see that processes the data from a file in HDFS using the jar-with-dependencies jar file.
My problem is that I haven't yet understood how can I add my algorithm in the apache giraph framework and start the process of the graph. Can I add my algorithm to apache framework using eclipse and modify it from there or there is any other way?
Thank you!
Have a look here:
https://cwiki.apache.org/confluence/display/GIRAPH/Shortest+Paths+Example
Where you able to run this example?
If yes.
Familiarize yourself with the different Writable formats of hadoop! Else it is hard to use these to your algorithm.
All computation concerning the graph is done in the compute() function.
(If you're more advanced have a look into the workerContext preSuperstep and Aggregators!)
You can change the example, but as soon as you use other data types you have to change your VertexReader and VertexWriter.
If you have a specific Algorithm in mind, make up your mind what you need for the computation and specify the layout of your input file. Then adapt your VertexReader and -Writer. And then finaly start the implement your compute() function!
Of course you can use eclipse! Simply Reference the Giraph jar (For me it is "giraph-0.1-jar-with-dependencies.jar") And start coding.
All you need is a instance of these files specific to your algorithm:
YourGiraphJob (the file starting the Hadoop/Giraph job)
YourVertex (Specifies your compute() function executed on each Vertex)
YourInputFormat (Specifying the Writable formats of YourReader)
YourOutputFormat (Specifying the Writable formats of YourWriter)
YourReader (Specifies how your inputFile is transformed e.g. that for each line a Vertex can be initialized using given information)
YourWriter (Specifies how your outputFile is generated from the vertices)
(optionaly a WorkerContext if you want to use Aggregators.)
Simply checkout: http://giraph.apache.org/source-repository.html
using eclipse and you should have the code including an example application you can toy around with!