I'm currently working on a pipeline for some projects on GitLab. I tried to make use of YAML aliases and anchors to organize my files better and avoid having a huge pipeline file. Still, I realized that I cannot use those artifacts, but instead I must use the !reference tag from GitLab itself.
Even though it's better then nothing, I didn't find a way to override some of the properties of the referenced elements. Does anyone know how can I achieve this?
Related
here is a complex problem that I am struggling to find a clean solution for:
Imagine having a Snakemake workflow with several rules that can be parameterized in some way. Now, we might want to test different parameter settings for some rules, to see how the results differ. However, ideally, if these rules depend on the output of other rules that are not parameterized, we want to re-use these non-changing files, instead of re-computing them for each of our parameter settings. Furthermore, if at all possible, all this should be optional, so that in the default case, a user does not see any of this.
There is inherent complexity in there (to specify which files are re-used, etc). I am also aware that this is not exactly the intended use case of Snakemake ("reproducible workflows"), but is more of a meta-feature for experimentation.
Here are some approaches:
Naive solution: Add wildcards for each possible parameter to the file paths. This gets ugly, hard to maintain, and hard to extend really quickly. Not a solution.
A nice approach might be to name each run, and have an individual config file for that name which contains all settings that we need. Then, we only need a wildcard for such a named set of parameter settings. That would probably require to read some table of meta-config file, and process that. That doesn't solve the re-use issue though. Also, that means we need multiple config files for one snakemake call, and it seems that this is not possible (they would instead update each other, but not considered as individual configs to be run separately).
Somehow use sub-workflows, by specifying individual config files each time, e.g., via a wildcard. Not sure that this can be done (e.g., configfile: path/to/{config_name}.yaml). Still not a solution for file re-using.
Quick-and-dirty: Run all the rules up to the last output file that is shared between different configurations. Then, manually (or with some extra script) create directories with symlinks to this "base" run, with individual config files that specify the parameters for the per-config-runs. This still necessitates to call snakemake individually for each of these directories, making cluster usage harder.
None of these solve all issues though. Any ideas appreciated!
Thanks in advance, all the best
Lucas
Snakemake now offers the Paramspace helper to solve this! https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=parameter#parameter-space-exploration
I have not tried it yet, but it seems like the solution to the issue!
I've started using Kubeflow Pipelines to run data processing, training and predicting for a machine learning project, and I'm using InputPath and OutputhPath to pass large files between components.
I'd like to know how, if it's possible, do I set the path that OutputPath would look for a file in in a component, and where InputPath would load a file in a component.
Currently, the code stores them in a pre-determined place (e.g. data/my_data.csv), and it would be ideal if I could 'tell' InputPath/OutputPath this is the file it should copy, instead of having to rename all the files to match what OutputPath expects, as per below minimal example.
#dsl.pipelines(name='test_pipeline')
def pipeline():
pp = create_component_from_func(func=_pre_process_data)()
# use pp['pre_processed']...
def pre_process_data(pre_processed_path: OutputPath('csv')):
import os
print('do some processing which saves file to data/pre_processed.csv')
# want to avoid this:
print('move files to OutputPath locations...')
os.rename(f'data/pre_processed.csv', pre_processed_path)
Naturally I would prefer not to update the code to adhere to Kubeflow pipeline naming convention, as that seems like very bad practice to me.
Thanks!
Update - See ark-kun's comment, the approach in my original answer is deprecated and should not be used. It is better to let Kubeflow Pipelines specify where you should store your pipeline's artifacts.
For lightweight components (such as the one in your example), Kubeflow Pipelines builds the container image for your component and specifies the paths for inputs and outputs (based upon the types you use to decorate your component function). I would recommend using those paths directly, instead of writing to one location and then renaming the file. The Kubeflow Pipelines samples follow this pattern.
For reusable components, you define the pipeline inputs and outputs as part of the YAML specification for the component. In that case you can specify your preferred location for the output files. That being said, reusable components take a bit more effort to create, since you need to build a Docker container image and component specification in YAML.
This is not supported by the system.
Components should use the system-provided paths.
This is important, because on some execution engines the data is mounted to those paths. And sometimes these paths have certain restrictions or might even be unchangeable. So the system must have the freedom to choose the paths.
Usually, good programs do not hard-code any absolute paths inside their code, but rather receive the paths from the command line.
In any case, it's pretty easy to copy the files from or to the system-provided paths (as you already do in the code).
I am using IntelliJ to develop Java applications which uses YAML files for the app properties. These YAML files have some placeholder/template params like:
credentials:
clientId: ${client.id}
secretKey: ${secret.key}
My CI/CD pipeline takes care of substituting the actual value for these params (client.id and secret.key) based on the environment on which it is getting deployed.
I'm looking for something similar in IntelliJ. Something like, I configure some static/fixed values for the params (Ex: client.id and secret.key) within the IDE and when I run locally using the IDE, these values should be substituted onto these YAML files and run.
This will actually save me from updating the YAML files with the placeholder params each time I check in some other changes to my version control system.
There is no such feature in IDEA, because IDEA cannot auto detect every possible known or unknown expression language or template macros that you could use in a yaml file. Furthermore, IDEA must create a context for that or these template files.
For IDEA it's just a normal yaml file.
IDEA has a language injection feature.
That can be used to inject sql into a java string for instance or inject any language into a yaml field.
This is a really nice feature and can help you to rename sql column names aso. but this won't solve your special problem, because you want to make that template "runnable" within in certain context where you define your variables.
My suggestion would be, to write a small simple program that makes nearly the same as the template engine does.
When you only need simple string replacements and no macro execution then this could be done via regular expression.
If it's more complicated I would use the same template engine as the "real processor" does.
If you want further help, it would be good to know how your yaml processing pipeline looks like.
We're trying to migrate from current Ant build to Maven. In the current project, we've different properites files for each of the env say
qa.properties, prod.properties & dev.properties.
The property values present in these files, are used to replace wherever these properties are being referred through config files (present in src\main\resources\config ). The current Ant build process replaces all these properties which are being referred in config files with their corresponding value for the current build env.
I'm somewhat aware of the Profiles concept in maven. However, I'm not able to figure how to achieve this using maven.
Any help would be appreicated.
Thanks,
Prabhjot
There are several ways to implement this but they are all variations around the same features: combine profiles with filtering. A Maven2 multi-environment filter setup shows one way to implement such a setup (a little variation would be to move the filter declaration inside each profile).
See also
9.3. Resource Filtering
I wonder what is the Maven way in my situation.
My application has a bunch of configuration files, let's call them profiles. Each profile configuration file is a *.properties file, that contains keys/values and some comments on these keys/values semantics. The idea is to generate these *.properties to have unified comments in all of them.
My plan is to create a template.properties file that contains something like
#Comments for key1/value1
key1=${key1.value}
#Comments for key2/value2
key2=${key2.value}
and a bunch of files like
#profile_data_1.properties
key1.value=profile_1_key_1_value
key2.value=profile_1_key_2_value
#profile_data_2.properties
key1.value=profile_2_key_1_value
key2.value=profile_2_key_2_value
Then bind to generate-resources phase to create a copy of template.properties per profile_data_, and filter that copy with profile_data_.properties as a filter.
The easiest way is probably to create an ant build file and use antrun plugin. But that is not a Maven way, is it?
Other option is to create a Maven plugin for that tiny task. Somehow, I don't like that idea (plugin deployment is not what I want very much).
Maven does offer filtering of resources that you can combine with Maven profiles (see for example this post) but I'm not sure this will help here. If I understand your needs correctly, you need to loop on a set of input files and to change the name of the output file. And while the first part would be maybe possible using several <execution>, I don't think the second part is doable with the resources plugin.
So if you want to do this in one build, the easiest way would be indeed to use the Maven AntRun plugin and to implement the loop and the processing logic with Ant tasks.
And unless you need to reuse this at several places, I wouldn't encapsulate this logic in a Maven plugin, this would give you much benefits if this is done in a single project, in a unique location.
You can extend the way maven does it's filtering, as maven retrieves it's filtering strategy from the plexus container via dependency injection. So you would have to register a new default strategy. This is heavy stuff and badly documented, but I think it can be done.
Use these URLs as starting point:
http://maven.apache.org/shared/maven-filtering/usage.html
and
http://maven.apache.org/plugins/maven-resources-plugin/
Sean