How can you specify local path of InputPath or OutputPath in Kubeflow Pipelines - kubeflow-pipelines

I've started using Kubeflow Pipelines to run data processing, training and predicting for a machine learning project, and I'm using InputPath and OutputhPath to pass large files between components.
I'd like to know how, if it's possible, do I set the path that OutputPath would look for a file in in a component, and where InputPath would load a file in a component.
Currently, the code stores them in a pre-determined place (e.g. data/my_data.csv), and it would be ideal if I could 'tell' InputPath/OutputPath this is the file it should copy, instead of having to rename all the files to match what OutputPath expects, as per below minimal example.
#dsl.pipelines(name='test_pipeline')
def pipeline():
pp = create_component_from_func(func=_pre_process_data)()
# use pp['pre_processed']...
def pre_process_data(pre_processed_path: OutputPath('csv')):
import os
print('do some processing which saves file to data/pre_processed.csv')
# want to avoid this:
print('move files to OutputPath locations...')
os.rename(f'data/pre_processed.csv', pre_processed_path)
Naturally I would prefer not to update the code to adhere to Kubeflow pipeline naming convention, as that seems like very bad practice to me.
Thanks!

Update - See ark-kun's comment, the approach in my original answer is deprecated and should not be used. It is better to let Kubeflow Pipelines specify where you should store your pipeline's artifacts.
For lightweight components (such as the one in your example), Kubeflow Pipelines builds the container image for your component and specifies the paths for inputs and outputs (based upon the types you use to decorate your component function). I would recommend using those paths directly, instead of writing to one location and then renaming the file. The Kubeflow Pipelines samples follow this pattern.
For reusable components, you define the pipeline inputs and outputs as part of the YAML specification for the component. In that case you can specify your preferred location for the output files. That being said, reusable components take a bit more effort to create, since you need to build a Docker container image and component specification in YAML.

This is not supported by the system.
Components should use the system-provided paths.
This is important, because on some execution engines the data is mounted to those paths. And sometimes these paths have certain restrictions or might even be unchangeable. So the system must have the freedom to choose the paths.
Usually, good programs do not hard-code any absolute paths inside their code, but rather receive the paths from the command line.
In any case, it's pretty easy to copy the files from or to the system-provided paths (as you already do in the code).

Related

VK_LAYER_PATH works not as expected

Ubuntu 22.04. I set VK_LAYER_PATH in /etc/enviroment like VK_LAYER_PATH=/etc/vulkan/implicit_layer.d where stored single json file with AMDVLK (need for test). But when I check vulkaninfo, it says that I still have 10 layers (without VK_LAYER_PATH vulkaninfo shows me 12 layers) , among them, as example, MangoHUD. I know where MangoHUD stored its layer json file. This is totally different folder (/usr/share/vulkan/**). From documentation I see
"Setting the VK_LAYER_PATH environment variable overrides the default loader layer search mechanism. When set, the loader will search only the directory(s) identified by the VK_LAYER_PATH environment variable for layer manifest files."
But it looks like Loader still goes through all default folders to look json files.
Found information that VK_LAYER_PATH only override explicit layers. Implicit will be still looked by default folders. Is it correct? If yes, do we have any option to override implicit search path?
You've probably already found the Loader Layer Interface doc and others like it in the same folder.
Since the loading of implicit layers is supposed to be automatic, I don't think that there is a way to set a search path just for implicit layers by design. You could post a question to the same Vulkan-Loader repo to get a better opinion.
If your goal is to prevent the loading of a set of implicit layers in order to test your own implicit layer by itself, then you can disable implicit layers one by one. Each layer manifest specifies an environment variable that can be set to disable that layer.

Loading fsx files dynamically in an FSX script

We are sharing a build script for FAKE across a set of projects. We want to keep this one build script the same but make it possible to extend with other targets. One way I could think of doing this is by loading .fsx files if they fit a specific naming pattern like al files that matches build-*.fsx however I can't seem to think of a way to load these files dynamically. Any suggestions on how to do this or how to accomplish the desired result are all good as answers
if I could I would have done something like
#load "build-*.fsx"
It's not completely clear to me why you want to do this but maybe this will help. Refer to a single script in each project:
#load "load-build-scripts.fsx"
And then in single load-build-scripts.fsx:
#load "build-1.fsx"
#load "build-2.fsx"
#load "build-3.fsx"
...
This second file you will need to change whenever you add a new script.
It's not generally recommended to do this. Because now if these separate scripts refer to each other then some scripts will be loaded more than once. Scripts aren't really meant to be used for cases this complex.
Another option is to use FAKE as a console project instead of using scripts and the fake-cli tool. Then you can use normal .NET project dependencies.

Can we do variable substitution on YAML files in IntelliJ?

I am using IntelliJ to develop Java applications which uses YAML files for the app properties. These YAML files have some placeholder/template params like:
credentials:
clientId: ${client.id}
secretKey: ${secret.key}
My CI/CD pipeline takes care of substituting the actual value for these params (client.id and secret.key) based on the environment on which it is getting deployed.
I'm looking for something similar in IntelliJ. Something like, I configure some static/fixed values for the params (Ex: client.id and secret.key) within the IDE and when I run locally using the IDE, these values should be substituted onto these YAML files and run.
This will actually save me from updating the YAML files with the placeholder params each time I check in some other changes to my version control system.
There is no such feature in IDEA, because IDEA cannot auto detect every possible known or unknown expression language or template macros that you could use in a yaml file. Furthermore, IDEA must create a context for that or these template files.
For IDEA it's just a normal yaml file.
IDEA has a language injection feature.
That can be used to inject sql into a java string for instance or inject any language into a yaml field.
This is a really nice feature and can help you to rename sql column names aso. but this won't solve your special problem, because you want to make that template "runnable" within in certain context where you define your variables.
My suggestion would be, to write a small simple program that makes nearly the same as the template engine does.
When you only need simple string replacements and no macro execution then this could be done via regular expression.
If it's more complicated I would use the same template engine as the "real processor" does.
If you want further help, it would be good to know how your yaml processing pipeline looks like.

Is is possible to pass a variable from the build process to Visual Basic code?

My goal is to create build definitions within Visual Studio Team Services for both test and production environments. I need to update 2 variables in my code which determine which database and which blob storage the environment uses. Up till now, I've juggled this value in a Resource variable, and pulled that value in code from My.Resources.DB for a library, and Microsoft.Azure.CloudConfigurationManager.GetSetting("DatabaseConnectionString") for an Azure worker role. However, changing 4 variables every time I do a release is getting tiring.
I see a lot of posts that get close to what I want, but they're geared towards C#. For reasons beyond my influence, this project is written in VB.NET. It seems I have 2 options. First, I could call the MSBuild process with a couple of defined properties, passing them to the .metaproj build file, but I don't know how to get them to be used in VB code. That's preferable, but, at this point, I'm starting to doubt that this is possible.
I've been able to set some pre-processor constants, to be recognized in #If-#Else directives.
#If DEBUG = True Then
BarStaticItemVersion.Caption = String.Format("Version: {0}", "1.18.0.xxx")
#Else
BarStaticItemVersion.Caption = String.Format("Version: {0}", "1.18.0.133")
#End If
msbuild CalbertNG.sln.metaproj /t:Rebuild /p:DefineConstants="DEBUG=False"
This seems to work, though I need to Rebuild to change the value of that constant. Should I have to? Should Build be enough? Is this normal, or an indication that I don't have something set quite right?
I've seen other posts that talk about pre-processing the source files with some other builder, like Ant, but that seems like overkill. It feels like I'm close here. But I want to zoom out and ask, from a clean sheet of paper, if you're given 2 variables which need to change per environment, you're using VB.NET, and you want to incorporate those variable values in an automated VS Team Services build process upon code check-in, what's the best way to do it? (I want to define the variables in the VSTS panel, but this just passes them to my builder, so I have to know how to parse the call to MSBuild to make these useful.)
I can control picking between 2 static strings, now, via compiler directives, but I'd really like to reference the Build.BuildNumber that comes out of the MSBuild process to display to the user, and, if I can do that, I can just feed the variables for database and blob container via the same mechanism, and skip the pre-processor.
You've already found the way you can pass data from the MsBuild Arguments directly into the code. An alternative is to use the Condition Attribute in your project files to make certain property groups optional, it allows you to even include specific files conditionally. You can control conditions by passing in /p:ConditionalProperty=value on the MsBuild command. This at least ensures people use a set of values that make sense together.
The problem is that when MsBuild is running in Incremental mode it is likely to not process your changes (as you've noticed), the reason for this, is that the input files remain unchanged since the last build and are all older than the last generated output files.
To by-pass this behavior you'd normally create a separate solution configuration and override the output location for all projects to be unique for that configuration. Combined with setting the Compiler constants for that specific configuration you're ensured that when building that Configuration/Platform combination, incremental builds work as intended.
I do want to echo some of the comments from JerryM and Daniel Mann. Some items are better stored in else where or updated before you actually start the compile phase.
Possible solutions:
Store your configuration data in config files and use Configuration Transformation to generate the right config file base don the selected solution configuration. The process is explained on MSDN. To enable configuration transformation on all project types, you can use SlowCheetah.
Store your ocnfiguration data in the config files and use MsDeploy and specify a Parameters.xml file that matches the deploy package. It will perform the transformation on deploy time and will actually allow your solution to contain a standard config file you use at runtime, plus a publish profile which will post-process your configuration. You can use a SetParameters.xml file to override the variables at deploy time.
Create an installer project (such as through Wix) and merge the final configuration at install time (similar to the MsDeploy). You could even provide a UI which prompts for specific values (and can supply default values).
Use a CI server, like the new TFS/VSTS 2015 task based build engine and combine it with a task that can search&replace tokens, like the Replace Tokens task, Tokenization Task, Colin's ALM Corner Build and Release Tasks. And a whole bunch that specifically deal with versioning. Handling these things in the CI server also allows you to do a quick build locally at all times and do these relatively expensive steps on the build server (patching source code breaks incremental build in MsBuild, because there are always newer input files.
When talking specifically about versioning, there are a number of ways to set the AssemblyVersion and AssemblyFileVersion just before compile time, usually it involves overriding the AssemblyInfo.cs file before compilation. Your code could then use reflection to read the value at runtime. You can use the AssemblyInformationalversion to specify something like you do in the example above which contains .xxx or other text. It also ensures that the version displayed always reflects the information obtained when reading the file properties through Windows Explorer.

Apache giraph process graph with a custom algorithm

I have a custom algorithm for processing a graph which accepts a txt file as input. Because it is a large scale graph I want to implement it in the apache giraph framework. I' ve done a lot of research but I am still not sure if I am in the right path.
I am reading a .csv file which contain the graph data and using a parser I am converting it to the txt file and uploading to the HDFS file system of hadoop.
I have read the SimpleShortestPathsVertex example from the apache quick start guide and I can see that processes the data from a file in HDFS using the jar-with-dependencies jar file.
My problem is that I haven't yet understood how can I add my algorithm in the apache giraph framework and start the process of the graph. Can I add my algorithm to apache framework using eclipse and modify it from there or there is any other way?
Thank you!
Have a look here:
https://cwiki.apache.org/confluence/display/GIRAPH/Shortest+Paths+Example
Where you able to run this example?
If yes.
Familiarize yourself with the different Writable formats of hadoop! Else it is hard to use these to your algorithm.
All computation concerning the graph is done in the compute() function.
(If you're more advanced have a look into the workerContext preSuperstep and Aggregators!)
You can change the example, but as soon as you use other data types you have to change your VertexReader and VertexWriter.
If you have a specific Algorithm in mind, make up your mind what you need for the computation and specify the layout of your input file. Then adapt your VertexReader and -Writer. And then finaly start the implement your compute() function!
Of course you can use eclipse! Simply Reference the Giraph jar (For me it is "giraph-0.1-jar-with-dependencies.jar") And start coding.
All you need is a instance of these files specific to your algorithm:
YourGiraphJob (the file starting the Hadoop/Giraph job)
YourVertex (Specifies your compute() function executed on each Vertex)
YourInputFormat (Specifying the Writable formats of YourReader)
YourOutputFormat (Specifying the Writable formats of YourWriter)
YourReader (Specifies how your inputFile is transformed e.g. that for each line a Vertex can be initialized using given information)
YourWriter (Specifies how your outputFile is generated from the vertices)
(optionaly a WorkerContext if you want to use Aggregators.)
Simply checkout: http://giraph.apache.org/source-repository.html
using eclipse and you should have the code including an example application you can toy around with!