Pentaho - Migrating from Database repository to File Repository - pentaho

I am in process of migrating Pentaho from Database repository to file repository.
I have exported the database repository into xml file and then created a file repository and imported the repository...
The first issue that I saw after importing is all my database connections are being stored in .ktr and .kjb files, This is going to be a big issue If I update a connection string like updating a password, I have more than a hundreds of sub transformations and jobs, do I have to update this in all those files?
Is there any way to ignore the password and other connection settings that is stored in the .ktr and .kjb files and instead use the repository connection or specify it in the .kettle property?
The other issue that I face is When I try to run the master job via kitchen in cmd it does not recognize the sub transformation and jobs. However when I change the Transformation root to ${Internal.Entry.Current.Directory} - the sub transformation is being recognized and processed- As I mentioned I have more than 100 sub transformation and jobs - is there any way to update this root for all jobs and transformation at once.
Kitchen.bat /file:"C:\pentaho-8-1\Dev_Repo\home\jobs\MainProcess\MasterJob.kjb" /level:Basic /logfile:"C:\pentaho-8-1\logs\my-job.txt"
This fails with error (.ktr is not a file or the repository is not defined)
However when I change the root directory to ${Internal.Entry.Current.Directory} it works!

For the database connections, you can make .kdbs in the repository and enter variables for all the properties (Host, Port, Schema, User, etc) and define them in kettle.properties or another properties file.
This works like a more convenient version of JNDI files, with one properties file per environment. You can easily inspect current values by opening the kettle properties from within the Spoon client (don't edit them or it will mess up the layout!) and you can also put kettle "encrypted" passwords in the properties file.
PDI will still save copies of the connections into all the .kjb and ktr files (and should in theory update them from .kdb or shared.xml when opening them) but since the contents are just generic variable names (${STAGING_DB_HOST} etc) you will almost never run into problems with this.
For the transformation filenames, a good text search and replace tool should fix most of your transformations in one go. Include some of the XML tag to prevent replacing too much.

Related

Is there an easy way to move a file to a different folder in dbt Cloud?

Is there an easy way to move a file to a different folder in dbt Cloud, without having to create a new file of the same name in the new folder, copy/paste from the old file, and delete the old file, which is a pain.
Is there a good reason I should NOT do this? I assume my refs still work as long as the filename remains the same, and I don't have any specific folder logic tied to this file.
For example, say I have my_model.sql in my 'staging' folder and I want to simply move it to my 'mart' folder instead. In this example I'd like to do this to reflect that my file is really a more 'stable' mart-type table file vs a staging view. I realize I can just change the materialization type, but I'm doing this more to organize things clearly. Thanks!
The way to move a file in the cloud IDE for dbt is not 100% obvious. You can use the rename function to move a file to another location.
Click on the drop down next to the file name, then select "Rename." That will open a file path and you can change where the file lives from there by typing in the new folder's name.
The easiest way I have found to do this is...not using DBT Cloud, but using github desktop (no command line needed).
Create a new branch
Open repository in github
View files in your file explorer
Move files or directory locally
Upload to github
Push to origin for the branch you created
Open a pull request
Merge
Depending on what the file or directory is you may find the creating a new branch and opening PR to be overkill. For my specific project there is a lot of legacy organization and models that we aren't totally sure don't have downstream dependencies, so creating a new branch for this allowed me to test run all of our models.
Hope this helps!

how to add a new repository outside graphdb.home

I am cloning a large public triplestore for local development of a client app.
The data is too large to fit on the ssd partition where <graphdb.home>/data lives. How can I create a new repository at a different location to host this data?
GraphDB on startup will read the value of graphdb.home.data parameter. By default it will point to ${graphdb.home}/data. You have two options:
Move all repositories to the big non-SSD partition
You need to start graphdb with ./graphdb -Dgraphdb.home.data=/mnt/big-drive/ or edit the value of graphdb.home.data parameter in ${graphdb.home/conf/graphdb.properties.
Move a single repository to a different location
GraphDB does not allow creating a repository if the directory already exists. The easiest way to work around this is to create a new empty repository bigRepo, initialize the repository by making at least a request to it, and then shutdown GraphDB. Then move the directory $gdb.home/data/repositories/bigRepo/storage/ to your new big drive and create a symbolic link on the file system ln -s /mnt/big-drive/ data/repositories/bigRepo/storage/
You can apply the same technique also for moving only individual files.
Please make sure that all permissions are correctly set by using the same user to start GraphDB.

FlywayDB ignore sub-folder in migration

I have a situation where I would like to ignore specific folders inside of where Flyway is looking for the migration files.
Example
/db/Migration
2.0-newBase.sql
/oldScripts
1.1-base.sql
1.2-foo.sql
I want to ignore everything inside of the 'oldScripts' sub folder. Is there a flag that I can set in Flyway configs like ignoreFolder=SOME_FOLDER or scanRecursive=false?
An example for why I would do this is say, I have 1000 scripts in my migration folder. If we onboard a new member, instead of having them run the migration on 1000 files, they could just run the one script (The new base) and proceed from there. The alternative would be to never sync those files in the first place, but then people would need to remember to check source control to prior migrations instead of just looking on their local drive.
This is not currently supported directly. You could put both directories at the same level in the hierarchy (without nesting them) and selectively configure flyway.locations to achieve the same thing.
Since Flyway 6.4.0 wildcards are supported in flyway.locations. Examples:
db/**/test
db/release1.*
db/release1.?
More info at https://flywaydb.org/blog/organising-your-migrations

Executing Abaqus Model in Taverna

I'm pretty new to both Taverna and Abaqus but I am trying to run an Abaqus model using a "Tool" in Taverna remotely on a HPC. This works fine if I already have my model file and inputs on the HPC but I need a way of uploading the files dynamically in Taverna (trying to generically wrap Abaqus models).
I've tried adding a input port that takes a file list but I don't know how I can copy it to the "location" that I've set for the tool. Could a beanshell service be the answer or can I iterate through the file list and copy them up before executing the abaqus model?
Thanks
When you say that you created an input port that takes a file list, I guess you mean an input to the tool service.
Assuming the input port is called my_file_list, when the tool service is run, it will take a list of data values on port my_file_list. As an example, say it has "hello", "hi" and "hola" is the three values in the list.
On the location where the tool service is run, it executes in a temporary directory - a different directory for each execution of the service. It is normally something like /tmp/usecase-2029778474741087696
Three files will be created in the temporary directory; those files contain the (in this example) three values the tool service received on port my_file_list. The files could be called
/tmp/usecase-2029778474741087696/tempfile.0.tmp containing hello
/tmp/usecase-2029778474741087696/tempfile.1.tmp containing hi
/tmp/usecase-2029778474741087696/tempfile.2.tmp containing hola
There will also be a file called my_input_list. That file will contain
/tmp/usecase-2029778474741087696/tempfile.0.tmp
/tmp/usecase-2029778474741087696/tempfile.1.tmp
/tmp/usecase-2029778474741087696/tempfile.2.tmp
The script of your tool service would normally read the contents of my_input_list line by line and do something with the contents of the listed file(s).
I have also seen some scripts that 'cheat' and iterate directly over tempfile*.tmp but that would be "a bad thing". The problem with that trick, is that if you want to add a second list of files to the tool service then the file my_input_list could contain
/tmp/usecase7932018053449784034/tempfile.4.tmp
/tmp/usecase7932018053449784034/tempfile.5.tmp
/tmp/usecase7932018053449784034/tempfile.6.tmp
as other temporary files were used for the other file list port.
I hope that helps
The tool service allows you to upload files - but if you are using the HPC through a job submission node, then you would have to modify your command line tool to then use the job file staging command to further push the files as part of the job. The files would be available in the current (temporary) directory of the specified tool script.
I would try to do it through the Tool service and not involve the beanshell - then you can keep your workflow simpler.
A good thing to remember is that you can write multiple shell commands in the box.
Similarly you would probably want to retrieve back the results so that you can process them further in the workflow (unless they are massive - in which case you should just output their remote filenames and send them in again to the next HPC job)
The exact commands to use for staging files and retrieving them depends on the HPC job submission system. Which one are you using?
Thanks for the input guys.
It was my misunderstanding of how Taverna uses the File list. All the files in the list are copied to the temp "sandbox" and are therefore available for use.
Another nice easy way is to zip the directory and pass the zipped files into an input port for the service. Then just unzip the files inside the command.
Thanks again

In Pentaho How can I set the dynamic file path or folder directory for file repository in kettle jobs?

How can I set the dynamic file path or folder directory for kettle jobs?
Please check the attached screenshot.
Goal: Read the path from a config file as a variable[so that we can change the path dynamically as per the other parameters.]
Details: Say, we want to use the /web/test directory for test environment and we want to fetch file repository from the normal path when the parameter is not test! I assume, there must be a way to keep a config/ini file from where we can read the path and use the variable inside the "File/Directory" section of pentaho.
I am gone through variable reference option but which is mainly for database configurations parameter ,some people suggested which is not good option instead of you can specified the database configuration in xml.
Please suggest any idea or solution.
Sounds like you want to set a parameter/variable in the .kettle file and reference it in the File or directory text box. Note the red dollar sign next to the box. That means this field accepts variables. Here's the wiki entry for variables:
PDI Variables
You can also read from a config file directly (from a transform) and set it dynamically with the Set variables step if you can only have one .kettle file. Also check out the Check if connected to repository (from the Repository branch) step as well and see if that will suit your needs.
If none of these suite your needs, please add detail to your question to describe exactly what you're trying to do and how you're trying to do it.