Currently we have multiple file based kettle repositories, we are planning to move them to database . Some of these repositories have transformations with same name, we have a oracle back end and when we are trying to import the repositories into the schema generated by kettle the repositories with the same name are getting overwritten or skipped. Does it mean that there needs to be a separate schema in oracle for each repository ?
Thanks
You can create different folders on the root of your DB repo for each of the file repos. When you import you specify the target folder. This way they are kept separate.
Just remember to change your schedules or invocation scripts to point to the right location.
Related
Newbie to liquibase here. We have a requirement to deploy same changes to multiple databases in amazon RDS.
What will be the best way to deploy? Using different change logs for different databases and include them in one master changeLog or using a single changeLog file and use context tag in changeSet? I read couple other articles but not able to find a concrete solution.
And also we have .dll files to deploy. Do we convert them to .sql and then use tag or is there any tags for .dll files?
Note: we are not using any java/maven application.
the answer for your post depends on what type of deployment you are doing. There could be a simple pipeline where you want the same changes made to the different databases, or subset of changes on different pipeline database (eg dev, test, prod)
Liquibase do not work on .dll file (binary or complied files). Liquibase works only for text based files (sql, json, yaml and xml). You can read more on it here
I'm looking for a light way of executing a databricks notebook that depends on multiple files having been loaded to Azure Data Lake Storage.
Multiple different ADF packages are loading different files into ADLS and then processed by databricks notebooks. Some of the notebooks depend on multiple files from different packages.
A single file is simple enough with an event trigger. Can this be generalised to more than one file without something like Airflow handling dependencies?
This isn't exactly light since you'll have to provision a Azure SQL table, but this is what I'll do:
I would create and store a JSON file in ADLS which details each notebook/pipeline and the file name dependencies.
I'll then provision an Azure SQL Table to store the metadata of each of these files. Essentially, this table will have 3 columns:
General File Name (which matches the file name dependencies in step #1 (e.g.: FileName)
Real File Name (e.g.:FileName_20201007.csv)
Timestamp
Flag (boolean) if file is present
Flag (boolean) if file is processed (i.e.: it's dependent Databricks
notebook has run)
To populate the table in Step#2, I'd use a Azure Logic App which will look for when a blob that meets your criteria is created and then subsequently update/create a new entry on the Azure SQL Table.
See:
https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-azureblobstorage &
https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-sqlazure
You'll need to ensure that at the end of the Azure pipeline/Databricks Notebook that is ran, you update the Azure SQL flag of the respective dependencies to indicate these versions of the file is processed. Your Azure SQL Table will function as a 'watermark' table.
Before your pipeline triggers the Azure databricks notebook, your pipeline will look up the JSON file in ADLS, identify the dependencies for each Notebook, check if all the dependencies are available AND not processed by the Databricks notebook, and subsequently continue to run the Databricks notebook once all this criteria is met.
In terms of triggering your pipeline, you could either use an Azure LogicApp to do this or leverage a tumbling window on ADF.
I have been using AWS Cloudformation and Terraform to manage cloud infrastructure as code (IAC). The benefits are obvious.
1) Template file to concisely describe your infrastructure
2) Versioning
3) Rollbacks
I also have a PostgreSQL DB where I can dump the schema into a single file. Now, it would be amazing if I could edit a dumped SQL file like I do a (IFC) template. I could then validate my new SQL template and apply changes to my DB with the same workflow as Cloudformation or Terraform.
Does anyone know if a tool like this exists for any of the various SQL providers?
Have you given Flyway a try?
It supports versioning database migrations as well as rolling back and undoing migrations when needed. It also keeps a schema table in the database that tracks which migrations have been applied to that database, so that you can continuously deploy new scripts and changes to an existing application that is using Flyway.
I'm new with Pentaho and I'm trying to set up an automatic deployment process for the pentaho business analytics platform repository, but I'm having troubles to find out how to proceed with the data sources.
I would like to do export/import all the data sources, the same that here is explained with the repository (Reporting, Analyzer, Dashboards, Solution Files...) but with the data connections, mondrian files, schemas....
I know there's way to backup and restore the entire repository (explained here), but that's not the way I want to proceed, since the entire repository could contain undesired changes for production.
This would need to be with command line or rest system or some other thing that be triggered by Jenkins.
Did you try import-export with the -ds(DataSource) qualifier ? This will include the data connection, mondrian schema and metadata models.
Otherwise, you can import everything, unzip, filter according a certain logic (to be defined by the guy in charge of the deployment), zip again and export it to prod. A half day project with the Pentaho Data Integrator.
I would like to know how to create different (multiple) repositories in Pentaho Enterprise version.
Below are some points which I would like to add.
1. Different repositories for different users, so one user cant access the other users transformations and jobs.
2. One user cant access the DB connections of other users in different repositories.
My main concern is I want logic here is for security reasons. One user cant access or update other users created transformation.
Is this possible? Please help me on this.
Thanks for all in advance.
This is exactly how my repos are set up. I use database repos on PostgreSQL for all my users. To create a new repo, just click the green + button at the top right of the Repository Connection dialog.
To keep users out of each others sandboxes, I create a different schema for each user and assign DB permissions accordingly. Note, the schema has to be created before you create the repo. Of course I'm DB superuser so I can get into all their repos.
When you create a connection for a repo, go to the advanced tab and specify that user's schema in the 'Preferred schema name' box. Note, this connection will not appear in your list of connections stored in the repo; it's in the repositories.xml file in the .kettle directory. I also created a template xml file that I can tweak give out to anyone who comes on board as a developer. That way they only see their repo in the connection dialog, but my repositories.xml has all of their repos.
You can do this with file based repos as well, but of course you'd handle permissions through the file system rather than the DB.
It's also true that repos can have multiple users. I use this feature when members of the same group need to share transforms. For example the Data Warehouse group is all in one repo, but each has their own directory; the other group has their own repo, etc.
I am not sure ,that you can create multiple instatnce of same repository , but
i sugest you can use single repository with different user and with
different user level permissions
You concerns can be re-solved based on user level permission on repo