Pentaho dynamic merge key fields - dynamic

I am trying to perform a merge in Kettle Pentaho. One table will be merged with exactly the same table on another database. The table name to be merged will be passed as an argument so I will have a config table in my db that will contain the column(s) on which the 2 instances of the table have to be joined on.
My question is, can I do a merge with dynamic key fields? I could have a separate step/transformation that selects the names of the columns to be used as keys from the config table.
Thanks in advance!

Better to do the merge with static fields, but use separate steps to populate those fields dynamically with the values you want to merge by. Remember to sort after you populate the fields!

Related

How to implement an add if not available in the database in Pentaho?

How do I implement, or what steps do I use to create a transformation that compares a table and a list . For example a database table name Schools and an excel file with a huge list of names of Schools.
if the entry in the excel is not seen in the database, it should then be added to the database table.
I'm not quite sure if I can use the database lookup step, it does not tell if a lookup fails. insert update step doesn't seem a solution as well, for it requires some ID value but no ID is present on the list of schools in the excel file
Based on the information that you provided a simple join with table insert step will do your task. You can use the Merge rows step for comparing both the data stream (excel and database). The merge rows step uses the key to compare two streams and add a flag field which marks the row as new, identical, changed, deleted. In your case you would like to insert all the rows that are marked as new by using table insert step.
Please check the below links for more reference.
Merge rows, Synchronize after merge
This what worked for me,
excel file -->
select values (to delete unnecessary fields) -->
database lookup (this will create a new field, and will set null if not found) -->
filter rows (get the fields with null output from lookup) -->
table output (insert the filtered records)

Transfer data model tables from sql to csv

I have lots of sql tables. The tables are "dependent" , i.e. constraints on foreign keys are defined between the tables.
I need to transfer the tables from sql to csv. What is correct way to do that:
Define tables exactly as they are defined in sql? (What should I do with the foreign keys?)
Try to generate other tables by joining the existing ones based on foreign keys in order to hide the foreign keys dependencies?
May be there are other options? What are the pros and cons ?
Thanks,
Note:This is need for another application that run some anylitics on the data
I would suggest to create a view in SQL which contains all information from all tables you need in your CSV later.
The view already implements the dependencies (link of two rows from different tables) and linkes all together in one table.
It would be way easier than your second proposal to create a new table because the view will do all the work for you.
I guess you will need your dependencies.
So you should not ignore them.
Here a quick example how they work:
Lets say you have 2 Tables the first one is named persons and the second one is cars. In the persons table you have 3 columns: ID, Name, Age. In the second one you have ID, Car. To see which person has which car you just check which id from the first table has which value for car in the second one.
If you link them together in a view the result is one single table with the columns ID, Person, Age, Car.
Same does the view.
Later you can simply export the view to CSV.
Maybe I can help you better if you define your needs a bit more detailed.
What kind of data is in your tables, how are they linked(what are the primary/secondary keys).

Alter table or select/copy to new table with new columns

I have a huge BQ table with a complex schema (lots of repeated and record fields). Is there a way for me to add more columns to this table and/or create a select that would copy the entire table into a new one with the addition of one (or more) columns? It appears as if copying a table requires flattening of repeated columns (not good). I need an exact copy of the original table with some new columns.
I found a way to Update Table Schema but it looks rather limited as I can only seem to add nullable or repeated columns. I can't add record columns or remove anything.
If I were to modify my import JSON data (and schema) I could import anything. But my import data is huge and conveniently already in a denormalized gzipped JSON so changing that seems like a huge effort.
I think you can add fields of type RECORD.
Nullable and repeated refer to field's mode, not type. So you can add a Nullable record or a Repeated record, but cannot add a Required record.
https://cloud.google.com/bigquery/docs/reference/v2/tables#resource
You are correct that you cannot delete anything.
If you want to use a query to copy the table, but don't want nested and repeated fields to be flattened, you can set the flattenResults parameter to false to preserve the structure of your output schema.

SQL DataAdapter Inserting Single row across multiple tables

I am interested in using the SQLDataAdapter with the DataTable and associated Insert/Update/Delete Command operations that I can attach to the Adapter object. My question is this. Does each row in the datatable used necessarily need to correspond to any one physical table ? What I would like to be able to do is allow a single row to represent columns that span multiple tables and then craft each of the insert/update commands to handle their operations across these tables. That would mean that what I assign to the command might actually be a more complex sql statement even wrapped in BEGIN/END so that I can insert into the first "anchor" table then use that primary key and for the foreign key column of the subsequent column.
So far all the examples I see relate to each data table representing a single table. I realize that I could perhaps use a dataset but then how would I attach a command relative to each data table within the set. Furthermore how then could I relate the rows from table to the rows of the dhild table.?
Anyone try this ?
You could create a View with an instead of insert trigger. Within the trigger you can split the columns as you like and do multiple inserts to different tables.

If exist update else insert records in SQL Server 2008 table

I have one staging table and want to insert data to Main table, so i want to check while inserting data from staging to Main table, if exists then update the records else insert as new records. Here the issue is both the staging as well as Main table does not have any key column based on which i can compare values.
Is it possible to do without having key columns i.e. primary key on both the tables? if yes, please, suggest me how.
Thanks in advance.
If there is no unique key or set of data within a row to define uniqueness, then no.
The set of data can be a combination of the data in each column, creating a sum of parts which will provide uniqueness; however without exposure to your data you would need to make that decision.
You write the WHERE-clause to include all the fields that make your record unique (ie. the fields that decide whether the record is new or should be updated.)
Take a look at this article (http://blogs.msdn.com/b/miah/archive/2008/02/17/sql-if-exists-update-else-insert.aspx) for hints on how to construct it.
If you are using SQL Server 2008r2, you could also use the MERGE statement - I haven't tried it on tables without keys, so I don't know whether it would work for you.