Increment - Pentaho Data Integration (PDI) - pentaho

I'm starting to use pentaho data integration, and I intend to use it to update a data lake with data from a server. However, I just need to add data that does not yet exist (increment) in the data lake.
Exemple of SQL:
SELECT COLUM1, COLUM2, COLUM3, COLUM4 FROM TABLEX
I don't know if I can do this increment via sql, filter or some other way.

Let do it simple:
Use Stream lookup and filter.
First step, from source you lookup to target table in lake by some keys(bussiness key, bla bla...) and get new column as name checker (Init checker equal 1 in select clause in lookup link).
Second step, if checker = 1 (Record exist in target) do nothing else insert new record to target.

There are multiple ways to achieve this
Example :
Take two table input steps(source and target) and two add a checksum step, then compare the checksum from source and target, if it does not match insert into target.

Related

Unable to implement SCD2

I was just working on SCD Type 2 and was unable to fully implement it in a way that some scenarios were not getting full filled. I had done it in IICS. Really finding it very very difficult to cover all possible scenarios. So, below is the flow:
Src
---> Lkp ( on src.id = tgt.id)
---> expression ( flag= iif (isnull (tgt.surrogatekey) then Insert, iif(isnotnull(tgt.surrogatekey) and md5(other_non_key_cole)<>tgt.md5)then Update)
----> insert on flag insert(works fine)
but on update i pass updates
to 2 target instances
of same target table
in one i am updating it
as new update as insert
and in other i
am updating tgt_end_date=lkp_start_date for previously stored ids and active_ind becomes 'N'.
But what happens is this works but not in when i receive new updates with again same records meaning duplicates or simply rerunning the mapping inserts duplicates in the target table and changing of end_date also becomes unstable when i insert multiple changes of the same record it sets all active_flags to 'Y' what expected is all should be 'N' except the last latest in evry run. Could anyone please help with this even in SQL if you can interpret.
If I’ve understood your question correctly, in each run you have multiple records in your source that are matching a single record in your target.
If this is the case then you need to process your data so that you have a single source record per target record before you put the data through the SCD2 process

SSIS Inserting incrementing ID with starting range into multiple tables at a time

Is there are one or some reliable variants to solve easy task?
I've got a number of XML files which will be converting into 6 SQL tables (via SSIS).
Before the end of this process i need to add a new (in fact - common for all tables) column (or field) into each of them.
This column represents ID with assigning range and +1 incrementing step. Like (350000, 1)
Yes, i know how to solve it on SSMS SQL stage. But i need a solution at SSIS's pre-SQL converting lvl.
I'm sure there should be well-known pattern-solutions to deal with it.
I am going to take a stab at this. Just to be clear, I don't have a lot of information in your question to go on.
Most XML files that I have dealt with have a common element (let's call it a customer) with one to many attributes (this can be invoices, addresses, email, contacts, etc).
So your table structure will be somewhat star shaped around the customer.
So your XML will have a core customer information on a 1 to 1 basis that can be loaded into a single main table, and will have array information of invoices and an array of addresses etc. Those arrays would be their own tables referencing the customer as a key.
I think you are asking how to create that key.
Load the customer data first and return the identity column to be used as a foreign key when loading the other tables.
I find it easiest to do so in script component. I'm only going to explain how to get the key back. I personally would handle the whole process in C# (deserializing and all).
Add this to Using Block:
Using System.Data.OleDB;
Add this into your main or row processing depending on where the script task / component is:
string SQL = #"INSERT INTO Customer(CustName,field1, field2,...)
values(?,?,?,...); Select cast(scope_identity() as int);";
OleDBCommanad cmd = new OleDBCommand();
cmd.CommandType = System.Data.CommandType.Text;
cmd.CommandText = SQL;
cmd.Parameters.AddWithValue("#p1",[CustName]);
...
cmd.Connection.Open();
int CustomerKey = (int)cmd.ExecuteScalar(); //ExecuteScalar returns the value in first row / first column which in our case is scope_identity
cmd.Connection.Close();
Now you can use CustomerKey for all of the other tables.

Unable to load distinct records via lookup in informatica

This is a very strange thing happening i dont know why. I have created a mapping that transforms the data via expression and loads the data into the target(file) based on lookup on the same target.
Source table
#CompanyName
Acne Lmtd
Acne Ltd
N/A
None
Abc Ltd
Abc Ltd
X
Mapping
Source
->Exp(trim..)
->Lookup(source.company_name
= tgt.company_name)
ReturnPort is CompId
-> filter(ISNULL(CompId))
-> Target
Compid (via sequence
gen)
CompName
The above mapping logic inserts duplicate companynames as well like in source 2 Abc Ltd records same is repeated in target as well. I dont know why. I have tried to debug as well the condition evaluates to true in filter that companyid is null even if the record is already inserted in target.
Also, i thought it might be the case of lookup cache i do enabled dynamic as well but same result. It should have worked like an sql query
select company_id
From lkptarget where
company_name
In (select company_name
from
Source)
Therefore, for Abc Ltd the filter condition should have result in false
Isnull(company_id) false
But, this is getting true. How do I get unique records via lookup and without using distinct?
Note: lookup used is dynamic lookup already
That was in fact a dynamic cache issue the newLookupRow gets assigned a value of 0 on duplicates so I have added the condition in filter as ISNULL(COMPANYID) AND NEWLOOKUPROW=1
and finally that did work.
The Lookup transformation has not way to know what happens in further transformations in the mapping. It can't see results in the target itself, because the Lookup cache is loaded once at the beginning of the mapping using a separate connection to the database. Even if you disable caching (that would mean one query for each Lookup input row), data is not immediately committed (so not visible to other connections) when writing to the target.
That's the reason to use dynamic Lookup cache, which works by adding new lines to the Lookup cache. However in your case there is a catch : the company_id is created after the Lookup (it's the right place to do so), so it can't be added to the Lookup cache.
I think you could configure the Lookup so that :
You activate the options Dynamic Lookup Cache, Update Else Insert and Insert Else Update
You use the company_name to make comparison between source data and Lookup data
You create a fake field company_id with value 0 before the Lookup and associate it to the corresponding Lookup field
You check the checkbox Disable in comparison for the company_id field
You can then use the predefined field NewLookupRow (it appears when you check the Dynamic Lookup Cache option) which should have a value of 1 for new rows or 2 for existing rows with updates (0 for identical rows)
The Lookup should now output NewLookupRow = 1 for the first Abc Ltd and then NewLookupRow = 0 for the second. The filter just after the Lookup should have a condition like NewLookupRow = 1.
For more details you can have a look at the Informatica documentation :
https://docs.informatica.com/data-integration/data-services/10-2/developer-transformation-guide/dynamic-lookup-cache.html

Implementing Pure SCD Type 6 in Pentaho

I have an interesting task to create a Kettle transformation for loading a table which is a Pure Type 6 dimension. This is driving me crazy
Assume the below table structure
|CustomerId|Name|Value|startdate|enddate|
|1|A|value1|01-01-2001|31-12-2199|
|2|B|value2|01-01-2001|31-12-2199|
Then comes my input file
Name,Value,startdate
A,value4,01-01-2010
C,value3,01-01-2010
After the kettle transformation the data must look like
|CustomerId|Name|Value|startdate|enddate|
|1|A|value1|01-01-2001|31-12-2009|
|1|A|value4|01-01-2010|31-12-2199|
|2|B|value2|01-01-2001|31-12-2199|
|3|C|value3|01-01-2010|31-12-2199|
Check for existing data and find if the incoming record is insert/update
Then generate Surrogate keys only for the insert records & perform inserts.
Retain the surrogate keys for the update records and insert it as new records and assign an open end date for the new record ( A very high value ) and close the previous corresponding record as new record's start date - 1
Can some one please suggest the best way of doing this? I could see only Type 1 and 2 using the Dimension Lookup-Update option
I did this using a mixed approach of ETLT.

How to look up technical key of dimension using its natural key?

According to the Wiki:
"The Dimension Lookup/Update step allows you to implement Ralph Kimball's slowly changing dimension for both types: Type I (update) and Type II (insert) ..."
"To do the lookup it uses not only the specified natural keys (with an "equals" condition) but also the specified "Stream datefield" (see below)."
"As a result of the lookup or update operation of this step type, a field is added to the stream containing the technical key of the dimension."
So if I understand that correctly, it should be possible to have the "Dimension Lookup/Update" step lookup a dimensions technical/surrogate key using a natural key. In case no entry yet exists the step could also be configured to add the requested natural key to the dimension table using a unique technical key. But for now I would like to only use the lookup functionality - no update and no insert.
Here's my setup:
This is my dimension table (SCD Type 1) named "dims":
The transformation looks as follows:
But if I run this in Preview mode I get:
What I would like to see is actually the values of id (1,2,3) next to the natural keys (a,b,c)
What am I doing wrong here?
Effectively I could achieve this using a join step - but I would like to use the advanced dimension handling functionality after I got this working.
Kind regards
Raffael
http://www.joyofdata.de/blog/a-stackoverflow-but-for-business-intelligence/
This step expects a table with 3 more attributes:
start_date (date)
end_date (date)
version (int)
Check that your date settings in the „Lookup / Update“ step matches your data. Check the version field too.
Below an example:
Table:
Setting for the „Dimension Lookup / Update“ step:
Preview table (the id's that match the date are returned)