Pentaho PDI SCD Type 2 Inserts new row when data has not changed - pentaho

I'm starting to use Pentaho PDI Version 9.3 for experimenting with Type 2 SCD. But when I run the same transformation 2 times, with the same data (no change in data), a new version of each row gets inserted every time, even though the row data has not changed. This is my setup:
Overall view:
Dimension Lookup/Update Setup - Keys
Dimension Lookup/Update Setup - Fields
Expected outcome
No matter how many times I run this, if the values of exercise and short_name have not changed, no new rows should be added. But when I
Actual outcome
A new version of each and every record is created each time I run the transformation, even when the exercise and short_name fields have not changed.

Related

Unable to implement SCD2

I was just working on SCD Type 2 and was unable to fully implement it in a way that some scenarios were not getting full filled. I had done it in IICS. Really finding it very very difficult to cover all possible scenarios. So, below is the flow:
Src
---> Lkp ( on src.id = tgt.id)
---> expression ( flag= iif (isnull (tgt.surrogatekey) then Insert, iif(isnotnull(tgt.surrogatekey) and md5(other_non_key_cole)<>tgt.md5)then Update)
----> insert on flag insert(works fine)
but on update i pass updates
to 2 target instances
of same target table
in one i am updating it
as new update as insert
and in other i
am updating tgt_end_date=lkp_start_date for previously stored ids and active_ind becomes 'N'.
But what happens is this works but not in when i receive new updates with again same records meaning duplicates or simply rerunning the mapping inserts duplicates in the target table and changing of end_date also becomes unstable when i insert multiple changes of the same record it sets all active_flags to 'Y' what expected is all should be 'N' except the last latest in evry run. Could anyone please help with this even in SQL if you can interpret.
If I’ve understood your question correctly, in each run you have multiple records in your source that are matching a single record in your target.
If this is the case then you need to process your data so that you have a single source record per target record before you put the data through the SCD2 process

Keep queried SQL data in one state while updating

The use case as follows:
I scrape data from a bigger database (only read access) on a fixed schedule and it takes roughly 30mins to 1 hour
the resulted table will always have >20k rows, the data can be grouped by a dataset_id column which is constricted to 4 values like an enum
when I query all of the rows of one dataset_id let's say SELECT * FROM db WHERE dataset_id = A it is essential that all the records are from the same scrape (so I shouldn't have mixed data from different scrapes)
The question is how would I persist the old data until the new scrape is finished and only then switch to get the new data while deleting the old scrape?
I have thought of the following option:
have 2 tables and switch between them when a newer scrape is finished

Import data from csv into database when not all columns are guaranteed

I am trying to build an automatic feature for a database that takes NOAA weather data and imports it into our own database tables.
Currently we have 3 steps:
1. Import the data literally into its own table to preserve the original data
2. Copy it's data into a table that better represents our own data in structure
3. Then convert that table into our own data
The problem I am having stems from the data that NOAA gives us. It comes in the following format:
Station Station_Name Elevation Latitude Longitude Date MXPN Measurement_Flag Quality_Flag Source_Flag Time_Of_Observation ...
Starting with MXPN (Maximum temperature for water in a pan) which for example is comprised of it's column and the 4 other columns after it, it repeats that same 5 columns for each form of weather observation. The problem though is that if a particular type of weather was not observed in any of the stations reported, that set of 5 columns will be completely omitted.
For example if you look at Central Florida stations, you will find no SNOW (Snowfall measured in mm). However, if you look at stations in New Jersey, you will find this column as they report snowfall. This means a 1:1 mapping of columns is not possible between different reports, and the order of columns may not be guaranteed.
Even worse, some of the weather types include wild cards in their definition, e.g. SN*# where * is a number from 0-8 representing the type of ground, and # is a number 1-7 representing the depth at which soil temperature was taken for the minimum soil temperature, and we'd like to collect these together.
All of these are column headers, and my instinct is to build a small Java program to map these properly to our data set as we'd like it. However, my superior believes it may be possible to have the database do this on a mass import, but he does not know how to do it.
Is there a way to do this as a mass import, or is it best for me to just write the Java program to convert the data to our format?
Systems in use:
MariaDB for the database.
Centos7 for the operating system (if it really becomes an issue)
Java is being done with JPA and Spring Boot, with hibernate where necessary.
You are creating a new table per each file.
I presume that the first 6 fields are always present, and that you have 0 or more occurrences of the next 5 fields. if you are using SQL Server i would approach it as follows,
Query the information_schema catalog to get a count of the fields in
the table. If the count= 6 then no observations are present, if 11
columns ,then you have 1 observation, if 17 then you have 2
observations, etc.
Now that you know the number of observations you can write some SQL
that will loop the over the observations and insert them into a
child table with a link back to a parent table which has the 1st 6
fields.
apologies if my assumptions are way off.
-HTH

How should I control the duplication of calculation version using Pentaho?

I have a table result_slalom where data is populated via ETL Jobs of Pentaho.
When ETL runs for the first time it creates version-1.
Now, if data is changed after new calculations it becomes version-2.
I need to make changes only in the Calculation Version -2 and no more than 2 versions should be there in the table result_slalom. ( Version-1 and Version-2 )
So the logic is :
Check if data exists in table
o
When data exists and existing version is 1, then set the version of new data=2
--> Insert new dataset
o When data exists and existing version is 2, then set the version of new data=2
--> Update existing dataset
o When no data exists, then set version = 1
--> Insert new dataset
How do I make my Pentaho formula for this logic?
currently it is:
if([VersionInDB]=1;[Calculationversion];[VersionInDB]+1)
The dimension lookup/update is a step which does exactly that.
In addition it has validity dates : at the time version2 is created, version1 receives a end-date of now and version2 receives a start-date of now. It make it easy to retrieve historical information with a where date between start-date and end-date. Plus, you have a single button that writes the create/alter table and create index for you.
The other neat solution is to put a trigger on the tables.
Forget to reinvent the wheel in that direction. Although I usually like to inventing the wheel, it is one case on which redeveloping the logic will lead you in an infinite number of tests and bugs.

Implementing Pure SCD Type 6 in Pentaho

I have an interesting task to create a Kettle transformation for loading a table which is a Pure Type 6 dimension. This is driving me crazy
Assume the below table structure
|CustomerId|Name|Value|startdate|enddate|
|1|A|value1|01-01-2001|31-12-2199|
|2|B|value2|01-01-2001|31-12-2199|
Then comes my input file
Name,Value,startdate
A,value4,01-01-2010
C,value3,01-01-2010
After the kettle transformation the data must look like
|CustomerId|Name|Value|startdate|enddate|
|1|A|value1|01-01-2001|31-12-2009|
|1|A|value4|01-01-2010|31-12-2199|
|2|B|value2|01-01-2001|31-12-2199|
|3|C|value3|01-01-2010|31-12-2199|
Check for existing data and find if the incoming record is insert/update
Then generate Surrogate keys only for the insert records & perform inserts.
Retain the surrogate keys for the update records and insert it as new records and assign an open end date for the new record ( A very high value ) and close the previous corresponding record as new record's start date - 1
Can some one please suggest the best way of doing this? I could see only Type 1 and 2 using the Dimension Lookup-Update option
I did this using a mixed approach of ETLT.