How should I control the duplication of calculation version using Pentaho?

How should I control the duplication of calculation version using Pentaho? - pentaho

I have a table result_slalom where data is populated via ETL Jobs of Pentaho.
When ETL runs for the first time it creates version-1.
Now, if data is changed after new calculations it becomes version-2.
I need to make changes only in the Calculation Version -2 and no more than 2 versions should be there in the table result_slalom. ( Version-1 and Version-2 )
So the logic is :
Check if data exists in table
o
When data exists and existing version is 1, then set the version of new data=2
--> Insert new dataset
o When data exists and existing version is 2, then set the version of new data=2
--> Update existing dataset
o When no data exists, then set version = 1
--> Insert new dataset
How do I make my Pentaho formula for this logic?
currently it is:
if([VersionInDB]=1;[Calculationversion];[VersionInDB]+1)

The dimension lookup/update is a step which does exactly that.
In addition it has validity dates : at the time version2 is created, version1 receives a end-date of now and version2 receives a start-date of now. It make it easy to retrieve historical information with a where date between start-date and end-date. Plus, you have a single button that writes the create/alter table and create index for you.
The other neat solution is to put a trigger on the tables.
Forget to reinvent the wheel in that direction. Although I usually like to inventing the wheel, it is one case on which redeveloping the logic will lead you in an infinite number of tests and bugs.

Related

How do you deduplicate records in a BigQuery table?

We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery.
The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time
The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table?
Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated

I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
Option One
If this is a one-off fix, I recommend you simply
navigate to the table (your_dataset.your_table) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (e.g. your_dataset.your_table_deduplicated)
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (i.e. call the copy your_dataset.your_table)
delete your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
Option Two
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
The Future
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (i.e. doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, e.g. in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet
This will prevent the sort of issues you have encountered here.

Pentaho PDI SCD Type 2 Inserts new row when data has not changed

I'm starting to use Pentaho PDI Version 9.3 for experimenting with Type 2 SCD. But when I run the same transformation 2 times, with the same data (no change in data), a new version of each row gets inserted every time, even though the row data has not changed. This is my setup:
Overall view:
Dimension Lookup/Update Setup - Keys
Dimension Lookup/Update Setup - Fields
Expected outcome
No matter how many times I run this, if the values of exercise and short_name have not changed, no new rows should be added. But when I
Actual outcome
A new version of each and every record is created each time I run the transformation, even when the exercise and short_name fields have not changed.

Need to create Period over Period Issue Reporting in SQL Server 2016

I am responsible for creating period-over-period and trend reporting for our Team's Issue Management Department. What I need to do is at copy table Issues at month-end into a new table IssuesHist and add a column with the current date example: 1/31/21. Then at the next month-end I need to take another copy of the Issues table and append it to the existing IssuesHist table, and then add the column again with the current date. For example: 2/28/21.
I need to do this to be able to run comparative analysis on a period-over-period basis. The goal is to be able to identify any activity (opening new issues, closing old ones, reopening issues, etc.) that occurred over the period.
Example tables below:
Issues Table with the current data from our front-end tool
I need to copy the above into the new IssuesHist and add a date column like so
Then at the following month end I need to do the same thing. For example if the Issues table looked like this (changes highlighted in Red)
I would need to Append that to the bottom of the existing IssuesHist table with the new Date. So that I could run queries comparing the data periods to identify any changes.
My research has shown that a Temporal Table may be the best solution here, but I am unable to DIM our existing database's tables to include system versioning.
Please let me know what solution would work, best, and if you have any SQL Statement Tips.
Thank you!

Implementing Pure SCD Type 6 in Pentaho

I have an interesting task to create a Kettle transformation for loading a table which is a Pure Type 6 dimension. This is driving me crazy
Assume the below table structure
|CustomerId|Name|Value|startdate|enddate|
|1|A|value1|01-01-2001|31-12-2199|
|2|B|value2|01-01-2001|31-12-2199|
Then comes my input file
Name,Value,startdate
A,value4,01-01-2010
C,value3,01-01-2010
After the kettle transformation the data must look like
|CustomerId|Name|Value|startdate|enddate|
|1|A|value1|01-01-2001|31-12-2009|
|1|A|value4|01-01-2010|31-12-2199|
|2|B|value2|01-01-2001|31-12-2199|
|3|C|value3|01-01-2010|31-12-2199|
Check for existing data and find if the incoming record is insert/update
Then generate Surrogate keys only for the insert records & perform inserts.
Retain the surrogate keys for the update records and insert it as new records and assign an open end date for the new record ( A very high value ) and close the previous corresponding record as new record's start date - 1
Can some one please suggest the best way of doing this? I could see only Type 1 and 2 using the Dimension Lookup-Update option

I did this using a mixed approach of ETLT.

Dynamically register tables in SAS (DI)

Tables with transaction data are generated daily, with the date in the name e.g.
data_01_12_2014.
It is clear why this method would be undesirable, but presumably the reason is that the daily tables are enormous and this is a space management mechanism. Whatever the reason, my task is to grab data from these tables, do some transformations, and drop the results into a result table.
My problem is that I want to automate the process, and do not want to manually register the new daily table each day. Is there a way to automate this process in SAS/SAS DI?
Much gratitude.

What I do, is to create a macro variable, and give it the value "01_12_2014". You can then register the table in DI Studio with the physical name name "libref.Data_&datevar." Logical name can be anything.
Now the same job will work on the new names, just by changing the value of "datevar" macrovariable.
In the autoexec, a program can be written that sets the macrovariable dynamically. For example, this will set the value to todays date:
data _null_;
call symputx("datevar",translate(put(today(),DDMMYYD10.),"_","-"));
run;
%put &datevar;
Hope this helps!

I hope i'm not too late in answering the question. Just saw this question today only.
Anyhow, The most important thing that you need to remember is that the registered table showing up on the metadata folder/inventory are just shortcuts to the physical file. Let's say that the DI Studio job that you have is taking input from this table(registered on the metadata server as let's say MYDATA pointing to physical file data_2015_10_30 on 30th October).
On 31st October i can run the below code to update the shortcut to point to 31st dataset i.e data_2015_10_31. The tableID macro value is the Metadata ID of the table which shows in the Basic Properties panel( if it's not showing check View->Basic Properties . It should start showing on bottom left screen). Also, I'm hard coding 2015_10_31, but you can use macro to pick up today's date instead of hard coding. Leaving that to you.
%let tableID=A5LZW6LX.BD000006;
data _null_;
rc=metadata_setattr("omsobj:PhysicalTable?#Id ='&tableID'",
"SASTableName",
"DATA_2015_10_31");
rc=metadata_setattr("omsobj:PhysicalTable?#Id ='&tableID'",
"TableName",
"DATA_2015_10_31");
run;
PLEASE NOTE THAT DI STUDIO JOB CAN BE OPENED OR CLOSED WHILE YOU MAKE THE CHANGES OR RUN THE ABOVE CODE, BUT IF IT IS OPEN THEN CLOSE IT AND REOPEN IT AND IF THE JOB WAS CLOSED, JUST OPENING IT WOULD WORK. IF YOU DO NOT REOPEN THE JOB THEN TRANSFORMATIONS IN THE JOB WHICH ARE INTERACTING WITH THE DATASET MYDATA WOULD STILL PICK UP OLD TABLE NAME NOT THE UPDATED ONE. Also, The above code CANNOT be added as Precode since opening the job is updating all the linkages of the dataset to the new physical table in the transformations i.e. 31st October in the DI Job. You can created a new job with the above code and add it in the jobflow to run before you main job. If you would like to add it in precode then code to update becomes complicated and lengthy which i would avoid.
Good Reference Link : http://support.sas.com/resources/papers/proceedings09/097-2009.pdf

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas