Issue with create dimension table with last update time column in Pentaho Data Integration - pentaho

I am creating dimension table with last updated time(from GetSystemInfo)in Pentaho Data Integration(PDI).It works fine except it enters new rows even there is no changes in row and reason is there is lookup is also performing on last updated time field which should not perform. But when I removes this field from key field from attribute Dimenssion lookup/update it works as expected but values in lat time updated field goes empty.Thanx in advance for any solution/Suggestion.

I expect you are talking about SDC II. (Slowly changing dimension of type 2) here and you want to store a date of when a row is inserted to a SCD table.
Instead of obtaining data from GetSystemInfo step, you can use Date of last insert (without stream field as source) type of dimension update in the Fields tab of Dimension Lookup / Update step which stores a datetime automatically in defined table column.
Additional hint: If you need to store maximum value of some date from a source system table which is relevant for loading new / changed data, store its maximum right after Dimension Lookup / Update step into a separate table and use it as when loading updated data at the beginning of a ETL transformation.

I think it is better to use the below components
Step 1: Using table input step you can get a max value from target system and pass the value to next step
Step 2: Take one more table input step and write a source query and assign the previous value in where clause(like ?)
Step 3: Then perform the as usual operation on target level
I think you are getting the above steps.

Related

Add new column to existing table Pentaho

I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.

Create a historical sequence in talend big data jobs

I have a requirement to create sequence in talend.
Basically records are coming from a source file.
for each source row i want to create a unique number.
here is where it gets complicated.
when a new file comes next day , the talend should pick the last generated number and then increment it with 1.
for EX:
today the last generated sequence number is 100.
tomorrow the talend should generate sequence number from 100 . i.e. 101,102,103,104.....
This means talend should keep the history of previously generated last sequence number.
Thanks
So, in such a case you have to persist this last sequence value somewhere, in the target database (if any) or in a dedicated file.
If the records are stored in a database, you can also get the max value from the corresponding field using the appropriate Select.
When you got the desired value, you need to store it in a global variable, then reuse this variable to initialize the sequence with something like:
Numeric.sequence("yourSequence", (Integer)globalMap.get("yourGlobal"), 1)Hope this helps.

Kettle Pentaho backup transformation by latest data

I need to sychronize some data from a database to another using kettle/spoon transformation. The logic is i need to select latest date data that has existed in destination db. Then select from source db from the last date. What transformation element do i need to do this?
Thank you.
There can be many solutions:
If you have timestamp columns in both the source and destination tables, then you can take two table input steps. In the first one, just select the max last updated timestamp, use it as a variable in the next table input, taking it as a filter for the source data. You can do something like this:
If you just want the new data to be updated in the destination table and you don't care much about timestamp, I would suggest you to use insert/update step for output. It will bring all the data to the stream and if it finds a match, it won't insert anything. If it doesn't find a match, it will insert the new row. If it finds any modifications to the existing row in the destination table, it will update it accordingly.

how to remove duplicate row from output table using Pentaho DI?

I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.
Now I want to remove all duplicate row from the output table.
And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.
How I can solve this?
Two solutions come to my mind:
Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:
If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.
Use following parameters:
The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
After storing duplicite values to the output table, you can remove duplicites using this concept:
Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?
Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.
As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.
Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".
The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.
Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.
I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me
Case
user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.
Solution
Step 1: Connect csv to insert/update as in the below Transformation.
Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.

Derived date calculation

I am currently entering data into a SQL Server database using SSIS. The plan is for it to do this each week but the day that it happens may differ depending on when the data will be pushed through.
I use SSIS to grab data from an Excel worksheet and enter each row into the database (about 150 rows per week). The only common denominator is the date between all the rows. I want to add a date to each of the rows on the day that it gets pushed through. Because the push date may differ I can't use the current date I want to use a week from the previous date entered for that row.
But because there are about 150 rows I don't know how to achieve this. It would be nice if I could set this up in SQL Server where every time a new set of rows are entered it adds 7 days from the previous set of rows. But I would also be happy to do this in SSIS.
Does anyone have any clue how to achieve this? Alternatively, I don't mind doing this in C# either.
Here's one way to do what you want:
Create a column for tracking the data entry date in your target table.
Add an Execute SQL Task before the Data Flow Task. This task will retrieve the latest data entry date + 7 days. The query should be something like:
select dateadd(day,7,max(trackdate)) from targettable
Assign the SQL result to a package variable.
Add a Derived Column Transformation between your Source and Destination components in the Data Flow Task. Create a dummy column to hold the tracking date and assign the variable to it.
When you map the Excel to table in a Data Flow task, map the dummy column created earlier to the tracking date column. Now when you write the data to DB, your tracking column will have the desired date.
Derived Column Transformation