how to remove duplicate row from output table using Pentaho DI? - pentaho

I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.
Now I want to remove all duplicate row from the output table.
And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.
How I can solve this?

Two solutions come to my mind:
Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:
If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.
Use following parameters:
The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
After storing duplicite values to the output table, you can remove duplicites using this concept:
Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?

Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.
As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.
Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".
The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.
Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.

I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me
Case
user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.
Solution
Step 1: Connect csv to insert/update as in the below Transformation.
Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.

Related

pentaho spoon : how to insert value to column conditionnally?

So in my table, I have a column quantity and comment.
If the value in quantity is more than 0, then I need to insert a string "available" to column comment , if it equals to 0 then "to order" and finally if it's less than zero, then "warning". What could be the best way?
edited:
Guess my question above doesn't show the whole work necessary.
At first, I have a text file where I get fields including quantity.
Then I do some modifications of data (on formula step, I do some calculations on quantity).
In the end I use Table output step to insert them into BD. One of the fields to insert is quantity.
My main question is :
Is it better to insert values to column comment after Table output step (when quantity is already added in BD) using SQL script step?
You have basically 3 options:
A filter rows step to split the stream based on the value of quantity, then each of the output streams has an Add constants step to add the new field you want, then combine them again by connecting both Add constants steps to a dummy;
A user defined java expression
A javascript step.
Option 2 is probably the cleanest; option 3 is basically the same as option 2, but with javascript instead of java code; option 1 has the advantage of not requiring any code (though, as the alternative is a one liner, not really an issue). Plus, in option 1 order of rows isn’t necessarily maintained.
** answer no longer applies with new question details **
If you are updating a database table, by far the best and most efficient solution is to do it in a single SQL statement.
In a Pentaho Job, add a SQL step (under scripting).
In that step enter the SQL command. It will be similar to:
UPDATE MyTable
SET comment =
CASE
WHEN quantity > 0 THEN 'available'
WHEN quantity < 0 THEN 'warning'
ELSE 'to order'
END
// next line optional, use it if you only need to update some of the records.
WHERE (insert conditions here if you need any)
As an extra comment, it's less than ideal to have two columns that should always be in sync, but depend on an external job to keep them in sync. There are techniques like database triggers or calculating the case/when while retrieving the rows in a select statement that eliminate the chance of having out of sync fields.

Add new column to existing table Pentaho

I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.

Kettle Pentaho backup transformation by latest data

I need to sychronize some data from a database to another using kettle/spoon transformation. The logic is i need to select latest date data that has existed in destination db. Then select from source db from the last date. What transformation element do i need to do this?
Thank you.
There can be many solutions:
If you have timestamp columns in both the source and destination tables, then you can take two table input steps. In the first one, just select the max last updated timestamp, use it as a variable in the next table input, taking it as a filter for the source data. You can do something like this:
If you just want the new data to be updated in the destination table and you don't care much about timestamp, I would suggest you to use insert/update step for output. It will bring all the data to the stream and if it finds a match, it won't insert anything. If it doesn't find a match, it will insert the new row. If it finds any modifications to the existing row in the destination table, it will update it accordingly.

SSIS Check Excel source rows redirect rows to another table on 'x' number of field matches

I work in a sales based environment and our data consists of 'leads'.
Let's say we record CompanyName, PhoneNumber, Address1 & PostCode(ZIP). These rows a seeded with a unique ID in the schema.
The leads come in from various sources and are compiled onto a spread sheet and then imported into SQL 2012 using SSIS.
After a validation check to see if a file exists we then use a simple data flow which consists of an Excel source, Derived Column, Data Conversion and finally an OLE DB Destination.
My requirement I'm sure has a relatively simple solution. I understand what I need to achieve is the first step. I need to take a sample of data from the last rolling two months, if 2 or more fields in the source excel file match the corresponding field in the destination sql table then I want to redirect to another table.
I am unsure of which combination of components I could use to achieve this. I believe that Fuzzy lookup may not be what I am looking for as I am looking to find exact field matches, I have looked at the lookup component but I am unsure if this is the way to go.
Could anyone please provide some advice on how I can best achieve this as simply as possible.
You can use the Lookup to check for matches in your existing table. However, it will be fairly complicated to implement the requirement of checking for any two or more fields matching. Your expression would be long and complex basically consisting of:
(using pseudo code for readability)
IIF((a=a AND b=b) OR (a=a AND c=c) OR (b=b AND c=c) OR ...and so on
for every combination of two columns you want to test
I would do this by importing the entire spreadsheet to a staging table, and doing the existing rows check in a SQL stored proc that moves the data to the desired destination table.

stored procedure sql (Excel data to T-SQL)

I need to set up a new company for automated data import. The utility has provided the data in a spreadsheet. (Image 1)
Based on this data, I need to create a stored procedure that will identify the correct meter, if it exists, and perform either an insert or update to the monthly data table. For automated utility data import, I want to make sure I restrict everything to a particular utility company.
The steps are the following ( I am having a hard time converting this to SQL)
1- I just want a script that identify the correct meter to see if it exists, basically check the Meter# column in the excel with the MeterNumber column in the Meters table.
2- The next step is perform either an insert or update to the MonthlyData table. This is a screen shot of all its columns.
3- Then I just want to make sure that I am restricting everything to the particular company which in this case Site1 since 2 different companies might have the same meter#. The UtilityCompany table contains 3 columns: ID, Name, UtilityType
I honestly do not know from where to get started, would anybody help me with the script? Thank you
You will want to:
perform a Bulk Insert operation to take your data from the excel file into a staging table.
write a query to select ALL rows for the corresponding utility company (notice I didn't see iterate over each row...). This select could be an update where you update an additional column to mark the row as an INSERT, or an UPDATE.
Then the last step (2 parts), retrieve all of the rows that were marked as INSERT, and insert those into your table. Then grab all rows that were marked with an UPDATE, and update their corresponding values based on your matching criteria.