Pentaho: execute insert only if there are no duplicates - pentaho

Basically I want to insert a set of rows only if there are no changes from the target row.
I have implemented a blocking step to wait for all rows to be processed before proceeding. After this I want to add a condition to check if there are any changed data and if there are any abort the process else insert all rows.
Any suggestions?

This seems to be very easy with only 2 steps
Try this:
Step 1 : Use Database lookup step, look up on the key columns, And retrieve the columns you want compare including key fields in the target table for the duplicates.
Step 2: Use Filter Step, Here compare all the field which you have retrieved from the db look with the stream / table / source input input. like id (from source input) = id (from target) and name (from source input) = name (from target) , false condition point to Target table and true to dummy for testing.
Note: If you want populate table key max + 1 then for Combination lookup and update step instead of the table output

If I understand your question properly, you want to insert rows if they are identical to the rows in target? Would that not result in a PK violation?
Anyways from your code screen shot, you seem to have used a Merge Rows(Diff) step which will give you rows flagged with a 'new', 'changed', 'identical' or 'deleted' status.
From here you want to check for two things: Changed or Identical
If it is changed you have to abort and if it is identical you will insert
Now you use a simple filter step with the status = 'identical' as the true condition (i.e.) in your case to the insert flow
The false condition would go to the abort step.
Although do note that even if a single row is found to be changed the entire transformation would be aborted

If I understand your use case properly, I would not use the "Table output" step for this kind of move.
"Table output" is a great step for data warehousing, where you usually insert data to tables which are supposed to be empty and are part of a much broader process.
Alternatively, I would use "Execute SQL script" to tweak the INSERT to your own needs.
Consider this to be your desired SQL statement (PostgreSQL syntax in this example):
INSERT INTO ${TargetTable}
(contact_id, request_id, event_time, channel_id)
VALUES ('?', '?', '?', '?')
WHERE
NOT EXISTS (
SELECT contact_id, request_id, event_time, channel_id FROM ${TargetTable}
WHERE contact_id = '?' AND
-- and so on...
);
:
Get the required fields for mapping (will be referenced by the question marks into an argument sequence);
Check the "Variable substitution" check box in case you intend to use variables which were loaded and/or created along the broader process;
SQL-performance-wise, it may not be the most efficient way, but it looks to me like a better implementation for your use case.

the simplest way to do that is to use the insert/update step. not need to make any query: if the row exists it updates, if not exists it creates a new row.

Related

Update query in Access to update MANY columns from Null to value

I have a database table with about 100 columns (bulky, I know). I have about half of these columns which I will need to update iteratively to set Is Null or "" values to "TBD".
I compiled all 50 some columns which need to be updated into an update query with Access SQL code that looked something like this...
UPDATE tablename
SET tablename.column1="TBD", tablename.column2="TBD", tablename.column3="TBD"....
WHERE tablename.column1 Is Null OR tablename.column1="" OR tablename.column2 Is Null OR tablename.column2="" OR tablename.column3 Is Null OR tablename.column3=""....
Two issues: This query with 50 columns receives a "query is too complex" error.
This query is also just functionally wrong...because I'm losing data within these columns due to the WHERE statement. Records that had values populated which I did not want to update are being updated because of the OR clause.
My question is how can I go about updating all of these columns and setting their null or empty values to a particular value (in this case, "TBD")?
I know that I can just use a select query to select the columns I need to update, run it, and just CTRL+H to find & replace "" to "TBD". However, I'm worried about the potential for this to introduce errors into my dataset. I also know I could also go through column by column and update these values via an update query. However, this would be quite time consuming with 50+ columns & the iterative updates which I need to run on the entire dataset.
I'm leaning towards this latter route. I am still wondering if there are any other scripted options which I can build into a query to overcome such an issue, and that leads me here to you.
Thank you!
You could just run 50 queries:
UPDATE table SET column1="TBD" WHERE column1 IS NULL OR column1 = "";
An optimization could be:
Create a temporary table which determines which rows actually would need an update: Concatenate all column values such that a single NULL or empty would result in an record in your temp table. This way you only have to scan the base table once.
Use the keys from that table to focus on those rows only.
Etc.
That is safe and only updates your empty values (where as your previous query would have updated all columns unless you would have checked every value first with an IFNULL).
This query style also does not run into the too complex issue
You could issue one query as:
UPDATE tablename
SET column1 = iif(column1 is null or column1 = "", "TBD", column1),
column2 = iif(column2 is null or column2 = "", "TBD", column2),
. . .;
If you don't mind potentially updating all rows, you can leave out the where clause.

Change column value after INSERT if the value fits criteria?

I have never really worked with Triggers before in MSSQL but I think it'll be what I need for this task.
The structure of the table is as such:
ID|****|****|****|****|****|****|****|TOUROPERATOR
The Tour Operator Code is the code that tells us what company owned the flight we carried out for them. Two of those codes (there are 24 in total) are outdated. Our users requested that those two be changed but the tour operator code is pulled from a database we don't control. The FlightData table however, we do control. So I was thinking a trigger could change the tour operator code if it was one of the two outdated ones, to the correct ones instead respectively when they were inserted.
So I went into good ol' SQL Management Studio and asked to make a trigger. It gave me some sample code and here is my Pseudo Code below:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TRIGGER ChangeProvider
ON FlightData
AFTER INSERT
AS
BEGIN
IF(TheInsertedValue == Criteria)
UPDATE FlightData
SET TheInsertedValue = NewValue
ENDIF
END
GO
I am not that good with this type of Database Programming so excuse my mistakes.
How would I go about doing this?
You could add a computed column to your table instead of adding a trigger.
Then the new column could just use a case statement to either show
the original TourOperator column value or the new value you wanted.
You'd add a new column to your table like this
TourOperatorCorrect = CASE WHEN TourOperator = 'Whatever value' THEN 'ChangedValue'
--I just want to use what I have already in the TourOperator column
ELSE TourOperator
END AS VARCHAR(50)
Basics of computed columns are here - https://msdn.microsoft.com/en-ie/library/ms188300.aspx
Your misconception here is that the trigger runs once per inserted value - it is in fact run once per insert statement, so you can and will find more than one row inserted at once.
You'll find that your inserted values are in the pseudo table inserted, which has the same structure as your FlightData table in this case. You write a select statement against that, specifying any criteria you wish.
However, it's not immediately clear what your logic is - does the FlightData table you are updating in your trigger only have one row? Do you update every row in the table with the newest inserted value? It is hard to understand what you are trying to now, and what the purpose of the table and this trigger are - let alone what you would want to do if you inserted more than one row at once.
When inserted table contains mutiple rows,your code will fail,so change code to work with inserted table as whole
UPDATE F
SET f.TheInsertedValue = i.value
from inserted i
join
Flighttable F
on f.matchingcolumn=i.matchingcolumn
and i.somevalue='criteria'

Creation of a temporary table in postgres

I'm trying to create a temporary table in Postgres (to speed up joining, as there will be a lot of similar queries throughout a session). The SQL that will be called at the beginning of a session is the following:
CREATE TEMPORARY TABLE extended_point AS
SELECT (
point.id,
local_location,
relative_location,
long_lat,
region,
dataset,
region.name,
region.sub_name,
color,
type)
FROM point, region, dataset
WHERE point.region = region.id AND region.dataset = dataset.id;
The tables point has the columns id::int, region::int, local_location::point, relative_location::point, long_lat:point (longitude, latitude).
Region has the columns id::int, color::int, dataset::int, name::varchar, sub_name::varchar.
Dataset has the columns id::int, name::varchar, type:varchar.
When this is run, I get the error message: [25P02] ERROR: current transaction is aborted, commands ignored until end of transaction block.
As a side, the commands are executed in PyCharm, and is part of a Python project.
Any suggestions?
Thanks in advance :)
There is an important difference between these two queries:
select 1, 'abc';
select (1, 'abc');
The first query returns one row with two columns with values 1 and 'abc'. The second one returns a row with one column of pseudo-type record with value (1, 'abc').
Your query tries to create a table with one column of pseudo-type record. This is impossible and should end with
ERROR: column "row" has pseudo-type record
SQL state: 42P16
Just remove brackets from your query.
As a_horse stated, [25P02] ERROR does not apply to the query in question.
Btw, my advice: never use keywords as table/column names.

SQL Column Swap Behavior

I'm swapping column values in a table using the following statement:
UPDATE SwapTable
SET ValueA=ValueB
,ValueB=ValueA
This works and the values do get swapped, as can be verified by this SQL Fiddle.
However, if we did such thing in (mostly any) other language, we would end up with both ValueA and ValueB having identical values.
So my question is why/how this works in SQL.
You can just see the execution plan.
Select all the rows from the table and make it as a row set.
Open a transaction
Update the table referenced (SwapTable) with corresponding row address, from the old values read from the row set to the field reference.
Commit -- done updating.

How to proceed to the next task only if no records exist for a given query?

I have the following piece of SQL that will check if any duplicate records exist. How can I check to see if no records are returned? I'm using this in an SSIS package. I only want it to proceed to the next step if no records exist, otherwise error.
SELECT Number
, COUNT(Number) AS DuplicateCheckresult
FROM [TelephoneNumberManagement].[dbo].[Number]
GROUP BY Number
HAVING COUNT(Number) > 1
Following example created using SSIS 2008 R2 and SQL Server 2008 R2 backend illustrates how you can achieve your requirement in an SSIS package.
Create a table named dbo.Phone and populate it couple records that would return duplicate results.
CREATE TABLE [dbo].[Phone](
[Number] [int] NOT NULL
) ON [PRIMARY]
GO
INSERT INTO dbo.Phone (Number) VALUES
(1234567890),
(1234567890);
GO
You need to slightly modify your query so that it returns the total number of duplicates instead of the duplicate rows. This query will result only one value (scalar value) which could be either zero or non-zero value depending on if duplicates are found or not. This is the query we will use in the SSIS package's Execute SQL Task.
SELECT COUNT(Number) AS Duplicates
FROM
(
SELECT Number
, COUNT(Number) AS NumberCount
FROM dbo.Phone
GROUP BY Number
HAVING COUNT(Number) > 1
) T1
On the SSIS package, create a variable named DuplicatesCount of data type Int32.
On the SSIS package, create an OLE DB Connection manager to connect to the SQL Server database. I have named it as SQLServer.
On the Control Flow tab of the SSIS, package, place an Execute SQL Task and configure it as shown below in the screenshots. The task should accept a single row value and assign it to the newly create variable. Set the ResultSet to Single row. Set the Connection to SQLServer and the SQLStatement to SELECT COUNT(Number) AS Duplicates FROM (SELECT Number, COUNT(Number) AS NumberCount FROM dbo.Phone GROUP BY Number HAVING COUNT(Number) > 1) T1.
On the Result Set section, click on the Add button and set the Result Name to 0. Assign the variable User::DuplicatesCount to the result name. Then click OK.
Place another task after the Execute SQL Task. I have chosen Foreach Loop Container for sample. Connect the tasks as shown below.
Now, the requirement is if there are no duplicates, which means if the output value of the query in the Execute SQL task is zero, then the package should proceed to Foreach loop container. Otherwise, the package should not proceed to Foreach loop container. To achieve this, we need to add a expression to the precedence constraint (the green arrow between the tasks).
Right-click on the precedence constraint and select Edit...
On the Precedence constraint editor, select Expression from the Evaluation operation dropdown. Set the expression to #[User::DuplicatesCount] == 0 in order to check that the variable DuplicatesCount contains the value zero. Value zero means that there were no duplicates in the table dbo.Phone. Test the expression to verify that the syntax is correct. Click OK to close the verification message. Click OK to close the precedence constraint.
Now, the Control Flow should look like this. The precedence constraint will be denote with fx, which represents there is a constraint/expression in place.
Let's check the rows in the table dbo.Phone. As you see, the value 1234567890 exists twice. It means that there are duplicate rows and the Foreach loop container shouldn't execute.
Let's execute the package. You can notice that the Execute SQL Task executed successfully but it didn't proceed to Foreach Loop container. That's because the variable DuplicatesCount contains a value of 1 and we had written a condition to check that the value should be zero to proceed to Foreach loop container.
Let's delete the rows from the table dbo.Phone and populate it with non-duplicate rows using the following script.
TRUNCATE TABLE dbo.Phone;
INSERT INTO dbo.Phone (Number) VALUES
(1234567890),
(0987654321);
Now, the data in the table is as shown below.
If we execute the package, it will proceed to the Foreach Loop container because there are no duplicate rows in the table dbo.Phone
Hope that helps.
What you need to do to is work with ##ROWCOUNT, but how you do it depends on your data flows. Have a look at this discussion, which points out how to do it with either one or with two data flows.
Using Row Count In SSIS