Missing data in Redshift Update after INSERT INTO - sql

The goal is to upsert data from original_table into the target table using stage temp table. Upserting works fine. The problem I'm having is that after upserting I'd like to modify versions timestamps in the target but only for ids that were in the stagged table.
Queries below is roughly what I have. When removing the WHERE from the last update, i.e WHERE t.id in (SELECT DISTINCT id from stage) all works as intended. However, we're expecting billions of rows so doing this every couple of hours is not feasible. (Yes, subquery will be changed into INNER JOIN to optimize performance.)
Any idea what's going on? Why there are no rows in the target with id (in stage) even though I just copied over that table?
-- Temp table
CREATE TEMP TABLE stage (LIKE target);
-- Lift recent data from one table and put it into stage
INSERT INTO stage SELECT {transformations} FROM original_table WHERE version_start > current_day - 1;
-- Update overlaps in the stage table with ids that are in target
UPDATE stage
SET id = target.id
FROM target WHERE stage.service_id = target.service_id AND stage.region = target.region;
-- Update unique row_id (per id and version)
UPDATE stage
SET row_id = target.row_id
FROM target WHERE stage.id = target.id AND stage.timestamp_start = target.timestamp_start;
-- Delete all duplicate entries
DELETE FROM target USING stage WHERE stage.row_id = target.row_id;
-- Copy over all staged data into cleaned target
INSERT INTO target SELECT * FROM stage s;
-- Update entries in the 'target' to have {current_row}.version_start = {previous_row}.version_end.
-- Look only for ids that were in the stage table.
UPDATE target
SET timestamp_end = versions.timestamp_end
FROM (
SELECT
t.row_id, t.id, t.timestamp_start,
lead(t.timestamp_start) over (partition by t.id order by t.timestamp_start asc) as timestamp_end
FROM target t
WHERE t.id in (SELECT DISTINCT id from stage)
) versions
WHERE target.row_id = versions.row_id;

Related

Improving insert query for SCD2 solution

I have two insert statements. The first query is to inserta new row if the id doesn't exist in the target table. The second query inserts to the target table only if the joined id hash value is different (indicates that the row has been updated in the source table) and the id in the source table is not null. These solutions are meant to be used for my SCD2 solution, which will be used for inserts of hundreds thousands of rows. I'm trying not to use the MERGE statement for practices.
The columns "Current" value 1 indicates that the row is new and 0 indicates that the row has expired. I use this information later to expire my rows in the target table with my update queries.
Besides indexing is there a more competent and effective way to improve my insert queries in a way that resembles the like of the SCD2 merge statement for inserting new/updated rows?
Query:
Query 1:
INSERT INTO TARGET
SELECT Name,Middlename,Age, 1 as current,Row_HashValue,id
from Source s
Where s.id not in (select id from TARGET) and s.id is not null
Query 2:
INSERT INTO TARGET
SELECT Name,Middlename,Age,1 as current ,Row_HashValue,id
FROM SOURCE s
LEFT JOIN TARGET t ON s.id = t.id
AND s.Row_HashValue = t.Row_HashValue
WHERE t.Row_HashValue IS NULL and s.ID IS NOT NULL
You can use WHERE NOT EXISTS, and have just one INSERT statement:
INSERT INTO TARGET
SELECT Name,Middlename,Age,1 as current ,Row_HashValue,id
FROM SOURCE s
WHERE NOT EXISTS (
SELECT 1
FROM TARGET t
WHERE s.id = t.id
AND s.Row_HashValue = t.Row_HashValue)
AND s.ID IS NOT NULL;

MERGE - to replace ON Duplicate with sql server

I've predominantly used mySQL so moving over to azure and sql server I realise that on duplicate does not work.
I'm trying to do this:
INSERT INTO records (jid, pair, interval, entry) VALUES (1, 'alpha', 3, 'unlimited') ON DUPLICATE KEY UPDATE entry = "limited";
But of course on duplicate key isn't allowed here. So MERGE is the right form.
I've looked at:
https://technet.microsoft.com/en-gb/library/bb522522(v=sql.105).aspx
But honestly the example is a bit excessive and eye watering. Could someone dumb it down for me to fit my example so I can understand it better?
In order to do the merge you need some form of source table/table var for the merge statement. Then you can do the merging. So something along the lines of this maybe (note: not completely syntax checked, apologies in advance):
WITH src AS (
-- This should be your source
SELECT 1 AS Id, 2 AS Val
)
-- The above is not neccessary if you have a source table
MERGE Target -- the detination table, so in your case records
USING src -- as defined above
ON (Target.Id = src.Id) -- how do we join the tables
WHEN NOT MATCHED BY TARGET
-- if we dont match, what do to the destination table. This case insert it.
THEN INSERT(Id, Val) VALUES(src.Id, src.Val)
WHEN MATCHED
-- what do we do if we match. This case update Val
THEN UPDATE SET Target.Val = src.Val;
Don't forget to read the proper syntax page: https://msdn.microsoft.com/en-us/library/bb510625.aspx
I think this translates to your example (tm):
WITH src AS (
-- This should be your source
SELECT 1 AS jid, 'alpha' AS pair, 3 as 'interval'
)
MERGE records -- the detination table, so in your case records
USING src -- as defined above
ON (records.Id = src.Id) -- how do we join the tables
WHEN NOT MATCHED BY TARGET
-- if we dont match, what do to the destination table. This case insert it.
THEN INSERT(jid, pair, interval, entry) VALUES(src.jid, src.pair, src.interval, 'unlimited')
WHEN MATCHED
-- what do we do if we match. This case update Val
THEN UPDATE SET records.entry = 'limited';

Merge statement inserting instead of updating in SQL Server

I'm using SQL Server 2008 and I'm trying to load a new (target) table from a staging (source) table. The target table is empty.
I think since my target table is empty, the MERGE statement skips the WHEN MATCHED part i.e. result of INNER JOIN is NULL and so nothing is UPDATED, and it just proceed to the WHEN NOT MATCHED BY TARGET part (LEFT OUTER JOIN) an inserts all the records in the staging table.
My target table looks exactly similar to my staging table (rows #1 and #4). There should be only 3 rows in the target table (3 inserts and one update for row #4). So, I'm not sure whats going on.
FileID client_id account_name account_currency creation_date last_modified
210 12345 Cars USD 2013-11-21 2013-11-27
211 23498 Truck USD 2013-09-22 2013-11-27
212 97652 Cars - 1 USD 2013-09-17 2013-11-27
210 12345 Cars JPY 2013-11-21 2013-11-29
QUERY
MERGE [AccountSettings] AS tgt -- RIGHT TABLE
USING
(
SELECT * FROM [AccountSettings_Staging]
) AS src -- LEFT TABLE
ON src.client_id = tgt.client_id
AND src.account_name = tgt.account_name
WHEN MATCHED -- INNER JOIN
THEN UPDATE
SET
tgt.[FileID] = src.[FileID]
,tgt.[account_currency] = src.[account_currency]
,tgt.[creation_date] = src.[creation_date]
,tgt.[last_modified] = src.[last_modified]
WHEN NOT MATCHED BY TARGET -- left outer join: A row from the source that has no corresponding row in the target
THEN INSERT
(
[FileID],
[client_id],
[account_name],
[account_currency],
[creation_date],
[last_modified]
)
VALUES
(
src.[FileID],
src.[client_id],
src.[account_name],
src.[account_currency],
src.[creation_date],
src.[last_modified]
);
Since the target table is empty, using MERGE seems to me like hiring a plumber to pour you a glass of water. And MERGE operates only one branch, independently, for every row of a table - it can't see that the key is repeated and so perform an insert and then an update - this betrays that you think SQL always operates on a row-by-row basis, when in fact most operations are performed on the entire set at once.
Why not just insert only the most recent row:
;WITH cte AS
(
SELECT FileID, ... other columns ...,
rn = ROW_NUMBER() OVER (PARTITION BY FileID ORDER BY last_modified DESC)
FROM dbo.AccountSettings_Staging
)
INSERT dbo.AccountSettings(FileID, ... other columns ...)
SELECT FileID, ... other columns ...
FROM cte WHERE rn = 1;
If you have potential for ties on the most recent last_modified, you'll need to find another tie-breaker (not obvious from your sample data).
For future versions, I would say run an UPDATE first:
UPDATE a SET client_id = s.client_id /* , other columns that can change */
FROM dbo.AccountSettings AS a
INNER JOIN dbo.AccountSettings_Staging AS s
ON a.FileID = s.FileID;
(Of course, this will choose an arbitrary row if the source contains multiple rows with the same FileID - you may want to use a CTE here too to make the choice predictable.)
Then add this clause to the INSERT CTE above:
FROM dbo.AccountSettings_Staging AS s
WHERE NOT EXISTS (SELECT 1 FROM dbo.AccountSettings
WHERE FileID = s.FileID);
Wrap it all in a transaction at the appropriate isolation level, and you are still avoiding a ton of complicated MERGE syntax, potential bugs, etc.
I think since my target table is empty, the MERGE statement skips the WHEN MATCHED part
Well, that's correct, but it's by design - MERGE is not a "progressive" merge. It does not go row-by-row to see if records inserted as part of the MERGE should now be updated. It processes the source in "batches" based on whether or not a match was found in the destination.
You'll need to deal with the "duplicate" records at the source before attempting the MERGE.

SQL set operation for update latest record

I am facing a problem and cant find any solution to this. I have a source table (T) where I get data from field. The data may contain duplicate records with time stamp. My objective is to take the field data and store it into a final table (F) having the same structure.
Before inserting I check whether key field exists or not in the F if yes I update the the record in F with the latest one from T. Other wise I Insert the record in F from T. This works fine as long as there is no duplicate record in T. In case T has two records of the same key with different time stamp. It always inserts both the record (In case the key is primary key the insert operation fails). I am using following code for the operation -
IF EXISTS(SELECT * FROM [Final_Table] F, TMP_Source T WHERE T.IKEy =F.IKEY)
begin
print 'Update'
UPDATE [Final_Table]
SET [FULLNAME] = T.FULLNAME
,[FATHERNAME] = T.FATHERNAME
,[MOTHERNAME] = T.MOTHERNAME
,[SPOUSENAME] = T.SPOUSENAME
from TMP_Source T
WHERE Final_Table.IKEy = T.IKEy
and [Final_Table].[RCRD_CRN_DATE] < T.RCRD_CRN_DATE
--Print 'Update'
end
else
begin
INSERT INTO [Final_Table]
([IKEy],[FTIN],[FULLNAME],[FATHERNAME],[MOTHERNAME],[SPOUSENAME]
)
Select IKEy,FTIN,FULLNAME,FATHERNAME,MOTHERNAME,SPOUSENAME
from TMP_Source
end
The problem comes when I my T table has entries like -
IKey RCRD_CRN_DATE ...
123 10-11-2013-12.20.30
123 10-11-2013-12.20.35
345 10-11-2013-01.10.10
All three are inserted in the F table.
Please help.
Remove all but the latest row as a first step (well, in a CTE) using ROW_NUMBER() before attempting to perform the insert:
;WITH UniqueRows AS (
SELECT IKey,RCRD_CRN_DATE,FULL_NAME,FATHER_NAME,MOTHER_NAME,SPOUSENAME,FTIN,
ROW_NUMBER() OVER (PARTITION BY IKey ORDER BY RCRD_CRN_DATE desc) as rn
FROM TMP_Source
)
MERGE INTO Final_Table t
USING (SELECT * FROM UniqueRows WHERE rn = 1) s
ON t.IKey = s.IKey
WHEN MATCHED THEN UPDATE
SET [FULLNAME] = s.FULLNAME
,[FATHERNAME] = s.FATHERNAME
,[MOTHERNAME] = s.MOTHERNAME
,[SPOUSENAME] = s.SPOUSENAME
WHEN NOT MATCHED THEN INSERT
([IKEy],[FTIN],[FULLNAME],[FATHERNAME],[MOTHERNAME],[SPOUSENAME]) VALUES
(s.IKEy,s.FTIN,s.FULLNAME,s.FATHERNAME,s.MOTHERNAME,s.SPOUSENAME);
(I may not have all the columns entirely correct, they seem to keep switching around in your question)
(As you may have noticed, I've also switched to using MERGE since it allows us to express everything as a single declarative statement rather than writing procedural code)

SQL With... Update

Is there any way to do some kind of "WITH...UPDATE" action on SQL?
For example:
WITH changes AS
(...)
UPDATE table
SET id = changes.target
FROM table INNER JOIN changes ON table.id = changes.base
WHERE table.id = changes.base;
Some context information: What I'm trying to do is to generate a base/target list from a table and then use it to change values in another table (changing values equal to base into target)
Thanks!
You can use merge, with the equivalent of your with clause as the using clause, but because you're updating the field you're joining on you need to do a bit more work; this:
merge into t42
using (
select 1 as base, 10 as target
from dual
) changes
on (t42.id = changes.base)
when matched then
update set t42.id = changes.target;
.. gives error:
ORA-38104: Columns referenced in the ON Clause cannot be updated: "T42"."ID"
Of course, it depends a bit what you're doing in the CTE, but as long as you can join to your table withint that to get the rowid you can use that for the on clause instead:
merge into t42
using (
select t42.id as base, t42.id * 10 as target, t42.rowid as r_id
from t42
where id in (1, 2)
) changes
on (t42.rowid = changes.r_id)
when matched then
update set t42.id = changes.target;
If I create my t42 table with an id column and have rows with values 1, 2 and 3, this will update the first two to 10 and 20, and leave the third one alone.
SQL Fiddle demo.
It doesn't have to be rowid, it can be a real column if it uniquely identifies the row; normally that would be an id, which would normally never change (as a primary key), you just can't use it and update it at the same time.