Merging deltas with duplicate keys - sql

I'm trying to perform a merge into a target table in our Snowflake instance where the source data contains change data with a field denoting the at source DML operation i.e I=Insert,U=Update,D=Delete.
The problem is dealing with the fact the log (deltas) source might contain multiple updates for the same record. The merge I've constructed bombs out complaining about duplicate keys.
I'm struggling to think of a solution without going the likes of GROUP BY and MAX on the updates. I've done a similar setup with Oracle and the AND clause on the MATCH was enough.
MERGE INTO "DB"."SCHEMA"."TABLE" t
USING (
SELECT * FROM "DB"."SCHEMA"."TABLE_LOG"
ORDER BY RECORD_TIMESTAMP ASC
) s ON t.RECORD_KEY = s.RECORD_KEY
WHEN MATCHED AND s.RECORD_OPERATION = 'D' THEN DELETE
WHEN MATCHED AND s.RECORD_OPERATION = 'U' THEN UPDATE
SET t.ID=COALESCE(s.ID,t.ID),
t.CREATED_AT=COALESCE(s.CREATED_AT,t.CREATED_AT),
t.PRODUCT=COALESCE(s.PRODUCT,t.PRODUCT),
t.SHOP_ID=COALESCE(s.SHOP_ID,t.SHOP_ID),
t.UPDATED_AT=COALESCE(s.UPDATED_AT,t.UPDATED_AT)
WHEN NOT MATCHED AND s.RECORD_OPERATION = 'I' THEN
INSERT (RECORD_KEY, ID, CREATED_AT, PRODUCT,
SHOP_ID, UPDATED_AT)
VALUES (s.RECORD_KEY, s.ID, s.CREATED_AT, s.PRODUCT,
s.SHOP_ID, s.UPDATED_AT);
Is there a way to rewrite the above merge so that it works as is?

The Snowflake docs show the ability for the AND case predicate during the match clause, it sounds like you tried this and it's not working because of the duplicates, right?
https://docs.snowflake.net/manuals/sql-reference/sql/merge.html#matchedclause-for-updates-or-deletes
There is even an example there which is using the AND command:
merge into t1 using t2 on t1.t1key = t2.t2key
when matched and t2.marked = 1 then delete
when matched and t2.isnewstatus = 1 then update set val = t2.newval, status = t2.newstatus
when matched then update set val = t2.newval
when not matched then insert (val, status) values (t2.newval, t2.newstatus);
I think you are going to have to get the "last record" per key and use that as your update, or process these serially which will be pretty slow...
Another thing to look at would be to try to see if you can apply the last_value( ) function to each column, where you order by your timestamp and partition over your key. If you do that in your inline view, that might work.
I hope this helps, I have a feeling it won't help much...Rich
UPDATE:
I found the following: https://docs.snowflake.net/manuals/sql-reference/parameters.html#error-on-nondeterministic-merge
If you run the following command before your merge, I think you'll be OK (testing required of course):
ALTER SESSION SET ERROR_ON_NONDETERMINISTIC_MERGE=false;

Related

MERGE - to replace ON Duplicate with sql server

I've predominantly used mySQL so moving over to azure and sql server I realise that on duplicate does not work.
I'm trying to do this:
INSERT INTO records (jid, pair, interval, entry) VALUES (1, 'alpha', 3, 'unlimited') ON DUPLICATE KEY UPDATE entry = "limited";
But of course on duplicate key isn't allowed here. So MERGE is the right form.
I've looked at:
https://technet.microsoft.com/en-gb/library/bb522522(v=sql.105).aspx
But honestly the example is a bit excessive and eye watering. Could someone dumb it down for me to fit my example so I can understand it better?
In order to do the merge you need some form of source table/table var for the merge statement. Then you can do the merging. So something along the lines of this maybe (note: not completely syntax checked, apologies in advance):
WITH src AS (
-- This should be your source
SELECT 1 AS Id, 2 AS Val
)
-- The above is not neccessary if you have a source table
MERGE Target -- the detination table, so in your case records
USING src -- as defined above
ON (Target.Id = src.Id) -- how do we join the tables
WHEN NOT MATCHED BY TARGET
-- if we dont match, what do to the destination table. This case insert it.
THEN INSERT(Id, Val) VALUES(src.Id, src.Val)
WHEN MATCHED
-- what do we do if we match. This case update Val
THEN UPDATE SET Target.Val = src.Val;
Don't forget to read the proper syntax page: https://msdn.microsoft.com/en-us/library/bb510625.aspx
I think this translates to your example (tm):
WITH src AS (
-- This should be your source
SELECT 1 AS jid, 'alpha' AS pair, 3 as 'interval'
)
MERGE records -- the detination table, so in your case records
USING src -- as defined above
ON (records.Id = src.Id) -- how do we join the tables
WHEN NOT MATCHED BY TARGET
-- if we dont match, what do to the destination table. This case insert it.
THEN INSERT(jid, pair, interval, entry) VALUES(src.jid, src.pair, src.interval, 'unlimited')
WHEN MATCHED
-- what do we do if we match. This case update Val
THEN UPDATE SET records.entry = 'limited';

Roll up sparse table in vertica

I'm using vertica.
Problem:
I have sparse table (user_session_tmp2). Row contains session_token and a list (about 15 fields) of params. Several rows can describe one session_token. I need to get table where only one row describes one session (i.e. merge all data for one session in one row)
Obvious solution is:
merge /*+ direct */ into user_session tgt using user_session_tmp2 src on src.session_token=tgt.session_token
when matched then
update set time = (case when src.time> tgt.time then tgt.time else src.time)
device_id = (case when src.device_id is not null then src.device_id else tgt.device_id)
when not matched then
insert values(src.session_token, src.user_id, src.time, src.client_time, src.device_id, src.app_version, ... );
Unfortunately, it's not allowed to use case expression in update query.
Could you try the following solution:
select session_token, max(col1), max(col2),.., max(colN)
from user_session
group by session_token

SQL set operation for update latest record

I am facing a problem and cant find any solution to this. I have a source table (T) where I get data from field. The data may contain duplicate records with time stamp. My objective is to take the field data and store it into a final table (F) having the same structure.
Before inserting I check whether key field exists or not in the F if yes I update the the record in F with the latest one from T. Other wise I Insert the record in F from T. This works fine as long as there is no duplicate record in T. In case T has two records of the same key with different time stamp. It always inserts both the record (In case the key is primary key the insert operation fails). I am using following code for the operation -
IF EXISTS(SELECT * FROM [Final_Table] F, TMP_Source T WHERE T.IKEy =F.IKEY)
begin
print 'Update'
UPDATE [Final_Table]
SET [FULLNAME] = T.FULLNAME
,[FATHERNAME] = T.FATHERNAME
,[MOTHERNAME] = T.MOTHERNAME
,[SPOUSENAME] = T.SPOUSENAME
from TMP_Source T
WHERE Final_Table.IKEy = T.IKEy
and [Final_Table].[RCRD_CRN_DATE] < T.RCRD_CRN_DATE
--Print 'Update'
end
else
begin
INSERT INTO [Final_Table]
([IKEy],[FTIN],[FULLNAME],[FATHERNAME],[MOTHERNAME],[SPOUSENAME]
)
Select IKEy,FTIN,FULLNAME,FATHERNAME,MOTHERNAME,SPOUSENAME
from TMP_Source
end
The problem comes when I my T table has entries like -
IKey RCRD_CRN_DATE ...
123 10-11-2013-12.20.30
123 10-11-2013-12.20.35
345 10-11-2013-01.10.10
All three are inserted in the F table.
Please help.
Remove all but the latest row as a first step (well, in a CTE) using ROW_NUMBER() before attempting to perform the insert:
;WITH UniqueRows AS (
SELECT IKey,RCRD_CRN_DATE,FULL_NAME,FATHER_NAME,MOTHER_NAME,SPOUSENAME,FTIN,
ROW_NUMBER() OVER (PARTITION BY IKey ORDER BY RCRD_CRN_DATE desc) as rn
FROM TMP_Source
)
MERGE INTO Final_Table t
USING (SELECT * FROM UniqueRows WHERE rn = 1) s
ON t.IKey = s.IKey
WHEN MATCHED THEN UPDATE
SET [FULLNAME] = s.FULLNAME
,[FATHERNAME] = s.FATHERNAME
,[MOTHERNAME] = s.MOTHERNAME
,[SPOUSENAME] = s.SPOUSENAME
WHEN NOT MATCHED THEN INSERT
([IKEy],[FTIN],[FULLNAME],[FATHERNAME],[MOTHERNAME],[SPOUSENAME]) VALUES
(s.IKEy,s.FTIN,s.FULLNAME,s.FATHERNAME,s.MOTHERNAME,s.SPOUSENAME);
(I may not have all the columns entirely correct, they seem to keep switching around in your question)
(As you may have noticed, I've also switched to using MERGE since it allows us to express everything as a single declarative statement rather than writing procedural code)

How to update a PostgreSQL table with a count of duplicate items

I found two bugs in a program that created a lot of duplicate values:
an 'index' was created instead of a 'unique index'
a duplication checks wasn't integrated in one of 4 twisted routines
So I need to go in and clean up my database.
Step one is to decorate the table with a count of all the duplicate values (next I'll look into finding the first value, and then migrating everything over )
The code below works, I just recall doing a similar "update from select count" on the same table years ago, and I did it in half as much code.
Is there a better way to write this?
UPDATE
shared_link
SET
is_duplicate_of_count = subquery.is_duplicate_of_count
FROM
(
SELECT
count(url) AS is_duplicate_of_count
, url
FROM
shared_link
WHERE
shared_link.url = url
GROUP BY
url
) AS subquery
WHERE
shared_link.url = subquery.url
;
You query is fine, generally, except for the pointless (but also harmless) WHERE clause in the subquery:
UPDATE shared_link
SET is_duplicate_of_count = subquery.is_duplicate_of_count
FROM (
SELECT url
, count(url) AS is_duplicate_of_count
FROM shared_link
-- WHERE shared_link.url = url
GROUP BY url
) AS subquery
WHERE shared_link.url = subquery.url;
The commented clause is the same as
WHERE shared_link.url = shared_link.url
and therefore only eliminating NULL values (because NULL = NULL is not TRUE), which is most probably neither intended nor needed in your setup.
Other than that you can only shorten your code further with aliases and shorter names:
UPDATE shared_link s
SET ct = u.ct
FROM (
SELECT url, count(url) AS ct
FROM shared_link
GROUP BY 1
) AS u
WHERE s.url = u.url;
In PostgreSQL 9.1 or later you might be able to do the whole operation (identify dupes, consolidate data, remove dupes) in one SQL statement with aggregate and window functions and data-modifying CTEs - thereby eliminating the need for an additional column to begin with.

Update multiple rows using one query

Can I update multiple rows using one query?
How to union following queries:
UPDATE tablename SET col1='34355' WHERE id='2'
UPDATE tablename SET col1='152242' WHERE id='44'
You can use a virtual map table for this update.
update tablename
inner join (
select '34355' col1, '2' id union all
select '152242' col1, '44' id
) map on map.id = tablename.id
set tablename.col1 = map.col1
Using this pattern allows for easy expansion (just add rows to the map). It also allows MySQL to more predictably choose an index on tablename.id for the normal JOIN operation.
Can you? Sure. Should you? No way.
Think about the person looking at your code in five years. What's more readable, this:
UPDATE tablename SET col1='34355' WHERE id='2';
UPDATE tablename SET col1='152242' WHERE id='44';
or this (The Scrum Meister's answer):
UPDATE tablename SET col1 = IF(id='2', '34355','152242') WHERE id='2' OR id='44';
The second one is shorter, but it's a challenge to figure out exactly what it's doing. If you're worried about race conditions, make it a single transaction (in most modern DBMS):
BEGIN;
UPDATE tablename SET col1='34355' WHERE id='2';
UPDATE tablename SET col1='152242' WHERE id='44';
COMMIT;
That way you can be guaranteed no other query will run when row 2 is updated but row 44 is not.
You can use a OR clause combined with the IF() function (or CASE WHEN... for other RDBMS)
UPDATE tablename SET col1 = IF(id='2', '34355','152242')
WHERE id='2' OR id='44'
Generally the only way you can update multiple rows in a single query is if your where clause matches multiple rows... and then every row will have the same values set.
Past that you can do funky stuff with expressions in your set clauses, but generally it's cleaner to do multiple queries, unless there's a very specific reason you can't.