Roll up sparse table in vertica - sql

I'm using vertica.
Problem:
I have sparse table (user_session_tmp2). Row contains session_token and a list (about 15 fields) of params. Several rows can describe one session_token. I need to get table where only one row describes one session (i.e. merge all data for one session in one row)
Obvious solution is:
merge /*+ direct */ into user_session tgt using user_session_tmp2 src on src.session_token=tgt.session_token
when matched then
update set time = (case when src.time> tgt.time then tgt.time else src.time)
device_id = (case when src.device_id is not null then src.device_id else tgt.device_id)
when not matched then
insert values(src.session_token, src.user_id, src.time, src.client_time, src.device_id, src.app_version, ... );
Unfortunately, it's not allowed to use case expression in update query.

Could you try the following solution:
select session_token, max(col1), max(col2),.., max(colN)
from user_session
group by session_token

Related

Improving insert query for SCD2 solution

I have two insert statements. The first query is to inserta new row if the id doesn't exist in the target table. The second query inserts to the target table only if the joined id hash value is different (indicates that the row has been updated in the source table) and the id in the source table is not null. These solutions are meant to be used for my SCD2 solution, which will be used for inserts of hundreds thousands of rows. I'm trying not to use the MERGE statement for practices.
The columns "Current" value 1 indicates that the row is new and 0 indicates that the row has expired. I use this information later to expire my rows in the target table with my update queries.
Besides indexing is there a more competent and effective way to improve my insert queries in a way that resembles the like of the SCD2 merge statement for inserting new/updated rows?
Query:
Query 1:
INSERT INTO TARGET
SELECT Name,Middlename,Age, 1 as current,Row_HashValue,id
from Source s
Where s.id not in (select id from TARGET) and s.id is not null
Query 2:
INSERT INTO TARGET
SELECT Name,Middlename,Age,1 as current ,Row_HashValue,id
FROM SOURCE s
LEFT JOIN TARGET t ON s.id = t.id
AND s.Row_HashValue = t.Row_HashValue
WHERE t.Row_HashValue IS NULL and s.ID IS NOT NULL
You can use WHERE NOT EXISTS, and have just one INSERT statement:
INSERT INTO TARGET
SELECT Name,Middlename,Age,1 as current ,Row_HashValue,id
FROM SOURCE s
WHERE NOT EXISTS (
SELECT 1
FROM TARGET t
WHERE s.id = t.id
AND s.Row_HashValue = t.Row_HashValue)
AND s.ID IS NOT NULL;

Need to filter target table in SQL merge query

I am using BigQuery SQL to execute a merge query. Here is the query
MERGE `dataset.target_table` AS Target
USING
(
select
*
from
`dataset.source_table` s_data
WHERE
trans_id is not null and user_id is not null
)
AS Source
ON Source.trans_id = Target.trans_id and Target.start_date IN
(
select distinct start_date from `dataset.source_table`
)
WHEN NOT MATCHED BY Target THEN
INSERT (...)
VALUES (...)
WHEN MATCHED and Target.user_id is null THEN
UPDATE SET ...
I am getting an issue with the using a subquery in the ON statement. In Subquery not supported by join predicate
The reason I have this subquery is because I want to filter the Target table before the Merge happens or bigquery throws a OOM exception. The target table is 10billion rows while the source is 200m rows. I don't need the subquery in the ON statement, but it's a hacky way to esentially filter the Target table before the merge happens. Is there some other approach I can take?
I tried the approach here - https://dba.stackexchange.com/questions/30633/merge-a-subset-of-the-target-table utilizing
WITH TARGET AS
(
SELECT *
FROM `dataset.target_table`
WHERE <filter target_table here>
)
MERGE INTO TARGET
...
but it seems this isn't supported by BigQuery and gave a syntax error. How can I filter my Target table before the Merge happens so it doesn't need to load the entire table in memory?
I'm a little confused. Can you just take the start_date from the matching row in source?
MERGE `dataset.target_table` AS Target USING
(select *
from `dataset.source_table` s_data
WHERE trans_id is not null and user_id is not null
) AS Source
ON Source.trans_id = Target.trans_id and
Target.start_date = source.start_date

Merging deltas with duplicate keys

I'm trying to perform a merge into a target table in our Snowflake instance where the source data contains change data with a field denoting the at source DML operation i.e I=Insert,U=Update,D=Delete.
The problem is dealing with the fact the log (deltas) source might contain multiple updates for the same record. The merge I've constructed bombs out complaining about duplicate keys.
I'm struggling to think of a solution without going the likes of GROUP BY and MAX on the updates. I've done a similar setup with Oracle and the AND clause on the MATCH was enough.
MERGE INTO "DB"."SCHEMA"."TABLE" t
USING (
SELECT * FROM "DB"."SCHEMA"."TABLE_LOG"
ORDER BY RECORD_TIMESTAMP ASC
) s ON t.RECORD_KEY = s.RECORD_KEY
WHEN MATCHED AND s.RECORD_OPERATION = 'D' THEN DELETE
WHEN MATCHED AND s.RECORD_OPERATION = 'U' THEN UPDATE
SET t.ID=COALESCE(s.ID,t.ID),
t.CREATED_AT=COALESCE(s.CREATED_AT,t.CREATED_AT),
t.PRODUCT=COALESCE(s.PRODUCT,t.PRODUCT),
t.SHOP_ID=COALESCE(s.SHOP_ID,t.SHOP_ID),
t.UPDATED_AT=COALESCE(s.UPDATED_AT,t.UPDATED_AT)
WHEN NOT MATCHED AND s.RECORD_OPERATION = 'I' THEN
INSERT (RECORD_KEY, ID, CREATED_AT, PRODUCT,
SHOP_ID, UPDATED_AT)
VALUES (s.RECORD_KEY, s.ID, s.CREATED_AT, s.PRODUCT,
s.SHOP_ID, s.UPDATED_AT);
Is there a way to rewrite the above merge so that it works as is?
The Snowflake docs show the ability for the AND case predicate during the match clause, it sounds like you tried this and it's not working because of the duplicates, right?
https://docs.snowflake.net/manuals/sql-reference/sql/merge.html#matchedclause-for-updates-or-deletes
There is even an example there which is using the AND command:
merge into t1 using t2 on t1.t1key = t2.t2key
when matched and t2.marked = 1 then delete
when matched and t2.isnewstatus = 1 then update set val = t2.newval, status = t2.newstatus
when matched then update set val = t2.newval
when not matched then insert (val, status) values (t2.newval, t2.newstatus);
I think you are going to have to get the "last record" per key and use that as your update, or process these serially which will be pretty slow...
Another thing to look at would be to try to see if you can apply the last_value( ) function to each column, where you order by your timestamp and partition over your key. If you do that in your inline view, that might work.
I hope this helps, I have a feeling it won't help much...Rich
UPDATE:
I found the following: https://docs.snowflake.net/manuals/sql-reference/parameters.html#error-on-nondeterministic-merge
If you run the following command before your merge, I think you'll be OK (testing required of course):
ALTER SESSION SET ERROR_ON_NONDETERMINISTIC_MERGE=false;

MERGE - to replace ON Duplicate with sql server

I've predominantly used mySQL so moving over to azure and sql server I realise that on duplicate does not work.
I'm trying to do this:
INSERT INTO records (jid, pair, interval, entry) VALUES (1, 'alpha', 3, 'unlimited') ON DUPLICATE KEY UPDATE entry = "limited";
But of course on duplicate key isn't allowed here. So MERGE is the right form.
I've looked at:
https://technet.microsoft.com/en-gb/library/bb522522(v=sql.105).aspx
But honestly the example is a bit excessive and eye watering. Could someone dumb it down for me to fit my example so I can understand it better?
In order to do the merge you need some form of source table/table var for the merge statement. Then you can do the merging. So something along the lines of this maybe (note: not completely syntax checked, apologies in advance):
WITH src AS (
-- This should be your source
SELECT 1 AS Id, 2 AS Val
)
-- The above is not neccessary if you have a source table
MERGE Target -- the detination table, so in your case records
USING src -- as defined above
ON (Target.Id = src.Id) -- how do we join the tables
WHEN NOT MATCHED BY TARGET
-- if we dont match, what do to the destination table. This case insert it.
THEN INSERT(Id, Val) VALUES(src.Id, src.Val)
WHEN MATCHED
-- what do we do if we match. This case update Val
THEN UPDATE SET Target.Val = src.Val;
Don't forget to read the proper syntax page: https://msdn.microsoft.com/en-us/library/bb510625.aspx
I think this translates to your example (tm):
WITH src AS (
-- This should be your source
SELECT 1 AS jid, 'alpha' AS pair, 3 as 'interval'
)
MERGE records -- the detination table, so in your case records
USING src -- as defined above
ON (records.Id = src.Id) -- how do we join the tables
WHEN NOT MATCHED BY TARGET
-- if we dont match, what do to the destination table. This case insert it.
THEN INSERT(jid, pair, interval, entry) VALUES(src.jid, src.pair, src.interval, 'unlimited')
WHEN MATCHED
-- what do we do if we match. This case update Val
THEN UPDATE SET records.entry = 'limited';

Oracle SQL, trying to get one value from a select/join to use to update one column in one table?

I have one table with the following columns:
T_RESOLVED_DATE
I_HOUSEHOLD_NUMBER
I_RESOLVED_SET_NUMBER
I_STATION_CODE
I_RESOLVED_START_MIN
I_DURATION
I_PERSON_NUMBER
I_COVIEW_DEMO_ID
Initially, I_COVIEW_DEMO_ID is set to null.
Then I have another table with the following columns:
T_RESOLVED_DATE
I_HOUSEHOLD_NUMBER
I_PERSON_NUMBER
I_AGE
T_GENDER
I_COVIEW_DEMO_ID
I am trying to update I_COVIEW_DEMO_ID in the first table by using the value of I_COVIEW_DEMO_ID in the second table where the T_RESOLVED_DATE, I_HOUSEHOLD_NUMBER, and I_PERSON_NUMBER are equal in both tables. The first table may contain multiple rows with the same DATE, HOUSEHOLD_NUMBER, and PERSON_NUMBER, because the rows can vary by the rest of the columns.
I have tried to do a select and a group by which seems to get me part way there, but I am getting a "single-row subquery returns more than one row" error when I try to update the columns in the first table. This is what I've tried, along with variations of it:
UPDATE
Table1
SET
I_COVIEW_DEMO_ID =
(SELECT
b.I_COVIEW_DEMO_ID
FROM Table1 a,
Table2 b
WHERE a.I_HOUSEHOLD_NUMBER = b.I_HOUSEHOLD_NUMBER AND
a.I_PERSON_NUMBER = b.I_PERSON_NUMBER AND
a.T_RESOLVED_DATE = b.T_RESOLVED_DATE
GROUP BY b.I_COVIEW_DEMO_ID);
Any suggestions?
I was able to get it to work using this statement:
MERGE INTO table1 a
USING
(
SELECT DISTINCT
T_RESOLVED_DATE,
I_HOUSEHOLD_NUMBER,
I_PERSON_NUMBER,
I_COVIEW_DEMO_ID
FROM
table2
) b
ON
(
a.T_RESOLVED_DATE = b.T_RESOLVED_DATE
AND a.I_HOUSEHOLD_NUMBER = b.I_HOUSEHOLD_NUMBER
AND a.I_PERSON_NUMBER = b.I_PERSON_NUMBER
) WHEN MATCHED THEN
UPDATE SET
a.I_COVIEW_DEMO_ID = b.I_COVIEW_DEMO_ID;
As per our discussion on the comments this would be a simple PLSQL block to do what you need. I'm doing direct from my head without test, so you may need to fix some sintaxe mistake.
BEGIN
FOR rs IN ( SELECT I_HOUSEHOLD_NUMBER,
I_PERSON_NUMBER,
I_COVIEW_DEMO_ID,
T_RESOLVED_DATE
FROM Table2 ) LOOP
UPDATE Table1
SET I_COVIEW_DEMO_ID = rs.I_COVIEW_DEMO_ID
WHERE I_PERSON_NUMBER = rs.I_PERSON_NUMBER
AND I_HOUSEHOLD_NUMBER = rs.I_HOUSEHOLD_NUMBER
AND T_RESOLVED_DATE = rs.T_RESOLVED_DATE;
END LOOP;
--commit after all updates, if there is many rows you should consider in
--making commits by blocks. Define a count and increment it whithin the for
--after some number of updates you commit and restart the counter
COMMIT;
END;