cdc with python and merge on bigquery - google-bigquery

I have written a pipeline using Apache beam and Google dataflow that sends changes from a MongoDB to bq. I have a bigquery log table like ...
table
operation type
timestamp
[all columns]
[insert / update / delete / replace]
timestamp
and a "normal" table without the operation and timestamp column. My goal is to merge the src table (log) and the target table. The problem is as following, when the second last entry to a field is not null and the last one is, how can I check this in the merge statement? For example in other databases you can do something like
create function get_sec_last_value(id) as (
(
select as struct
*
from (
select
*,
row_number() over(order by timestamp desc) as number
from table
where id = id
) where number = 2
)
);
merge target trg
using source as src
on trg.id = src.id
...
update set id = case
when (get_sec_last_value(src.id).id is not null and src.id is not null) or (get_sec_last_value(src.id).id is null and src.id is not null) then src.id
when (get_sec_last_value(src.id).id is not null and src.id is null) or (get_sec_last_value(src.id).id is null and src.id is null) then null
end
...
Has anybody faced the same problem or has an idea how to solve it?
Thanks in advance

Related

Insert Non-Duplicate records to copy table with updated date

I am currently working on a project where I have 2 SQL Server databases and need to copy new records into archive database and append with updated date. Example:
Existing DB: dbo.A.Category(Id, Name)
Copy new records (no duplicates) to:
Archive DB: dbo.B.Category(Id, Name, ArchiveDate)
How do I copy only changed records from the existing database to the archive database? This is in SQL Server.
You can use the EXCEPT operator for this. For example:
INSERT INTO archiveCategory (id,name,creationdate)
SELECT id,name,current_timestamp
FROM (
SELECT id,name
FROM myDB.dbo.category
EXCEPT
SELECT id,name
FROM archiveDB.dbo.category
WHERE creationdate = (SELECT max(creationdate) from archiveDB.dbo.category a2 where a.id = a2.id )
) delta
You can achieve this with a MERGE statement.
I have made the following assumptions about what you're trying to achieve:
the [Id] column in dbo.A.Category contains unique values
the [Id] column in dbo.B.Category is not an identity column and values correspond to matching [Id] values in dbo.B.Category
you only care if updated [name] values in dbo.A.Category have been changed, not if they've been updated with the same value (e.g. not if 'Bob' is changed to 'Bob')
you do not want deleted rows from dbo.A.Category to be likewise deleted from dbo.B.Category
MERGE dbo.B.Category AS tgt
USING dbo.A.Category AS src
ON tgt.[Id] = src.[Id]
WHEN MATCHED
AND tgt.[Name] <> src.[Name]
THEN UPDATE
SET [Name] = src.[Name]
, [ArchiveDate] = SYSDATETIME()
WHEN NOT MATCHED BY TARGET
THEN INSERT ( [Id], [Name], [ArchiveDate] )
VALUES ( src.[id], src.[name], SYSDATETIME() ) ;
GO

SQL - trigger select into after update/insert

i have a table called Audit_Data, and from time to time there is an update coming. Every single update consists of around 300 rows (around 20 columns per row), and all of the rows from the same update share the same Audit_ID.
I have a select, that pulls out only this data, which is relevant for me. It basically transform the 300x20 field data into one row of data.
Is there a way to create a SQL trigger, that would perform the select on the updated Audit_Data table, and insert selected data into table named Audit_Final?
This is my select statement that i use to pull out the relevant data:
SELECT main.Audit_ID
,main.Item_19
,main.Item_1
,main.Item_7
,main.Item_8
,Item_17
,main.Item_13
,macaddr.Item_2
,macaddr.Item_16
,t1.Item_1
FROM dbo.[Audit_Data] AS main
LEFT JOIN
(
SELECT Audit_ID, Item_2, Item_16
FROM dbo.[Audit_Data] AS macaddr
WHERE
(Item_2 NOT LIKE 'Hyper-V%')
AND (Item_17 = 'connected')
AND (Item_18 IN ('10000Mbps', '1000MBps') OR ITEM_9 IS NOT NULL AND ITEM_10 IS NOT NULL)
AND (Item_18 != '100Mbps')
) macaddr ON main.Audit_ID = macaddr.Audit_ID
LEFT JOIN
(
SELECT Audit_ID, Category_ID, Item_1, Record_ordinal
FROM dbo.[Audit_Data] AS t1
WHERE
Item_1 = 'Automatyczna konfiguracja sieci przewodowej' OR Item_1 = 'dot3svc' OR Item_1 = 'Wired AutoConfig'
AND Item_3 = 'Running'
AND Category_ID = '4100'
) t1 ON main.Audit_ID = t1.Audit_ID
WHERE
main.Record_Ordinal = '2'
ORDER BY main.Audit_ID
Based on authors comment this is what is required here:
CREATE TRIGGER [TR_Audit_Data] ON [Audit_Data]
AFTER UPDATE
AS
BEGIN
INSERT INTO [Audit_Final](cloumn_1, colum_2, ... all columns you have on target table)
/*
Paste your select query here
*/
END

MERGE UPSERT DOES NOT WORK WHEN NOT MATCHED

ABC_TABLE holds history data based on UPDATED_TS column.
Requirement is to load data from a CSV file and conditions is as below:
Fetch the latest EMPLOYEE_NAME based on UPDATED_TS (query inside USING condition)
In ON condition check, if the EMPLOYEE_NAME in CSV file does not match the EMPLOYEE_NAME fetched in the USING query, a new row should be inserted
If a new TABLE_ID is present in the CSV file and the TABLE_ID does not exist in ABC_TABLE, a new record should be inserted
When executing below query, no rows get inserted for a new TABLE_ID,
MERGE INTO ABC_TABLE T
USING (SELECT EMPLOYEE_NAME
FROM ABC_TABLE
WHERE TABLE_ID = ?
AND UPDATED_TS =
(SELECT MAX(UPDATED_TS) FROM ABC_TABLE WHERE TABLE_ID = ?)) S
ON ((S.EMPLOYEE_NAME IS NULL AND ? IS NULL) OR ? = S.EMPLOYEE_NAME )
WHEN NOT MATCHED THEN
INSERT
/*
insert statement here
*/
Any help would be greatly appreciated.
The merge documentation says:
Use the ON clause to specify the condition upon which the MERGE operation either updates or inserts. For each row in the target table for which the search condition is true, Oracle Database updates the row with corresponding data from the source table. If the condition is not true for any rows, then the database inserts into the target table based on the corresponding source table row.
For both matched and not-matched, there has to be a row in the source table (your S subquery in this case) for anything to happen. If the passed-in values don't exist then your subquery finds no rows, so source table is empty, and thus nothing happens.
You could add an aggregate function call in your subquery so it always finds something, and use that (e.g. a count of found records) to decide if it's matched; something like:
MERGE INTO ABC_TABLE T
USING (SELECT :table_id AS TABLE_ID, :employee_name AS EMPLOYEE_NAME, count(*) AS FOUND
FROM ABC_TABLE
WHERE TABLE_ID = :table_id
AND ((EMPLOYEE_NAME IS NULL AND :employee_name IS NULL)
OR EMPLOYEE_NAME = :employee_name)
AND UPDATED_TS =
(SELECT MAX(UPDATED_TS) FROM ABC_TABLE WHERE TABLE_ID = :table_id)) S
ON (S.FOUND > 0)
WHEN NOT MATCHED THEN
INSERT (table_id, updated_ts, employee_name)
VALUES (S.TABLE_ID, systimestamp, S.EMPLOYEE_NAME)
But as you're only inserting and never updating, why not just use an insert?
INSERT INTO ABC_TABLE T (table_id, updated_ts, employee_name)
SELECT :table_id, systimestamp, :employee_name
FROM DUAL
WHERE NOT EXISTS (
SELECT null
FROM ABC_TABLE T2
WHERE T2.TABLE_ID = :table_id
AND ((T2.EMPLOYEE_NAME IS NULL AND :employee_name IS NULL)
OR T2.EMPLOYEE_NAME = :employee_name)
AND UPDATED_TS =
(SELECT MAX(UPDATED_TS) FROM ABC_TABLE WHERE TABLE_ID = :table_id)
)
I'm not sure you actually want the UPDATED_TS check in either case though.

Looking for an alternative method to this query

There are 2 tables, a Component table and a Log table. The component table holds the actual(current) value description and a timestamp when it was last updated.
The Log table contains a component ID that references to wich component it belongs:
Component:
Id
Actual
LastUpdated
Log:
Id
ComponentId
Value
Timestamp
the query that used to work but currently lock the table looks like this.
update Component set Actual =
(select top 1 Value from Log
where Component.Id = ComponentId
order by Id desc),
LastUpdated =
(select top 1 TimeStamp from Log
where Component.Id = ComponentId
order by Id desc)
Both the log and component tables are growing and this query can't keep up like it used to be able to do. there are around 80 components now and a couple of million records.
Is it possible to work in a way like this and just improve the query or is the entire approach wrong?
ps the devices that send the data don't have an reliable system time and therefor having them update the component table leads to inconsistency. When inserting a log i take the system time on the SQL server(default value)
EDIT:
taking a suggestion from the awnsers im trying to create a trigger on log to automaticly update Component when a log is created.
CREATE TRIGGER trg_log_ins
ON Log
AFTER INSERT
AS
BEGIN
update Component
SET Actual = (SELECT i.value FROM inserted as i),
LastUpdated = (SELECT i.Timestamp FROM inserted as i);
END;
but for some reason the query doesn't finish and keeps executing.
I think you're going about this all wrong. A better solution would be a trigger on the Component table, that inserts into the Log table whenever a Component is inserted or updated.
CREATE TRIGGER trg_component_biu
ON Component
AFTER INSERT, UPDATE
AS
BEGIN
INSERT INTO Log(
ComponentId,
Value,
Timestamp
)
SELECT
Id,
Actual,
LastUpdated
FROM inserted;
END;
You can do it by using ROW_NUMBER() like this:
UPDATE t1
SET t1.Actual = t2.value,
t1.LastUpdated = t2.TimeStamp
FROM Component t1
INNER JOIN (SELECT log.*,ROW_NUMBER() OVER (PARTITION BY log.componentID order by log.ID DESC) as rnk
FROM log) t2
ON(t2.componentID = t1.id and t2.rnk = 1)
Based on TOP 1 in your query I guess you are using SQL SERVER. In SQL Server you can use OUTER APPLY
UPDATE c
SET c.Actual = cs.Value,
c.LastUpdated = cs.TimeStamp
FROM Component C
OUTER apply (SELECT TOP 1 TimeStamp,
ComponentId
FROM Log l
WHERE c.Id = l.ComponentId
ORDER BY Id DESC) cs
Adding a non-clustered index on Log table ID column and include TimeStamp,ComponentId will improve query performance
Another way is using ROW_NUMBER and LEFT OUTER JOIN
UPDATE c
SET c.Actual = cs.Value,
c.LastUpdated = cs.TimeStamp
FROM Component C
LEFT OUTER JOIN (SELECT Row_number()OVER(partition BY ComponentId
ORDER BY id DESC) rn,*
FROM Log) cs
ON cs.ComponentId = c.id
AND cs.rn = 1
All data in your Component table is coming from the Log table. Instead of making Component an actual table, you can make it a view, indexed if necessary.
CREATE VIEW Component
WITH SCHEMABINDING
AS
SELECT
ComponentId AS Id,
FIRST_VALUE(Value)
OVER(PARTITION BY ComponentId
ORDER BY Timestamp DESC)
AS Actual,
MAX(Timestamp) AS LastUpdated
FROM Log
GROUP BY ComponentId;
If you are going to use trigger on the Log table it has to work even if several rows are inserted. Here is one possible variant.
Also, this variant would not capture values for a new ComponentID that doesn't exist in the Component table yet.
If there is a possibility that such values would be inserted into the Log table, I'd use MERGE instead of simple UPDATE.
CREATE TRIGGER trg_log_ins
ON Log
AFTER INSERT
AS
BEGIN
WITH
CTE
AS
(
SELECT
Component.Actual AS OldValue
,Component.LastUpdated AS OldTimestamp
,inserted.Value AS NewValue
,inserted.Timestamp AS NewTimestamp
FROM
Component
INNER JOIN inserted ON inserted.ComponentID = Component.ID
)
UPDATE CTE
SET
OldValue = NewValue,
OldTimestamp = NewTimestamp
;
END
Also, if it is possible to insert into Log several rows with the same ComponentID in the same INSERT statement, you'd better choose explicitly which value to use for update. Likely, the one with the latest Timestamp.
So, the query becomes more complicated:
CREATE TRIGGER trg_log_ins
ON Log
AFTER INSERT
AS
BEGIN
WITH
CTE_InsertedRowNumbers
AS
(
SELECT
inserted.ComponentID
,inserted.Value AS NewValue
,inserted.Timestamp AS NewTimestamp
,ROW_NUMBER() OVER (
PARTITION BY inserted.ComponentID
ORDER BY inserted.Timestamp DESC, inserted.ID DESC) AS rn
FROM inserted
)
,CTE_LatestInsertedComponents
AS
(
SELECT
ComponentID
,NewValue
,NewTimestamp
FROM CTE_InsertedRowNumbers
WHERE rn = 1
)
,CTE
AS
(
SELECT
Component.Actual AS OldValue
,Component.LastUpdated AS OldTimestamp
,CTE_LatestInsertedComponents.NewValue
,CTE_LatestInsertedComponents.NewTimestamp
FROM
Component
INNER JOIN CTE_LatestInsertedComponents
ON CTE_LatestInsertedComponents.ComponentID = Component.ID
)
UPDATE CTE
SET
OldValue = NewValue,
OldTimestamp = NewTimestamp
;
END

Remove duplicates from table based on multiple criteria and persist to other table

I have a taccounts table with columns like account_id(PK), login_name, password, last_login. Now I have to remove some duplicate entries according to a new business logic.
So, duplicate accounts will be with either same email or same (login_name & password). The account with the latest login must be preserved.
Here are my attempts (some email values are null and blank)
DELETE
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 and last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0
GROUP BY lower(trim(both ' ' from email)))
Similarly for login_name and password
DELETE
FROM taccounts
WHERE last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
GROUP BY login_name, password)
Is there any better way or any way to combine these two separate queries?
Also some other table have account_id as foreign key. How to update this change for those tables?`
I am using PostgreSQL 9.2.1
EDIT: Some of the email values are null and some of them are blank(''). So, If two accounts have different login_name & password and their emails are null or blank, then they must be considered as two different accounts.
If most of the rows are deleted (mostly dupes) and the table fits into RAM, consider this route:
SELECT surviving rows into a temporary table.
Reroute FK references to survivors
DELETE all rows from the base table.
Re-INSERT survivors.
1a. Distill surviving rows
CREATE TEMP TABLE tmp AS
SELECT DISTINCT ON (login_name, password) *
FROM (
SELECT DISTINCT ON (email) *
FROM taccounts
ORDER BY email, last_login DESC
) sub
ORDER BY login_name, password, last_login DESC;
About DISTINCT ON:
Select first row in each GROUP BY group?
To identify duplicates for two different criteria, use a subquery to apply the two rules one after the other. The first step preserves the account with the latest last_login, so this is "serializable".
Inspect results and test for plausibility.
SELECT * FROM tmp;
Temporary tables are dropped automatically at the end of a session. In pgAdmin (which you seem to be using) the session lives as long as the editor window is open.
1b. Alternative query for updated definition of "duplicates"
SELECT *
FROM taccounts t
WHERE NOT EXISTS (
SELECT FROM taccounts t1
WHERE ( NULLIF(t1.email, '') = t.email
OR (NULLIF(t1.login_name, ''), NULLIF(t1.password, '')) = (t.login_name, t.password))
AND (t1.last_login, t1.account_id) > (t.last_login, t.account_id)
);
This doesn't treat NULL or empty string ('') as identical in any of the "duplicate" columns.
The row expression (t1.last_login, t1.account_id) takes care of the possibility that two dupes could share the same last_login. The one with the bigger account_id is chosen in this case - which is unique, since it is the PK.
2a. How to identify all incoming FKs
SELECT c.confrelid::regclass::text AS referenced_table
, c.conname AS fk_name
, pg_get_constraintdef(c.oid) AS fk_definition
FROM pg_attribute a
JOIN pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum)
WHERE c.confrelid = 'taccounts'::regclass -- (schema-qualified) table name
AND c.contype = 'f'
ORDER BY 1, contype DESC;
Only building on the first column of the foreign key. More about that:
Find the referenced table name using table, field and schema name
Or inspect the Dependents rider in the right hand window of the object browser of pgAdmin after selecting the table taccounts.
2b. Reroute to new primary
If you have tables referencing taccounts (incoming foreign keys to taccounts) you will want to update all those fields, before you delete the dupes.
Reroute all of them to the new primary row:
UPDATE referencing_tbl r
SET referencing_column = tmp.reference_column
FROM tmp
JOIN taccounts t1 USING (email)
WHERE r.referencing_column = t1.referencing_column
AND referencing_column IS DISTINCT FROM tmp.reference_column;
UPDATE referencing_tbl r
SET referencing_column = tmp.reference_column
FROM tmp
JOIN taccounts t2 USING (login_name, password)
WHERE r.referencing_column = t1.referencing_column
AND referencing_column IS DISTINCT FROM tmp.reference_column;
3. & 4. Go in for the kill
Now, dupes are not referenced any more. Go in for the kill.
ALTER TABLE taccounts DISABLE TRIGGER ALL;
DELETE FROM taccounts;
VACUUM taccounts;
INSERT INTO taccounts
SELECT * FROM tmp;
ALTER TABLE taccounts ENABLE TRIGGER ALL;
Disable all triggers for the duration of the operation. This avoids checking for referential integrity during the operation. Everything should be fine once you re-activate triggers. We took care of all incoming FKs above. Outgoing FKs are guaranteed to be sound, since you have no concurrent write access and all values have been there before.
In addition to Erwin's excellent answer, it can often be useful to create in intermediate link-table that relates the old keys with the new ones.
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE taccounts
( account_id SERIAL PRIMARY KEY
, login_name varchar
, email varchar
, last_login TIMESTAMP
);
-- create some fake data
INSERT INTO taccounts(last_login)
SELECT gs FROM generate_series('2013-03-30 14:00:00' ,'2013-03-30 15:00:00' , '1min'::interval) gs
;
UPDATE taccounts
SET login_name = 'User_' || (account_id %10)::text
, email = 'Joe' || (account_id %9)::text || '#somedomain.tld'
;
SELECT * FROM taccounts;
--
-- Create (temp) table linking old id <--> new id
-- After inspection this table can be used as a source for the FK updates
-- and for the final delete.
--
CREATE TABLE update_ids AS
WITH pairs AS (
SELECT one.account_id AS old_id
, two.account_id AS new_id
FROM taccounts one
JOIN taccounts two ON two.last_login > one.last_login
AND ( two.email = one.email OR two.login_name = one.login_name)
)
SELECT old_id,new_id
FROM pairs pp
WHERE NOT EXISTS (
SELECT * FROM pairs nx
WHERE nx.old_id = pp.old_id
AND nx.new_id > pp.new_id
)
;
SELECT * FROM update_ids
;
UPDATE other_table_with_fk_to_taccounts dst
SET account_id. = ids.new_id
FROM update_ids ids
WHERE account_id. = ids.old_id
;
DELETE FROM taccounts del
WHERE EXISTS (
SELECT * FROM update_ids ex
WHERE ex.old_id = del.account_id
);
SELECT * FROM taccounts;
Yet another way to accomplish the same is to add a column with a pointer to the preferred key to the table itself and use that for your updates and deletes.
ALTER TABLE taccounts
ADD COLUMN better_id INTEGER REFERENCES taccounts(account_id)
;
-- find the *better* records for each record.
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.last_login > dst.last_login
AND src.email IS NOT NULL
AND NOT EXISTS (
SELECT * FROM taccounts nx
WHERE nx.login_name = dst.login_name
AND nx.email IS NOT NULL
AND nx.last_login > src.last_login
);
-- Find records that *do* have an email address
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.email IS NOT NULL
AND dst.email IS NULL
AND NOT EXISTS (
SELECT * FROM taccounts nx
WHERE nx.login_name = dst.login_name
AND nx.email IS NOT NULL
AND nx.last_login > src.last_login
);
SELECT * FROM taccounts ORDER BY account_id;
UPDATE other_table_with_fk_to_taccounts dst
SET account_id = src.better_id
FROM update_ids src
WHERE dst.account_id = src.account_id
AND src.better_id IS NOT NULL
;
DELETE FROM taccounts del
WHERE EXISTS (
SELECT * FROM taccounts ex
WHERE ex.account_id = del.better_id
);
SELECT * FROM taccounts ORDER BY account_id;