How to handle source table row deletions in SQL ETL?

How to handle source table row deletions in SQL ETL? - sql

I usually follow this strategy to load fact tables via ETL:
Truncate the staging table
Insert new rows (those added after the previous ETL run) into the fact table and changed rows into the staging table
Perform updates on the fact table based upon the data in the staging table
The challenge I am facing is that the source table rows can be deleted. How do I handle this deletion in ETL? I want the deleted row to be removed from the fact table. I cannot use merge between source oltp db table and target data warehouse table because that puts additional load at each ETL run.
Note: the source table has got a date last modified column, but this is not useful to me because the record disappears from the source table upon deletion.

Related

How to make DELETE faster in fast changing (DELSERT) table in MonetDB?

I am using MonetDB (MDB) for OLAP queries. I am storing source data in PostgreSQL (PGSQL) and syncing it with MonetDB in batches written in Python.
In PGSQL there is a wide table with ID (non-unique) and few columns. Every few seconds Python script takes a batch of 10k records changed in the PGSQL and uploads them to MDB.
The process of upload to MDB is as follows:
Create staging table in MDB
Use COPY command to upload 10k records into the staging table.
DELETE from destination table all IDs that are in staging table.
INSERT to the destination table all rows from staging table.
So, it is basically a DELETE & INSERT. I cannot use MERGE statement, because I do not have a PK - one ID can have multiple values in the destination. So I need to do a delete and full insert for all IDs currently synced.
Now to the problem: the DELETE is slow.
When I do a DELETE on a destination table, deleting 10k records in table of 25M rows, it will take 500ms.
However! If I run simple SELECT * FROM destination WHERE id = 1 and THEN do a DELETE, it takes 2ms.
I think that it has something to do with automatic creation of auxiliary indices. But this is where my knowledge ends.
I tried to solve this problem of "pre-heating" by doing the lookup myself and it works - but only for the first DELETE after pre-heat.
Once I do DELETE and INSERT, the next DELETE gets again slow. And doing the pre-heating before each DELETE does not make sense, because the pre-heat itself takes 500ms.
Is there any way on how to sync data to MDB without breaking auxiliary indices already built? Or make the DELETE faster without pre-heat? Or should I use some different technique to sync data into MDB without PK (does MERGE has the same problem?).
Thanks!

Difference between magic table and temporal (system versioned) table in SQL Server?

What is the difference between magic table and temporal (system versioned) table in SQL Server?

Magic Table
Virtual and No Physical Existence. SQL Server internally maintain magic table.
There are two table name INSERTED and DELETED.
INSERTED contains information about newly inserted or updated record in table.
DELETED contains information about last state of that table record.
Now if you perform two update operation on two table then INSERTED and DELETED megic table updated accoridignly and atleast available in trigger.
Apart from Trigger you can use with output clause.
Temporal Table
This is new feature of SQL Server 2016.
It is also called server versioned history table for particular table so there is a physical existence of this Temporal table.
You can query against it and also its purpose is to keep history of particular record.

BigQuery Write Truncate with a partitioned table causes loss of partition information?

We have recently partitioned most of our tables in BigQuery using the following method:
Run a Dataflow pipeline which reads a table and writes the data to a new partitioned table.
Copy the partitioned table back to the original table using a copy job with write truncate set.
Once complete the original table is replaced with the data from the newly created partitioned table, however, the original table is still not partitioned. So I tried the copy again, this time deleting the original table first and it all worked.
The problem is it takes 20 minutes to copy our partitioned table which would cause downtime for our production application. So is there any way of doing write truncate with a partitioned table replacing a non-partitioned table without causing any downtime? Or will we need to delete the table first in order to replace it?

Sorry but you cannot change a non-partitioned table to partitioned, or vice versa. You will have to delete and re-create the table.
Couple of workarounds I can think of:
Keep both tables while you're migrating your queries to the partitioned table. After all queries are migrated you delete the original, non-partitioned table.
If you are using Standard Sql, you can replace the original table with a view on top of the partitioned table. Deleting and replacing the original table with a view should be very quick. And partition pruning should still work on top of the view so you're only charged for the queried partitions. Partition pruning might not work for legacy sql.

how to dump data from one table to another

I have staging table in job in SQL server which dumped data in another table and once data is dumped to transaction table I have truncated staging table.
Now, problem occurs if job is failed then data is transaction table is
roll back and all data is placed in staging table. So staging table already consist data and if I rerun the job then it merges all new data with existing data in staging table.
I want my staging table to empty when the job will run.
Can I make use of temp table in this scenario?

This is a common scenario in data warehousing project and the answer is logical instead of technical. You've two approaches to deal with these scenario:
If your data is important then first check if staging is empty or not. If the table is not empty then it means last job failed; in this case instead of inserting into staging do a insert-update operation and then continue with the job steps. If the table is empty then it means that last job was successful then new data will be insert only.
If you can afford data loss from last job then make a habit to truncate table before running your package.

How to update a table copy in another database

I have two identical databases - one for development (DEV) and one for production (PROD) (both SQL Server 2008).
I have updated the contents of a certain table in DEV and now I want to sync the corresponding table in PROD.
I have not changed the table schema, only some the data inside the table (I have both changed existing rows and added some new rows).
How can I easily transfer the changes in DEV to the corresponding table in PROD?
Note, that the values in the automatic identity column mgiht not match exactly between the two tables. However, I know that I have only made changes to rows having the same value in another column.
Martin

If you don't want to use the replication, you can Create update, Insert and delete trigger in DEV database and update PROD by trigger.
or you can create view of DEV database table on the PROD database.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to handle source table row deletions in SQL ETL? - sql

Related

How to make DELETE faster in fast changing (DELSERT) table in MonetDB?

Difference between magic table and temporal (system versioned) table in SQL Server?

BigQuery Write Truncate with a partitioned table causes loss of partition information?

how to dump data from one table to another

How to update a table copy in another database

Categories

Resources