My DWH is deployed on Azure Synapse SQL pool.
I loaded data to DWH by script that consists of update, insert and delete (u-i-d) operations. The duration of full load to target table was 12minutes for near 50million of rows.
Recently I tried to use MERGE statement instead of u-i-d. And I found that MERGE performance much worse than u-i-d - 1hour for MERGE against 12minutes for u-i-d!
Please share your experience with MERGE statement on Azure synapse, friends!
Does MERGE really work worse in Synapse than separate update-insert-delete operations?
As per MS doc on MERGE (Transact-SQL) - SQL Server | Microsoft Learn, Merge statement works better for complex statements and for simple activities merging using Insert, Update and delete statement works better.
The conditional behavior described for the MERGE statement works best when the two tables have a complex mixture of matching characteristics. For example, inserting a row if it doesn't exist, or updating a row if it matches. When simply updating one table based on the rows of another table, improve the performance and scalability with basic INSERT, UPDATE, and DELETE statements.
I tried to repro and compare both approaches using simple statements.
Sample tables are taken as in below image.
Merge statement is used to merge and it took 9 seconds
MERGE Products AS TARGET
USING UpdatedProducts AS SOURCE
ON (TARGET.ProductID = SOURCE.ProductID)
--When records are matched, update the records if there is any change
WHEN MATCHED AND TARGET.ProductName <> SOURCE.ProductName OR TARGET.Rate <> SOURCE.Rate
THEN UPDATE SET TARGET.ProductName = SOURCE.ProductName, TARGET.Rate = SOURCE.Rate
--When no records are matched, insert the incoming records from source table to target table
WHEN NOT MATCHED BY TARGET
THEN INSERT (ProductID, ProductName, Rate) VALUES (SOURCE.ProductID, SOURCE.ProductName, SOURCE.Rate)
--When there is a row that exists in target and same record does not exist in source then delete this record target
WHEN NOT MATCHED BY SOURCE
THEN DELETE ;
Then tried with Update, Insert and delete statement. It took nearly 0 second.
Update, insert and delete works better for simple scenarios.
I have a Day-Partitioned Table on BigQuery. When I try to delete some rows from the table using a query like:
DELETE FROM `MY_DATASET.partitioned_table` WHERE id = 2374180
I get the following error:
Error: DML statements are not yet supported over partitioned tables.
A quick Google search leads me to: https://cloud.google.com/bigquery/docs/loading-data-sql-dml where it also says: "DML statements that modify partitioned tables are not yet supported."
So for now, is there a workaround that we can use in deleting rows from a partitioned table?
DML has some known issues/limitation in this phase.
Such as:
DML statements cannot be used to modify tables with REQUIRED fields in their schema.
Each DML statement initiates an implicit transaction, which means that changes made by the statement are automatically committed at the end of each successful DML statement. There is no support for multi-statement transactions.
The following combinations of DML statements are allowed to run concurrently on a table:
UPDATE and INSERT
DELETE and INSERT
INSERT and INSERT
Otherwise one of the DML statements will be aborted. For example, if two UPDATE statements execute simultaneously against the table then only one of them will succeed.
Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer. If it is absent, the table can be modified using UPDATE or DELETE statements.
DML statements that modify partitioned tables are not yet supported.
Also be aware of the quota limits
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements per day per project: 500
Maximum INSERT statements per day per table: 1,000
Maximum INSERT statements per day per project: 10,000
What you can do is copy the entire partition to a non-partitioned table and execute the DML statement there. Than write back the temp table to the partition. Also if you ran into DML update limit statements per day per table, you need to create a copy of the table and run the DML on the new table to avoid the limit.
You could delete partitions in partitioned tables using the command-line bq rm, like this:
bq rm 'mydataset.mytable$20160301'
I've already done it without temporary table, steps:
1) prepare query which selects all the rows from particular partition which should be kept:
SELECT * FROM `your_data_set.tablename` WHERE
_PARTITIONTIME = timestamp('2017-12-07')
AND condition_to_keep_rows_which_shouldn't_be_deleted = 'condition'
if necessary run this for other partitions
2) choose Destination table for result of your query where you point TO THE PARTICULAR PARTITION, you need to provide table name like this:
tablename$20171207
3) Check option "Overwrite table" -> it will overwrite only particular partition
4) Run Query, as a result from pointed partition redundant rows will be deleted!
//remember that you could need run this for other partitions, where you rows to deleted are spread across more than one partition
Looks like as of my writing, this is no longer a BigQuery limitation!
In standard SQL, a statement like the above, over a partitioned table, will succeed, assuming rows being deleted weren't recently (within last 30 minutes) inserted via a streaming insert.
Current docs on DML: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language
Example Query that worked for me in the BQ UI:
DELETE
FROM dataset_name.partitioned_table_on_timestamp_column
WHERE
timestamp >= '2020-02-01' AND timestamp < '2020-06-01'
After the hamsters are done spinning, we get the BQ response:
This statement removed 101 rows from partitioned_table_on_timestamp_column
What is needed: I'm needing 25 million records from oracle incrementally loaded to SQL Server 2012. It will need to have an UPDATE, DELETE, NEW RECORDS feature in the package. The oracle data source is always changing.
What I have: I've done this many times before but not anything past 10 million records.First I have an [Execute SQL Task] that is set to grab the result set of the [Max Modified Date]. I then have a query that only pulls data from the [ORACLE SOURCE] > [Max Modified Date] and have that lookup against my destination table.
I have the the [ORACLE Source] connecting to the [Lookup-Destination table], the lookup is set to NO CACHE mode, I get errors if I use partial or full cache mode because I assume the [ORACLE Source] is always changing. The [Lookup] then connects to a [Conditional Split] where I would input an expression like the one below.
(REPLACENULL(ORACLE.ID,"") != REPLACENULL(Lookup.ID,""))
|| (REPLACENULL(ORACLE.CASE_NUMBER,"")
!= REPLACENULL(ORACLE.CASE_NUMBER,""))
I would then have the rows that the [Conditional Split] outputs into a staging table. I then add a [Execute SQL Task] and perform an UPDATE to the DESTINATION-TABLE with the query below:
UPDATE Destination
SET SD.CASE_NUMBER =UP.CASE_NUMBER,
SD.ID = UP.ID,
From Destination SD
JOIN STAGING.TABLE UP
ON UP.ID = SD.ID
Problem: This becomes very slow and takes a very long time and it just keeps running. How can I improve the time and get it to work? Should I use a cache transformation? Should I use a merge statement instead?
How would I use the expression REPLACENULL in the conditional split when it is a data column? would I use something like :
(REPLACENULL(ORACLE.LAST_MODIFIED_DATE,"01-01-1900 00:00:00.000")
!= REPLACENULL(Lookup.LAST_MODIFIED_DATE," 01-01-1900 00:00:00.000"))
PICTURES BELOW:
A pattern that is usually faster for larger datasets is to load the source data into a local staging table then use a query like below to identify the new records:
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Then you just feed that dataset into an insert:
INSERT INTO TargetTable (column1,column 2)
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Updates look like this:
UPDATE TGT
SET
column1 = SRC.column1,
column2 = SRC.column2,
DTUpdated=GETDATE()
FROM TargetTable TGT
WHERE EXISTS (
SELECT * FROM TargetTable SRC
WHERE TGT.MatchKey = SRC.MatchKey
)
Note the additional column DTUpdated. You should always have a 'last updated' column in your table to help with auditing and debugging.
This is an INSERT/UPDATE approach. There are other data load approaches such as windowing (pick a trailing window of data to be fully deleted and reloaded) but the approach depends on how your system works and whether you can make assumptions about data (i.e. posted data in the source will never be changed)
You can squash the seperate INSERT and UPDATE statements into a single MERGE statement, although it gets pretty huge, and I've had performance issues with it and there are other documented issues with MERGE
Unfortunately, there's not a good way to do what you're trying to do. SSIS has some controls and documented ways to do this, but as you have found they don't work as well when you start dealing with large amounts of data.
At a previous job, we had something similar that we needed to do. We needed to update medical claims from a source system to another system, similar to your setup. For a very long time, we just truncated everything in the destination and rebuilt every night. I think we were doing this daily with more than 25M rows. If you're able to transfer all the rows from Oracle to SQL in a decent amount of time, then truncating and reloading may be an option.
We eventually had to get away from this as our volumes grew, however. We tried to do something along the lines of what you're attempting, but never got anything we were satisfied with. We ended up with a sort of non-conventional process. First, each medical claim had a unique numeric identifier. Second, whenever the medical claim was updated in the source system, there was an incremental ID on the individual claim that was also incremented.
Step one of our process was to bring over any new medical claims, or claims that had changed. We could determine this quite easily, since the unique ID and the "change ID" column were both indexed in source and destination. These records would be inserted directly into the destination table. The second step was our "deletes", which we handled with a logical flag on the records. For actual deletes, where records existed in destination but were no longer in source, I believe it was actually fastest to do this by selecting the DISTINCT claim numbers from the source system and placing them in a temporary table on the SQL side. Then, we simply did a LEFT JOIN update to set the missing claims to logically deleted. We did something similar with our updates: if a newer version of the claim was brought over by our original Lookup, we would logically delete the old one. Every so often we would clean up the logical deletes and actually delete them, but since the logical delete indicator was indexed, this didn't need to be done too frequently. We never saw much of a performance hit, even when the logically deleted records numbered in the tens of millions.
This process was always evolving as our server loads and data source volumes changed, and I suspect the same may be true for your process. Because every system and setup is different, some of the things that worked well for us may not work for you, and vice versa. I know our data center was relatively good and we were on some stupid fast flash storage, so truncating and reloading worked for us for a very, very long time. This may not be true on conventional storage, where your data interconnects are not as fast, or where your servers are not colocated.
When designing your process, keep in mind that deletes are one of the more expensive operations you can perform, followed by updates and by non-bulk inserts, respectively.
Incremental Approach using SSIS
Get Max(ID) and Max(ModifiedDate) from Destination Table and Store them in a Variables
Create a temporary staging table using EXECUTE SQL TASK and store that temporary staging table name into the variable
Take a Data Flow Task and Use OLEDB Source and OLEDB Destination to pull the data from the Source System and load the
data into the variable of temporary tables
Take Two Execute SQL Task one for Insert Process and other for Update
Drop the Temporary Table
INSERT INTO sales.salesorderdetails
(
salesorderid,
salesorderdetailid,
carriertrackingnumber ,
orderqty,
productid,
specialofferid,
unitprice,
unitpricediscount,
linetotal ,
rowguid,
modifieddate
)
SELECT sd.salesorderid,
sd.salesorderdetailid,
sd.carriertrackingnumber,
sd.orderqty,
sd.productid ,
sd.specialofferid ,
sd.unitprice,
sd.unitpricediscount,
sd.linetotal,
sd.rowguid,
sd.modifieddate
FROM ##salesdetails AS sd WITH (nolock)
LEFT JOIN sales.salesorderdetails AS sa WITH (nolock)
ON sa.salesorderdetailid = sd.salesorderdetailid
WHERE NOT EXISTS
(
SELECT *
FROM sales.salesorderdetails sa
WHERE sa.salesorderdetailid = sd.salesorderdetailid)
AND sa.salesorderdetailid > ?
UPDATE sa
SET SalesOrderID = sd.salesorderid,
CarrierTrackingNumber = sd.carriertrackingnumber,
OrderQty = sd.orderqty,
ProductID = sd.productid,
SpecialOfferID = sd.specialofferid,
UnitPrice = sd.unitprice,
UnitPriceDiscount = sd.unitpricediscount,
LineTotal = sd.linetotal,
rowguid = sd.rowguid,
ModifiedDate = sd.modifieddate
FROM sales.salesorderdetails sa
LEFT JOIN ##salesdetails sd
ON sd.salesorderdetailid = sa.salesorderdetailid
WHERE sa.modifieddate > sd.modifieddate
AND sa.salesorderdetailid < ?
Entire Process took 2 Minutes to Complete
Incremental Process Screenshot
I am assuming you have some identity like (pk)column in your oracle table.
1 Get max identity (Business key) from Destination database (SQL server one)
2 Create two data flow
a) Pull only data >max identity from oracle and put them Destination directly .( As these are new record).
b) Get all record < max identity and update date > last load put them into temp (staging ) table (as this is updated data)
3 Update Destination table with record from temp table record (created at step b)
I am trying to find the best way to perform a SQL Server merge in batches.
TEMPTABLENAME is a temp table with an id column (and all the other columns).
I am thinking some kind of while loop using the id as a counter.
The real problem with the merge statement is that it locks the entire table for an extended amount of time when processing 200K+ records. I want to loop every 100 rows so I can release the lock and let the other applications access the table. This table has millions of rows and also fires off an audit every time data is updated. This causes 160K records to take around 20 to 30 mins in the merge below.
The merge code below is a sample of the code. There is probably about 25 columns that get updated/inserted.
I would be open to finding another way other than the merge to insert the data. I just can not change the audit system or amount of records in the table.
merge Employee as Target
using TEMPTABLENAME as Source on (Target.ClientId = Source.ClientId and
Target.EmployeeReferenceId = Source.EmployeeReferenceId)
when matched then
update
set
Target.FirstName = Source.FirstName,
Target.MiddleName = Source.MiddleName,
Target.LastName = Source.LastName,
Target.BirthDate = Source.BirthDate
when not matched then
INSERT ([FirstName], [MiddleName], [LastName], [BirthDate])
VALUES (Source.FirstName, Source.MiddleName, Source.LastName, Source.BirthDate)
OUTPUT $action INTO #SummaryOfChanges;
SELECT Change, COUNT(*) AS CountPerChange
FROM #SummaryOfChanges
GROUP BY Change;
With a simple UPDATE statement we can do it in batches when dealing with huge tables.
WHILE 1 = 1
BEGIN
UPDATE TOP (5000)
dbo.LargeOrders
SET CustomerID = N'ABCDE'
WHERE CustomerID = N'OLDWO';
IF ##rowcount < 5000
BREAK;
END
When working with MERGE statement, is it possible to do similar things? As I know this is not possible, because you need to do different operations based on the condition. For example to UPDATE when matched and to INSERT when not matched. I just want to confirm on it and I may need to switch to the old-school UPDATE & INSERT if it's true.
Why not use a temp table as the source of your MERGE and then handle batching via the source table. I do this in a project of mine where I'll batch 50,000 rows to a temp table and then use the temp table as the source for the MERGE. I do this in a loop until all rows are processed.
The temp table could be a physical table or an in memory table.