My DWH is deployed on Azure Synapse SQL pool.
I loaded data to DWH by script that consists of update, insert and delete (u-i-d) operations. The duration of full load to target table was 12minutes for near 50million of rows.
Recently I tried to use MERGE statement instead of u-i-d. And I found that MERGE performance much worse than u-i-d - 1hour for MERGE against 12minutes for u-i-d!
Please share your experience with MERGE statement on Azure synapse, friends!
Does MERGE really work worse in Synapse than separate update-insert-delete operations?
As per MS doc on MERGE (Transact-SQL) - SQL Server | Microsoft Learn, Merge statement works better for complex statements and for simple activities merging using Insert, Update and delete statement works better.
The conditional behavior described for the MERGE statement works best when the two tables have a complex mixture of matching characteristics. For example, inserting a row if it doesn't exist, or updating a row if it matches. When simply updating one table based on the rows of another table, improve the performance and scalability with basic INSERT, UPDATE, and DELETE statements.
I tried to repro and compare both approaches using simple statements.
Sample tables are taken as in below image.
Merge statement is used to merge and it took 9 seconds
MERGE Products AS TARGET
USING UpdatedProducts AS SOURCE
ON (TARGET.ProductID = SOURCE.ProductID)
--When records are matched, update the records if there is any change
WHEN MATCHED AND TARGET.ProductName <> SOURCE.ProductName OR TARGET.Rate <> SOURCE.Rate
THEN UPDATE SET TARGET.ProductName = SOURCE.ProductName, TARGET.Rate = SOURCE.Rate
--When no records are matched, insert the incoming records from source table to target table
WHEN NOT MATCHED BY TARGET
THEN INSERT (ProductID, ProductName, Rate) VALUES (SOURCE.ProductID, SOURCE.ProductName, SOURCE.Rate)
--When there is a row that exists in target and same record does not exist in source then delete this record target
WHEN NOT MATCHED BY SOURCE
THEN DELETE ;
Then tried with Update, Insert and delete statement. It took nearly 0 second.
Update, insert and delete works better for simple scenarios.
Related
I have table adv_days(adv_date date not null primary key, edit_state integer) and I want to do series of commands: check wether '27.11.2019' exists in table, if exists then update edit_state to 1, if does not exist, then insert date '27.11.2019' with edit_state=1.
I have come up with Firebrid merge statement:
merge into adv_days d
using (select adv_date, edit_state from adv_days where adv_date='27.11.2019') d2
on (d.adv_date=d2.adv_date)
when matched then update set d.edit_state=1
when not matched then
insert(adv_date, edit_state)
values(:adv_date, 1);
It compiles and works as expected, but all the examples that I have read for merge is about application to 2 tables. So - maybe there is better solution for select-check-update operatios for one table than merge?
This is design question, but so concrete that SO is more appropriate than software engineering.
There is community response, that merge for insert/update for one table is perfectly valid solution. But, of course, update or insert is more elegant solution, e.g. in this case the code is:
update or insert into adv_days (adv_date, edit_state)
values ('27.11.2019', 1)
matching (adv_date)
This solution has increased performance over merge (as mentioned in the question) as one can observe from the plan and the number of indexed reads in the case when both statements are repeated serveral times.
There is drawback in both cases still: repeated execution results in update although it would not be necessary to touch the record which already has all the required values.
I will be running a MERGE SQL query to query over a million records in my source table and insert into my target table. This table that I'm doing the SELECT from in the Merge is in production. This table will have an application with many users hitting the table for SELECT, INSERT, UPDATE, DELETE at the same time. I will NOT be modifying the source table data with my MERGE statement, only the target table. I will have SQL Snapshot Isolation enabled, so no reason to use NOLOCK hint. Is there a way to have the query run in batches, or is having the MERGE statement scan the entire table more efficient? I have 2 other merge statements I'll be running after the initial INSERT to do INSERT, UPDATE, DELETE on target table for any changes that were done. Are there any precautions I need to take so as to not cause performance issues with the production application? I'm going to use a stored procedure because I will be running these queries on multiple tables that will be doing the same function over and over again.
My sample initial MERGE:
MERGE dl178 as TARGET
USING dlsd178 as SOURCE
ON (TARGET.docid = source.docid AND TARGET.objectid = source.objectid AND target.pagenum = source.pagenum
and target.subpagenum = source.subpagenum
and target.pagever = source.pagever and target.pathid = source.pathid
and target.annote = source.annote)
WHEN NOT MATCHED BY TARGET
THEN INSERT (docid, pagenum, subpagenum, pagever, objectid, pathid, annote, formatid, ftoffset, ftcount)
VALUES (
source.docid, source.pagenum, source.subpagenum, source.pagever,
source.objectid, source.pathid,source.annote ,source.formatid ,source.ftoffset, source.ftcount)
OUTPUT $action, Inserted.*;
Have you considered Partition Switching or Change Data Capture? Sounds like you are trying to start an ongoing ETL process.
I have a Day-Partitioned Table on BigQuery. When I try to delete some rows from the table using a query like:
DELETE FROM `MY_DATASET.partitioned_table` WHERE id = 2374180
I get the following error:
Error: DML statements are not yet supported over partitioned tables.
A quick Google search leads me to: https://cloud.google.com/bigquery/docs/loading-data-sql-dml where it also says: "DML statements that modify partitioned tables are not yet supported."
So for now, is there a workaround that we can use in deleting rows from a partitioned table?
DML has some known issues/limitation in this phase.
Such as:
DML statements cannot be used to modify tables with REQUIRED fields in their schema.
Each DML statement initiates an implicit transaction, which means that changes made by the statement are automatically committed at the end of each successful DML statement. There is no support for multi-statement transactions.
The following combinations of DML statements are allowed to run concurrently on a table:
UPDATE and INSERT
DELETE and INSERT
INSERT and INSERT
Otherwise one of the DML statements will be aborted. For example, if two UPDATE statements execute simultaneously against the table then only one of them will succeed.
Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer. If it is absent, the table can be modified using UPDATE or DELETE statements.
DML statements that modify partitioned tables are not yet supported.
Also be aware of the quota limits
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements per day per project: 500
Maximum INSERT statements per day per table: 1,000
Maximum INSERT statements per day per project: 10,000
What you can do is copy the entire partition to a non-partitioned table and execute the DML statement there. Than write back the temp table to the partition. Also if you ran into DML update limit statements per day per table, you need to create a copy of the table and run the DML on the new table to avoid the limit.
You could delete partitions in partitioned tables using the command-line bq rm, like this:
bq rm 'mydataset.mytable$20160301'
I've already done it without temporary table, steps:
1) prepare query which selects all the rows from particular partition which should be kept:
SELECT * FROM `your_data_set.tablename` WHERE
_PARTITIONTIME = timestamp('2017-12-07')
AND condition_to_keep_rows_which_shouldn't_be_deleted = 'condition'
if necessary run this for other partitions
2) choose Destination table for result of your query where you point TO THE PARTICULAR PARTITION, you need to provide table name like this:
tablename$20171207
3) Check option "Overwrite table" -> it will overwrite only particular partition
4) Run Query, as a result from pointed partition redundant rows will be deleted!
//remember that you could need run this for other partitions, where you rows to deleted are spread across more than one partition
Looks like as of my writing, this is no longer a BigQuery limitation!
In standard SQL, a statement like the above, over a partitioned table, will succeed, assuming rows being deleted weren't recently (within last 30 minutes) inserted via a streaming insert.
Current docs on DML: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language
Example Query that worked for me in the BQ UI:
DELETE
FROM dataset_name.partitioned_table_on_timestamp_column
WHERE
timestamp >= '2020-02-01' AND timestamp < '2020-06-01'
After the hamsters are done spinning, we get the BQ response:
This statement removed 101 rows from partitioned_table_on_timestamp_column
With a simple UPDATE statement we can do it in batches when dealing with huge tables.
WHILE 1 = 1
BEGIN
UPDATE TOP (5000)
dbo.LargeOrders
SET CustomerID = N'ABCDE'
WHERE CustomerID = N'OLDWO';
IF ##rowcount < 5000
BREAK;
END
When working with MERGE statement, is it possible to do similar things? As I know this is not possible, because you need to do different operations based on the condition. For example to UPDATE when matched and to INSERT when not matched. I just want to confirm on it and I may need to switch to the old-school UPDATE & INSERT if it's true.
Why not use a temp table as the source of your MERGE and then handle batching via the source table. I do this in a project of mine where I'll batch 50,000 rows to a temp table and then use the temp table as the source for the MERGE. I do this in a loop until all rows are processed.
The temp table could be a physical table or an in memory table.
We have a trigger that creates audit records for a table and joins the inserted and deleted tables to see if any columns have changed. The join has been working well for small sets, but now I'm updating about 1 million rows and it doesn't finish in days. I tried updating a select number of rows with different orders of magnitude and it's obvious this is exponential, which would make sense if the inserted/deleted tables are being scanned to do the join.
I tried creating an index but get the error:
Cannot find the object "inserted" because it does not exist or you do not have permissions.
Is there any way to make this any faster?
Inserting into temporary tables indexed on the joining columns could well improve things as inserted and deleted are not indexed.
You can check ##ROWCOUNT inside the trigger so you only perform this logic above some threshold number of rows though on SQL Server 2008 this might overstate the number somewhat if the trigger was fired as the result of a MERGE statement (It will return the total number of rows affected by all MERGE actions not just the one relevant to that specific trigger).
In that case you can just do something like SELECT #NumRows = COUNT(*) FROM (SELECT TOP 10 * FROM INSERTED) T to see if the threshold is met.
Addition
One other possibility you could experiment with is simply bypassing the trigger for these large updates. You could use SET CONTEXT_INFO to set a flag and check the value of this inside the trigger. You could then use OUTPUT inserted.*, deleted.* to get the "before" and "after" values for a row without needing to JOIN at all.
DECLARE #TriggerFlag varbinary(128)
SET #TriggerFlag = CAST('Disabled' AS varbinary(128))
SET CONTEXT_INFO #TriggerFlag
UPDATE YourTable
SET Bar = 'X'
OUTPUT inserted.*, deleted.* INTO #T
/*Reset the flag*/
SET CONTEXT_INFO 0x