Splitting a merge into batches in SQL Server 2008 - sql

I am trying to find the best way to perform a SQL Server merge in batches.
TEMPTABLENAME is a temp table with an id column (and all the other columns).
I am thinking some kind of while loop using the id as a counter.
The real problem with the merge statement is that it locks the entire table for an extended amount of time when processing 200K+ records. I want to loop every 100 rows so I can release the lock and let the other applications access the table. This table has millions of rows and also fires off an audit every time data is updated. This causes 160K records to take around 20 to 30 mins in the merge below.
The merge code below is a sample of the code. There is probably about 25 columns that get updated/inserted.
I would be open to finding another way other than the merge to insert the data. I just can not change the audit system or amount of records in the table.
merge Employee as Target
using TEMPTABLENAME as Source on (Target.ClientId = Source.ClientId and
Target.EmployeeReferenceId = Source.EmployeeReferenceId)
when matched then
update
set
Target.FirstName = Source.FirstName,
Target.MiddleName = Source.MiddleName,
Target.LastName = Source.LastName,
Target.BirthDate = Source.BirthDate
when not matched then
INSERT ([FirstName], [MiddleName], [LastName], [BirthDate])
VALUES (Source.FirstName, Source.MiddleName, Source.LastName, Source.BirthDate)
OUTPUT $action INTO #SummaryOfChanges;
SELECT Change, COUNT(*) AS CountPerChange
FROM #SummaryOfChanges
GROUP BY Change;

Related

Updating a column in a very large SQL table

I have a task on my dev databases to update some customer PII. Some of these tables have many rows and are quite large (on the order of 51 million+ and 25GB in size). Every row in the table will need to be updated so right now I'm just running a simple update statement without a where clause on a single column, and this query is, at the time of this post, running 35mins+. Is there a faster way to update large tables? or a better way to mask PII data?
Current query is just
update mytable
set mycolumn = 'some text'

Need to move COMPLETED records from one table to other Table in Oracle DB

I have million records in table and accumulating each second, so whenever the record marked status as COMPLETE I want to move that records to backup table.
This job I will be running periodically. Also I don't want to interrupt other operations in the DB by this move operation.
I have tried the below query.
insert into Table_Bakup
select * from Table where batch_status ='COMPLETED'
and ID not in (select ID from Table_Bakup )
delete from Table
where ID in (select ID from Table_Bakup )
But with the above query performance will be impacted. Can anyone suggest how can I achieve this?
"Periodically" you mentioned sounds like a database job which is scheduled to run ... I don't know how often. Once a day? Every 2 hours? You decide.
Procedure might look like this:
update table set cb_to_be_moved = 1 where status = 'COMPLETED';
insert into table_bakup (col1, col2, ...)
select col1, col2, ...
from table
where cb_to_be_moved = 1;
delete from table where cb_to_be_moved = 1;
Why cb_to_be_moved? As there are millions of rows affected, between your insert and delete statements there might be some other rows which are set to be "completed" (and those transactions committed). If you use IDs, yes - that works, but - you have to compare millions of rows in table with billions of rows in table_bakup so this (cb_to_be_moved) approach might be faster.
Database trigger? Yes, but - imagine what happens when it fires for millions of rows ... database might choke. Or maybe not; try to test it.

SQL Merge and large table data

I will be running a MERGE SQL query to query over a million records in my source table and insert into my target table. This table that I'm doing the SELECT from in the Merge is in production. This table will have an application with many users hitting the table for SELECT, INSERT, UPDATE, DELETE at the same time. I will NOT be modifying the source table data with my MERGE statement, only the target table. I will have SQL Snapshot Isolation enabled, so no reason to use NOLOCK hint. Is there a way to have the query run in batches, or is having the MERGE statement scan the entire table more efficient? I have 2 other merge statements I'll be running after the initial INSERT to do INSERT, UPDATE, DELETE on target table for any changes that were done. Are there any precautions I need to take so as to not cause performance issues with the production application? I'm going to use a stored procedure because I will be running these queries on multiple tables that will be doing the same function over and over again.
My sample initial MERGE:
MERGE dl178 as TARGET
USING dlsd178 as SOURCE
ON (TARGET.docid = source.docid AND TARGET.objectid = source.objectid AND target.pagenum = source.pagenum
and target.subpagenum = source.subpagenum
and target.pagever = source.pagever and target.pathid = source.pathid
and target.annote = source.annote)
WHEN NOT MATCHED BY TARGET
THEN INSERT (docid, pagenum, subpagenum, pagever, objectid, pathid, annote, formatid, ftoffset, ftcount)
VALUES (
source.docid, source.pagenum, source.subpagenum, source.pagever,
source.objectid, source.pathid,source.annote ,source.formatid ,source.ftoffset, source.ftcount)
OUTPUT $action, Inserted.*;
Have you considered Partition Switching or Change Data Capture? Sounds like you are trying to start an ongoing ETL process.

Best incremental load method using SSIS with over 20 million records

What is needed: I'm needing 25 million records from oracle incrementally loaded to SQL Server 2012. It will need to have an UPDATE, DELETE, NEW RECORDS feature in the package. The oracle data source is always changing.
What I have: I've done this many times before but not anything past 10 million records.First I have an [Execute SQL Task] that is set to grab the result set of the [Max Modified Date]. I then have a query that only pulls data from the [ORACLE SOURCE] > [Max Modified Date] and have that lookup against my destination table.
I have the the [ORACLE Source] connecting to the [Lookup-Destination table], the lookup is set to NO CACHE mode, I get errors if I use partial or full cache mode because I assume the [ORACLE Source] is always changing. The [Lookup] then connects to a [Conditional Split] where I would input an expression like the one below.
(REPLACENULL(ORACLE.ID,"") != REPLACENULL(Lookup.ID,""))
|| (REPLACENULL(ORACLE.CASE_NUMBER,"")
!= REPLACENULL(ORACLE.CASE_NUMBER,""))
I would then have the rows that the [Conditional Split] outputs into a staging table. I then add a [Execute SQL Task] and perform an UPDATE to the DESTINATION-TABLE with the query below:
UPDATE Destination
SET SD.CASE_NUMBER =UP.CASE_NUMBER,
SD.ID = UP.ID,
From Destination SD
JOIN STAGING.TABLE UP
ON UP.ID = SD.ID
Problem: This becomes very slow and takes a very long time and it just keeps running. How can I improve the time and get it to work? Should I use a cache transformation? Should I use a merge statement instead?
How would I use the expression REPLACENULL in the conditional split when it is a data column? would I use something like :
(REPLACENULL(ORACLE.LAST_MODIFIED_DATE,"01-01-1900 00:00:00.000")
!= REPLACENULL(Lookup.LAST_MODIFIED_DATE," 01-01-1900 00:00:00.000"))
PICTURES BELOW:
A pattern that is usually faster for larger datasets is to load the source data into a local staging table then use a query like below to identify the new records:
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Then you just feed that dataset into an insert:
INSERT INTO TargetTable (column1,column 2)
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Updates look like this:
UPDATE TGT
SET
column1 = SRC.column1,
column2 = SRC.column2,
DTUpdated=GETDATE()
FROM TargetTable TGT
WHERE EXISTS (
SELECT * FROM TargetTable SRC
WHERE TGT.MatchKey = SRC.MatchKey
)
Note the additional column DTUpdated. You should always have a 'last updated' column in your table to help with auditing and debugging.
This is an INSERT/UPDATE approach. There are other data load approaches such as windowing (pick a trailing window of data to be fully deleted and reloaded) but the approach depends on how your system works and whether you can make assumptions about data (i.e. posted data in the source will never be changed)
You can squash the seperate INSERT and UPDATE statements into a single MERGE statement, although it gets pretty huge, and I've had performance issues with it and there are other documented issues with MERGE
Unfortunately, there's not a good way to do what you're trying to do. SSIS has some controls and documented ways to do this, but as you have found they don't work as well when you start dealing with large amounts of data.
At a previous job, we had something similar that we needed to do. We needed to update medical claims from a source system to another system, similar to your setup. For a very long time, we just truncated everything in the destination and rebuilt every night. I think we were doing this daily with more than 25M rows. If you're able to transfer all the rows from Oracle to SQL in a decent amount of time, then truncating and reloading may be an option.
We eventually had to get away from this as our volumes grew, however. We tried to do something along the lines of what you're attempting, but never got anything we were satisfied with. We ended up with a sort of non-conventional process. First, each medical claim had a unique numeric identifier. Second, whenever the medical claim was updated in the source system, there was an incremental ID on the individual claim that was also incremented.
Step one of our process was to bring over any new medical claims, or claims that had changed. We could determine this quite easily, since the unique ID and the "change ID" column were both indexed in source and destination. These records would be inserted directly into the destination table. The second step was our "deletes", which we handled with a logical flag on the records. For actual deletes, where records existed in destination but were no longer in source, I believe it was actually fastest to do this by selecting the DISTINCT claim numbers from the source system and placing them in a temporary table on the SQL side. Then, we simply did a LEFT JOIN update to set the missing claims to logically deleted. We did something similar with our updates: if a newer version of the claim was brought over by our original Lookup, we would logically delete the old one. Every so often we would clean up the logical deletes and actually delete them, but since the logical delete indicator was indexed, this didn't need to be done too frequently. We never saw much of a performance hit, even when the logically deleted records numbered in the tens of millions.
This process was always evolving as our server loads and data source volumes changed, and I suspect the same may be true for your process. Because every system and setup is different, some of the things that worked well for us may not work for you, and vice versa. I know our data center was relatively good and we were on some stupid fast flash storage, so truncating and reloading worked for us for a very, very long time. This may not be true on conventional storage, where your data interconnects are not as fast, or where your servers are not colocated.
When designing your process, keep in mind that deletes are one of the more expensive operations you can perform, followed by updates and by non-bulk inserts, respectively.
Incremental Approach using SSIS
Get Max(ID) and Max(ModifiedDate) from Destination Table and Store them in a Variables
Create a temporary staging table using EXECUTE SQL TASK and store that temporary staging table name into the variable
Take a Data Flow Task and Use OLEDB Source and OLEDB Destination to pull the data from the Source System and load the
data into the variable of temporary tables
Take Two Execute SQL Task one for Insert Process and other for Update
Drop the Temporary Table
INSERT INTO sales.salesorderdetails
(
salesorderid,
salesorderdetailid,
carriertrackingnumber ,
orderqty,
productid,
specialofferid,
unitprice,
unitpricediscount,
linetotal ,
rowguid,
modifieddate
)
SELECT sd.salesorderid,
sd.salesorderdetailid,
sd.carriertrackingnumber,
sd.orderqty,
sd.productid ,
sd.specialofferid ,
sd.unitprice,
sd.unitpricediscount,
sd.linetotal,
sd.rowguid,
sd.modifieddate
FROM ##salesdetails AS sd WITH (nolock)
LEFT JOIN sales.salesorderdetails AS sa WITH (nolock)
ON sa.salesorderdetailid = sd.salesorderdetailid
WHERE NOT EXISTS
(
SELECT *
FROM sales.salesorderdetails sa
WHERE sa.salesorderdetailid = sd.salesorderdetailid)
AND sa.salesorderdetailid > ?
UPDATE sa
SET SalesOrderID = sd.salesorderid,
CarrierTrackingNumber = sd.carriertrackingnumber,
OrderQty = sd.orderqty,
ProductID = sd.productid,
SpecialOfferID = sd.specialofferid,
UnitPrice = sd.unitprice,
UnitPriceDiscount = sd.unitpricediscount,
LineTotal = sd.linetotal,
rowguid = sd.rowguid,
ModifiedDate = sd.modifieddate
FROM sales.salesorderdetails sa
LEFT JOIN ##salesdetails sd
ON sd.salesorderdetailid = sa.salesorderdetailid
WHERE sa.modifieddate > sd.modifieddate
AND sa.salesorderdetailid < ?
Entire Process took 2 Minutes to Complete
Incremental Process Screenshot
I am assuming you have some identity like (pk)column in your oracle table.
1 Get max identity (Business key) from Destination database (SQL server one)
2 Create two data flow
a) Pull only data >max identity from oracle and put them Destination directly .( As these are new record).
b) Get all record < max identity and update date > last load put them into temp (staging ) table (as this is updated data)
3 Update Destination table with record from temp table record (created at step b)

Is it possible to do batch operations with TSQL "merge" statement?

With a simple UPDATE statement we can do it in batches when dealing with huge tables.
WHILE 1 = 1
BEGIN
UPDATE TOP (5000)
dbo.LargeOrders
SET CustomerID = N'ABCDE'
WHERE CustomerID = N'OLDWO';
IF ##rowcount < 5000
BREAK;
END
When working with MERGE statement, is it possible to do similar things? As I know this is not possible, because you need to do different operations based on the condition. For example to UPDATE when matched and to INSERT when not matched. I just want to confirm on it and I may need to switch to the old-school UPDATE & INSERT if it's true.
Why not use a temp table as the source of your MERGE and then handle batching via the source table. I do this in a project of mine where I'll batch 50,000 rows to a temp table and then use the temp table as the source for the MERGE. I do this in a loop until all rows are processed.
The temp table could be a physical table or an in memory table.