How do I optimize writing to a large remote table? - sql

I'm working with a set of servers that all record an event that typically occurs multiple times a day, then later call a stored procedure to copy those records to a matching table on a central remote server. The key part of that stored procedure is as follows:
First, because the events take several minutes, sometimes they won't be complete when they're copied and certain records in the central server will have null values in certain columns. The stored procedure updates the records this happened to last time:
UPDATE r SET r.ToUpdateA = l.ToUpdateA, r.ToUpdateB = l.ToUpdateB
FROM LocalTable l INNER JOIN RemoteServer.RemoteDB.dbo.RemoteTable r
ON l.IdentifierA = r.IdentifierA AND l.IdentifierB = r.IdentifierB
WHERE r.ToUpdateB IS NULL AND l.ToUpdateB IS NOT NULL;
Both IdentifierA and IdentifierB are necessary to identify a given record; the first identifies which server it's from.
Second comes the update itself, identifying records on the local table that aren't on the remote table and inserting them:
INSERT INTO RemoteServer.RemoteDB.dbo.RemoteTable (A, B, C...)
SELECT l.A, l.B, l.C...
FROM LocalTable l LEFT OUTER JOIN RemoteServer.RemoteDB.dbo.RemoteTable r
ON l.IdentifierA = r.IdentifierA AND l.IdentifierB = r.IdentifierB
WHERE r.uid IS NULL;
These joins are coming to take too long as the central remote table grows, especially on the larger servers. The estimated execution plan indicates that most of the work is being done in a Remote Scan for the UPDATE's inner join (relating to the r.ToUpdateB IS NULL part) and a Remote Query for the INSERT's left outer join (selecting three columns from the entire RemoteTable). I can think of three types of solutions:
Delete old records. We've never needed to look further back than a month or so.
Split the work between stored procedures on the "spoke" and "hub" servers. This would mean just copying new records blindly to a new intermediate table on the "hub", perhaps with an extra BIT column on the "spokes" to indicate whether a given record has been copied, and having the "hub" weed out duplicates itself.
Modify the joins to be faster. This is what I'd like to do if possible -- there's probably a way to sent the recent data to the hub server and instruct it on what to do with it, all on the same query, rather than fetching massive amounts of data from the hub. I tried changing INNER JOIN to INNER REMOTE JOIN, but if I'm interpreting the modified execution plan right, that would take orders of magnitude longer.
Is #3 feasible? If so, how?

The best way, by far, that I have found to dramastically increase performance on Linked Server DML statements is to not do them ;-). Fortunately I am being more cheeky than sarcastic :).
The trick is to do the DML work on the server where the table lives. In order to do that you:
gather the related/relevant data
package it up as XML (but stored in a NVARCHAR(MAX) variable since XML is not a valid datatype for Linked Server calls)
execute a proc on the remote server, passing in that dataset, that unpacks the XML into a Temporary Table and joins to that (hence a local transaction).
I have this method detailed in two answers:
Cross Server Transaction taking too long inside a procedure
DELETE from Linked Server table using OPENQUERY and dynamic criteria
The method described above deals with how to transfer data over faster, but doesn't address an improvement that can be made on identifying what data to move over in the first place.
Scanning the destination table, even if it were merely in a different database on the same instance, each time to determine missing records is very expensive as row counts increase. This expensive can be avoided by dumping new records into a queue table. This queue table holds only the records that need to be inserted and potentially updated. Once you know that the records have been synced remotely, you remove those records. This is similar to your option #3 in the Question, but not doing so all in a single query as there is no way to identify the "new" records outside of scanning the destination table (simple but doesn't scale) or capturing them as they come in (a little more effort but scales quite well).
The queue table can be either:
a user created table that is populated via an INSERT trigger. This table can be just the key fields and a status (needed to keep track of the INSERT vs potential UPDATE)
a system table created by enabling Change Data Capture (CDC) or Change Tracking on the source table
In either case, you would do something along the lines of:
Create the queue table
CREATE TABLE RemoteTableQueue
(
RemoteTableQueueID INT NOT NULL IDENTITY(-2140000000, 1)
CONSTRAINT [PK_RemoteTableQueue] PRIMARY KEY,
IdentifierA DATATYPE NOT NULL,
IdentifierB DATATYPE NOT NULL,
StatusID TINYINT NOT NULL,
);
Create an AFTER INSERT trigger
INSERT INTO RemoteTableQueue (IdentifierA, IdentifierB, StatusID)
SELECT IdentifierA, IdentifierB, 1
FROM INSERTED;
Update your ETL proc (assuming this is single-threaded)
CREATE TABLE #TempUpdate
(
IdentifierA DATATYPE NOT NULL,
IdentifierB DATATYPE NOT NULL,
ToUpdateA DATATYPE NOT NULL,
ToUpdateB DATATYPE NOT NULL
);
BEGIN TRAN;
INSERT INTO #TempUpdate (IdentifierA, IdentifierB, ToUpdateA, ToUpdateB)
SELECT lt.IdentifierA, lt.IdentifierB, lt.ToUpdateA, lt.ToUpdateB
FROM LocalTable lt
INNER JOIN RemoteTableQueue rtq
ON lt.IdentifierA = rtq.IdentifierA
AND lt.IdentifierB = rtq.IdentifierB
WHERE rtq.StatusID = 2 -- rows eligible for UPDATE
AND lt.ToUpdateB IS NOT NULL;
DECLARE #UpdateData NVARCHAR(MAX);
SET #UpdateData = (
SELECT *
FROM #TempUpdate
FOR XML ...);
EXEC RemoteServer.RemoteDB.dbo.UpdateProc #UpdateData;
DELETE rtq
FROM RemoteTableQueue rtq
INNER JOIN #TempUpdate tmp
ON tmp.IdentifierA = rtq.IdentifierA
AND tmp.IdentifierB = rtq.IdentifierB;
TRUNCATE TABLE #TempUpdate;
INSERT INTO #TempUpdate (IdentifierA, IdentifierB, ToUpdateA, ToUpdateB)
SELECT lt.IdentifierA, lt.IdentifierB, lt.ToUpdateA, lt.ToUpdateB
FROM LocalTable lt
INNER JOIN RemoteTableQueue rtq
ON lt.IdentifierA = rtq.IdentifierA
AND lt.IdentifierB = rtq.IdentifierB
WHERE rtq.StatusID = 1 -- rows to INSERT;
SET #UpdateData = (
SELECT lt.*
FROM LocalTable lt
INNER JOIN #TempUpdate tmp
ON tmp.IdentifierA = rtq.IdentifierA
AND tmp.IdentifierB = rtq.IdentifierB
FOR XML ...);
EXEC RemoteServer.RemoteDB.dbo.InsertProc #UpdateData;
-- no need to check for changed value later if it already has it now
DELETE rtq
FROM RemoteTableQueue rtq
INNER JOIN #TempUpdate tmp
ON tmp.IdentifierA = rtq.IdentifierA
AND tmp.IdentifierB = rtq.IdentifierB
WHERE tmp.ToUpdateB IS NOT NULL;
-- we know these records will need to be checked later since they are NULL
UPDATE rtq
SET rtq.StatusID = 2 -- rows eligible for UPDATE
FROM RemoteTableQueue rtq
INNER JOIN #TempUpdate tmp
ON tmp.IdentifierA = rtq.IdentifierA
AND tmp.IdentifierB = rtq.IdentifierB
WHERE tmp.ToUpdateB IS NULL;
COMMIT;
Additional Steps
Add TRY / CATCH logic to ETL proc to properly handle ROLLBACK
Update remote INSERT and UPDATE procs to batch the incoming data into the destination table (loop through temp table populated from incoming XML, processing 1000 rows at a time until done).
If there is too much contention between "spoke" servers reporting in at the same time, create an incoming Queue table on the Remote server that the incoming XML data simply gets inserted into with no additional logic. That is a very clean and quick operation. Then create a local job on the Remote server to check every few minutes and if rows exist in the incoming Queue table, process them into the destination table. This separates the transactions between the Source server/table and the Destination server/table, thereby reducing contention.
The [RemoteTableQueueID] field exists in case you change your ETL model to just run every 3 - 10 minutes all day long, grabbing the TOP (#BatchSize) of rows to process, in which case you would want to ORDER BY [RemoteTableQueueID] ASC

Related

Fastest options for merging two tables in SQL Server

Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.
Both table structures
- Identifier int
- Date DateTime,
- Identifier A
- Identifier B
- General decimal data.. (maybe 10 columns)
I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?
I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.
I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.
I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.
Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:
Example using SELECT INTO:
SELECT *
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d
exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'
Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.
A Merge is probably your best bet, if you want to both inserts and updates.
MERGE #TableB AS Tgt
USING (SELECT * FROM #TableA) Src
ON (Tgt.Identifier = SRc.Identifier)
WHEN MATCHED THEN
UPDATE SET Date = Src.Date, ...
WHEN NOT MATCHED THEN
INSERT (Identifier, Date, ...)
VALUES (Src.Identifier, Src.Date, ...);
Note that the merge statement must be terminated with a ;

Best incremental load method using SSIS with over 20 million records

What is needed: I'm needing 25 million records from oracle incrementally loaded to SQL Server 2012. It will need to have an UPDATE, DELETE, NEW RECORDS feature in the package. The oracle data source is always changing.
What I have: I've done this many times before but not anything past 10 million records.First I have an [Execute SQL Task] that is set to grab the result set of the [Max Modified Date]. I then have a query that only pulls data from the [ORACLE SOURCE] > [Max Modified Date] and have that lookup against my destination table.
I have the the [ORACLE Source] connecting to the [Lookup-Destination table], the lookup is set to NO CACHE mode, I get errors if I use partial or full cache mode because I assume the [ORACLE Source] is always changing. The [Lookup] then connects to a [Conditional Split] where I would input an expression like the one below.
(REPLACENULL(ORACLE.ID,"") != REPLACENULL(Lookup.ID,""))
|| (REPLACENULL(ORACLE.CASE_NUMBER,"")
!= REPLACENULL(ORACLE.CASE_NUMBER,""))
I would then have the rows that the [Conditional Split] outputs into a staging table. I then add a [Execute SQL Task] and perform an UPDATE to the DESTINATION-TABLE with the query below:
UPDATE Destination
SET SD.CASE_NUMBER =UP.CASE_NUMBER,
SD.ID = UP.ID,
From Destination SD
JOIN STAGING.TABLE UP
ON UP.ID = SD.ID
Problem: This becomes very slow and takes a very long time and it just keeps running. How can I improve the time and get it to work? Should I use a cache transformation? Should I use a merge statement instead?
How would I use the expression REPLACENULL in the conditional split when it is a data column? would I use something like :
(REPLACENULL(ORACLE.LAST_MODIFIED_DATE,"01-01-1900 00:00:00.000")
!= REPLACENULL(Lookup.LAST_MODIFIED_DATE," 01-01-1900 00:00:00.000"))
PICTURES BELOW:
A pattern that is usually faster for larger datasets is to load the source data into a local staging table then use a query like below to identify the new records:
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Then you just feed that dataset into an insert:
INSERT INTO TargetTable (column1,column 2)
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Updates look like this:
UPDATE TGT
SET
column1 = SRC.column1,
column2 = SRC.column2,
DTUpdated=GETDATE()
FROM TargetTable TGT
WHERE EXISTS (
SELECT * FROM TargetTable SRC
WHERE TGT.MatchKey = SRC.MatchKey
)
Note the additional column DTUpdated. You should always have a 'last updated' column in your table to help with auditing and debugging.
This is an INSERT/UPDATE approach. There are other data load approaches such as windowing (pick a trailing window of data to be fully deleted and reloaded) but the approach depends on how your system works and whether you can make assumptions about data (i.e. posted data in the source will never be changed)
You can squash the seperate INSERT and UPDATE statements into a single MERGE statement, although it gets pretty huge, and I've had performance issues with it and there are other documented issues with MERGE
Unfortunately, there's not a good way to do what you're trying to do. SSIS has some controls and documented ways to do this, but as you have found they don't work as well when you start dealing with large amounts of data.
At a previous job, we had something similar that we needed to do. We needed to update medical claims from a source system to another system, similar to your setup. For a very long time, we just truncated everything in the destination and rebuilt every night. I think we were doing this daily with more than 25M rows. If you're able to transfer all the rows from Oracle to SQL in a decent amount of time, then truncating and reloading may be an option.
We eventually had to get away from this as our volumes grew, however. We tried to do something along the lines of what you're attempting, but never got anything we were satisfied with. We ended up with a sort of non-conventional process. First, each medical claim had a unique numeric identifier. Second, whenever the medical claim was updated in the source system, there was an incremental ID on the individual claim that was also incremented.
Step one of our process was to bring over any new medical claims, or claims that had changed. We could determine this quite easily, since the unique ID and the "change ID" column were both indexed in source and destination. These records would be inserted directly into the destination table. The second step was our "deletes", which we handled with a logical flag on the records. For actual deletes, where records existed in destination but were no longer in source, I believe it was actually fastest to do this by selecting the DISTINCT claim numbers from the source system and placing them in a temporary table on the SQL side. Then, we simply did a LEFT JOIN update to set the missing claims to logically deleted. We did something similar with our updates: if a newer version of the claim was brought over by our original Lookup, we would logically delete the old one. Every so often we would clean up the logical deletes and actually delete them, but since the logical delete indicator was indexed, this didn't need to be done too frequently. We never saw much of a performance hit, even when the logically deleted records numbered in the tens of millions.
This process was always evolving as our server loads and data source volumes changed, and I suspect the same may be true for your process. Because every system and setup is different, some of the things that worked well for us may not work for you, and vice versa. I know our data center was relatively good and we were on some stupid fast flash storage, so truncating and reloading worked for us for a very, very long time. This may not be true on conventional storage, where your data interconnects are not as fast, or where your servers are not colocated.
When designing your process, keep in mind that deletes are one of the more expensive operations you can perform, followed by updates and by non-bulk inserts, respectively.
Incremental Approach using SSIS
Get Max(ID) and Max(ModifiedDate) from Destination Table and Store them in a Variables
Create a temporary staging table using EXECUTE SQL TASK and store that temporary staging table name into the variable
Take a Data Flow Task and Use OLEDB Source and OLEDB Destination to pull the data from the Source System and load the
data into the variable of temporary tables
Take Two Execute SQL Task one for Insert Process and other for Update
Drop the Temporary Table
INSERT INTO sales.salesorderdetails
(
salesorderid,
salesorderdetailid,
carriertrackingnumber ,
orderqty,
productid,
specialofferid,
unitprice,
unitpricediscount,
linetotal ,
rowguid,
modifieddate
)
SELECT sd.salesorderid,
sd.salesorderdetailid,
sd.carriertrackingnumber,
sd.orderqty,
sd.productid ,
sd.specialofferid ,
sd.unitprice,
sd.unitpricediscount,
sd.linetotal,
sd.rowguid,
sd.modifieddate
FROM ##salesdetails AS sd WITH (nolock)
LEFT JOIN sales.salesorderdetails AS sa WITH (nolock)
ON sa.salesorderdetailid = sd.salesorderdetailid
WHERE NOT EXISTS
(
SELECT *
FROM sales.salesorderdetails sa
WHERE sa.salesorderdetailid = sd.salesorderdetailid)
AND sa.salesorderdetailid > ?
UPDATE sa
SET SalesOrderID = sd.salesorderid,
CarrierTrackingNumber = sd.carriertrackingnumber,
OrderQty = sd.orderqty,
ProductID = sd.productid,
SpecialOfferID = sd.specialofferid,
UnitPrice = sd.unitprice,
UnitPriceDiscount = sd.unitpricediscount,
LineTotal = sd.linetotal,
rowguid = sd.rowguid,
ModifiedDate = sd.modifieddate
FROM sales.salesorderdetails sa
LEFT JOIN ##salesdetails sd
ON sd.salesorderdetailid = sa.salesorderdetailid
WHERE sa.modifieddate > sd.modifieddate
AND sa.salesorderdetailid < ?
Entire Process took 2 Minutes to Complete
Incremental Process Screenshot
I am assuming you have some identity like (pk)column in your oracle table.
1 Get max identity (Business key) from Destination database (SQL server one)
2 Create two data flow
a) Pull only data >max identity from oracle and put them Destination directly .( As these are new record).
b) Get all record < max identity and update date > last load put them into temp (staging ) table (as this is updated data)
3 Update Destination table with record from temp table record (created at step b)

How to SELECT COUNT from tables currently being INSERT?

Hi consider there is an INSERT statement running on a table TABLE_A, which takes a long time, I would like to see how has it progressed.
What I tried was to open up a new session (new query window in SSMS) while the long running statement is still in process, I ran the query
SELECT COUNT(1) FROM TABLE_A WITH (nolock)
hoping that it will return right away with the number of rows everytime I run the query, but the test result was even with (nolock), still, it only returns after the INSERT statement is completed.
What have I missed? Do I add (nolock) to the INSERT statement as well? Or is this not achievable?
(Edit)
OK, I have found what I missed. If you first use CREATE TABLE TABLE_A, then INSERT INTO TABLE_A, the SELECT COUNT will work. If you use SELECT * INTO TABLE_A FROM xxx, without first creating TABLE_A, then non of the following will work (not even sysindexes).
Short answer: You can't do this.
Longer answer: A single INSERT statement is an atomic operation. As such, the query has either inserted all the rows or has inserted none of them. Therefore you can't get a count of how far through it has progressed.
Even longer answer: Martin Smith has given you a way to achieve what you want. Whether you still want to do it that way is up to you of course. Personally I still prefer to insert in manageable batches if you really need to track progress of something like this. So I would rewrite the INSERT as multiple smaller statements. Depending on your implementation, that may be a trivial thing to do.
If you are using SQL Server 2016 the live query statistics feature can allow you to see the progress of the insert in real time.
The below screenshot was taken while inserting 10 million rows into a table with a clustered index and a single nonclustered index.
It shows that the insert was 88% complete on the clustered index and this will be followed by a sort operator to get the values into non clustered index key order before inserting into the NCI. This is a blocking operator and the sort cannot output any rows until all input rows are consumed so the operators to the left of this are 0% done.
With respect to your question on NOLOCK
It is trivial to test
Connection 1
USE tempdb
CREATE TABLE T2
(
X INT IDENTITY PRIMARY KEY,
F CHAR(8000)
);
WHILE NOT EXISTS(SELECT * FROM T2 WITH (NOLOCK))
LOOP:
SELECT COUNT(*) AS CountMethod FROM T2 WITH (NOLOCK);
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('T2');
RAISERROR ('Waiting for 10 seconds',0,1) WITH NOWAIT;
WAITFOR delay '00:00:10';
SELECT COUNT(*) AS CountMethod FROM T2 WITH (NOLOCK);
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('T2');
RAISERROR ('Waiting to drop table',0,1) WITH NOWAIT
DROP TABLE T2
Connection 2
use tempdb;
--Insert 2000 * 2000 = 4 million rows
WITH T
AS (SELECT TOP 2000 'x' AS x
FROM master..spt_values)
INSERT INTO T2
(F)
SELECT 'X'
FROM T v1
CROSS JOIN T v2
OPTION (MAXDOP 1)
Example Results - Showing row count increasing
SELECT queries with NOLOCK allow dirty reads. They don't actually take no locks and can still be blocked, they still need a SCH-S (schema stability) lock on the table (and on a heap it will also take a hobt lock).
The only thing incompatible with a SCH-S is a SCH-M (schema modification) lock. Presumably you also performed some DDL on the table in the same transaction (e.g. perhaps created it in the same tran)
For the use case of a large insert, where an approximate in flight result is fine, I generally just poll sysindexes as shown above to retrieve the count from metadata rather than actually counting the rows (non deprecated alternative DMVs are available)
When an insert has a wide update plan you can even see it inserting to the various indexes in turn that way.
If the table is created inside the inserting transaction this sysindexes query will still block though as the OBJECT_ID function won't return a result based on uncommitted data regardless of the isolation level in effect. It's sometimes possible to get around that by getting the object_id from sys.tables with nolock instead.
Use the below query to find the count for any large table or locked table or being inserted table in seconds . Just replace the table name which you want to search.
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'TABLENAME' AND (index_id < 2)
For those who just need to see the record count while executing a long running INSERT script, I found you can see the current record count through SSMS by right clicking on the destination database table, -> Properties -> Storage, then view the "Row Count" value like so:
Close window and repeat to see the updated record count.

Passing a table to a stored procedure

I have a table with 20 billion rows. Table does not have any indexes as it was created on fly for doing bulk insert operation. The table is being used in a stored procedure which does the following operation
Delete A
from master a
inner join (Select distinct Col from TableB ) b
on A.Col = B.Col
Insert into master
Select *
from tableB
group by col1,col2,col3
TableB is the one which has 20 billion rows. I don't want to execute SP directly because it might take days to complete the execution. Master is also a huge table and has clustered index on Col
Can i pass chunks of rows to the stored procedure and perform the operation.This might reduce the log file growth. If yes how can i do that
Should i create clustered index on the table and execute the SP which might be little faster but then again i think creating CI on a huge table might take 10 hours to complete.
Or is there any way to perform this operation fast
I've used a method similar to this one. I'd recommend putting your DB into Bulk Logged recovery mode instead of Full recovery mode if you can.
Blog entry reproduced below to future proof it.
Below is a technique used to transfer a large amount of records from
one table to another. This scales pretty well for a couple reasons.
First, this will not fill up the entire log prior to committing the
transaction. Rather, it will populate the table in chunks of 10,000
records. Second, it’s generally much quicker. You will have to play
around with the batch size. Sometimes it’s more efficient at 10,000,
sometimes 500,000, depending on the system.
If you do not need to insert into an existing table and just need a
copy of the table, it is better to do a SELECT INTO. However for this
example, we are inserting into an existing table.
Another trick you should do is to change the recovery model of the
database to simple. This way, there will be much less logging in the
transaction log.
The WITH (TABLOCK) below only works in SQL 2008.
DECLARE #BatchSize INT = 10000
WHILE 1 = 1
BEGIN
INSERT INTO [dbo].[Destination] --WITH (TABLOCK) -- Uncomment for 2008
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE NOT EXISTS (
SELECT 1
FROM dbo.Destination
WHERE PersonID = s.PersonID
)
IF ##ROWCOUNT < #BatchSize BREAK
END
With the above example, it is important to have at least a non
clustered index on PersonID in both tables.
Another way to transfer records is to use multiple threads. Specifying
a range of records as such:
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE PersonID BETWEEN 1 AND 5000
GO
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(#BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[SOURCE] s
WHERE PersonID BETWEEN 5001 AND 10000
For super fast performance however, I’d recommend using SSIS.
Especially in SQL Server 2008. We recently transferred 17 million
records in 5 minutes with an SSIS package executed on the same server
as the two databases it transferred between.
SQL Server 2008 SQL Server 2008 has made changes with regards to it’s
logging mechanism when inserting records. Previously, to do an insert
that was minimally logged, you would have to perform a SELECT.. INTO.
Now, you can perform a minimally logged insert if you can lock the
table you are inserting into. The example below shows an example of
this. The exception to this rule is if you have a clustered index on
the table AND the table is not empty. If the table is empty and you
acquire a table lock and you have a clustered index, it will be
minimally logged. However if you have data in the table, the insert
will be logged. Now if you have a non clustered index on a heap and
you acquire a table lock then only the non clustered index will be
logged. It is always better to drop indexes prior to inserting
records.
To determine the amount of logging you can use the following statement
SELECT * FROM ::fn_dblog(NULL, NULL)
Credit for above goes to Derek Dieter at SQL Server Planet.
If you're dead set on passing a table to your stored procedure, you can pass a table-valued parameter to a stored procedure in SQL Server 2008. You might have better luck with some other approaches suggested, like partitioning. Select distinct on a table with 20 billion rows might be part of the problem. I wonder if some very basic tuning wouldn't help, too:
Delete A
from master a
where exists (select 1 from TableB b where b.Col = a.Col)

How can you track the progress of a SQL update?

Let's say I have an update such as:
UPDATE [db1].[sc1].[tb1]
SET c1 = LEFT(c1, LEN(c1)-1)
WHERE c1 like '%:'
This update is basically going to go through millions of rows and trim the colon if there is one in the c1 column.
How can I track how far along in the table this has progressed?
Thanks
This is sql server 2008
You can use the sysindexes table, which keeps track of how much an index has changed. Because this is done in an atomic update, it won't have a chance to recalc statistics, so rowmodctr will keep growing. This is sometimes not noticeable in small tables, but for millions, it will show.
-- create a test table
create table testtbl (id bigint identity primary key clustered, nv nvarchar(max))
-- fill it up with dummy data. 1/3 will have a trailing ':'
insert testtbl
select
convert(nvarchar(max), right(a.number*b.number+c.number,30)) +
case when a.number %3=1 then ':' else '' end
from master..spt_values a
inner join master..spt_values b on b.type='P'
inner join master..spt_values c on c.type='P'
where a.type='P' and a.number between 1 and 5
-- (20971520 row(s) affected)
update testtbl
set nv = left(nv, len(nv)-1)
where nv like '%:'
Now in another query window, run the below continuously and watch the rowmodctr going up and up. rowmodctr vs rows gives you an idea where you are up to, if you know where rowmodctr needs to end up being. In our case, it is 67% of just over 2 million.
select rows, rowmodctr
from sysindexes with (nolock)
where id = object_id('testtbl')
Please don't run (nolock) counting queries on the table itself while it is being updated.
Not really... you can query with the nolock hint and same where, but this will take resources
It isn't an optimal query with a leading wildcard of course...)
Database queries, particularly Data Manipulation Language (DML), are atomic. That means that the INSERT/UPDATE/DELETE either successfully occurs, or it doesn't. There's no means to see what record is being processed -- to the database, they all had been changed once the COMMIT is issued after the UPDATE. Even if you were able to view the records in process, by the time you would see the value, the query will have progressed on to other records.
The only means to knowing where in the process is to script the query to occur within a loop, so you can use a counter to know how many are processed. It's common to do this so large data sets are periodically committed, to minimize the risk of failure requiring having to run the entire query over again.