I have a question on general database/sql server designing:
There is a table with 3 million rows that is being accessed 24x7. I need to update all the records in the table. Can you give me some methods to do this so that the user impact is minimized while I update my table?
Thanks in advance.
Normally you'd write a single update statement to update rows. But in your case you actually want to break it up.
http://www.sqlfiddle.com/#!3/c9c75/6
Is a working example of a common pattern. You don't want a batch size of 2, maybe you want 100,000 or 25,000 - you'll have to test on your system to determine the best balance between quick completion and low blocking.
declare #min int, #max int
select #min = min(user_id), #max = max(user_id)
from users
declare #tmp int
set #tmp = #min
declare #batchSize int
set #batchSize = 2
while #tmp <= #max
begin
print 'from ' + Cast(#tmp as varchar(10)) + ' to ' + cast(#tmp + #batchSize as varchar(10)) + ' starting (' + CONVERT(nvarchar(30), GETDATE(), 120) + ')'
update users
set name = name + '_foo'
where user_id >= #tmp and user_id < #tmp + #batchsize and user_id <= #max
set #tmp = #tmp + #batchSize
print 'Done (' + CONVERT(nvarchar(30), GETDATE(), 120) + ')'
WAITFOR DELAY '000:00:001'
end
update users
set name = name + '_foo'
where user_id > #max
We use patterns like this to update a user table about 10x your table size. With 100,000 chunks it takes about an hour. Performance depends on your hardware of course.
To minimally impact users, I would update only a certain # of records at a time. The number to update is more dependent on your hardware than anything else in my opinion.
As with all things database, it depends. What is the load pattern (ie, are users reading mainly from the end of the table)? How are new records added, if at all? What are your index fill factor settings and actual values? Will your update force any index re-computes? Can you split up the update to reduce locking? If so, do you need robust rollback ability in case of a failure? Are you setting the same value in every row, or do you need a per row calculation, or do you have a per-row source to match up?
Go through the table one row at a time using a loop or even a cursor. Make sure each update is using row locks.
If you don't have a way of identifying rows that still have to be updated, create another table first to hold the primary key and an update indicator, copy all primary key values in there and then keep track of how far you are along in that table.
This is also going to be the slowest method. If you need it to go a little faster, update a few thousand rows at a time, still using rowlock hints.
Related
I have a table with 3.4 million rows. I want to copy this whole data into another table.
I am performing this task using the below query:
select *
into new_items
from productDB.dbo.items
I need to know the best possible way to do this task.
I had the same problem, except I have a table with 2 billion rows, so the log file would grow to no end if I did this, even with the recovery model set to Bulk-Logging:
insert into newtable select * from oldtable
So I operate on blocks of data. This way, if the transfer is interupted, you just restart it. Also, you don't need a log file as big as the table. You also seem to get less tempdb I/O, not sure why.
set identity_insert newtable on
DECLARE #StartID bigint, #LastID bigint, #EndID bigint
select #StartID = isNull(max(id),0) + 1
from newtable
select #LastID = max(ID)
from oldtable
while #StartID < #LastID
begin
set #EndID = #StartID + 1000000
insert into newtable (FIELDS,GO,HERE)
select FIELDS,GO,HERE from oldtable (NOLOCK)
where id BETWEEN #StartID AND #EndId
set #StartID = #EndID + 1
end
set identity_insert newtable off
go
You might need to change how you deal with IDs, this works best if your table is clustered by ID.
If you are copying into a new table, the quickest way is probably what you have in your question, unless your rows are very large.
If your rows are very large, you may want to use the bulk insert functions in SQL Server. I think you can call them from C#.
Or you can first download that data into a text file, then bulk-copy (bcp) it. This has the additional benefit of allowing you to ignore keys, indexes etc.
Also try the Import/Export utility that comes with the SQL Management Studio; not sure whether it will be as fast as a straight bulk-copy, but it should allow you to skip the intermediate step of writing out as a flat file, and just copy directly table-to-table, which might be a bit faster than your SELECT INTO statement.
I have been working with our DBA to copy an audit table with 240M rows to another database.
Using a simple select/insert created a huge tempdb file.
Using a the Import/Export wizard worked but copied 8M rows in 10min
Creating a custom SSIS package and adjusting settings copied 30M rows in 10Min
The SSIS package turned out to be the fastest and most efficent for our purposes
Earl
Here's another way of transferring large tables. I've just transferred 105 million rows between two servers using this. Quite quick too.
Right-click on the database and choose Tasks/Export Data.
A wizard will take you through the steps but you choosing your SQL server client as the data source and target will allow you to select the database and table(s) you wish to transfer.
For more information, see https://www.mssqltips.com/sqlservertutorial/202/simple-way-to-export-data-from-sql-server/
If it's a 1 time import, the Import/Export utility in SSMS will probably work the easiest and fastest. SSIS also seems to work better for importing large data sets than a straight INSERT.
BULK INSERT or BCP can also be used to import large record sets.
Another option would be to temporarily remove all indexes and constraints on the table you're importing into and add them back once the import process completes. A straight INSERT that previously failed might work in those cases.
If you're dealing with timeouts or locking/blocking issues when going directly from one database to another, you might consider going from one db into TEMPDB and then going from TEMPDB into the other database as it minimizes the effects of locking and blocking processes on either side. TempDB won't block or lock the source and it won't hold up the destination.
Those are a few options to try.
-Eric Isaacs
Simple Insert/Select sp's work great until the row count exceeds 1 mil. I've watched tempdb file explode trying to insert/select 20 mil + rows. The simplest solution is SSIS setting the batch row size buffer to 5000 and commit size buffer to 1000.
I know this is late, but if you are encountering semaphore timeouts then you can use row_number to set increments for your insert(s) using something like
INSERT INTO DestinationTable (column1, column2, etc)
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY ID) AS RN , column1, column2, etc
FROM SourceTable ) AS A
WHERE A.RN >= 1 AND A.RN <= 10000 )
The size of the log file will grow, so there is that to contend with. You get better performance if you disable constraints and index when inserting into an existing table. Then enable the constraints and rebuild the index for the table you inserted into once the insertion is complete.
I like the solution from #Mathieu Longtin to copy in batches thereby minimising log file issues and created a version with OFFSET FETCH as suggested by #CervEd.
Others have suggested using the Import/Export Wizard or SSIS packages, but that's not always possible.
It's probably overkill for many but my solution includes some checks for record counts and outputs progress as well.
USE [MyDB]
GO
SET NOCOUNT ON;
DECLARE #intStart int = 1;
DECLARE #intCount int;
DECLARE #intFetch int = 10000;
DECLARE #strStatus VARCHAR(200);
DECLARE #intCopied int = 0;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Getting count of HISTORY records currently in MyTable...';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
SELECT #intCount = COUNT(*) FROM [dbo].MyTable WHERE IsHistory = 1;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Count of HISTORY records currently in MyTable: ' + CONVERT(VARCHAR(20), #intCount);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT; --(note: PRINT resets ##ROWCOUNT to 0 so using RAISERROR instead)
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Starting copy...';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
WHILE #intStart < #intCount
BEGIN
INSERT INTO [dbo].[MyTable_History] (
[PK1], [PK2], [PK3], [Data1], [Data2])
SELECT
[PK1], [PK2], [PK3], [Data1], [Data2]
FROM [MyDB].[dbo].[MyTable]
WHERE IsHistory = 1
ORDER BY
[PK1], [PK2], [PK3]
OFFSET #intStart - 1 ROWS
FETCH NEXT #intFetch ROWS ONLY;
SET #intCopied = #intCopied + ##ROWCOUNT;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Records copied so far: ' + CONVERT(VARCHAR(20), #intCopied);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
SET #intStart = #intStart + #intFetch;
END
--Check the record count is correct.
IF #intCopied = #intCount
BEGIN
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Correct record count.';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
END
ELSE
BEGIN
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Only ' + CONVERT(VARCHAR(20), #intCopied) + ' records were copied, expected: ' + CONVERT(VARCHAR(20), #intCount);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
END
GO
If your focus is Archiving (DW) and are dealing with VLDB with 100+ partitioned tables and you want to isolate most of these resource intensive work on a non production server (OLTP) here is a suggestion (OLTP -> DW)
1) Use backup / Restore to get the data onto the archive server (so now, on Archive or DW you will have Stage and Target database)
2) Stage database: Use partition switch to move data to corresponding stage table
3) Use SSIS to transfer data from staged database to target database for each staged table on both sides
4) Target database: Use partition switch on target database to move data from stage to base table
Hope this helps.
select * into new_items from productDB.dbo.items
That pretty much is it. THis is the most efficient way to do it.
I want to delete 10GB (1%) data from 1TB table. I have come across several articles to delete large amounts of data from a huge table but didn't find much on deleting smaller percentage of data from a huge table.
Additional details:
Trying to delete bot data from the visits table. The filter condition is a combination of fields... ip in (list of ips about 20 of them) and useragent like '%SOMETHING%'
useragent size 1024 varchar
The data can be old or new. I can't use date filter
Here is a batch delete in chunks that I use regularly. Perhaps it would give you some ideas on how to approach your need. I create a stored proc and call the proc from a SQL Agent Job. I generally schedule it to allow a transaction log backup between executions so the log does not grow too large. You could always just run it interactively if you wish.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROC [DBA_Delete_YourTableName] AS
SET NOCOUNT ON;
---------------------------------------------------------
DECLARE #DaysHistoryToKeep INT
SET #DaysHistoryToKeep = 90
IF #DaysHistoryToKeep < 30
SET #DaysHistoryToKeep = 30
---------------------------------------------------------
DECLARE #continue INT
DECLARE #rowcount INT
DECLARE #loopCount INT
DECLARE #MaxLoops INT
DECLARE #TotalRows BIGINT
DECLARE #PurgeThruDate DATETIME
SET #PurgeThruDate = DATEADD(dd,(-1)*(#DaysHistoryToKeep+1), GETDATE())
SET #MaxLoops = 100
SET #continue = 1
SET #loopCount = 0
SELECT #TotalRows = (SELECT COUNT(*) FROM YourTableName (NOLOCK) WHERE CREATEDDATETIME < #PurgeThruDate)
PRINT 'Total Rows = ' + CAST(#TotalRows AS VARCHAR(20))
PRINT ''
WHILE #continue = 1
BEGIN
SET #loopCount = #loopCount + 1
PRINT 'Loop # ' + CAST(#loopCount AS VARCHAR(10))
PRINT CONVERT(VARCHAR(20), GETDATE(), 120)
BEGIN TRANSACTION
DELETE TOP (4500) YourTableName WHERE CREATEDDATETIME < #PurgeThruDate
SET #rowcount = ##rowcount
COMMIT
PRINT 'Rows Deleted: ' + CAST(#rowcount AS VARCHAR(10))
PRINT CONVERT(VARCHAR(20), GETDATE(), 120)
PRINT ''
IF #rowcount = 0 OR #loopCount >= #MaxLoops
BEGIN
SET #continue = 0
END
END
SELECT #TotalRows = (SELECT COUNT(*) FROM YourTableName (NOLOCK) WHERE CREATEDDATETIME < #PurgeThruDate)
PRINT 'Total Rows Remaining = ' + CAST(#TotalRows AS VARCHAR(20))
PRINT ''
GO
The filter condition is ... ip in (list of ips about 20 of them) and useragent like '%SOMETHING%'
Regarding table size, it's important to touch as few rows as possible while executing the delete.
I imagine on a table that size you already have an index on the ip column. It might help (or not) to put your list 20 or so ips in a table instead of in an in clause, especially if they're parameters. I'd look at my query plan to see.
I hope useragent like '%SOMETHING%' is usually true; otherwise it's an expensive test because SQL Server has to examine every row for an eligible ip. If not, a redesign to allow the query to avoid like would probably be beneficial.
[D]eleting smaller percentage isn't really relevant. Using selective search criteria is (per above), as is the size of the delete transaction in absolute terms. By definition, the size of the deletion in terms of rows and row size determines the size of the transaction. Very large transactions can push against machine resources. Breaking them up into smaller ones can yield better performance in such cases.
The last server I used had 0.25 TB RAM and was comfortable deleting 1 million rows at a time, but not 10 million. Your milage will vary; you have to try, and observe, to see.
How much you're willing to tax the machine will depend on what else is (or needs to be able to) run at the same time. The way you break up one logical action -- delete all rows where [condition] -- into "chunks" also depends on what you want the database to look like while the delete procedure is in process, when some chunks are deleted and others remain present.
If you do decide to break it into chunks, I recommend not using a fixed number of rows and a TOP(n) syntax, because that's the least logical solution. Unless you use order by, you're leaving to the server to choose arbitrarily which N rows to delete. If you do use order by, you're requiring the server to sort the result before starting the delete, possibly several times over the whole run. Bleh!
Instead, find some logical subset of rows, ideally distinguishable along the clustered index, that fall beneath your machine's threshold of an acceptable number of rows to delete at one time. Loop over that set. In your case, I would be tempted to iterate over the set of ip values in the in clause. Instead of delete ... where ip in(...), you get (roughly) for each ip delete ... where ip = #ip
The advantage of the latter approach is that you always know where the database stands. If you kill the procedure or it gets rolled back partway through its iteration, you can examine the database to see which ips still remain (or whatever criteria you end up using). You avoid any kind of pathological behavior, whereby some query gets a partial result because some part of your selection criteria (determined by the server alone) are present and others deleted. In thinking about the problem you can say, I'm unable to delete ip 192.168.0.1 because, without wondering which portion have already been removed.
In sum, I recommend:
Give the server every chance to touch only the rows you want to affect, and verify that's what it will do.
Construct your delete routine, if you need one, to delete logical chunks, so you can reason about the state of the database at any time.
I have a large table with 100,000,000 rows. I'd like to select every n'th row from the table. My first instinct is to use something like this:
SELECT id,name FROM table WHERE id%125000=0
to retrieve a even spread of 800 rows (id is a clustered index)
This technique works fine on smaller data sets but with my larger table the query takes 2.5 minutes. I assume this is because the modulus operation is applied to every row. Is there a more optimal method of row skipping ?
Your query assumes that the IDs are contiguous (and probably they aren't without you realizing this...). Anyway, you should generate the IDs yourself:
select *
from T
where ID in (0, 250000*1, 250000*2, ...)
Maybe you need a TVP to send all IDs because there are so many. Or, you produce the IDs on the server in T-SQL or a SQLCLR function or a numbers table.
This technique allows you to perform index seeks and will be the fastest you can possibly produce. It reads the minimal amount of data possible.
Modulo is not SARGable. SQL Server could support this if Microsoft wanted it, but this is an exotic use case. They will never make modulo SARGable and they shouldn't.
The time is not going into the modulus operation itself, but rather into just reading 124,999 unnecessary rows for every row that you actually want (i.e., the Table Scan or Clustered Index Scan).
Just about the only way to speed up a query like this is something that seems at first illogical: Add an extra non-Clustered index on just that column ([ID]). Additionally, you may have to add an Index Hint to force it to use that index. And finally, it may not actually make it faster, though for a modulus of 125,000+, it should be (though it'll never be truly fast).
If your IDs are not necessarily contiguous (any deleted rows will pretty much cause this) and you really do need exactly every modulo rows, by ID order, then you can still use the approach above, but you will have to resequence the IDs for the Modulo operation using ROW_NUMBER() OVER(ORDER BY ID) in the query.
If id is in an index, then I am thinking of something along these lines:
with ids as (
select 1 as id
union all
select id + 125000
from ids
where id <= 100000000
)
select ids.id,
(select name from table t where t.id = ids.id) as name
from ids
option (MAXRECURSION 1000);
I think this formulation will use the index on table.
EDIT:
As I think about this approach, you can actually use it to get actual random ids in the table, rather than just evenly spaced ones:
with ids as (
select 1 as cnt,
ABS(CONVERT(BIGINT,CONVERT(BINARY(8), NEWID()))) % 100000000 as id
union all
select cnt + 1, ABS(CONVERT(BIGINT,CONVERT(BINARY(8), NEWID()))) % 100000000
from ids
where cnt < 800
)
select ids.id,
(select name from table t where t.id = ids.id) as name
from ids
option (MAXRECURSION 1000);
The code for the actual random number generator came from here.
EDIT:
Due to quirks in SQL Server, you can still get non-contiguous ids, even in your scenario. This accepted answer explains the cause. In short, identity values are not allocated one at a time, but rather in groups. The server can fail and even unused values get skipped.
One reason I wanted to do the random sampling was to help avoid this problem. Presumably, the above situation is rather rare on most systems. You can use the random sampling to generate say 900 ids. From these, you should be able to find 800 that are actually available for your sample.
DECLARE #i int, #max int, #query VARCHAR(1000)
SET #i = 0
SET #max = (SELECT max(id)/125000 FROM Table1)
SET #query = 'SELECT id, name FROM Table1 WHERE id in ('
WHILE #i <= #max
BEGIN
IF #i > 0 SET #query = #query + ','
SET #query = #query + CAST(#i*125000 as varchar(12))
SET #i = #i + 1
END
SET #query = #query + ')'
EXEC(#query)
EDIT :
To avoid any "holes" in a non-Contiguous ID situation, you can try something like this :
DECLARE #i int, #start int, #id int, #max int, #query VARCHAR(1000)
SET #i = 0
SET #max = (SELECT max(id)/125000 FROM Table1)
SET #query = 'SELECT id, name FROM Table1 WHERE id in ('
WHILE #i <= #max
BEGIN
SET #start = #i*125000
SET #id = (SELECT TOP 1 id FROM Table1 WHERE id >= #start ORDER BY id ASC)
IF #i > 0 SET #query = #query + ','
SET #query = #query + CAST(#id as VARCHAR(12))
SET #i = #i + 1
END
SET #query = #query + ')'
EXEC(#query)
I have a table with 3.4 million rows. I want to copy this whole data into another table.
I am performing this task using the below query:
select *
into new_items
from productDB.dbo.items
I need to know the best possible way to do this task.
I had the same problem, except I have a table with 2 billion rows, so the log file would grow to no end if I did this, even with the recovery model set to Bulk-Logging:
insert into newtable select * from oldtable
So I operate on blocks of data. This way, if the transfer is interupted, you just restart it. Also, you don't need a log file as big as the table. You also seem to get less tempdb I/O, not sure why.
set identity_insert newtable on
DECLARE #StartID bigint, #LastID bigint, #EndID bigint
select #StartID = isNull(max(id),0) + 1
from newtable
select #LastID = max(ID)
from oldtable
while #StartID < #LastID
begin
set #EndID = #StartID + 1000000
insert into newtable (FIELDS,GO,HERE)
select FIELDS,GO,HERE from oldtable (NOLOCK)
where id BETWEEN #StartID AND #EndId
set #StartID = #EndID + 1
end
set identity_insert newtable off
go
You might need to change how you deal with IDs, this works best if your table is clustered by ID.
If you are copying into a new table, the quickest way is probably what you have in your question, unless your rows are very large.
If your rows are very large, you may want to use the bulk insert functions in SQL Server. I think you can call them from C#.
Or you can first download that data into a text file, then bulk-copy (bcp) it. This has the additional benefit of allowing you to ignore keys, indexes etc.
Also try the Import/Export utility that comes with the SQL Management Studio; not sure whether it will be as fast as a straight bulk-copy, but it should allow you to skip the intermediate step of writing out as a flat file, and just copy directly table-to-table, which might be a bit faster than your SELECT INTO statement.
I have been working with our DBA to copy an audit table with 240M rows to another database.
Using a simple select/insert created a huge tempdb file.
Using a the Import/Export wizard worked but copied 8M rows in 10min
Creating a custom SSIS package and adjusting settings copied 30M rows in 10Min
The SSIS package turned out to be the fastest and most efficent for our purposes
Earl
Here's another way of transferring large tables. I've just transferred 105 million rows between two servers using this. Quite quick too.
Right-click on the database and choose Tasks/Export Data.
A wizard will take you through the steps but you choosing your SQL server client as the data source and target will allow you to select the database and table(s) you wish to transfer.
For more information, see https://www.mssqltips.com/sqlservertutorial/202/simple-way-to-export-data-from-sql-server/
If it's a 1 time import, the Import/Export utility in SSMS will probably work the easiest and fastest. SSIS also seems to work better for importing large data sets than a straight INSERT.
BULK INSERT or BCP can also be used to import large record sets.
Another option would be to temporarily remove all indexes and constraints on the table you're importing into and add them back once the import process completes. A straight INSERT that previously failed might work in those cases.
If you're dealing with timeouts or locking/blocking issues when going directly from one database to another, you might consider going from one db into TEMPDB and then going from TEMPDB into the other database as it minimizes the effects of locking and blocking processes on either side. TempDB won't block or lock the source and it won't hold up the destination.
Those are a few options to try.
-Eric Isaacs
Simple Insert/Select sp's work great until the row count exceeds 1 mil. I've watched tempdb file explode trying to insert/select 20 mil + rows. The simplest solution is SSIS setting the batch row size buffer to 5000 and commit size buffer to 1000.
I know this is late, but if you are encountering semaphore timeouts then you can use row_number to set increments for your insert(s) using something like
INSERT INTO DestinationTable (column1, column2, etc)
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY ID) AS RN , column1, column2, etc
FROM SourceTable ) AS A
WHERE A.RN >= 1 AND A.RN <= 10000 )
The size of the log file will grow, so there is that to contend with. You get better performance if you disable constraints and index when inserting into an existing table. Then enable the constraints and rebuild the index for the table you inserted into once the insertion is complete.
I like the solution from #Mathieu Longtin to copy in batches thereby minimising log file issues and created a version with OFFSET FETCH as suggested by #CervEd.
Others have suggested using the Import/Export Wizard or SSIS packages, but that's not always possible.
It's probably overkill for many but my solution includes some checks for record counts and outputs progress as well.
USE [MyDB]
GO
SET NOCOUNT ON;
DECLARE #intStart int = 1;
DECLARE #intCount int;
DECLARE #intFetch int = 10000;
DECLARE #strStatus VARCHAR(200);
DECLARE #intCopied int = 0;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Getting count of HISTORY records currently in MyTable...';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
SELECT #intCount = COUNT(*) FROM [dbo].MyTable WHERE IsHistory = 1;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Count of HISTORY records currently in MyTable: ' + CONVERT(VARCHAR(20), #intCount);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT; --(note: PRINT resets ##ROWCOUNT to 0 so using RAISERROR instead)
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Starting copy...';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
WHILE #intStart < #intCount
BEGIN
INSERT INTO [dbo].[MyTable_History] (
[PK1], [PK2], [PK3], [Data1], [Data2])
SELECT
[PK1], [PK2], [PK3], [Data1], [Data2]
FROM [MyDB].[dbo].[MyTable]
WHERE IsHistory = 1
ORDER BY
[PK1], [PK2], [PK3]
OFFSET #intStart - 1 ROWS
FETCH NEXT #intFetch ROWS ONLY;
SET #intCopied = #intCopied + ##ROWCOUNT;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Records copied so far: ' + CONVERT(VARCHAR(20), #intCopied);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
SET #intStart = #intStart + #intFetch;
END
--Check the record count is correct.
IF #intCopied = #intCount
BEGIN
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Correct record count.';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
END
ELSE
BEGIN
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Only ' + CONVERT(VARCHAR(20), #intCopied) + ' records were copied, expected: ' + CONVERT(VARCHAR(20), #intCount);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
END
GO
If your focus is Archiving (DW) and are dealing with VLDB with 100+ partitioned tables and you want to isolate most of these resource intensive work on a non production server (OLTP) here is a suggestion (OLTP -> DW)
1) Use backup / Restore to get the data onto the archive server (so now, on Archive or DW you will have Stage and Target database)
2) Stage database: Use partition switch to move data to corresponding stage table
3) Use SSIS to transfer data from staged database to target database for each staged table on both sides
4) Target database: Use partition switch on target database to move data from stage to base table
Hope this helps.
select * into new_items from productDB.dbo.items
That pretty much is it. THis is the most efficient way to do it.
I have a couple large tables (188m and 144m rows) I need to populate from views, but each view contains a few hundred million rows (pulling together pseudo-dimensionally modelled data into a flat form). The keys on each table are over 50 composite bytes of columns. If the data was in tables, I could always think about using sp_rename to make the other new table, but that isn't really an option.
If I do a single INSERT operation, the process uses a huge amount of transaction log space, typicalyl filing it up and prompting a bunch of hassle with the DBAs. (And yes, this is probably a job the DBAs should handle/design/architect)
I can use SSIS and stream the data into the destination table with batch commits (but this does require the data to be transmitted over the network, since we are not allowed to run SSIS packages on the server).
Any things other than to divide the process up into multiple INSERT operations using some kind of key to distribute the rows into different batches and doing a loop?
Does the view have ANY kind of unique identifier / candidate key? If so, you could select those rows into a working table using:
SELECT key_columns INTO dbo.temp FROM dbo.HugeView;
(If it makes sense, maybe put this table into a different database, perhaps with SIMPLE recovery model, to prevent the log activity from interfering with your primary database. This should generate much less log anyway, and you can free up the space in the other database before you resume, in case the problem is that you have inadequate disk space all around.)
Then you can do something like this, inserting 10,000 rows at a time, and backing up the log in between:
SET NOCOUNT ON;
DECLARE
#batchsize INT,
#ctr INT,
#rc INT;
SELECT
#batchsize = 10000,
#ctr = 0;
WHILE 1 = 1
BEGIN
WITH x AS
(
SELECT key_column, rn = ROW_NUMBER() OVER (ORDER BY key_column)
FROM dbo.temp
)
INSERT dbo.PrimaryTable(a, b, c, etc.)
SELECT v.a, v.b, v.c, etc.
FROM x
INNER JOIN dbo.HugeView AS v
ON v.key_column = x.key_column
WHERE x.rn > #batchsize * #ctr
AND x.rn <= #batchsize * (#ctr + 1);
IF ##ROWCOUNT = 0
BREAK;
BACKUP LOG PrimaryDB TO DISK = 'C:\db.bak' WITH INIT;
SET #ctr = #ctr + 1;
END
That's all off the top of my head, so don't cut/paste/run, but I think the general idea is there. For more details (and why I backup log / checkpoint inside the loop), see this post on sqlperformance.com:
Break large delete operations into chunks
Note that if you are taking regular database and log backups you will probably want to take a full to start your log chain over again.
You could partition your data and insert your data in a cursor loop. That would be nearly the same as SSIS batchinserting. But runs on your server.
create cursor ....
select YEAR(DateCol), MONTH(DateCol) from whatever
while ....
insert into yourtable(...)
select * from whatever
where YEAR(DateCol) = year and MONTH(DateCol) = month
end
I know this is an old thread, but I made a generic version of Arthur's cursor solution:
--Split a batch up into chunks using a cursor.
--This method can be used for most any large table with some modifications
--It could also be refined further with an #Day variable (for example)
DECLARE #Year INT
DECLARE #Month INT
DECLARE BatchingCursor CURSOR FOR
SELECT DISTINCT YEAR(<SomeDateField>),MONTH(<SomeDateField>)
FROM <Sometable>;
OPEN BatchingCursor;
FETCH NEXT FROM BatchingCursor INTO #Year, #Month;
WHILE ##FETCH_STATUS = 0
BEGIN
--All logic goes in here
--Any select statements from <Sometable> need to be suffixed with:
--WHERE Year(<SomeDateField>)=#Year AND Month(<SomeDateField>)=#Month
FETCH NEXT FROM BatchingCursor INTO #Year, #Month;
END;
CLOSE BatchingCursor;
DEALLOCATE BatchingCursor;
GO
This solved the problem on loads of our large tables.
There is no pixie dust, you know that.
Without knowing specifics about the actual schema being transfered, a generic solution would be exactly as you describe it: divide processing into multiple inserts and keep track of the key(s). This is sort of pseudo-code T-SQL:
create table currentKeys (table sysname not null primary key, key sql_variant not null);
go
declare #keysInserted table (key sql_variant);
declare #key sql_variant;
begin transaction
do while (1=1)
begin
select #key = key from currentKeys where table = '<target>';
insert into <target> (...)
output inserted.key into #keysInserted (key)
select top (<batchsize>) ... from <source>
where key > #key
order by key;
if (0 = ##rowcount)
break;
update currentKeys
set key = (select max(key) from #keysInserted)
where table = '<target>';
commit;
delete from #keysInserted;
set #key = null;
begin transaction;
end
commit
It would get more complicated if you want to allow for parallel batches and partition the keys.
You could use the BCP command to load the data and use the Batch Size parameter
http://msdn.microsoft.com/en-us/library/ms162802.aspx
Two step process
BCP OUT data from Views into Text files
BCP IN data from Text files into Tables with batch size parameter
This looks like a job for good ol' BCP.