INSERT 150TB data into the table [duplicate] - sql

I have a table with 3.4 million rows. I want to copy this whole data into another table.
I am performing this task using the below query:
select *
into new_items
from productDB.dbo.items
I need to know the best possible way to do this task.

I had the same problem, except I have a table with 2 billion rows, so the log file would grow to no end if I did this, even with the recovery model set to Bulk-Logging:
insert into newtable select * from oldtable
So I operate on blocks of data. This way, if the transfer is interupted, you just restart it. Also, you don't need a log file as big as the table. You also seem to get less tempdb I/O, not sure why.
set identity_insert newtable on
DECLARE #StartID bigint, #LastID bigint, #EndID bigint
select #StartID = isNull(max(id),0) + 1
from newtable
select #LastID = max(ID)
from oldtable
while #StartID < #LastID
begin
set #EndID = #StartID + 1000000
insert into newtable (FIELDS,GO,HERE)
select FIELDS,GO,HERE from oldtable (NOLOCK)
where id BETWEEN #StartID AND #EndId
set #StartID = #EndID + 1
end
set identity_insert newtable off
go
You might need to change how you deal with IDs, this works best if your table is clustered by ID.

If you are copying into a new table, the quickest way is probably what you have in your question, unless your rows are very large.
If your rows are very large, you may want to use the bulk insert functions in SQL Server. I think you can call them from C#.
Or you can first download that data into a text file, then bulk-copy (bcp) it. This has the additional benefit of allowing you to ignore keys, indexes etc.
Also try the Import/Export utility that comes with the SQL Management Studio; not sure whether it will be as fast as a straight bulk-copy, but it should allow you to skip the intermediate step of writing out as a flat file, and just copy directly table-to-table, which might be a bit faster than your SELECT INTO statement.

I have been working with our DBA to copy an audit table with 240M rows to another database.
Using a simple select/insert created a huge tempdb file.
Using a the Import/Export wizard worked but copied 8M rows in 10min
Creating a custom SSIS package and adjusting settings copied 30M rows in 10Min
The SSIS package turned out to be the fastest and most efficent for our purposes
Earl

Here's another way of transferring large tables. I've just transferred 105 million rows between two servers using this. Quite quick too.
Right-click on the database and choose Tasks/Export Data.
A wizard will take you through the steps but you choosing your SQL server client as the data source and target will allow you to select the database and table(s) you wish to transfer.
For more information, see https://www.mssqltips.com/sqlservertutorial/202/simple-way-to-export-data-from-sql-server/

If it's a 1 time import, the Import/Export utility in SSMS will probably work the easiest and fastest. SSIS also seems to work better for importing large data sets than a straight INSERT.
BULK INSERT or BCP can also be used to import large record sets.
Another option would be to temporarily remove all indexes and constraints on the table you're importing into and add them back once the import process completes. A straight INSERT that previously failed might work in those cases.
If you're dealing with timeouts or locking/blocking issues when going directly from one database to another, you might consider going from one db into TEMPDB and then going from TEMPDB into the other database as it minimizes the effects of locking and blocking processes on either side. TempDB won't block or lock the source and it won't hold up the destination.
Those are a few options to try.
-Eric Isaacs

Simple Insert/Select sp's work great until the row count exceeds 1 mil. I've watched tempdb file explode trying to insert/select 20 mil + rows. The simplest solution is SSIS setting the batch row size buffer to 5000 and commit size buffer to 1000.

I know this is late, but if you are encountering semaphore timeouts then you can use row_number to set increments for your insert(s) using something like
INSERT INTO DestinationTable (column1, column2, etc)
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY ID) AS RN , column1, column2, etc
FROM SourceTable ) AS A
WHERE A.RN >= 1 AND A.RN <= 10000 )
The size of the log file will grow, so there is that to contend with. You get better performance if you disable constraints and index when inserting into an existing table. Then enable the constraints and rebuild the index for the table you inserted into once the insertion is complete.

I like the solution from #Mathieu Longtin to copy in batches thereby minimising log file issues and created a version with OFFSET FETCH as suggested by #CervEd.
Others have suggested using the Import/Export Wizard or SSIS packages, but that's not always possible.
It's probably overkill for many but my solution includes some checks for record counts and outputs progress as well.
USE [MyDB]
GO
SET NOCOUNT ON;
DECLARE #intStart int = 1;
DECLARE #intCount int;
DECLARE #intFetch int = 10000;
DECLARE #strStatus VARCHAR(200);
DECLARE #intCopied int = 0;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Getting count of HISTORY records currently in MyTable...';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
SELECT #intCount = COUNT(*) FROM [dbo].MyTable WHERE IsHistory = 1;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Count of HISTORY records currently in MyTable: ' + CONVERT(VARCHAR(20), #intCount);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT; --(note: PRINT resets ##ROWCOUNT to 0 so using RAISERROR instead)
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Starting copy...';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
WHILE #intStart < #intCount
BEGIN
INSERT INTO [dbo].[MyTable_History] (
[PK1], [PK2], [PK3], [Data1], [Data2])
SELECT
[PK1], [PK2], [PK3], [Data1], [Data2]
FROM [MyDB].[dbo].[MyTable]
WHERE IsHistory = 1
ORDER BY
[PK1], [PK2], [PK3]
OFFSET #intStart - 1 ROWS
FETCH NEXT #intFetch ROWS ONLY;
SET #intCopied = #intCopied + ##ROWCOUNT;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Records copied so far: ' + CONVERT(VARCHAR(20), #intCopied);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
SET #intStart = #intStart + #intFetch;
END
--Check the record count is correct.
IF #intCopied = #intCount
BEGIN
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Correct record count.';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
END
ELSE
BEGIN
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Only ' + CONVERT(VARCHAR(20), #intCopied) + ' records were copied, expected: ' + CONVERT(VARCHAR(20), #intCount);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
END
GO

If your focus is Archiving (DW) and are dealing with VLDB with 100+ partitioned tables and you want to isolate most of these resource intensive work on a non production server (OLTP) here is a suggestion (OLTP -> DW)
1) Use backup / Restore to get the data onto the archive server (so now, on Archive or DW you will have Stage and Target database)
2) Stage database: Use partition switch to move data to corresponding stage table
3) Use SSIS to transfer data from staged database to target database for each staged table on both sides
4) Target database: Use partition switch on target database to move data from stage to base table
Hope this helps.

select * into new_items from productDB.dbo.items
That pretty much is it. THis is the most efficient way to do it.

Related

How to force a running t-sql query (half done) to commit?

I have database on Sql Server 2008 R2.
On that database a delete query on 400 Million records, has been running for 4 days , but I need to reboot the machine. How can I force it to commit whatever is deleted so far? I want to reject that data which is deleted by running query so far.
But problem is that query is still running and will not complete before the server reboot.
Note : I have not set any isolation / begin/end transaction for the query. The query is running in SSMS studio.
If machine reboot or I cancelled the query, then database will go in recovery mode and it will recovering for next 2 days, then I need to re-run the delete and it will cost me another 4 days.
I really appreciate any suggestion / help or guidance in this.
I am novice user of sql server.
Thanks in Advance
Regards
There is no way to stop SQL Server from trying to bring the database into a transactionally consistent state. Every single statement is implicitly a transaction itself (if not part of an outer transaction) and is executing either all or nothing. So if you either cancel the query or disconnect or reboot the server, SQL Server will from transaction log write the original values back to the updated data pages.
Next time when you delete so many rows at once, don't do it at once. Divide the job in smaller chunks (I always use 5.000 as a magic number, meaning I delete 5000 rows at the time in the loop) to minimize transaction log use and locking.
set rowcount 5000
delete table
while ##rowcount = 5000
delete table
set rowcount 0
If you are deleting that many rows you may have a better time with truncate. Truncate deletes all rows from the table very efficiently. However, I'm assuming that you would like to keep some of the records in the table. The stored procedure below backs up the data you would like to keep into a temp table then truncates then re-inserts the records that were saved. This can clean a huge table very quickly.
Note that truncate doesn't play well with Foreign Key constraints so you may need to drop those then recreate them after cleaned.
CREATE PROCEDURE [dbo].[deleteTableFast] (
#TableName VARCHAR(100),
#WhereClause varchar(1000))
AS
BEGIN
-- input:
-- table name: is the table to use
-- where clause: is the where clause of the records to KEEP
declare #tempTableName varchar(100);
set #tempTableName = #tableName+'_temp_to_truncate';
-- error checking
if exists (SELECT [Table_Name] FROM Information_Schema.COLUMNS WHERE [TABLE_NAME] =(#tempTableName)) begin
print 'ERROR: already temp table ... exiting'
return
end
if not exists (SELECT [Table_Name] FROM Information_Schema.COLUMNS WHERE [TABLE_NAME] =(#TableName)) begin
print 'ERROR: table does not exist ... exiting'
return
end
-- save wanted records via a temp table to be able to truncate
exec ('select * into '+#tempTableName+' from '+#TableName+' WHERE '+#WhereClause);
exec ('truncate table '+#TableName);
exec ('insert into '+#TableName+' select * from '+#tempTableName);
exec ('drop table '+#tempTableName);
end
GO
You must know D(Durability) in ACID first before you understand why database goes to Recovery mode.
Generally speaking, you should avoid long running SQL if possible. Long running SQL means more lock time on resource, larger transaction log and huge rollback time when it fails.
Consider divided your task some id or time. For example, you want to insert large volume data from TableSrc to TableTarget, you can write query like
DECLARE #BATCHCOUNT INT = 1000;
DECLARE #Id INT = 0;
DECLARE #Max = ...;
WHILE Id < #Max
BEGIN
INSERT INTO TableTarget
FROM TableSrc
WHERE PrimaryKey >= #Id AND #PrimaryKey < #Id + #BatchCount;
SET #Id = #Id + #BatchCount;
END
It's ugly more code and more error prone. But it's the only way I know to deal with huge data volume.

Updating a large table and minimizing user impact

I have a question on general database/sql server designing:
There is a table with 3 million rows that is being accessed 24x7. I need to update all the records in the table. Can you give me some methods to do this so that the user impact is minimized while I update my table?
Thanks in advance.
Normally you'd write a single update statement to update rows. But in your case you actually want to break it up.
http://www.sqlfiddle.com/#!3/c9c75/6
Is a working example of a common pattern. You don't want a batch size of 2, maybe you want 100,000 or 25,000 - you'll have to test on your system to determine the best balance between quick completion and low blocking.
declare #min int, #max int
select #min = min(user_id), #max = max(user_id)
from users
declare #tmp int
set #tmp = #min
declare #batchSize int
set #batchSize = 2
while #tmp <= #max
begin
print 'from ' + Cast(#tmp as varchar(10)) + ' to ' + cast(#tmp + #batchSize as varchar(10)) + ' starting (' + CONVERT(nvarchar(30), GETDATE(), 120) + ')'
update users
set name = name + '_foo'
where user_id >= #tmp and user_id < #tmp + #batchsize and user_id <= #max
set #tmp = #tmp + #batchSize
print 'Done (' + CONVERT(nvarchar(30), GETDATE(), 120) + ')'
WAITFOR DELAY '000:00:001'
end
update users
set name = name + '_foo'
where user_id > #max
We use patterns like this to update a user table about 10x your table size. With 100,000 chunks it takes about an hour. Performance depends on your hardware of course.
To minimally impact users, I would update only a certain # of records at a time. The number to update is more dependent on your hardware than anything else in my opinion.
As with all things database, it depends. What is the load pattern (ie, are users reading mainly from the end of the table)? How are new records added, if at all? What are your index fill factor settings and actual values? Will your update force any index re-computes? Can you split up the update to reduce locking? If so, do you need robust rollback ability in case of a failure? Are you setting the same value in every row, or do you need a per row calculation, or do you have a per-row source to match up?
Go through the table one row at a time using a loop or even a cursor. Make sure each update is using row locks.
If you don't have a way of identifying rows that still have to be updated, create another table first to hold the primary key and an update indicator, copy all primary key values in there and then keep track of how far you are along in that table.
This is also going to be the slowest method. If you need it to go a little faster, update a few thousand rows at a time, still using rowlock hints.

How to copy a huge table data into another table in SQL Server

I have a table with 3.4 million rows. I want to copy this whole data into another table.
I am performing this task using the below query:
select *
into new_items
from productDB.dbo.items
I need to know the best possible way to do this task.
I had the same problem, except I have a table with 2 billion rows, so the log file would grow to no end if I did this, even with the recovery model set to Bulk-Logging:
insert into newtable select * from oldtable
So I operate on blocks of data. This way, if the transfer is interupted, you just restart it. Also, you don't need a log file as big as the table. You also seem to get less tempdb I/O, not sure why.
set identity_insert newtable on
DECLARE #StartID bigint, #LastID bigint, #EndID bigint
select #StartID = isNull(max(id),0) + 1
from newtable
select #LastID = max(ID)
from oldtable
while #StartID < #LastID
begin
set #EndID = #StartID + 1000000
insert into newtable (FIELDS,GO,HERE)
select FIELDS,GO,HERE from oldtable (NOLOCK)
where id BETWEEN #StartID AND #EndId
set #StartID = #EndID + 1
end
set identity_insert newtable off
go
You might need to change how you deal with IDs, this works best if your table is clustered by ID.
If you are copying into a new table, the quickest way is probably what you have in your question, unless your rows are very large.
If your rows are very large, you may want to use the bulk insert functions in SQL Server. I think you can call them from C#.
Or you can first download that data into a text file, then bulk-copy (bcp) it. This has the additional benefit of allowing you to ignore keys, indexes etc.
Also try the Import/Export utility that comes with the SQL Management Studio; not sure whether it will be as fast as a straight bulk-copy, but it should allow you to skip the intermediate step of writing out as a flat file, and just copy directly table-to-table, which might be a bit faster than your SELECT INTO statement.
I have been working with our DBA to copy an audit table with 240M rows to another database.
Using a simple select/insert created a huge tempdb file.
Using a the Import/Export wizard worked but copied 8M rows in 10min
Creating a custom SSIS package and adjusting settings copied 30M rows in 10Min
The SSIS package turned out to be the fastest and most efficent for our purposes
Earl
Here's another way of transferring large tables. I've just transferred 105 million rows between two servers using this. Quite quick too.
Right-click on the database and choose Tasks/Export Data.
A wizard will take you through the steps but you choosing your SQL server client as the data source and target will allow you to select the database and table(s) you wish to transfer.
For more information, see https://www.mssqltips.com/sqlservertutorial/202/simple-way-to-export-data-from-sql-server/
If it's a 1 time import, the Import/Export utility in SSMS will probably work the easiest and fastest. SSIS also seems to work better for importing large data sets than a straight INSERT.
BULK INSERT or BCP can also be used to import large record sets.
Another option would be to temporarily remove all indexes and constraints on the table you're importing into and add them back once the import process completes. A straight INSERT that previously failed might work in those cases.
If you're dealing with timeouts or locking/blocking issues when going directly from one database to another, you might consider going from one db into TEMPDB and then going from TEMPDB into the other database as it minimizes the effects of locking and blocking processes on either side. TempDB won't block or lock the source and it won't hold up the destination.
Those are a few options to try.
-Eric Isaacs
Simple Insert/Select sp's work great until the row count exceeds 1 mil. I've watched tempdb file explode trying to insert/select 20 mil + rows. The simplest solution is SSIS setting the batch row size buffer to 5000 and commit size buffer to 1000.
I know this is late, but if you are encountering semaphore timeouts then you can use row_number to set increments for your insert(s) using something like
INSERT INTO DestinationTable (column1, column2, etc)
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY ID) AS RN , column1, column2, etc
FROM SourceTable ) AS A
WHERE A.RN >= 1 AND A.RN <= 10000 )
The size of the log file will grow, so there is that to contend with. You get better performance if you disable constraints and index when inserting into an existing table. Then enable the constraints and rebuild the index for the table you inserted into once the insertion is complete.
I like the solution from #Mathieu Longtin to copy in batches thereby minimising log file issues and created a version with OFFSET FETCH as suggested by #CervEd.
Others have suggested using the Import/Export Wizard or SSIS packages, but that's not always possible.
It's probably overkill for many but my solution includes some checks for record counts and outputs progress as well.
USE [MyDB]
GO
SET NOCOUNT ON;
DECLARE #intStart int = 1;
DECLARE #intCount int;
DECLARE #intFetch int = 10000;
DECLARE #strStatus VARCHAR(200);
DECLARE #intCopied int = 0;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Getting count of HISTORY records currently in MyTable...';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
SELECT #intCount = COUNT(*) FROM [dbo].MyTable WHERE IsHistory = 1;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Count of HISTORY records currently in MyTable: ' + CONVERT(VARCHAR(20), #intCount);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT; --(note: PRINT resets ##ROWCOUNT to 0 so using RAISERROR instead)
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Starting copy...';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
WHILE #intStart < #intCount
BEGIN
INSERT INTO [dbo].[MyTable_History] (
[PK1], [PK2], [PK3], [Data1], [Data2])
SELECT
[PK1], [PK2], [PK3], [Data1], [Data2]
FROM [MyDB].[dbo].[MyTable]
WHERE IsHistory = 1
ORDER BY
[PK1], [PK2], [PK3]
OFFSET #intStart - 1 ROWS
FETCH NEXT #intFetch ROWS ONLY;
SET #intCopied = #intCopied + ##ROWCOUNT;
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Records copied so far: ' + CONVERT(VARCHAR(20), #intCopied);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
SET #intStart = #intStart + #intFetch;
END
--Check the record count is correct.
IF #intCopied = #intCount
BEGIN
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Correct record count.';
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
END
ELSE
BEGIN
SET #strStatus = CONVERT(VARCHAR(30), GETDATE()) + ' Only ' + CONVERT(VARCHAR(20), #intCopied) + ' records were copied, expected: ' + CONVERT(VARCHAR(20), #intCount);
RAISERROR (#strStatus, 10, 1) WITH NOWAIT;
END
GO
If your focus is Archiving (DW) and are dealing with VLDB with 100+ partitioned tables and you want to isolate most of these resource intensive work on a non production server (OLTP) here is a suggestion (OLTP -> DW)
1) Use backup / Restore to get the data onto the archive server (so now, on Archive or DW you will have Stage and Target database)
2) Stage database: Use partition switch to move data to corresponding stage table
3) Use SSIS to transfer data from staged database to target database for each staged table on both sides
4) Target database: Use partition switch on target database to move data from stage to base table
Hope this helps.
select * into new_items from productDB.dbo.items
That pretty much is it. THis is the most efficient way to do it.

Copy one column to another for over a billion rows in SQL Server database

Database : SQL Server 2005
Problem : Copy values from one column to another column in the same table with a billion+
rows.
test_table (int id, bigint bigid)
Things tried 1: update query
update test_table set bigid = id
fills up the transaction log and rolls back due to lack of transaction log space.
Tried 2 - a procedure on following lines
set nocount on
set rowcount = 500000
while #rowcount > 0
begin
update test_table set bigid = id where bigid is null
set #rowcount = ##rowcount
set #rowupdated = #rowsupdated + #rowcount
end
print #rowsupdated
The above procedure starts slowing down as it proceeds.
Tried 3 - Creating a cursor for update.
generally discouraged in SQL Server documentation and this approach updates one row at a time which is too time consuming.
Is there an approach that can speed up the copying of values from one column to another. Basically I am looking for some 'magic' keyword or logic that will allow the update query to rip through the billion rows half a million at a time sequentially.
Any hints, pointers will be much appreciated.
I'm going to guess that you are closing in on the 2.1billion limit of an INT datatype on an artificial key for a column. Yes, that's a pain. Much easier to fix before the fact than after you've actually hit that limit and production is shut down while you are trying to fix it :)
Anyway, several of the ideas here will work. Let's talk about speed, efficiency, indexes, and log size, though.
Log Growth
The log blew up originally because it was trying to commit all 2b rows at once. The suggestions in other posts for "chunking it up" will work, but that may not totally resolve the log issue.
If the database is in SIMPLE mode, you'll be fine (the log will re-use itself after each batch). If the database is in FULL or BULK_LOGGED recovery mode, you'll have to run log backups frequently during the running of your operation so that SQL can re-use the log space. This might mean increasing the frequency of the backups during this time, or just monitoring the log usage while running.
Indexes and Speed
ALL of the where bigid is null answers will slow down as the table is populated, because there is (presumably) no index on the new BIGID field. You could, (of course) just add an index on BIGID, but I'm not convinced that is the right answer.
The key (pun intended) is my assumption that the original ID field is probably the primary key, or the clustered index, or both. In that case, lets take advantage of that fact, and do a variation of Jess' idea:
set #counter = 1
while #counter < 2000000000 --or whatever
begin
update test_table set bigid = id
where id between #counter and (#counter + 499999) --BETWEEN is inclusive
set #counter = #counter + 500000
end
This should be extremely fast, because of the existing indexes on ID.
The ISNULL check really wasn't necessary anyway, neither is my (-1) on the interval. If we duplicate some rows between calls, that's not a big deal.
Use TOP in the UPDATE statement:
UPDATE TOP (#row_limit) dbo.test_table
SET bigid = id
WHERE bigid IS NULL
You could try to use something like SET ROWCOUNT and do batch updates:
SET ROWCOUNT 5000;
UPDATE dbo.test_table
SET bigid = id
WHERE bigid IS NULL
GO
and then repeat this as many times as you need to.
This way, you're avoiding the RBAR (row-by-agonizing-row) symptoms of cursors and while loops, and yet, you don't unnecessarily fill up your transaction log.
Of course, in between runs, you'd have to do backups (especially of your log) to keep its size within reasonable limits.
Is this a one time thing? If so, just do it by ranges:
set counter = 500000
while #counter < 2000000000 --or whatever your max id
begin
update test_table set bigid = id where id between (#counter - 500000) and #counter and bigid is null
set counter = #counter + 500000
end
I didn't run this to try it, but if you can get it to update 500k at a time I think you're moving in the right direction.
set rowcount 500000
update test_table tt1
set bigid = (SELECT tt2.id FROM test_table tt2 WHERE tt1.id = tt2.id)
where bigid IS NULL
You can also try changing the recover model so you don't log the transactions
ALTER DATABASE db1
SET RECOVERY SIMPLE
GO
update test_table
set bigid = id
GO
ALTER DATABASE db1
SET RECOVERY FULL
GO
First step, if there are any, would be to drop indexes before the operation. This is probably what is causing the speed degrade with time.
The other option, a little outside the box thinking...can you express the update in such a way that you could materialize the column values in a select? If you can do this then you could create what amounts to a NEW table using SELECT INTO which is a minimally logged operation (assuming in 2005 that you are set to a recovery model of SIMPLE or BULK LOGGED). This would be pretty fast and then you can drop the old table, rename this table to to old table name and recreate any indexes.
select id, CAST(id as bigint) bigid into test_table_temp from test_table
drop table test_table
exec sp_rename 'test_table_temp', 'test_table'
I second the
UPDATE TOP(X) statement
Also to suggest, if you're in a loop, add in some WAITFOR delay or COMMIT between, to allow other processes some time to use the table if needed vs. blocking forever until all the updates are completed

Batch commit on large INSERT operation in native SQL?

I have a couple large tables (188m and 144m rows) I need to populate from views, but each view contains a few hundred million rows (pulling together pseudo-dimensionally modelled data into a flat form). The keys on each table are over 50 composite bytes of columns. If the data was in tables, I could always think about using sp_rename to make the other new table, but that isn't really an option.
If I do a single INSERT operation, the process uses a huge amount of transaction log space, typicalyl filing it up and prompting a bunch of hassle with the DBAs. (And yes, this is probably a job the DBAs should handle/design/architect)
I can use SSIS and stream the data into the destination table with batch commits (but this does require the data to be transmitted over the network, since we are not allowed to run SSIS packages on the server).
Any things other than to divide the process up into multiple INSERT operations using some kind of key to distribute the rows into different batches and doing a loop?
Does the view have ANY kind of unique identifier / candidate key? If so, you could select those rows into a working table using:
SELECT key_columns INTO dbo.temp FROM dbo.HugeView;
(If it makes sense, maybe put this table into a different database, perhaps with SIMPLE recovery model, to prevent the log activity from interfering with your primary database. This should generate much less log anyway, and you can free up the space in the other database before you resume, in case the problem is that you have inadequate disk space all around.)
Then you can do something like this, inserting 10,000 rows at a time, and backing up the log in between:
SET NOCOUNT ON;
DECLARE
#batchsize INT,
#ctr INT,
#rc INT;
SELECT
#batchsize = 10000,
#ctr = 0;
WHILE 1 = 1
BEGIN
WITH x AS
(
SELECT key_column, rn = ROW_NUMBER() OVER (ORDER BY key_column)
FROM dbo.temp
)
INSERT dbo.PrimaryTable(a, b, c, etc.)
SELECT v.a, v.b, v.c, etc.
FROM x
INNER JOIN dbo.HugeView AS v
ON v.key_column = x.key_column
WHERE x.rn > #batchsize * #ctr
AND x.rn <= #batchsize * (#ctr + 1);
IF ##ROWCOUNT = 0
BREAK;
BACKUP LOG PrimaryDB TO DISK = 'C:\db.bak' WITH INIT;
SET #ctr = #ctr + 1;
END
That's all off the top of my head, so don't cut/paste/run, but I think the general idea is there. For more details (and why I backup log / checkpoint inside the loop), see this post on sqlperformance.com:
Break large delete operations into chunks
Note that if you are taking regular database and log backups you will probably want to take a full to start your log chain over again.
You could partition your data and insert your data in a cursor loop. That would be nearly the same as SSIS batchinserting. But runs on your server.
create cursor ....
select YEAR(DateCol), MONTH(DateCol) from whatever
while ....
insert into yourtable(...)
select * from whatever
where YEAR(DateCol) = year and MONTH(DateCol) = month
end
I know this is an old thread, but I made a generic version of Arthur's cursor solution:
--Split a batch up into chunks using a cursor.
--This method can be used for most any large table with some modifications
--It could also be refined further with an #Day variable (for example)
DECLARE #Year INT
DECLARE #Month INT
DECLARE BatchingCursor CURSOR FOR
SELECT DISTINCT YEAR(<SomeDateField>),MONTH(<SomeDateField>)
FROM <Sometable>;
OPEN BatchingCursor;
FETCH NEXT FROM BatchingCursor INTO #Year, #Month;
WHILE ##FETCH_STATUS = 0
BEGIN
--All logic goes in here
--Any select statements from <Sometable> need to be suffixed with:
--WHERE Year(<SomeDateField>)=#Year AND Month(<SomeDateField>)=#Month
FETCH NEXT FROM BatchingCursor INTO #Year, #Month;
END;
CLOSE BatchingCursor;
DEALLOCATE BatchingCursor;
GO
This solved the problem on loads of our large tables.
There is no pixie dust, you know that.
Without knowing specifics about the actual schema being transfered, a generic solution would be exactly as you describe it: divide processing into multiple inserts and keep track of the key(s). This is sort of pseudo-code T-SQL:
create table currentKeys (table sysname not null primary key, key sql_variant not null);
go
declare #keysInserted table (key sql_variant);
declare #key sql_variant;
begin transaction
do while (1=1)
begin
select #key = key from currentKeys where table = '<target>';
insert into <target> (...)
output inserted.key into #keysInserted (key)
select top (<batchsize>) ... from <source>
where key > #key
order by key;
if (0 = ##rowcount)
break;
update currentKeys
set key = (select max(key) from #keysInserted)
where table = '<target>';
commit;
delete from #keysInserted;
set #key = null;
begin transaction;
end
commit
It would get more complicated if you want to allow for parallel batches and partition the keys.
You could use the BCP command to load the data and use the Batch Size parameter
http://msdn.microsoft.com/en-us/library/ms162802.aspx
Two step process
BCP OUT data from Views into Text files
BCP IN data from Text files into Tables with batch size parameter
This looks like a job for good ol' BCP.