Sql server while loop - sql

I have a select query that returns about 10million rows and I then need to insert them into a new table.
I want the performance to be ok so I want to insert them into the new table in batches of 10000. To give an example i created a simple select query below
Insert into new table
Select top 10000 * from applications
But now I need to get the next 10000 rows and insert them. Is there a way to iterate through the million rows to insert them in batches of 10000?? I'm using sql server 2008.

It will probably not be faster by batching it up. Probably the opposite. One statement is the fastest version most of the time. It might just require high amounts of temp space and log. But the fastest measured with the wall-clock.
Reason for that is that SQL Server automatically build a good plan that efficiently batches up all work at once.
To answer your question: The statement as you wrote it produces undefined rows because a table has no order. You should probably add a clustering key like an ID column. That way you can go along the table with a while loop, each time executing the following:
INSERT ...
SELECT TOP 10000 *
FROM T
WHERE ID > #lastMaxID
ORDER BY ID
Note, that the ORDER BY is required for correctness.

I wouldn't batch 10 million records.
If you are batching an insert, use an indexed field to define your batches.
DECLARE #intFlag INT
SET #intFlag = 1
WHILE (#intFlag <=10000000)
BEGIN
INSERT INTO yourTable
SELECT *
FROM applications
WHERE ID BETWEEN #intFlag AND #IntFlag + 9999
SET #intFlag = #intFlag + 10000
END
GO

Use CTE or While loop to insert like batches
;WITH q (n) AS (
SELECT 1
UNION ALL
SELECT n + 1
FROM q
WHERE n < 10000
)
INSERT INTO table1
SELECT * FROM q
OR
DECLARE #batch INT,
#rowcounter INT,
#maxrowcount INT
SET #batch = 10000
SET #rowcounter = 1
SELECT #maxrowcount = max(id) FROM table1
WHILE #rowcounter <= #maxrowcount
BEGIN
INSERT INTO table2 (col1)
SELECT col1
FROM table1
WHERE 1 = 1
AND id between #rowcounter and (#rowcounter + #batch)
-- Set the #rowcounter to the next batch start
SET #rowcounter = #rowcounter + #batch + 1;
END

As an option you can export query to a flat file by bcp and BULK IMPORT it into a table.
BULK IMPORT statement has BATCHSIZE option to limit number of rows.
In your case BATCHSIZE =10000 will work.
There is another option to create SSIS package. Select fast load in OLE DB destination and define 10000 number of rows in “Rows per batch:”. It is probably the easiest solution.

Related

Copying data from one table to another using Insert Into

I have two tables. They both have identical structures except table2 has an additional column. I currently copy data from table1 into table2 using a stored proc, as shown below.
However, due to the sheer number of records (20million+), and the structure of the stored proc, this currently takes a couple of hours to run.
Does anyone have any suggestions on how to optimize the code?
CREATE PROCEDURE dbo.insert_period #period INT AS
DECLARE #batchsize INT
DECLARE #start INT
DECLARE #numberofrows INT
SELECT #numberofrows = COUNT(*) from daily_table
SET #batchsize = 150000
SET #start = 1
WHILE #start < #numberofrows
BEGIN
INSERT INTO dbo.main_table WITH (TABLOCK) (
col1,
col2,
....,
col26,
time_period
)
SELECT *, #period FROM dbo.daily_table
ORDER BY id
OFFSET #start ROWS
FETCH NEXT #batchsize ROWS ONLY
SET #start += #batchsize + 1
END
The id that I am using here is not unique. The table itself does not have any keys or unique id's.
First I would like to point out that the logic in your insert is flawed.
With #start starting at 1 your always skipping the first row of the source table. Then adding 1 to it at the end of your loop is causing it to skip another row on each subsequent run of the loop.
If your set on using batched inserts I suggest you read up on how it works over on MSSQLTips.
To help you with performance I would suggest taking a look at the following:
SELECT *
Remove the SELECT * and replace with the column names. This will help the optimizer get you a better query plan. Further reading on why SELECT * is bad can be found in this SO Question.
ORDER BY
That ORDER BY is probably slowing you down. Without seeing your query plan we cannot know for sure though. Each time your loop executes it queries the source table and has to sort all those records. Sorting 20+ milling records that many times is a lot of work. Take a look at my simplified example below.
CREATE TABLE #Test (Id INT);
INSERT INTO #Test VALUES (1), (2), (3), (4), (5);
DECLARE #batchsize INT;
DECLARE #start INT;
DECLARE #numberofrows INT;
SELECT #numberofrows = COUNT(*) FROM #Test;
SET #batchsize = 2;
SET #start = 0;
WHILE #start < #numberofrows
BEGIN
SELECT
*
, 10
FROM
#Test
ORDER BY
Id OFFSET #start ROWS FETCH NEXT #batchsize ROWS ONLY;
SET #start += #batchsize;
END;
Below is a portion of the query plan produced by the sample. Notice the Sort operation highlighted in yellow. Its cost accounts for 78% of that query plan.
If we add an index that is already sorted on the Id column of the source table we can eliminate the sort. Now when the loop runs it doesn't have to do any sorting.
CREATE INDEX ix_Test ON #Test (Id)
Other Options to Research
Columnstore Indexes
Batch Mode in RowStore
Parallel Inserts
You copy the table row by row, that's why it takes so long. The simplest way to achieve what you want is an 'INSERT' combined with a 'SELECT' statement. This way, you would insert the data in one batch.
CREATE TABLE dbo.daily_table (id INT PRIMARY KEY IDENTITY,
value1 NVARCHAR(100) NULL,
value2 NVARCHAR(100) NULL);
GO
CREATE TABLE dbo.main_table (id INT PRIMARY KEY IDENTITY,
value1 NVARCHAR(100) NULL,
value2 NVARCHAR(100) NULL,
value3 NVARCHAR(100) NULL);
GO
INSERT INTO dbo.daily_table (value1, value2)
VALUES('1', '2');
-- Insert with Select
INSERT INTO dbo.main_table (value1, value2)
SELECT value1, value2
FROM dbo.daily_table;
Also, it's better not to use an asterisk in your 'SELECT' statement since the result could be unpredictable.

SQL Server: Why isn't this logic working when Chunking on inserts?

Fellow Techies--
I've got an endless loop condition happening here. Why is ##rowcount never getting set back to 0? I must not be understanding what ##rowcount really does--or I am setting the value in the wrong place. I think the value should be decrementing on each pass until I eventually hit zero.
DECLARE #ChunkSize int = 250000;
WHILE #ChunkSize <> 0
BEGIN
BEGIN TRANSACTION
INSERT TableName
(col1,col2)
SELECT TOP (#ChunkSize)
col1,col2
FROM TableName2
COMMIT TRANSACTION;
SET #ChunkSize = ##ROWCOUNT
END -- transaction block
END -- while-loop block
I'm not sure, by what you posted, how you are going to ensure you catch rows that you haven't already inserted. If you don't, it'll be an infinite loop of course. Here is a way using test data--but naturally you'd want to base it off a PK or other unique column. Perhaps you just left that part off, or I'm missing something all together. I'm just interested in what your final code is for your chunking and the logic behind it, so this is an answer and inquiry.
if object_id('tempdb..#source') is not null drop table #source
if object_id('tempdb..#destination') is not null drop table #destination
create table #source(c1 int, c2 int)
create table #destination (c1 int, c2 int)
insert into #source (c1,c2) values
(1,1),
(2,1),
(3,1),
(4,1),
(5,1),
(6,1),
(7,1),
(8,1),
(9,1),
(10,1),
(11,1),
(12,1)
DECLARE #ChunkSize int = 2;
WHILE #ChunkSize <> 0
BEGIN
INSERT INTO #destination (c1,c2)
SELECT TOP (#ChunkSize) c1,c2 FROM #source WHERE c1 NOT IN (SELECT DISTINCT c1 FROM #destination) ORDER BY ROW_NUMBER() OVER (ORDER BY C1)
SET #ChunkSize = ##ROWCOUNT
--SELECT #ChunkSize
END
select * from #source
select * from #destination
Nothing is happening because you're setting chunksize to itself without ever looking at what you've already inserted. Using your example, #Chunksize = 250000. First, select performs SELECT TOP 250000 and returns (presumably) 250000 rows. You then use ##RowCount to update #Chunksize, but the row count returned will be 250000, so you just set it to 250000 again. Which could be fine, except there is no way that number will ever change without ruling out rows that you've already inserted - you will keep inserting the same 250000 rows over and over.
You need something like NOT EXISTS to filter out the rows you've already inserted:
DECLARE #ChunkSize int = 250000;
WHILE #ChunkSize > 0
BEGIN
BEGIN TRANSACTION
INSERT INTO TableName
(col1,col2)
SELECT TOP (#ChunkSize)
col1,col2
FROM TableName2 T2
WHERE NOT EXISTS (SELECT *
FROM TableName T
WHERE T.Col1 = T2.Col1
AND T.Col2 = T2.Col2)
SET #ChunkSize = ##ROWCOUNT
PRINT CONVERT(nvarchar(10),#ChunkSize) + ' Rows Inserted.';
COMMIT TRANSACTION
END -- while-loop block
Implemented solution
In the end, I decided to pump the SQL through SSIS, where I could set the commit batch size accordingly. Had I not chosen hat route, I would have had to follow #scsimon's suggestion and basically maintain a tracking table for the records completed and the records left to cycle through.

How to do Select query range by range on a particular table

I have one temp_table which consists of more than 80K rows.
In aqua I am unable to do select * on this table due to space/memory limitation I guess.
select * from #tmp
Is there any way to do select query range by range?
For eg:- give me first 10000 records and next 10000 and next 10000 till the end.
Note:-
1) I am using Aqua Data Studio, where I am restricted to select max 5000 rows in one select query.
2) I am using Sybase, which somehow doesn't allow 'except' and 'select top #var from table' syntax and ROWNUM() is not avaliable
Thanks!!
You can use something like the following in SQL Server. Just update #FirstRow for each new iteration.
declare #FirstRow int = 0
declare #Rows int = 10000
select top (#FirstRow+#Rows) * from Table
except
select top (#FirstRow) * from Table
set #FirstRow = #FirstRow + #Rows
select top (#FirstRow+#Rows) * from Table
except
select top (#FirstRow) * from Table
Can you not use something like with where clause on some id in the table
select top n * from table where some_id > current_iteration_starting_point
e.g
select top 200 * from tablename where some_id > 1
and keep increasing the iteration_starting_point say from 1 to 201 in the next iteration and so on.
Here is documentation on how to increase the memory capacity of Aqua Data Studio :
https://www.aquaclusters.com/app/home/project/public/aquadatastudio/wikibook/Documentation16/page/50/Launcher-Memory-Configuration

delete old records and keep 10 latest in sql compact

i'm using a sql compact database(sdf) in MS SQL 2008.
in the table 'Job', each id has multiple jobs.
there is a system regularly add jobs into the table.
I would like to keep the 10 latest records for each id order by their 'datecompleted'
and delete the rest of the records
how can i construct my query? failed in using #temp table and cursor
Well it is fast approaching Christmas, so here is my gift to you, an example script that demonstrates what I believe it is that you are trying to achieve. No I don't have a big white fluffy beard ;-)
CREATE TABLE TestJobSetTable
(
ID INT IDENTITY(1,1) not null PRIMARY KEY,
JobID INT not null,
DateCompleted DATETIME not null
);
--Create some test data
DECLARE #iX INT;
SET #iX = 0
WHILE(#iX < 15)
BEGIN
INSERT INTO TestJobSetTable(JobID,DateCompleted) VALUES(1,getDate())
INSERT INTO TestJobSetTable(JobID,DateCompleted) VALUES(34,getDate())
SET #iX = #iX + 1;
WAITFOR DELAY '00:00:0:01'
END
--Create some more test data, for when there may be job groups with less than 10 records.
SET #iX = 0
WHILE(#iX < 6)
BEGIN
INSERT INTO TestJobSetTable(JobID,DateCompleted) VALUES(23,getDate())
SET #iX = #iX + 1;
WAITFOR DELAY '00:00:0:01'
END
--Review the data set
SELECT * FROM TestJobSetTable;
--Apply the deletion to the remainder of the data set.
WITH TenMostRecentCompletedJobs AS
(
SELECT ID, JobID, DateCompleted
FROM TestJobSetTable A
WHERE ID in
(
SELECT TOP 10 ID
FROM TestJobSetTable
WHERE JobID = A.JobID
ORDER BY DateCompleted DESC
)
)
--SELECT * FROM TenMostRecentCompletedJobs ORDER BY JobID,DateCompleted desc;
DELETE FROM TestJobSetTable
WHERE ID NOT IN(SELECT ID FROM TenMostRecentCompletedJobs)
--Now only data of interest remains
SELECT * FROM TestJobSetTable
DROP TABLE TestJobSetTable;
How about something like:
DELETE FROM
Job
WHERE NOT
id IN (
SELECT TOP 10 id
FROM Job
ORDER BY datecompleted)
This is assuming you're using 3.5 because nested SELECT is only available in this version or higher.
I did not read the question correctly. I suspect something more along the lines of a CTE will solve the problem, using similar logic. You want to build a query that identifies the records you want to keep, as your starting point.
Using CTE on SQL Server Compact 3.5

Split query result by half in TSQL (obtain 2 resultsets/tables)

I have a query that returns a large number of heavy rows.
When I transform this rows in a list of CustomObject I have a big memory peak, and this transformation is made by a custom dotnet framework that I can't modify.
I need to retrieve a less number of rows to do "the transform" in two passes and then avoid the memory peak.
How can I split the result of a query by half? I need to do it in DB layer. I thing to do a "Top count(*)/2" but how to get the other half?
Thank you!
If you have identity field in the table, select first even ids, then odd ones.
select * from Table where Id % 2 = 0
select * from Table where Id % 2 = 1
You should have roughly 50% rows in each set.
Here is another way to do it from(http://www.tek-tips.com/viewthread.cfm?qid=1280248&page=5). I think it's more efficient:
Declare #Rows Int
Declare #TopRows Int
Declare #BottomRows Int
Select #Rows = Count(*) From TableName
If #Rows % 2 = 1
Begin
Set #TopRows = #Rows / 2
Set #BottomRows = #TopRows + 1
End
Else
Begin
Set #TopRows = #Rows / 2
Set #BottomRows = #TopRows
End
Set RowCount #TopRows
Select * From TableName Order By DisplayOrder
Set RowCount #BottomRows
Select * From TableNameOrder By DisplayOrderDESC
--- old answer below ---
Is this a stored procedure call or dynamic sql? Can you use temp tables?
if so, something like this would work
select row_number() OVER(order by yourorderfield) as rowNumber, *
INTO #tmp
FROM dbo.yourtable
declare #rowCount int
SELECT #rowCount = count(1) from #tmp
SELECT * from #tmp where rowNumber <= #rowCount / 2
SELECT * from #tmp where rowNumber > #rowCount / 2
DROP TABLE #tmp
SELECT TOP 50 PERCENT WITH TIES ... ORDER BY SomeThing
then
SELECT TOP 50 PERCENT ... ORDER BY SomeThing DESC
However, unless you snapshot the data first, a row in the middle may slip through or be processed twice
I don't think you should do that in SQL, unless you will always have a possibility to have the same record 2 times.
I would do it in an "software" programming language, not SQL. Java, .NET, C++, etc...