I have two tables. They both have identical structures except table2 has an additional column. I currently copy data from table1 into table2 using a stored proc, as shown below.
However, due to the sheer number of records (20million+), and the structure of the stored proc, this currently takes a couple of hours to run.
Does anyone have any suggestions on how to optimize the code?
CREATE PROCEDURE dbo.insert_period #period INT AS
DECLARE #batchsize INT
DECLARE #start INT
DECLARE #numberofrows INT
SELECT #numberofrows = COUNT(*) from daily_table
SET #batchsize = 150000
SET #start = 1
WHILE #start < #numberofrows
BEGIN
INSERT INTO dbo.main_table WITH (TABLOCK) (
col1,
col2,
....,
col26,
time_period
)
SELECT *, #period FROM dbo.daily_table
ORDER BY id
OFFSET #start ROWS
FETCH NEXT #batchsize ROWS ONLY
SET #start += #batchsize + 1
END
The id that I am using here is not unique. The table itself does not have any keys or unique id's.
First I would like to point out that the logic in your insert is flawed.
With #start starting at 1 your always skipping the first row of the source table. Then adding 1 to it at the end of your loop is causing it to skip another row on each subsequent run of the loop.
If your set on using batched inserts I suggest you read up on how it works over on MSSQLTips.
To help you with performance I would suggest taking a look at the following:
SELECT *
Remove the SELECT * and replace with the column names. This will help the optimizer get you a better query plan. Further reading on why SELECT * is bad can be found in this SO Question.
ORDER BY
That ORDER BY is probably slowing you down. Without seeing your query plan we cannot know for sure though. Each time your loop executes it queries the source table and has to sort all those records. Sorting 20+ milling records that many times is a lot of work. Take a look at my simplified example below.
CREATE TABLE #Test (Id INT);
INSERT INTO #Test VALUES (1), (2), (3), (4), (5);
DECLARE #batchsize INT;
DECLARE #start INT;
DECLARE #numberofrows INT;
SELECT #numberofrows = COUNT(*) FROM #Test;
SET #batchsize = 2;
SET #start = 0;
WHILE #start < #numberofrows
BEGIN
SELECT
*
, 10
FROM
#Test
ORDER BY
Id OFFSET #start ROWS FETCH NEXT #batchsize ROWS ONLY;
SET #start += #batchsize;
END;
Below is a portion of the query plan produced by the sample. Notice the Sort operation highlighted in yellow. Its cost accounts for 78% of that query plan.
If we add an index that is already sorted on the Id column of the source table we can eliminate the sort. Now when the loop runs it doesn't have to do any sorting.
CREATE INDEX ix_Test ON #Test (Id)
Other Options to Research
Columnstore Indexes
Batch Mode in RowStore
Parallel Inserts
You copy the table row by row, that's why it takes so long. The simplest way to achieve what you want is an 'INSERT' combined with a 'SELECT' statement. This way, you would insert the data in one batch.
CREATE TABLE dbo.daily_table (id INT PRIMARY KEY IDENTITY,
value1 NVARCHAR(100) NULL,
value2 NVARCHAR(100) NULL);
GO
CREATE TABLE dbo.main_table (id INT PRIMARY KEY IDENTITY,
value1 NVARCHAR(100) NULL,
value2 NVARCHAR(100) NULL,
value3 NVARCHAR(100) NULL);
GO
INSERT INTO dbo.daily_table (value1, value2)
VALUES('1', '2');
-- Insert with Select
INSERT INTO dbo.main_table (value1, value2)
SELECT value1, value2
FROM dbo.daily_table;
Also, it's better not to use an asterisk in your 'SELECT' statement since the result could be unpredictable.
Related
I want to populate a list of row numbers into a temp table to the maximum number from another table. For example, I want to add 1,2,3,4,5 etc. up to a the max 45.
The other table where the max is coming from misses out some row numbers (ie 1,3,5,11) which is why I can't use that table.
My poor attempt so far is the following, but this only gives me the max number and not a sequential number listing. There is probably some inbuilt table/function I've forgotten about.
DECLARE #reportTable TABLE (row int, [1] nvarchar(max), [2] nvarchar(max))
INSERT INTO #reportTable (row, [1], [2])
SELECT MAX(row), '', ''
FROM #Days
Your assistance is most appreciated.
Brain switched on finally...
DECLARE #rowmax as int
DECLARE #rowcount as int
SET #rowmax = (SELECT MAX(row) FROM #Days)
SET #rowcount = 1
WHILE #rowcount<=#rowmax
BEGIN
INSERT #reportTable(row)
SELECT #rowcount;
SET #rowcount = #rowcount + 1
END
My task is I need to create index on a large table in SQL Server (~370G). The plan is to
create a new table with the same columns and
create a clustered index in the new table on three columns
copy in small chunks (grouped by the three columns) the original data into the new table.
I can do 1) and 2) in SQL with the following script:
SELECT TOP 0 *
INTO js_sample_indexed
FROM dbo.js_sample
CREATE CLUSTERED INDEX domain_event_platform_idx
ON dbo.js_sample_indexed (domain ASC, event_type ASC, platform ASC)
GO
But I am stuck in the third step. Presumably there are thousands of values in the index, for example, an value might be ('Amazon', 'search', 'mobile').
So I might need to put a where statement in a for loop, while updating the condition for selection every time.
But I'm stuck at how to store and retrieve the values in each column (e.g. 'domain') using SQL.
Don't know whether I've phrased this question clearly, but any comments would be helpful. Thanks!
I am assuming that there is an identity field of some sort (a sequentially numbered field used as an index) on the table. For this example, I will call this field ID. If this is true, then a simple looping construct will do what you need.
DECLARE #MinID int, #MaxID int, #Step int = 10000 -- Move 10k records per loop
SELECT #MinID = MIN(ID), #MaxID = MAX(ID)
FROM MyTableToCopyFrom
While ##MinID <= #MaxID
BEGIN
INSERT INTO MyTableToCopyTo (Field1, Field2, Field3, Fieldx)
SELECT Field1, Field2, Field3, Field4
FROM MyTableToCopyFrom
WHERE ID >= #MinId
AND ID < #MinId + #Step
SET #MinID = #MinID + #Step
END
So I came up with an answer after some reading and asking. Here is the code:
USE jumpshot_data
GO
DROP TABLE dbo.js_indexed
-- create a new table with existing structure
SELECT TOP 0 *
INTO dbo.js_full_indexed_1
FROM dbo.js_test
CREATE CLUSTERED INDEX domain_event_platform_idx
ON dbo.js_full_indexed_1 (domain ASC, event_type ASC, platform ASC)
GO
CREATE NONCLUSTERED INDEX device_id_idx
ON js_full_indexed_1 (device_id ASC);
-- using cursor to loop through meta-data table, and insert by chunk into the new table
DECLARE #event_type varchar(50)
DECLARE #platform varchar(50)
DECLARE #domain varchar(50)
DECLARE SelectionCursor CURSOR LOCAL FOR
SELECT * FROM dbo.js_index_info
OPEN SelectionCursor
FETCH NEXT FROM SelectionCursor into #event_type, #platform, #domain
WHILE (##FETCH_STATUS = 0)
BEGIN
-- operation at each row
INSERT INTO dbo.js_full_indexed_1
SELECT *
FROM dbo.js_test
WHERE event_type = #event_type AND domain = #domain AND platform = #platform
-- loop condition
FETCH NEXT FROM SelectionCursor into #event_type, #platform, #domain
END
CLOSE SelectionCursor
DEALLOCATE SelectionCursor
GO
Could someone please advise on how to repeat the query if it returned no results. I am trying to generate a random person out of the DB using RAND, but only if that number was not used previously (that info is stored in the column "allready_drawn").
At this point when the query comes over the number that was drawn before, because of the second condition "is null" it does not display a result.
I would need for query to re-run once again until it comes up with a number.
DECLARE #min INTEGER;
DECLARE #max INTEGER;
set #min = (select top 1 id from [dbo].[persons] where sector = 8 order by id ASC);
set #max = (select top 1 id from [dbo].[persons] where sector = 8 order by id DESC);
select
ordial,
name_surname
from [dbo].[persons]
where id = ROUND(((#max - #min) * RAND() + #min), 0) and allready_drawn is NULL
The results (two possible outcomes):
Any suggestion is appreciated and I would like to thank everyone in advance.
Just try this to remove the "id" filter so you only have to run it once
select TOP 1
ordial,
name_surname
from [dbo].[persons]
where allready_drawn is NULL
ORDER BY NEWID()
#gbn that's a correct solution, but it's possible it's too expensive. For very large tables with dense keys, randomly picking a key value between the min and max and re-picking until you find a match is also fair, and cheaper than sorting the whole table.
Also there's a bug in the original post, as the min and max rows will be selected only half as often as the others, as each maps to a smaller interval. To fix generate a random number from #min to #max + 1, and truncate, rather than round. That way you map the interval [N,N+1) to N, ensuring a fair chance for each N.
For this selection method, here's how to repeat until you find a match.
--drop table persons
go
create table persons(id int, ordial int, name_surname varchar(2000), sector int, allready_drawn bit)
insert into persons(id,ordial,name_surname,sector, allready_drawn)
values (1,1,'foo',8,null),(2,2,'foo2',8,null),(100,100,'foo100',8,null)
go
declare #min int = (select top 1 id from [dbo].[persons] where sector = 8 order by id ASC);
declare #max int = 1+ (select top 1 id from [dbo].[persons] where sector = 8 order by id DESC);
set nocount on
declare #results table(ordial int, name_surname varchar(2000))
declare #i int = 0
declare #selected bit = 0
while #selected = 0
begin
set #i += 1
insert into #results(ordial,name_surname)
select
ordial,
name_surname
from [dbo].[persons]
where id = ROUND(((#max - #min) * RAND() + #min), 0, 1) and allready_drawn is NULL
if ##ROWCOUNT > 0
begin
select *, #i tries from #results
set #selected = 1
end
end
I have a select query that returns about 10million rows and I then need to insert them into a new table.
I want the performance to be ok so I want to insert them into the new table in batches of 10000. To give an example i created a simple select query below
Insert into new table
Select top 10000 * from applications
But now I need to get the next 10000 rows and insert them. Is there a way to iterate through the million rows to insert them in batches of 10000?? I'm using sql server 2008.
It will probably not be faster by batching it up. Probably the opposite. One statement is the fastest version most of the time. It might just require high amounts of temp space and log. But the fastest measured with the wall-clock.
Reason for that is that SQL Server automatically build a good plan that efficiently batches up all work at once.
To answer your question: The statement as you wrote it produces undefined rows because a table has no order. You should probably add a clustering key like an ID column. That way you can go along the table with a while loop, each time executing the following:
INSERT ...
SELECT TOP 10000 *
FROM T
WHERE ID > #lastMaxID
ORDER BY ID
Note, that the ORDER BY is required for correctness.
I wouldn't batch 10 million records.
If you are batching an insert, use an indexed field to define your batches.
DECLARE #intFlag INT
SET #intFlag = 1
WHILE (#intFlag <=10000000)
BEGIN
INSERT INTO yourTable
SELECT *
FROM applications
WHERE ID BETWEEN #intFlag AND #IntFlag + 9999
SET #intFlag = #intFlag + 10000
END
GO
Use CTE or While loop to insert like batches
;WITH q (n) AS (
SELECT 1
UNION ALL
SELECT n + 1
FROM q
WHERE n < 10000
)
INSERT INTO table1
SELECT * FROM q
OR
DECLARE #batch INT,
#rowcounter INT,
#maxrowcount INT
SET #batch = 10000
SET #rowcounter = 1
SELECT #maxrowcount = max(id) FROM table1
WHILE #rowcounter <= #maxrowcount
BEGIN
INSERT INTO table2 (col1)
SELECT col1
FROM table1
WHERE 1 = 1
AND id between #rowcounter and (#rowcounter + #batch)
-- Set the #rowcounter to the next batch start
SET #rowcounter = #rowcounter + #batch + 1;
END
As an option you can export query to a flat file by bcp and BULK IMPORT it into a table.
BULK IMPORT statement has BATCHSIZE option to limit number of rows.
In your case BATCHSIZE =10000 will work.
There is another option to create SSIS package. Select fast load in OLE DB destination and define 10000 number of rows in “Rows per batch:”. It is probably the easiest solution.
I am using MS SQL Server 2005 at work to build a database. I have been told that most tables will hold 1,000,000 to 500,000,000 rows of data in the near future after it is built... I have not worked with datasets this large. Most of the time I don't even know what I should be considering to figure out what the best answer might be for ways to set up schema, queries, stuff.
So... I need to know the start and end dates for something and a value that is associated with in ID during that time frame. SO... we can the table up two different ways:
create table xxx_test2 (id int identity(1,1), groupid int, dt datetime, i int)
create table xxx_test2 (id int identity(1,1), groupid int, start_dt datetime, end_dt datetime, i int)
Which is better? How do I define better? I filled the first table with about 100,000 rows of data and it takes about 10-12 seconds to set up in the format of the second table depending on the query...
select y.groupid,
y.dt as [start],
z.dt as [end],
(case when z.dt is null then 1 else 0 end) as latest,
y.i
from #x as y
outer apply (select top 1 *
from #x as x
where x.groupid = y.groupid and
x.dt > y.dt
order by x.dt asc) as z
or
http://consultingblogs.emc.com/jamiethomson/archive/2005/01/10/t-sql-deriving-start-and-end-date-from-a-single-effective-date.aspx
Buuuuut... with the second table.... to insert a new row, I have to go look and see if there is a previous row and then if so update its end date. So... is it a question of performance when retrieving data vs insert/update things? It seems silly to store that end date twice but maybe...... not? What things should I be looking at?
this is what i used to generate my fake data... if you want to play with it for some reason (if you change the maximum of the random number to something higher it will generate the fake stuff a lot faster):
declare #dt datetime
declare #i int
declare #id int
set #id = 1
declare #rowcount int
set #rowcount = 0
declare #numrows int
while (#rowcount<100000)
begin
set #i = 1
set #dt = getdate()
set #numrows = Cast(((5 + 1) - 1) *
Rand() + 1 As tinyint)
while #i<=#numrows
begin
insert into #x values (#id, dateadd(d,#i,#dt), #i)
set #i = #i + 1
end
set #rowcount = #rowcount + #numrows
set #id = #id + 1
print #rowcount
end
For your purposes, I think option 2 is the way to go for table design. This gives you flexibility, and will save you tons of work.
Having the effective date and end date will allow you to have a query that will only return currently effective data by having this in your where clause:
where sysdate between effectivedate and enddate
You can also then use it to join with other tables in a time-sensitive way.
Provided you set up the key properly and provide the right indexes, performance (on this table at least) should not be a problem.
for anyone who can use LEAD Analytic function of SQL Server 2012 (or Oracle, DB2, ...), retrieving data from the 1st table (that uses only 1 date column) would be much much quicker than without this feature:
select
groupid,
dt "start",
lead(dt) over (partition by groupid order by dt) "end",
case when lead(dt) over (partition by groupid order by dt) is null
then 1 else 0 end "latest",
i
from x