MS sql server looping through huge table

MS sql server looping through huge table - sql

I have a table with 9 million record I need to loop through each row and need to insert into multiple tables in each iteration.
My example query is
//this is the table with 9 million records
create table tablename
(
ROWID INT IDENTITY(1, 1) primary key ,
LeadID int,
Title varchar(20),
FirstName varchar(50),
MiddleName varchar(20),
Surname varchar(50)
)
declare #counter int
declare #leadid int
Declare #totalcounter int
set #counter = 1
Select #totalcounter = count(id) from tablename
while(#counter < #totalcounter)
begin
select #leadid = leadid from tablename
where ROWID = #counter
--perform some insert into multiple tables
--in each iteration i need to do this as well
select * from [sometable]
inner join tablename where leadid = #leadid
set #counter = #counter + 1
end
The problem here is this is taking too long especially the join on each iteration.
Can someone please help me to optimize this.

Yes, your join is taking long because there is no join condition specified between your two tables so you are creating a Cartesian product. That is definitely going to take a while.
If yuo want to optimize this, specifiy what you want to join those tables on.
If it is still slow, have a look at appropriate indexes.

It looks like you are trying to find all the rows in sometable that have the same leadid as the rows in tablename ? If so, a simple join should work
select t2.*
from tablename t2 inner join sometable t2
on t1.leadid=t2.leadid
As long as you have an index on leaid you shouldn't have any problems
What are you really trying to do?

Related

SQL: Delete Rows from Dynamic list of tables where ID is null

I'm a SQL novice, and usually figure things out via Google and SO, but I can't wrap my head around the SQL required for this.
My question is similar to Delete sql rows where IDs do not have a match from another table, but in my case I have a middle table that I have to query, so here's the scenario:
We have this INSTANCES table that basically lists all the occurrences of files sent to the database, but have to join with CROSS_REF so our reporting application knows which table to query for the report, and we just have orphaned INSTANCES rows I want to clean out. Each DETAIL table contains different fields from the other ones.
I want to delete all single records from INSTANCES if there are no records for that Instance ID in any DETAIL table. The DETAIL table got regularly cleaned of old files, but the Instance record wasn't cleaned up, so we have a lot of INSTANCE records that don't have any associated DETAIL data. The thing is, I have to select the Table Name from CROSS_REF to know which DETAIL_X table to look up the Instance ID.
In the below example then, since DETAIL_1 doesn't have a record with Instance ID = 1001, I want to delete the 1001 record from INSTANCES.
INSTANCES
Instance ID
Detail ID
1000
123
1001
123
1002
234
CROSS_REF
Detail ID
Table Name
123
DETAIL_1
124
DETAIL_2
125
DETAIL_3
DETAIL_1
Instance ID
1000
1000
2999

Storing table names or column names in a database is almost always a sign for a bad database design. You may want to change this and thus get rid of this problem.
However, when knowing the possible table names, the task is not too difficult.
delete from instances i
where not exists
(
select null
from cross_ref cr
left join detail_1 d1 on d1.instance_id = i.instance_id and cr.table_name = 'DETAIL_1'
left join detail_2 d2 on d2.instance_id = i.instance_id and cr.table_name = 'DETAIL_2'
left join detail_3 d3 on d3.instance_id = i.instance_id and cr.table_name = 'DETAIL_3'
where cr.detail_id = i.detail_id
and
(
d1.instance_id is not null or
d2.instance_id is not null or
d3.instance_id is not null
)
);
(You can replace is not null by = i.instance_id, if you find that more readable. In that case you could even remove these criteria from the ON clauses.)

Much thanks to #DougCoats, this is what I ended up with.
So here's what I ended up with (#Doug, if you want to update your answer, I'll mark yours correct).
DECLARE #Count INT, #Sql VARCHAR(MAX), #Max INT;
SET #Count = (SELECT MIN(DetailID) FROM CROSS_REF)
SET #Max = (SELECT MAX(DetailID) FROM CROSS_REF)
WHILE #Count <= #Max
BEGIN
IF (select count(*) from CROSS_REF where file_id = #count) <> 0
BEGIN
SET #sql ='DELETE i
FROM Instances i
WHERE NOT EXISTS
(
SELECT InstanceID
FROM '+(SELECT TableName FROM Cross_Ref WHERE DetailID=#Count)+' d
WHERE d.InstanceId=i.InstanceID
AND i.detailID ='+ cast(#Count as varchar) +'
)
AND i.detailID ='+ cast(#Count as varchar)
EXEC(#sql);
SET #Count=#Count+1
END
END

this answer assumes you have sequential data in the CROSS_REF table. If you do not, you'll need to alter this to account it (as it will bomb due to missing object reference).
However, this should give you an idea. It also could probably be written to do a more set based approach, but my answer is to demonstrate dynamic sql use. Be careful when using dynamic SQL though.
DECLARE #Count INT, #Sql VARCHAR(MAX), #Max INT;
SET #Count = (SELECT MIN(DetailID) FROM CROSS_REF)
SET #Max = (SELECT MAX(DetailID) FROM CROSS_REF)
WHILE #Count <= #Max
BEGIN
IF (select count(*) from CROSS_REF where file_id = #count) <> 0
BEGIN
SET #sql ='DELETE i
FROM Instances i
WHERE NOT EXISTS
(
SELECT InstanceID
FROM '+(SELECT TableName FROM Cross_Ref WHERE DetailID=#Count)+' d
WHERE d.InstanceId=i.InstanceID
AND i.detailID ='+ cast(#Count as varchar) +'
)
AND i.detailID ='+ cast(#Count as varchar)
EXEC(#sql);
SET #Count=#Count+1
END
END

Copying data from one table to another using Insert Into

I have two tables. They both have identical structures except table2 has an additional column. I currently copy data from table1 into table2 using a stored proc, as shown below.
However, due to the sheer number of records (20million+), and the structure of the stored proc, this currently takes a couple of hours to run.
Does anyone have any suggestions on how to optimize the code?
CREATE PROCEDURE dbo.insert_period #period INT AS
DECLARE #batchsize INT
DECLARE #start INT
DECLARE #numberofrows INT
SELECT #numberofrows = COUNT(*) from daily_table
SET #batchsize = 150000
SET #start = 1
WHILE #start < #numberofrows
BEGIN
INSERT INTO dbo.main_table WITH (TABLOCK) (
col1,
col2,
....,
col26,
time_period
)
SELECT *, #period FROM dbo.daily_table
ORDER BY id
OFFSET #start ROWS
FETCH NEXT #batchsize ROWS ONLY
SET #start += #batchsize + 1
END
The id that I am using here is not unique. The table itself does not have any keys or unique id's.

First I would like to point out that the logic in your insert is flawed.
With #start starting at 1 your always skipping the first row of the source table. Then adding 1 to it at the end of your loop is causing it to skip another row on each subsequent run of the loop.
If your set on using batched inserts I suggest you read up on how it works over on MSSQLTips.
To help you with performance I would suggest taking a look at the following:
SELECT *
Remove the SELECT * and replace with the column names. This will help the optimizer get you a better query plan. Further reading on why SELECT * is bad can be found in this SO Question.
ORDER BY
That ORDER BY is probably slowing you down. Without seeing your query plan we cannot know for sure though. Each time your loop executes it queries the source table and has to sort all those records. Sorting 20+ milling records that many times is a lot of work. Take a look at my simplified example below.
CREATE TABLE #Test (Id INT);
INSERT INTO #Test VALUES (1), (2), (3), (4), (5);
DECLARE #batchsize INT;
DECLARE #start INT;
DECLARE #numberofrows INT;
SELECT #numberofrows = COUNT(*) FROM #Test;
SET #batchsize = 2;
SET #start = 0;
WHILE #start < #numberofrows
BEGIN
SELECT
*
, 10
FROM
#Test
ORDER BY
Id OFFSET #start ROWS FETCH NEXT #batchsize ROWS ONLY;
SET #start += #batchsize;
END;
Below is a portion of the query plan produced by the sample. Notice the Sort operation highlighted in yellow. Its cost accounts for 78% of that query plan.
If we add an index that is already sorted on the Id column of the source table we can eliminate the sort. Now when the loop runs it doesn't have to do any sorting.
CREATE INDEX ix_Test ON #Test (Id)
Other Options to Research
Columnstore Indexes
Batch Mode in RowStore
Parallel Inserts

You copy the table row by row, that's why it takes so long. The simplest way to achieve what you want is an 'INSERT' combined with a 'SELECT' statement. This way, you would insert the data in one batch.
CREATE TABLE dbo.daily_table (id INT PRIMARY KEY IDENTITY,
value1 NVARCHAR(100) NULL,
value2 NVARCHAR(100) NULL);
GO
CREATE TABLE dbo.main_table (id INT PRIMARY KEY IDENTITY,
value1 NVARCHAR(100) NULL,
value2 NVARCHAR(100) NULL,
value3 NVARCHAR(100) NULL);
GO
INSERT INTO dbo.daily_table (value1, value2)
VALUES('1', '2');
-- Insert with Select
INSERT INTO dbo.main_table (value1, value2)
SELECT value1, value2
FROM dbo.daily_table;
Also, it's better not to use an asterisk in your 'SELECT' statement since the result could be unpredictable.

Performance is slow when Replacing/updating a string of a table row[bulk data] in SQL Server

I want to update formatted body column of the below main table called postswhich has below schema with dummy data-
Now, i want to replace/update a substring [i.e. source with the final URL] from the above formattedbody column.[total 5335 records in excel sheet]
For the same i've written below query -
DECLARE #LoopCounter INT = 1
DECLARE #SURL nvarchar(max)
DECLARE #FURL nvarchar(max)
WHILE ( #LoopCounter <= 5335)
BEGIN
SET #SURL = (select sourceURL from temptable where ID = #LoopCounter)
SET #FURL = (select [TargetURL] from temptable where ID = #LoopCounter)
update posts
Set FormattedBody=REPLACE(CAST(FormattedBody as NVarchar(Max)),#SURL,#FURL)
Where SectionID = 95 and postlevel=1 and CAST(FormattedBody as NVarchar(Max)) like '%' + #SURL + '%'
SET #LoopCounter = #LoopCounter + 1
END
temptable contains the data of the excel sheet [i.e. ID,sourceURL, and TargetURL].
Above query works as expected but the performance is too low, as it loops through all the rows from posts table [huge data] for 5335 records.
Currently, it updates only 3 records/minute.
Any suggestion/help is appreciated! :)
Thanks!

I think you don't need to use while and update, I would use UPDATE .. JOIN instead of while and update.
If there isn't any relationship between temptable and posts tables, you can use CROSS JOIN (Descartes product) let every sourceURL and [TargetURL] temptable columns to mapper with posts table then update.
UPDATE p
SET FormattedBody = REPLACE(CAST(FormattedBody as NVarchar(Max)),sourceURL,[TargetURL])
FROM posts p
CROSS JOIN
(
SELECT sourceURL,[TargetURL]
FROM temptable
where id <= 5335
) targetDt
Where p.SectionID = 95 and p.postlevel=1

I would suggest adding
and CAST(FormattedBody as NVarchar(Max)) like '%' + #SURL + '%'
to the where condition in the first place, because the way you wrote it, I think ALL the records are updated EACH time, whether the FormattedBody contains #SURL or not.

Find a column where the identity column is breaking

I have a table, e.g.
cust_ord_key
1
2
3
4
5
7
9
How do I write a query to find out if the numbers are in sequence and not breaking anywhere?

In SQL Server:
SELECT SeqID AS MissingSeqID
FROM (SELECT ROW_NUMBER() OVER (ORDER BY column_id) SeqID from sys.columns) LkUp
LEFT JOIN dbo.TestData t ON t.ID = LkUp.SeqID

You may do something like:
DECLARE #RESULT int;
SET #RESULT = 0;
DECLARE #FIRST_ID int;
DECLARE #LAST_ID int;
DECLARE #THIS_VALUE int;
DECLARE #NEXT_VALUE int;
SELECT #FIRST_ID = min(ID), #LAST_ID = max(ID) from table_name;
WHILE(#FIRST_ID <= #LAST_ID)
BEGIN
SELECT #THIS_VALUE = your_field_with_keys from table_name where ID = #FIRST_ID;
SELECT #NEXT_VALUE = your_field_with_keys from table_name where ID = (#FIRST_ID + 1);
if #THIS_VALUE > #NEXT_VALUE
SET #RESULT = #FIRST_ID;
--break your query here or do anything else
SET #FIRST_ID = #FIRST_ID + 1;
END
What does this query do? We declare #RESULT variable for taking an ID of key, where you key is breaking. #FIRST_ID and #LAST_ID are the minimal and maximal IDs from your table, we will use them later. #THIS_VALUE and #NEXT_VALUE are two variables for two keys to be compared.
Then we execute loop over our IDs. Then setting up #THIS_VALUE and #NEXT_VALUE with corresponding keys (this and the next). If #THIS_VALUE more than #NEXT_VALUE, it means that the key is breaking here (if previous key is more than next key), and we take the ID of element, where key is broken. And there you may stop your query or do some required logic.

This is not perfect, but definitely does the job, and is universal across all DB engines.
SELECT t1.id FROM myTable t1
LEFT JOIN myTable t2 ON t1.id+1 = t2.id
WHERE t2.id IS NULL
When the identity doesn't break anywhere, this will only return the last entry. You can compensate for that with a procedural language, by first getting the MAX(ID) adding that to the WHERE clause like this:
WHERE t2.id IS NULL AND t1.id<>5643
where 5643 is the max id (either a variable introduced in the query string, or can be a variable in the procedural SQL language of whatever DB engine you're using). The point is that it's the maximum value of the identity on that table.
OR, you can just dismiss the last row from the result set if you're doing it in PHP or whatever.

SQL IN operator in update query causes a lot of time

Below is a update query which is to update a table with about 40000 records:
UPDATE tableName
SET colA = val, colB = val
WHERE ID IN (select RecordIDs from tableB where needUpdate = 'Y')
When the above query is executed, I found out that the below query taken ~ 15 seconds
SELECT RecordIDs
FROM tableB
WHERE needUpdate = 'Y'
But when I take away the where clause (i.e. update tableName set colA = val, colB = val) The query runs smoothly.
Why this happens? are there any ways to shorten the time of execution?
Edited:
Below is the structure of both tables:
tableName:
ID int,
VehicleBrandID int,
VehicleLicenseExpiryDate nvarchar(25),
LicensePlateNo nvarchar(MAX),
ContactPerson nvarchar(MAX),
ContactPersonID nvarchar(MAX),
ContactPersonPhoneNumber nvarchar(MAX),
ContactPersonAddress nvarchar(MAX),
CreatedDate nvarchar(MAX),
CreatedBy nvarchar(MAX)
PRIMARY KEY (ID)
tableB:
RowNumber int
RecordIDs int
NeedUpdate char(1)
PRIMARY KEY (RowNumber)
Edited
Below screenshot is the execution plan for the update query

The execution plan shows you are using table variables and are missing a useful index.
Keep the existing PK on #output
DECLARE #output TABLE (
ID INT PRIMARY KEY,
VehicleBrandID INT,
VehicleLicenseExpiryDate NVARCHAR(25),
LicensePlateNo NVARCHAR(MAX),
ContactPerson NVARCHAR(MAX),
ContactPersonID NVARCHAR(MAX),
ContactPersonPhoneNumber NVARCHAR(MAX),
ContactPersonAddress NVARCHAR(MAX),
CreatedDate NVARCHAR(MAX), /*<-- Don't store dates as strings*/
CreatedBy NVARCHAR(MAX))
And add a new index to #tenancyEditable
DECLARE #tenancyEditable TABLE (
RowNumber INT PRIMARY KEY,
RecordIDs INT,
NeedUpdate CHAR(1),
UNIQUE (NeedUpdate, RecordIDs, RowNumber))
With these indexes in place the following query
UPDATE #output
SET LicensePlateNo = ''
WHERE ID IN (SELECT RecordIDs
FROM #tenancyEditable
WHERE NeedUpdate = 'Y')
OPTION (RECOMPILE)
Can generate the more efficient looking
Also you should use appropriate datatypes rather than storing everything as NVARCHAR(MAX). A person name isn't going to need more than nvarchar(100) at most and CreatedDate should be stored as date[time2] for example.

I suppose you are in one of the 2 cases below:
1/ STATISTICS are not updated due to a some recently modification of in your table. In this case you should execute this:
UPDATE STATISTICS tableB
2/ I suppose a wrong query plan is used, case when I recommend to execute this in order to force recompilation of the query:
SELECT RecordIDs
FROM tableB
WHERE needUpdate = 'Y'
OPTION (RECOMPILE)
Tell us the result and we'll come with more details about.

This is an alternative. It is worth it to try in your environment as it has been demonstrated for others to be faster.
MERGE INTO tableName tn
USING (
SELECT recordIDs
FROM tableB
WHERE needUpdate = 'Y'
) tb
ON tn.ID = tb.recordID
WHEN MATCHED THEN
UPDATE
SET colA = tb.val,
colB = tb.val;
EDIT:
I am not claiming this to be faster in every case or in every setup/environment - just that it is worth a try as it has worked for me and others I have worked with or read about.

you can use inner join instead of IN clause.
update t
set
t.colA = val, t.colB = val
From tablename
inner join tableb x on
t.id = x.recordid
where x.needUpdate = 'Y'
Although the UPDATE...FROM
syntax is essential in some
circumstances, I prefer to use
subqueries (by using IN clause) whenever
possible.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

MS sql server looping through huge table - sql

Related

SQL: Delete Rows from Dynamic list of tables where ID is null

Copying data from one table to another using Insert Into

Performance is slow when Replacing/updating a string of a table row[bulk data] in SQL Server

Find a column where the identity column is breaking

SQL IN operator in update query causes a lot of time

Categories

Resources