SQL Query on table with 30mill records - sql

I have been having problems building a table in my local SQL Server. Orginally it was causing the tempdb table to become full and throw an exception. This has a lot of joins and outer applies, and so to find specifically where the problem lay I did a select on the first table in the sql query to determine how long it took, that was fast so I then added the next table that was the first join in the query and reran, I continued to do this until I found the table that stalled.
I found the problem (or at least the first problem) was with the shipper_container table. This table is huge and pulling it alone gets a System.OutOfMemoryException just showing a select on the results of that table alone (it has only 5 columns). It cuts out at 16 million records but has 30 million rows. It is 1.2GB in size. This doesn't seem so big for me that SQL Management studio couldn't handle it.
Using a WHERE statement to collect values between 1 January - 10 January 2015 still resulted in a search that took over 5 minutes and was still executing when I cancelled. I have also added indexes on each of the select parameters and this did not increase performance either.
Here is the SQL Query. You can see I have commented out the other parameters that have yet to be added in other joins and outer applies.
DECLARE #startDate DATETIME
DECLARE #endDate DATETIME
DECLARE #Shipper_Key INT = NULL
DECLARE #Part_Key INT = NULL
SET #startDate = '2015-01-01'
SET #endDate = '2015-01-10'
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
INSERT Shipped_Container
(
Ship_Date,
Invoice_Quantity,
Shipper_No,
Serial_No,
Truck_Key,
Shipper_Key
)
SELECT
S.Ship_Date,
SC.Quantity,
S.Shipper_No,
SC.Serial_No,
S.Truck_Key,
S.Shipper_Key
FROM Shipper AS S
JOIN Shipper_Line AS SL
--ON SL.PCN = S.PCN
ON SL.Shipper_Key = S.Shipper_Key
JOIN Shipper_Container AS SC
--ON SC.PCN = SL.PCN
ON SC.Shipper_Line_Key = SL.Shipper_Line_Key
WHERE S.Ship_Date >= #startDate AND S.Ship_Date <= #endDate
AND S.Shipper_Key = ISNULL(#Shipper_Key, S.Shipper_Key)
AND SL.Part_Key = ISNULL(#Part_Key, SL.Part_Key)
The server instance is run on the local network - could this be an issue? I really have minimal experience at this and would really appreciate help and as detailed and clear as possible. Often in SQL forums people jump right into technical details I don't follow so well.

Don't do a Select ... From yourtable in SS Management Studio when it return
hundrend of thousand or millions of row. 1GB of data gets a lot bigger when the system has to draw and show it on screen in the Management Studio data sheet
The server instance is run on the local network
When you do a Select ... From yourtable in SSMS, the server must send all the data to your laptop/desktop. This is quite a lot of uneeded presure on the network.
It should not be an issue when you insert because everything stays on the server. However, staying on the server does not mean it will be fast if your data model is not good enough.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
You may get dirty data is you use that... It may be better to remove it unless you know why it is there and why you need it.
I have also added indexes on each of the select parameters and this did not increase performance either
If you mean indexes on :
S.Ship_Date,
SC.Quantity,
S.Shipper_No,
SC.Serial_No,
S.Truck_Key,
S.Shipper_Key
What are their definitions ?
If they are individual indexes on 1 column, you can drop indexes on SC.Quantity, S.Shipper_No, SC.Serial_No and S.Truck_Key. They are not used.
Ship_Date and Shipper_key may be usefull. It all depends on your model and existing primary keys. (which you need to describe, see below)
It will help to give a more accurate answer if you could tell us:
the relation between your 3 tables (which field link A to B and in which direction)
the primary key on your 3 tables
a complete list of all your indexes(and columns) on your 3 tables
If none of your indexes are usefull or if they are missing, it will most likely read the whole 3 tables and try to match them. Because it is pretty big, it does not have enough memory to process it and it uses tempdb to store intermediary data.
For now I will suppose that shipper_key + PCN is the primary key on each tables.
I think you can try that:
You can create an index on S.Ship_Date
Create Index Shipper_Line_Ship_Date(Ship_Date) -- subject to updates according to your Primary Key
The query optimizer may not use the indexes (if they exists) with such a where clause:
AND S.Shipper_Key = ISNULL(#Shipper_Key, S.Shipper_Key)
AND SL.Part_Key = ISNULL(#Part_Key, SL.Part_Key)
you can use:
AND (S.Shipper_Key = #Shipper_Key or #Shipper_Key is null)
AND (SL.Part_Key = #Part_Key or #Part_Keyis null)
It would help to have indexes on Shipper_Key and PCN
Finally
As I already said above, we need to know more about your data model (create table...), primary keys and indexes (create indexes). You can create a modele here http://sqlfiddle.com/ with all 3 create tables and their indexes. Then go to link and add the link here.
In SSMS, you can right click on a table and go to Script Table as / Create To / New Query Window and add it here or in http://sqlfiddle.com/. Only keep the CREATE TABLE ... part down to the first GO.
You can then do the same thing on all you indexes.
You should also add a copy of you query plan.
In SSMS, go to Query menu / Display Estimated Execution Plan and right click to save it as xml (xml is better). It is only an estimation and it won't execute the whole query. It should be pretty fast.

Related

Setting recovery to simple still results in Log File running computer out of memory with CREATE and INSERT INTO SQL but not SELECT INTO

I have recently encountered an issue pulling a lot of records from my local database. All tables are in the same database in a local environment on my laptop. I first encountered the problem when trying to do a rather large SQL statement that joins a lot of records between various tables into a single table to be pushed to Azure. What resulted was a memory exception. After looking for the cause I found that the database log file was increasing to well over 80GB. At first I thought it was the tempDB database and added the usual advice so that it can autogrow. This did not solve the problem but rather just made my computer run out of memory.
Below I will show you the first three tables that in themselves lead to the error asking for a months worth of data.
Tables
Shipper: 2 MB, 53180 rows.
Shipper_Line: 2 MB, 63740 rows
Shipper_Container: 1229 MB, 28232977 rows
So pretty big but I still would expect SQL Server Management Studio to easily handle these computations using basic joins.
I originally used CREATE and INSERT INTO statement to populate and existing table. Here is the SQL.
--CREATE TABLE FIRST METHOD
--Creates table
CREATE TABLE Shipper_Test
(
Ship_Date DATETIME,
Shipper_No VARCHAR(50),
Truck_Key INT,
Shipper_Key INT,
Invoice_Quantity INT,
Shipper_Line_Key INT,
Serial_No VARCHAR(25)
)
--declare parameters
DECLARE #startDate DATETIME
DECLARE #endDate DATETIME
SET #startDate = '20150101'
SET #endDate = '20150130'
--inserts data from sql query below into existing table
INSERT Shipper_Test
(
Ship_Date,
Shipper_No,
Truck_Key,
Shipper_Key,
Invoice_Quantity,
Shipper_Line_Key,
Serial_No
)
SELECT
S.Ship_Date,
S.Shipper_No,
S.Truck_Key,
S.Shipper_Key,
SC.Quantity,
SL.Shipper_Line_Key,
SC.Serial_No
FROM Shipper AS S
JOIN Shipper_Line AS SL
ON SL.Shipper_Key = S.Shipper_Key
JOIN Shipper_Container AS SC
ON SC.Shipper_Line_Key = SL.Shipper_Line_Key
WHERE Ship_Date >= #startDate
AND Ship_Date <= #endDate
This is where the problem lies and the temp log file keeps increasing.
I used a SELECT INTO statement and it worked. This is the sql for that.
--CREATE TABLE ON THE FLY METHOD
DECLARE #startDate DATETIME
DECLARE #endDate DATETIME
SET #startDate = '20150101'
SET #endDate = '20150130'
--Uses Select into to create the table as it gets the data
SELECT
S.Ship_Date,
S.Shipper_No,
S.Truck_Key,
S.Shipper_Key,
SL.Shipper_Line_Key,
SC.Serial_No,
SC.Quantity
INTO NewShipperTable
FROM Shipper AS S
JOIN Shipper_Line AS SL
ON SL.Shipper_Key = S.Shipper_Key
JOIN Shipper_Container AS SC
ON SC.Shipper_Line_Key = SL.Shipper_Line_Key
WHERE Ship_Date >= #startDate AND Ship_Date <= #endDate
The difference was quite a contrast. Selecting values from 1 - 30 January 2015 using the CREATE and INSERT INTO method resulted in a time out due to using up all disk space (over 50gb at the point of time out). Selecting these same values using a SELECT INTO statement took 1:00min exactly, and created 144million rows. A simple select statement shows the same result.
I would like to use the existing table if I can. I tried using ALTER DATABASE Plex SET RECOVERY SIMPLE to stop the log file from increasing but it still grew the computer out of memory. I understand that using a SELECT INTO produces minimal logging as it is a minimally logged operator and uses a bulk insert. But if I had done the above step I still don't understand why this would have not resulted in improved performance (i.e. the log file still grew). Can someone explain to me what the best method is to approach this. Are my joins and datatables just too big to use a CREATE and INSERT INTO SQL statement, and thus am forced to use SELECT INTO, or am I doing something wrong? It seems to me that 30 million records is not too much at all for SQL Server to handle, so I suspect I am doing something wrong?
EDIT
I have also tried using indexes on the parameters in the joins and this did not help either.
SQL Server (like many relational databases) implements a set of properties recalled by their acronym ACID. I'm not going to talk about all of them right now, but the one that matters for this purpose is atomicity. Simply stated, that says that a given operation needs to either happen 100% or 0%. Even if the operation fails somewhere in the middle.
Why this is relevant here is that even though your database is in simple recovery, it needs to log the insertion of the rows into the table. It does so so that in the case of a failure before the operation is complete, it can roll that operation back.
A common workaround for this type of operation is to insert the rows in batches (say 1000 at a time). An easy way and efficient to do that is to use SSIS (if that's available to you). Otherwise, you're going to have to roll your own batching logic into your T-SQL.

SQL Query very slow - Suddenly

I have a SQL stored procedure that was running perfect (.2 secs execution or less), suddenly today its taking more than 10 minutes.
I see that the issue comes because of a LEFT JOIN of a documents table (that stores the location of all the digital files associated to records in the DB).
This documents table has today 153,234 records.
The schema is
The table has 2 indexes:
Primary key (uid)
documenttype (nonclustered)
The stored procedure is:
SELECT
.....,
CASE ISNULL(cd.countdocs,0) WHEN 0 THEN 0 ELSE 1 END as hasdocs
.....
FROM
requests re
JOIN
employee e ON (e.employeeuid = re.employeeuid)
LEFT JOIN
(SELECT
COUNT(0) as countnotes, n.objectuid as objectuid
FROM
notes n
WHERE
n.isactive = 1
GROUP BY
n.objectuid) n ON n.objectuid = ma.authorizationuid
/* IF I COMMENT THIS LEFT JOIN THEN WORKS AMAZING FAST */
LEFT JOIN
(SELECT
COUNT(0) as countdocs, cd.objectuid
FROM
cloud_document cd
WHERE
cd.isactivedocument = 1
AND cd.entity = 'COMPANY'
GROUP BY
cd.objectuid) cd ON cd.objectuid = re.authorizationuid
JOIN ....
So don't know if I have to add another INDEX to improve this query of maybe the LEFT JOIN I have is not ideal.
If I run the execution plan I get this:
/*
Missing Index Details from SQLQuery7.sql - (local).db_prod (Test/test (55))
The Query Processor estimates that implementing the following index could improve the query cost by 60.8843%.
*/
/*
USE [db_prod]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[cloud_document] ([objectuid],[entity],[isactivedocument])
GO
*/
Any clue on how to solve this?
Thanks.
Just don't go out and add a index. Do some research first!
Can you grab a picture of the query plan and post it? It will show if the query is using the index or not.
Also, complete details of the table would be awesome, including Primary Keys, Foreign Keys, and any indexes. Just script them out to TSQL. A couple of sample records to boot and we can recreate it in a test environment and help you.
Also, take a look at Glenn Barry's DMVs.
http://sqlserverperformance.wordpress.com/tag/dmv-queries/
Good stuff like top running queries, read/write usages of indexes, etc - to name a few.
Like many things in life, it all depends on your situation!
Just need more information before we can make a judgement call.
I would actually be surprised if an index on that field helps as it likely only has two or three values (0,1, null) and indexes are not generally useful when the data has so few values.
I would suspect that either your statistics are out of date or your current indexes need to be rebuilt.

Performing SQL updates in single statements vs batches

I'm working with large databases and need advice on how to optimize my selects/updates. Here's an ex:
create table Book (
BookID int,
Description nvarchar(max)
)
-- 8 million rows
create table #BookUpdates (
BookID int,
Description nvarchar(max)
)
-- 2 million rows
Let's assume that there's 8 million Books and I have to update the genre for 2 million of them.
Problem: the time to run these updates is very long. It will occasionally cause blocking for the users who are also trying to run statements off the database. I've come up with a solution but want to know if there's a better one out there. I have to prepare one-off random updates like this alot (for whatever reason)
-- normal update
update b set b.Description = bu.Description
from Book b
join #BookUpdates bu
on bu.BookID = b.BookID
-- batch update
while (#BookID < #MaxBookID)
begin
update b set b.Description = bu.Description
from Book b
join #BookUpdates bu
on bu.BookID = b.BookID
where bu.BookID >= #BookID
and bu.BookID < #BookID + 5000
set #BookID = #BookID + 5000
end
The second update works a lot faster. I like this solution because I can print status updates to myself on how long it has left and it doesn't cause performance issues on our customers.
Question: am I missing something important here? Indexes on the temp tables?
I updated the EXAMPLE tables so I don't get more normalization comments. Only 1 description per book :)
You can prevent blocking on the query side by using NOLOCK or READUNCOMITTED hints on the SQL queries.
The real issue with performance is probably the accumulation of changes in the log. Your method of batching the changes in groups of 5,000 is quite reasonable. Because you are setting up the updates in a batch table, you might as well calculate the batch number in the table and then do the looping based on that.
I would try your own suggestion first and index the temp table before you run the update:
CREATE INDEX IDX_BookID ON #BookUpdates(BookID)
Try it with the index and without the index and see what the impact on the runtime is. If you want to avoid impacting your users for this test, run it outside working hours (if you can) or copy Book to another temp table first and test against that.
Regardless, given the volume, I expect you will still cause blocking for other processes. If you are unable to schedule your updates at a time when no other processes are running against this table (which would be the ideal solution), your existing batch update appears to be a perfectly valid solution. Indexing the temp table will likely help with that too so you may be able to increase the batch size without causing blocking.

DELETE SQL with correlated subquery for table with 42 million rows?

I have a table cats with 42,795,120 rows.
Apparently this is a lot of rows. So when I do:
/* owner_cats is a many-to-many join table */
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
the query times out :(
(edit: I need to increase my CommandTimeout value, default is only 30 seconds)
I can't use TRUNCATE TABLE cats because I don't want to blow away cats from other owners.
I'm using SQL Server 2005 with "Recovery model" set to "Simple."
So, I thought about doing something like this (executing this SQL from an application btw):
DELETE TOP (25) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE TOP(50) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
My question is: what is the threshold of the number of rows I can DELETE in SQL Server 2005?
Or, if my approach is not optimal, please suggest a better approach. Thanks.
This post didn't help me enough:
SQL Server Efficiently dropping a group of rows with millions and millions of rows
EDIT (8/6/2010):
Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even id_cat is not a PK, because in my real life schema, it's not a unique field.
I will put indexes on:
cats.id_cat
owner_cats.id_cat
owner_cats.id_owner
I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the JOIN fields right?
However, it takes hours for me to do this batch load process. I'm already doing it as a SqlBulkCopy (in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy:
SqlBulkCopy slow as molasses
What’s the fastest way to bulk insert a lot of data in SQL Server (C# client)
So I'm going to DROP my indexes before the copy and then re CREATE them when it's done.
Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results.
UPDATE (8/7/2010):
Tom suggested:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes!
Update (8/8/2010):
Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min (yes an hour!) versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help.
Update (8/9/2010):
Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. Deletes are taking longer on my database when I use indexes by an order of magnitude! I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! What am I doing wrong?
Update (8/10/2010):
For lack of any other options, I have come up with what I feel is a lackluster solution (hopefully temporary):
Increase timeout for database connection to 1 hour (CommandTimeout=60000; default was 30 sec)
Use Tom's query: DELETE FROM WHERE EXISTS (SELECT 1 ...) because it performed a little faster
DROP all indexes and PKs before running delete statement (???)
Run DELETE statement
CREATE all indexes and PKs
Seems crazy, but at least it's faster than using TRUNCATE and starting over my load from the beginning with the first owner_id, because one of my owner_id takes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that owner_id, but I don't want to blow away previous owner_id, so I don't want to TRUNCATE the owner_cats table, which is why I'm trying to use DELETE.)
Anymore help would still be appreciated :)
There is no practical threshold. It depends on what your command timeout is set to on your connection.
Keep in mind that the time it takes to delete all of these rows is contingent upon:
The time it takes to find the rows of interest
The time it takes to log the transaction in the transaction log
The time it takes to delete the index entries of interest
The time it takes to delete the actual rows of interest
The time it takes to wait for other processes to stop using the table so you can acquire what in this case will most likely be an exclusive table lock
The last point may often be the most significant. Do an sp_who2 command in another query window to make sure that there isn't lock contention going on, preventing your command from executing.
Improperly configured SQL Servers will do poorly at this type of query. Transaction logs which are too small and/or share the same disks as the data files will often incur severe performance penalties when working with large rows.
As for a solution, well, like all things, it depends. Is this something you intend to be doing often? Depending on how many rows you have left, the fastest way might be to rebuild the table as another name and then rename it and recreate its constraints, all inside a transaction. If this is just an ad-hoc thing, make sure your ADO CommandTimeout is set high enough and you can just bear the cost of this big delete.
If the delete will remove "a significant number" of rows from the table, this can be an alternative to a DELETE: put the records to keep somewhere else, truncate the original table, put back the 'keepers'. Something like:
SELECT *
INTO #cats_to_keep
FROM cats
WHERE cats.id_cat NOT IN ( -- note the NOT
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
TRUNCATE TABLE cats
INSERT INTO cats
SELECT * FROM #cats_to_keep
Have you tried no Subquery and use a join instead?
DELETE cats
FROM
cats c
INNER JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
And if you have have you also tried different Join hints e.g.
DELETE cats
FROM
cats c
INNER HASH JOIN owner_cats oc
on c.id_cat = oc.id_cat
WHERE
id_owner =1
If you use an EXISTS rather than an IN, you should get much better performance. Try this:
DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)
There's no threshold as such - you can DELETE all the rows from any table given enough transaction log space - which is where your query is most likely falling over. If you're getting some results from your DELETE TOP (n) PERCENT FROM cats WHERE ... then you can wrap it in a loop as below:
SELECT 1
WHILE ##ROWCOUNT <> 0
BEGIN
DELETE TOP (somevalue) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)
END
As others have mentioned, when you delete 42 million rows, the db has to log 42 million deletions against the database. Thus, the transaction log has to grow substantially. What you might try is to break up the delete into chunks. In the following query, I use the NTile ranking function to break up the rows into 100 buckets. If that is too slow, you can expand the number of buckets so that each delete is smaller. It will help tremendously if there is an index on owner_cats.id_owner, owner_cats.id_cats and cats.id_cat (which I assumed the primary key and numeric).
Declare #Cats Cursor
Declare #CatId int --assuming an integer PK here
Declare #Start int
Declare #End int
Declare #GroupCount int
Set #GroupCount = 100
Set #Cats = Cursor Fast_Forward For
With CatHerd As
(
Select cats.id_cat
, NTile(#GroupCount) Over ( Order By cats.id_cat ) As Grp
From cats
Join owner_cats
On owner_cats.id_cat = cats.id_cat
Where owner_cats.id_owner = 1
)
Select Grp, Min(id_cat) As MinCat, Max(id_cat) As MaxCat
From CatHerd
Group By Grp
Open #Cats
Fetch Next From #Cats Into #CatId, #Start, #End
While ##Fetch_Status = 0
Begin
Delete cats
Where id_cat Between #Start And #End
Fetch Next From #Cats Into #CatId, #Start, #End
End
Close #Cats
Deallocate #Cats
The notable catch with the above approach is that it is not transactional. Thus, if it fails on the 40th chunk, you will have deleted 40% of the rows and the other 60% will still exist.
Might be worth trying MERGE e.g.
MERGE INTO cats
USING owner_cats
ON cats.id_cat = owner_cats.id_cat
AND owner_cats.id_owner = 1
WHEN MATCHED THEN DELETE;
<Edit> (9/28/2011)
My answer performs basically the same way as Thomas' solution (Aug 6 '10). I missed it when I posted my answer because it he uses an actual CURSOR so I thought to myself "bad" because of the # of records involved. However, when I reread his answer just now I realize that the WAY he uses the cursor is actually "good". Very clever. I just voted up his answer and will probably use his approach in the future. If you don't understand why, take a look at it again. If you still can't see it, post a comment on this answer and I will come back and try to explain in detail. I decided to leave my answer because someone may have a DBA who refuses to let them use an actual CURSOR regardless of how "good" it is. :-)
</Edit>
I realize that this question is a year old but I recently had a similar situation. I was trying to do "bulk" updates to a large table with a join to a different table, also fairly large. The problem was that the join was resulting in so many "joined records" that it took too long to process and could have led to contention problems. Since this was a one-time update I came up with the following "hack." I created a WHILE LOOP that went through the table to be updated and picked 50,000 records to update at a time. It looked something like this:
DECLARE #RecId bigint
DECLARE #NumRecs bigint
SET #NumRecs = (SELECT MAX(Id) FROM [TableToUpdate])
SET #RecId = 1
WHILE #RecId < #NumRecs
BEGIN
UPDATE [TableToUpdate]
SET UpdatedOn = GETDATE(),
SomeColumn = t2.[ColumnInTable2]
FROM [TableToUpdate] t
INNER JOIN [Table2] t2 ON t2.Name = t.DBAName
AND ISNULL(t.PhoneNumber,'') = t2.PhoneNumber
AND ISNULL(t.FaxNumber, '') = t2.FaxNumber
LEFT JOIN [Address] d ON d.AddressId = t.DbaAddressId
AND ISNULL(d.Address1,'') = t2.DBAAddress1
AND ISNULL(d.[State],'') = t2.DBAState
AND ISNULL(d.PostalCode,'') = t2.DBAPostalCode
WHERE t.Id BETWEEN #RecId AND (#RecId + 49999)
SET #RecId = #RecId + 50000
END
Nothing fancy but it got the job done. Because it was only processing 50,000 records at a time, any locks that got created were short lived. Also, the optimizer realized that it did not have to do the entire table so it did a better job of picking an execution plan.
<Edit> (9/28/2011)
There is a HUGE caveat to the suggestion that has been mentioned here more than once and is posted all over the place around the web regarding copying the "good" records to a different table, doing a TRUNCATE (or DROP and reCREATE, or DROP and rename) and then repopulating the table.
You cannot do this if the table is the PK table in a PK-FK relationship (or other CONSTRAINT). Granted, you could DROP the relationship, do the clean up, and re-establish the relationship, but you would have to clean up the FK table, too. You can do that BEFORE re-establishing the relationship, which means more "down-time", or you can choose to not ENFORCE the CONSTRAINT on creation and clean up afterwards. I guess you could also clean up the FK table BEFORE you clean up the PK table. Bottom line is that you have to explicitly clean up the FK table, one way or the other.
My answer is a hybrid SET-based/quasi-CURSOR process. Another benefit of this method is that if the PK-FK relationship is setup to CASCADE DELETES you don't have to do the clean up I mention above because the server will take care of it for you. If your company/DBA discourage cascading deletes, you can ask that it be enabled only while this process is running and then disabled when it is finished. Depending on the permission levels of the account that runs the clean up, the ALTER statements to enable/disable cascading deletes can be tacked onto the beginning and the end of the SQL statement.
</Edit>
Bill Karwin's answer to another question applies to my situation also:
"If your DELETE is intended to eliminate a great majority of the rows in that table, one thing that people often do is copy just the rows you want to keep to a duplicate table, and then use DROP TABLE or TRUNCATE to wipe out the original table much more quickly."
Matt in this answer says it this way:
"If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename."
ammoQ in this answer (from the same question) recommends (paraphrased):
issue a table lock when deleting a large amount of rows
put indexes on any foreign key columns

Unexpected #temp table performance

Bounty open:
Ok people, the boss needs an answer and I need a pay rise. It doesn't seem to be a cold caching issue.
UPDATE:
I've followed the advice below to no avail. How ever the client statistics threw up an interesting set of number.
#temp vs #temp
Number of INSERT, DELETE and UPDATE statements
0 vs 1
Rows affected by INSERT, DELETE, or UPDATE statements
0 vs 7647
Number of SELECT statements
0 vs 0
Rows returned by SELECT statements
0 vs 0
Number of transactions
0 vs 1
The most interesting being the number of rows affected and the number of transactions. To remind you, the queries below return identical results set, just into different styles of tables.
The following query are basicaly doing the same thing. They both select a set of results (about 7000) and populate this into either a temp or var table. In my mind the var table #temp should be created and populated quicker than the temp table #temp however the var table in the first example takes 1min 15sec to execute where as the temp table in the second example takes 16 seconds.
Can anyone offer an explanation?
declare #temp table (
id uniqueidentifier,
brand nvarchar(255),
field nvarchar(255),
date datetime,
lang nvarchar(5),
dtype varchar(50)
)
insert into #temp (id, brand, field, date, lang, dtype )
select id, brand, field, date, lang, dtype
from view
where brand = 'myBrand'
-- takes 1:15
vs
select id, brand, field, date, lang, dtype
into #temp
from view
where brand = 'myBrand'
DROP TABLE #temp
-- takes 16 seconds
I believe this almost completely comes down to table variable vs. temp table performance.
Table variables are optimized for having exactly one row. When the query optimizer chooses an execution plan, it does it on the (often false) assumption that that the table variable only has a single row.
I can't find a good source for this, but it is at least mentioned here:
http://technet.microsoft.com/en-us/magazine/2007.11.sqlquery.aspx
Other related sources:
http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=125052
http://databases.aspfaq.com/database/should-i-use-a-temp-table-or-a-table-variable.html
Run both with SET STATISTICS IO ON and SET STATISTICS TIME ON. Run 6-7 times each, discard the best and worst results for both cases, then compare the two average times.
I suspect the difference is primarily from a cold cache (first execution) vs. a warm cache (second execution). The output from STATISTICS IO would give away such a case, as a big difference in the physical reads between the runs.
And make sure you have 'lab' conditions for the test: no other tasks running (no lock contention), databases (including tempdb) and logs are pre-grown to required size so you don't hit any log growth or database growth event.
This is not uncommon. Table variables can be (and in a lot of cases ARE) slower than temp tables. Here are some of the reasons for this:
SQL Server maintains statistics for queries that use temporary tables but not for queries that use table variables. Without statistics, SQL Server might choose a poor processing plan for a query that contains a table variable
Non-clustered indexes cannot be created on table variables, other than the system indexes that are created for a PRIMARY or UNIQUE constraint. That can influence the query performance when compared to a temporary table with non-clustered indexes.
table variables use internal metadata in a way that prevents the engine from using a table variable within a parallel query (this means that it wont take advantage of multi-processor machines).
A table variable is optimized for one row, by SQL Server (it assumes 1 row will be returned).
I'm not 100% that this is the cause, but the table var will not have any statistics whereas the temp table will.
SELECT INTO is a non-logged operation, which would likely explain most of the performance difference. INSERT creates a log entry for every operation.
Additionally, SELECT INTO is creating the table as part of the operation, so SQL Server knows automatically that there are no constraints on it, which may factor in.
If it takes over a full minute to insert 7000 records into a temp table (persistent or variable), then the perf issue is almost certainly in the SELECT statement that's populating it.
Have you run DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before profiling? I'm thinking that maybe it's using some cached results for the second query.