Optimize delete SQL query with unordered table - sql

I am attempting a mass delete of old data from a huge table with 80,000,000 rows, about 50,000,000 rows will be removed. This will be done in batches of 50k to avoid database log overflow. Also the rows of the table are not sorted chronologically. I've come up with the following script:
BEGIN
DECLARE #START_TIME DATETIME,
#END_TIME DATETIME,
#DELETE_COUNT NUMERIC(10,0),
#TOTAL_COUNT NUMERIC(10,0),
#TO_DATE DATETIME,
#FROM_DATE DATETIME,
#TABLE_SIZE INT
SELECT #START_TIME = GETDATE()
PRINT 'Delete script Execution START TIME = %1!', #START_TIME
SELECT #TABLE_SIZE = COUNT(*) FROM HUGE_TABLE
PRINT 'Number of rows in HUGE_TABLE = %1!', #TABLE_SIZE
SELECT #DELETE_COUNT = 1,
#TOTAL_COUNT = 0,
#TO_DATE = DATEADD(yy, -2, GETDATE())
CREATE TABLE #TMP_BATCH_FOR_DEL (REQUEST_DT DATETIME)
WHILE(#DELETE_COUNT > 0)
BEGIN
DELETE FROM #TMP_BATCH_FOR_DEL
INSERT INTO #TMP_BATCH_FOR_DEL (REQUEST_DT)
SELECT TOP 50000 REQUEST_DT
FROM HUGE_TABLE
WHERE REQUEST_DT < #TO_DATE
ORDER BY REQUEST_DT DESC
SELECT #FROM_DATE = MIN(REQUEST_DT), #TO_DATE = MAX(REQUEST_DT)
FROM #TMP_BATCH_FOR_DEL
PRINT 'Deleting data from %1! to %2!', #FROM_DATE, #TO_DATE
DELETE FROM HUGE_TABLE
WHERE REQUEST_DT BETWEEN #FROM_DATE AND #TO_DATE
SELECT #DELETE_COUNT = ##ROWCOUNT
SELECT #TOTAL_COUNT = #TOTAL_COUNT + #DELETE_COUNT
SELECT #TO_DATE = #FROM_DATE
COMMIT
CHECKPOINT
END
SELECT #END_TIME = GETDATE()
PRINT 'Delete script Execution END TIME = %1!', #END_TIME
PRINT 'Total Rows deleted = %1!', #TOTAL_COUNT
DROP TABLE #TMP_BATCH_FOR_DEL
END
GO
I did a practice run and found the above was deleting around 2,250,000 rows per hour. So, it would take 24+ hours of continuous runtime to delete my data.
I know it's that darn ORDER BY clause within the loop that's slowing things down, but storing the ordered table in another temp table would take up too much memory. But, I can't think of a better way to do this.
Thoughts?

It is probably not the query itself. Your code is deleting about 600+ records per second. A lot is going on in that time -- logging, locking, and so on.
A faster approach is to load the data you want into a new table, truncate the old table, and reload it:
select *
into temp_huge_table
from huge_table
where request_dt > ?; -- whatever the cutoff is
Then -- after validating the results -- truncate the huge table and reload the data:
truncate table huge_table;
insert into huge_table
select *
from temp_huge_table;
If there is an identity column you will want to disable that to allow identity insert. You might have to take other precautions if there are triggers that set values in the table. Or if there are foreign key references to rows in the table.
I would not suggest doing this directly. After you have truncated the table, you should probably partition by the table by date -- by day, week, month, whatever.
Then, in the future, you can simply drop partitions rather than deleting rows. Dropping partitions is much, much faster.
Note that loading a few tens of millions of rows into an empty table is much, much faster than deleting them, but it still takes time (you can test how much time on your system). This requires downtown for the table. However, you hopefully have a maintenance period where this is possible.
And, the downtime can be justified by partitioning the table so you won't have this issue in the future.

Maybe you can optimize your Query by Inserting the 30.000.000 Records you want to keep into antoher Table which will be your new "Huge Table". And Drop the whole old "Huge Table" all together.
Best Regards
LK

Related

Query Session no longer respond

I'm trying to execute the following T-SQL Statement:
SET NOCOUNT ON
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
BEGIN TRANSACTION
DECLARE #nRows INT = 1
DECLARE #DataCancellazione DATE = DATEADD(DAY, -7, GETDATE())
CREATE TABLE #IDToDel (ID BIGINT)
WHILE #nRows > 0
BEGIN
INSERT INTO #IDToDel
SELECT TOP 5000 LogID
FROM MioDB.Test
WHERE CAST(ReceivedDate AS date) < #DataCancellazione
SELECT #nRows = ##ROWCOUNT
DELETE RM WITH (PAGLOCK)
FROM MioDB.Test RM WITH (PAGLOCK)
INNER JOIN #IDToDel TBD ON RM.LogID = TBD.ID
TRUNCATE TABLE #IDToDel
END
ROLLBACK
When I launch the execution the query window seems to no longer respond and without having particular increase of CPUTime and DiskIO on the process. Can anyone help me thanks.
Honestly, I think you're overly complicating the problem. SQL Server can easily handle processing millions of rows in one go, and I suspect that you could likely do this in a few 1M row batches. If you have at least 4,000,000 rows you want to delete, then at 5,000 a batch that will take 800 iterations.
There is also no need for the temporary table, a DELETE can make use of a TOP, so you can just delete that many rows each cycle. I define this with a variable, and then pass that (1,000,000 rows). This would mean everything is deleted in 4 iterations, not 800. You may want to reduce the size a little, but I would suggest that 500,000 is easy pickings for instance.
This gives you the following more succinct batch:
SET NOCOUNT ON;
--The following transaction level seems like a terrible idea when you're performing DDL statements. Don't, just don't.
--SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
DECLARE #BatchSize int = 1000000,
#DataCancellazione date = DATEADD(DAY, -7, GETDATE());
SELECT 1; --Dataset with 1 row
WHILE ##ROWCOUNT > 0
DELETE TOP (#BatchSize)
FROM MioDB.Test --A Schema called "MioDB" is a little confusing
WHERE ReceivedDate < #DataCancellazione; --Casting ReceivedDate would have had no change to the query
--And could well have slowed it down.

MS-SQL Server selecting rows, locking rows. Unique returns

I'm selecting the available login infos from a DB randomly via the stored procedure below. But when multiple threads want to get the available login infos, duplicate records are returned although I'm updating the timestamp field of the record.
How can I lock the rows here so that the record returned once won't be returned again?
Putting
WITH (HOLDLOCK, ROWLOCK)
didn't help!
SELECT TOP 1 #uid = [LoginInfoUid]
FROM [ZPer].[dbo].[LoginInfos]
WITH (HOLDLOCK, ROWLOCK)
WHERE ([Type] = #type)
...
...
...
ALTER PROCEDURE [dbo].[SelectRandomLoginInfo]
-- Add the parameters for the stored procedure here
#type int = 0,
#expireTimeout int = 86400 -- 24 * 60 * 60 = 24h
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Insert statements for procedure here
DECLARE #processTimeout int = 10 * 60
DECLARE #uid uniqueidentifier
BEGIN TRANSACTION
-- SELECT [LoginInfos] which are currently not being processed ([Timestamp] is timedout) and which are not expired.
SELECT TOP 1 #uid = [LoginInfoUid]
FROM [MyDb].[dbo].[LoginInfos]
WITH (HOLDLOCK, ROWLOCK)
WHERE ([Type] = #type) AND ([Uid] IS NOT NULL) AND ([Key] IS NOT NULL) AND
(
([Timestamp] IS NULL OR DATEDIFF(second, [Timestamp], GETDATE()) > #processTimeout) OR
(
DATEDIFF(second, [UpdateDate], GETDATE()) <= #expireTimeout OR
([UpdateDate] IS NULL AND DATEDIFF(second, [CreateDate], GETDATE()) <= #expireTimeout)
)
)
ORDER BY NEWID()
-- UPDATE the selected record so that it won't be re-selected.
UPDATE [MyDb].[dbo].[LoginInfos] SET
[UpdateDate] = GETDATE(), [Timestamp] = GETDATE()
WHERE [LoginInfoUid] = #uid
-- Return the full record data.
SELECT *
FROM [MyDb].[dbo].[LoginInfos]
WHERE [LoginInfoUid] = #uid
COMMIT TRANSACTION
END
Locking a row in shared mode doesn't help a bit in preventing multiple threads from reading the same row. You want to lock the row exclusivey with XLOCK hint. Also you are using a very low precision marker determining candidate rows (GETDATE has 3ms precision) so you will get a lot of false positives. You must use a precise field, like a bit (processing 0 or 1).
Ultimately you are treating the LoginsInfo as a queue, so I suggest you read Using tables as Queues. The way to achieve what you want is to use UPDATE ... WITH OUTPUT. But you have an additional requirement to select a random login, which would throw everything haywire. Are you really, really, 100% convinced that you need randomness? It is an extremely unusual requirement and you will have a heck of hard time coming up with a solution that is correct and performant. You'll get duplicates and you're going to deadlock till the day after.
A first attempt would go something like:
with cte as (
select top 1 ...
from [LoginInfos] with (readpast)
where processing = 0 and ...
order by newid())
update cte
set processing = 1
output cte...
But because the NEWID order requires a full table scan and sort to pick the 1 lucky winner row, you will be 1) extremely unperformant and 2) deadlock constantly.
Now you may take this a a random forum rant, but it so happens I've been working with SQL Server backed queues for some years now and I know what you want will not work. You must modify your requirement, specifically the randomness, and then you can go back to the article linked above and use one of the true and tested schemes.
Edit
If you don't need randomess then is somehow simpler. The gist of the tables-as-queues issue is that you must seek your output row, you absolutely cannot scan for it. Scanning over a queue is not only unperformed, is a guaranteed deadlock because of the way queues are used (highly concurent dequeue operations where all threads want the same row). To achieve this your WHERE clause must be sarg-able, which is subject to 1) your expressions in the WHERE clause and 2) the clustered index key. Your expression cannot contain OR conditions, so loose all the IS NULL OR ..., modify the fields to be non-nullable and always populate them. Second, your must compare in an index freindly manner, not DATEDIFF(..., field, ...) < #variable) but instead always use field < DATEDIDD (..., #variable, ...) because the second form is SARG-able. And you must settle for one of the two fields, [Timestamp] or [UpdateDate], you cannot seek on both. All these, of course, call for a much more strict and tight state machine in your application, but that is a good thing, the lax conditions and OR clauses are only indication of poor data input.
select #now = getdate();
select #expired = dateadd(second, #now, #processTimeout);
with cte as (
select *
from [MyDb].[dbo].[LoginInfos] WITH (readpast, xlock)
WHERE
[Type] = #type) AND
[Timestamp] < #expired)
update cte
set [Timestamp] = #now
output INSERTED.*;
For this to work, the clustered index of the table must be on ([Type], [Timestamp]) (which implies making the primary key LoginInfoId a non-clustered index).

sql query takes much long time compared to next run

I'm running a procedure which takes around 1 minute for the first time execution but for the next time it reduces to around 9-10 seconds. And after some time again it takes around 1 minute.
My procedure is working with single table which is having 6 non clustered and 1 clustered indexes and unique id column is uniqueidentifier data type with 1,218,833 rows.
Can you guide me where is the problem/possible performance improvement is?
Thanks in advance.
Here is the procedure.
PROCEDURE [dbo].[Proc] (
#HLevel NVARCHAR(100),
#HLevelValue INT,
#Date DATE,
#Numbers NVARCHAR(MAX)=NULL
)
AS
declare #LoopCount INT ,#DateLastYear DATE
DECLARE #Table1 TABLE ( list of columns )
DECLARE #Table2 TABLE ( list of columns )
-- LOOP FOR 12 MONTH DATA
SET #LoopCount=12
WHILE(#LoopCount>0)
BEGIN
SET #LoopCount= #LoopCount -1
-- LAST YEAR DATA
DECLARE #LastDate DATE;
SET #LastDate=DATEADD(D,-1, DATEADD(yy,-1, DATEADD(D,1,#Date)))
INSERT INTO #Table1
SELECT list of columns
FROM Table3 WHERE Date = #Date
AND
CASE
WHEN #HLevel='crieteria1' THEN col1
WHEN #HLevel='crieteria2' THEN col2
WHEN #HLevel='crieteria3' THEN col3
END =#HLevelValue
INSERT INTO #Table2
SELECT list of columns
FROM table4
WHERE Date= #LastDate
AND ( #Numbers IS NULL OR columnNumber IN ( SELECT * FROM dbo.ConvertNumbersToTable(#Numbers)))
INSERT INTO #Table1
SELECT list of columns
FROM #Table2 Prf2 WHERE Prf2.col1 IN (SELECT col2 FROM #Table1) AND Year(Date) = Year(#Date)
SET #Date = DATEADD(D,-1,DATEADD(m,-1, DATEADD(D,1,#Date)));
END
SELECT list of columns FROM #Table1
The first time the query runs, the data is not in the data cache and so has to be retrieved from disk. Also, it has to prepare an execution plan. Subsequent times you run the query, the data will be in the cache and so it will not have to go to disk to read it. It can also reuse the execution plan generated originally. This means execution time can be much quicker and why an ideal situation is to have large amounts of RAM in order to be able to cache as much data in memory as possible (it's the data cache that offers the biggest performance improvements).
If execution times subsequently increase again, it's possible that the data is being removed from the cache (and execution plans can be removed from the cache too) - depends on how much pressure there is for RAM. If SQL Server needs to free some up, it will remove stuff from the cache. Data/execution plans that are used most often/have the highest value will remain cached for longer.
There are of course other things that could be a factor such as what load is on the server at the time, whether your query is being blocked by other processes etc
It seems that stored procedure is recompiling repeatedly after some time. To reduce the recompilation please check this article:
http://blog.sqlauthority.com/2010/02/18/sql-server-plan-recompilation-and-reduce-recompilation-performance-tuning/

Stored procs breaking overnight

We are running MS SQL 2005 and we have been experiencing a very peculiar problem the past few days.
I have two procs, one that creates an hourly report of data. And another that calls it, puts its results in a temp table, and does some aggregations, and returns a summary.
They work fine...until the next morning.
The next morning, suddenly the calling report, complains about an invalid column name.
The fix, is simply a recompile of the calling proc, and all works well again.
How can this happen? It's happened three nights in a row since moving these procs into production.
EDIT: It appears, that it's not a recompile that is needed of the caller (summary) proc. I was just able to fix the problem by executing the callee (hourly) proc. Then executing the summary proc. This makes less sense than before.
EDIT2:
The hourly proc is rather large, and I'm not posting it here in it's entirety. But, at the end, it does a SELECT INTO, then conditionally, returns the appropriate result(s) from the created temp table.
Select [large column list]
into #tmpResults
From #DailySales8
Where datepart(hour,RowStartTime) >= #StartHour
and datepart(hour,RowStartTime) < #EndHour
and datepart(hour, RowStartTime) <= #LastHour
IF #UntilHour IS NOT NULL
AND EXISTS (SELECT * FROM #tmpResults WHERE datepart(hour, RowEndTime) = #UntilHour) BEGIN
SELECT *
FROM #tmpResults
WHERE datepart(hour, RowEndTime) = #UntilHour
END ELSE IF #JustLastFullHour = 1 BEGIN
DECLARE #MaxHour INT
SELECT #MaxHour = max(datepart(hour, RowEndTime)) FROM #tmpResults
IF #LastHour > 24 SELECT #LastHour = #MaxHour
SELECT *
FROM #tmpResults
WHERE datepart(hour, RowEndTime) = #LastHour
IF ##ROWCOUNT = 0 BEGIN
SELECT *
FROM #tmpResults
WHERE datepart(hour, RowEndTime) = #MaxHour
END
END ELSE BEGIN
SELECT * FROM #tmpResults
END
Then it drops all temp tables and ends.
The caller (Summary)
First creates a temp table #tmpTodaySales to store the results, the column list DOES MATCH the definition of #tmpResults in the other proc. Then it ends up calling the hourly proc a couple times
INSERT #tmpTodaysSales
EXEC HourlyProc #LocationCode, #ReportDate, null, 1
INSERT #tmpTodaysSales
EXEC HourlyProc #LocationCode, #LastWeekReportDate, #LastHour, 0
I believe it is these calls that fail. But recompiling the proc, or executing the hourly procedure outside of this, and then calling the summary proc fixes the problem.
Two questions:
Does the schema of #DailySales8 vary at all? Does it have any direct/indirect dependence on the date of execution, or on any of the parameters supplied to HourlyProc?
Which execution of INSERT #tmpTodaysSales EXEC HourlyProc ... in the summary fails - first or second?
What do the overnight maintenance plans look like, and are there any other scheduled overnight jobs that run between 2230 and 1000 the next day? It's possible that step in the maintenance plan or another agent job is causing some kind of corruption that's breaking your SP.

de-duplicating rows in a sql server 2005 table

I have a table with ~17 million rows in it. I need to de-duplicate the rows in the table. Under normal circumstances this wouldn't be a challenge, however, this isn't a normal circumstance. Normally 'duplicate rows' is defined as two or more rows containing the exact same values for all columns. In this case 'duplicate rows' is defined as two or more rows that have the exact same values, but are also within 20 seconds of each other. I wrote a script that is still running after 19.5 hours, this isn't acceptable, but I'm not sure how else to do it. Here's the script:
begin
create table ##dupes (ID int)
declare curOriginals cursor for
select ID, AssociatedEntityID, AssociatedEntityType, [Timestamp] from tblTable
declare #ID int
declare #AssocEntity int
declare #AssocType int
declare #Timestamp datetime
declare #Count int
open curOriginals
fetch next from curOriginals into #ID, #AssocEntity, #AssocType, #Timestamp
while ##FETCH_STATUS = 0
begin
select #Count = COUNT(*) from tblTable where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType
and [Timestamp] >= DATEADD(ss, -20, #Timestamp)
and [Timestamp] <= DATEADD(ss, 20, #Timestamp)
and ID <> #ID
if (#Count > 0)
begin
insert into ##dupes (ID)
(select ID from tblHBMLog where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType
and [Timestamp] >= DATEADD(ss, -20, #Timestamp)
and [Timestamp] <= DATEADD(ss, 20, #Timestamp)
and ID <> #ID)
print #ID
end
delete from tblHBMLog where ID = #ID or ID in (select ID from ##dupes)
fetch next from curOriginals into #ID, #AssocEntity, #AssocType, #Timestamp
end
close curOriginals
deallocate curOriginals
select * from ##dupes
drop table ##dupes
end
Any help would be greatly appreciated.
A quick tweak that should gain some speed would be to replace the nasty COUNT section with some EXISTS stuff :
IF EXISTS(SELECT 1 FROM tblTable WHERE AssociatedEntityID = #AssocEntity
AND AssociatedEntityType = #AssocType AND [Timestamp] >= DATEADD(ss, -20, #Timestamp)
AND [Timestamp] <= DATEADD(ss, 20, #Timestamp)
AND ID <> #ID) //if there are any matching rows...
BEGIN
DELETE FROM tblHBMLog
OUTPUT deleted.ID INTO ##dupes
WHERE AssociatedEntityID = #AssocEntity AND AssociatedEntityType = #AssocType
AND [Timestamp] >= DATEADD(ss, -20, #Timestamp)
AND [Timestamp] <= DATEADD(ss, 20, #Timestamp) //I think this is supposed to be within the block, not outside it
END
I've also now replaced the double references of ##dupes with the OUTPUT clause which will mean you're not scanning a growing ##dupes every time you delete a row. As far as the deletion goes, as you're deleting the ID and its matches in one go you don't need such an elaborate deletion clause. You've already checked that there are entries that need removing, and you seem to want to remove all the entries including the original.
Once you answer Paul's question, we can take a look at completely removing the cursor.
Basically, I agree with Bob.
1st of all, you have way too many things done in your code to be repeated 17 million times.
2nd, you could crop your set down to the absolute duplicates.
3rd it would be nicer if you had enough memory (which you should) and try and solve this in your programming language of choice.
Anyway, for the sake of a hardcoded answer, and because your query might still be running, I will try to give a working script which I think (?) does what you want.
First of all you should have an Index.
I would recommend an index on the AssociatedEntityID field.
If you already have one, but your table has been populated with lots of data after you created the index, then drop it and recreate it, in order to have fresh statistics.
Then see the script below, which does the following:
dumps all duplicates in the ##dupes, ignoring the 20 secs rule
it sorts them out (by AssociatedEntityID, Timestamp) and starts the simplest straight forward loop it can do.
checks for duplicate AssociatedEntityID and the timestamp inside the 20 sec interval.
if all true, then inserts the id to the ##dupes_to_be_deleted table.
There is the assumption that if you have a set of more than two duplicates, in sequence, then the script eliminates every duplicate in the range of 20 secs from the first one. Then, from the next remaining, if any, it resets and goes for another 20 secs, and so on...
Here is the script, it may be useful to you, though did not have the time to test it
CREATE TABLE ##dupes
(
ID INT ,
AssociatedEntityID INT ,
[Timestamp] DATETIME
)
CREATE TABLE ##dupes_to_be_deleted
(
ID INT
)
-- collect all dupes, ignoring for now the rule of 20 secs
INSERT
INTO ##dupes
SELECT ID ,
AssociatedEntityID ,
[Timestamp]
FROM tblTable
WHERE AssociatedEntityID IN
( SELECT AssociatedEntityID
FROM tblTable
GROUP BY AssociatedEntityID
HAVING COUNT(*) > 1
)
-- then sort and loop on all of them
-- using a cursor
DECLARE c CURSOR FOR
SELECT ID ,
AssociatedEntityID ,
[Timestamp]
FROM ##dupes
ORDER BY AssociatedEntityID,
[Timestamp]
-- declarations
DECLARE #id INT,
#AssociatedEntityID INT,
#ts DATETIME,
#old_AssociatedEntityID INT,
#old_ts DATETIME
-- initialisation
SELECT #old_AssociatedEntityID = 0,
#old_ts = '1900-01-01'
-- start loop
OPEN c
FETCH NEXT
FROM c
INTO #id ,
#AssociatedEntityID,
#ts
WHILE ##fetch_status = 0
BEGIN
-- check for dupe AssociatedEntityID
IF #AssociatedEntityID = #old_AssociatedEntityID
BEGIN
-- check for time interval
IF #ts <= DATEADD(ss, 20, #old_ts )
BEGIN
-- yes! it is a duplicate
-- store it in ##dupes_to_be_deleted
INSERT
INTO ##dupes_to_be_deleted
(
id
)
VALUES
(
#id
)
END
ELSE
BEGIN
-- IS THIS OK?:
-- put last timestamp for comparison
-- with the next timestamp
-- only if the previous one is not going to be deleted.
-- this way we delete all duplicates
-- 20 secs away from the first of the set of duplicates
-- and the next one remaining will be a duplicate
-- but after the 20 secs interval.
-- and so on ...
SET #old_ts = #ts
END
END
-- prepare vars for next iteration
SELECT #old_AssociatedEntityID = #AssociatedEntityID
FETCH NEXT
FROM c
INTO #id ,
#AssociatedEntityID,
#ts
END
CLOSE c
DEALLOCATE c
-- now you have all the ids that are duplicates and in the 20 sec interval of the first duplicate in the ##dupes_to_be_deleted
DELETE
FROM <wherever> -- replace <wherever> with tblHBMLog?
WHERE id IN
( SELECT id
FROM ##dupes_to_be_deleted
)
DROP TABLE ##dupes_to_be_deleted
DROP TABLE ##dupes
You could give a try and leave it for a couple of hours. Hope it helps.
If you have enough memory and storage, may be faster this way:
Create the new table with similar structure
Copy all data by select with distinct to this temp table
Clear original table (your should
delete some constraints before this)
Copy data back to original table
Instead of 3 and 4 steps you can rename drop original table and rename temp folder.
Putting the time differentiator aside, the first thing I would do is knock this list down to a much smaller subset of potential duplicates. For example, if you have 17 million rows, but only, say, 10 million have every field matching but the time, then you have just chopped a large portion of your processing off.
To do this I'd just whip up a query to dump the unique ID's of the potential duplicates into a temp table, then use this as an inner join on your cursor (again, this would be a first step).
In looking at the cursor, I see a lot of relatively heavy function calls which would explain your slowdowns. There's also a lot of data activity and I would not be suprised if you weren't being crushed by an I/O bottleneck.
One thing you could do then is rather than use the cursor, dump it into your programming language of choice. Assuming we've already limited all of our fields except for the timestamp down to a manageable set, grab each subset in turn (i.e. ones that match the remaining fields), since any dups would necessarily have all of their other fields matched. Then just snuff out the duplicates you find in these smaller atomic subsets.
So assuming you have 10 million potentials, and each time range has about 20 records or so that need to be worked through with the date logic, you're down to a much smaller number of database calls and some quick code - and from experience, knocking out the datetime comparisons, etc. outside of SQL is generally a lot faster.
Bottom line is to figure out ways to, as quickly as possible, partition your data down into managable subsets.
Hope that helps!
-Bob
In answer to Paul's question:
What happens when you have three entries, a, b, c. a = 00 secs b = 19 secs c = 39 secs >Are these all considered to be the same time? ( a is within 20 secs of b, b is within 20 >secs of c )
If the other comparisons are equal (AssociatedEntityid and AssociatedEntityType) then yes, they are considered the same thing, otherwise no.
I would add to the original question, except that I used a different account to post the question and now can't remember my password. It was a very old account and didn't realize that I had connected to the site with it.
I have been working with some the answers you guys have given me and there is one problem, you're using only one key column (AssociatedEntityid) when there are two (AssociatedEntityID and AssociatedEntityType). Your suggestions would work great for a single key column.
What I have done so far is:
Step 1: Determine which AssociatedEntityID and AssociatedEntityType pairs have duplicates and insert them into a temp table:
create table ##stage1 (ID int, AssociatedEntityID int, AssociatedEntityType int, [Timestamp] datetime)
insert into ##stage1 (AssociatedEntityID, AssociatedEntityType)
(select AssociatedEntityID, AssociatedEntityType from tblHBMLog group by AssociatedEntityID, AssociatedEntityType having COUNT(*) > 1)
Step 2: Retrieve the ID of the earliest occurring row with a given AssociatedEntityID and AssociatedEntityType pair:
declare curStage1 cursor for
select AssociatedEntityID, AssociatedEntityType from ##stage1
open curStage1
fetch next from curStage1 into #AssocEntity, #AssocType
while ##FETCH_STATUS = 0
begin
select top 1 #ID = ID, #Timestamp = [Timestamp] from tblHBMLog where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType order by [Timestamp] asc
update ##stage1 set ID = #ID, [Timestamp] = #Timestamp where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType
end
And this is where things slow down again. Now, granted, the result set has been pared down from ~17 million to just under 400,000, but it is still taking quite a long time to run through.
I guess another question that I should ask is this; If I continue to write this in SQL will it just have to take quite a long time? Should I write this in C# instead? Or am I just stupid and not seeing the forest for the trees of this solution?
Well, after much stomping of feet and gnashing of teeth, I have come up with a solution. It's just a simple, quick and dirty C# command line app, but it's faster than the sql script and it does the job.
I thank you all for your help, in the end the sql script was just taking too much time to execute and C# is much better suited for looping.