de-duplicating rows in a sql server 2005 table - sql-server-2005

I have a table with ~17 million rows in it. I need to de-duplicate the rows in the table. Under normal circumstances this wouldn't be a challenge, however, this isn't a normal circumstance. Normally 'duplicate rows' is defined as two or more rows containing the exact same values for all columns. In this case 'duplicate rows' is defined as two or more rows that have the exact same values, but are also within 20 seconds of each other. I wrote a script that is still running after 19.5 hours, this isn't acceptable, but I'm not sure how else to do it. Here's the script:
begin
create table ##dupes (ID int)
declare curOriginals cursor for
select ID, AssociatedEntityID, AssociatedEntityType, [Timestamp] from tblTable
declare #ID int
declare #AssocEntity int
declare #AssocType int
declare #Timestamp datetime
declare #Count int
open curOriginals
fetch next from curOriginals into #ID, #AssocEntity, #AssocType, #Timestamp
while ##FETCH_STATUS = 0
begin
select #Count = COUNT(*) from tblTable where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType
and [Timestamp] >= DATEADD(ss, -20, #Timestamp)
and [Timestamp] <= DATEADD(ss, 20, #Timestamp)
and ID <> #ID
if (#Count > 0)
begin
insert into ##dupes (ID)
(select ID from tblHBMLog where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType
and [Timestamp] >= DATEADD(ss, -20, #Timestamp)
and [Timestamp] <= DATEADD(ss, 20, #Timestamp)
and ID <> #ID)
print #ID
end
delete from tblHBMLog where ID = #ID or ID in (select ID from ##dupes)
fetch next from curOriginals into #ID, #AssocEntity, #AssocType, #Timestamp
end
close curOriginals
deallocate curOriginals
select * from ##dupes
drop table ##dupes
end
Any help would be greatly appreciated.

A quick tweak that should gain some speed would be to replace the nasty COUNT section with some EXISTS stuff :
IF EXISTS(SELECT 1 FROM tblTable WHERE AssociatedEntityID = #AssocEntity
AND AssociatedEntityType = #AssocType AND [Timestamp] >= DATEADD(ss, -20, #Timestamp)
AND [Timestamp] <= DATEADD(ss, 20, #Timestamp)
AND ID <> #ID) //if there are any matching rows...
BEGIN
DELETE FROM tblHBMLog
OUTPUT deleted.ID INTO ##dupes
WHERE AssociatedEntityID = #AssocEntity AND AssociatedEntityType = #AssocType
AND [Timestamp] >= DATEADD(ss, -20, #Timestamp)
AND [Timestamp] <= DATEADD(ss, 20, #Timestamp) //I think this is supposed to be within the block, not outside it
END
I've also now replaced the double references of ##dupes with the OUTPUT clause which will mean you're not scanning a growing ##dupes every time you delete a row. As far as the deletion goes, as you're deleting the ID and its matches in one go you don't need such an elaborate deletion clause. You've already checked that there are entries that need removing, and you seem to want to remove all the entries including the original.
Once you answer Paul's question, we can take a look at completely removing the cursor.

Basically, I agree with Bob.
1st of all, you have way too many things done in your code to be repeated 17 million times.
2nd, you could crop your set down to the absolute duplicates.
3rd it would be nicer if you had enough memory (which you should) and try and solve this in your programming language of choice.
Anyway, for the sake of a hardcoded answer, and because your query might still be running, I will try to give a working script which I think (?) does what you want.
First of all you should have an Index.
I would recommend an index on the AssociatedEntityID field.
If you already have one, but your table has been populated with lots of data after you created the index, then drop it and recreate it, in order to have fresh statistics.
Then see the script below, which does the following:
dumps all duplicates in the ##dupes, ignoring the 20 secs rule
it sorts them out (by AssociatedEntityID, Timestamp) and starts the simplest straight forward loop it can do.
checks for duplicate AssociatedEntityID and the timestamp inside the 20 sec interval.
if all true, then inserts the id to the ##dupes_to_be_deleted table.
There is the assumption that if you have a set of more than two duplicates, in sequence, then the script eliminates every duplicate in the range of 20 secs from the first one. Then, from the next remaining, if any, it resets and goes for another 20 secs, and so on...
Here is the script, it may be useful to you, though did not have the time to test it
CREATE TABLE ##dupes
(
ID INT ,
AssociatedEntityID INT ,
[Timestamp] DATETIME
)
CREATE TABLE ##dupes_to_be_deleted
(
ID INT
)
-- collect all dupes, ignoring for now the rule of 20 secs
INSERT
INTO ##dupes
SELECT ID ,
AssociatedEntityID ,
[Timestamp]
FROM tblTable
WHERE AssociatedEntityID IN
( SELECT AssociatedEntityID
FROM tblTable
GROUP BY AssociatedEntityID
HAVING COUNT(*) > 1
)
-- then sort and loop on all of them
-- using a cursor
DECLARE c CURSOR FOR
SELECT ID ,
AssociatedEntityID ,
[Timestamp]
FROM ##dupes
ORDER BY AssociatedEntityID,
[Timestamp]
-- declarations
DECLARE #id INT,
#AssociatedEntityID INT,
#ts DATETIME,
#old_AssociatedEntityID INT,
#old_ts DATETIME
-- initialisation
SELECT #old_AssociatedEntityID = 0,
#old_ts = '1900-01-01'
-- start loop
OPEN c
FETCH NEXT
FROM c
INTO #id ,
#AssociatedEntityID,
#ts
WHILE ##fetch_status = 0
BEGIN
-- check for dupe AssociatedEntityID
IF #AssociatedEntityID = #old_AssociatedEntityID
BEGIN
-- check for time interval
IF #ts <= DATEADD(ss, 20, #old_ts )
BEGIN
-- yes! it is a duplicate
-- store it in ##dupes_to_be_deleted
INSERT
INTO ##dupes_to_be_deleted
(
id
)
VALUES
(
#id
)
END
ELSE
BEGIN
-- IS THIS OK?:
-- put last timestamp for comparison
-- with the next timestamp
-- only if the previous one is not going to be deleted.
-- this way we delete all duplicates
-- 20 secs away from the first of the set of duplicates
-- and the next one remaining will be a duplicate
-- but after the 20 secs interval.
-- and so on ...
SET #old_ts = #ts
END
END
-- prepare vars for next iteration
SELECT #old_AssociatedEntityID = #AssociatedEntityID
FETCH NEXT
FROM c
INTO #id ,
#AssociatedEntityID,
#ts
END
CLOSE c
DEALLOCATE c
-- now you have all the ids that are duplicates and in the 20 sec interval of the first duplicate in the ##dupes_to_be_deleted
DELETE
FROM <wherever> -- replace <wherever> with tblHBMLog?
WHERE id IN
( SELECT id
FROM ##dupes_to_be_deleted
)
DROP TABLE ##dupes_to_be_deleted
DROP TABLE ##dupes
You could give a try and leave it for a couple of hours. Hope it helps.

If you have enough memory and storage, may be faster this way:
Create the new table with similar structure
Copy all data by select with distinct to this temp table
Clear original table (your should
delete some constraints before this)
Copy data back to original table
Instead of 3 and 4 steps you can rename drop original table and rename temp folder.

Putting the time differentiator aside, the first thing I would do is knock this list down to a much smaller subset of potential duplicates. For example, if you have 17 million rows, but only, say, 10 million have every field matching but the time, then you have just chopped a large portion of your processing off.
To do this I'd just whip up a query to dump the unique ID's of the potential duplicates into a temp table, then use this as an inner join on your cursor (again, this would be a first step).
In looking at the cursor, I see a lot of relatively heavy function calls which would explain your slowdowns. There's also a lot of data activity and I would not be suprised if you weren't being crushed by an I/O bottleneck.
One thing you could do then is rather than use the cursor, dump it into your programming language of choice. Assuming we've already limited all of our fields except for the timestamp down to a manageable set, grab each subset in turn (i.e. ones that match the remaining fields), since any dups would necessarily have all of their other fields matched. Then just snuff out the duplicates you find in these smaller atomic subsets.
So assuming you have 10 million potentials, and each time range has about 20 records or so that need to be worked through with the date logic, you're down to a much smaller number of database calls and some quick code - and from experience, knocking out the datetime comparisons, etc. outside of SQL is generally a lot faster.
Bottom line is to figure out ways to, as quickly as possible, partition your data down into managable subsets.
Hope that helps!
-Bob

In answer to Paul's question:
What happens when you have three entries, a, b, c. a = 00 secs b = 19 secs c = 39 secs >Are these all considered to be the same time? ( a is within 20 secs of b, b is within 20 >secs of c )
If the other comparisons are equal (AssociatedEntityid and AssociatedEntityType) then yes, they are considered the same thing, otherwise no.
I would add to the original question, except that I used a different account to post the question and now can't remember my password. It was a very old account and didn't realize that I had connected to the site with it.
I have been working with some the answers you guys have given me and there is one problem, you're using only one key column (AssociatedEntityid) when there are two (AssociatedEntityID and AssociatedEntityType). Your suggestions would work great for a single key column.
What I have done so far is:
Step 1: Determine which AssociatedEntityID and AssociatedEntityType pairs have duplicates and insert them into a temp table:
create table ##stage1 (ID int, AssociatedEntityID int, AssociatedEntityType int, [Timestamp] datetime)
insert into ##stage1 (AssociatedEntityID, AssociatedEntityType)
(select AssociatedEntityID, AssociatedEntityType from tblHBMLog group by AssociatedEntityID, AssociatedEntityType having COUNT(*) > 1)
Step 2: Retrieve the ID of the earliest occurring row with a given AssociatedEntityID and AssociatedEntityType pair:
declare curStage1 cursor for
select AssociatedEntityID, AssociatedEntityType from ##stage1
open curStage1
fetch next from curStage1 into #AssocEntity, #AssocType
while ##FETCH_STATUS = 0
begin
select top 1 #ID = ID, #Timestamp = [Timestamp] from tblHBMLog where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType order by [Timestamp] asc
update ##stage1 set ID = #ID, [Timestamp] = #Timestamp where AssociatedEntityID = #AssocEntity and AssociatedEntityType = #AssocType
end
And this is where things slow down again. Now, granted, the result set has been pared down from ~17 million to just under 400,000, but it is still taking quite a long time to run through.
I guess another question that I should ask is this; If I continue to write this in SQL will it just have to take quite a long time? Should I write this in C# instead? Or am I just stupid and not seeing the forest for the trees of this solution?
Well, after much stomping of feet and gnashing of teeth, I have come up with a solution. It's just a simple, quick and dirty C# command line app, but it's faster than the sql script and it does the job.
I thank you all for your help, in the end the sql script was just taking too much time to execute and C# is much better suited for looping.

Related

Optimize delete SQL query with unordered table

I am attempting a mass delete of old data from a huge table with 80,000,000 rows, about 50,000,000 rows will be removed. This will be done in batches of 50k to avoid database log overflow. Also the rows of the table are not sorted chronologically. I've come up with the following script:
BEGIN
DECLARE #START_TIME DATETIME,
#END_TIME DATETIME,
#DELETE_COUNT NUMERIC(10,0),
#TOTAL_COUNT NUMERIC(10,0),
#TO_DATE DATETIME,
#FROM_DATE DATETIME,
#TABLE_SIZE INT
SELECT #START_TIME = GETDATE()
PRINT 'Delete script Execution START TIME = %1!', #START_TIME
SELECT #TABLE_SIZE = COUNT(*) FROM HUGE_TABLE
PRINT 'Number of rows in HUGE_TABLE = %1!', #TABLE_SIZE
SELECT #DELETE_COUNT = 1,
#TOTAL_COUNT = 0,
#TO_DATE = DATEADD(yy, -2, GETDATE())
CREATE TABLE #TMP_BATCH_FOR_DEL (REQUEST_DT DATETIME)
WHILE(#DELETE_COUNT > 0)
BEGIN
DELETE FROM #TMP_BATCH_FOR_DEL
INSERT INTO #TMP_BATCH_FOR_DEL (REQUEST_DT)
SELECT TOP 50000 REQUEST_DT
FROM HUGE_TABLE
WHERE REQUEST_DT < #TO_DATE
ORDER BY REQUEST_DT DESC
SELECT #FROM_DATE = MIN(REQUEST_DT), #TO_DATE = MAX(REQUEST_DT)
FROM #TMP_BATCH_FOR_DEL
PRINT 'Deleting data from %1! to %2!', #FROM_DATE, #TO_DATE
DELETE FROM HUGE_TABLE
WHERE REQUEST_DT BETWEEN #FROM_DATE AND #TO_DATE
SELECT #DELETE_COUNT = ##ROWCOUNT
SELECT #TOTAL_COUNT = #TOTAL_COUNT + #DELETE_COUNT
SELECT #TO_DATE = #FROM_DATE
COMMIT
CHECKPOINT
END
SELECT #END_TIME = GETDATE()
PRINT 'Delete script Execution END TIME = %1!', #END_TIME
PRINT 'Total Rows deleted = %1!', #TOTAL_COUNT
DROP TABLE #TMP_BATCH_FOR_DEL
END
GO
I did a practice run and found the above was deleting around 2,250,000 rows per hour. So, it would take 24+ hours of continuous runtime to delete my data.
I know it's that darn ORDER BY clause within the loop that's slowing things down, but storing the ordered table in another temp table would take up too much memory. But, I can't think of a better way to do this.
Thoughts?
It is probably not the query itself. Your code is deleting about 600+ records per second. A lot is going on in that time -- logging, locking, and so on.
A faster approach is to load the data you want into a new table, truncate the old table, and reload it:
select *
into temp_huge_table
from huge_table
where request_dt > ?; -- whatever the cutoff is
Then -- after validating the results -- truncate the huge table and reload the data:
truncate table huge_table;
insert into huge_table
select *
from temp_huge_table;
If there is an identity column you will want to disable that to allow identity insert. You might have to take other precautions if there are triggers that set values in the table. Or if there are foreign key references to rows in the table.
I would not suggest doing this directly. After you have truncated the table, you should probably partition by the table by date -- by day, week, month, whatever.
Then, in the future, you can simply drop partitions rather than deleting rows. Dropping partitions is much, much faster.
Note that loading a few tens of millions of rows into an empty table is much, much faster than deleting them, but it still takes time (you can test how much time on your system). This requires downtown for the table. However, you hopefully have a maintenance period where this is possible.
And, the downtime can be justified by partitioning the table so you won't have this issue in the future.
Maybe you can optimize your Query by Inserting the 30.000.000 Records you want to keep into antoher Table which will be your new "Huge Table". And Drop the whole old "Huge Table" all together.
Best Regards
LK

Keep the last 'x' results in a database

I would like to ask if there is a quick way to keep the last 'x' inserted rows in a database.
For example, i have an application through which users can search for items and I want to keep the last 10 searches that each user has made. I do not want to keep all his searches into the database as this will increase db_size. Is there a quick way of keeping only the latest 10 searches on my db or shall i check every time:
A) how many searches has been stored on database so far
B) if (searches = 10) delete last row
C) insert new row
I think that this way will have an impact on performance as it will need 3 different accesses on the database: check, delete and insert
I don't think an easy/quick way to do this. Based on your conditions i created the below stored procedure.
I considered SearchContent which is going to store the data.
CREATE TABLE SearchContent (
Id INT IDENTITY (1, 1),
UserId VARCHAR (8),
SearchedOn DATETIME,
Keyword VARCHAR (40)
)
In the stored procedure passing the UserId, Keyword and do the calculation. The procedure will be,
CREATE PROCEDURE [dbo].[pub_SaveSearchDetails]
(
#UserId VARCHAR (8),
#Keyword VARCHAR (40)
)
AS
BEGIN
SET NOCOUNT ON;
DECLARE #Count AS INT = 0;
-- A) how many searches has been stored on database so far
SELECT #Count = COUNT(Keyword) FROM SearchContent WHERE UserId = #UserId
IF #Count >= 10
BEGIN
-- B) if (searches = 10) delete last row
DELETE FROM SearchContent WHERE Id IN ( SELECT TOP 1 Id FROM SearchContent WHERE UserId = #UserId ORDER BY SearchedOn ASC)
END
-- C) insert new row
INSERT INTO SearchContent (UserId, SearchedOn, Keyword)
VALUES (#UserId, GETDATE(), #Keyword)
END
Sample execution: EXEC pub_SaveSearchDetails 'A001', 'angularjs'

MS-SQL Server selecting rows, locking rows. Unique returns

I'm selecting the available login infos from a DB randomly via the stored procedure below. But when multiple threads want to get the available login infos, duplicate records are returned although I'm updating the timestamp field of the record.
How can I lock the rows here so that the record returned once won't be returned again?
Putting
WITH (HOLDLOCK, ROWLOCK)
didn't help!
SELECT TOP 1 #uid = [LoginInfoUid]
FROM [ZPer].[dbo].[LoginInfos]
WITH (HOLDLOCK, ROWLOCK)
WHERE ([Type] = #type)
...
...
...
ALTER PROCEDURE [dbo].[SelectRandomLoginInfo]
-- Add the parameters for the stored procedure here
#type int = 0,
#expireTimeout int = 86400 -- 24 * 60 * 60 = 24h
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Insert statements for procedure here
DECLARE #processTimeout int = 10 * 60
DECLARE #uid uniqueidentifier
BEGIN TRANSACTION
-- SELECT [LoginInfos] which are currently not being processed ([Timestamp] is timedout) and which are not expired.
SELECT TOP 1 #uid = [LoginInfoUid]
FROM [MyDb].[dbo].[LoginInfos]
WITH (HOLDLOCK, ROWLOCK)
WHERE ([Type] = #type) AND ([Uid] IS NOT NULL) AND ([Key] IS NOT NULL) AND
(
([Timestamp] IS NULL OR DATEDIFF(second, [Timestamp], GETDATE()) > #processTimeout) OR
(
DATEDIFF(second, [UpdateDate], GETDATE()) <= #expireTimeout OR
([UpdateDate] IS NULL AND DATEDIFF(second, [CreateDate], GETDATE()) <= #expireTimeout)
)
)
ORDER BY NEWID()
-- UPDATE the selected record so that it won't be re-selected.
UPDATE [MyDb].[dbo].[LoginInfos] SET
[UpdateDate] = GETDATE(), [Timestamp] = GETDATE()
WHERE [LoginInfoUid] = #uid
-- Return the full record data.
SELECT *
FROM [MyDb].[dbo].[LoginInfos]
WHERE [LoginInfoUid] = #uid
COMMIT TRANSACTION
END
Locking a row in shared mode doesn't help a bit in preventing multiple threads from reading the same row. You want to lock the row exclusivey with XLOCK hint. Also you are using a very low precision marker determining candidate rows (GETDATE has 3ms precision) so you will get a lot of false positives. You must use a precise field, like a bit (processing 0 or 1).
Ultimately you are treating the LoginsInfo as a queue, so I suggest you read Using tables as Queues. The way to achieve what you want is to use UPDATE ... WITH OUTPUT. But you have an additional requirement to select a random login, which would throw everything haywire. Are you really, really, 100% convinced that you need randomness? It is an extremely unusual requirement and you will have a heck of hard time coming up with a solution that is correct and performant. You'll get duplicates and you're going to deadlock till the day after.
A first attempt would go something like:
with cte as (
select top 1 ...
from [LoginInfos] with (readpast)
where processing = 0 and ...
order by newid())
update cte
set processing = 1
output cte...
But because the NEWID order requires a full table scan and sort to pick the 1 lucky winner row, you will be 1) extremely unperformant and 2) deadlock constantly.
Now you may take this a a random forum rant, but it so happens I've been working with SQL Server backed queues for some years now and I know what you want will not work. You must modify your requirement, specifically the randomness, and then you can go back to the article linked above and use one of the true and tested schemes.
Edit
If you don't need randomess then is somehow simpler. The gist of the tables-as-queues issue is that you must seek your output row, you absolutely cannot scan for it. Scanning over a queue is not only unperformed, is a guaranteed deadlock because of the way queues are used (highly concurent dequeue operations where all threads want the same row). To achieve this your WHERE clause must be sarg-able, which is subject to 1) your expressions in the WHERE clause and 2) the clustered index key. Your expression cannot contain OR conditions, so loose all the IS NULL OR ..., modify the fields to be non-nullable and always populate them. Second, your must compare in an index freindly manner, not DATEDIFF(..., field, ...) < #variable) but instead always use field < DATEDIDD (..., #variable, ...) because the second form is SARG-able. And you must settle for one of the two fields, [Timestamp] or [UpdateDate], you cannot seek on both. All these, of course, call for a much more strict and tight state machine in your application, but that is a good thing, the lax conditions and OR clauses are only indication of poor data input.
select #now = getdate();
select #expired = dateadd(second, #now, #processTimeout);
with cte as (
select *
from [MyDb].[dbo].[LoginInfos] WITH (readpast, xlock)
WHERE
[Type] = #type) AND
[Timestamp] < #expired)
update cte
set [Timestamp] = #now
output INSERTED.*;
For this to work, the clustered index of the table must be on ([Type], [Timestamp]) (which implies making the primary key LoginInfoId a non-clustered index).

How to keep a rolling checksum in SQL?

I am trying to keep a rolling checksum to account for order, so take the previous 'checksum' and xor it with the current one and generate a new checksum.
Name Checksum Rolling Checksum
------ ----------- -----------------
foo 11829231 11829231
bar 27380135 checksum(27380135 ^ 11829231) = 93291803
baz 96326587 checksum(96326587 ^ 93291803) = 67361090
How would I accomplish something like this?
(Note that the calculations are completely made up and are for illustration only)
This is basically the running total problem.
Edit:
My original claim was that is one of the few places where a cursor based solution actually performs best. The problem with the triangular self join solution is that it will repeatedly end up recalculating the same cumulative checksum as a subcalculation for the next step so is not very scalable as the work required grows exponentially with the number of rows.
Corina's answer uses the "quirky update" approach. I've adjusted it to do the check sum and in my test found that it took 3 seconds rather than 26 seconds for the cursor solution. Both produced the same results. Unfortunately however it relies on an undocumented aspect of Update behaviour. I would definitely read the discussion here before deciding whether to rely on this in production code.
There is a third possibility described here (using the CLR) which I didn't have time to test. But from the discussion here it seems to be a good possibility for calculating running total type things at display time but out performed by the cursor when the result of the calculation must be saved back.
CREATE TABLE TestTable
(
PK int identity(1,1) primary key clustered,
[Name] varchar(50),
[CheckSum] AS CHECKSUM([Name]),
RollingCheckSum1 int NULL,
RollingCheckSum2 int NULL
)
/*Insert some random records (753,571 on my machine)*/
INSERT INTO TestTable ([Name])
SELECT newid() FROM sys.objects s1, sys.objects s2, sys.objects s3
Approach One: Based on the Jeff Moden Article
DECLARE #RCS int
UPDATE TestTable
SET #RCS = RollingCheckSum1 =
CASE WHEN #RCS IS NULL THEN
[CheckSum]
ELSE
CHECKSUM([CheckSum] ^ #RCS)
END
FROM TestTable WITH (TABLOCKX)
OPTION (MAXDOP 1)
Approach Two - Using the same cursor options as Hugo Kornelis advocates in the discussion for that article.
SET NOCOUNT ON
BEGIN TRAN
DECLARE #RCS2 INT
DECLARE #PK INT, #CheckSum INT
DECLARE curRollingCheckSum CURSOR LOCAL STATIC READ_ONLY
FOR
SELECT PK, [CheckSum]
FROM TestTable
ORDER BY PK
OPEN curRollingCheckSum
FETCH NEXT FROM curRollingCheckSum
INTO #PK, #CheckSum
WHILE ##FETCH_STATUS = 0
BEGIN
SET #RCS2 = CASE WHEN #RCS2 IS NULL THEN #CheckSum ELSE CHECKSUM(#CheckSum ^ #RCS2) END
UPDATE dbo.TestTable
SET RollingCheckSum2 = #RCS2
WHERE #PK = PK
FETCH NEXT FROM curRollingCheckSum
INTO #PK, #CheckSum
END
COMMIT
Test they are the same
SELECT * FROM TestTable
WHERE RollingCheckSum1<> RollingCheckSum2
I'm not sure about a rolling checksum, but for a rolling sum for instance, you can do this using the UPDATE command:
declare #a table (name varchar(2), value int, rollingvalue int)
insert into #a
select 'a', 1, 0 union all select 'b', 2, 0 union all select 'c', 3, 0
select * from #a
declare #sum int
set #sum = 0
update #a
set #sum = rollingvalue = value + #sum
select * from #a
Select Name, Checksum
, (Select T1.Checksum_Agg(Checksum)
From Table As T1
Where T1.Name < T.Name) As RollingChecksum
From Table As T
Order By T.Name
To do a rolling anything, you need some semblance of an order to the rows. That can be by name, an integer key, a date or whatever. In my example, I used name (even though the order in your sample data isn't alphabetical). In addition, I'm using the Checksum_Agg function in SQL.
In addition, you would ideally have a unique value on which you compare the inner and outer query. E.g., Where T1.PK < T.PK for an integer key or even string key would work well. In my solution if Name had a unique constraint, it would also work well enough.

Weird trigger problem when I do an INSERT into a table

I've got a trigger attached to a table.
ALTER TRIGGER [dbo].[UpdateUniqueSubjectAfterInsertUpdate]
ON [dbo].[Contents]
AFTER INSERT,UPDATE
AS
BEGIN
-- Grab the Id of the row just inserted/updated
DECLARE #Id INT
SELECT #Id = Id
FROM INSERTED
END
Every time a new entry is inserted or modified, I wish to update a single field (in this table). For the sake of this question, imagine i'm updating a LastModifiedOn (datetime) field.
Ok, so what i've got is a batch insert thingy..
INSERT INTO [dbo].[Contents]
SELECT Id, a, b, c, d, YouDontKnowMe
FROM [dbo].[CrapTable]
Now all the rows are correctly inserted. The LastModifiedOn field defaults to null. So all the entries for this are null -- EXCEPT the first row.
Does this mean that the trigger is NOT called for each row that is inserted into the table, but once AFTER the insert query is finished, ie. ALL the rows are inserted? Which mean, the INSERTED table (in the trigger) has not one, but 'n' number of rows?!
If so .. er.. :( Would that mean i would need a cursor in this trigger? (if i need to do some unique logic to each single row, which i do currently).
?
UPDATE
I'll add the full trigger code, to see if it's possible to do it without a cursor.
BEGIN
SET NOCOUNT ON
DECLARE #ContentId INTEGER,
#ContentTypeId TINYINT,
#UniqueSubject NVARCHAR(200),
#NumberFound INTEGER
-- Grab the Id. Also, convert the subject to a (first pass, untested)
-- unique subject.
-- NOTE: ToUriCleanText just replaces bad uri chars with a ''.
-- eg. an '#' -> ''
SELECT #ContentId = ContentId, #ContentTypeId = ContentTypeId,
#UniqueSubject = [dbo].[ToUriCleanText]([Subject])
FROM INSERTED
-- Find out how many items we have, for these two keys.
SELECT #NumberFound = COUNT(ContentId)
FROM [dbo].[Contents]
WHERE ContentId = #ContentId
AND UniqueSubject = #UniqueSubject
-- If we have at least one identical subject, then we need to make it
-- unique by appending the current found number.
-- Eg. The first instance has no number.
-- Second instance has subject + '1',
-- Third instance has subject + '2', etc...
IF #NumberFound > 0
SET #UniqueSubject = #UniqueSubject + CAST(#NumberFound AS NVARCHAR(10))
-- Now save this change.
UPDATE [dbo].[Contents]
SET UniqueSubject = #UniqueSubject
WHERE ContentId = #ContentId
END
Why not change the trigger to deal with multiple rows?
No cursor or loops needed: it's the whole point of SQL ...
UPDATE
dbo.SomeTable
SET
LastModifiedOn = GETDATE()
WHERE
EXIST (SELECT * FROM INSERTED I WHERE I.[ID] = dbo.SomeTable.[ID]
Edit: Something like...
INSERT #ATableVariable
(ContentId, ContentTypeId, UniqueSubject)
SELECT
ContentId, ContentTypeId, [dbo].[ToUriCleanText]([Subject])
FROM
INSERTED
UPDATE
[dbo].[Contents]
SET
UniqueSubject + CAST(NumberFound AS NVARCHAR(10))
FROM
--Your original COUNT feels wrong and/or trivial
--Do you expect 0, 1 or many rows.
--Edit2: I assume 0 or 1 because of original WHERE so COUNT(*) will suffice
-- .. although, this implies an EXISTS could be used but let's keep it closer to OP post
(
SELECT ContentId, UniqueSubject, COUNT(*) AS NumberFound
FROM #ATableVariable
GROUP BY ContentId, UniqueSubject
HAVING COUNT(*) > 0
) foo
JOIN
[dbo].[Contents] C ON C.ContentId = foo.ContentId AND C.UniqueSubject = foo.UniqueSubject
Edit 2: and again with RANKING
UPDATE
C
SET
UniqueSubject + CAST(foo.Ranking - 1 AS NVARCHAR(10))
FROM
(
SELECT
ContentId, --not needed? UniqueSubject,
ROW_NUMBER() OVER (PARTITION BY ContentId ORDER BY UniqueSubject) AS Ranking
FROM
#ATableVariable
) foo
JOIN
dbo.Contents C ON C.ContentId = foo.ContentId
/* not needed? AND C.UniqueSubject = foo.UniqueSubject */
WHERE
foo.Ranking > 1
The trigger will be run only once for an INSERT INTO query. The INSERTED table will contain multiple rows.
Ok folks, I think I figure it out myself. Inspired by the previous answers and comments, I've done the following. (Can you folks have a quick look over to see if i've over-enginered this baby?)
.1. Created an Index'd View, representing the 'Subject' field, which needs to be cleaned. This is the field that has to be unique .. but before we can make it unique, we need to group by it.
-- Create the view.
CREATE VIEW ContentsCleanSubjectView with SCHEMABINDING AS
SELECT ContentId, ContentTypeId,
[dbo].[ToUriCleanText]([Subject]) AS CleanedSubject
FROM [dbo].[Contents]
GO
-- Index the view with three index's. Custered PK and a non-clustered,
-- which is where most of the joins will be done against.
-- Last one is because the execution plan reakons i was missing statistics
-- against one of the fields, so i added that index and the stats got gen'd.
CREATE UNIQUE CLUSTERED INDEX PK_ContentsCleanSubjectView ON
ContentsCleanSubjectView(ContentId)
CREATE NONCLUSTERED INDEX IX_BlahBlahSnipSnip_A ON
ContentsCleanSubjectView(ContentTypeId, CleanedSubject)
CREATE INDEX IX_BlahBlahSnipSnip_B ON
ContentsCleanSubjectView(CleanedSubject)
.2. Create the trigger code which now
a) grabs all the items 'changed' (nothing new/hard about that)
b) orders all the inserted rows, row numbered with partitioning by a clean subject
c) update the single row we're upto in the main update clause.
here's the code...
ALTER TRIGGER [dbo].[UpdateUniqueSubjectAfterInsertUpdate]
ON [dbo].[Contents]
AFTER INSERT,UPDATE
AS
BEGIN
SET NOCOUNT ON
DECLARE #InsertRows TABLE (ContentId INTEGER PRIMARY KEY,
ContentTypeId TINYINT,
CleanedSubject NVARCHAR(300))
DECLARE #UniqueSubjectRows TABLE (ContentId INTEGER PRIMARY KEY,
UniqueSubject NVARCHAR(350))
DECLARE #UniqueSubjectRows TABLE (ContentId INTEGER PRIMARY KEY,
UniqueSubject NVARCHAR(350))
-- Grab all the records that have been updated/inserted.
INSERT INTO #InsertRows(ContentId, ContentTypeId, CleanedSubject)
SELECT ContentId, ContentTypeId, [dbo].[ToUriCleanText]([Subject])
FROM INSERTED
-- Determine the correct unique subject by using ROW_NUMBER partitioning.
INSERT INTO #UniqueSubjectRows
SELECT SubResult.ContentId, UniqueSubject = CASE SubResult.RowNumber
WHEN 1 THEN SubResult.CleanedSubject
ELSE SubResult.CleanedSubject + CAST(SubResult.RowNumber - 1 AS NVARCHAR(5)) END
FROM (
-- Order all the cleaned subjects, partitioned by the cleaned subject.
SELECT a.ContentId, a.CleanedSubject, ROW_NUMBER() OVER (PARTITION BY a.CleanedSubject ORDER BY a.ContentId) AS RowNumber
FROM ContentsCleanSubjectView a
INNER JOIN #InsertRows b ON a.ContentTypeId = b.ContentTypeId AND a.CleanedSubject = b.CleanedSubject
GROUP BY a.contentId, a.cleanedSubject
) SubResult
INNER JOIN [dbo].[Contents] c ON c.ContentId = SubResult.ContentId
INNER JOIN #InsertRows d ON c.ContentId = d.ContentId
-- Now update all the effected rows.
UPDATE a
SET a.UniqueSubject = b.UniqueSubject
FROM [dbo].[Contents] a INNER JOIN #UniqueSubjectRows b ON a.ContentId = b.ContentId
END
Now, the subquery correctly returns all the cleaned subjects, partitioned correctly and numbered correctly. I never new about the 'PARTITION' command, so that trick was the big answer here :)
Then i just join'd the subquery with the row that is being updated in the parent query. The row number is correct, so now i just do a case. if this is the first time the cleaned subject exists (eg. row_number = 1), don't modify it. otherwise, append the row_number minus one. This means the 2nd instance of the same subject, the unique subject will be => cleansubject + '1'.
The reason why i believe i need to have an index'd view is because if i have two very similar subjects, that when you have stripped out (ie. cleaned) all the bad chars (which i've determined are bad) .. it's possible that the two clean subjects are the same. As such, I need to do all my joins on a cleanedSubject, instead of a subject. Now, for the massive amount of rows I have, this is crap for performance when i don't have the view. :)
So .. is this over engineered?
Edit 1:
Refactored trigger code so it's waay more performant.