Fastest technique to deleting duplicate data - sql

After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.
In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:
-- Step 1: **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
MAX(prikey) as MaxPriKey, -- identity(1, 1)
a,
b,
c,
d,
e,
f,
g,
h,
i
into #dupTemp
FROM sourceTable
group by
a,
b,
c,
d,
e,
f,
g,
h,
i
having COUNT(*) > 1
Next,
-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete
from sourceTable
from sourceTable
inner join #dupTemp on
sourceTable.a = #dupTemp.a and
sourceTable.b = #dupTemp.b and
sourceTable.c = #dupTemp.c and
sourceTable.d = #dupTemp.d and
sourceTable.e = #dupTemp.e and
sourceTable.f = #dupTemp.f and
sourceTable.g = #dupTemp.g and
sourceTable.h = #dupTemp.h and
sourceTable.i = #dupTemp.i and
sourceTable.PriKey != #dupTemp.MaxPriKey
Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.
Thanks so much.
UPDATE:
I had to stop step 2 from running at the 9 hour mark.
I tried OMG Ponies' method and it finished after only 40 minutes.
I tried my step 2 with Andomar's batch delete, it ran the 9 hours before I stopped it.
UPDATE:
Ran a similar query with one less field to get rid of a different set of duplicates and the query ran for only 4 minutes (8000 rows) using OMG Ponies' method.
I will try the cte technique the next chance I get, however, I suspect OMG Ponies' method will be tough to beat.

What about EXISTS:
DELETE FROM sourceTable
WHERE EXISTS(SELECT NULL
FROM #dupTemp dt
WHERE sourceTable.a = dt.a
AND sourceTable.b = dt.b
AND sourceTable.c = dt.c
AND sourceTable.d = dt.d
AND sourceTable.e = dt.e
AND sourceTable.f = dt.f
AND sourceTable.g = dt.g
AND sourceTable.h = dt.h
AND sourceTable.i = dt.i
AND sourceTable.PriKey < dt.MaxPriKey)

Can you afford to have the original table unavailable for a short time?
I think the fastest solution is to create a new table without the duplicates. Basically the approach that you use with the temp table, but creating a "regular" table instead.
Then drop the original table and rename the intermediate table to have the same name as the old table.

The bottleneck in bulk row deletion is usually the transaction that SQL Server has to build up. You might be able to speed it up considerably by splitting the removal into smaller transactions. For example, to delete 100 rows at a time:
while 1=1
begin
delete top 100
from sourceTable
...
if ##rowcount = 0
break
end

...based on OMG Ponies comment above, a CTE method that's a little more compact. This method works wonders on tables where you've (for whatever reason) no primary key - where you can have rows which are identical on all columns.
;WITH cte AS (
SELECT ROW_NUMBER() OVER
(PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY prikey DESC) AS sequence
FROM sourceTable
)
DELETE
FROM cte
WHERE sequence > 1

Well lots of differnt things. First would something like this work (do a select o make sure, maybe even put into a temp table of it's own, #recordsToDelete):
delete
from sourceTable
left join #dupTemp on
sourceTable.PriKey = #dupTemp.MaxPriKey
where #dupTemp.MaxPriKey is null
Next you can index temp tables, put an index on prikey
If you have records in a temp table of the ones you want to delete, you can delete in batches which is often faster than locking up the whole table with a delete.

Here's a version where you can combine both steps into a single step.
WITH cte AS
( SELECT prikey, ROW_NUMBER() OVER (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY
prikey DESC) AS sequence
FROM sourceTable
)
DELETE
FROM sourceTable
WHERE prikey IN
( SELECT prikey
FROM cte
WHERE sequence > 1
) ;
By the way, do you have any indexes that can be temporarily removed?

If you're using Oracle database, I recently found out that following statement performs best, from total durtion time as well as CPU consumption point of view.
I've performed several test with different data sizes from tens of rows to thousands, always in a loop. I used TKProf tool to analyze the results.
When compared to ROW_NUMBER() solution above, this approach took 2/3 of the original time and consumed about 50% of the CPU time. It seemed to behave linearly, ie it should give similar results with any input data size.
Feel free to give me your feedback. I wonder if there is a better method.
DELETE FROM sourceTable
WHERE
ROWID IN(
-- delete all
SELECT ROWID
FROM sourceTable t
MINUS
-- but keep every unique row
SELECT
rid
FROM
(
SELECT a,b,c,d,e,f,g,h,i, MAX(ROWID) KEEP (DENSE_RANK FIRST ORDER BY ROWID) AS RID
FROM sourceTable t
GROUP BY a,b,c,d,e,f,g,h,i
)
)
;

Related

postgreSQL query seems to be running on infinite loop

Following my previous question, I am now trying to remove duplicates from my database. I am first running a sub-query to identify the almost identical records (the only difference would be the index column "id"). My table has roughly 9 million records and the below code had to be interrupted after roughly 1h30
DELETE FROM public."OptionsData"
WHERE id NOT IN
(
SELECT id FROM (
SELECT DISTINCT ON (asofdate, contract, strike, expiry, type, last, bid, ask, volume, iv, moneyness, underlying, underlyingprice) * FROM public."OptionsData"
) AS TempTable
);
Producing the results from the sub-query takes about 1 minute, so maybe running the full query might take a long time (?) or is there something off in my code please?
NOT IN combined with a DISTINCT is usually quite slow.
To delete duplicates using EXISTS is typically faster:
DELETE FROM public."OptionsData" d1
WHERE EXISTS (select *
from public."OptionsData" d2
where d1.id > d2.id
and (d1.asofdate, d1.contract, d1.strike, d1.expiry, d1.type, d1.last, d1.bid, d1.ask, d1.volume, d1.iv, d1.moneyness, d1.underlying, d1.underlyingprice)
= (d2.asofdate, d2.contract, d2.strike, d2.expiry, d2.type, d2.last, d2.bid, d2.ask, d2.volume, d2.iv, d2.moneyness, d2.underlying, d2.underlyingprice)
)
This will keep the rows with the smallest value in id. If you want to keep those with the highest id use where d1.id < d2.id.

Select random value for each row

I'm trying to select a new random value from a column in another table for each row of a table I'm updating. I'm getting the random value, however I can't get it to change for each row. Any ideas? Here's the code:
UPDATE srs1.courseedition
SET ta_id = teacherassistant.ta_id
FROM srs1.teacherassistant
WHERE (SELECT ta_id FROM srs1.teacherassistant ORDER BY RANDOM()
LIMIT 1) = teacherassistant.ta_id
My guess is that Postgres is optimizing out the subquery, because it has no dependencies on the outer query. Have you simply considered using a subquery?
UPDATE srs1.courseedition
SET ta_id = (SELECT ta.ta_id
FROM srs1.teacherassistant ta
ORDER BY RANDOM()
LIMIT 1
);
I don't think this will fix the problem (smart optimizers, alas). But, if you correlate to the outer query, then it should run each time. Perhaps:
UPDATE srs1.courseedition ce
SET ta_id = (SELECT ta.ta_id
FROM srs1.teacherassistant ta
WHERE ce.ta_id IS NULL -- or something like that
ORDER BY RANDOM()
LIMIT 1
);
You can replace the WHERE clause with something more nonsensical such as WHERE COALESCE(ca.ta_id, '') IS NOT NULL.
This following solution should be faster by order(s) of magnitude than running a correlated subquery for every row. N random sorts over the whole table vs. 1 random sort. The result is just as random, but we get a perfectly even distribution with this method, whereas independent random picks like in Gordon's solution can (and probably will) assign some rows more often than others. There are different kinds of "random". Actual requirements for "randomness" need to be defined carefully.
Assuming the number of rows in courseedition is bigger than in teacherassistant.
To update all rows in courseedition:
UPDATE srs1.courseedition c1
SET ta_id = t.ta_id
FROM (
SELECT row_number() OVER (ORDER BY random()) - 1 AS rn -- random order
, count(*) OVER () As ct -- total count
, ta_id
FROM srs1.teacherassistant -- smaller table
) t
JOIN (
SELECT row_number() OVER () - 1 AS rn -- arbitrary order
, courseedition_id -- use actual PK of courseedition
FROM srs1.courseedition -- bigger table
) c ON c.rn%t.ct = t.rn -- rownumber of big modulo count of small table
WHERE c.courseedition_id = c1.courseedition_id;
Notes
Match the random rownumber of the bigger table modulo the count of the smaller table to the rownumber of the smaller table.
row_number() - 1 to get a 0-based index. Allows using the modulo operator % more elegantly.
Random sort for one table is enough. The smaller table is cheaper. The second can have any order (arbitrary is cheaper). The assignment after the join is random either way. Perfect randomness would only be impaired indirectly if there are regular patterns in sort order of the bigger table. In this unlikely case, apply ORDER BY random() to the bigger table to eliminate any such effect.

Another Why Is This Nearest Neighbor Spatial Query So Slow?

Following this recommendation for an optimized nearest neighbor update, I'm using the below tsql to update a GPS table of 11,000 points with the nearest point of interest to each point.
WHILE (2 > 1)
BEGIN
BEGIN TRANSACTION
UPDATE TOP ( 100 ) s
set
[NEAR_SHELTER]= fname,
[DIST_SHELTER] = Shape.STDistance(fshape)
from(
Select
[dbo].[GRSM_GPS_COLLAR].*,
fnc.NAME as fname,
fnc.Shape as fShape
from
[dbo].[GRSM_GPS_COLLAR]
CROSS APPLY (SELECT TOP 1 NAME, shape
FROM [dbo].[BACK_COUNTRY_SHELTERS] WITH(index ([S50_idx]))
WHERE [BACK_COUNTRY_SHELTERS].Shape.STDistance([dbo].[GRSM_GPS_COLLAR].Shape) IS NOT NULL
ORDER BY BACK_COUNTRY_SHELTERS.Shape.STDistance([dbo].[GRSM_GPS_COLLAR].Shape) ASC) fnc)s;
IF ##ROWCOUNT = 0
BEGIN
COMMIT TRANSACTION
BREAK
END
COMMIT TRANSACTION
-- 1 second delay
WAITFOR DELAY '00:00:01'
END -- WHILE
GO
Note that I'm doing it in chunks of 100 to avoid locking, which I get if I don't chunk it up, and it runs for hours before I have to kill it. The obvious answer is "Have you optimized your spatial indexes" and the answer is yes, both tables have a spatial index (SQL 2012), Geography Autogrid, 4092 cells per object, which was found to be the most efficient index after many days of testing every possible permutation of index parameters. I have tried this with and without the spatial index hint....with multiple spatial indexes.
In the above, note the spatial index seek cost and the warning about no column statistics, which I understand is the case with spatial indexes. In each case I eventually have to terminate the tsql. It just runs forever (in one case overnight, with 2300 rows updated).
I've tried Isaac's numbers table join solution, but that example doesn't appear to lend itself to looping through n distance searches, just a single user-supplied location (#x).
Update
# Brad D based your answer, I tried this, with some syntax errors that I can't quite figure out...I'm not sure I'm converting your example to mine correctly. Any ideas what I'm doing wrong? Thanks!
;WITH Points as(
SELECT TOP 100 [NAME], [Shape] as GeoPoint
FROM [BACK_COUNTRY_SHELTERS]
WHERE 1=1
SELECT P1.*, CP.[GPS_POS_NUMBER] as DestinationName, CP.Dist
INTO #tmp_Distance
FROM [GRSM_GPS_COLLAR] P1
CROSS APPLY (SELECT [NAME] , Shape.STDistance(P1.GeoPoint)/1609.344 as Dist
FROM [BACK_COUNTRY_SHELTERS] as P2
WHERE 1=1
AND P1.[NAME] <> P2.[NAME] --Don't compare it to itself
) as CP
CREATE CLUSTERED INDEX tmpIX ON #tmp_Distance (name, Dist)
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Dist ASC) as Rnk FROM #tmp_Distance) as tbl1
WHERE rnk = 1
DROP TABLE #tmp_Distance
You're essentially comparing 121 million data points (11K Origins to 11K destinations) this isn't going to scale well trying to do it all in one fell swoop. I like your idea of breaking it into batches, but trying to do an ordering of a result set of 1.1MM records without an index could be painful.
I suggest breaking this out into a few more operations. I just tried the following and it runs in under a minute per batch in my environment. (5500 location records)
This was able to work for me, without a geospatial index, but a clusted index around the origin and the distance to the destination.
;WITH Points as(
SELECT TOP 100 Name, AddressLine1,
AddressLatitude, AddressLongitude
, geography::STGeomFromText('POINT(' + CONVERT(varchar(50),AddressLatitude) + ' ' + CONVERT(varchar(50),AddressLongitude) + ')',4326) as GeoPoint
FROM ServiceFacility
WHERE 1=1
AND AddressLatitude BETWEEN -90 AND 90
AND AddressLongitude BETWEEN -90 AND 90)
SELECT P1.*, CP.Name as DestinationName, CP.Dist
INTO #tmp_Distance
FROM Points P1
CROSS APPLY (SELECT Name, AlternateName,
geography::STGeomFromText('POINT(' + CONVERT(varchar(50),P2.AddressLatitude) + ' ' + CONVERT(varchar(50),P2.AddressLongitude) + ')',4326).STDistance(P1.GeoPoint)/1609.344 as Dist
FROM ServiceFacility as P2
WHERE 1=1
AND P1.Name <> P2.Name --Don't compare it to itself
AND P2.AddressLatitude BETWEEN -90 AND 90
AND P2.AddressLongitude BETWEEN -90 AND 90
) as CP
CREATE CLUSTERED INDEX tmpIX ON #tmp_Distance (name, Dist)
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Dist ASC) as Rnk FROM #tmp_Distance) as tbl1
WHERE rnk = 1
DROP TABLE #tmp_Distance
The actual update on 100 records, or even 11000 records shouldn't take too long. Spatial index's are cool, but incase I'm missing something I don't see a hard stop requirement for this for this particular exercise.
You should redesign the process. Not just tune indexes.
Create a copy of the table with the columns you need. You can work in batches of several thousands if you work with much larger tables.
Then for the side table, set for each the "closest" point.
Then run a loop updating the main table in batches of under 5k (so not to cause a table lock escalation) using the clustered index for joining. This is usually much faster, and safer than running large scale updates on active tables.
On the side table, add a "handled" column for the loop to update the main table. And an index on the handled column and the clustered index columns of the main table to prevent uneeded "sorting" when joining to the main table.

Remove duplicate rows - Impossible to find a decisive answer

You'd immediately think I went straight to here to ask my question but I googled an awful lot to not find a decisive answer.
Facts: I have a table with 3.3 million rows, 20 columns.
The first row is the primary key thus unique.
I have to remove all the rows where column 2 till column 11 is duplicate. In fact a basic question but so much different approaches whereas everyone seeks the same solution in the end, removing the duplicates.
I was personally thinking about GROUP BY HAVING COUNT(*) > 1
Is that the way to go or what do you suggest?
Thanks a lot in advance!
L
As a generic answer:
WITH cte AS (
SELECT ROW_NUMBER() OVER (
PARTITION BY <groupbyfield> ORDER BY <tiebreaker>) as rn
FROM Table)
DELETE FROM cte
WHERE rn > 1;
I find this more powerful and flexible than the GROUP BY ... HAVING. In fact, GROUP BY ... HAVING only gives you the duplicates, you're still left with the 'trivial' task of choosing a 'keeper' amongst the duplicates.
ROW_NUMBER OVER (...) gives more control over how to distinguish among duplicates (the tiebreaker) and allows for behavior like 'keep first 3 of the duplicates', not only 'keep just 1', which is a behavior really hard to do with GROUP BY ... HAVING.
The other part of your question is how to approach this for 3.3M rows. Well, 3.3M is not really that big, but I would still recommend doing this in batches. Delete TOP 10000 at a time, otherwise you'll push a huge transaction into the log and might overwhelm your log drives.
And final question is whether this will perform acceptably. It depends on your schema. IF the ROW_NUMBER() has to scan the entire table and spool to count, and you have to repeat this in batches for N times, then it won't perform. An appropriate index will help. But I can't say anything more, not knowing the exact schema involved (structure of clustered index/heap, all non-clustered indexes etc).
Group by the fields you want to be unique, and get an aggregate value (like min) for your pk field. Then insert those results into a new table.
If you have SQL Server 2005 or newer, then the easiest way would be to use a CTE (Common Table Expression).
You need to know what criteria you want to "partition" your data by - e.g. create partitions of data that is considered identical/duplicate - and then you need to order those partitions by something - e.g. a sequence ID, a date/time or something.
You didn't provide much details about your tables - so let me just give you a sample:
;WITH Duplicates AS
(
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) AS RowN
FROM
dbo.Orders
)
DELETE FROM dbo.Orders
WHERE RowN > 1
The CTE ( WITH ... AS :... ) gives you an "inline view" for the next SQL statement - it's not persisted or anything - it just lives for that next statement and then it's gone.
Basically, I'm "grouping" (partitioning) my Orders by CustomerID, and ordering by OrderDate. So for each CustomerID, I get a new "group" of data, which gets a row number starting with 1. The ORDER BY OrderDate DESC gives the newest order for each customer the RowN = 1 value - this is the one order I keep.
All other orders for each customer are deleted based on the CTE (the WITH..... expression).
You'll need to adapt this for your own situation, obviously - but the CTE with the PARTITION BY and ROW_NUMBER() are a very reliable and easy technique to get rid of duplicates.
If you don't want to deal with a new table delete then just use DELETE TOP(1). Use a subquery to get all the ids of rows that are duplicates and then use the delete top to delete where there is multiple rows. You might have to run more than once if there are more than one duplicate but you get the point.
DELETE TOP(1) FROM Table
WHERE ID IN (SELECT ID FROM Table GROUP BY Field HAVING COUNT(*) > 1)
You get the idea hopefully. This is just some pseudo code to help demonstrate.

Delete all but top n from database table in SQL

What's the best way to delete all rows from a table in sql but to keep n number of rows on the top?
DELETE FROM Table WHERE ID NOT IN (SELECT TOP 10 ID FROM Table)
Edit:
Chris brings up a good performance hit since the TOP 10 query would be run for each row. If this is a one time thing, then it may not be as big of a deal, but if it is a common thing, then I did look closer at it.
I would select ID column(s) the set of rows that you want to keep into a temp table or table variable. Then delete all the rows that do not exist in the temp table. The syntax mentioned by another user:
DELETE FROM Table WHERE ID NOT IN (SELECT TOP 10 ID FROM Table)
Has a potential problem. The "SELECT TOP 10" query will be executed for each row in the table, which could be a huge performance hit. You want to avoid making the same query over and over again.
This syntax should work, based what you listed as your original SQL statement:
create table #nuke(NukeID int)
insert into #nuke(Nuke) select top 1000 id from article
delete article where not exists (select 1 from nuke where Nukeid = id)
drop table #nuke
Future reference for those of use who don't use MS SQL.
In PostgreSQL use ORDER BY and LIMIT instead of TOP.
DELETE FROM table
WHERE id NOT IN (SELECT id FROM table ORDER BY id LIMIT n);
MySQL -- well...
Error -- This version of MySQL does not yet support 'LIMIT &
IN/ALL/ANY/SOME subquery'
Not yet I guess.
Here is how I did it. This method is faster and simpler:
Delete all but top n from database table in MS SQL using OFFSET command
WITH CTE AS
(
SELECT ID
FROM dbo.TableName
ORDER BY ID DESC
OFFSET 11 ROWS
)
DELETE CTE;
Replace ID with column by which you want to sort.
Replace number after OFFSET with number of rows which you want to keep.
Choose DESC or ASC - whatever suits your case.
I think using a virtual table would be much better than an IN-clause or temp table.
DELETE
Product
FROM
Product
LEFT OUTER JOIN
(
SELECT TOP 10
Product.id
FROM
Product
) TopProducts ON Product.id = TopProducts.id
WHERE
TopProducts.id IS NULL
This really is going to be language specific, but I would likely use something like the following for SQL server.
declare #n int
SET #n = SELECT Count(*) FROM dTABLE;
DELETE TOP (#n - 10 ) FROM dTable
if you don't care about the exact number of rows, there is always
DELETE TOP 90 PERCENT FROM dTABLE;
I don't know about other flavors but MySQL DELETE allows LIMIT.
If you could order things so that the n rows you want to keep are at the bottom, then you could do a DELETE FROM table LIMIT tablecount-n.
Edit
Oooo. I think I like Cory Foy's answer better, assuming it works in your case. My way feels a little clunky by comparison.
I would solve it using the technique below. The example expect an article table with an id on each row.
Delete article where id not in (select top 1000 id from article)
Edit: Too slow to answer my own question ...
Refactored?
Delete a From Table a Inner Join (
Select Top (Select Count(tableID) From Table) - 10)
From Table Order By tableID Desc
) b On b.tableID = A.tableID
edit: tried them both in the query analyzer, current answer is fasted (damn order by...)
Better way would be to insert the rows you DO want into another table, drop the original table and then rename the new table so it has the same name as the old table
I've got a trick to avoid executing the TOP expression for every row. We can combine TOP with MAX to get the MaxId we want to keep. Then we just delete everything greater than MaxId.
-- Declare Variable to hold the highest id we want to keep.
DECLARE #MaxId as int = (
SELECT MAX(temp.ID)
FROM (SELECT TOP 10 ID FROM table ORDER BY ID ASC) temp
)
-- Delete anything greater than MaxId. If MaxId is null, there is nothing to delete.
IF #MaxId IS NOT NULL
DELETE FROM table WHERE ID > #MaxId
Note: It is important to use ORDER BY when declaring MaxId to ensure proper results are queried.