Remove duplicate rows - Impossible to find a decisive answer

Remove duplicate rows - Impossible to find a decisive answer - sql

You'd immediately think I went straight to here to ask my question but I googled an awful lot to not find a decisive answer.
Facts: I have a table with 3.3 million rows, 20 columns.
The first row is the primary key thus unique.
I have to remove all the rows where column 2 till column 11 is duplicate. In fact a basic question but so much different approaches whereas everyone seeks the same solution in the end, removing the duplicates.
I was personally thinking about GROUP BY HAVING COUNT(*) > 1
Is that the way to go or what do you suggest?
Thanks a lot in advance!
L

As a generic answer:
WITH cte AS (
SELECT ROW_NUMBER() OVER (
PARTITION BY <groupbyfield> ORDER BY <tiebreaker>) as rn
FROM Table)
DELETE FROM cte
WHERE rn > 1;
I find this more powerful and flexible than the GROUP BY ... HAVING. In fact, GROUP BY ... HAVING only gives you the duplicates, you're still left with the 'trivial' task of choosing a 'keeper' amongst the duplicates.
ROW_NUMBER OVER (...) gives more control over how to distinguish among duplicates (the tiebreaker) and allows for behavior like 'keep first 3 of the duplicates', not only 'keep just 1', which is a behavior really hard to do with GROUP BY ... HAVING.
The other part of your question is how to approach this for 3.3M rows. Well, 3.3M is not really that big, but I would still recommend doing this in batches. Delete TOP 10000 at a time, otherwise you'll push a huge transaction into the log and might overwhelm your log drives.
And final question is whether this will perform acceptably. It depends on your schema. IF the ROW_NUMBER() has to scan the entire table and spool to count, and you have to repeat this in batches for N times, then it won't perform. An appropriate index will help. But I can't say anything more, not knowing the exact schema involved (structure of clustered index/heap, all non-clustered indexes etc).

Group by the fields you want to be unique, and get an aggregate value (like min) for your pk field. Then insert those results into a new table.

If you have SQL Server 2005 or newer, then the easiest way would be to use a CTE (Common Table Expression).
You need to know what criteria you want to "partition" your data by - e.g. create partitions of data that is considered identical/duplicate - and then you need to order those partitions by something - e.g. a sequence ID, a date/time or something.
You didn't provide much details about your tables - so let me just give you a sample:
;WITH Duplicates AS
(
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) AS RowN
FROM
dbo.Orders
)
DELETE FROM dbo.Orders
WHERE RowN > 1
The CTE ( WITH ... AS :... ) gives you an "inline view" for the next SQL statement - it's not persisted or anything - it just lives for that next statement and then it's gone.
Basically, I'm "grouping" (partitioning) my Orders by CustomerID, and ordering by OrderDate. So for each CustomerID, I get a new "group" of data, which gets a row number starting with 1. The ORDER BY OrderDate DESC gives the newest order for each customer the RowN = 1 value - this is the one order I keep.
All other orders for each customer are deleted based on the CTE (the WITH..... expression).
You'll need to adapt this for your own situation, obviously - but the CTE with the PARTITION BY and ROW_NUMBER() are a very reliable and easy technique to get rid of duplicates.

If you don't want to deal with a new table delete then just use DELETE TOP(1). Use a subquery to get all the ids of rows that are duplicates and then use the delete top to delete where there is multiple rows. You might have to run more than once if there are more than one duplicate but you get the point.
DELETE TOP(1) FROM Table
WHERE ID IN (SELECT ID FROM Table GROUP BY Field HAVING COUNT(*) > 1)
You get the idea hopefully. This is just some pseudo code to help demonstrate.

Related

Why does distinct does not give results in the order?

I ordered my results by their id's by:
CREATE TABLE my_table2 AS SELECT * FROM my_table ORDER BY record_group_id;
now when i execute:
SELECT DISTINCT record_group_id FROM my_table2 where rownum <=1000000;
I get gorup id's in random order, though my order by went fine:
Here is few of the records in result set
1599890050
1647717203
1647717120
1647717172
1647716972
1647717196
1647717197
1647717205
1599889999
1599889986
What could be the possible reason?
Shouldn't DISTINCT statement return records in same order as they are in table?

Neither SELECT or DISTINCT defines the order of data.
If you want ordered data explicitly define the Order you need.
SELECT DISTINCT record_group_id
FROM my_table2
WHERE rownum <=1000000
ORDER BY record_group_id;

The ordering only determines the order of the source data that is inserted in the table. If there is no clustered index in the table, that means that the records will be stored in that order physically.
However, how the records are stored doesn't guarantee that they will be selected in that order. The execution planner determines the most efficient way to run the query, which means that the data might not be fetched the way that you think it is, and it can differ from time to time as the data changes, or just the statistics about the data.
For a simple query like in the example, you usually get a predictable result, but there is no guarantee, so you always need to sort the query where you fetch the data to be sure to get a predictable result.
One reason that you don't get the data in the order that they are stored in the table in this case, may be that an index is used for filtering the result, and the records are returned in the order of the index rather than the order of the table.

Use ORDER BY on your SELECT statement:
SELECT DISTINCT record_group_id
FROM my_table2
WHERE rownum <=1000000
ORDER BY record_group_id;

Using DISTINCT has no effect on order, only on uniqueness of values.
If you want to control order too:
SELECT DISTINCT record_group_id
FROM my_table2
WHERE rownum <= 1000000
ORDER BY record_group_id -- Added this line

Your assumption that data in the table is ordered is wrong.
There is no implicit ordering in a database table - it's just a bag of unsorted data.
If you need ordered data, you'll have to use ORDER BY - there's no way around it (neither DISTINCT nor GROUP BY nor ...), see TomKyte Blog on Order By

Postgresql 8.3 duplicate rows

I'm having a little problem with one query that I'm writing. I've a lot of joins, and a lot of columns that I extract, in the where clause I compare date column with the minimum value of the same table. But when I got a same date for two rows, I need to get only one row. The where clause is like that:
bt.da2 = (select min(btreg.da2) from bt btreg.....
the query results a lot of customers, every customer has that bt.da2 date. I need when one customer has two rows, with the same value of the bt.da2 , I need to take only one of the two rows, not the two.
I may not explained myself clear. Please if anyone have a little clue what I'm asking, and something is not clear, please ask me.
I'm using PostgreSQL 8.3
Regards,
Julian

It's hard to tell with so little information, but I would try something like this:
select *
from (
select product_id, -- assumed to be the primary key
...
row_number() over (partition by product_id order by bt.da2) as rn
from products pr
left join bt on bt.da2 = pr.some_col
) t
where rn = 1
the row_number() function is used to create consecutive numbers for each product. The outer where clause then picks the first one. You can change the order by in the window definition to influence which one you pick.

You should be able to sort out duplicate values of da2 using:
bt.da2 = (select distinct min(btreg.da2) from bt btreg.....
I tried this out using PostgreSQL 9.1, but I am sure the distinct keyword will work as expected in 8.4 as well.

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?

The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+

Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case

If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...

Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.

Deleting Duplicate Records from a Table

I Have a table called Table1 which has 48 records. Out of which only 24 should be there in that table. For some reason I got duplicate records inserted into it. How do I delete the duplicate records from that table.

Here's something you might try if SQL Server version is 2005 or later.
WITH cte AS
(
SELECT {list-of-columns-in-table},
row_number() over (PARTITION BY {list-of-key-columns} ORDER BY {rule-to-determine-row-to-keep}) as sequence
FROM myTable
)
DELETE FROM cte
WHERE sequence > 1
This uses a common table expression (CTE) and adds a sequence column. {list-of-columns-in-table} is just as it states. Not all columns are needed, but I won't explain here.
The {list-of-key-columns] is the columns that you use to define what is a duplicate.
{rule-to-determine-row-to-keep} is a sequence so that the first row is the row to keep. For example, if you want to keep the oldest row, you would use a date column for sequence.
Here's an example of the query with real columns.
WITH cte AS
(
SELECT ID, CourseName, DateAdded,
row_number() over (PARTITION BY CourseName ORDER BY DateAdded) as sequence
FROM Courses
)
DELETE FROM cte
WHERE sequence > 1
This example removes duplicate rows based on the CoursName value and keeps the oldest basesd on the DateAdded value.

http://support.microsoft.com/kb/139444
This section is the key. The primary point you should take away. ;)
This article discusses how to locate
and remove duplicate primary keys from
a table. However, you should closely
examine the process which allowed the
duplicates to happen in order to
prevent a recurrence.
Identify your records by grouping data by your logical keys, since you obviously haven't defined them, and applying a HAVING COUNT(*) > 1 statement at the end. The article goes into this in depth.

This is an easier way
Select * Into #TempTable FROM YourTable
Truncate Table YourTable
Insert into YourTable Select Distinct * from #TempTable
Drop Table #TempTable

Efficient SQL to count an occurrence in the latest X rows

For example I have:
create table a (i int);
Assume there are 10k rows.
I want to count 0's in the last 20 rows.
Something like:
select count(*) from (select i from a limit 20) where i = 0;
Is that possible to make it more efficient? Like a single SQL statement or something?
PS. DB is SQLite3 if that matters at all...
UPDATE
PPS. No need to group by anything in this instance, assume the table that is literally 1 column (and presumably the internal DB row_ID or something). I'm just curious if this is possible to do without the nested selects?

You'll need to order by something in order to determine the last 20 rows. When you say last, do you mean by date, by ID, ...?
Something like this should work:
select count(*)
from (
select i
from a
order by j desc
limit 20
) where i = 0;

If you do not remove rows from the table, you may try the following hacky query:
SELECT COUNT(*) as cnt
FROM A
WHERE
ROWID > (SELECT MAX(ROWID)-20 FROM A)
AND i=0;
It operates with ROWIDs only. As the documentation says: Rows are stored in rowid order.

You need to remember to order by when you use limit, otherwise the result is indeterminate. To get the latest rows added, you need to include a column with the insertion date, then you can use that. Without this column you cannot guarantee that you will get the latest rows.
To make it efficient you should ensure that there is an index on the column you order by, possibly even a clustered index.

I'm afraid that you need a nested select to be able to count and restrict to last X rows at a time, because something like this
SELECT count(*) FROM a GROUP BY i HAVING i = 0
will count 0's, but in ALL table records, because a LIMIT in this query will basically have no effect.
However, you can optimize making COUNT(i) as it is faster to COUNT only one field than 2 or more (in this case your table will have 2 fields, i and rowid, that is automatically created by SQLite in PKless tables)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove duplicate rows - Impossible to find a decisive answer - sql

Group by the fields you want to be unique, and get an aggregate value (like min) for your pk field. Then insert those results into a new table.

Related

Why does distinct does not give results in the order?

Postgresql 8.3 duplicate rows

How to speed up group-based duplication-count queries on unindexed tables

Deleting Duplicate Records from a Table

Efficient SQL to count an occurrence in the latest X rows

Categories

Resources