How to delete duplicate rows with SQL?

How to delete duplicate rows with SQL? - sql

I have a table with some rows in. Every row has a date-field. Right now, it may be duplicates of a date. I need to delete all the duplicates and only store the row with the highest id. How is this possible using a SQL query?
Now:
date id
'07/07' 1
'07/07' 2
'07/07' 3
'07/05' 4
'07/05' 5
What I want:
date id
'07/07' 3
'07/05' 5

DELETE FROM table WHERE id NOT IN
(SELECT MAX(id) FROM table GROUP BY date);

I don't have comment rights, so here's my comment as an answer in case anyone comes across the same problem:
In SQLite3, there is an implicit numerical primary key called "rowid", so the same query would look like this:
DELETE FROM table WHERE rowid NOT IN
(SELECT MAX(rowid) FROM table GROUP BY date);
this will work with any table even if it does not contain a primary key column called "id".

For mysql,postgresql,oracle better way is SELF JOIN.
Postgresql:
DELETE FROM table t1 USING table t2 WHERE t1.date=t2.date AND t1.id<t2.id;
MySQL
DELETE FROM table
USING table, table as vtable
WHERE (table.id < vtable.id)
AND (table.date=vtable.date)
SQL aggregate (max,group by) functions almost always are very slow.

Related

Optimisation of sql query for deleting duplicate items from large table

Could anyone please help me optimise one of the queries which is taking more than 20 minutes to run against 3 Million data.
Table Structure
-----------------------------------------------------------------------------------------
|id [INT Auto Inc]| name_id (uuid) | name (varchar)| city (varchar) | name_type(varchar)|
-----------------------------------------------------------------------------------------
Query
The purpose of the query is to eliminate the duplicate, here duplicate means having same name_id and name.
DELETE
FROM records
WHERE id NOT IN
(SELECT DISTINCT
ON (name_id, name) id
FROM records);

I would write your delete using exists logic:
DELETE
FROM records r1
WHERE EXISTS (SELECT 1 FROM records r2
WHERE r2.name_id = r1.name_id AND r2.name = r2.name AND
r2.id < r1.id);
This delete query will spare the duplicate having the smallest id value. To speed this up, you may try adding the following index:
CREATE INDEX idx ON records (name_id, name, id);

You probably already have a primary key on the identity column, then you can use it to exclude redundant rows by id in the following way:
WITH cte AS (
SELECT MIN(id) AS id FROM records GROUP BY name_id, name)
DELETE FROM records
WHERE NOT EXISTS (SELECT id FROM cte WHERE id=records.id)
Even without the index, this should work relatively fast, probably because of merge join strategy.

Oracle SQL Subquery - Usage of NOT EXISTS

I used a query to find a list of Primary Keys. One Primary key per each ForiegnKey in a table by using below query.
select foreignKey, min(primaryKey)
from t
group by foreignKey;
Let us say this is the result : 1,4,5
NOw I have another table - Table B that has list of all Primary keys. It has 1,2,3,6,7,8,9
I want a write a query using the above query So that I get a subset of the original query(above) that does not exist in Table B. I want 4 and 5 back with the new query.

Use a having clause:
select foreignKey, min(primaryKey)
from t
group by foreignKey
having min(primarykey) not in (select pk from b);
You should also be able to express this as not exists:
having not exists (select 1
from b
where b.pk = min(t.primaryKey)
)

SQL deleting one of two duplicate records?

I have a DB that has a problem that there are two of the same records for everything but they all have a different ID, but they have 2 columns (the actual data) that are the same. I was wondering if there was a good way to have a DELETE statement where I could select all these records that have the 2 columns match but have a different ID and delete one (doesn't matter which one)?
If you could could you give me a code example?

Delete from ...
where id in (select max(id), count as c
from ...
group by data1, data2
having c >1)
The idea is to select the bigger id of all duplicate rows, by grouping the rows on the column that are the same and making sure that there are multiple rows (having clause).

delete from your_table
where id not in
(
select min(id)
from your_table
group by col2
)

SQL Server Sum multiple rows into one - no temp table

I would like to see a most concise way to do what is outlined in this SO question: Sum values from multiple rows into one row
that is, combine multiple rows while summing a column.
But how to then delete the duplicates. In other words I have data like this:
Person Value
--------------
1 10
1 20
2 15
And I want to sum the values for any duplicates (on the Person col) into a single row and get rid of the other duplicates on the Person value. So my output would be:
Person Value
-------------
1 30
2 15
And I would like to do this without using a temp table. I think that I'll need to use OVER PARTITION BY but just not sure. Just trying to challenge myself in not doing it the temp table way. Working with SQL Server 2008 R2
Simply put, give me a concise stmt getting from my input to my output in the same table. So if my table name is People if I do a select * from People on it before the operation that I am asking in this question I get the first set above and then when I do a select * from People after the operation, I get the second set of data above.

Not sure why not using Temp table but here's one way to avoid it (tho imho this is an overkill):
UPDATE MyTable SET VALUE = (SELECT SUM(Value) FROM MyTable MT WHERE MT.Person = MyTable.Person);
WITH DUP_TABLE AS
(SELECT ROW_NUMBER()
OVER (PARTITION BY Person ORDER BY Person) As ROW_NO
FROM MyTable)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;
First query updates every duplicate person to the summary value. Second query removes duplicate persons.
Demo: http://sqlfiddle.com/#!3/db7aa/11

All you're asking for is a simple SUM() aggregate function and a GROUP BY
SELECT Person, SUM(Value)
FROM myTable
GROUP BY Person
The SUM() by itself would sum up the values in a column, but when you add a secondary column and GROUP BY it, SQL will show distinct values from the secondary column and perform the aggregate function by those distinct categories.

how to delete duplicates from a database table based on a certain field

i have a table that somehow got duplicated. i basically want to delete all records that are duplicates, which is defined by a field in my table called SourceId. There should only be one record for each source ID.
is there any SQL that i can write that will delete every duplicate so i only have one record per Sourceid ?

Assuming you have a column ID that can tie-break the duplicate sourceid's, you can use this. Using min(id) causes it to keep just the min(id) per sourceid batch.
delete from tbl
where id NOT in
(
select min(id)
from tbl
group by sourceid
)

delete from table
where pk in (
select i2.pk
from table i1
inner join table i2
on i1.SourceId = i2.SourceId
)
good practice is to start with
select * from … and only later replace to delete from …

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to delete duplicate rows with SQL? - sql

DELETE FROM table WHERE id NOT IN (SELECT MAX(id) FROM table GROUP BY date);

Related

Optimisation of sql query for deleting duplicate items from large table

Oracle SQL Subquery - Usage of NOT EXISTS

SQL deleting one of two duplicate records?

SQL Server Sum multiple rows into one - no temp table

how to delete duplicates from a database table based on a certain field

Categories

Resources