How to delete duplicate rows without unique identifier - sql

I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:
WITH TempEmp AS
(
SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount
FROM mytable
)
-- Now Delete Duplicate Records
DELETE FROM TempEmp
WHERE duplicateRecCount > 1;
But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE after the WITH clause?

I like #erwin-brandstetter 's solution, but wanted to show a solution with the USING keyword:
DELETE FROM table_with_dups T1
USING table_with_dups T2
WHERE T1.ctid < T2.ctid -- delete the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
If you want to review the records before deleting them, then simply replace DELETE with SELECT * and USING with a comma ,, i.e.
SELECT * FROM table_with_dups T1
, table_with_dups T2
WHERE T1.ctid < T2.ctid -- select the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...) clause as those generate a lot of rows in the subquery.
If you rewrite the query to use IN (...) then it performs similarly to the solution presented here, but the SQL code becomes much less concise.
Update 2: If you have NULL values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE() in the condition for that column, e.g.
AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')

If you have no other unique identifier, you can use ctid:
delete from mytable
where exists (select 1
from mytable t2
where t2.name = mytable.name and
t2.address = mytable.address and
t2.zip = mytable.zip and
t2.ctid > mytable.ctid
);
It is a good idea to have a unique, auto-incrementing id in every table. Doing a delete like this is one important reason why.

In a perfect world, every table has a unique identifier of some sort.
In the absence of any unique column (or combination thereof), use the ctid column:
In-order sequence generation
How do I decompose ctid into page and row numbers?
DELETE FROM tbl
WHERE ctid NOT IN (
SELECT min(ctid) -- ctid is NOT NULL by definition
FROM tbl
GROUP BY name, address, zipcode); -- list columns defining duplicates
The above query is short, conveniently listing column names only once. NOT IN (SELECT ...) is a tricky query style when NULL values can be involved, but the system column ctid is never NULL. See:
Find records where join doesn't exist
Using EXISTS as demonstrated by #Gordon is typically faster. So is a self-join with the USING clause like #isapir added later. Both should result in the same query plan.
Important difference: These other queries treat NULL values as not equal, while GROUP BY (or DISTINCT or DISTINCT ON ()) treats NULL values as equal. Does not matter for columns defined NOT NULL. Else, depending on your definition of "duplicate", you'll need one approach or the other. Or use IS NOT DISTINCT FROM to compare values (which may exclude some indexes).
Disclaimer:
ctid is an implementation detail of Postgres, it's not in the SQL standard and can change between major versions without warning (even if that's very unlikely). Its values can change between commands due to background processes or concurrent write operations (but not within the same command).
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
How to use the physical location of rows (ROWID) in a DELETE statement
Aside:
The target of a DELETE statement cannot be the CTE, only the underlying table. That's a spillover from SQL Server - as is your whole approach.

Here is what I came up with, using a group by
DELETE FROM mytable
WHERE id NOT in (
SELECT MIN(id)
FROM mytable
GROUP BY name, address, zipcode
)
It deletes the duplicates, preserving the oldest record that has duplicates.

We can use a window function for very effective removal of duplicate rows:
DELETE FROM tab
WHERE id IN (SELECT id
FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), id
FROM tab) x
WHERE x.row_number > 1);
Some PostgreSQL's optimized version (with ctid):
DELETE FROM tab
WHERE ctid = ANY(ARRAY(SELECT ctid
FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), ctid
FROM tab) x
WHERE x.row_number > 1));

The valid syntax is specified at http://www.postgresql.org/docs/current/static/sql-delete.html
I would ALTER your table to add a unique auto-incrementing primary key id so that you can run a query like the following which will keep the first of each set of duplicates (ie the one with the lowest id). Note that adding the key is a bit more complicated in Postgres than some other DBs.
DELETE FROM mytable d USING (
SELECT min(id), name, address, zip
FROM mytable
GROUP BY name, address, zip HAVING COUNT() > 1
) AS k
WHERE d.id <> k.id
AND d.name=k.name
AND d.address=k.address
AND d.zip=k.zip;

If you want a unique identifier for every row, you could just add one (a serial, or a guid), and treat it like a surrogate key.
CREATE TABLE thenames
( name text not null
, address text not null
, zipcode text not null
);
INSERT INTO thenames(name,address,zipcode) VALUES
('James', 'main street', '123' )
,('James', 'main street', '123' )
,('James', 'void street', '456')
,('Alice', 'union square' , '123')
;
SELECT*FROM thenames;
-- add a surrogate key
ALTER TABLE thenames
ADD COLUMN seq serial NOT NULL PRIMARY KEY
;
SELECT*FROM thenames;
DELETE FROM thenames del
WHERE EXISTS(
SELECT*FROM thenames x
WHERE x.name=del.name
AND x.address=del.address
AND x.zipcode=del.zipcode
AND x.seq < del.seq
);
-- add the unique constrain,so that new dupplicates cannot be created in the future
ALTER TABLE thenames
ADD UNIQUE (name,address,zipcode)
;
SELECT*FROM thenames;

To remove duplicates (keep only one entry) from a table "tab" where data looks like this:
fk_id_1
fk_id_2
12
32
12
32
12
32
15
37
15
37
You can do this:
DELETE FROM tab WHERE ctid IN
(SELECT ctid FROM
(SELECT ctid, fk_id_1, fk_id_2, row_number() OVER (PARTITION BY fk_id_1, fk_id_2 ORDER BY fk_id_1) AS rnum FROM tab) t
WHERE t.rnum > 1);
Where ctid is the physical location of the row within its table (therefore, a row identifier) and row_number is a window function that assigns a sequential integer to each row in a result set.
PARTITION groups the result set and the sequential integer is restarted for every group.

From the documentation delete duplicate rows
A frequent question in IRC is how to delete rows that are duplicates over a set of columns, keeping only the one with the lowest ID.
This query does that for all rows of tablename having the same column1, column2, and column3.
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
Sometimes a timestamp field is used instead of an ID field.

For smaller tables, we can use rowid pseudo column to delete duplicate rows.
You can use this query below:
Delete from table1 t1 where t1.rowid > (select min(t2.rowid) from table1 t2 where t1.column = t2. column)

Related

Unable to delete duplicate rows with PostgreSQL

My query deletes the whole table instead of duplicate rows.
Video as proof: https://streamable.com/3s843
create table customer_info (
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
phone_number VARCHAR(50)
);
insert into customer_info (id, first_name, last_name, phone_number) values
(1, 'Kevin', 'Binley', '600-449-1059'),
(1, 'Kevin', 'Binley', '600-449-1059'),
(2, 'Skippy', 'Lam', '779-278-0889');
My query:
with t1 as (
select *, row_number() over(partition by id order by id) as rn
from customer_info)
delete
from customer_info
where id in (select id from t1 where rn > 1);
Your query would delete all rows from each set of dupes (as all share the same id by which you select - that's what #wildplasser hinted at with subtle comments) and only initially unique rows would survive. So if it "deletes the whole table", that means there were no unique rows at all.
In your query, dupes are defined by (id) alone, not by the whole row as your title suggests.
Either way, there is a remarkably simple solution:
DELETE FROM customer_info c
WHERE EXISTS (
SELECT FROM customer_info c1
WHERE ctid < c.ctid
AND c1 = c -- comparing whole rows
);
Since you deal with completely identical rows, the remaining way to tell them apart is the internal tuple ID ctid.
My query deletes all rows, where an identical row with a smaller ctid exists. Hence, only the "first" row from each set of dupes survives.
Notably, NULL values compare equal in this case - which is most probably as desired. The manual:
The SQL specification requires row-wise comparison to return NULL if
the result depends on comparing two NULL values or a NULL and a
non-NULL. PostgreSQL does this only when comparing the results of two
row constructors (as in Section 9.23.5) or comparing a row constructor
to the output of a subquery (as in Section 9.22). In other contexts
where two composite-type values are compared, two NULL field values
are considered equal, [...]
If dupes are defined by id alone (as your query suggests), then this would work:
DELETE FROM customer_info c
WHERE EXISTS (
SELECT FROM customer_info c1
WHERE ctid < c.ctid
AND id = c.id
);
But then there might be a better way to decide which rows to keep than ctid as a measure of last resort!
Obviously, you would then add a PRIMARY KEY to avoid the initial dilemma from reappearing. For the second interpretation, id is the candidate.
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
About ctid:
How do I decompose ctid into page and row numbers?
You can't if the table does not have a key.
Tables have "keys" that identify each row uniquely. If your table does not have any key, then you won't be able to identify one row from the other one.
The only workaround to delete duplicate rows I can think of would be to:
Add a key on the table.
Use the key to delete the rows that are in excess.
For example:
create sequence seq1;
alter table customer_info add column k1 int;
update customer_info set k1 = nextval('seq1');
delete from customer_info where k1 in (
select k1
from (
select
k1,
row_number() over(partition by id, first_name, last_name, phone_number) as rn
from customer_info
) x
where rn > 1
)
Now you only have two rows.

How to remove duplicate entries in postgresql table?

I have a postgresql table without primary key.
I want to remove all entries that have the same id, but retain the most recent one.
The following statement almost works:
DELETE FROM mytable USING mytable t
WHERE mytable.id = t.id AND mytable.modification < t.modification;
Problem: when two entries have the same modification timestamp (which is possible), both are retained.
What would I have to change to just keep one of them, does not matter which one?
I cannot change the condition to AND mytable.modification <= t.modification; as this would then remove all dublicates not retaining any entry.
If you have rows that are complete duplicates (i.e., no way to distinguish one from the other), then you have two options. One is to use a built-in row identifier such as ctid:
DELETE FROM mytable USING mytable t
WHERE mytable.id = t.id AND
(mytable.modification < t.modification OR
mytable.modification = t.modification AND mytable.ctid < t.ctid);
Or use a secondary table:
create table tokeep as
select distinct on (t.id) t.*
from mytable
order by t.id, t.modification;
truncate table mytable;
insert into mytable
select * from tokeep;
Use EXISTS to see if there are other rows with same id:
DELETE FROM mytable t
WHERE EXISTS (SELECT 1 from mytable
WHERE id = t.id AND modification > t.modification);

How can I delete one of two perfectly identical rows?

I am cleaning out a database table without a primary key (I know, I know, what were they thinking?). I cannot add a primary key, because there is a duplicate in the column that would become the key. The duplicate value comes from one of two rows that are in all respects identical. I can't delete the row via a GUI (in this case MySQL Workbench, but I'm looking for a database agnostic approach) because it refuses to perform tasks on tables without primary keys (or at least a UQ NN column), and I cannot add a primary key, because there is a duplicate in the column that would become the key. The duplicate value comes from one...
How can I delete one of the twins?
SET ROWCOUNT 1
DELETE FROM [table] WHERE ....
SET ROWCOUNT 0
This will only delete one of the two identical rows
One option to solve your problem is to create a new table with the same schema, and then do:
INSERT INTO new_table (SELECT DISTINCT * FROM old_table)
and then just rename the tables.
You will of course need approximately the same amount of space as your table requires spare on your disk to do this!
It's not efficient, but it's incredibly simple.
Note that MySQL has its own extension of DELETE, which is DELETE ... LIMIT, which works in the usual way you'd expect from LIMIT: http://dev.mysql.com/doc/refman/5.0/en/delete.html
The MySQL-specific LIMIT row_count option to DELETE tells the server
the maximum number of rows to be deleted before control is returned to
the client. This can be used to ensure that a given DELETE statement
does not take too much time. You can simply repeat the DELETE
statement until the number of affected rows is less than the LIMIT
value.
Therefore, you could use DELETE FROM some_table WHERE x="y" AND foo="bar" LIMIT 1; note that there isn't a simple way to say "delete everything except one" - just keep checking whether you still have row duplicates.
delete top(1) works on Microsoft SQL Server (T-SQL).
This can be accomplished using a CTE and the ROW_NUMBER() function, as below:
/* Sample Data */
CREATE TABLE #dupes (ID INT, DWCreated DATETIME2(3))
INSERT INTO #dupes (ID, DWCreated) SELECT 1, '2015-08-03 01:02:03.456'
INSERT INTO #dupes (ID, DWCreated) SELECT 2, '2014-08-03 01:02:03.456'
INSERT INTO #dupes (ID, DWCreated) SELECT 1, '2013-08-03 01:02:03.456'
/* Check sample data - returns three rows, with two rows for ID#1 */
SELECT * FROM #dupes
/* CTE to give each row that shares an ID a unique number */
;WITH toDelete AS
(
SELECT ID, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY DWCreated) AS RN
FROM #dupes
)
/* Delete any row that is not the first instance of an ID */
DELETE FROM toDelete WHERE RN > 1
/* Check the results: ID is now unique */
SELECT * FROM #dupes
/* Clean up */
DROP TABLE #dupes
Having a column to ORDER BY is handy, but not necessary unless you have a preference for which of the rows to delete. This will also handle all instances of duplicate records, rather than forcing you to delete one row at a time.
For PostgreSQL you can do this:
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id, ROW_NUMBER()
OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
column1, column2, column3 would the column set which have duplicate values.
Reference here.
This works for PostgreSQL
DELETE FROM tablename WHERE id = 123 AND ctid IN (SELECT ctid FROM tablename WHERE id = 123 LIMIT 1)
Tried LIMIT 1? This will only delete 1 of the rows that match your DELETE query
DELETE FROM `table_name` WHERE `column_name`='value' LIMIT 1;
In my case I could get the GUI to give me a string of values of the row in question (alternatively, I could have done this by hand). On the suggestion of a colleague, in whose debt I remain, I used this to create an INSERT statement:
INSERT
'ID1219243408800307444663', '2004-01-20 10:20:55', 'INFORMATION', 'admin' (...)
INTO some_table;
I tested the insert statement, so that I now had triplets. Finally, I ran a simple DELETE to remove all of them...
DELETE FROM some_table WHERE logid = 'ID1219243408800307444663';
followed by the INSERT one more time, leaving me with a single row, and the bright possibilities of a primary key.
in case you can add a column like
ALTER TABLE yourtable ADD IDCOLUMN bigint NOT NULL IDENTITY (1, 1)
do so.
then count rows grouping by your problem column where count >1 , this will identify your twins (or triplets or whatever).
then select your problem column where its content equals the identified content of above and check the IDs in IDCOLUMN.
delete from your table where IDCOLUMN equals one of those IDs.
You could use a max, which was relevant in my case.
DELETE FROM [table] where id in
(select max(id) from [table] group by id, col2, col3 having count(id) > 1)
Be sure to test your results first and having a limiting condition in your "having" clausule. With such a huge delete query you might want to update your database first.
delete top(1) tableNAme
where --your conditions for filtering identical rows
I added a Guid column to the table and set it to generate a new id for each row. Then I could delete the rows using a GUI.
In PostgreSQL there is an implicit column called ctid. See the wiki. So you are free to use the following:
WITH cte1 as(
SELECT unique_column, max( ctid ) as max_ctid
FROM table_1
GROUP BY unique_column
HAVING count(*) > 1
), cte2 as(
SELECT t.ctid as target_ctid
FROM table_1 t
JOIN cte1 USING( unique_column )
WHERE t.ctid != max_ctid
)
DELETE FROM table_1
WHERE ctid IN( SELECT target_ctid FROM cte2 )
I'm not sure how safe it is to use this when there is a possibility of concurrent updates. So one may find it sensible to make a LOCK TABLE table_1 IN ACCESS EXCLUSIVE MODE; before actually doing the cleanup.
In case there are multiple duplicate rows to delete and all fields are identical, no different id, the table has no primary key , one option is to save the duplicate rows with distinct in a new table, delete all duplicate rows and insert the rows back. This is helpful if the table is really big and the number of duplicate rows is small.
--- col1 , col2 ... coln are the table columns that are relevant.
--- if not sure add all columns of the table in the select bellow and the where clause later.
--- make a copy of the table T to be sure you can rollback anytime , if possible
--- check the ##rowcount to be sure it's what you want
--- use transactions and rollback in case there is an error
--- first find all with duplicate rows that are identical , this statement could be joined
--- with the first one if you choose all columns
select col1,col2, --- other columns as needed
count(*) c into temp_duplicate group by col1,col2 having count(*) > 1
--- save all the rows that are identical only once ( DISTINCT )
insert distinct * into temp_insert from T , temp_duplicate D where
T.col1 = D.col1 and
T.col2 = D.col2 --- and other columns if needed
--- delete all the rows that are duplicate
delete T from T , temp_duplicate D where
T.col1 = D.col1 and
T.col2 = D.col2 ---- and other columns if needed
--- add the duplicate rows , now only once
insert into T select * from temp_insert
--- drop the temp tables after you check all is ok
If, like me, you don't want to have to list out all the columns of the database, you can convert each row to JSONB and compare by that.
(NOTE: This is incredibly inefficient - be careful!)
select to_jsonb(a.*), to_jsonb(b.*)
FROM
table a
left join table b
on
a.entry_date < b.entry_date
where (SELECT NOT exists(
SELECT
FROM jsonb_each_text(to_jsonb(a.*) - 'unwanted_column') t1
FULL OUTER JOIN jsonb_each_text(to_jsonb(b.*) - 'unwanted_column') t2 USING (key)
WHERE t1.value<>t2.value OR t1.key IS NULL OR t2.key IS NULL
))
Suppose we want to delete duplicate records with keeping only 1 unique records from Employee table - Employee(id,name,age)
delete from Employee
where id not in (select MAX(id)
from Employee
group by (id,name,age)
);
You can use limit 1
This works perfectly for me with MySQL
delete from `your_table` [where condition] limit 1;
DELETE FROM Table_Name
WHERE ID NOT IN
(
SELECT MAX(ID) AS MaxRecordID
FROM Table_Name
GROUP BY [FirstName],
[LastName],
[Country]
);

How do I find duplicated values in related table and update them

This is my situation.
TABLE1:
DOCUMENT_ID,
GUID
TABLE2:
DOCUMENT_ID,
FILE
The tables are joined by DOCUMENT_ID, meaning that TABLE2 can have one or many rows with the same DOCUMENT_ID.
My problem is that TABLE2 values for whole bunch of DOCUMENT_ID have same FILE values.
I need a SQL query that will get me all GUID and count how many rows in TABLE2 for this DOCUMENT_ID have EXACTLY THE SAME FILE value (so that I can copy the GUID to Excel).
Then I need to UPDATE TABLE2's FILE columns for these cases.
For instance if DOCUMENT_ID has three rows in TABLE2 with same FILE value, I need to update two of them by adding a postfix like FILEVALUE-1, FILEVALUE-2 and so on.
Hope I make sense.
To all experts thank you in advance.
To get duplicates you might employ oldfashioned group by:
select table1.guid, table1.document_id, table2.[file], count(*) cnt
from table1
inner join table2
on table1.document_id = table2.document_id
group by table1.guid, table1.document_id, table2.[file]
having count (*) > 1
To directly update duplicates, you might use CTE:
; with t2 as (
select id,
[file],
row_number() over (partition by document_id, [file]
order by id) rn
from table2
)
update t2
set [file] = [file] + '-' + convert(varchar(10), rn - 1)
where t2.rn > 1
Note that I've added ID as a placeholder for primary key. You need a way to identify a record to be updated.
There is live test # Sql Fiddle.
This will get you all FILES that have more than a Document_id
Select FILE, COUNT(DOCUMENT_ID) as DOCUMENT_ID from table2
group by FILE
Having count(DOCUMENT_ID)>1
You can use CTE to find out duplicate value from TABLE2:
WITH CTE_1 (DOCUMENT_ID,FILE, DuplicateCount)
AS
(
SELECT DOCUMENT_ID,FILE,
ROW_NUMBER() OVER(PARTITION BY DOCUMENT_ID,FILE ORDER BY DOCUMENT_ID) AS DuplicateCount
FROM table2
)
select *
FROM CTE_1
WHERE DuplicateCount >1
I have 1 approach in mind, but not sure whether it is feasible at your end or not. But let me assure you, this is a very effective approach. You can create a table having an identity column and insert your entire data in that table. And from there on handling any duplicate data is a child's play.
There are two ways of adding an identity column to a table with existing data:
Create a new table with identity, copy data to this new table then drop the existing table followed by renaming the temp table.
Create a new column with identity & drop the existing column
For reference the I have found 2 articles :
http://blog.sqlauthority.com/2009/05/03/sql-server-add-or-remove-identity-property-on-column/
http://cavemansblog.wordpress.com/2009/04/02/sql-how-to-add-an-identity-column-to-a-table-with-data/

How to delete duplicate rows from an Oracle Database?

We have a table that has had the same data inserted into it twice by accident meaning most (but not all) rows appears twice in the table. Simply put, I'd like an SQL statement to delete one version of a row while keeping the other; I don't mind which version is deleted as they're identical.
Table structure is something like:
FID, unique_ID, COL3, COL4....
Unique_ID is the primary key, meaning each one appears only once.
FID is a key that is unique to each feature, so if it appears more than once then the duplicates should be deleted.
To select features that have duplicates would be:
select count(*) from TABLE GROUP by FID
Unfortunately I can't figure out how to go from that to a SQL delete statement that will delete extraneous rows leaving only one of each.
This sort of question has been asked before, and I've tried the create table with distinct, but how do I get all columns without naming them? This only gets the single column FID and itemising all the columns to keep gives an: ORA-00936: missing expression
CREATE TABLE secondtable NOLOGGING as select distinct FID from TABLE
If you don't care which row is retained
DELETE FROM your_table_name a
WHERE EXISTS( SELECT 1
FROM your_table_name b
WHERE a.fid = b.fid
AND a.unique_id < b.unique_id )
Once that's done, you'll want to add a constraint to the table that ensures that FID is unique.
Try this
DELETE FROM table_name A WHERE ROWID > (
SELECT min(rowid) FROM table_name B
WHERE A.FID = B.FID)
A suggestion
DELETE FROM x WHERE ROWID IN
(WITH y AS (SELECT xCOL, MIN(ROWID) FROM x GROUP BY xCOL HAVING COUNT(xCOL) > 1)
SELCT a.ROWID FROM x, y WHERE x.XCOL=y.XCOL and x.ROWIDy.ROWID)
Try with this.
DELETE FROM firsttable WHERE unique_ID NOT IN
(SELECT MAX(unique_ID) FROM firsttable GROUP BY FID)
EDIT:
One explanation:
SELECT MAX(unique_ID) FROM firsttable GROUP BY FID;
This sql statement will pick each maximum unique_ID row from each duplicate rows group. And delete statement will keep these maximum unique_ID rows and delete other rows of each duplicate group.
You can try this.
delete from tablename a
where a.logid, a.pointid, a.routeid) in (select logid, pointid, routeid from tablename
group by logid, pointid, routeid having count(*) > 1)
and rowid not in (select min(rowid) from tablename
group by logid, pointid, routeid having count(*) > 1)