delete duplicate records, keep one - sql

I have a temp table created from a copy from a CSV file and the result includes some duplicate ids. I need to delete any duplication. I have tried the following:
delete from my_table where id in
(select id from (select count(*) as count, id
from my_table group by id) as counts where count>1);
However this deletes both the duplicate records and I must keep one.
How can I delete only the 2nd record with a duplicated Id?
Thanks.

Your query deletes all IDs that have a count greater than 1, so it removes everything that is duplicated. What you need to do is isolate one record from the list of duplicates and preserve that:
delete
from my_table
where id in (select id
from my_table
where some_field in (select some_field
from my_table
group by some_field
having count(id) > 1))
and id not in (select min(id)
from my_table
where some_field in (select some_field
from my_table
group by some_field
having count(id) > 1)
group by some_field);
EDIT Fixed :P

Assuming you don't have foreign key relations...
CREATE TABLE "temp"(*column definitions*);
insert into "temp" (*column definitions*)
select *column definitions*
from (
select *,row_number() over(PARTITION BY id) as rn from "yourtable"
) tm
where rn=1;
drop table "yourtable";
alter table "temp" rename to "yourtable";

Related

How to Update Executed table result into the same table?

I have created a table tbl_Dist with Column names District and DistCode, there were many duplicate values in the District table so i have removed all the duplicates value using this statement:
select distinct District from tbl_Dist;
its done, but i am not getting how to update the results of the above executed query to the table tbl_Dist?
You can as the below:
-- Move temp table
SELECT DISTINCT District INTO TmpTable FROM tbl_Dist
-- Delete all data
DELETE FROM tbl_Dist
-- Insert data from temp table
INSERT INTO tbl_Dist
SELECT * FROM TmpTable
Updated
Firstly, run this query. You will have a temp table with distinct data of main table (tbl_Dist)
-- Move temp table
SELECT DISTINCT District INTO TmpTable FROM tbl_Dist
Then run the below query to delete all data
DELETE FROM tbl_Dist
Finally, run the below query to insert all distinct data to main table.
-- Insert data from temp table
INSERT INTO tbl_Dist
SELECT * FROM TmpTable
You need Delete not Update
;with cte as
(
Select row_number() over(partition by District order by (select null)) as rn,*
From yourtable
)
Delete from cte where Rn > 1
To check the records that will be deleted use this.
;with cte as
(
Select row_number() over(partition by District order by (select null)) as rn,*
From yourtable
)
Select * from cte where Rn > 1
If you want to keep this query you can keep it in a view the write an update query through that view.The table will be updated
try this script
DELETE FROM tbl_Dist
WHERE District = District
AND DistCode > DistCode

Deleting duplicates rows from redshift

I am trying to delete some duplicate data in my redshift table.
Below is my query:-
With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;
This query is giving me an error.
Amazon Invalid operation: syntax error at or near "delete";
Not sure what the issue is as the syntax for with clause seems to be correct.
Has anybody faced this situation before?
Redshift being what it is (no enforced uniqueness for any column), Ziggy's 3rd option is probably best. Once we decide to go the temp table route it is more efficient to swap things out whole. Deletes and inserts are expensive in Redshift.
begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;
If space isn't an issue you can keep the old table around for a while and use the other methods described here to validate that the row count in the original accounting for duplicates matches the row count in the new.
If you're doing constant loads to such a table you'll want to pause that process while this is going on.
If the number of duplicates is a small percentage of a large table, you might want to try copying distinct records of the duplicates to a temp table, then delete all records from the original that join with the temp. Then append the temp table back to the original. Make sure you vacuum the original table after (which you should be doing for large tables on a schedule anyway).
If you're dealing with a lot of data it's not always possible or smart to recreate the whole table. It may be easier to locate, delete those rows:
-- First identify all the rows that are duplicate
CREATE TEMP TABLE duplicate_saleids AS
SELECT saleid
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
GROUP BY saleid
HAVING COUNT(*) > 1;
-- Extract one copy of all the duplicate rows
CREATE TEMP TABLE new_sales(LIKE sales);
INSERT INTO new_sales
SELECT DISTINCT *
FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Remove all rows that were duplicated (all copies).
DELETE FROM sales
WHERE saledateid BETWEEN 2224 AND 2231
AND saleid IN(
SELECT saleid
FROM duplicate_saleids
);
-- Insert back in the single copies
INSERT INTO sales
SELECT *
FROM new_sales;
-- Cleanup
DROP TABLE duplicate_saleids;
DROP TABLE new_sales;
COMMIT;
Full article: https://elliot.land/post/removing-duplicate-data-in-redshift
That should have worked. Alternative you can do:
With
duplicates As (
Select *, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);
or
delete from table_name
where id in (
select id
from (
Select id, ROW_NUMBER() Over (PARTITION by record_indicator
Order by record_indicator) as Duplicate
From table_name) x
Where Duplicate > 1);
If you have no primary key, you can do the following:
BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
SELECT DISTINCT ON (record_indicator) *
FROM table_name
ORDER BY record_indicator --, other_optional_priority_field DESC
;
DELETE FROM table_name
WHERE record_indicator IN (
SELECT record_indicator FROM mydups);
INSERT INTO table_name SELECT * FROM mydups;
COMMIT;
This method will preserve permissions and the table definition of the original_table.
The most upvoted answer does not preserve permissions on the table or the original definition of the table.
In real world production environment this method is how you should be doing as this is safest and easiest way to execute in production environment.
This will have a DOWN TIME in PROD.
Create Table with unique rows
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
Backup the original_table
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
Truncate the original_table
TRUNCATE original_table;
Insert records from unique_table into original_table
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
To avoid DOWN TIME run the below queries in a TRANSACTION and instead of TRUNCATE use DELETE
BEGIN transaction;
CREATE TABLE unique_table as
(
SELECT DISTINCT * FROM original_table
)
;
CREATE TABLE backup_table as
(
SELECT * FROM original_table
)
;
DELETE FROM original_table;
INSERT INTO original_table
(
SELECT * FROM unique_table
)
;
END transaction;
Simple answer to this question:
Firstly create a temporary table from the main table where value of row_number=1.
Secondly delete all the rows from the main table on which we had duplicates.
Then insert the values of temporary table into the main table.
Queries:
Temporary table
select id,date into #temp_a
from
(select *
from (select a.*,
row_number() over(partition by id order by etl_createdon desc) as rn
from table a
where a.id between 59 and 75 and a.date = '2018-05-24')
where rn =1)a
deleting all the rows from the main table.
delete from table a
where a.id between 59 and 75 and a.date = '2018-05-24'
inserting all values from temp table to main table
insert into table a select * from #temp_a.
The following deletes all records in 'tablename' that have a duplicate, it will not deduplicate the table:
DELETE FROM tablename
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename
) t
WHERE t.rnum > 1);
Postgres administrative snippets
Your query does not work because Redshift does not allow DELETE after the WITH clause. Only SELECT and UPDATE and a few others are allowed (see WITH clause)
Solution (in my situation):
I did have an id column on my table events that contained duplicate rows and uniquely identifies the record. This column id is the same as your record_indicator.
Unfortunately I was unable to create a temporary table because I ran into the following error using SELECT DISTINCT:
ERROR: Intermediate result row exceeds database block size
But this worked like a charm:
CREATE TABLE temp as (
SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber
FROM events
);
resulting in the temp table:
id | rownumber | ...
----------------
1 | 1 | ...
1 | 2 | ...
2 | 1 | ...
2 | 2 | ...
Now the duplicates can be deleted by removing the rows having rownumber larger than 1:
DELETE FROM temp WHERE rownumber > 1
After that rename the tables and your done.
with duplicates as
(
select a.*, row_number (over (partition by first_name, last_name, email order by first_name, last_name, email) as rn from contacts a
)
delete from contacts
where contact_id in (
select contact_id from duplicates where rn >1
)

Oracle: Why I cannot rely on ROWNUM in a delete clause

I have a such statement:
SELECT MIN(ROWNUM) FROM my_table
GROUP BY NAME
HAVING COUNT(NAME) > 1);
This statement gives me the rownum of the first duplicate, but when transform this statement into DELETE it just delete everything. Why does it happen so?
This is because ROWNUM is a pseudo column which implies that they do not exist physically. You can better use rowid to delete the records.
To remove the duplicates you can try like this:
DELETE FROM mytable a
WHERE EXISTS( SELECT 1 FROM mytable b
WHERE a.id = b.id
AND a.name = b.name
AND a.rowid > b.rowid )
Using rownum to delete duplicate records makes not much sense. If you need to delete duplicate rows, leaving only one row for each value of name, try the following:
DELETE FROM mytable
WHERE ROWID IN (SELECT ID
FROM (SELECT ROWID ID, ROW_NUMBER() OVER
(PARTITION BY name ORDER BY name) numRows FROM mytable
)
WHERE numRows > 1)
By adding further columns in ORDER BY clause, you can choice to delete the record with greatest/smallest ID, or some other field.

Delete rows with duplicate value in one column but different value in other columns

I have table with 3 columns :
ID, name, role
Some names are duplicated but they all have an unique ID, how do I delete all rows with duplicated name(not all,leave one for each) in my table?
Group by the name and select the lowest unique id. Delete all records that are not in that list
delete from your_table
where id not in
(
select min(id)
from your_table
group by name
)
And if you use MySQL you need another subquery since MySQL does not allow you to delete from the same table you are selecting from:
delete from your_table
where id not in
(
select * from
(
select min(id)
from your_table
group by name
) tmp_tbl
)

Send faulty rows to other table

I have a table with many columns in which I have to find the duplicate based on one column.
I.e. if I found duplicate customer_name in the Customer_name then
I have to remove all repeating from the source table.
Send all those rows to other table with same structure.
If you have two tables like this:
CREATE TABLE t1 (ID int, customerName varchar(64))
CREATE TABLE t2 (ID int, customerName varchar(64))
You can make something like this: (The ID column is for just to have a base for the deceision what to keep, you can change it as you need)
--First Copy
WITH CTE_T1
AS
(
SELECT
ID,
customerName,
ROW_NUMBER() OVER(PARTITION BY customerName ORDER BY ID) as OrderOfCustomer
FROM
t1
)
INSERT INTO t2
SELECT ID, customerName FROM cte_T1
WHERE OrderOfCustomer > 1;
--Then Delete
WITH CTE_T1
AS
(
SELECT
ID,
customerName,
ROW_NUMBER() OVER(PARTITION BY customerName ORDER BY ID) as OrderOfCustomer
FROM
t1
)
DELETE FROM CTE_T1
WHERE OrderOfCustomer > 1
Here is an SQLFiddle to show how it works.
I guess each row has a unique Id primary key.
This inserts into your duplicate rows table :
Insert into duplicateRowsTable
select * from myTable t1
where (select count(*) from myTable t2 where t1.customerId = t2.customerId) > 1
You delete from the duplicateRowsTable the good rows:
delete from duplicatesTable
where --this is not the faulty row for each customerId
finally you delete from your first table :
delete from myTable
where id IN (select id from duplicatesTable)
Try this:
For moving duplicates
INSERT Into DuplicatesTable
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER(PARTITION BY Customer_name ORDER BY Customer_name) As RowID,
FROM SourceTable) as temp
WHERE RowID > 1
For deteting:
WITH TableCTE
AS
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Customer_name ORDER BY Customer_name) AS RowID
FROM SourceTable
)
DELETE
FROM TableCTE
WHERE RowID> 1