Unable to delete duplicate rows with PostgreSQL - sql

My query deletes the whole table instead of duplicate rows.
Video as proof: https://streamable.com/3s843
create table customer_info (
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
phone_number VARCHAR(50)
);
insert into customer_info (id, first_name, last_name, phone_number) values
(1, 'Kevin', 'Binley', '600-449-1059'),
(1, 'Kevin', 'Binley', '600-449-1059'),
(2, 'Skippy', 'Lam', '779-278-0889');
My query:
with t1 as (
select *, row_number() over(partition by id order by id) as rn
from customer_info)
delete
from customer_info
where id in (select id from t1 where rn > 1);

Your query would delete all rows from each set of dupes (as all share the same id by which you select - that's what #wildplasser hinted at with subtle comments) and only initially unique rows would survive. So if it "deletes the whole table", that means there were no unique rows at all.
In your query, dupes are defined by (id) alone, not by the whole row as your title suggests.
Either way, there is a remarkably simple solution:
DELETE FROM customer_info c
WHERE EXISTS (
SELECT FROM customer_info c1
WHERE ctid < c.ctid
AND c1 = c -- comparing whole rows
);
Since you deal with completely identical rows, the remaining way to tell them apart is the internal tuple ID ctid.
My query deletes all rows, where an identical row with a smaller ctid exists. Hence, only the "first" row from each set of dupes survives.
Notably, NULL values compare equal in this case - which is most probably as desired. The manual:
The SQL specification requires row-wise comparison to return NULL if
the result depends on comparing two NULL values or a NULL and a
non-NULL. PostgreSQL does this only when comparing the results of two
row constructors (as in Section 9.23.5) or comparing a row constructor
to the output of a subquery (as in Section 9.22). In other contexts
where two composite-type values are compared, two NULL field values
are considered equal, [...]
If dupes are defined by id alone (as your query suggests), then this would work:
DELETE FROM customer_info c
WHERE EXISTS (
SELECT FROM customer_info c1
WHERE ctid < c.ctid
AND id = c.id
);
But then there might be a better way to decide which rows to keep than ctid as a measure of last resort!
Obviously, you would then add a PRIMARY KEY to avoid the initial dilemma from reappearing. For the second interpretation, id is the candidate.
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
About ctid:
How do I decompose ctid into page and row numbers?

You can't if the table does not have a key.
Tables have "keys" that identify each row uniquely. If your table does not have any key, then you won't be able to identify one row from the other one.
The only workaround to delete duplicate rows I can think of would be to:
Add a key on the table.
Use the key to delete the rows that are in excess.
For example:
create sequence seq1;
alter table customer_info add column k1 int;
update customer_info set k1 = nextval('seq1');
delete from customer_info where k1 in (
select k1
from (
select
k1,
row_number() over(partition by id, first_name, last_name, phone_number) as rn
from customer_info
) x
where rn > 1
)
Now you only have two rows.

Related

Change value of duplicated rows

There is a table with tow columns(ID, Data) and there are 3 rows with same value.
ID Data
4 192.168.0.22
4 192.168.0.22
4 192.168.0.22
Now I want to change third row DATA column. In update SQL Server Generate an error that I ca not change the value.
I can delete all 3 rows. But I can not delete third row separately.
This table is for a software that I bought and I changed the third Server IP.
You can try the following query
create table #tblSimilarValues(id int, ipaddress varchar(20))
insert into #tblSimilarValues values (4, '192.168.0.22'),
(4, '192.168.0.22'),(4, '192.168.0.22')
Use Below query if you want to change all rows
with oldData as (
select *,
count(*) over (partition by id, ipaddress) as cnt
from #tblSimilarValues
)
update oldData
set ipaddress = '192.168.0.22_1'
where cnt > 1;
select * from #tblSimilarValues
Use Below query if you want to skip firs row
;with oldData as (
select *,
ROW_NUMBER () over (partition by id, ipaddress order by id, ipaddress) as cnt
from #tblSimilarValues
)
update oldData
set ipaddress = '192.168.0.22_2'
where cnt > 1;
select * from #tblSimilarValues
drop table #tblSimilarValues
You can find the live demo live demo here
Since there is no column that allows us to distinguish these rows from each other, there's no "third row" (nor a first or second one for that matter).
We can use a ROW_NUMBER function to apply arbitrary row numbers to these rows, however, and if we place that in a CTE, we can apply DELETE/UPDATE actions via the CTE and use the arbitrary row numbers:
declare #t table (ID int not null, Data varchar(15))
insert into #t(ID,Data) values
(4,'192.168.0.22'),
(4,'192.168.0.22'),
(4,'192.168.0.22')
;With ArbitraryAssignments as (
select *,ROW_NUMBER() OVER (PARTITION BY ID, Data ORDER BY Data) as rn
from #t
)
delete from ArbitraryAssignments where rn > 2
select * from #t
This produces two rows of output - one row was deleted.
Note that I say that the ROW_NUMBER is arbitrary. One of the expressions in both the PARTITION BY and ORDER BY clauses is the same. By definition, then, we know that no real ORDER is defined by this (because all rows within the same partition, by definition, have the same value for that expression).
In this case ID columns allows duplicate value which is wrong, ID should be unique.
Now what you can do is create a new column make that unique or Primary Key or change the duplicate values of ID column and make it Unique/Primary key.
Now as per your Unique key/Primary key you can update DATA column value by query as below:
UPDATE <Table Name>
SET DATA = 'new data'
WHERE ID = 3;

PostgreSQL shuffle column values

In a table with > 100k rows, how can I efficiently shuffle the values of a specific column?
Table definition:
CREATE TABLE person
(
id integer NOT NULL,
first_name character varying,
last_name character varying,
CONSTRAINT person_pkey PRIMARY KEY (id)
)
In order to anonymize data, I have to shuffle the values of the 'first_name' column in place (I'm not allowed to create a new table).
My try:
with
first_names as (
select row_number() over (order by random()),
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()),
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;
It takes hours to complete.
Is there an efficient way to do it?
This one takes 5 seconds to shuffle 500.000 rows on my laptop:
with names as (
select id, first_name, last_name,
lead(first_name) over w as first_1,
lag(first_name) over w as first_2
from person
window w as (order by random())
)
update person
set first_name = coalesce(first_1, first_2)
from names
where person.id = names.id;
The idea is to pick the "next" name after sorting the data randomly. Which is just as good as picking a random name.
There is a chance that not all names are shuffled, but if you run it two or three times, this should be good enough.
Here is a test setup on SQLFiddle: http://sqlfiddle.com/#!15/15713/1
The query on the right hand side checks if any first name stayed the same after the "randomizing"
The problem with postgres is every update mean delete + insert
You can check the analyze with using a SELECT instead UPDATE to see what is the performance of CTE
You can turn off index so update are faster
But the best solution I use when need update all the rows is create the table again
.
CREATE TABLE new_table AS
SELECT * ....
DROP oldtable;
Rename new_table to old_table
CREATE index and constrains
Sorry that isnt an option for you :(
EDIT: After reading a_horse_with_no_name
looks like you need
with
first_names as (
select row_number() over (order by random()) rn,
first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()) rn,
id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names
join ids
on first_names.rn = ids.rn
where id = ref_id;
Again for performance question is better if you provide the ANALYZE / EXPLAIN result.

How to optimize SELECT some_field, max(primary_key) FROM table GROUP BY some_field

I have SQL query in SQL Azure:
SELECT some_field, max(primary_key) FROM table GROUP BY some_field
Table has currently over 6 million rows. Index on (some_field asc, primary_key desc) is created. primary_key field is incremental. There is about 700 distinct values of some_field. This select takes at least 30 seconds.
There are only inserts into this table, no updates or deletes.
I can create separate table to store some_field and maximal value of primary key and write trigger to build it, but I am looking for more elegant solution. Is there any?
Dont know if this will be performant but you you can give it a shot...
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY some_field ORDER BY primary_key DESC) AS rn
FROM table
)
SELECT *
FROM cte
WHERE rn = 1
Definitely do the secondary table of "somefield" and "highestPK" columns that is indexed on the "somefield" column. Build that once up front as a baseline and use that.
Then, whenever any new records are inserted into your 6 million record table, have a simple trigger to update your secondary table with something as simple as..
update SecondaryTable
set highestPK = newlyInsertedPKID
where somefield = newlyInsertedSomeFieldValue
This way, it stays updated with every insert as the highest PK for your "somefield" column will qualify, and if no update is available, insert into the secondary table with the new "somefield" value.

How to delete duplicate rows without unique identifier

I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:
WITH TempEmp AS
(
SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount
FROM mytable
)
-- Now Delete Duplicate Records
DELETE FROM TempEmp
WHERE duplicateRecCount > 1;
But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE after the WITH clause?
I like #erwin-brandstetter 's solution, but wanted to show a solution with the USING keyword:
DELETE FROM table_with_dups T1
USING table_with_dups T2
WHERE T1.ctid < T2.ctid -- delete the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
If you want to review the records before deleting them, then simply replace DELETE with SELECT * and USING with a comma ,, i.e.
SELECT * FROM table_with_dups T1
, table_with_dups T2
WHERE T1.ctid < T2.ctid -- select the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...) clause as those generate a lot of rows in the subquery.
If you rewrite the query to use IN (...) then it performs similarly to the solution presented here, but the SQL code becomes much less concise.
Update 2: If you have NULL values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE() in the condition for that column, e.g.
AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')
If you have no other unique identifier, you can use ctid:
delete from mytable
where exists (select 1
from mytable t2
where t2.name = mytable.name and
t2.address = mytable.address and
t2.zip = mytable.zip and
t2.ctid > mytable.ctid
);
It is a good idea to have a unique, auto-incrementing id in every table. Doing a delete like this is one important reason why.
In a perfect world, every table has a unique identifier of some sort.
In the absence of any unique column (or combination thereof), use the ctid column:
In-order sequence generation
How do I decompose ctid into page and row numbers?
DELETE FROM tbl
WHERE ctid NOT IN (
SELECT min(ctid) -- ctid is NOT NULL by definition
FROM tbl
GROUP BY name, address, zipcode); -- list columns defining duplicates
The above query is short, conveniently listing column names only once. NOT IN (SELECT ...) is a tricky query style when NULL values can be involved, but the system column ctid is never NULL. See:
Find records where join doesn't exist
Using EXISTS as demonstrated by #Gordon is typically faster. So is a self-join with the USING clause like #isapir added later. Both should result in the same query plan.
Important difference: These other queries treat NULL values as not equal, while GROUP BY (or DISTINCT or DISTINCT ON ()) treats NULL values as equal. Does not matter for columns defined NOT NULL. Else, depending on your definition of "duplicate", you'll need one approach or the other. Or use IS NOT DISTINCT FROM to compare values (which may exclude some indexes).
Disclaimer:
ctid is an implementation detail of Postgres, it's not in the SQL standard and can change between major versions without warning (even if that's very unlikely). Its values can change between commands due to background processes or concurrent write operations (but not within the same command).
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
How to use the physical location of rows (ROWID) in a DELETE statement
Aside:
The target of a DELETE statement cannot be the CTE, only the underlying table. That's a spillover from SQL Server - as is your whole approach.
Here is what I came up with, using a group by
DELETE FROM mytable
WHERE id NOT in (
SELECT MIN(id)
FROM mytable
GROUP BY name, address, zipcode
)
It deletes the duplicates, preserving the oldest record that has duplicates.
We can use a window function for very effective removal of duplicate rows:
DELETE FROM tab
WHERE id IN (SELECT id
FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), id
FROM tab) x
WHERE x.row_number > 1);
Some PostgreSQL's optimized version (with ctid):
DELETE FROM tab
WHERE ctid = ANY(ARRAY(SELECT ctid
FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), ctid
FROM tab) x
WHERE x.row_number > 1));
The valid syntax is specified at http://www.postgresql.org/docs/current/static/sql-delete.html
I would ALTER your table to add a unique auto-incrementing primary key id so that you can run a query like the following which will keep the first of each set of duplicates (ie the one with the lowest id). Note that adding the key is a bit more complicated in Postgres than some other DBs.
DELETE FROM mytable d USING (
SELECT min(id), name, address, zip
FROM mytable
GROUP BY name, address, zip HAVING COUNT() > 1
) AS k
WHERE d.id <> k.id
AND d.name=k.name
AND d.address=k.address
AND d.zip=k.zip;
If you want a unique identifier for every row, you could just add one (a serial, or a guid), and treat it like a surrogate key.
CREATE TABLE thenames
( name text not null
, address text not null
, zipcode text not null
);
INSERT INTO thenames(name,address,zipcode) VALUES
('James', 'main street', '123' )
,('James', 'main street', '123' )
,('James', 'void street', '456')
,('Alice', 'union square' , '123')
;
SELECT*FROM thenames;
-- add a surrogate key
ALTER TABLE thenames
ADD COLUMN seq serial NOT NULL PRIMARY KEY
;
SELECT*FROM thenames;
DELETE FROM thenames del
WHERE EXISTS(
SELECT*FROM thenames x
WHERE x.name=del.name
AND x.address=del.address
AND x.zipcode=del.zipcode
AND x.seq < del.seq
);
-- add the unique constrain,so that new dupplicates cannot be created in the future
ALTER TABLE thenames
ADD UNIQUE (name,address,zipcode)
;
SELECT*FROM thenames;
To remove duplicates (keep only one entry) from a table "tab" where data looks like this:
fk_id_1
fk_id_2
12
32
12
32
12
32
15
37
15
37
You can do this:
DELETE FROM tab WHERE ctid IN
(SELECT ctid FROM
(SELECT ctid, fk_id_1, fk_id_2, row_number() OVER (PARTITION BY fk_id_1, fk_id_2 ORDER BY fk_id_1) AS rnum FROM tab) t
WHERE t.rnum > 1);
Where ctid is the physical location of the row within its table (therefore, a row identifier) and row_number is a window function that assigns a sequential integer to each row in a result set.
PARTITION groups the result set and the sequential integer is restarted for every group.
From the documentation delete duplicate rows
A frequent question in IRC is how to delete rows that are duplicates over a set of columns, keeping only the one with the lowest ID.
This query does that for all rows of tablename having the same column1, column2, and column3.
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
Sometimes a timestamp field is used instead of an ID field.
For smaller tables, we can use rowid pseudo column to delete duplicate rows.
You can use this query below:
Delete from table1 t1 where t1.rowid > (select min(t2.rowid) from table1 t2 where t1.column = t2. column)

How can I delete one of two perfectly identical rows?

I am cleaning out a database table without a primary key (I know, I know, what were they thinking?). I cannot add a primary key, because there is a duplicate in the column that would become the key. The duplicate value comes from one of two rows that are in all respects identical. I can't delete the row via a GUI (in this case MySQL Workbench, but I'm looking for a database agnostic approach) because it refuses to perform tasks on tables without primary keys (or at least a UQ NN column), and I cannot add a primary key, because there is a duplicate in the column that would become the key. The duplicate value comes from one...
How can I delete one of the twins?
SET ROWCOUNT 1
DELETE FROM [table] WHERE ....
SET ROWCOUNT 0
This will only delete one of the two identical rows
One option to solve your problem is to create a new table with the same schema, and then do:
INSERT INTO new_table (SELECT DISTINCT * FROM old_table)
and then just rename the tables.
You will of course need approximately the same amount of space as your table requires spare on your disk to do this!
It's not efficient, but it's incredibly simple.
Note that MySQL has its own extension of DELETE, which is DELETE ... LIMIT, which works in the usual way you'd expect from LIMIT: http://dev.mysql.com/doc/refman/5.0/en/delete.html
The MySQL-specific LIMIT row_count option to DELETE tells the server
the maximum number of rows to be deleted before control is returned to
the client. This can be used to ensure that a given DELETE statement
does not take too much time. You can simply repeat the DELETE
statement until the number of affected rows is less than the LIMIT
value.
Therefore, you could use DELETE FROM some_table WHERE x="y" AND foo="bar" LIMIT 1; note that there isn't a simple way to say "delete everything except one" - just keep checking whether you still have row duplicates.
delete top(1) works on Microsoft SQL Server (T-SQL).
This can be accomplished using a CTE and the ROW_NUMBER() function, as below:
/* Sample Data */
CREATE TABLE #dupes (ID INT, DWCreated DATETIME2(3))
INSERT INTO #dupes (ID, DWCreated) SELECT 1, '2015-08-03 01:02:03.456'
INSERT INTO #dupes (ID, DWCreated) SELECT 2, '2014-08-03 01:02:03.456'
INSERT INTO #dupes (ID, DWCreated) SELECT 1, '2013-08-03 01:02:03.456'
/* Check sample data - returns three rows, with two rows for ID#1 */
SELECT * FROM #dupes
/* CTE to give each row that shares an ID a unique number */
;WITH toDelete AS
(
SELECT ID, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY DWCreated) AS RN
FROM #dupes
)
/* Delete any row that is not the first instance of an ID */
DELETE FROM toDelete WHERE RN > 1
/* Check the results: ID is now unique */
SELECT * FROM #dupes
/* Clean up */
DROP TABLE #dupes
Having a column to ORDER BY is handy, but not necessary unless you have a preference for which of the rows to delete. This will also handle all instances of duplicate records, rather than forcing you to delete one row at a time.
For PostgreSQL you can do this:
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id, ROW_NUMBER()
OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
column1, column2, column3 would the column set which have duplicate values.
Reference here.
This works for PostgreSQL
DELETE FROM tablename WHERE id = 123 AND ctid IN (SELECT ctid FROM tablename WHERE id = 123 LIMIT 1)
Tried LIMIT 1? This will only delete 1 of the rows that match your DELETE query
DELETE FROM `table_name` WHERE `column_name`='value' LIMIT 1;
In my case I could get the GUI to give me a string of values of the row in question (alternatively, I could have done this by hand). On the suggestion of a colleague, in whose debt I remain, I used this to create an INSERT statement:
INSERT
'ID1219243408800307444663', '2004-01-20 10:20:55', 'INFORMATION', 'admin' (...)
INTO some_table;
I tested the insert statement, so that I now had triplets. Finally, I ran a simple DELETE to remove all of them...
DELETE FROM some_table WHERE logid = 'ID1219243408800307444663';
followed by the INSERT one more time, leaving me with a single row, and the bright possibilities of a primary key.
in case you can add a column like
ALTER TABLE yourtable ADD IDCOLUMN bigint NOT NULL IDENTITY (1, 1)
do so.
then count rows grouping by your problem column where count >1 , this will identify your twins (or triplets or whatever).
then select your problem column where its content equals the identified content of above and check the IDs in IDCOLUMN.
delete from your table where IDCOLUMN equals one of those IDs.
You could use a max, which was relevant in my case.
DELETE FROM [table] where id in
(select max(id) from [table] group by id, col2, col3 having count(id) > 1)
Be sure to test your results first and having a limiting condition in your "having" clausule. With such a huge delete query you might want to update your database first.
delete top(1) tableNAme
where --your conditions for filtering identical rows
I added a Guid column to the table and set it to generate a new id for each row. Then I could delete the rows using a GUI.
In PostgreSQL there is an implicit column called ctid. See the wiki. So you are free to use the following:
WITH cte1 as(
SELECT unique_column, max( ctid ) as max_ctid
FROM table_1
GROUP BY unique_column
HAVING count(*) > 1
), cte2 as(
SELECT t.ctid as target_ctid
FROM table_1 t
JOIN cte1 USING( unique_column )
WHERE t.ctid != max_ctid
)
DELETE FROM table_1
WHERE ctid IN( SELECT target_ctid FROM cte2 )
I'm not sure how safe it is to use this when there is a possibility of concurrent updates. So one may find it sensible to make a LOCK TABLE table_1 IN ACCESS EXCLUSIVE MODE; before actually doing the cleanup.
In case there are multiple duplicate rows to delete and all fields are identical, no different id, the table has no primary key , one option is to save the duplicate rows with distinct in a new table, delete all duplicate rows and insert the rows back. This is helpful if the table is really big and the number of duplicate rows is small.
--- col1 , col2 ... coln are the table columns that are relevant.
--- if not sure add all columns of the table in the select bellow and the where clause later.
--- make a copy of the table T to be sure you can rollback anytime , if possible
--- check the ##rowcount to be sure it's what you want
--- use transactions and rollback in case there is an error
--- first find all with duplicate rows that are identical , this statement could be joined
--- with the first one if you choose all columns
select col1,col2, --- other columns as needed
count(*) c into temp_duplicate group by col1,col2 having count(*) > 1
--- save all the rows that are identical only once ( DISTINCT )
insert distinct * into temp_insert from T , temp_duplicate D where
T.col1 = D.col1 and
T.col2 = D.col2 --- and other columns if needed
--- delete all the rows that are duplicate
delete T from T , temp_duplicate D where
T.col1 = D.col1 and
T.col2 = D.col2 ---- and other columns if needed
--- add the duplicate rows , now only once
insert into T select * from temp_insert
--- drop the temp tables after you check all is ok
If, like me, you don't want to have to list out all the columns of the database, you can convert each row to JSONB and compare by that.
(NOTE: This is incredibly inefficient - be careful!)
select to_jsonb(a.*), to_jsonb(b.*)
FROM
table a
left join table b
on
a.entry_date < b.entry_date
where (SELECT NOT exists(
SELECT
FROM jsonb_each_text(to_jsonb(a.*) - 'unwanted_column') t1
FULL OUTER JOIN jsonb_each_text(to_jsonb(b.*) - 'unwanted_column') t2 USING (key)
WHERE t1.value<>t2.value OR t1.key IS NULL OR t2.key IS NULL
))
Suppose we want to delete duplicate records with keeping only 1 unique records from Employee table - Employee(id,name,age)
delete from Employee
where id not in (select MAX(id)
from Employee
group by (id,name,age)
);
You can use limit 1
This works perfectly for me with MySQL
delete from `your_table` [where condition] limit 1;
DELETE FROM Table_Name
WHERE ID NOT IN
(
SELECT MAX(ID) AS MaxRecordID
FROM Table_Name
GROUP BY [FirstName],
[LastName],
[Country]
);