How to remove duplicate records in a table?

How to remove duplicate records in a table? - sql

I've got a table in a testing DB that someone apparently got a little too trigger-happy on when running INSERT scripts to set it up. The schema looks like this:
ID UNIQUEIDENTIFIER
TYPE_INT SMALLINT
SYSTEM_VALUE SMALLINT
NAME VARCHAR
MAPPED_VALUE VARCHAR
It's supposed to have a few dozen rows. It has about 200,000, most of which are duplicates in which TYPE_INT, SYSTEM_VALUE, NAME and MAPPED_VALUE are all identical and ID is not.
Now, I could probably make a script to clean this up that creates a temporary table in memory, uses INSERT .. SELECT DISTINCT to grab all the unique values, TRUNCATE the original table and then copy everything back. But is there a simpler way to do it, like a DELETE query with something special in the WHERE clause?

You don't give your table name but I think something like this should work. Just leaving the record which happens to have the lowest ID. You might want to test with the ROLLBACK in first!
BEGIN TRAN
DELETE <table_name>
FROM <table_name> T1
WHERE EXISTS(
SELECT * FROM <table_name> T2
WHERE
T1.TYPE_INT = T2.TYPE_INT AND
T1.SYSTEM_VALUE = T2.SYSTEM_VALUE AND
T1.NAME = T2.NAME AND
T1.MAPPED_VALUE = T2.MAPPED_VALUE AND
T2.ID > T1.ID
)
SELECT * FROM <table_name>
ROLLBACK

here is a great article on that: Deleting duplicates, which basically uses this pattern:
WITH q AS
(
SELECT d.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY value) AS rn
FROM t_duplicate d
)
DELETE
FROM q
WHERE rn > 1
SELECT *
FROM t_duplicate

WITH Duplicates(ID , TYPE_INT, SYSTEM_VALUE, NAME, MAPPED_VALUE )
AS
(
SELECT Min(Id) ID TYPE_INT, SYSTEM_VALUE, NAME, MAPPED_VALUE
FROM T1
GROUP BY TYPE_INT, SYSTEM_VALUE, NAME, MAPPED_VALUE
HAVING Count(Id) > 1
)
DELETE FROM T1
WHERE ID IN (
SELECT T1.Id
FROM T1
INNER JOIN Duplicates
ON T1.TYPE_INT = Duplicates.TYPE_INT
AND T1.SYSTEM_VALUE = Duplicates.SYSTEM_VALUE
AND T1.NAME = Duplicates.NAME
AND T1.MAPPED_VALUE = Duplicates.MAPPED_VALUE
AND T1.Id <> Duplicates.ID
)

Related

How to keep only the first unique element and delete its duplicates in oracle?

So this query looks like a very easy by its statement but in actual isnt that easy.
Heres my code what ive tried.
Delete from table where id
In (select id from (select
id, row_number()
over(partition by id)
rn from table where rn>1)
The above can work but thats not standard sql for almost all databases like partition by may not be supported in most of the other databases. What i was trying was below is it possible using group by. I tried below but i am not sure this will work or not. Any suggestions and which one is optimized
//using group by
Delete from table where id
In (select id from(select
id from table
Group by id
Having sum(1)>1)
)

As question says
delete its duplicates in oracle
then
delete from your_table a
where a.rowid > (select min(b.rowid)
from your_table b
where b.id = a.id
);

You can use exists as follows:
Delete from your_table t
Where exists (select 1 from your_table t1
Where t1.id = t.id
And t1.rowid > t.rowid)

Simple update statement so that all rows are assigned a different value

I'm trying to set a column in one table to a random foreign key for testing purposes.
I attempted using the below query
update table1 set table2Id = (select top 1 table2Id from table2 order by NEWID())
This will get one table2Id at random and assign it as the foreign key in table1 for each row.
It's almost what I want, but I want each row to get a different table2Id value.
I could do this by looping through the rows in table1, but I know there's a more concise way of doing it.

On some test table my end your original plan looks as follows.
It just calculates the result once and caches it in a sppol then replays that result. You could try the following so that SQL Server sees the subquery as correlated and in need of re-evaluating for each outer row.
UPDATE table1
SET table2Id = (SELECT TOP 1 table2Id
FROM table2
ORDER BY Newid(),
table1.table1Id)
For me that gives this plan without the spool.
It is important to correlate on a unique field from table1 however so that even if a spool is added it must always be rebound rather than rewound (replaying the last result) as the correlation value will be different for each row.
If the tables are large this will be slow as work required is a product of the two table's rows (for each row in table1 it needs to do a full scan of table2)

I'm having another go at answering this, since my first answer was incomplete.
As there is no other way to join the two tables until you assign the table2_id you can use row_number to give a temporary key to both table1 and table2.
with
t1 as (
select row_number() over (order by table1_id) as row, table1_id
from table1 )
,
t2 as (
select row_number() over (order by NEWID()) as row, table2_id
from table2 )
update table1
set table2_id = t2.table2_id
from t1 inner join t2
on t1.row = t2.row
select * from table1
SQL Fiddle to test it out: http://sqlfiddle.com/#!6/bf414/12

Broke down and used a loop for it. This worked, although it was very slow.
Select *
Into #Temp
From table1
Declare #Id int
While (Select Count(*) From #Temp) > 0
Begin
Select Top 1 #Id = table1Id From #Temp
update table1 set table2Id = (select top 1 table2Id from table2 order by NEWID()) where table1Id = #Id
Delete #Temp Where table1Id = #Id
End
drop table #Temp

I'm going to assume MS SQL based on top 1:
update table1
set table2Id =
(select top 1 table2Id from table2 tablesample(1 percent))
(sorry, not tested)

SQL - using a value in a nested select

Hope the title makes some kind of sense - I'd basically like to do a nested select, based on a value in the original select, like so:
SELECT MAX(iteration) AS maxiteration,
(SELECT column
FROM table
WHERE id = 223652
AND iteration = maxiteration)
FROM table
WHERE id = 223652;
I get an ORA-00904 invalid identifier error.
Would really appreciate any advice on how to return this value, thanks!

It looks like this should be rewritten with a where clause:
select iteration,
col
from tbl
where id = 223652
and iteration = (select max(iteration) from tbl where id = 223652);

You can circumvent the problem alltogether by placing the subselect in an INNER JOIN of its own.
SELECT t.iteration
, t.column
FROM table t
INNER JOIN (
SELECT id, MAX(iteration) AS iteration
FROM table
WHERE id = 223652
) tm ON tm.id = t.id AND tm.iteration = t.iteration

Since you're using Oracle, I'd suggest using analytic functions for this:
SELECT * FROM (
SELECT col,
iteration,
row_number() over (partition by id order by iteration desc) rn
FROM tab
WHERE id = 223652
) WHERE rn = 1

do it like this:
with maxiteration as
(
SELECT MAX(iteration) AS maxiteration
FROM table
WHERE id = 223652
)
select
column,
iteration
from
table
where
id = 223652
AND iteration = maxiteration
;

Not 100% sure on Oracle syntax, but isn't it something like:
select iteration, column from table where id = 223652 order by iteration desc limit 1

I would approach this problem in a slightly different way. You're basically looking for the row that has no other iterations greater than it. There are at least 3 ways I can think of to do this:
SELECT
T1.iteration AS maxiteration,
T1.column
FROM
Table T1
WHERE
T1.id = 223652 AND
NOT EXISTS
(
SELECT *
FROM Table T2
WHERE
T2.id = 223652 AND
T2.iteration > T1.iteration
)
Or...
SELECT
T1.iteration AS maxiteration,
T1.column
FROM
Table T1
LEFT OUTER JOIN Table T2 ON
T2.id = T1.id AND
T2.iteration > T1.iteration
WHERE
T1.id = 223652 AND
T2.id IS NULL
Or...
SELECT
T1.iteration AS maxiteration,
T1.column
FROM
Table T1
INNER JOIN (SELECT id, MAX(iteration) AS maxiteration FROM Table T2 GROUP BY id) SQ ON
SQ.id = T1.id AND
SQ.maxiteration = T1.iteration
WHERE
T1.id = 223652
EDIT: I didn't see the ORA error the first time reading the question and it wasn't tagged as Oracle specific. I think that there may be some differences in the syntax and use of aliases in Oracle, so you may need to tweak some of the above queries.
The Oracle error is telling you that it doesn't know what maxiteration is, because the column alias isn't available yet inside the subquery. You need to refer to it by the table alias and column name instead of the column alias I believe.

You do something like
select maxiteration,column from table a join (select max(iteration) as maxiteration from table where id=1) b using (id) where b.maxiteration=a.iteration;
This could of course return multiple rows for one maxiteration unless your table has a constraint against it.

Avoid duplicates in INSERT INTO SELECT query in SQL Server

I have the following two tables:
Table1
----------
ID Name
1 A
2 B
3 C
Table2
----------
ID Name
1 Z
I need to insert data from Table1 to Table2. I can use the following syntax:
INSERT INTO Table2(Id, Name) SELECT Id, Name FROM Table1
However, in my case, duplicate IDs might exist in Table2 (in my case, it's just "1") and I don't want to copy that again as that would throw an error.
I can write something like this:
IF NOT EXISTS(SELECT 1 FROM Table2 WHERE Id=1)
INSERT INTO Table2 (Id, name) SELECT Id, name FROM Table1
ELSE
INSERT INTO Table2 (Id, name) SELECT Id, name FROM Table1 WHERE Table1.Id<>1
Is there a better way to do this without using IF - ELSE? I want to avoid two INSERT INTO-SELECT statements based on some condition.

Using NOT EXISTS:
INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
WHERE NOT EXISTS(SELECT id
FROM TABLE_2 t2
WHERE t2.id = t1.id)
Using NOT IN:
INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
WHERE t1.id NOT IN (SELECT id
FROM TABLE_2)
Using LEFT JOIN/IS NULL:
INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
LEFT JOIN TABLE_2 t2 ON t2.id = t1.id
WHERE t2.id IS NULL
Of the three options, the LEFT JOIN/IS NULL is less efficient. See this link for more details.

In MySQL you can do this:
INSERT IGNORE INTO Table2(Id, Name) SELECT Id, Name FROM Table1
Does SQL Server have anything similar?

I just had a similar problem, the DISTINCT keyword works magic:
INSERT INTO Table2(Id, Name) SELECT DISTINCT Id, Name FROM Table1

I was facing the same problem recently...
Heres what worked for me in MS SQL server 2017...
The primary key should be set on ID in table 2...
The columns and column properties should be the same of course between both tables. This will work the first time you run the below script. The duplicate ID in table 1, will not insert...
If you run it the second time, you will get a
Violation of PRIMARY KEY constraint error
This is the code:
Insert into Table_2
Select distinct *
from Table_1
where table_1.ID >1

Using ignore Duplicates on the unique index as suggested by IanC here was my solution for a similar issue, creating the index with the Option WITH IGNORE_DUP_KEY
In backward compatible syntax
, WITH IGNORE_DUP_KEY is equivalent to WITH IGNORE_DUP_KEY = ON.
Ref.: index_option

From SQL Server you can set a Unique key index on the table for (Columns that needs to be unique)

A little off topic, but if you want to migrate the data to a new table, and the possible duplicates are in the original table, and the column possibly duplicated is not an id, a GROUP BY will do:
INSERT INTO TABLE_2
(name)
SELECT t1.name
FROM TABLE_1 t1
GROUP BY t1.name

In my case, I had duplicate IDs in the source table, so none of the proposals worked. I don't care about performance, it's just done once.
To solve this I took the records one by one with a cursor to ignore the duplicates.
So here's the code example:
DECLARE #c1 AS VARCHAR(12);
DECLARE #c2 AS VARCHAR(250);
DECLARE #c3 AS VARCHAR(250);
DECLARE MY_cursor CURSOR STATIC FOR
Select
c1,
c2,
c3
from T2
where ....;
OPEN MY_cursor
FETCH NEXT FROM MY_cursor INTO #c1, #c2, #c3
WHILE ##FETCH_STATUS = 0
BEGIN
if (select count(1)
from T1
where a1 = #c1
and a2 = #c2
) = 0
INSERT INTO T1
values (#c1, #c2, #c3)
FETCH NEXT FROM MY_cursor INTO #c1, #c2, #c3
END
CLOSE MY_cursor
DEALLOCATE MY_cursor

I used a MERGE query to fill a table without duplications.
The problem I had was a double key in the tables ( Code , Value ) ,
and the exists query was very slow
The MERGE executed very fast ( more then X100 )
examples for MERGE query

For one table it works perfectly when creating one unique index from multiple field. Then simple "INSERT IGNORE" will ignore duplicates if ALL of 7 fields (in this case) will have SAME values.
Select fields in PMA Structure View and click Unique, new combined index will be created.

A simple DELETE before the INSERT would suffice:
DELETE FROM Table2 WHERE Id = (SELECT Id FROM Table1)
INSERT INTO Table2 (Id, name) SELECT Id, name FROM Table1
Switching Table1 for Table2 depending on which table's Id and name pairing you want to preserve.

Delete Duplicate SQL Records

What is the simplest way to delete records with duplicate name in a table? The answers I came across are very confusing.
Related:
Removing duplicate records from table

I got it! Simple and it worked great.
delete
t1
from
tTable t1, tTable t2
where
t1.locationName = t2.locationName and
t1.id > t2.id
http://www.cryer.co.uk/brian/sql/sql_delete_duplicates.htm

SQL Server 2005:
with FirstKey
AS
(
SELECT MIN(ID), Name, COUNT(*) AS Cnt
FROM YourTable
GROUP BY Name
HAVING COUNT(*) > 1
)
DELETE YourTable
FROM YourTable YT
JOIN FirstKey FK ON FK.Name = YT.Name AND FK.ID != YT.ID

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to remove duplicate records in a table? - sql

here is a great article on that: Deleting duplicates, which basically uses this pattern: WITH q AS ( SELECT d., ROW_NUMBER() OVER (PARTITION BY id ORDER BY value) AS rn FROM t_duplicate d ) DELETE FROM q WHERE rn > 1 SELECT FROM t_duplicate

Related

How to keep only the first unique element and delete its duplicates in oracle?

Simple update statement so that all rows are assigned a different value

SQL - using a value in a nested select

Avoid duplicates in INSERT INTO SELECT query in SQL Server

Delete Duplicate SQL Records

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to remove duplicate records in a table? - sql

here is a great article on that: Deleting duplicates, which basically uses this pattern: WITH q AS ( SELECT d.*, ROW_NUMBER() OVER (PARTITION BY id ORDER BY value) AS rn FROM t_duplicate d ) DELETE FROM q WHERE rn > 1 SELECT * FROM t_duplicate

Related

How to keep only the first unique element and delete its duplicates in oracle?

Simple update statement so that all rows are assigned a different value

SQL - using a value in a nested select

Avoid duplicates in INSERT INTO SELECT query in SQL Server

Delete Duplicate SQL Records

Categories

Resources

here is a great article on that: Deleting duplicates, which basically uses this pattern: WITH q AS ( SELECT d., ROW_NUMBER() OVER (PARTITION BY id ORDER BY value) AS rn FROM t_duplicate d ) DELETE FROM q WHERE rn > 1 SELECT FROM t_duplicate