I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1
Related
How can I delete duplicate rows where no unique row id exists?
My table is
col1 col2 col3 col4 col5 col6 col7
john 1 1 1 1 1 1
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
sally 2 2 2 2 2 2
I want to be left with the following after the duplicate removal:
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
I've tried a few queries but I think they depend on having a row id as I don't get the desired result. For example:
DELETE
FROM table
WHERE col1 IN (
SELECT id
FROM table
GROUP BY id
HAVING (COUNT(col1) > 1)
)
I like CTEs and ROW_NUMBER as the two combined allow us to see which rows are deleted (or updated), therefore just change the DELETE FROM CTE... to SELECT * FROM CTE:
WITH CTE AS(
SELECT [col1], [col2], [col3], [col4], [col5], [col6], [col7],
RN = ROW_NUMBER()OVER(PARTITION BY col1 ORDER BY col1)
FROM dbo.Table1
)
DELETE FROM CTE WHERE RN > 1
DEMO (result is different; I assume that it's due to a typo on your part)
COL1 COL2 COL3 COL4 COL5 COL6 COL7
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
This example determines duplicates by a single column col1 because of the PARTITION BY col1. If you want to include multiple columns simply add them to the PARTITION BY:
ROW_NUMBER()OVER(PARTITION BY Col1, Col2, ... ORDER BY OrderColumn)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
Without using CTE and ROW_NUMBER() you can just delete the records just by using group by with MAX function here is an example
DELETE
FROM MyDuplicateTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyDuplicateTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)
If you have no references, like foreign keys, you can do this. I do it a lot when testing proofs of concept and the test data gets duplicated.
SELECT DISTINCT [col1],[col2],[col3],[col4],[col5],[col6],[col7]
INTO [newTable]
FROM [oldTable]
Go into the object explorer and delete the old table.
Rename the new table with the old table's name.
Remove all duplicates, but the very first ones (with min ID)
should work equally in other SQL servers, like Postgres:
DELETE FROM table
WHERE id NOT IN (
select min(id) from table
group by col1, col2, col3, col4, col5, col6, col7
)
DELETE from search
where id not in (
select min(id) from search
group by url
having count(*)=1
union
SELECT min(id) FROM search
group by url
having count(*) > 1
)
There are two solutions in mysql:
A) Delete duplicate rows using DELETE JOIN statement
DELETE t1 FROM contacts t1
INNER JOIN contacts t2
WHERE
t1.id < t2.id AND
t1.email = t2.email;
This query references the contacts table twice, therefore, it uses the table alias t1 and t2.
The output is:
1
Query OK, 4 rows affected (0.10 sec)
In case you want to delete duplicate rows and keep the lowest id, you can use the following statement:
DELETE c1 FROM contacts c1
INNER JOIN contacts c2
WHERE
c1.id > c2.id AND
c1.email = c2.email;
B) Delete duplicate rows using an intermediate table
The following shows the steps for removing duplicate rows using an intermediate table:
1. Create a new table with the structure the same as the original table that you want to delete duplicate rows.
2. Insert distinct rows from the original table to the immediate table.
3. Insert distinct rows from the original table to the immediate table.
Step 1. Create a new table whose structure is the same as the original table:
CREATE TABLE source_copy LIKE source;
Step 2. Insert distinct rows from the original table to the new table:
INSERT INTO source_copy
SELECT * FROM source
GROUP BY col; -- column that has duplicate values
Step 3. drop the original table and rename the immediate table to the original one
DROP TABLE source;
ALTER TABLE source_copy RENAME TO source;
Source: http://www.mysqltutorial.org/mysql-delete-duplicate-rows/
Please see the below way of deletion too.
Declare #table table
(col1 varchar(10),col2 int,col3 int, col4 int, col5 int, col6 int, col7 int)
Insert into #table values
('john',1,1,1,1,1,1),
('john',1,1,1,1,1,1),
('sally',2,2,2,2,2,2),
('sally',2,2,2,2,2,2)
Created a sample table named #table and loaded it with given data.
Delete aliasName from (
Select *,
ROW_NUMBER() over (Partition by col1,col2,col3,col4,col5,col6,col7 order by col1) as rowNumber
From #table) aliasName
Where rowNumber > 1
Select * from #table
Note: If you are giving all columns in the Partition by part, then order by do not have much significance.
I know, the question is asked three years ago, and my answer is another version of what Tim has posted, But posting just incase it is helpful for anyone.
It can be done by many ways in sql server
the most simplest way to do so is:
Insert the distinct rows from the duplicate rows table to new temporary table. Then delete all the data from duplicate rows table then insert all data from temporary table which has no duplicates as shown below.
select distinct * into #tmp From table
delete from table
insert into table
select * from #tmp drop table #tmp
select * from table
Delete duplicate rows using Common Table Expression(CTE)
With CTE_Duplicates as
(select id,name , row_number()
over(partition by id,name order by id,name ) rownumber from table )
delete from CTE_Duplicates where rownumber!=1
To delete the duplicate rows from the table in SQL Server, you follow these steps:
Find duplicate rows using GROUP BY clause or ROW_NUMBER() function.
Use DELETE statement to remove the duplicate rows.
Setting up a sample table
DROP TABLE IF EXISTS contacts;
CREATE TABLE contacts(
contact_id INT IDENTITY(1,1) PRIMARY KEY,
first_name NVARCHAR(100) NOT NULL,
last_name NVARCHAR(100) NOT NULL,
email NVARCHAR(255) NOT NULL,
);
Insert values
INSERT INTO contacts
(first_name,last_name,email)
VALUES
('Syed','Abbas','syed.abbas#example.com'),
('Catherine','Abel','catherine.abel#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Pilar','Ackerman','pilar.ackerman#example.com');
Query
SELECT
contact_id,
first_name,
last_name,
email
FROM
contacts;
Delete duplicate rows from a table
WITH cte AS (
SELECT
contact_id,
first_name,
last_name,
email,
ROW_NUMBER() OVER (
PARTITION BY
first_name,
last_name,
email
ORDER BY
first_name,
last_name,
email
) row_num
FROM
contacts
)
DELETE FROM cte
WHERE row_num > 1;
Should delete the record now
Try to Use:
SELECT linkorder
,Row_Number() OVER (
PARTITION BY linkorder ORDER BY linkorder DESC
) AS RowNum
FROM u_links
Microsoft has a vey ry neat guide on how to remove duplicates. Check out http://support.microsoft.com/kb/139444
In brief, here is the easiest way to delete duplicates when you have just a few rows to delete:
SET rowcount 1;
DELETE FROM t1 WHERE myprimarykey=1;
myprimarykey is the identifier for the row.
I set rowcount to 1 because I only had two rows that were duplicated. If I had had 3 rows duplicated then I would have set rowcount to 2 so that it deletes the first two that it sees and only leaves one in table t1.
with myCTE
as
(
select productName,ROW_NUMBER() over(PARTITION BY productName order by slno) as Duplicate from productDetails
)
Delete from myCTE where Duplicate>1
After trying the suggested solution above, that works for small medium tables.
I can suggest that solution for very large tables. since it runs in iterations.
Drop all dependency views of the LargeSourceTable
you can find the dependecies by using sql managment studio, right click on the table and click "View Dependencies"
Rename the table:
sp_rename 'LargeSourceTable', 'LargeSourceTable_Temp'; GO
Create the LargeSourceTable again, but now, add a primary key with all the columns that define the duplications add WITH (IGNORE_DUP_KEY = ON)
For example:
CREATE TABLE [dbo].[LargeSourceTable]
(
ID int IDENTITY(1,1),
[CreateDate] DATETIME CONSTRAINT [DF_LargeSourceTable_CreateDate] DEFAULT (getdate()) NOT NULL,
[Column1] CHAR (36) NOT NULL,
[Column2] NVARCHAR (100) NOT NULL,
[Column3] CHAR (36) NOT NULL,
PRIMARY KEY (Column1, Column2) WITH (IGNORE_DUP_KEY = ON)
);
GO
Create again the views that you dropped in the first place for the new created table
Now, Run the following sql script, you will see the results in 1,000,000 rows per page, you can change the row number per page to see the results more often.
Note, that I set the IDENTITY_INSERT on and off because one the columns contains auto incremental id, which I'm also copying
SET IDENTITY_INSERT LargeSourceTable ON
DECLARE #PageNumber AS INT, #RowspPage AS INT
DECLARE #TotalRows AS INT
declare #dt varchar(19)
SET #PageNumber = 0
SET #RowspPage = 1000000
select #TotalRows = count (*) from LargeSourceTable_TEMP
While ((#PageNumber - 1) * #RowspPage < #TotalRows )
Begin
begin transaction tran_inner
; with cte as
(
SELECT * FROM LargeSourceTable_TEMP ORDER BY ID
OFFSET ((#PageNumber) * #RowspPage) ROWS
FETCH NEXT #RowspPage ROWS ONLY
)
INSERT INTO LargeSourceTable
(
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
)
select
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
from cte
commit transaction tran_inner
PRINT 'Page: ' + convert(varchar(10), #PageNumber)
PRINT 'Transfered: ' + convert(varchar(20), #PageNumber * #RowspPage)
PRINT 'Of: ' + convert(varchar(20), #TotalRows)
SELECT #dt = convert(varchar(19), getdate(), 121)
RAISERROR('Inserted on: %s', 0, 1, #dt) WITH NOWAIT
SET #PageNumber = #PageNumber + 1
End
SET IDENTITY_INSERT LargeSourceTable OFF
-- this query will keep only one instance of a duplicate record.
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3-- based on what? --can be multiple columns
ORDER BY ( SELECT 0)) RN
FROM Mytable)
delete FROM cte
WHERE RN > 1
You need to group by the duplicate records according to the field(s), then hold one of the records and delete the rest.
For example:
DELETE prg.Person WHERE Id IN (
SELECT dublicateRow.Id FROM
(
select MIN(Id) MinId, NationalCode
from prg.Person group by NationalCode having count(NationalCode ) > 1
) GroupSelect
JOIN prg.Person dublicateRow ON dublicateRow.NationalCode = GroupSelect.NationalCode
WHERE dublicateRow.Id <> GroupSelect.MinId)
Deleting duplicates from a huge(several millions of records) table might take long time . I suggest that you do a bulk insert into a temp table of the selected rows rather than deleting.
--REWRITING YOUR CODE(TAKE NOTE OF THE 3RD LINE) WITH CTE AS(SELECT NAME,ROW_NUMBER()
OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB) SELECT * INTO #unique_records FROM
CTE WHERE ID =1;
This might help in your case
DELETE t1 FROM table t1 INNER JOIN table t2 WHERE t1.id > t2.id AND t1.col1 = t2.col1
With reference to https://support.microsoft.com/en-us/help/139444/how-to-remove-duplicate-rows-from-a-table-in-sql-server
The idea of removing duplicate involves
a) Protecting those rows that are not duplicate
b) Retain one of the many rows that qualified together as duplicate.
Step-by-step
1) First identify the rows those satisfy the definition of duplicate
and insert them into temp table, say #tableAll .
2) Select non-duplicate(single-rows) or distinct rows into temp table
say #tableUnique.
3) Delete from source table joining #tableAll to delete the
duplicates.
4) Insert into source table all the rows from #tableUnique.
5) Drop #tableAll and #tableUnique
If you have the ability to add a column to the table temporarily, this was a solution that worked for me:
ALTER TABLE dbo.DUPPEDTABLE ADD RowID INT NOT NULL IDENTITY(1,1)
Then perform a DELETE using a combination of MIN and GROUP BY
DELETE b
FROM dbo.DUPPEDTABLE b
WHERE b.RowID NOT IN (
SELECT MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
);
Verify that the DELETE performed correctly:
SELECT a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE, COUNT(*)--MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
ORDER BY COUNT(*) DESC
The result should have no rows with a count greater than 1. Finally, remove the rowid column:
ALTER TABLE dbo.DUPPEDTABLE DROP COLUMN RowID;
Oh wow, i feel so stupid by ready all this answers, they are like experts' answer with all CTE and temp table and etc.
And all I did to get it working was simply aggregated the ID column by using MAX.
DELETE FROM table WHERE col1 IN (
SELECT MAX(id) FROM table GROUP BY id HAVING ( COUNT(col1) > 1 )
)
NOTE: you might need to run it multiple time to remove duplicate as this will only delete one set of duplicate rows at a time.
please simply add the keyword DISTINCT right after the SELECT command,
for example:
SELECT DISTICNT ColumnOne, ColumnTwo, ColumnThree
FROM YourTable
Another way of removing dublicated rows without loosing information in one step is like following:
delete from dublicated_table t1 (nolock)
join (
select t2.dublicated_field
, min(len(t2.field_kept)) as min_field_kept
from dublicated_table t2 (nolock)
group by t2.dublicated_field having COUNT(*)>1
) t3
on t1.dublicated_field=t3.dublicated_field
and len(t1.field_kept)=t3.min_field_kept
DECLARE #TB TABLE(NAME VARCHAR(100));
INSERT INTO #TB VALUES ('Red'),('Red'),('Green'),('Blue'),('White'),('White')
--**Delete by Rank**
;WITH CTE AS(SELECT NAME,DENSE_RANK() OVER (PARTITION BY NAME ORDER BY NEWID()) ID FROM #TB)
DELETE FROM CTE WHERE ID>1
SELECT NAME FROM #TB;
--**Delete by Row Number**
;WITH CTE AS(SELECT NAME,ROW_NUMBER() OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB)
DELETE FROM CTE WHERE ID>1;
SELECT NAME FROM #TB;
DELETE FROM TBL1 WHERE ID IN
(SELECT ID FROM TBL1 a WHERE ID!=
(select MAX(ID) from TBL1 where DUPVAL=a.DUPVAL
group by DUPVAL
having count(DUPVAL)>1))
DELETE p1 FROM Person p1,
Person p2
WHERE
p1.Email = p2.Email AND p1.Id > p2.Id
How can I delete duplicate rows where no unique row id exists?
My table is
col1 col2 col3 col4 col5 col6 col7
john 1 1 1 1 1 1
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
sally 2 2 2 2 2 2
I want to be left with the following after the duplicate removal:
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
I've tried a few queries but I think they depend on having a row id as I don't get the desired result. For example:
DELETE
FROM table
WHERE col1 IN (
SELECT id
FROM table
GROUP BY id
HAVING (COUNT(col1) > 1)
)
I like CTEs and ROW_NUMBER as the two combined allow us to see which rows are deleted (or updated), therefore just change the DELETE FROM CTE... to SELECT * FROM CTE:
WITH CTE AS(
SELECT [col1], [col2], [col3], [col4], [col5], [col6], [col7],
RN = ROW_NUMBER()OVER(PARTITION BY col1 ORDER BY col1)
FROM dbo.Table1
)
DELETE FROM CTE WHERE RN > 1
DEMO (result is different; I assume that it's due to a typo on your part)
COL1 COL2 COL3 COL4 COL5 COL6 COL7
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
This example determines duplicates by a single column col1 because of the PARTITION BY col1. If you want to include multiple columns simply add them to the PARTITION BY:
ROW_NUMBER()OVER(PARTITION BY Col1, Col2, ... ORDER BY OrderColumn)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
Without using CTE and ROW_NUMBER() you can just delete the records just by using group by with MAX function here is an example
DELETE
FROM MyDuplicateTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyDuplicateTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)
If you have no references, like foreign keys, you can do this. I do it a lot when testing proofs of concept and the test data gets duplicated.
SELECT DISTINCT [col1],[col2],[col3],[col4],[col5],[col6],[col7]
INTO [newTable]
FROM [oldTable]
Go into the object explorer and delete the old table.
Rename the new table with the old table's name.
Remove all duplicates, but the very first ones (with min ID)
should work equally in other SQL servers, like Postgres:
DELETE FROM table
WHERE id NOT IN (
select min(id) from table
group by col1, col2, col3, col4, col5, col6, col7
)
DELETE from search
where id not in (
select min(id) from search
group by url
having count(*)=1
union
SELECT min(id) FROM search
group by url
having count(*) > 1
)
There are two solutions in mysql:
A) Delete duplicate rows using DELETE JOIN statement
DELETE t1 FROM contacts t1
INNER JOIN contacts t2
WHERE
t1.id < t2.id AND
t1.email = t2.email;
This query references the contacts table twice, therefore, it uses the table alias t1 and t2.
The output is:
1
Query OK, 4 rows affected (0.10 sec)
In case you want to delete duplicate rows and keep the lowest id, you can use the following statement:
DELETE c1 FROM contacts c1
INNER JOIN contacts c2
WHERE
c1.id > c2.id AND
c1.email = c2.email;
B) Delete duplicate rows using an intermediate table
The following shows the steps for removing duplicate rows using an intermediate table:
1. Create a new table with the structure the same as the original table that you want to delete duplicate rows.
2. Insert distinct rows from the original table to the immediate table.
3. Insert distinct rows from the original table to the immediate table.
Step 1. Create a new table whose structure is the same as the original table:
CREATE TABLE source_copy LIKE source;
Step 2. Insert distinct rows from the original table to the new table:
INSERT INTO source_copy
SELECT * FROM source
GROUP BY col; -- column that has duplicate values
Step 3. drop the original table and rename the immediate table to the original one
DROP TABLE source;
ALTER TABLE source_copy RENAME TO source;
Source: http://www.mysqltutorial.org/mysql-delete-duplicate-rows/
Please see the below way of deletion too.
Declare #table table
(col1 varchar(10),col2 int,col3 int, col4 int, col5 int, col6 int, col7 int)
Insert into #table values
('john',1,1,1,1,1,1),
('john',1,1,1,1,1,1),
('sally',2,2,2,2,2,2),
('sally',2,2,2,2,2,2)
Created a sample table named #table and loaded it with given data.
Delete aliasName from (
Select *,
ROW_NUMBER() over (Partition by col1,col2,col3,col4,col5,col6,col7 order by col1) as rowNumber
From #table) aliasName
Where rowNumber > 1
Select * from #table
Note: If you are giving all columns in the Partition by part, then order by do not have much significance.
I know, the question is asked three years ago, and my answer is another version of what Tim has posted, But posting just incase it is helpful for anyone.
It can be done by many ways in sql server
the most simplest way to do so is:
Insert the distinct rows from the duplicate rows table to new temporary table. Then delete all the data from duplicate rows table then insert all data from temporary table which has no duplicates as shown below.
select distinct * into #tmp From table
delete from table
insert into table
select * from #tmp drop table #tmp
select * from table
Delete duplicate rows using Common Table Expression(CTE)
With CTE_Duplicates as
(select id,name , row_number()
over(partition by id,name order by id,name ) rownumber from table )
delete from CTE_Duplicates where rownumber!=1
To delete the duplicate rows from the table in SQL Server, you follow these steps:
Find duplicate rows using GROUP BY clause or ROW_NUMBER() function.
Use DELETE statement to remove the duplicate rows.
Setting up a sample table
DROP TABLE IF EXISTS contacts;
CREATE TABLE contacts(
contact_id INT IDENTITY(1,1) PRIMARY KEY,
first_name NVARCHAR(100) NOT NULL,
last_name NVARCHAR(100) NOT NULL,
email NVARCHAR(255) NOT NULL,
);
Insert values
INSERT INTO contacts
(first_name,last_name,email)
VALUES
('Syed','Abbas','syed.abbas#example.com'),
('Catherine','Abel','catherine.abel#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Pilar','Ackerman','pilar.ackerman#example.com');
Query
SELECT
contact_id,
first_name,
last_name,
email
FROM
contacts;
Delete duplicate rows from a table
WITH cte AS (
SELECT
contact_id,
first_name,
last_name,
email,
ROW_NUMBER() OVER (
PARTITION BY
first_name,
last_name,
email
ORDER BY
first_name,
last_name,
email
) row_num
FROM
contacts
)
DELETE FROM cte
WHERE row_num > 1;
Should delete the record now
Try to Use:
SELECT linkorder
,Row_Number() OVER (
PARTITION BY linkorder ORDER BY linkorder DESC
) AS RowNum
FROM u_links
Microsoft has a vey ry neat guide on how to remove duplicates. Check out http://support.microsoft.com/kb/139444
In brief, here is the easiest way to delete duplicates when you have just a few rows to delete:
SET rowcount 1;
DELETE FROM t1 WHERE myprimarykey=1;
myprimarykey is the identifier for the row.
I set rowcount to 1 because I only had two rows that were duplicated. If I had had 3 rows duplicated then I would have set rowcount to 2 so that it deletes the first two that it sees and only leaves one in table t1.
with myCTE
as
(
select productName,ROW_NUMBER() over(PARTITION BY productName order by slno) as Duplicate from productDetails
)
Delete from myCTE where Duplicate>1
After trying the suggested solution above, that works for small medium tables.
I can suggest that solution for very large tables. since it runs in iterations.
Drop all dependency views of the LargeSourceTable
you can find the dependecies by using sql managment studio, right click on the table and click "View Dependencies"
Rename the table:
sp_rename 'LargeSourceTable', 'LargeSourceTable_Temp'; GO
Create the LargeSourceTable again, but now, add a primary key with all the columns that define the duplications add WITH (IGNORE_DUP_KEY = ON)
For example:
CREATE TABLE [dbo].[LargeSourceTable]
(
ID int IDENTITY(1,1),
[CreateDate] DATETIME CONSTRAINT [DF_LargeSourceTable_CreateDate] DEFAULT (getdate()) NOT NULL,
[Column1] CHAR (36) NOT NULL,
[Column2] NVARCHAR (100) NOT NULL,
[Column3] CHAR (36) NOT NULL,
PRIMARY KEY (Column1, Column2) WITH (IGNORE_DUP_KEY = ON)
);
GO
Create again the views that you dropped in the first place for the new created table
Now, Run the following sql script, you will see the results in 1,000,000 rows per page, you can change the row number per page to see the results more often.
Note, that I set the IDENTITY_INSERT on and off because one the columns contains auto incremental id, which I'm also copying
SET IDENTITY_INSERT LargeSourceTable ON
DECLARE #PageNumber AS INT, #RowspPage AS INT
DECLARE #TotalRows AS INT
declare #dt varchar(19)
SET #PageNumber = 0
SET #RowspPage = 1000000
select #TotalRows = count (*) from LargeSourceTable_TEMP
While ((#PageNumber - 1) * #RowspPage < #TotalRows )
Begin
begin transaction tran_inner
; with cte as
(
SELECT * FROM LargeSourceTable_TEMP ORDER BY ID
OFFSET ((#PageNumber) * #RowspPage) ROWS
FETCH NEXT #RowspPage ROWS ONLY
)
INSERT INTO LargeSourceTable
(
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
)
select
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
from cte
commit transaction tran_inner
PRINT 'Page: ' + convert(varchar(10), #PageNumber)
PRINT 'Transfered: ' + convert(varchar(20), #PageNumber * #RowspPage)
PRINT 'Of: ' + convert(varchar(20), #TotalRows)
SELECT #dt = convert(varchar(19), getdate(), 121)
RAISERROR('Inserted on: %s', 0, 1, #dt) WITH NOWAIT
SET #PageNumber = #PageNumber + 1
End
SET IDENTITY_INSERT LargeSourceTable OFF
-- this query will keep only one instance of a duplicate record.
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3-- based on what? --can be multiple columns
ORDER BY ( SELECT 0)) RN
FROM Mytable)
delete FROM cte
WHERE RN > 1
You need to group by the duplicate records according to the field(s), then hold one of the records and delete the rest.
For example:
DELETE prg.Person WHERE Id IN (
SELECT dublicateRow.Id FROM
(
select MIN(Id) MinId, NationalCode
from prg.Person group by NationalCode having count(NationalCode ) > 1
) GroupSelect
JOIN prg.Person dublicateRow ON dublicateRow.NationalCode = GroupSelect.NationalCode
WHERE dublicateRow.Id <> GroupSelect.MinId)
Deleting duplicates from a huge(several millions of records) table might take long time . I suggest that you do a bulk insert into a temp table of the selected rows rather than deleting.
--REWRITING YOUR CODE(TAKE NOTE OF THE 3RD LINE) WITH CTE AS(SELECT NAME,ROW_NUMBER()
OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB) SELECT * INTO #unique_records FROM
CTE WHERE ID =1;
This might help in your case
DELETE t1 FROM table t1 INNER JOIN table t2 WHERE t1.id > t2.id AND t1.col1 = t2.col1
With reference to https://support.microsoft.com/en-us/help/139444/how-to-remove-duplicate-rows-from-a-table-in-sql-server
The idea of removing duplicate involves
a) Protecting those rows that are not duplicate
b) Retain one of the many rows that qualified together as duplicate.
Step-by-step
1) First identify the rows those satisfy the definition of duplicate
and insert them into temp table, say #tableAll .
2) Select non-duplicate(single-rows) or distinct rows into temp table
say #tableUnique.
3) Delete from source table joining #tableAll to delete the
duplicates.
4) Insert into source table all the rows from #tableUnique.
5) Drop #tableAll and #tableUnique
If you have the ability to add a column to the table temporarily, this was a solution that worked for me:
ALTER TABLE dbo.DUPPEDTABLE ADD RowID INT NOT NULL IDENTITY(1,1)
Then perform a DELETE using a combination of MIN and GROUP BY
DELETE b
FROM dbo.DUPPEDTABLE b
WHERE b.RowID NOT IN (
SELECT MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
);
Verify that the DELETE performed correctly:
SELECT a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE, COUNT(*)--MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
ORDER BY COUNT(*) DESC
The result should have no rows with a count greater than 1. Finally, remove the rowid column:
ALTER TABLE dbo.DUPPEDTABLE DROP COLUMN RowID;
Oh wow, i feel so stupid by ready all this answers, they are like experts' answer with all CTE and temp table and etc.
And all I did to get it working was simply aggregated the ID column by using MAX.
DELETE FROM table WHERE col1 IN (
SELECT MAX(id) FROM table GROUP BY id HAVING ( COUNT(col1) > 1 )
)
NOTE: you might need to run it multiple time to remove duplicate as this will only delete one set of duplicate rows at a time.
please simply add the keyword DISTINCT right after the SELECT command,
for example:
SELECT DISTICNT ColumnOne, ColumnTwo, ColumnThree
FROM YourTable
Another way of removing dublicated rows without loosing information in one step is like following:
delete from dublicated_table t1 (nolock)
join (
select t2.dublicated_field
, min(len(t2.field_kept)) as min_field_kept
from dublicated_table t2 (nolock)
group by t2.dublicated_field having COUNT(*)>1
) t3
on t1.dublicated_field=t3.dublicated_field
and len(t1.field_kept)=t3.min_field_kept
DECLARE #TB TABLE(NAME VARCHAR(100));
INSERT INTO #TB VALUES ('Red'),('Red'),('Green'),('Blue'),('White'),('White')
--**Delete by Rank**
;WITH CTE AS(SELECT NAME,DENSE_RANK() OVER (PARTITION BY NAME ORDER BY NEWID()) ID FROM #TB)
DELETE FROM CTE WHERE ID>1
SELECT NAME FROM #TB;
--**Delete by Row Number**
;WITH CTE AS(SELECT NAME,ROW_NUMBER() OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB)
DELETE FROM CTE WHERE ID>1;
SELECT NAME FROM #TB;
DELETE FROM TBL1 WHERE ID IN
(SELECT ID FROM TBL1 a WHERE ID!=
(select MAX(ID) from TBL1 where DUPVAL=a.DUPVAL
group by DUPVAL
having count(DUPVAL)>1))
DELETE p1 FROM Person p1,
Person p2
WHERE
p1.Email = p2.Email AND p1.Id > p2.Id
I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1
How can I delete duplicate rows where no unique row id exists?
My table is
col1 col2 col3 col4 col5 col6 col7
john 1 1 1 1 1 1
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
sally 2 2 2 2 2 2
I want to be left with the following after the duplicate removal:
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
I've tried a few queries but I think they depend on having a row id as I don't get the desired result. For example:
DELETE
FROM table
WHERE col1 IN (
SELECT id
FROM table
GROUP BY id
HAVING (COUNT(col1) > 1)
)
I like CTEs and ROW_NUMBER as the two combined allow us to see which rows are deleted (or updated), therefore just change the DELETE FROM CTE... to SELECT * FROM CTE:
WITH CTE AS(
SELECT [col1], [col2], [col3], [col4], [col5], [col6], [col7],
RN = ROW_NUMBER()OVER(PARTITION BY col1 ORDER BY col1)
FROM dbo.Table1
)
DELETE FROM CTE WHERE RN > 1
DEMO (result is different; I assume that it's due to a typo on your part)
COL1 COL2 COL3 COL4 COL5 COL6 COL7
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
This example determines duplicates by a single column col1 because of the PARTITION BY col1. If you want to include multiple columns simply add them to the PARTITION BY:
ROW_NUMBER()OVER(PARTITION BY Col1, Col2, ... ORDER BY OrderColumn)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
Without using CTE and ROW_NUMBER() you can just delete the records just by using group by with MAX function here is an example
DELETE
FROM MyDuplicateTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyDuplicateTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)
If you have no references, like foreign keys, you can do this. I do it a lot when testing proofs of concept and the test data gets duplicated.
SELECT DISTINCT [col1],[col2],[col3],[col4],[col5],[col6],[col7]
INTO [newTable]
FROM [oldTable]
Go into the object explorer and delete the old table.
Rename the new table with the old table's name.
Remove all duplicates, but the very first ones (with min ID)
should work equally in other SQL servers, like Postgres:
DELETE FROM table
WHERE id NOT IN (
select min(id) from table
group by col1, col2, col3, col4, col5, col6, col7
)
DELETE from search
where id not in (
select min(id) from search
group by url
having count(*)=1
union
SELECT min(id) FROM search
group by url
having count(*) > 1
)
There are two solutions in mysql:
A) Delete duplicate rows using DELETE JOIN statement
DELETE t1 FROM contacts t1
INNER JOIN contacts t2
WHERE
t1.id < t2.id AND
t1.email = t2.email;
This query references the contacts table twice, therefore, it uses the table alias t1 and t2.
The output is:
1
Query OK, 4 rows affected (0.10 sec)
In case you want to delete duplicate rows and keep the lowest id, you can use the following statement:
DELETE c1 FROM contacts c1
INNER JOIN contacts c2
WHERE
c1.id > c2.id AND
c1.email = c2.email;
B) Delete duplicate rows using an intermediate table
The following shows the steps for removing duplicate rows using an intermediate table:
1. Create a new table with the structure the same as the original table that you want to delete duplicate rows.
2. Insert distinct rows from the original table to the immediate table.
3. Insert distinct rows from the original table to the immediate table.
Step 1. Create a new table whose structure is the same as the original table:
CREATE TABLE source_copy LIKE source;
Step 2. Insert distinct rows from the original table to the new table:
INSERT INTO source_copy
SELECT * FROM source
GROUP BY col; -- column that has duplicate values
Step 3. drop the original table and rename the immediate table to the original one
DROP TABLE source;
ALTER TABLE source_copy RENAME TO source;
Source: http://www.mysqltutorial.org/mysql-delete-duplicate-rows/
Please see the below way of deletion too.
Declare #table table
(col1 varchar(10),col2 int,col3 int, col4 int, col5 int, col6 int, col7 int)
Insert into #table values
('john',1,1,1,1,1,1),
('john',1,1,1,1,1,1),
('sally',2,2,2,2,2,2),
('sally',2,2,2,2,2,2)
Created a sample table named #table and loaded it with given data.
Delete aliasName from (
Select *,
ROW_NUMBER() over (Partition by col1,col2,col3,col4,col5,col6,col7 order by col1) as rowNumber
From #table) aliasName
Where rowNumber > 1
Select * from #table
Note: If you are giving all columns in the Partition by part, then order by do not have much significance.
I know, the question is asked three years ago, and my answer is another version of what Tim has posted, But posting just incase it is helpful for anyone.
It can be done by many ways in sql server
the most simplest way to do so is:
Insert the distinct rows from the duplicate rows table to new temporary table. Then delete all the data from duplicate rows table then insert all data from temporary table which has no duplicates as shown below.
select distinct * into #tmp From table
delete from table
insert into table
select * from #tmp drop table #tmp
select * from table
Delete duplicate rows using Common Table Expression(CTE)
With CTE_Duplicates as
(select id,name , row_number()
over(partition by id,name order by id,name ) rownumber from table )
delete from CTE_Duplicates where rownumber!=1
To delete the duplicate rows from the table in SQL Server, you follow these steps:
Find duplicate rows using GROUP BY clause or ROW_NUMBER() function.
Use DELETE statement to remove the duplicate rows.
Setting up a sample table
DROP TABLE IF EXISTS contacts;
CREATE TABLE contacts(
contact_id INT IDENTITY(1,1) PRIMARY KEY,
first_name NVARCHAR(100) NOT NULL,
last_name NVARCHAR(100) NOT NULL,
email NVARCHAR(255) NOT NULL,
);
Insert values
INSERT INTO contacts
(first_name,last_name,email)
VALUES
('Syed','Abbas','syed.abbas#example.com'),
('Catherine','Abel','catherine.abel#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Pilar','Ackerman','pilar.ackerman#example.com');
Query
SELECT
contact_id,
first_name,
last_name,
email
FROM
contacts;
Delete duplicate rows from a table
WITH cte AS (
SELECT
contact_id,
first_name,
last_name,
email,
ROW_NUMBER() OVER (
PARTITION BY
first_name,
last_name,
email
ORDER BY
first_name,
last_name,
email
) row_num
FROM
contacts
)
DELETE FROM cte
WHERE row_num > 1;
Should delete the record now
Try to Use:
SELECT linkorder
,Row_Number() OVER (
PARTITION BY linkorder ORDER BY linkorder DESC
) AS RowNum
FROM u_links
Microsoft has a vey ry neat guide on how to remove duplicates. Check out http://support.microsoft.com/kb/139444
In brief, here is the easiest way to delete duplicates when you have just a few rows to delete:
SET rowcount 1;
DELETE FROM t1 WHERE myprimarykey=1;
myprimarykey is the identifier for the row.
I set rowcount to 1 because I only had two rows that were duplicated. If I had had 3 rows duplicated then I would have set rowcount to 2 so that it deletes the first two that it sees and only leaves one in table t1.
with myCTE
as
(
select productName,ROW_NUMBER() over(PARTITION BY productName order by slno) as Duplicate from productDetails
)
Delete from myCTE where Duplicate>1
After trying the suggested solution above, that works for small medium tables.
I can suggest that solution for very large tables. since it runs in iterations.
Drop all dependency views of the LargeSourceTable
you can find the dependecies by using sql managment studio, right click on the table and click "View Dependencies"
Rename the table:
sp_rename 'LargeSourceTable', 'LargeSourceTable_Temp'; GO
Create the LargeSourceTable again, but now, add a primary key with all the columns that define the duplications add WITH (IGNORE_DUP_KEY = ON)
For example:
CREATE TABLE [dbo].[LargeSourceTable]
(
ID int IDENTITY(1,1),
[CreateDate] DATETIME CONSTRAINT [DF_LargeSourceTable_CreateDate] DEFAULT (getdate()) NOT NULL,
[Column1] CHAR (36) NOT NULL,
[Column2] NVARCHAR (100) NOT NULL,
[Column3] CHAR (36) NOT NULL,
PRIMARY KEY (Column1, Column2) WITH (IGNORE_DUP_KEY = ON)
);
GO
Create again the views that you dropped in the first place for the new created table
Now, Run the following sql script, you will see the results in 1,000,000 rows per page, you can change the row number per page to see the results more often.
Note, that I set the IDENTITY_INSERT on and off because one the columns contains auto incremental id, which I'm also copying
SET IDENTITY_INSERT LargeSourceTable ON
DECLARE #PageNumber AS INT, #RowspPage AS INT
DECLARE #TotalRows AS INT
declare #dt varchar(19)
SET #PageNumber = 0
SET #RowspPage = 1000000
select #TotalRows = count (*) from LargeSourceTable_TEMP
While ((#PageNumber - 1) * #RowspPage < #TotalRows )
Begin
begin transaction tran_inner
; with cte as
(
SELECT * FROM LargeSourceTable_TEMP ORDER BY ID
OFFSET ((#PageNumber) * #RowspPage) ROWS
FETCH NEXT #RowspPage ROWS ONLY
)
INSERT INTO LargeSourceTable
(
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
)
select
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
from cte
commit transaction tran_inner
PRINT 'Page: ' + convert(varchar(10), #PageNumber)
PRINT 'Transfered: ' + convert(varchar(20), #PageNumber * #RowspPage)
PRINT 'Of: ' + convert(varchar(20), #TotalRows)
SELECT #dt = convert(varchar(19), getdate(), 121)
RAISERROR('Inserted on: %s', 0, 1, #dt) WITH NOWAIT
SET #PageNumber = #PageNumber + 1
End
SET IDENTITY_INSERT LargeSourceTable OFF
-- this query will keep only one instance of a duplicate record.
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3-- based on what? --can be multiple columns
ORDER BY ( SELECT 0)) RN
FROM Mytable)
delete FROM cte
WHERE RN > 1
You need to group by the duplicate records according to the field(s), then hold one of the records and delete the rest.
For example:
DELETE prg.Person WHERE Id IN (
SELECT dublicateRow.Id FROM
(
select MIN(Id) MinId, NationalCode
from prg.Person group by NationalCode having count(NationalCode ) > 1
) GroupSelect
JOIN prg.Person dublicateRow ON dublicateRow.NationalCode = GroupSelect.NationalCode
WHERE dublicateRow.Id <> GroupSelect.MinId)
Deleting duplicates from a huge(several millions of records) table might take long time . I suggest that you do a bulk insert into a temp table of the selected rows rather than deleting.
--REWRITING YOUR CODE(TAKE NOTE OF THE 3RD LINE) WITH CTE AS(SELECT NAME,ROW_NUMBER()
OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB) SELECT * INTO #unique_records FROM
CTE WHERE ID =1;
This might help in your case
DELETE t1 FROM table t1 INNER JOIN table t2 WHERE t1.id > t2.id AND t1.col1 = t2.col1
With reference to https://support.microsoft.com/en-us/help/139444/how-to-remove-duplicate-rows-from-a-table-in-sql-server
The idea of removing duplicate involves
a) Protecting those rows that are not duplicate
b) Retain one of the many rows that qualified together as duplicate.
Step-by-step
1) First identify the rows those satisfy the definition of duplicate
and insert them into temp table, say #tableAll .
2) Select non-duplicate(single-rows) or distinct rows into temp table
say #tableUnique.
3) Delete from source table joining #tableAll to delete the
duplicates.
4) Insert into source table all the rows from #tableUnique.
5) Drop #tableAll and #tableUnique
If you have the ability to add a column to the table temporarily, this was a solution that worked for me:
ALTER TABLE dbo.DUPPEDTABLE ADD RowID INT NOT NULL IDENTITY(1,1)
Then perform a DELETE using a combination of MIN and GROUP BY
DELETE b
FROM dbo.DUPPEDTABLE b
WHERE b.RowID NOT IN (
SELECT MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
);
Verify that the DELETE performed correctly:
SELECT a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE, COUNT(*)--MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
ORDER BY COUNT(*) DESC
The result should have no rows with a count greater than 1. Finally, remove the rowid column:
ALTER TABLE dbo.DUPPEDTABLE DROP COLUMN RowID;
Oh wow, i feel so stupid by ready all this answers, they are like experts' answer with all CTE and temp table and etc.
And all I did to get it working was simply aggregated the ID column by using MAX.
DELETE FROM table WHERE col1 IN (
SELECT MAX(id) FROM table GROUP BY id HAVING ( COUNT(col1) > 1 )
)
NOTE: you might need to run it multiple time to remove duplicate as this will only delete one set of duplicate rows at a time.
please simply add the keyword DISTINCT right after the SELECT command,
for example:
SELECT DISTICNT ColumnOne, ColumnTwo, ColumnThree
FROM YourTable
Another way of removing dublicated rows without loosing information in one step is like following:
delete from dublicated_table t1 (nolock)
join (
select t2.dublicated_field
, min(len(t2.field_kept)) as min_field_kept
from dublicated_table t2 (nolock)
group by t2.dublicated_field having COUNT(*)>1
) t3
on t1.dublicated_field=t3.dublicated_field
and len(t1.field_kept)=t3.min_field_kept
DECLARE #TB TABLE(NAME VARCHAR(100));
INSERT INTO #TB VALUES ('Red'),('Red'),('Green'),('Blue'),('White'),('White')
--**Delete by Rank**
;WITH CTE AS(SELECT NAME,DENSE_RANK() OVER (PARTITION BY NAME ORDER BY NEWID()) ID FROM #TB)
DELETE FROM CTE WHERE ID>1
SELECT NAME FROM #TB;
--**Delete by Row Number**
;WITH CTE AS(SELECT NAME,ROW_NUMBER() OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB)
DELETE FROM CTE WHERE ID>1;
SELECT NAME FROM #TB;
DELETE FROM TBL1 WHERE ID IN
(SELECT ID FROM TBL1 a WHERE ID!=
(select MAX(ID) from TBL1 where DUPVAL=a.DUPVAL
group by DUPVAL
having count(DUPVAL)>1))
DELETE p1 FROM Person p1,
Person p2
WHERE
p1.Email = p2.Email AND p1.Id > p2.Id
I have a table with say 3 columns. There's no primary key so there can be duplicate rows. I need to just keep one and delete the others. Any idea how to do this is Sql Server?
I'd SELECT DISTINCT the rows and throw them into a temporary table, then drop the source table and copy back the data from the temp.
EDIT: now with code snippet!
INSERT INTO TABLE_2
SELECT DISTINCT * FROM TABLE_1
GO
DELETE FROM TABLE_1
GO
INSERT INTO TABLE_1
SELECT * FROM TABLE_2
GO
Add an identity column to act as a surrogate primary key, and use this to identify two of the three rows to be deleted.
I would consider leaving the identity column in place afterwards, or if this is some kind of link table, create a compound primary key on the other columns.
The following example works as well when your PK is just a subset of all table columns.
(Note: I like the approach with inserting another surrogate id column more. But maybe this solution comes handy as well.)
First find the duplicate rows:
SELECT col1, col2, count(*)
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1
If there are only few, you can delete them manually:
set rowcount 1
delete from t1
where col1=1 and col2=1
The value of "rowcount" should be n-1 times the number of duplicates. In this example there are 2 dulpicates, therefore rowcount is 1. If you get several duplicate rows, you have to do this for every unique primary key.
If you have many duplicates, then copy every key once into anoher table:
SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1
Then copy the keys, but eliminate the duplicates.
SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
In your keys you have now unique keys. Check if you don't get any result:
SELECT col1, col2, count(*)
FROM holddups
GROUP BY col1, col2
Delete the duplicates from the original table:
DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
Insert the original rows:
INSERT t1 SELECT * FROM holddups
btw and for completeness: In Oracle there is a hidden field you could use (rowid):
DELETE FROM our_table
WHERE rowid not in
(SELECT MIN(rowid)
FROM our_table
GROUP BY column1, column2, column3... ;
see: Microsoft Knowledge Site
Here's the method I used when I asked this question -
DELETE MyTable
FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
This is a way to do it with Common Table Expressions, CTE. It involves no loops, no new columns or anything and won't cause any unwanted triggers to fire (due to deletes+inserts).
Inspired by this article.
CREATE TABLE #temp (i INT)
INSERT INTO #temp VALUES (1)
INSERT INTO #temp VALUES (1)
INSERT INTO #temp VALUES (2)
INSERT INTO #temp VALUES (3)
INSERT INTO #temp VALUES (3)
INSERT INTO #temp VALUES (4)
SELECT * FROM #temp
;
WITH [#temp+rowid] AS
(SELECT ROW_NUMBER() OVER (ORDER BY i ASC) AS ROWID, * FROM #temp)
DELETE FROM [#temp+rowid] WHERE rowid IN
(SELECT MIN(rowid) FROM [#temp+rowid] GROUP BY i HAVING COUNT(*) > 1)
SELECT * FROM #temp
DROP TABLE #temp
This is a tough situation to be in. Without knowing your particular situation (table size etc) I think that your best shot is to add an identity column, populate it and then delete according to it. You may remove the column later but I would suggest that you should keep it as it is really a good thing to have in the table
After you clean up the current mess you could add a primary key that includes all the fields in the table. that will keep you from getting into the mess again.
Of course this solution could very well break existing code. That will have to be handled as well.
Can you add a primary key identity field to the table?
Manrico Corazzi - I specialize in Oracle, not MS SQL, so you'll have to tell me if this is possible as a performance boost:-
Leave the same as your first step - insert distinct values into TABLE2 from TABLE1.
Drop TABLE1. (Drop should be faster than delete I assume, much as truncate is faster than delete).
Rename TABLE2 as TABLE1 (saves you time, as you're renaming an object rather than copying data from one table to another).
Here's another way, with test data
create table #table1 (colWithDupes1 int, colWithDupes2 int)
insert into #table1
(colWithDupes1, colWithDupes2)
Select 1, 2 union all
Select 1, 2 union all
Select 2, 2 union all
Select 3, 4 union all
Select 3, 4 union all
Select 3, 4 union all
Select 4, 2 union all
Select 4, 2
select * from #table1
set rowcount 1
select 1
while ##rowcount > 0
delete #table1 where 1 < (select count(*) from #table1 a2
where #table1.colWithDupes1 = a2.colWithDupes1
and #table1.colWithDupes2 = a2.colWithDupes2
)
set rowcount 0
select * from #table1
What about this solution :
First you execute the following query :
select 'set rowcount ' + convert(varchar,COUNT(*)-1) + ' delete from MyTable where field=''' + field +'''' + ' set rowcount 0' from mytable group by field having COUNT(*)>1
And then you just have to execute the returned result set
set rowcount 3 delete from Mytable where field='foo' set rowcount 0
....
....
set rowcount 5 delete from Mytable where field='bar' set rowcount 0
I've handled the case when you've got only one column, but it's pretty easy to adapt the same approach tomore than one column. Let me know if you want me to post the code.
How about:
select distinct * into #t from duplicates_tbl
truncate duplicates_tbl
insert duplicates_tbl select * from #t
drop table #t
I'm not sure if this works with DELETE statements, but this is a way to find duplicate rows:
SELECT *
FROM myTable t1, myTable t2
WHERE t1.field = t2.field AND t1.id > t2.id
I'm not sure if you can just change the "SELECT" to a "DELETE" (someone wanna let me know?), but even if you can't, you could just make it into a subquery.