finding duplicates and removing but keeping one value [duplicate] - sql

This question already has answers here:
Delete duplicate records in SQL Server?
(10 answers)
Closed 9 years ago.
I currently have a URL redirect table in my database that contains ~8000 rows and ~6000 of them are duplicates.
I was wondering if there was a way I could delete these duplicates based on a certain columns value and if it matches, I am looking to use my "old_url" column to find duplicates and I have used
SELECT old_url
,DuplicateCount = COUNT(1)
FROM tbl_ecom_url_redirect
GROUP BY old_url
HAVING COUNT(1) > 1 -- more than one value
ORDER BY COUNT(1) DESC -- sort by most duplicates
however I'm not sure what I can do to remove them now as I don't want to lose every single one, just the duplicates. They are almost a match completely apart from sometimes the new_url is different and the url_id (GUID) is different in each time

In my opinion ranking functions and a CTE are the easiest approach:
WITH CTE AS
(
SELECT old_url
,Num = ROW_NUMBER()OVER(PARTITION BY old_url ORDER BY DateColumn ASC)
FROM tbl_ecom_url_redirect
)
DELETE FROM CTE WHERE Num > 1
Change ORDER BY DateColumn ASC accordingly to determine which records should be deleted and which record should be left alone. In this case i delete all newer duplicates.

If your table has a primary key then this is easy:
BEGIN TRAN
CREATE TABLE #T(Id INT, OldUrl VARCHAR(MAX))
INSERT INTO #T VALUES
(1, 'foo'),
(2, 'moo'),
(3, 'foo'),
(4, 'moo'),
(5, 'foo'),
(6, 'zoo'),
(7, 'foo')
DELETE FROM #T WHERE Id NOT IN (
SELECT MIN(Id)
FROM #T
GROUP BY OldUrl
HAVING COUNT(OldUrl) = 1
UNION
SELECT MIN(Id)
FROM #T
GROUP BY OldUrl
HAVING COUNT(OldUrl) > 1)
SELECT * FROM #T
DROP TABLE #T
ROLLBACK

this is the sample to delete multiple record with guid, hope it can help u=)
DECLARE #t1 TABLE
(
DupID UNIQUEIDENTIFIER,
DupRecords NVARCHAR(255)
)
INSERT INTO #t1 VALUES
(NEWID(),'A1'),
(NEWID(),'A1'),
(NEWID(),'A2'),
(NEWID(),'A1'),
(NEWID(),'A3')
so now, a duplicated record with guid is created in #t1
;WITH CTE AS(
SELECT DupID,DupRecords, Rn = ROW_NUMBER()
OVER (PARTITION BY DupRecords ORDER BY DupRecords)
FROM #t1
)
DELETE FROM #t1 WHERE DupID IN (SELECT DupID FROM CTE WHERE RN>1)
with query above, duplicated record is deleted from #t1, i use Row_number() to distinct each of the records
SELECT * FROM #t1

Related

Delete duplicates with different timestamps [duplicate]

How can I delete duplicate rows where no unique row id exists?
My table is
col1 col2 col3 col4 col5 col6 col7
john 1 1 1 1 1 1
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
sally 2 2 2 2 2 2
I want to be left with the following after the duplicate removal:
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
I've tried a few queries but I think they depend on having a row id as I don't get the desired result. For example:
DELETE
FROM table
WHERE col1 IN (
SELECT id
FROM table
GROUP BY id
HAVING (COUNT(col1) > 1)
)
I like CTEs and ROW_NUMBER as the two combined allow us to see which rows are deleted (or updated), therefore just change the DELETE FROM CTE... to SELECT * FROM CTE:
WITH CTE AS(
SELECT [col1], [col2], [col3], [col4], [col5], [col6], [col7],
RN = ROW_NUMBER()OVER(PARTITION BY col1 ORDER BY col1)
FROM dbo.Table1
)
DELETE FROM CTE WHERE RN > 1
DEMO (result is different; I assume that it's due to a typo on your part)
COL1 COL2 COL3 COL4 COL5 COL6 COL7
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
This example determines duplicates by a single column col1 because of the PARTITION BY col1. If you want to include multiple columns simply add them to the PARTITION BY:
ROW_NUMBER()OVER(PARTITION BY Col1, Col2, ... ORDER BY OrderColumn)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
Without using CTE and ROW_NUMBER() you can just delete the records just by using group by with MAX function here is an example
DELETE
FROM MyDuplicateTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyDuplicateTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)
If you have no references, like foreign keys, you can do this. I do it a lot when testing proofs of concept and the test data gets duplicated.
SELECT DISTINCT [col1],[col2],[col3],[col4],[col5],[col6],[col7]
INTO [newTable]
FROM [oldTable]
Go into the object explorer and delete the old table.
Rename the new table with the old table's name.
Remove all duplicates, but the very first ones (with min ID)
should work equally in other SQL servers, like Postgres:
DELETE FROM table
WHERE id NOT IN (
select min(id) from table
group by col1, col2, col3, col4, col5, col6, col7
)
DELETE from search
where id not in (
select min(id) from search
group by url
having count(*)=1
union
SELECT min(id) FROM search
group by url
having count(*) > 1
)
There are two solutions in mysql:
A) Delete duplicate rows using DELETE JOIN statement
DELETE t1 FROM contacts t1
INNER JOIN contacts t2
WHERE
t1.id < t2.id AND
t1.email = t2.email;
This query references the contacts table twice, therefore, it uses the table alias t1 and t2.
The output is:
1
Query OK, 4 rows affected (0.10 sec)
In case you want to delete duplicate rows and keep the lowest id, you can use the following statement:
DELETE c1 FROM contacts c1
INNER JOIN contacts c2
WHERE
c1.id > c2.id AND
c1.email = c2.email;
B) Delete duplicate rows using an intermediate table
The following shows the steps for removing duplicate rows using an intermediate table:
1. Create a new table with the structure the same as the original table that you want to delete duplicate rows.
2. Insert distinct rows from the original table to the immediate table.
3. Insert distinct rows from the original table to the immediate table.
Step 1. Create a new table whose structure is the same as the original table:
CREATE TABLE source_copy LIKE source;
Step 2. Insert distinct rows from the original table to the new table:
INSERT INTO source_copy
SELECT * FROM source
GROUP BY col; -- column that has duplicate values
Step 3. drop the original table and rename the immediate table to the original one
DROP TABLE source;
ALTER TABLE source_copy RENAME TO source;
Source: http://www.mysqltutorial.org/mysql-delete-duplicate-rows/
Please see the below way of deletion too.
Declare #table table
(col1 varchar(10),col2 int,col3 int, col4 int, col5 int, col6 int, col7 int)
Insert into #table values
('john',1,1,1,1,1,1),
('john',1,1,1,1,1,1),
('sally',2,2,2,2,2,2),
('sally',2,2,2,2,2,2)
Created a sample table named #table and loaded it with given data.
Delete aliasName from (
Select *,
ROW_NUMBER() over (Partition by col1,col2,col3,col4,col5,col6,col7 order by col1) as rowNumber
From #table) aliasName
Where rowNumber > 1
Select * from #table
Note: If you are giving all columns in the Partition by part, then order by do not have much significance.
I know, the question is asked three years ago, and my answer is another version of what Tim has posted, But posting just incase it is helpful for anyone.
It can be done by many ways in sql server
the most simplest way to do so is:
Insert the distinct rows from the duplicate rows table to new temporary table. Then delete all the data from duplicate rows table then insert all data from temporary table which has no duplicates as shown below.
select distinct * into #tmp From table
delete from table
insert into table
select * from #tmp drop table #tmp
select * from table
Delete duplicate rows using Common Table Expression(CTE)
With CTE_Duplicates as
(select id,name , row_number()
over(partition by id,name order by id,name ) rownumber from table )
delete from CTE_Duplicates where rownumber!=1
To delete the duplicate rows from the table in SQL Server, you follow these steps:
Find duplicate rows using GROUP BY clause or ROW_NUMBER() function.
Use DELETE statement to remove the duplicate rows.
Setting up a sample table
DROP TABLE IF EXISTS contacts;
CREATE TABLE contacts(
contact_id INT IDENTITY(1,1) PRIMARY KEY,
first_name NVARCHAR(100) NOT NULL,
last_name NVARCHAR(100) NOT NULL,
email NVARCHAR(255) NOT NULL,
);
Insert values
INSERT INTO contacts
(first_name,last_name,email)
VALUES
('Syed','Abbas','syed.abbas#example.com'),
('Catherine','Abel','catherine.abel#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Pilar','Ackerman','pilar.ackerman#example.com');
Query
SELECT
contact_id,
first_name,
last_name,
email
FROM
contacts;
Delete duplicate rows from a table
WITH cte AS (
SELECT
contact_id,
first_name,
last_name,
email,
ROW_NUMBER() OVER (
PARTITION BY
first_name,
last_name,
email
ORDER BY
first_name,
last_name,
email
) row_num
FROM
contacts
)
DELETE FROM cte
WHERE row_num > 1;
Should delete the record now
Try to Use:
SELECT linkorder
,Row_Number() OVER (
PARTITION BY linkorder ORDER BY linkorder DESC
) AS RowNum
FROM u_links
Microsoft has a vey ry neat guide on how to remove duplicates. Check out http://support.microsoft.com/kb/139444
In brief, here is the easiest way to delete duplicates when you have just a few rows to delete:
SET rowcount 1;
DELETE FROM t1 WHERE myprimarykey=1;
myprimarykey is the identifier for the row.
I set rowcount to 1 because I only had two rows that were duplicated. If I had had 3 rows duplicated then I would have set rowcount to 2 so that it deletes the first two that it sees and only leaves one in table t1.
with myCTE
as
(
select productName,ROW_NUMBER() over(PARTITION BY productName order by slno) as Duplicate from productDetails
)
Delete from myCTE where Duplicate>1
After trying the suggested solution above, that works for small medium tables.
I can suggest that solution for very large tables. since it runs in iterations.
Drop all dependency views of the LargeSourceTable
you can find the dependecies by using sql managment studio, right click on the table and click "View Dependencies"
Rename the table:
sp_rename 'LargeSourceTable', 'LargeSourceTable_Temp'; GO
Create the LargeSourceTable again, but now, add a primary key with all the columns that define the duplications add WITH (IGNORE_DUP_KEY = ON)
For example:
CREATE TABLE [dbo].[LargeSourceTable]
(
ID int IDENTITY(1,1),
[CreateDate] DATETIME CONSTRAINT [DF_LargeSourceTable_CreateDate] DEFAULT (getdate()) NOT NULL,
[Column1] CHAR (36) NOT NULL,
[Column2] NVARCHAR (100) NOT NULL,
[Column3] CHAR (36) NOT NULL,
PRIMARY KEY (Column1, Column2) WITH (IGNORE_DUP_KEY = ON)
);
GO
Create again the views that you dropped in the first place for the new created table
Now, Run the following sql script, you will see the results in 1,000,000 rows per page, you can change the row number per page to see the results more often.
Note, that I set the IDENTITY_INSERT on and off because one the columns contains auto incremental id, which I'm also copying
SET IDENTITY_INSERT LargeSourceTable ON
DECLARE #PageNumber AS INT, #RowspPage AS INT
DECLARE #TotalRows AS INT
declare #dt varchar(19)
SET #PageNumber = 0
SET #RowspPage = 1000000
select #TotalRows = count (*) from LargeSourceTable_TEMP
While ((#PageNumber - 1) * #RowspPage < #TotalRows )
Begin
begin transaction tran_inner
; with cte as
(
SELECT * FROM LargeSourceTable_TEMP ORDER BY ID
OFFSET ((#PageNumber) * #RowspPage) ROWS
FETCH NEXT #RowspPage ROWS ONLY
)
INSERT INTO LargeSourceTable
(
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
)
select
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
from cte
commit transaction tran_inner
PRINT 'Page: ' + convert(varchar(10), #PageNumber)
PRINT 'Transfered: ' + convert(varchar(20), #PageNumber * #RowspPage)
PRINT 'Of: ' + convert(varchar(20), #TotalRows)
SELECT #dt = convert(varchar(19), getdate(), 121)
RAISERROR('Inserted on: %s', 0, 1, #dt) WITH NOWAIT
SET #PageNumber = #PageNumber + 1
End
SET IDENTITY_INSERT LargeSourceTable OFF
-- this query will keep only one instance of a duplicate record.
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3-- based on what? --can be multiple columns
ORDER BY ( SELECT 0)) RN
FROM Mytable)
delete FROM cte
WHERE RN > 1
You need to group by the duplicate records according to the field(s), then hold one of the records and delete the rest.
For example:
DELETE prg.Person WHERE Id IN (
SELECT dublicateRow.Id FROM
(
select MIN(Id) MinId, NationalCode
from prg.Person group by NationalCode having count(NationalCode ) > 1
) GroupSelect
JOIN prg.Person dublicateRow ON dublicateRow.NationalCode = GroupSelect.NationalCode
WHERE dublicateRow.Id <> GroupSelect.MinId)
Deleting duplicates from a huge(several millions of records) table might take long time . I suggest that you do a bulk insert into a temp table of the selected rows rather than deleting.
--REWRITING YOUR CODE(TAKE NOTE OF THE 3RD LINE) WITH CTE AS(SELECT NAME,ROW_NUMBER()
OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB) SELECT * INTO #unique_records FROM
CTE WHERE ID =1;
This might help in your case
DELETE t1 FROM table t1 INNER JOIN table t2 WHERE t1.id > t2.id AND t1.col1 = t2.col1
With reference to https://support.microsoft.com/en-us/help/139444/how-to-remove-duplicate-rows-from-a-table-in-sql-server
The idea of removing duplicate involves
a) Protecting those rows that are not duplicate
b) Retain one of the many rows that qualified together as duplicate.
Step-by-step
1) First identify the rows those satisfy the definition of duplicate
and insert them into temp table, say #tableAll .
2) Select non-duplicate(single-rows) or distinct rows into temp table
say #tableUnique.
3) Delete from source table joining #tableAll to delete the
duplicates.
4) Insert into source table all the rows from #tableUnique.
5) Drop #tableAll and #tableUnique
If you have the ability to add a column to the table temporarily, this was a solution that worked for me:
ALTER TABLE dbo.DUPPEDTABLE ADD RowID INT NOT NULL IDENTITY(1,1)
Then perform a DELETE using a combination of MIN and GROUP BY
DELETE b
FROM dbo.DUPPEDTABLE b
WHERE b.RowID NOT IN (
SELECT MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
);
Verify that the DELETE performed correctly:
SELECT a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE, COUNT(*)--MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
ORDER BY COUNT(*) DESC
The result should have no rows with a count greater than 1. Finally, remove the rowid column:
ALTER TABLE dbo.DUPPEDTABLE DROP COLUMN RowID;
Oh wow, i feel so stupid by ready all this answers, they are like experts' answer with all CTE and temp table and etc.
And all I did to get it working was simply aggregated the ID column by using MAX.
DELETE FROM table WHERE col1 IN (
SELECT MAX(id) FROM table GROUP BY id HAVING ( COUNT(col1) > 1 )
)
NOTE: you might need to run it multiple time to remove duplicate as this will only delete one set of duplicate rows at a time.
please simply add the keyword DISTINCT right after the SELECT command,
for example:
SELECT DISTICNT ColumnOne, ColumnTwo, ColumnThree
FROM YourTable
Another way of removing dublicated rows without loosing information in one step is like following:
delete from dublicated_table t1 (nolock)
join (
select t2.dublicated_field
, min(len(t2.field_kept)) as min_field_kept
from dublicated_table t2 (nolock)
group by t2.dublicated_field having COUNT(*)>1
) t3
on t1.dublicated_field=t3.dublicated_field
and len(t1.field_kept)=t3.min_field_kept
DECLARE #TB TABLE(NAME VARCHAR(100));
INSERT INTO #TB VALUES ('Red'),('Red'),('Green'),('Blue'),('White'),('White')
--**Delete by Rank**
;WITH CTE AS(SELECT NAME,DENSE_RANK() OVER (PARTITION BY NAME ORDER BY NEWID()) ID FROM #TB)
DELETE FROM CTE WHERE ID>1
SELECT NAME FROM #TB;
--**Delete by Row Number**
;WITH CTE AS(SELECT NAME,ROW_NUMBER() OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB)
DELETE FROM CTE WHERE ID>1;
SELECT NAME FROM #TB;
DELETE FROM TBL1 WHERE ID IN
(SELECT ID FROM TBL1 a WHERE ID!=
(select MAX(ID) from TBL1 where DUPVAL=a.DUPVAL
group by DUPVAL
having count(DUPVAL)>1))
DELETE p1 FROM Person p1,
Person p2
WHERE
p1.Email = p2.Email AND p1.Id > p2.Id

How to remove duplicates values after I use row number partition by order by? [duplicate]

How can I delete duplicate rows where no unique row id exists?
My table is
col1 col2 col3 col4 col5 col6 col7
john 1 1 1 1 1 1
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
sally 2 2 2 2 2 2
I want to be left with the following after the duplicate removal:
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
I've tried a few queries but I think they depend on having a row id as I don't get the desired result. For example:
DELETE
FROM table
WHERE col1 IN (
SELECT id
FROM table
GROUP BY id
HAVING (COUNT(col1) > 1)
)
I like CTEs and ROW_NUMBER as the two combined allow us to see which rows are deleted (or updated), therefore just change the DELETE FROM CTE... to SELECT * FROM CTE:
WITH CTE AS(
SELECT [col1], [col2], [col3], [col4], [col5], [col6], [col7],
RN = ROW_NUMBER()OVER(PARTITION BY col1 ORDER BY col1)
FROM dbo.Table1
)
DELETE FROM CTE WHERE RN > 1
DEMO (result is different; I assume that it's due to a typo on your part)
COL1 COL2 COL3 COL4 COL5 COL6 COL7
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
This example determines duplicates by a single column col1 because of the PARTITION BY col1. If you want to include multiple columns simply add them to the PARTITION BY:
ROW_NUMBER()OVER(PARTITION BY Col1, Col2, ... ORDER BY OrderColumn)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
Without using CTE and ROW_NUMBER() you can just delete the records just by using group by with MAX function here is an example
DELETE
FROM MyDuplicateTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyDuplicateTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)
If you have no references, like foreign keys, you can do this. I do it a lot when testing proofs of concept and the test data gets duplicated.
SELECT DISTINCT [col1],[col2],[col3],[col4],[col5],[col6],[col7]
INTO [newTable]
FROM [oldTable]
Go into the object explorer and delete the old table.
Rename the new table with the old table's name.
Remove all duplicates, but the very first ones (with min ID)
should work equally in other SQL servers, like Postgres:
DELETE FROM table
WHERE id NOT IN (
select min(id) from table
group by col1, col2, col3, col4, col5, col6, col7
)
DELETE from search
where id not in (
select min(id) from search
group by url
having count(*)=1
union
SELECT min(id) FROM search
group by url
having count(*) > 1
)
There are two solutions in mysql:
A) Delete duplicate rows using DELETE JOIN statement
DELETE t1 FROM contacts t1
INNER JOIN contacts t2
WHERE
t1.id < t2.id AND
t1.email = t2.email;
This query references the contacts table twice, therefore, it uses the table alias t1 and t2.
The output is:
1
Query OK, 4 rows affected (0.10 sec)
In case you want to delete duplicate rows and keep the lowest id, you can use the following statement:
DELETE c1 FROM contacts c1
INNER JOIN contacts c2
WHERE
c1.id > c2.id AND
c1.email = c2.email;
B) Delete duplicate rows using an intermediate table
The following shows the steps for removing duplicate rows using an intermediate table:
1. Create a new table with the structure the same as the original table that you want to delete duplicate rows.
2. Insert distinct rows from the original table to the immediate table.
3. Insert distinct rows from the original table to the immediate table.
Step 1. Create a new table whose structure is the same as the original table:
CREATE TABLE source_copy LIKE source;
Step 2. Insert distinct rows from the original table to the new table:
INSERT INTO source_copy
SELECT * FROM source
GROUP BY col; -- column that has duplicate values
Step 3. drop the original table and rename the immediate table to the original one
DROP TABLE source;
ALTER TABLE source_copy RENAME TO source;
Source: http://www.mysqltutorial.org/mysql-delete-duplicate-rows/
Please see the below way of deletion too.
Declare #table table
(col1 varchar(10),col2 int,col3 int, col4 int, col5 int, col6 int, col7 int)
Insert into #table values
('john',1,1,1,1,1,1),
('john',1,1,1,1,1,1),
('sally',2,2,2,2,2,2),
('sally',2,2,2,2,2,2)
Created a sample table named #table and loaded it with given data.
Delete aliasName from (
Select *,
ROW_NUMBER() over (Partition by col1,col2,col3,col4,col5,col6,col7 order by col1) as rowNumber
From #table) aliasName
Where rowNumber > 1
Select * from #table
Note: If you are giving all columns in the Partition by part, then order by do not have much significance.
I know, the question is asked three years ago, and my answer is another version of what Tim has posted, But posting just incase it is helpful for anyone.
It can be done by many ways in sql server
the most simplest way to do so is:
Insert the distinct rows from the duplicate rows table to new temporary table. Then delete all the data from duplicate rows table then insert all data from temporary table which has no duplicates as shown below.
select distinct * into #tmp From table
delete from table
insert into table
select * from #tmp drop table #tmp
select * from table
Delete duplicate rows using Common Table Expression(CTE)
With CTE_Duplicates as
(select id,name , row_number()
over(partition by id,name order by id,name ) rownumber from table )
delete from CTE_Duplicates where rownumber!=1
To delete the duplicate rows from the table in SQL Server, you follow these steps:
Find duplicate rows using GROUP BY clause or ROW_NUMBER() function.
Use DELETE statement to remove the duplicate rows.
Setting up a sample table
DROP TABLE IF EXISTS contacts;
CREATE TABLE contacts(
contact_id INT IDENTITY(1,1) PRIMARY KEY,
first_name NVARCHAR(100) NOT NULL,
last_name NVARCHAR(100) NOT NULL,
email NVARCHAR(255) NOT NULL,
);
Insert values
INSERT INTO contacts
(first_name,last_name,email)
VALUES
('Syed','Abbas','syed.abbas#example.com'),
('Catherine','Abel','catherine.abel#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Pilar','Ackerman','pilar.ackerman#example.com');
Query
SELECT
contact_id,
first_name,
last_name,
email
FROM
contacts;
Delete duplicate rows from a table
WITH cte AS (
SELECT
contact_id,
first_name,
last_name,
email,
ROW_NUMBER() OVER (
PARTITION BY
first_name,
last_name,
email
ORDER BY
first_name,
last_name,
email
) row_num
FROM
contacts
)
DELETE FROM cte
WHERE row_num > 1;
Should delete the record now
Try to Use:
SELECT linkorder
,Row_Number() OVER (
PARTITION BY linkorder ORDER BY linkorder DESC
) AS RowNum
FROM u_links
Microsoft has a vey ry neat guide on how to remove duplicates. Check out http://support.microsoft.com/kb/139444
In brief, here is the easiest way to delete duplicates when you have just a few rows to delete:
SET rowcount 1;
DELETE FROM t1 WHERE myprimarykey=1;
myprimarykey is the identifier for the row.
I set rowcount to 1 because I only had two rows that were duplicated. If I had had 3 rows duplicated then I would have set rowcount to 2 so that it deletes the first two that it sees and only leaves one in table t1.
with myCTE
as
(
select productName,ROW_NUMBER() over(PARTITION BY productName order by slno) as Duplicate from productDetails
)
Delete from myCTE where Duplicate>1
After trying the suggested solution above, that works for small medium tables.
I can suggest that solution for very large tables. since it runs in iterations.
Drop all dependency views of the LargeSourceTable
you can find the dependecies by using sql managment studio, right click on the table and click "View Dependencies"
Rename the table:
sp_rename 'LargeSourceTable', 'LargeSourceTable_Temp'; GO
Create the LargeSourceTable again, but now, add a primary key with all the columns that define the duplications add WITH (IGNORE_DUP_KEY = ON)
For example:
CREATE TABLE [dbo].[LargeSourceTable]
(
ID int IDENTITY(1,1),
[CreateDate] DATETIME CONSTRAINT [DF_LargeSourceTable_CreateDate] DEFAULT (getdate()) NOT NULL,
[Column1] CHAR (36) NOT NULL,
[Column2] NVARCHAR (100) NOT NULL,
[Column3] CHAR (36) NOT NULL,
PRIMARY KEY (Column1, Column2) WITH (IGNORE_DUP_KEY = ON)
);
GO
Create again the views that you dropped in the first place for the new created table
Now, Run the following sql script, you will see the results in 1,000,000 rows per page, you can change the row number per page to see the results more often.
Note, that I set the IDENTITY_INSERT on and off because one the columns contains auto incremental id, which I'm also copying
SET IDENTITY_INSERT LargeSourceTable ON
DECLARE #PageNumber AS INT, #RowspPage AS INT
DECLARE #TotalRows AS INT
declare #dt varchar(19)
SET #PageNumber = 0
SET #RowspPage = 1000000
select #TotalRows = count (*) from LargeSourceTable_TEMP
While ((#PageNumber - 1) * #RowspPage < #TotalRows )
Begin
begin transaction tran_inner
; with cte as
(
SELECT * FROM LargeSourceTable_TEMP ORDER BY ID
OFFSET ((#PageNumber) * #RowspPage) ROWS
FETCH NEXT #RowspPage ROWS ONLY
)
INSERT INTO LargeSourceTable
(
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
)
select
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
from cte
commit transaction tran_inner
PRINT 'Page: ' + convert(varchar(10), #PageNumber)
PRINT 'Transfered: ' + convert(varchar(20), #PageNumber * #RowspPage)
PRINT 'Of: ' + convert(varchar(20), #TotalRows)
SELECT #dt = convert(varchar(19), getdate(), 121)
RAISERROR('Inserted on: %s', 0, 1, #dt) WITH NOWAIT
SET #PageNumber = #PageNumber + 1
End
SET IDENTITY_INSERT LargeSourceTable OFF
-- this query will keep only one instance of a duplicate record.
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3-- based on what? --can be multiple columns
ORDER BY ( SELECT 0)) RN
FROM Mytable)
delete FROM cte
WHERE RN > 1
You need to group by the duplicate records according to the field(s), then hold one of the records and delete the rest.
For example:
DELETE prg.Person WHERE Id IN (
SELECT dublicateRow.Id FROM
(
select MIN(Id) MinId, NationalCode
from prg.Person group by NationalCode having count(NationalCode ) > 1
) GroupSelect
JOIN prg.Person dublicateRow ON dublicateRow.NationalCode = GroupSelect.NationalCode
WHERE dublicateRow.Id <> GroupSelect.MinId)
Deleting duplicates from a huge(several millions of records) table might take long time . I suggest that you do a bulk insert into a temp table of the selected rows rather than deleting.
--REWRITING YOUR CODE(TAKE NOTE OF THE 3RD LINE) WITH CTE AS(SELECT NAME,ROW_NUMBER()
OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB) SELECT * INTO #unique_records FROM
CTE WHERE ID =1;
This might help in your case
DELETE t1 FROM table t1 INNER JOIN table t2 WHERE t1.id > t2.id AND t1.col1 = t2.col1
With reference to https://support.microsoft.com/en-us/help/139444/how-to-remove-duplicate-rows-from-a-table-in-sql-server
The idea of removing duplicate involves
a) Protecting those rows that are not duplicate
b) Retain one of the many rows that qualified together as duplicate.
Step-by-step
1) First identify the rows those satisfy the definition of duplicate
and insert them into temp table, say #tableAll .
2) Select non-duplicate(single-rows) or distinct rows into temp table
say #tableUnique.
3) Delete from source table joining #tableAll to delete the
duplicates.
4) Insert into source table all the rows from #tableUnique.
5) Drop #tableAll and #tableUnique
If you have the ability to add a column to the table temporarily, this was a solution that worked for me:
ALTER TABLE dbo.DUPPEDTABLE ADD RowID INT NOT NULL IDENTITY(1,1)
Then perform a DELETE using a combination of MIN and GROUP BY
DELETE b
FROM dbo.DUPPEDTABLE b
WHERE b.RowID NOT IN (
SELECT MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
);
Verify that the DELETE performed correctly:
SELECT a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE, COUNT(*)--MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
ORDER BY COUNT(*) DESC
The result should have no rows with a count greater than 1. Finally, remove the rowid column:
ALTER TABLE dbo.DUPPEDTABLE DROP COLUMN RowID;
Oh wow, i feel so stupid by ready all this answers, they are like experts' answer with all CTE and temp table and etc.
And all I did to get it working was simply aggregated the ID column by using MAX.
DELETE FROM table WHERE col1 IN (
SELECT MAX(id) FROM table GROUP BY id HAVING ( COUNT(col1) > 1 )
)
NOTE: you might need to run it multiple time to remove duplicate as this will only delete one set of duplicate rows at a time.
please simply add the keyword DISTINCT right after the SELECT command,
for example:
SELECT DISTICNT ColumnOne, ColumnTwo, ColumnThree
FROM YourTable
Another way of removing dublicated rows without loosing information in one step is like following:
delete from dublicated_table t1 (nolock)
join (
select t2.dublicated_field
, min(len(t2.field_kept)) as min_field_kept
from dublicated_table t2 (nolock)
group by t2.dublicated_field having COUNT(*)>1
) t3
on t1.dublicated_field=t3.dublicated_field
and len(t1.field_kept)=t3.min_field_kept
DECLARE #TB TABLE(NAME VARCHAR(100));
INSERT INTO #TB VALUES ('Red'),('Red'),('Green'),('Blue'),('White'),('White')
--**Delete by Rank**
;WITH CTE AS(SELECT NAME,DENSE_RANK() OVER (PARTITION BY NAME ORDER BY NEWID()) ID FROM #TB)
DELETE FROM CTE WHERE ID>1
SELECT NAME FROM #TB;
--**Delete by Row Number**
;WITH CTE AS(SELECT NAME,ROW_NUMBER() OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB)
DELETE FROM CTE WHERE ID>1;
SELECT NAME FROM #TB;
DELETE FROM TBL1 WHERE ID IN
(SELECT ID FROM TBL1 a WHERE ID!=
(select MAX(ID) from TBL1 where DUPVAL=a.DUPVAL
group by DUPVAL
having count(DUPVAL)>1))
DELETE p1 FROM Person p1,
Person p2
WHERE
p1.Email = p2.Email AND p1.Id > p2.Id

Check two table's records are in the same order

I have two recordset (temp table data) with some columns. I need to check that both table's records are in the same order.
I am not checking differences between two recordset or common rows. I need to check that they are in the same order.(both tables have records order by some columns already and need to check order of both tables are same using GUID column)
If Guid matches then I will insert information in some table and if not then into log table, but it should move to/compare next record in both cases.
I am thinking to nested loop for both temp tables and check the order by comparing Guid columns.
Any other approach?
Your question is not very clear. Just some hard facts:
There is no implicit order! You can fill your table in a given order and the next SELECT might return the data exactly in this order - but this is random! You should never rely on a sort order! There is none!
The only guaranteed way to enforce a sort order is to use ORDER BY on the outermost query.
One specialty might be the usage of sorting functions like ROW_NUMBER(). But this is to broad to discuss this here.
If I get you correctly, you need to check for rows existing on both sides, if they appear in the same order. Try this:
DECLARE #t1 TABLE(YourGuid UNIQUEIDENTIFIER, Descr VARCHAR(100),SomeSortableColumn DATETIME);
INSERT INTO #t1 VALUES('653E6A93-3EBA-4D5E-A8F3-C36462A55FEF','Row 1',{d'2018-01-01'})
,('5461F417-1D14-4CFE-822D-3F028492F839','Row 2',{d'2018-01-02'})
,('E9BDE8C6-237A-49F6-88BD-9EB211FB12F2','Row 3',{d'2018-01-03'})
,('64343D33-8AD2-475F-AC27-66A6BFD011C9','Row 4',{d'2018-01-04'})
,('5778229D-B20E-41FC-9A2E-8694B204E4D3','Row 5',{d'2018-01-05'})
,('9AC0BB10-0F70-488C-A249-45A3C688D877','Row 6',{d'2018-01-06'})
,('330526D6-B931-4CEA-BB03-30F3783E6284','Row 7',{d'2018-01-07'})
,('6F68F260-2F64-4C78-9DA5-20E0FF22B4A1','Row 8',{d'2018-01-08'})
,('E09090F1-FC85-41EE-819B-8275A22BD075','Row 9',{d'2018-01-09'});
DECLARE #t2 TABLE(YourGuid UNIQUEIDENTIFIER, Descr VARCHAR(100),SomeSortableColumn DATETIME);
INSERT INTO #t2 VALUES('653E6A93-3EBA-4D5E-A8F3-C36462A55FEF','Row 1',{d'2018-01-01'})
,('5461F417-1D14-4CFE-822D-3F028492F839','Row 2',{d'2018-01-02'})
--missing in 2: 3 & 4
,('5778229D-B20E-41FC-9A2E-8694B204E4D3','Row 5',{d'2018-01-05'})
--other GUID
,(NEWID(),'Row 6',{d'2018-01-06'})
,('330526D6-B931-4CEA-BB03-30F3783E6284','Row 7',{d'2018-01-07'})
--other date
,('6F68F260-2F64-4C78-9DA5-20E0FF22B4A1','Row 8',{d'2018-01-01'})
,('E09090F1-FC85-41EE-819B-8275A22BD075','Row 9',{d'2018-01-09'})
--missing in 1
,(NEWID(),'Other row',{d'2018-01-03'})
;
--This query uses an INNER JOIN on the GUID column to omit rows, which do not exist in both sets. And it uses two times ROW_NUMBER(), each call sorted after the same column, but taken from different sources. The result shows rows where these indexes are different.
WITH ColumnsToCompare AS
(
SELECT t1.YourGuid
,t1.Descr AS Descr1
,t2.Descr AS Descr2
,t1.SomeSortableColumn AS Sort1
,t2.SomeSortableColumn AS Sort2
,ROW_NUMBER() OVER(ORDER BY t1.SomeSortableColumn) AS Index1
,ROW_NUMBER() OVER(ORDER BY t2.SomeSortableColumn) AS Index2
FROM #t1 AS t1
INNER JOIN #t2 AS t2 ON t1.YourGuid =t2.YourGuid
)
SELECT *
FROM ColumnsToCompare
WHERE Index1<>Index2
Not sure if I got your question right but below is what I think you should do.
The below is from your question
"If Guid matches then I will insert information in some table and if not then into log table, but it should move to/compare next record in both cases."
You need 2 insert statements
In the first one, do an inner join on GUIDs and insert the result to table1.
In the second query, do the left join and filter it by null and then insert the result set in to the log table.
insert into sometable
select *
from Table1 t1
inner join Table2 t2 on T1.GUID = T2.GUID
insert into logtable
select *
from Table1 t1
left join Table2 t2 on T1.GUID = T2.GUID
where t2.guid is null
You can generate row numbers (without ordering) and check for the GUIDs having different row numbers.
declare #table1 table(id varchar(MAX))
declare #table2 table(id VARCHAR(MAX))
insert into #table1
select '653E6A93'
union all
select '5461F417'
union all
select '330526D6'
insert into #table2
select '653E6A93'
union all
select '330526D6'
union all
select '5461F417'
;with cte1
AS
(
select *, ROW_NUMBER() OVER (ORDER BY (SELECT null)) AS rn from #table1
)
,
cte2
AS
(
select *, row_number() OVER(order by (SELECT NULL)) rn from #table2
)
select c1.id from cte1 c1
JOIN cte2 c2 on c1.id=c2.id and c1.rn<>c2.rn

How to delete duplicate rows in SQL Server?

How can I delete duplicate rows where no unique row id exists?
My table is
col1 col2 col3 col4 col5 col6 col7
john 1 1 1 1 1 1
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
sally 2 2 2 2 2 2
I want to be left with the following after the duplicate removal:
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
I've tried a few queries but I think they depend on having a row id as I don't get the desired result. For example:
DELETE
FROM table
WHERE col1 IN (
SELECT id
FROM table
GROUP BY id
HAVING (COUNT(col1) > 1)
)
I like CTEs and ROW_NUMBER as the two combined allow us to see which rows are deleted (or updated), therefore just change the DELETE FROM CTE... to SELECT * FROM CTE:
WITH CTE AS(
SELECT [col1], [col2], [col3], [col4], [col5], [col6], [col7],
RN = ROW_NUMBER()OVER(PARTITION BY col1 ORDER BY col1)
FROM dbo.Table1
)
DELETE FROM CTE WHERE RN > 1
DEMO (result is different; I assume that it's due to a typo on your part)
COL1 COL2 COL3 COL4 COL5 COL6 COL7
john 1 1 1 1 1 1
sally 2 2 2 2 2 2
This example determines duplicates by a single column col1 because of the PARTITION BY col1. If you want to include multiple columns simply add them to the PARTITION BY:
ROW_NUMBER()OVER(PARTITION BY Col1, Col2, ... ORDER BY OrderColumn)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
Without using CTE and ROW_NUMBER() you can just delete the records just by using group by with MAX function here is an example
DELETE
FROM MyDuplicateTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyDuplicateTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)
If you have no references, like foreign keys, you can do this. I do it a lot when testing proofs of concept and the test data gets duplicated.
SELECT DISTINCT [col1],[col2],[col3],[col4],[col5],[col6],[col7]
INTO [newTable]
FROM [oldTable]
Go into the object explorer and delete the old table.
Rename the new table with the old table's name.
Remove all duplicates, but the very first ones (with min ID)
should work equally in other SQL servers, like Postgres:
DELETE FROM table
WHERE id NOT IN (
select min(id) from table
group by col1, col2, col3, col4, col5, col6, col7
)
DELETE from search
where id not in (
select min(id) from search
group by url
having count(*)=1
union
SELECT min(id) FROM search
group by url
having count(*) > 1
)
There are two solutions in mysql:
A) Delete duplicate rows using DELETE JOIN statement
DELETE t1 FROM contacts t1
INNER JOIN contacts t2
WHERE
t1.id < t2.id AND
t1.email = t2.email;
This query references the contacts table twice, therefore, it uses the table alias t1 and t2.
The output is:
1
Query OK, 4 rows affected (0.10 sec)
In case you want to delete duplicate rows and keep the lowest id, you can use the following statement:
DELETE c1 FROM contacts c1
INNER JOIN contacts c2
WHERE
c1.id > c2.id AND
c1.email = c2.email;
B) Delete duplicate rows using an intermediate table
The following shows the steps for removing duplicate rows using an intermediate table:
1. Create a new table with the structure the same as the original table that you want to delete duplicate rows.
2. Insert distinct rows from the original table to the immediate table.
3. Insert distinct rows from the original table to the immediate table.
Step 1. Create a new table whose structure is the same as the original table:
CREATE TABLE source_copy LIKE source;
Step 2. Insert distinct rows from the original table to the new table:
INSERT INTO source_copy
SELECT * FROM source
GROUP BY col; -- column that has duplicate values
Step 3. drop the original table and rename the immediate table to the original one
DROP TABLE source;
ALTER TABLE source_copy RENAME TO source;
Source: http://www.mysqltutorial.org/mysql-delete-duplicate-rows/
Please see the below way of deletion too.
Declare #table table
(col1 varchar(10),col2 int,col3 int, col4 int, col5 int, col6 int, col7 int)
Insert into #table values
('john',1,1,1,1,1,1),
('john',1,1,1,1,1,1),
('sally',2,2,2,2,2,2),
('sally',2,2,2,2,2,2)
Created a sample table named #table and loaded it with given data.
Delete aliasName from (
Select *,
ROW_NUMBER() over (Partition by col1,col2,col3,col4,col5,col6,col7 order by col1) as rowNumber
From #table) aliasName
Where rowNumber > 1
Select * from #table
Note: If you are giving all columns in the Partition by part, then order by do not have much significance.
I know, the question is asked three years ago, and my answer is another version of what Tim has posted, But posting just incase it is helpful for anyone.
It can be done by many ways in sql server
the most simplest way to do so is:
Insert the distinct rows from the duplicate rows table to new temporary table. Then delete all the data from duplicate rows table then insert all data from temporary table which has no duplicates as shown below.
select distinct * into #tmp From table
delete from table
insert into table
select * from #tmp drop table #tmp
select * from table
Delete duplicate rows using Common Table Expression(CTE)
With CTE_Duplicates as
(select id,name , row_number()
over(partition by id,name order by id,name ) rownumber from table )
delete from CTE_Duplicates where rownumber!=1
To delete the duplicate rows from the table in SQL Server, you follow these steps:
Find duplicate rows using GROUP BY clause or ROW_NUMBER() function.
Use DELETE statement to remove the duplicate rows.
Setting up a sample table
DROP TABLE IF EXISTS contacts;
CREATE TABLE contacts(
contact_id INT IDENTITY(1,1) PRIMARY KEY,
first_name NVARCHAR(100) NOT NULL,
last_name NVARCHAR(100) NOT NULL,
email NVARCHAR(255) NOT NULL,
);
Insert values
INSERT INTO contacts
(first_name,last_name,email)
VALUES
('Syed','Abbas','syed.abbas#example.com'),
('Catherine','Abel','catherine.abel#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Kim','Abercrombie','kim.abercrombie#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Hazem','Abolrous','hazem.abolrous#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Humberto','Acevedo','humberto.acevedo#example.com'),
('Pilar','Ackerman','pilar.ackerman#example.com');
Query
SELECT
contact_id,
first_name,
last_name,
email
FROM
contacts;
Delete duplicate rows from a table
WITH cte AS (
SELECT
contact_id,
first_name,
last_name,
email,
ROW_NUMBER() OVER (
PARTITION BY
first_name,
last_name,
email
ORDER BY
first_name,
last_name,
email
) row_num
FROM
contacts
)
DELETE FROM cte
WHERE row_num > 1;
Should delete the record now
Try to Use:
SELECT linkorder
,Row_Number() OVER (
PARTITION BY linkorder ORDER BY linkorder DESC
) AS RowNum
FROM u_links
Microsoft has a vey ry neat guide on how to remove duplicates. Check out http://support.microsoft.com/kb/139444
In brief, here is the easiest way to delete duplicates when you have just a few rows to delete:
SET rowcount 1;
DELETE FROM t1 WHERE myprimarykey=1;
myprimarykey is the identifier for the row.
I set rowcount to 1 because I only had two rows that were duplicated. If I had had 3 rows duplicated then I would have set rowcount to 2 so that it deletes the first two that it sees and only leaves one in table t1.
with myCTE
as
(
select productName,ROW_NUMBER() over(PARTITION BY productName order by slno) as Duplicate from productDetails
)
Delete from myCTE where Duplicate>1
After trying the suggested solution above, that works for small medium tables.
I can suggest that solution for very large tables. since it runs in iterations.
Drop all dependency views of the LargeSourceTable
you can find the dependecies by using sql managment studio, right click on the table and click "View Dependencies"
Rename the table:
sp_rename 'LargeSourceTable', 'LargeSourceTable_Temp'; GO
Create the LargeSourceTable again, but now, add a primary key with all the columns that define the duplications add WITH (IGNORE_DUP_KEY = ON)
For example:
CREATE TABLE [dbo].[LargeSourceTable]
(
ID int IDENTITY(1,1),
[CreateDate] DATETIME CONSTRAINT [DF_LargeSourceTable_CreateDate] DEFAULT (getdate()) NOT NULL,
[Column1] CHAR (36) NOT NULL,
[Column2] NVARCHAR (100) NOT NULL,
[Column3] CHAR (36) NOT NULL,
PRIMARY KEY (Column1, Column2) WITH (IGNORE_DUP_KEY = ON)
);
GO
Create again the views that you dropped in the first place for the new created table
Now, Run the following sql script, you will see the results in 1,000,000 rows per page, you can change the row number per page to see the results more often.
Note, that I set the IDENTITY_INSERT on and off because one the columns contains auto incremental id, which I'm also copying
SET IDENTITY_INSERT LargeSourceTable ON
DECLARE #PageNumber AS INT, #RowspPage AS INT
DECLARE #TotalRows AS INT
declare #dt varchar(19)
SET #PageNumber = 0
SET #RowspPage = 1000000
select #TotalRows = count (*) from LargeSourceTable_TEMP
While ((#PageNumber - 1) * #RowspPage < #TotalRows )
Begin
begin transaction tran_inner
; with cte as
(
SELECT * FROM LargeSourceTable_TEMP ORDER BY ID
OFFSET ((#PageNumber) * #RowspPage) ROWS
FETCH NEXT #RowspPage ROWS ONLY
)
INSERT INTO LargeSourceTable
(
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
)
select
ID
,[CreateDate]
,[Column1]
,[Column2]
,[Column3]
from cte
commit transaction tran_inner
PRINT 'Page: ' + convert(varchar(10), #PageNumber)
PRINT 'Transfered: ' + convert(varchar(20), #PageNumber * #RowspPage)
PRINT 'Of: ' + convert(varchar(20), #TotalRows)
SELECT #dt = convert(varchar(19), getdate(), 121)
RAISERROR('Inserted on: %s', 0, 1, #dt) WITH NOWAIT
SET #PageNumber = #PageNumber + 1
End
SET IDENTITY_INSERT LargeSourceTable OFF
-- this query will keep only one instance of a duplicate record.
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY col1, col2, col3-- based on what? --can be multiple columns
ORDER BY ( SELECT 0)) RN
FROM Mytable)
delete FROM cte
WHERE RN > 1
You need to group by the duplicate records according to the field(s), then hold one of the records and delete the rest.
For example:
DELETE prg.Person WHERE Id IN (
SELECT dublicateRow.Id FROM
(
select MIN(Id) MinId, NationalCode
from prg.Person group by NationalCode having count(NationalCode ) > 1
) GroupSelect
JOIN prg.Person dublicateRow ON dublicateRow.NationalCode = GroupSelect.NationalCode
WHERE dublicateRow.Id <> GroupSelect.MinId)
Deleting duplicates from a huge(several millions of records) table might take long time . I suggest that you do a bulk insert into a temp table of the selected rows rather than deleting.
--REWRITING YOUR CODE(TAKE NOTE OF THE 3RD LINE) WITH CTE AS(SELECT NAME,ROW_NUMBER()
OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB) SELECT * INTO #unique_records FROM
CTE WHERE ID =1;
This might help in your case
DELETE t1 FROM table t1 INNER JOIN table t2 WHERE t1.id > t2.id AND t1.col1 = t2.col1
With reference to https://support.microsoft.com/en-us/help/139444/how-to-remove-duplicate-rows-from-a-table-in-sql-server
The idea of removing duplicate involves
a) Protecting those rows that are not duplicate
b) Retain one of the many rows that qualified together as duplicate.
Step-by-step
1) First identify the rows those satisfy the definition of duplicate
and insert them into temp table, say #tableAll .
2) Select non-duplicate(single-rows) or distinct rows into temp table
say #tableUnique.
3) Delete from source table joining #tableAll to delete the
duplicates.
4) Insert into source table all the rows from #tableUnique.
5) Drop #tableAll and #tableUnique
If you have the ability to add a column to the table temporarily, this was a solution that worked for me:
ALTER TABLE dbo.DUPPEDTABLE ADD RowID INT NOT NULL IDENTITY(1,1)
Then perform a DELETE using a combination of MIN and GROUP BY
DELETE b
FROM dbo.DUPPEDTABLE b
WHERE b.RowID NOT IN (
SELECT MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
);
Verify that the DELETE performed correctly:
SELECT a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE, COUNT(*)--MIN(RowID) AS RowID
FROM dbo.DUPPEDTABLE a WITH (NOLOCK)
GROUP BY a.ITEM_NUMBER,
a.CHARACTERISTIC,
a.INTVALUE,
a.FLOATVALUE,
a.STRINGVALUE
ORDER BY COUNT(*) DESC
The result should have no rows with a count greater than 1. Finally, remove the rowid column:
ALTER TABLE dbo.DUPPEDTABLE DROP COLUMN RowID;
Oh wow, i feel so stupid by ready all this answers, they are like experts' answer with all CTE and temp table and etc.
And all I did to get it working was simply aggregated the ID column by using MAX.
DELETE FROM table WHERE col1 IN (
SELECT MAX(id) FROM table GROUP BY id HAVING ( COUNT(col1) > 1 )
)
NOTE: you might need to run it multiple time to remove duplicate as this will only delete one set of duplicate rows at a time.
please simply add the keyword DISTINCT right after the SELECT command,
for example:
SELECT DISTICNT ColumnOne, ColumnTwo, ColumnThree
FROM YourTable
Another way of removing dublicated rows without loosing information in one step is like following:
delete from dublicated_table t1 (nolock)
join (
select t2.dublicated_field
, min(len(t2.field_kept)) as min_field_kept
from dublicated_table t2 (nolock)
group by t2.dublicated_field having COUNT(*)>1
) t3
on t1.dublicated_field=t3.dublicated_field
and len(t1.field_kept)=t3.min_field_kept
DECLARE #TB TABLE(NAME VARCHAR(100));
INSERT INTO #TB VALUES ('Red'),('Red'),('Green'),('Blue'),('White'),('White')
--**Delete by Rank**
;WITH CTE AS(SELECT NAME,DENSE_RANK() OVER (PARTITION BY NAME ORDER BY NEWID()) ID FROM #TB)
DELETE FROM CTE WHERE ID>1
SELECT NAME FROM #TB;
--**Delete by Row Number**
;WITH CTE AS(SELECT NAME,ROW_NUMBER() OVER (PARTITION BY NAME ORDER BY NAME) ID FROM #TB)
DELETE FROM CTE WHERE ID>1;
SELECT NAME FROM #TB;
DELETE FROM TBL1 WHERE ID IN
(SELECT ID FROM TBL1 a WHERE ID!=
(select MAX(ID) from TBL1 where DUPVAL=a.DUPVAL
group by DUPVAL
having count(DUPVAL)>1))
DELETE p1 FROM Person p1,
Person p2
WHERE
p1.Email = p2.Email AND p1.Id > p2.Id

Delete records which are considered duplicates based on same value on a column and keep the newest

I would like to delete records which are considered duplicates based on them having the same value in a certain column and keep one which is considered the newest based on InsertedDate in my example below. I would like a solution which doesn't use a cursor but is set based. Goal: delete all duplicates and keep the newest.
The ddl below creates some duplicates. The records which need to be deleted are: John1 & John2 because they have the same ID as John3 and John3 is the newest record.
Also record John5 needs to be deleted because there's another record with ID = 3 and is newer (John6).
Create table dbo.TestTable (ID int, InsertedDate DateTime, Name varchar(50))
Insert into dbo.TestTable Select 1, '07/01/2009', 'John1'
Insert into dbo.TestTable Select 1, '07/02/2009', 'John2'
Insert into dbo.TestTable Select 1, '07/03/2009', 'John3'
Insert into dbo.TestTable Select 2, '07/03/2009', 'John4'
Insert into dbo.TestTable Select 3, '07/05/2009', 'John5'
Insert into dbo.TestTable Select 3, '07/06/2009', 'John6'
Just as an academic exercise:
with cte as (
select *, row_number() over (partition by ID order by InsertedDate desc) as rn
from TestTable)
delete from cte
where rn <> 1;
Most of the time the solution proposed by Sam performs much better.
This works:
delete t
from TestTable t
left join
(
select id, InsertedDate = max(InsertedDate) from TestTable
group by id
) as sub on sub.id = t.id and sub.InsertedDate = t.InsertedDate
where sub.id is null
If you have to deal with ties it gets a tiny bit trickier.