How to delete duplicate rows that are exactly the same in SQL Server [duplicate] - sql

This question already has answers here:
Delete duplicate records from a SQL table without a primary key
(20 answers)
Delete duplicate entries keeping one entry of each if id column not available
(2 answers)
Closed 2 years ago.
I loaded some data into a SQL Server table from a .CSV file for test purposes, I don't have any primary key, unique key or auto-generated ID in that table.
Helow is an example of the situation:
select *
from people
where name in (select name
from people
group by name
having count(name) > 1)
When I run this query, I get these results:
The goal is to keep one row and remove other duplicate rows.
Is there any way other than save the content somewhere else, delete all duplicate rows and insert a new one?
Thanks for helping!

You could use an updatable CTE for this.
If you want to delete rows that are exact duplicates on the three columns (as shown in your sample data and explained in the question):
with cte as (
select row_number() over(partition by name, age, gender order by (select null)) rn
from people
)
delete from cte where rn > 1
If you want to delete duplicates on name only (as shown in your existing query):
with cte as (
select row_number() over(partition by name order by (select null)) rn
from people
)
delete from cte where rn > 1

How are you defining "duplicate"? Based on your code example, it appears to be by name.
For the deletion, you can use an updatable CTE with row_number():
with todelete as (
select p.*,
row_number() over (partition by name order by (select null)) as seqnum
from people p
)
delete from todelete
where seqnum > 1;
If more columns define the duplicate, then adjust the partition by clause.

Related

How to delete duplicate records (SQL) when there is no identification column?

It is a datawarehouse project in which I load a table in which each column refers to another table. The problem is that due to an error in the process, many duplicate records were loaded (approximately 13,000) but they do not have a unique identifier, therefore they are exactly the same. Is there a way to delete only one of the duplicate records so that I don't have to delete everything and repeat the table loading process?
You can use row_number() and a cte:
with cte as (
select row_number() over(
partition by col1, col2, ...
order by (select null)) rn
from mytable
)
delete from cte where rn > 1
The window functions guarantees that the same number will not be assigned twice within a partition - you need to enumerate all column columns in the partition by clause.
If you are going to delete a significant part of the rows, then it might be simpler to empty and recreate the table:
create table tmptable as select distinct * from mytable;
truncate table mytable; -- back it up first!
insert into mytable select * from tmptable;
drop table tmptable;
You can make use row_number to delete the duplicate rows by first partitioning them and then ordering by one of the columns with that partition.
You have to list all your columns in partition by if records are completely identical.
WITH CTE1 AS (
SELECT A.*
, ROW_NUMBER(PARTITION BY CODDIMALUMNO, (OTHER COLUMNS) ORDER BY CODDIMALUMNO) RN
FROM TABLE1 A
)
DELETE FROM CTE1
WHERE RN > 1;
You can use row_number() and an updatable CTE:
with todelete as (
select t.*, row_number() over (partition by . . . ) as seqnum
from t
)
delete from todelete
where seqnum > 1;
The . . . is for the columns that define duplicates.

Delete registers with rownumber greater than specified for each group got from sql

I have a table of people with their commments on a blog . I need to leave the last 10 comments for each person in the table and delete the older ones. Lets say the columns are:
personId
commentId
dateFromComment
I know how to do it with several queries but not with just one query(any subqueries allowed) and for anyDatabase
With:
select personId from PeopleComments
group by personId
having count(*) >10
I would get the people ids who have more than 10 comments but I dont know how to get the comments Ids from there and delete them
Thanks!
In my other answer the DBMS must find and count rows for every row in the table. This can be slow. It would be better to find all rows we want to keep once and then delete the others. Hence this additional answer.
The following works for Oracle as of version 12c:
delete from peoplecomments
where rowid not in
(
select rowid
from peoplecomments
order by row_number() over (partition by personid order by datefromcomment desc)
fetch first 10 rows with ties
);
Apart from ROWID this is standard SQL.
In other DBMS that support window functions and FETCH WITH TIES:
If your table has a single-column primary key, you'd replace ROWID with it.
If your table has a composite primary key, you'd use where (col1, col2) not in (select col1, col2 ...) provided your DBMS supports this syntax.
You need a correlated subquery counting the following comments:
delete from peoplecomments pc
where
(
select count(*)
from peoplecomments pc2
where pc2.personid = pc.personid
and pc2.datefromcomment > pc.datefromcomment
) >= 10; -- at least 10 newer comments for the person
BTW: While it seems we could simply number our rows and delete accordingly via
delete from
(
select
pc.*, row_number() over (partition by personid order by datefromcomment desc) as rn
from peoplecomments pc
)
where rn > 10;
Oracle doesn't allow this and gives us ORA-01732: data manipulation operation not legal on this view.

Deleting rows where the Primary key is duplicated - SQL

My issue is how do we delete a primary key row in case it is duplicated. The other fields may/may not be duplicates. I am interested only in the primary key being duplicated and would like to retain the first instance while deleting the other duplicate entries.
For example,
I have 2 tables with the following data:
Table1:- Portfolio
Columns:- PortfolioID(PK), PortfolioName
Sample data :-
1, North America
2, Europe
3, Asia
Table2:- Account
Columns:- AccountID(PK), PortfolioID(FK), AccountName
Sample data :-
1,1,Quake
1,1,Wind
2,1,Fire
3,1,Quake
4,2,Flood
5,2,Wind
Lets say for PortfolioID = 1,
I am trying to delete row number 2 from the Account table where the AccountID 1 is repeated for PortfolioID =1. I have tried using the CTE expression where I use the ROW_NUMBER statement and try to delete ROWNUMBER <> 1. But this query doesn't work as it deletes all the rows in the table.
The query I tried:
WITH CTE AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY [Account].[AccountID] ORDER BY [Account].[AccountID]) AS [ROWNUMBER],
[Account].[AccountID]
FROM [Account]
INNER JOIN [Portfolio] ON [Portfolio].[PortfolioID] = [Account]. [PortfolioID]
WHERE [Portfolio].[PortfolioID] = 1
)
DELETE [Account]
FROM [CTE]
WHERE [ROWNUMBER] <> 1
Am I doing something wrong in the query? Thanks in advance for the help.
Firstly, if you define the AccountID column as the primary key in your database, this going forward will help solve having these kinds of problems.
Secondly, are you using Sql Server? Which version?
Assuming you are using Sql Server and a recent version which allows you to use windowing, you can try something like this to delete any duplicates that you have.
This will delete ALL copies of ALL duplicates:
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY AccountID,PortfolioID)
FROM Account)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
This alternative script will keep one of the duplicates if that is what you prefer:
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY AccountID,PortfolioID ORDER BY AccountID,PortfolioID) AS RN
FROM Account
)
DELETE FROM CTE WHERE RN<>1
Finally, if you want to only delete duplicates for Portfolio Id 1:
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY AccountID,PortfolioID ORDER BY AccountID,PortfolioID) AS RN
FROM Account
Where PortfolioID = 1
)
DELETE FROM CTE WHERE RN<>1
Primary key column never ever support duplicate entries.
Try with the below query for the desired result based on the given data/inputs.
;WITH CTE AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY a.[AccountID],a.PortfolioID ORDER BY a.[AccountID]) AS [ROWNUMBER],*
FROM [Account] a
WHERE a.[PortfolioID] = 1
)
DELETE
FROM [CTE]
WHERE [ROWNUMBER] > 1

Delete IDs that repeat more then once, but leave the first occurrence [duplicate]

This question already has answers here:
How to delete duplicate rows in SQL Server?
(26 answers)
Closed 6 years ago.
I have a table and need to delete entire row where ID occurs second and subsequent times, but leave the first occurrence of suCustomerIDBy the way. M table has ID which is a primary key and CustometID which is duplicated. So I need to remove all rows with duplicated CustomerID.
Delete From Table1 where ID IN (select ID From Table1 where count(distinct CutomerID) >=2 group by CustomerID)
The code above will delete all id including the first occurrence of each of the IDs, but I need to keep their first occurrence. Please advise.
This code should give you what you need.
There may be better ways to do it if you can provide the full table schema for Table1
If you obtain the row number and then just ignore the first ones:
;WITH cte
AS
(
SELECT ID,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY ID) AS Rn
FROM [Table1]
)
DELETE cte WHERE Rn > 1
delete a from(
Select dense_rank() OVER(PARTITION BY id ORDER BY id) AS Rn,*
from Table1)a
where a.Rn>1

T-SQL Delete half of duplicates with no primary key

In a T-SQL stored procedure I have a complex procedure that is comparing data using temp tables but at the end of everything when I return a single table I end up with duplicate rows. In these rows all columns in the row are EXACTLY the same and there is no primary key within this table. I need to delete only half of these based on the number of times that row occurs. For example if there are eight rows that are all the same value. I want to delete four of them.
There is no way to get rid of them through my SP filtering because the data that is entered is literally duplicate information entered in by the user but I do required half of that information.
I've done some research on the subject and did some testing but it seems as if it's not possible to delete half of the duplicated rows. Is this not possible? Or is there a way?
Here is one way, using a great feature of SQL Server, updatable CTEs:
with todelete as (
select t.*,
row_number() over (partition by col1, col2, col3, . . . order by newid()) as seqnum
from table t
)
delete from todelete
where seqnum % 2 = 0;
This will delete every other value.
Assuming SQL Server 2005+:
;WITH CTE AS
(
SELECT *,
RN=ROW_NUMBER() OVER(PARTITION BY Col1, Col2,...Coln ORDER BY Col1)
FROM YourTempTableHere
)
DELETE FROM CTE
WHERE RN = 1