delete statement with a having count(*) > 1 - sql

I have some duplication's in a table that I wish to delete. However SQL doesn't like my query below.
delete from tblS
where Field = 'spread'
group by FundCode, Region, DateEntered
having count(*) > 1
So i tried the below query again though SQL doesn't like this. How should my query look?
delete s
from tblS s
join
(
select FundCode, Region, DateEntered, count(*)
from tblS
where Field = 'spread'
group by FundCode, Region, DateEntered
having count(*) > 1
) as d on s.FundCode = d.FundCode and s.DateEntered = d.DateEntered and s.Region= d.Region

Normally when you want to delete duplicates, you want to keep one of them. The right function to use is row_number() and SQL Server supports updatable CTEs. So:
with toupdate as (
select t.*,
row_number() over (partition by FundCode, Region, DateEntered order by FundCode) as seqnum
from tblS t
where Field = 'spread'
)
delete toupdate
where seqnum > 1;
If you actually want to delete all duplicates, then use count(*) instead of row_number().

Related

Update Values Returned from Select Statement

I have the following sql query that returns all duplicates (except most recent).
select *
from CodeData
inner join
(select max(cdID) as lastId, cdName
from CodeData
where cdName in (select cdName from CodeData
group by cdName
having count(*) > 1)
group by cdName) duplic on duplic.cdName = CodeData.cdName
where CodeData.cdID < duplic.lastId;
How would I go about updating a column for each result returned in the above query? Say I have an arbitrary column A, I was thinking about something along the lines of this at the end of the query.
update CodeData
set A = 0
where CodeData.cdID < CodeData.duplic.lastId;
But this ends up updating the entire table, and not just the returned results above.
Any tips?
Use an updatable CTE and window functions:
with toupdate as (
select cd.*,
row_number() over (partition by cdName order by cdId desc) as seqnum
from codedata cd
)
update toupdate
set a = 0
where seqnum > 1;
Note: You might find that row_number() is sufficient to handle the "duplication" issue and you don't need to store an additional column at all.
And you might really want:
update toupdate
set a = (case when seqnum > 1 then 0 else 1 end);

How to delete 90% of records from each group of a table (postgres)

I have a table called 'sales' in postgres which has a column called 'region'. I am trying to find out a way to delete 90% of records from each 'region' of the same table.
I am using the below query. But the same is not working in postgres and also the table does not have a primary/unique key column
delete from table
( select row_number() over (partition by region) as PAR
from sales
)b
where PAR >=
( select S*0.1 as ninety
from
( select region, count(*) as S
from sales
group by region
)a
and b.region = a.region
can anyone provide any better solution to this.
If you have an unique id in the table, you can do:
delete
from t
using (select t.*,
row_number() over (partition by region order by region) as seqnum, -- I always include order by
count(*) over (partition by region) as cnt
from t
) tt
where t.id = tt.id and
tt.seqnum < 0.9 * cnt;

unable to use alias in Update statment in sql

I working of update query for with the help of CTEs. Actually I want to update a table records, on the basis of duplicate records I just want to update one one row among those duplicate rows. My code is mentioned below:
with toupdate as (
select c.*,
count(*) over (partition by c.ConsumerReferenceNumber) as cnt,
max(c.ID) over (partition by c.ID) as onhand_value
from [dbo].[tbl_NADRA_CPS] c
)
update [dbo].[tbl_NADRA_CPS]
set StatusID = 38
where cnt > 1;
I am unable to use 'cnt' in my update where clause.
Thanks in advance.
Because cnt is a field of your CTE, not of [dbo].[tbl_NADRA_CPS]
with toupdate as (
select c.*,
count(*) over (partition by c.ConsumerReferenceNumber) as cnt,
max(c.ID) over (partition by c.ID) as onhand_value
from [dbo].[tbl_NADRA_CPS] c
)
update toupdate
set StatusID = 38
where cnt > 1;
If you want to update one row among the duplicates, then your query will not do that. Instead:
with toupdate as (
select c.*,
count(*) over (partition by c.ConsumerReferenceNumber) as cnt,
row_number() over (partition by c.ConsumerReferenceNumber order by c.ID) as seqnum
from [dbo].[tbl_NADRA_CPS] c
)
update toupdate
set StatusID = 38
where cnt > 1 and seqnum = 1;
The cnt > 1 gets reference numbers with more than one row. The seqnum = 1 ensures that just one is updated.

How do I delete duplicate rows in SQL Server using the OVER clause?

Here are the columns in my table:
Id
EmployeeId
IncidentRecordedById
DateOfIncident
Comments
TypeId
Description
IsAttenIncident
I would like to delete duplicate rows where EmployeeId, DateOfIncident, TypeId and Description are the same - just to clarify - I do want to keep one of them. I think I should be using the OVER clause with PARTITION, but I am not sure.
Thanks
If you want to keep one row of the duplicate-groups you can use ROW_NUMBER. In this example i keep the row with the lowest Id:
WITH CTE AS
(
SELECT rn = ROW_NUMBER()
OVER(
PARTITION BY employeeid, dateofincident, typeid, description
ORDER BY Id ASC), *
FROM dbo.TableName
)
DELETE FROM cte
WHERE rn > 1
use this query without using CTE....
delete a from
(select id,name,place, ROW_NUMBER() over (partition by id,name,place order by id) row_Count
from dup_table) a
where a.row_Count >1
You can use the following query. This has an assumption that you want to keep the latest row and delete the other duplicates.
DELETE [YourTable]
FROM [YourTable]
LEFT OUTER JOIN (
SELECT MAX(ID) as RowId
FROM [YourTable]
GROUP BY EmployeeId, DateOfIncident, TypeId, Description
) as KeepRows ON
[YourTable].ID = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL

SQL: How to find duplicates based on two fields?

I have rows in an Oracle database table which should be unique for a combination of two fields but the unique constrain is not set up on the table so I need to find all rows which violate the constraint myself using SQL. Unfortunately my meager SQL skills aren't up to the task.
My table has three columns which are relevant: entity_id, station_id, and obs_year. For each row the combination of station_id and obs_year should be unique, and I want to find out if there are rows which violate this by flushing them out with an SQL query.
I have tried the following SQL (suggested by this previous question) but it doesn't work for me (I get ORA-00918 column ambiguously defined):
SELECT
entity_id, station_id, obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes
ON
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year
Can someone suggest what I'm doing wrong, and/or how to solve this?
SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
FROM mytable t
)
WHERE rn > 1
SELECT entity_id, station_id, obs_year
FROM mytable t1
WHERE EXISTS (SELECT 1 from mytable t2 Where
t1.station_id = t2.station_id
AND t1.obs_year = t2.obs_year
AND t1.RowId <> t2.RowId)
Change the 3 fields in the initial select to be
SELECT
t1.entity_id, t1.station_id, t1.obs_year
Re-write of your query
SELECT
t1.entity_id, t1.station_id, t1.obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes
ON
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year
I think the ambiguous column error (ORA-00918) was because you were selecting columns whose names appeared in both the table and the subquery, but you did not specifiy if you wanted it from dupes or from mytable (aliased as t1).
Could you not create a new table that includes the unique constraint, and then copy across the data row by row, ignoring failures?
You need to specify the table for the columns in the main select. Also, assuming entity_id is the unique key for mytable and is irrelevant to finding duplicates, you should not be grouping on it in the dupes subquery.
Try:
SELECT t1.entity_id, t1.station_id, t1.obs_year
FROM mytable t1
INNER JOIN (
SELECT station_id, obs_year FROM mytable
GROUP BY station_id, obs_year HAVING COUNT(*) > 1) dupes
ON
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year
SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
FROM mytable t
)
WHERE rn > 1
by Quassnoi is the most efficient for large tables.
I had this analysis of cost :
SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
WHERE EXISTS (SELECT 1 from trn_refil_book b Where
a.dist_code = b.dist_code and a.book_date = b.book_date and a.book_no = b.book_no
AND a.RowId <> b.RowId)
;
gave a cost of 1322341
SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
INNER JOIN (
SELECT b.dist_code, b.book_date, b.book_no FROM trn_refil_book b
GROUP BY b.dist_code, b.book_date, b.book_no HAVING COUNT(*) > 1) c
ON
a.dist_code = c.dist_code and a.book_date = c.book_date and a.book_no = c.book_no
;
gave a cost of 1271699
while
SELECT dist_code, book_date, book_no
FROM (
SELECT t.dist_code, t.book_date, t.book_no, ROW_NUMBER() OVER (PARTITION BY t.book_date, t.book_no
ORDER BY t.dist_code) AS rn
FROM trn_refil_book t
) p
WHERE p.rn > 1
;
gave a cost of 1021984
The table was not indexed....
SELECT entity_id, station_id, obs_year
FROM mytable
GROUP BY entity_id, station_id, obs_year
HAVING COUNT(*) > 1
Specify the fields to find duplicates on both the SELECT and the GROUP BY.
It works by using GROUP BY to find any rows that match any other rows based on the specified Columns.
The HAVING COUNT(*) > 1 says that we are only interested in seeing any rows that occur more than 1 time (and are therefore duplicates)
I thought a lot of the solutions here were cumbersome and tough to understand since I had a 3 column primary key constraint and needed to find the duplicates. So here's an option
SELECT id, name, value, COUNT(*) FROM db_name.table_name
GROUP BY id, name, value
HAVING COUNT(*) > 1
I'm surprised there aren't any answers here that use a CTE (Common Table Expression)
WITH cte as (
SELECT
ROW_NUMBER()
OVER(
PARTITION BY Last_Name, First_Name order by BIRTHDATE)
AS RN,
Employee_number, First_Name, Last_Name, BirthDate,
SUM(1)
OVER(
PARTITION BY Last_Name, First_Name
ORDER BY BIRTHDATE ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS CNT
FROM
employment)
select * from cte where cnt > 1
Not only will this find duplicates (on first and last name only), it will tell you how many there are.