Delete Duplicate Rows in SQL

Delete Duplicate Rows in SQL - sql

I have a table with unique id but duplicate row information.
I can find the rows with duplicates using this query
SELECT
PersonAliasId, StartDateTime, GroupId, COUNT(*) as Count
FROM
Attendance
GROUP BY
PersonAliasId, StartDateTime, GroupId
HAVING
COUNT(*) > 1
I can manually delete the rows while keeping the 1 I need with this query
Delete
From Attendance
Where Id IN(SELECT
Id
FROM
Attendance
Where PersonAliasId = 15
and StartDateTime = '9/24/2017'
and GroupId = 1429
Order By ModifiedDateTIme Desc
Offset 1 Rows)
I am not versed in SQL enough to figure out how to use the rows in the first query to delete the duplicates leaving behind the most recent. There are over 3481 records returned by the first query to do this one by one manually.
How can I find the duplicate rows like the first query and delete all but the most recent like the second?

You can use a Common Table Expression to delete the duplicates:
WITH Cte AS(
SELECT *,
Rn = ROW_NUMBER() OVER(PARTITION BY PersonAliasId, StartDateTime, GroupId
ORDER BY ModifiedDateTIme DESC)
FROM Attendance
)
DELETE FROM Cte WHERE Rn > 1;
This will keep the most recent record for each PersonAliasId - StartDateTime - GroupId combination.

Use the MAX aggregate function to identify the latest startdatetime for each group/person combination. Then delete records which do not have that latest time.
DELETE a
FROM attendance as a
INNER JOIN (
SELECT
PersonAliasId, MAX(StartDateTime) AS LatestTime, GroupId,
FROM
Attendance
GROUP BY
PersonAliasId, GroupId
HAVING
COUNT(*) > 1
) as b
on a.personaliasid=b.personaliasid and a.groupid=b.groupid and a.startdatetime < b.latesttime

Same as the CTE answer - give Felix the check
delete
from ( SELECT rn = ROW_NUMBER() OVER(PARTITION BY PersonAliasId, StartDateTime, GroupId
ORDER BY ModifiedDateTIme DESC)
FROM Attendance
) tt
where tt.rn > 1

Related

sql: Select count(*) - nth record from each group

I'm grouping by tenant_id. I want to select the count() - 1000th record (ordered by _updated time) from each GROUPBY group, for the groups where count() is greater than 1000. As follows:
select t1.tenant_id,
(select temp._updated
from trace temp
where temp.tenant_id = t1.tenant_id
order by _updated limit 1 offset
count(*) - 1000
) as timekey
from fgc.trace as t1
group by tenant_id
having count(*) > 1000;
But this is not allowed as count(*) cannot be used inside the subquery.
So I tried the following, which still doesn't work as I don't have access to t1 since this is not a join.
select t1.tenant_id,
(select temp._updated
from trace temp
where temp.tenant_id = t1.tenant_id
order by _updated limit 1 offset
(select count(*)-1000
from trace t2
group by tenant_id
having t2.tenant_id = t1.tenant_id)
) as timekey
from fgc.trace as t1
group by tenant_id
having count(*) > 1000;
So how can I get the following?
tenant_id | timekey
+-----------+----------------------------------+
n7ia6ryc | 2019-07-23 23:09:49.951406+00:00

You seem to want ROW_NUMBER(). Cockroach supports windows functions, so:
SELECT updated
FROM (
SELECT
tenant_id,
updated,
ROW_NUMBER() OVER(PARTITION BY tenant_id ORDER BY updated DESC) rn
FROM trace
) x WHERE rn = 1001
For each tenant_id, this will return the timestamp of the 1001th less recent record. If a given tenant has less than 1000 records, it will not appear in the results.

select x.tenant_id
from (
select t.tenant_id,
row_number() over (partition by t.tenant_id order by t.timekey) as tenant_number
from fgc.trace as t
) x
where x.tenant_number > 1000
group by x.tenant_id
just the one timestamp would look like this:
select min(x.timekey) as min_timestamp
from (
select t.tenant_id, t.timekey,
row_number() over (partition by t.tenant_id order by t.timekey) as tenant_number
from fgc.trace as t
) x
where x.tenant_number > 1000
note that grouping does not matter here because each row can only be in one group and you are only looking at one row.

Remove duplicate records based on timestamp

I'm writing a query to find duplicate records. I have table with following columns
Id, Deliveries, TankId, Timestamp.
I have inserted duplicate records, that is for same tankid, same deliveries with the +1 day offset timestamp.
Now I want to remove duplicate records which is with lesser timestamp.
e.g. I have duplicate deliveries added for same tankid on 24th and 25th july. I need to remove 24th record.
I tried the following query;
SELECT raw.TimeStamp,raw.[Delivery],raw.[TankId]
FROM [dbo].[tObservationData] raw
INNER JOIN (
SELECT [Delivery],[TankSystemId]
FROM [dbo].[ObservationData]
GROUP BY [Delivery],[TankSystemId]
HAVING COUNT([ObservationDataId]) > 1
) dup
ON raw.[Delivery] = dup.[Delivery] AND raw.[TankId] = dup.[TankId]
AND raw.TimeStamp >'2019-06-30 00:00:00.0000000' AND raw.[DeliveryL]>0
ORDER BY [TankSystemId],TimeStamp
But above gives other records too, how can I find and delete those duplicate records?

In this case you can use partition by order by clause. You can partition by TankID and Delivery and order by Timestamp in desc order
Select * from (
Select *,ROW_NUMBER() OVER (PARTITION BY TankID,Delievry ORDER BY [Timestamp] DESC) AS rn
from [dbo].[ObservationData]
)
where rn = 1
In the above code records with rn=1 will have the latest timestamp. So you can only select those and ignore others. Also you can use the same to remove/delete the records from you table.
WITH TempObservationdata (TankID,Delivery,Timestamp)
AS
(
SELECT TankID,Delivery,ROW_NUMBER() OVER(PARTITION by TankID, Delivery ORDER BY Timsetamp desc)
AS Timestamp
FROM dbo.ObservationData
)
--Now Delete Duplicate Rows
DELETE FROM TempObservationdata
WHERE Timestamp > 1

think it will work
SELECT raw.TimeStamp,raw.[Delivery],raw.[TankId]
FROM [dbo].[tObservationData] raw
INNER JOIN (
SELECT [Delivery],[TankSystemId],min([TimeStamp]) as min_ts
FROM [dbo].[ObservationData]
GROUP BY [Delivery],[TankSystemId]
HAVING COUNT([ObservationDataId]) > 1
) dup
ON raw.[Delivery] = dup.[Delivery] AND raw.[TankId] = dup.[TankId] and raw.[TimeStamp] = dup.min_ts
AND raw.TimeStamp >'2019-06-30 00:00:00.0000000' AND raw.[DeliveryL]>0
ORDER BY [TankSystemId],TimeStamp

Are you just looking for this?
SELECT od.*
FROM (SELECT od.*,
ROW_NUMBER() OVER (PARTITION BY od.TankId, od.Delivery ORDER BY od.TimeStamp DESC) as seqnum
FROM [dbo].[tObservationData] od
) od
WHERE seqnum = 1;

How to select max rownumber for each partition in SQL Server

Can anybody tell me how to select the max row number for each partition in SQL Server using CTE.
Suppose any employee is having 4 transaction rows and another is having only one row then how to select max rows for those employees.
I am having job table I want to fetch max row number for employee to fetch the latest transaction for that employee
I'd tried following
With CTE as (
Select
My fields,
Rownum = row_number() over(partition by emplid order by date) from jobtable
Where
Myconditions
)
Select * from CTE B left outer join
CTE A on A.emplid = B.emplid
Where
A.rownum = (select max(a2.rownum) from jobtable a2)
Do left join is required above or it is not at all needed ?
Please tell me how to fetch rownum if only 1 row exist for any employees as above query is fetching only employees which are having.greatest rownum in whole table

With CTE as (
Select
My fields,
Rownum = row_number() over(partition by emplid order by date DESC)
from jobtable
Where
Myconditions
)
SELECT *
FROM
cte
WHERE
RowNum = 1
Just reverse the order of your ROW_NUMBER and and select where it equals 1. Row numbers can be ascending (ASC) or descending (DESC). So if you want the most recent date to get the latest record ORDER BY date DESC, if you want the earliest record first you would choose ORDER BY date ASC (or date)

Delete duplicates but keep 1 with multiple column key

I have the following SQL select. How can I convert it to a delete statement so it keeps 1 of the rows but deletes the duplicate?
select s.ForsNr, t.*
from [testDeleteDublicates] s
join (
select ForsNr, period, count(*) as qty
from [testDeleteDublicates]
group by ForsNr, period
having count(*) > 1
) t on s.ForsNr = t.ForsNr and s.Period = t.Period

Try using following:
Method 1:
DELETE FROM Mytable WHERE RowID NOT IN (SELECT MIN(RowID) FROM Mytable GROUP BY Col1,Col2,Col3)
Method 2:
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY ForsNr, period
ORDER BY ( SELECT 0)) RN
FROM testDeleteDublicates)
DELETE FROM cte
WHERE RN > 1
Hope this helps!
NOTE:
Please change the table & column names according to your need!

This is easy as long as you have a generated primary key column (which is a good idea). You can simply select the min(id) of each duplicate group and delete everything else - Note that I have removed the having clause so that the ids of non-duplicate rows are also excluded from the delete.
delete from [testDeleteDublicates]
where id not in (
select Min(Id) as Id
from [testDeleteDublicates]
group by ForsNr, period
)
If you don't have an artificial primary key you may have to achieve the same effect using row numbers, which will be a bit more fiddly as their implementation varies from vendor to vendor.

You can do with 2 option.
Add primary-key and delete accordingly
http://www.mssqltips.com/sqlservertip/1103/delete-duplicate-rows-with-no-primary-key-on-a-sql-server-table/
'2. Use row_number() with partition option, runtime add row to each row and then delete duplicate row.
Removing duplicates using partition by SQL Server
--give group by field in partition.
;with cte(
select ROW_NUMBER() over( order by ForsNr, period partition ForsNr, period) RowNo , * from [testDeleteDublicates]
group by ForsNr, period
having count(*) > 1
)
select RowNo from cte
group by ForsNr, period

Optimizing query with two MAX columns in the same table

I need to optimize below query
SELECT
Id, -- identity
CargoID,
[Status] AS CurrentStatus
FROM
dbo.CargoStatus
WHERE
id IN (SELECT TOP 1 ID
FROM dbo.CargoStatus CS
INNER JOIN STD.StatusMaster S ON CS.ShipStatusID = S.SatusID
WHERE CS.CargoID=CargoStatus.CargoID
ORDER BY YEAR([CS.DATE]) DESC, MONTH([CS.DATE]) DESC,
DAY([CS.DATE]) DESC, S.StatusStageNumber DESC)
There are two tables
CargoStatus, and
StatusMaster
Statusmaster has columns StatusID, StatusName, StatusStageNumber(int)
CargoStatus has columns ID, StatusID (FK StatusMaster StatusID column), Date
Is there any other better way of writing this query.
I want latest status for each cargo (only one entry per cargoID).

Since you seem to be using SQL Server 2005 or newer, you can use a CTE with the ROW_NUMBER() windowing function:
;WITH LatestCargo AS
(
SELECT
cs.Id, -- identity
cs.CargoID,
cs.[Status] AS CurrentStatus
ROW_NUMBER() OVER(PARTITION BY cs.CargoID
ORDER BY cs.[Date], s.StatusStageNumber DESC) AS 'RowNum'
FROM
dbo.CargoStatus cs
INNER JOIN
STD.StatusMaster s ON cs.ShipStatusID = s.[StatusID]
)
SELECT
Id, CargoID, [Status]
FROM
LatestCargo
WHERE
RowNum = 1
This CTE "partitions" your data by CargoID, and for each partition, the ROW_NUMBER function hands out sequential numbers, starting at 1 and ordered by Date DESC - so the latest row gets RowNum = 1 (for each CargoID) which is what I select from the CTE in the SELECT statement after it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Delete Duplicate Rows in SQL - sql

Same as the CTE answer - give Felix the check delete from ( SELECT rn = ROW_NUMBER() OVER(PARTITION BY PersonAliasId, StartDateTime, GroupId ORDER BY ModifiedDateTIme DESC) FROM Attendance ) tt where tt.rn > 1

Related

sql: Select count(*) - nth record from each group

Remove duplicate records based on timestamp

How to select max rownumber for each partition in SQL Server

Delete duplicates but keep 1 with multiple column key

Optimizing query with two MAX columns in the same table

Categories

Resources