Delete of duplicate records - sql

I have a table where I would like to identify duplicate records based on two columns(id and role) and I use a third column (unit) to select a subset of records to analyze and do the deletion within. Here comes the table and a few rows a example data:
id | role | unit
----------------
946| 1001 | 1
946| 1002 | 1
946| 1003 | 1
946| 1001 | 2
946| 1002 | 2
900| 1001 | 3
900| 1002 | 3
900| 1001 | 3
An analysis of unit 1 and 2 should identify two rows to delete 946/1001 and 946/1002. It doesn't matter if I delete the rows labeled unit 1 or 2. In a subsequent step I will update all records labeled unit=2 to unit=1.
I have a select statement capable to identify the rows to delete:
SELECT * FROM (SELECT
unit,
id,
role,
ROW_NUMBER() OVER (
PARTITION BY
id,
role
ORDER BY
id,
role
) row_num
FROM thetable WHERE unit IN (1,2) ) as x
WHERE row_num > 1;
This query will give this result:
id | role | unit
----------------
946| 1001 | 2
946| 1002 | 2
Now I would like to combine this with DELETE to delete the identified records. I have come pretty close (I believe) with this statement:
DELETE FROM thetable tp1 WHERE EXISTS
(SELECT
unit,
id,
role,
ROW_NUMBER() OVER (
PARTITION BY
id,
role
ORDER BY
id,
role
) as row_num
FROM
thetable tp2
WHERE unit IN (1,2) AND
tp1.unit=tp2.unit AND
tp1.role=tp2.role AND
tp1.id=tp2.id AND row_num >1
)
However, the row_num is not recognized as column. So how should I modify this statement to delete the two identified records?

It is very simple with EXISTS:
DELETE FROM thetable t
WHERE t.unit IN (1,2)
AND EXISTS (
SELECT 1 FROM thetable
WHERE (id, role) = (t.id, t.role) AND unit < t.unit
)
See the demo.
Results:
> id | role | unit
> --: | ---: | ---:
> 946 | 1001 | 1
> 946 | 1002 | 1
> 946 | 1003 | 1
> 900 | 1001 | 3
> 900 | 1002 | 3
> 900 | 1001 | 3

You could phrase this as:
delete from thetable t
where t.unit > (
select min(t1.unit)
from thetable t1
where t1.id = t.id and t1.role = t.role
)
This seems like a simple way to solve the assignment, basically phrasing as: delete rows for which another row exists with a smaller unit and the same id and role.
As for the query you wanted to write, using row_number(), I think that would be:
delete from thetable t
using (
select t.*, row_number() over(partition by id, role order by unit) rn
from mytable t
) t1
where t1.id = t.id and t1.role = t.role and t1.unit = t.unit and t1.rn > 1

Related

SQL get ALL rows which share foreign keys

I wish to find ALL rows which have at least one sibling with some(not specific) foreign key.
mytable
id |fk_id
---|--------
1 |100
2 |200
3 |200
4 |300
5 |300
6 |300
My query should return rows 2 to 6, but not row 1 since it is alone.
I came up with a working solution which uses 2 subqueries which seems too much. (Running a few seconds on 20k+ rows, which implies at least O(n^2)
SELECT * from mytable
WHERE fk_id IN
(SELECT fk_id FROM
(SELECT fk_id, SUM(fk_id) as mycnt from mytable GROUP BY fk_id)
WHERE mycnt >= 2)
What would be a faster solution?
Regular programming Non SQL solution would be to just sort by fk_id and then get rid of singles which would be O(nlogn) for generic sorting plus O(n) just to iterate over once, so O(nlogn)
Using SQLite, but other SQL dialects are fine too.
With EXISTS:
SELECT m.* FROM mytable m
WHERE EXISTS (
SELECT 1 FROM mytable
WHERE id <> m.id AND fk_id = m.fk_id
)
See the demo.
Or with COUNT() window function:
SELECT m.id, m.fk_id
FROM (
SELECT *, COUNT(id) OVER (PARTITION BY fk_id) counter
FROM mytable
) m
WHERE m.counter > 1
See the demo.
Results:
| id | fk_id |
| --- | ----- |
| 2 | 200 |
| 3 | 200 |
| 4 | 300 |
| 5 | 300 |
| 6 | 300 |
Something like this should work:
select *
from mytable
where fk_id in (
select fk_id
from mytable
group by fk_id
having count(*) > 1
)
Another alternative using inner join:
select *
from mytable m
inner join
(
select fk_id
from mytable
group by fk_id
having count(*) > 1
) fks on fks.fk_id = m.fk_id

Exclude first record associated with each parent record in Postgres

There are 2 tables, users and job_experiences.
I want to return a list of all job_experiences except the first associated with each user.
users
id
---
1
2
3
job_experiences
id | start_date | user_id
--------------------------
1 | 201001 | 1
2 | 201201 | 1
3 | 201506 | 1
4 | 200901 | 2
5 | 201005 | 2
Desired result
id | start_date | user_id
--------------------------
2 | 201201 | 1
3 | 201506 | 1
5 | 201005 | 2
Current query
select
*
from job_experiences
order by start_date asc
offset 1
But this doesn't work as it would need to apply the offset to each user individually.
You can do this with a lateral join:
select je.*
from users u cross join lateral
(select je.*
from job_experiences je
where u.id = je.user_id
order by id
offset 1 -- all except the first
) je;
For performance, an index on job_experiences(user_id, id) is recommended.
use row_number() window function
with cte as
(
select e.*,
row_number()over(partition by user_id order by start_date desc) rn,
count(*) over(partition by user_id) cnt
from users u join job_experiences e on u.id=e.user_id
)
, cte2 as
(
select * from cte
) select * from cte2 t1
where rn<=(select max(cnt)-1 from cte2 t2 where t1.user_id=t2.user_id)
You could use an intermediate CTE to get the first (MIN) jobs for each user, and then use that to determine which records to exclude:
WITH user_first_je("user_id", "job_id") AS
(
SELECT "user_id", MIN("id")
FROM job_experiences
GROUP BY "user_id"
)
SELECT job_experiences.*
FROM job_experiences
LEFT JOIN user_first_je ON
user_first_je.job_id = job_experiences.id
WHERE user_first_je.job_id IS NULL;

Query to skip first row after id changes in SQL Server

I have a long table like the following. The table adds two similar rows after the id changes. E.g in the following table when ID changes from 1 to 2 a duplicate record is added. All I need is a SELECT query to skip this and all other duplicate records only if the ID changes.
# | name| id
--+-----+---
1 | abc | 1
2 | abc | 1
3 | abc | 1
4 | abc | 1
5 | abc | 1
5 | abc | 2
6 | abc | 2
7 | abc | 2
8 | abc | 2
9 | abc | 2
and so on
You could use NOT EXISTS to eliminate the duplicates:
SELECT *
FROM yourtable AS T
WHERE NOT EXISTS
( SELECT 1
FROM yourtable AS T2
WHERE T.[#] = T2.[#]
AND T2.ID > T.ID
);
This will return:
# name ID
------------------
. ... .
4 abc 1
5 abc 2
6 abc 2
. ... .
... (Some irrelevant rows have been removed from the start and the end)
If you wanted the first record to be retained, rather than the last, then just change the condition T2.ID > T.ID to T2.ID < T.ID.
You can use the following CTEs to simulate LAG window function not available in SQL Server 2008:
;WITH CTE_RN AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY [#], id) AS rn
FROM #mytable
), CTE_LAG AS (
SELECT t1.[#], t1.name,
t1.id AS curId, t2.id AS prevId,
t1.[#] AS cur#, t2.[#] AS lag#
FROM CTE_RN t1
LEFT JOIN CTE_RN t2 ON t1.rn = t2.rn + 1 )
You can now filter out the 'duplicate' records using the above CTE_LAG and the following predicate in your WHERE clause:
;WITH (
... cte definitions here
) SELECT *
FROM CTE_LAG
WHERE (NOT ((prevId <> curId) AND (cur# = lag#))) OR (prevId IS NULL)
If prevId <> curId and cur# = lag#, then there is a change in the value of the id column and the following record has the same [#] value as the previous one, i.e. it is a duplicate.
Hence, using NOT on (prevId <> curId) AND (cur# = lag#), filters out all 'duplicate' records. This means record (5, abc, 2) will be eliminated.
SQL Fiddle Demo here
P.S. You can also add column name in the logical expression of the WHERE clause, depending on what defines a 'duplicate'.
So I achieved it by using the following query in SQL server.
select #, name, id
from table
group by #, name, id
having count(*) > 0

SQL delete almost identical rows

I have a table that have 5 columns, and instead of update, I've done insert of all rows(stupid mistake). How to get rid of duplicated records. They are identical except of the id. I can't remove all records, but I want do delete half of them.
ex. table:
+-----+-------+--------+-------+
| id | name | name2 | user |
+-----+-------+--------+-------+
| 1 | nameA | name2A | u1 |
| 12 | nameA | name2A | u1 |
| 2 | nameB | name2B | u2 |
| 192 | nameB | name2B | u2 |
+-----+-------+--------+-------+
How to do this?
I'm using Microsoft Sql Server.
Try the following.
DELETE
FROM MyTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyTable
GROUP BY Name, Name2, User)
That is untested so may need adapting. The following video will provide you with some more information about this query.
Video
This is more specific query than #TechDo as I find duplicates where name, name2 and user are identical not only name.
with duplicates as
(
select t.id, ROW_NUMBER() over (partition by t.name, t.name2, t.[user] order by t.id) as RowNumber
from YourTable t
)
delete duplicates
where RowNumber > 1
SQLFiddle demo to try it yourself: DEMO
Please try:
with c as
(
select
*, row_number() over(partition by name, name2, [user] order by id) as n
from YourTable
)
delete from c
where n > 1;

Grouping SQL Results based on order

I have table with data something like this:
ID | RowNumber | Data
------------------------------
1 | 1 | Data
2 | 2 | Data
3 | 3 | Data
4 | 1 | Data
5 | 2 | Data
6 | 1 | Data
7 | 2 | Data
8 | 3 | Data
9 | 4 | Data
I want to group each set of RowNumbers So that my result is something like this:
ID | RowNumber | Group | Data
--------------------------------------
1 | 1 | a | Data
2 | 2 | a | Data
3 | 3 | a | Data
4 | 1 | b | Data
5 | 2 | b | Data
6 | 1 | c | Data
7 | 2 | c | Data
8 | 3 | c | Data
9 | 4 | c | Data
The only way I know where each group starts and stops is when the RowNumber starts over. How can I accomplish this? It also needs to be fairly efficient since the table I need to do this on has 52 Million Rows.
Additional Info
ID is truly sequential, but RowNumber may not be. I think RowNumber will always begin with 1 but for example the RowNumbers for group1 could be "1,1,2,2,3,4" and for group2 they could be "1,2,4,6", etc.
For the clarified requirements in the comments
The rownumbers for group1 could be "1,1,2,2,3,4" and for group2 they
could be "1,2,4,6" ... a higher number followed by a lower would be a
new group.
A SQL Server 2012 solution could be as follows.
Use LAG to access the previous row and set a flag to 1 if that row is the start of a new group or 0 otherwise.
Calculate a running sum of these flags to use as the grouping value.
Code
WITH T1 AS
(
SELECT *,
LAG(RowNumber) OVER (ORDER BY ID) AS PrevRowNumber
FROM YourTable
), T2 AS
(
SELECT *,
IIF(PrevRowNumber IS NULL OR PrevRowNumber > RowNumber, 1, 0) AS NewGroup
FROM T1
)
SELECT ID,
RowNumber,
Data,
SUM(NewGroup) OVER (ORDER BY ID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM T2
SQL Fiddle
Assuming ID is the clustered index the plan for this has one scan against YourTable and avoids any sort operations.
If the ids are truly sequential, you can do:
select t.*,
(id - rowNumber) as grp
from t
Also you can use recursive CTE
;WITH cte AS
(
SELECT ID, RowNumber, Data, 1 AS [Group]
FROM dbo.test1
WHERE ID = 1
UNION ALL
SELECT t.ID, t.RowNumber, t.Data,
CASE WHEN t.RowNumber != 1 THEN c.[Group] ELSE c.[Group] + 1 END
FROM dbo.test1 t JOIN cte c ON t.ID = c.ID + 1
)
SELECT *
FROM cte
Demo on SQLFiddle
How about:
select ID, RowNumber, Data, dense_rank() over (order by grp) as Grp
from (
select *, (select min(ID) from [Your Table] where ID > t.ID and RowNumber = 1) as grp
from [Your Table] t
) t
order by ID
This should work on SQL 2005. You could also use rank() instead if you don't care about consecutive numbers.