Remove duplicates by multiple column criteria

Remove duplicates by multiple column criteria - sql

I have following table
CREATE TABLE Test (
ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
FIRST VARCHAR(10) NOT NULL,
SECOND VARCHAR(10) NOT NULL
)
Table filled with some duplicate data. TestTarget table have same structure and it filled using following procedural algorithm:
DECLARE #first varchar(10), #second varchar(10)
DECLARE c CURSOR FAST_FORWARD
FOR
SELECT first, second FROM Test ORDER BY id
OPEN c
FETCH NEXT FROM c INTO #first, #second
WHILE ##fetch_status = 0
BEGIN
IF NOT EXISTS(SELECT 1 FROM TestTarget WHERE first=#first OR second=#second)
INSERT INTO TestTarget (first, second) VALUES(#first, #second)
FETCH NEXT FROM c INTO #first, #second
END
CLOSE c
DEALLOCATE c
Briefly here we checking target table before insert if it already contains such 'first' OR 'second' value.
Example:
Source table
ID FIRST SECOND
1 A 2
2 A 1
3 A 3
4 B 2
5 B 1
6 B 3
7 B 2
8 B 4
9 C 2
10 C 3
INSERT INTO Test (first, second)
VALUES ('A', '2'),
('A', '1'),
('A', '3'),
('B', '2'),
('B', '1'),
('B', '3'),
('B', '2'),
('B', '4'),
('C', '2'),
('C', '3')
Target table
ID FIRST SECOND
1 A 2
5 B 1
10 C 3
Real source table have x*100k rows and at least 2 rows for same 'first' or 'second' column.
I'm looking for set based solution if it ever possible or please at least something faster than such loop because it takes hours for my real case.
NOTE Classic duplicate removals via partition/join/etc. is not the case here because it will produce different results even with different final number of rows.

INSERT INTO TestTarget (first, second)
SELECT first,second
FROM Test t
WHERE NOT EXISTS
(
SELECT 1
FROM Test t2
WHERE t2.id>t.id and (t2.first=t.first or t2.second=t.second)
)

I cannot think of any simple set based solution to your problem, I am afraid, but I would hope that something along the following lines would be much faster than your existing cursor:
declare #test table
(id int,
first varchar(1),
second varchar(1))
declare #target table
(id int,
first varchar(1),
second varchar(1))
declare #temp table
(id int,
first varchar(1),
second varchar(1))
INSERT INTO #Test (id, first, second)
VALUES (1, 'A', '2'),
(2, 'A', '1'),
(3, 'A', '3'),
(4, 'B', '2'),
(5, 'B', '1'),
(6, 'B', '3'),
(7, 'B', '2'),
(8, 'B', '4'),
(9, 'C', '2'),
(10, 'C', '3')
declare #firsts table
(first varchar(1))
declare #seconds table
(second varchar(1))
INSERT INTO #firsts
SELECT DISTINCT first FROM #test
INSERT INTO #seconds
SELECT DISTINCT second FROM #test
declare #firstcnt int = (SELECT count(*) FROM #firsts)
declare #secondcnt int = (SELECT count(*) FROM #firsts)
WHILE (#firstcnt > 0 AND #secondcnt > 0)
BEGIN
DELETE FROM #temp
INSERT INTO #temp
SELECT TOP 1 t.id, t.first, t.second FROM #test t
INNER JOIN #firsts f On t.first = f.first
INNER JOIN #seconds s On t.second = s.second
ORDER BY id
INSERT INTO #target
SELECT * FROM #temp
DELETE FROM #firsts WHERE first = (SELECT first FROM #temp)
SET #firstcnt = #firstcnt - 1
DELETE FROM #seconds WHERE second = (SELECT second FROM #temp)
SET #secondcnt = #secondcnt - 1
END
SELECT * FROM #target
This does produce the desired values and I would expect it to be faster because the while loop only needs to run for the total number of unique value pairs, rather than having to step through the entire table.
It also gives 10 C 3 as the last row, which I take to be correct, despite #Gordon's comment. If I understand the question correctly, the ID order takes precedence: that is to say, although 'A' and 'B' have entries with '3' as the second value, these entries have a greater id, than another second value that can legitimately be inserted.
HTH

using Recursive CTE,
declare #Target table(col1 varchar(20),col2 int)
declare #Test table(col1 varchar(20),col2 int)
INSERT INTO #Test (col1, col2
VALUES ('A', '2')
('A', '1')
('A', '3'),
('B', '1')
('B', '2'),
('B', '3'),
('B', '2'),
('B', '4'),
('C', '2'),
('C', '3')
　
;With CTE as
(
select col1 ,col2
,DENSE_RANK()over( ORDER by col1)rn1
from #Test
)
,cte1 AS(
select top 1 c.col1,c.col2,rn1 from cte c where rn1=1
union ALL
select c.col1,c.col2,c.rn1 from cte c
inner join cte1 c1
on c.rn1>c1.rn
where c.col2!=c1.col2
)
insert into #Target
select col1,col2 FROM(
select *,ROW_NUMBER()over(partition by col1 order by (select null)) rn2 from cte1
)t4
where rn2=1
select * from #Target

Related

SQL select items between LAG and LEAD using as range

Is it possible to select and sum items from a table using Lag and lead from another table as range as below.
SELECT #Last = MAX(ID) from [dbo].[#Temp]
select opl.Name as [Age Categories] ,
( SELECT count([dbo].udfCalculateAge([BirthDate],GETDATE()))
FROM [dbo].[tblEmployeeDetail] ed
inner join [dbo].[tblEmployee] e
on ed.EmployeeID = e.ID
where convert(int,[dbo].udfCalculateAge(e.[BirthDate],GETDATE()))
between LAG(opl.Name) OVER (ORDER BY opl.id)
and (CASE opl.ID WHEN #Last THEN '100' ELSE opl.Name End )
) as Total
FROM [dbo].[#Temp] opl
tblEmployee contains the employees and their dates of birth
INSERT INTO #tblEmployees VALUES
(1, 'A', 'A1', 'A', '1983/01/02'),
(2, 'B', 'B1', 'BC', '1982/01/02'),
(3, 'C', 'C1', 'JR2', '1982/10/11'),
(4, 'V', 'V1', 'G', '1990/07/12'),
(5, 'VV', 'VV1', 'J', '1992/06/02'),
(6, 'R', 'A', 'D', '1982/05/15'),
(7, 'C', 'Ma', 'C', '1984/09/29')
Next table is a temp table which is created depending on the ages enter by user eg "20;30;50;60" generates a temp table below , using funtion split
select * FROM [dbo].[Split](';','20;30;50;60')
Temp Table
pn s
1 20
2 30
3 50
4 60
Desired output as below, though column Age Categories can be renamed in a data-table in C#. l need the total columns to be accurate on ranges.
Age Categories Total
up to 20 0
21 - 30 2
31 - 50 5
51 - 60 0

Something along these lines should work for you:
declare #tblEmployees table(
ID int,
FirstNames varchar(20),
Surname varchar(20),
Initial varchar(3),
BirthDate date)
INSERT INTO #tblEmployees VALUES
(1, 'A', 'A1', 'A', '1983/01/02'),
(2, 'B', 'B1', 'BC', '1982/01/02'),
(3, 'C', 'C1', 'JR2', '1982/10/11'),
(4, 'V', 'V1', 'G', '1990/07/12'),
(5, 'VV', 'VV1', 'J', '1992/06/02'),
(6, 'R', 'A', 'D', '1982/05/15'),
(7, 'C', 'Ma', 'C', '1984/09/29')
declare #temp table
(id int identity,
age int)
INSERT INTO #temp
SELECT cast(item as int) FROM dbo.fnSplit(';','20;30;50;60')
declare #today date = GetDate()
declare #minBirthCutOff date = (SELECT DATEADD(yy, -MAX(age), #today) FROM #temp)
declare #minBirth date = (SELECT Min(birthdate) from #tblEmployees)
IF #minBirth < #minBirthCutOff
BEGIN
INSERT INTO #temp VALUES (100)
end
SELECT COALESCE(CAST((LAG(t.age) OVER(ORDER BY t.age) + 1) as varchar(3))
+ ' - ','Up to ')
+ CAST(t.age AS varchar(3)) AS [Age Categories],
COUNT(e.id) AS [Total] FROM #temp t
LEFT JOIN
(SELECT te.id,
te.age,
(SELECT MIN(age) FROM #temp t WHERE t.age > te.age) AS agebucket
FROM (select id,
dbo.udfCalculateAge(birthdate,#today) age from #tblEmployees) te) e
ON e.agebucket = t.age
GROUP BY t.age ORDER BY t.age
Result set looks like this:
Age Categories Total
Up to 20 0
21 - 30 2
31 - 50 5
51 - 60 0
For future reference, particularly when asking SQL questions, you will get far faster and better response, if you provide much of the work that I have done. Ie create statements for the tables concerned and insert statements to supply the sample data. It is much easier for you to do this than for us (we have to copy and paste and then re-format etc), whereas you should be able to do the same via a few choice SELECT statements!
Note also that I handled the case when a birthdate falls outside the given range rather differently. It is a bit more efficient to do a single check once via MAX than to complicate your SELECT statement. It also makes it much more readable.
Thanks to HABO for suggestion on GetDate()

Identifying/comparing sets of rows within groups

I have a matter which seemed simple to solve but now I find it troublesome.
In simplification - I need to find a way to identify unique sets of rows within groups defined by another column. In basic example the source table contains only two columns:
routeID nodeID nodeName
1 1 a
1 2 b
2 1 a
2 2 b
3 1 a
3 2 b
4 1 a
4 2 c
5 1 a
5 2 c
6 1 a
6 2 b
6 3 d
7 1 a
7 2 b
7 3 d
So, the routeID column refers to set of nodes which define a route.
What I need to do is to somehow group the routes, so that there will be only one unique sequence of nodes for one routeID.
In my actual case I tried to use window function to add columns which help to identify nodes sequence, but I still have no idea how to get those unique sequences and group routes.
As a final effect I want to get only unique routes - for example routes 1,2 and 3 aggregated to one route.
Do you have any idea how to help me ?
EDIT:
The other table which I would like to join with the one from the example may look like that:
journeyID nodeID nodeName routeID
1 1 a 1
1 2 b 1
2 1 a 1
2 2 b 1
3 1 a 4
3 2 c 4
...........................
...........................

You can try this idea:
DECLARE #DataSource TABLE
(
[routeID] TINYINT
,[nodeID] TINYINT
,[nodeName] CHAR(1)
);
INSERT INTO #DataSource ([routeID], [nodeID], [nodeName])
VALUES ('1', '1', 'a')
,('1', '2', 'b')
,('2', '1', 'a')
,('2', '2', 'b')
,('3', '1', 'a')
,('3', '2', 'b')
,('4', '1', 'a')
,('4', '2', 'c')
,('5', '1', 'a')
,('5', '2', 'c')
,('6', '1', 'a')
,('6', '2', 'b')
,('6', '3', 'd')
,('7', '1', 'a')
,('7', '2', 'b')
,('7', '3', 'd');
SELECT DS.[routeID]
,nodes.[value]
,ROW_NUMBER() OVER (PARTITION BY nodes.[value] ORDER BY [routeID]) AS [rowID]
FROM
(
-- getting unique route ids
SELECT DISTINCT [routeID]
FROM #DataSource DS
) DS ([routeID])
CROSS APPLY
(
-- for each route id creating CSV list with its node ids
SELECT STUFF
(
(
SELECT ',' + [nodeName]
FROM #DataSource DSI
WHERE DSI.[routeID] = DS.[routeID]
ORDER BY [nodeID]
FOR XML PATH(''), TYPE
).value('.', 'VARCHAR(MAX)')
,1
,1
,''
)
) nodes ([value]);
The code will give you this output:
So, you simple need to filter by rowID = 1. Of course, you can change the code as you like in order to satisfy your bussness criteria (for example showing no the first route ID with same nodes, but the last).
Also, ROW_NUMBER function cannot be used directly in the WHERE clause, so you need to wrap the code before filtering:
WITH DataSource AS
(
SELECT DS.[routeID]
,nodes.[value]
,ROW_NUMBER() OVER (PARTITION BY nodes.[value] ORDER BY [routeID]) AS [rowID]
FROM
(
-- getting unique route ids
SELECT DISTINCT [routeID]
FROM #DataSource DS
) DS ([routeID])
CROSS APPLY
(
-- for each route id creating CSV list with its node ids
SELECT STUFF
(
(
SELECT ',' + [nodeName]
FROM #DataSource DSI
WHERE DSI.[routeID] = DS.[routeID]
ORDER BY [nodeID]
FOR XML PATH(''), TYPE
).value('.', 'VARCHAR(MAX)')
,1
,1
,''
)
) nodes ([value])
)
SELECT DS2.*
FROM DataSource DS1
INNER JOIN #DataSource DS2
ON DS1.[routeID] = DS2.[routeID]
WHERE DS1.[rowID] = 1;

ok, let's use some recursion to create a complete node list for each routeID
First of all let's populate source table and journeyes tale
-- your source
declare #r as table (routeID int, nodeID int, nodeName char(1))
-- your other table
declare #j as table (journeyID int, nodeID int, nodeName char(1), routeID int)
-- temp results table
declare #routes as table (routeID int primary key, nodeNames varchar(1000))
;with
s as (
select *
from (
values
(1, 1, 'a'),
(1, 2, 'b'),
(2, 1, 'a'),
(2, 2, 'b'),
(3, 1, 'a'),
(3, 2, 'b'),
(4, 1, 'a'),
(4, 2, 'c'),
(5, 1, 'a'),
(5, 2, 'c'),
(6, 1, 'a'),
(6, 2, 'b'),
(6, 3, 'd'),
(7, 1, 'a'),
(7, 2, 'b'),
(7, 3, 'd')
) s (routeID, nodeID, nodeName)
)
insert into #r
select *
from s
;with
s as (
select *
from (
values
(1, 1, 'a', 1),
(1, 2, 'b', 1),
(2, 1, 'a', 1),
(2, 2, 'b', 1),
(3, 1, 'a', 4),
(3, 2, 'c', 4)
) s (journeyID, routeID, nodeID, nodeName)
)
insert into #j
select *
from s
now let's exctract routes:
;with
d as (
select *, row_number() over (partition by r.routeID order by r.nodeID desc) n2
from #r r
),
r as (
select d.*, cast(nodeName as varchar(1000)) Names, cast(0 as bigint) i2
from d
where nodeId=1
union all
select d.*, cast(r.names + ',' + d.nodeName as varchar(1000)), r.n2
from d
join r on r.routeID = d.routeID and r.nodeId=d.nodeId-1
)
insert into #routes
select routeID, Names
from r
where n2=1
table #routes will be like this:
routeID nodeNames
1 'a,b'
2 'a,b'
3 'a,b'
4 'a,c'
5 'a,c'
6 'a,b,d'
7 'a,b,d'
an now the final output:
-- the unique routes
select MIN(r.routeID) routeID, nodeNames
from #routes r
group by nodeNames
-- the unique journyes
select MIN(journeyID) journeyID, r.nodeNames
from #j j
inner join #routes r on j.routeID = r.routeID
group by nodeNames
output:
routeID nodeNames
1 'a,b'
4 'a,c'
6 'a,b,d'
and
journeyID nodeNames
1 'a,b'
3 'a,c'

Multiple SQL MAX when items are not in order

I have some data as below:
DECLARE #MyTable AS TABLE
(productName varchar(13), test1 int,test2 int)
INSERT INTO #MyTable
(productName, test1,test2)
VALUES
('a', 1,1),
('a', 2,2),
('a', 3,3),
('b', 1,4),
('b', 2,5),
('b', 3,6),
('a', 1,7),
('a', 4,8),
('a', 5,9)
;
SELECT productname,MAX(test1) from #MyTable group BY productname
a MAX query on test1 column gives
a,5
b,3
but I need to have result as
a,3
b,3
a,5
when I have order by test2

You can solve this by using a trick with row_numbers, so that you assign 2 different row numbers, one for the whole data and one that is partitioned by productname. If you compare the difference between these numbers, you can figure out when product name has changed, and use that to determine the max values for each group.
select productname, max(test1) from (
SELECT *,
row_number() over (order by test2 asc) -
row_number() over (partition by productname order by test2 asc) as GRP
from #MyTable
) X
group by productname, GRP
You can test this in SQL Fiddle
If the test2 column is always a row number without gaps, you can use that too instead of the first row number column. If you need ordering in the data, you'll have to for example to use the max of test1 to do that.

Please check the following SQL Select statement
DECLARE #MyTable AS TABLE (productName varchar(13), test1 int,test2 int)
INSERT INTO #MyTable
(productName, test1,test2)
VALUES
('a', 1,1),
('a', 2,2),
('a', 3,3),
('b', 1,4),
('b', 2,5),
('b', 3,6),
('a', 1,7),
('a', 4,8),
('a', 5,9)
DECLARE #MyTableNew AS TABLE (id int identity(1,1), productName varchar(13), test1 int,test2 int)
insert into #MyTableNew select * from #MyTable
--select * from #MyTableNew
;with cte as (
SELECT
id, productName, test1, test2,
case when (lag(productName,1,'') over (order by id)) = productName then 0 else 1 end ischange
from #MyTableNew
), cte2 as (
select t.*,(select sum(ischange) from cte where id <= t.id) grp from cte t
)
select distinct grp, productName, max(test1) over (partition by grp) from cte2
This is implemented according to the following SQL Server Lag() function tutorial
The Lag() function is used to identify and order the groups in table data

Please try this query
DECLARE #MyTable AS TABLE
(productName varchar(13), test1 int,test2 int)
INSERT INTO #MyTable
(productName, test1,test2)
VALUES
('a', 1,1),
('a', 2,2),
('a', 3,3),
('b', 1,4),
('b', 2,5),
('b', 3,6),
('a', 1,7),
('a', 4,8),
('a', 5,9)
;
SELECT productname,MAX(test1)
from #MyTable
where test1 = test2
group BY productname
union all
SELECT productname,MAX(test1)
from #MyTable
where test1 != test2
group BY productname

How to group rows by their DATEDIFF?

I hope you can help me.
I need to display the records in HH_Solution_Audit table -- if 2 or more staffs enter the room within 10 minutes. Here are the requirements:
Display only the events that have a timestamp (LAST_UPDATED) interval of less than or equal to 10 minutes. Therefore, I must compare the current row to the next row and previous row to check if their DATEDIFF is less than or equal to 10 minutes. I’m done with this part.
Show only the records if the number of distinct STAFF_GUID inside the room for less than or equal to 10 minutes is at least 2.
HH_Solution_Audit Table Details:
ID - PK
STAFF_GUID - staff id
LAST_UPDATED - datetime when a staff enters a room
Here's what I got so far. This satisfies requirement # 1 only.
CREATE TABLE HH_Solution_Audit (
ID INT PRIMARY KEY,
STAFF_GUID NVARCHAR(1),
LAST_UPDATED DATETIME
)
GO
INSERT INTO HH_Solution_Audit VALUES (1, 'b', '2013-04-25 9:01')
INSERT INTO HH_Solution_Audit VALUES (2, 'b', '2013-04-25 9:04')
INSERT INTO HH_Solution_Audit VALUES (3, 'b', '2013-04-25 9:13')
INSERT INTO HH_Solution_Audit VALUES (4, 'a', '2013-04-25 10:15')
INSERT INTO HH_Solution_Audit VALUES (5, 'a', '2013-04-25 10:30')
INSERT INTO HH_Solution_Audit VALUES (6, 'a', '2013-04-25 10:33')
INSERT INTO HH_Solution_Audit VALUES (7, 'a', '2013-04-25 10:41')
INSERT INTO HH_Solution_Audit VALUES (8, 'a', '2013-04-25 11:02')
INSERT INTO HH_Solution_Audit VALUES (9, 'a', '2013-04-25 11:30')
INSERT INTO HH_Solution_Audit VALUES (10, 'a', '2013-04-25 11:45')
INSERT INTO HH_Solution_Audit VALUES (11, 'a', '2013-04-25 11:46')
INSERT INTO HH_Solution_Audit VALUES (12, 'a', '2013-04-25 11:51')
INSERT INTO HH_Solution_Audit VALUES (13, 'a', '2013-04-25 12:24')
INSERT INTO HH_Solution_Audit VALUES (14, 'b', '2013-04-25 12:27')
INSERT INTO HH_Solution_Audit VALUES (15, 'b', '2013-04-25 13:35')
DECLARE #numOfPeople INT = 2,
--minimum number of people that must be inside
--the room for #lengthOfStay minutes
#lengthOfStay INT = 10,
--number of minutes of stay
#dateFrom DATETIME = '04/25/2013 00:00',
#dateTo DATETIME = '04/25/2013 23:59';
WITH cteSource AS
(
SELECT ID, STAFF_GUID, LAST_UPDATED,
ROW_NUMBER() OVER (ORDER BY LAST_UPDATED) AS row_num
FROM HH_SOLUTION_AUDIT
WHERE LAST_UPDATED >= #dateFrom AND LAST_UPDATED <= #dateTo
)
SELECT [current].ID, [current].STAFF_GUID, [current].LAST_UPDATED
FROM
cteSource AS [current]
LEFT OUTER JOIN
cteSource AS [previous] ON [current].row_num = [previous].row_num + 1
LEFT OUTER JOIN
cteSource AS [next] ON [current].row_num = [next].row_num - 1
WHERE
DATEDIFF(MINUTE, [previous].LAST_UPDATED, [current].LAST_UPDATED)
<= #lengthOfStay
OR
DATEDIFF(MINUTE, [current].LAST_UPDATED, [next].LAST_UPDATED)
<= #lengthOfStay
ORDER BY [current].ID, [current].LAST_UPDATED
Running the query returns IDs:
1, 2, 3, 5, 6, 7, 10, 11, 12, 13, 14
That satisfies requirement # 1 of having less than or equal to 10 minutes interval between the previous row, current row and next row.
Can you help me with the 2nd requirement? If it's applied, the returned IDs should only be:
13, 14

Here's an idea. You don't need ROW_NUMBER and previous and next records. You just need to queries unioned - one looking for everyone that have someone checked X minutes behind, and another looking for X minutes upfront. Each uses a correlated sub-query and COUNT(*) to find number of matching people. If number is greater then your #numOfPeople - that's it.
EDIT: new version: Instead of doing two queries with 10 minutes upfront and behind, we'll only check for 10 minutes behind - selecting those that match in cteLastOnes. After that will go in another part of query to search for those that actually exist within those 10 minutes. Ultimately again making union of them and the 'last ones'
WITH cteSource AS
(
SELECT ID, STAFF_GUID, LAST_UPDATED
FROM HH_SOLUTION_AUDIT
WHERE LAST_UPDATED >= #dateFrom AND LAST_UPDATED <= #dateTo
)
,cteLastOnes AS
(
SELECT * FROM cteSource c1
WHERE #numOfPeople -1 <= (SELECT COUNT(DISTINCT STAFF_GUID)
FROM cteSource c2
WHERE DATEADD(MI,#lengthOfStay,c2.LAST_UPDATED) > c1.LAST_UPDATED
AND C2.LAST_UPDATED <= C1.LAST_UPDATED
AND c1.STAFF_GUID <> c2.STAFF_GUID)
)
SELECT * FROM cteLastOnes
UNION
SELECT * FROM cteSource s
WHERE EXISTS (SELECT * FROM cteLastOnes l
WHERE DATEADD(MI,#lengthOfStay,s.LAST_UPDATED) > l.LAST_UPDATED
AND s.LAST_UPDATED <= l.LAST_UPDATED
AND s.STAFF_GUID <> l.STAFF_GUID)
SQLFiddle DEMO - new version
SQLFiddle DEMO - old version

SQL recursive logic

I have a situation where I need to configure existing client data to address a problem where our application was not correctly updating IDs in a table when it should have been.
Here's the scenario. We have a parent table, where rows can be inserted that effectively replace existing rows; the replacement can be recursive. We also have a child table, which has a field that points to the parent table. In existing data, the child table could be pointing at rows that have been replaced, and I need to correct that. I can't simply update each row to the replacing row, however, because that row could have been replaced as well, and I need the latest row to be reflected.
I was trying to find a way to write a CTE that would accomplish this for me, but I'm struggling to find a query that finds what I'm actually looking for. Here's a sample of the tables that I'm working with; the 'ShouldBe' column is what I'd like my update query to end up with, taking into account the recursive replacement of some of the rows.
DECLARE #parent TABLE (SampleID int,
SampleIDReplace int,
GroupID char(1))
INSERT INTO #parent (SampleID, SampleIDReplace, GroupID)
VALUES (1, -1, 'A'), (2, 1, 'A'), (3, -1, 'A'),
(4, -1, 'A'), (5, 4, 'A'), (6, 5, 'A'),
(7, -1, 'B'), (8, 7, 'B'), (9, 8, 'B')
DECLARE #child TABLE (ChildID int, ParentID int)
INSERT INTO #child (ChildID, ParentID)
VALUES (1, 4), (2, 7), (3, 1), (4, 3)
Desired results in child table, after the update script has been applied:
ChildID ParentID ParentID_ShouldBe
1 4 6 (4 replaced by 5, 5 replaced by 6)
2 7 9 (7 replaced by 8, 8 replaced by 9)
3 1 2 (1 replaced by 2)
4 3 3 (unchanged, never replaced)

The following returns what you are looking for:
with cte as (
select sampleid, sampleidreplace, 1 as num
from #parent
where sampleidreplace <> -1
union all
select p.sampleid, cte.sampleidreplace, cte.num+1
from #parent p join
cte
on p.sampleidreplace = cte.sampleId
)
select c.*, coalesce(p.sampleid, c.parentid)
from #child c left outer join
(select ROW_NUMBER() over (partition by sampleidreplace order by num desc) as seqnum, *
from cte
) p
on c.ParentID = p.SampleIDReplace and p.seqnum = 1
The recursive part keeps track of every correspondence (4-->5, 4-->6). The addition number is a "generation" count. We actually want the last generation. This is identified by using the row_number() function, ordering by the num in decreasing order -- hence the p.seqnum = 1.

Ok, so it took me a while and there are probably better ways to do it, but here is one option.
DECLARE #parent TABLE (SampleID int,
SampleIDReplace int,
GroupID char(1))
INSERT INTO #parent (SampleID, SampleIDReplace, GroupID)
VALUES (1, -1, 'A'), (2, 1, 'A'), (3, -1, 'A'),
(4, -1, 'A'), (5, 4, 'A'), (6, 5, 'A'),
(7, -1, 'B'), (8, 7, 'B'), (9, 8, 'B')
DECLARE #child TABLE (ChildID int, ParentID int)
INSERT INTO #child (ChildID, ParentID)
VALUES (1, 4), (2, 7), (3, 1), (4, 3)
;WITH RecursiveParent1 AS
(
SELECT SampleIDReplace, SampleID, 1 RecursionLevel
FROM #parent
WHERE SampleIDReplace != -1
UNION ALL
SELECT A.SampleIDReplace, B.SampleID, RecursionLevel + 1
FROM RecursiveParent1 A
INNER JOIN #parent B
ON A.SampleId = B.SampleIDReplace
),RecursiveParent2 AS
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY SampleIdReplace ORDER BY RecursionLevel DESC) RN
FROM RecursiveParent1
)
SELECT A.ChildID, ISNULL(B.ParentID,A.ParentID) ParentID
FROM #child A
LEFT JOIN ( SELECT SampleIDReplace, SampleID ParentID
FROM RecursiveParent2
WHERE RN = 1) B
ON A.ParentID = B.SampleIDReplace
OPTION(MAXRECURSION 500)

I've got a iterative SQL loop that I think sorts this out as follows:
WHILE EXISTS (SELECT * FROM #child C INNER JOIN #parent P ON C.ParentID = P.SampleIDReplace WHERE P.SampleIDReplace > -1)
BEGIN
UPDATE #child
SET ParentID = SampleID
FROM #parent
WHERE #child.ParentID = SampleIDReplace
END
Basically, the while condition compares the contents of the parent ID column in the child table and sees if there is a matching value in the SampleIDReplace column of the parent table. If there is, it goes and gets the SampleID of that record. It only stops when the join results in every SampleIDReplace being -1, meaning we have nothing else to do.
On your sample data, the above results in the expected output.
Note that I had to use temp tables rather than table variables here in order for the table to be accessible within the loop. If you have to use table variables then there would need to be a bit more surgery done.
Clearly if you have deep replacement hierarchies then you'll do quite a few updates, which may be a consideration when looking to perform the query against a production database.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove duplicates by multiple column criteria - sql

INSERT INTO TestTarget (first, second) SELECT first,second FROM Test t WHERE NOT EXISTS ( SELECT 1 FROM Test t2 WHERE t2.id>t.id and (t2.first=t.first or t2.second=t.second) )

Related

SQL select items between LAG and LEAD using as range

Identifying/comparing sets of rows within groups

Multiple SQL MAX when items are not in order

How to group rows by their DATEDIFF?

SQL recursive logic

Categories

Resources