Link all IDs from associative entity that have common ID - sql

ERD just for ease to vizualise:
Locations from table "Location" can be linked to each other via the associative entity "Link Location"
Let's say that there are some current links that look like this:
location_id_1 location_id_2 active
1 5 True
5 3 True
2 6 True
4 6 True
6 7 True
I am trying to write a query that will return a single column with all IDs that might be connected to each other even if removed/distanced by one or more links. So, 1 is linked to 5 and 3 is linked to 5. Because of the common ID of 5, 1 is also linked to 3 once removed.
So, in my query I'd like to be able to decide on a "Prime Location", if you will, and then it will return all location ids, in one column, that are connected to my prime location, be it directly, or once or twice or n times removed.
I can do this easily with the 1st degree of links that might happen (See query below), but once I introduce 2nd or 3rd degree links, I am struggling to see another way other than manually updating my query to allow for another degree of linking.
declare #PrimeLocation int
set #PrimeLocation = 1
Select location_id_1
from [Link Location]
where location_id_1 = #PrimeLocation
or location_id_2 = #PrimeLocation
union
Select location_id_2
from [Link Location]
where location_id_1 = #PrimeLocation
or location_id_2 = #PrimeLocation
This query obviously only returns "1" and "5". But how do I get it to return "3" as well, and other IDs, should I add another link maybe to 3 in the future that might then be twice removed from 1? And can I do this without having to add to my query every time?
So, if my "Prime Location" = 1 (or 3 or 5) my result set should be:
location_id
1
3
5
And if my "prime location" is 2 (or 4 or 6 or 7) my result set should be:
location_id
2
4
6
7
Thanks in advance.

Assuming that the order of id's within pairs has no significance, this will produce the desired results:
-- Sample data.
declare #LinkLocations as Table ( LocationId1 Int, LocationId2 Int );
insert into #LinkLocations ( LocationId1, LocationId2 ) values
( 1, 5 ), ( 5, 3 ), ( 2, 6 ), ( 4, 6 ), ( 6, 7 );
select * from #LinkLocations;
-- Search the links.
declare #PrimeLocationId as Int = 1;
with Locations as (
select #PrimeLocationId as LocationId,
Cast( '.' + Cast( #PrimeLocationId as VarChar(10) ) + '.' as VarChar(1024) ) as Visited
union all
select LL.LocationId1,
Cast( '.' + Cast( LL.LocationId1 as VarChar(10) ) + L.Visited as VarChar(1024) )
from #LinkLocations as LL inner join
Locations as L on L.LocationId = LL.LocationId2
where L.Visited not like '%.' + Cast( LL.LocationId1 as VarChar(10) ) + '.%'
union all
select LocationId2,
Cast( '.' + Cast( LL.LocationId2 as VarChar(10) ) + L.Visited as VarChar(1024) )
from #LinkLocations as LL inner join
Locations as L on L.LocationId = LL.LocationId1
where L.Visited not like '%.' + Cast( LL.LocationId2 as VarChar(10) ) + '.%' )
select LocationId -- , Visited
from Locations
option ( MaxRecursion 0 );
You can uncomment Visited in the last select to see some of the internals. This will correctly handle even degenerate cases like 42, 42 that link one id to itself.

Well, after a bunch of experimenting and reading up on recursive CTEs I finally figured something out that works for me.
A few notes:
I had to add the depth column so that I could cancel out of the loop once my depth reaches a certain level. This is still not ideal, as I'd like the recursion to cancel itself when it finishes, but I don't see how that is possible as I do not have a top level in my hierarchy. In other words, my links aren't child to parent links like all the CTE examples I've been able to find where eventually there is one location that no longer has a link.
Anyway, here is what I did:
declare #PrimeLocation int
set #PrimeLocation = 1
;
WITH LocationLinks AS
(
SELECT location_id_1, location_id_2, 0 as depth
FROM [Link Location]
WHERE location_id_1 = #PrimeLocation
OR location_id_2 = #PrimeLocation
UNION ALL
SELECT T1.location_id_1, T1.location_id_2, T2.depth + 1
FROM [Link Location] T1
JOIN LocationLinks T2 on T1.location_id_1 = T2.location_id_2
or T1.location_id_2 = T2.location_id_1
WHERE T2.depth <=4
)
SELECT *
INTO #Links
FROM LocationLinks
------Lumping everything into one column-------
SELECT distinct location_id_1
FROM
(
SELECT distinct location_id_1
FROM #Links
UNION ALL
SELECT distinct location_id_2
FROM #Links
) a
ORDER BY location_id_1
This results in my expected output of:
location_id_1
1
3
5

Related

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.
I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values
From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;
You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

Select all hierarchy level and below SQL Server

I am having a difficult time with this one. I have seen a few examples on how to obtain all child records from a self referencing table given a parent and even how to get the parents of child records.
What I am trying to do is return a record and all child records given the ID.
To put this into context - I have a corporate hierarchy. Where:
#Role Level#
--------------------
Corporate 0
Region 1
District 2
Rep 3
What I need is a procedure that (1) figures out what level the record is and (2) retrieves that record and all children records.
The idea being a Region can see all districts and reps in a district, Districts can see their reps. Reps can only see themselves.
I have table:
ID ParentId Name
-------------------------------------------------------
1 Null Corporate HQ
2 1 South Region
3 1 North Region
4 1 East Region
5 1 West Region
6 3 Chicago District
7 3 Milwaukee District
8 3 Minneapolis District
9 6 Gold Coast Dealer
10 6 Blue Island Dealer
How do I do this:
CREATE PROCEDURE GetPositions
#id int
AS
BEGIN
--What is the most efficient way to do this--
END
GO
For example the expected result for #id = 3, I would want to return:
3, 6, 7, 8, 9, 10
I'd appreciate any help or ideas on this.
You could do this via a recursive CTE:
DECLARE #id INT = 3;
WITH rCTE AS(
SELECT *, 0 AS Level FROM tbl WHERE Id = #id
UNION ALL
SELECT t.*, r.Level + 1 AS Level
FROM tbl t
INNER JOIN rCTE r
ON t.ParentId = r.ID
)
SELECT * FROM rCTE OPTION(MAXRECURSION 0);
ONLINE DEMO
Assuming that you're on a reasonably modern version of SQL Server, you can use the hierarchyid datatype with a little bit of elbow grease. First, the setup:
alter table [dbo].[yourTable] add [path] hierarchyid null;
Next, we'll populate the new column:
with cte as (
select *, cast(concat('/', ID, '/') as varchar(max)) as [path]
from [dbo].[yourTable]
where [ParentID] is null
union all
select child.*,
cast(concat(parent.path, child.ID, '/') as varchar(max)) as [path]
from [dbo].[yourTable] as child
join cte as parent
on child.ParentID = parent.ID
)
update t
set path = c.path
from [dbo].[yourTable] as t
join cte as c
on t.ID = c.ID;
This is just a bog standard recursive table expression with one calculated column that represents the hierarchy. That's the hard part. Now, your procedure can look something like this:
create procedure dbo.GetPositions ( #id int ) as
begin
declare #h hierarchyid
set #h = (select Path from [dbo].[yourTable] where ID = #id);
select ID, ParentID, Name
from [dbo].[yourTable]
where Path.IsDescendentOf(#h) = 1;
end
So, to wrap up, all you're doing with the hierarchyid is storing the lineage for a given row so that you don't have to calculate it on the fly at select time.

Simple SQL: How to calculate unique, contiguous numbers for duplicates in a set?

Let's say I create a table with an int Page, int Section, and an int ID identity field, where the page field ranges from 1 to 8 and the section field ranges from 1 to 30 for each page. Now let's say that two records have duplicate page and section. How could I renumber those two records so that the sequence of page and section numbering is contiguous?
select page, section
from #fun
group by page, section having count(*) > 1
shows the duplicates:
page 1 section 3
page 2 section 3
page 1 section 4 and page 2 section 4 are missing. Is there a way without using a cursor to find and renumber the positions in SQL 2000 that doesn't support Row_Number()?
This rownum below of course produces exactly the same number as in section:
select page, section,
(select count(*) + 1
from #fun b
where b.page = a.page and b.section < a.section) as rownum
from #fun a
I could create a pivot table having values 1 through 100, but what would I join against?
What I want to do is something like this:
update p set section = (expression that gets 4)
from #fun p
where (expression that identifies duplicate sections by page)
I don't have a 2000 server to test this on, but I think it should work.
Create test tables/data:
CREATE TABLE #fun
(Id INT IDENTITY(100,1)
,page INT NOT NULL
,section INT NOT NULL
)
INSERT #fun (page, section)
SELECT 1,1
UNION ALL SELECT 1,3 UNION ALL SELECT 1,2
UNION ALL SELECT 1,3 UNION ALL SELECT 1,5
UNION ALL SELECT 2,1 UNION ALL SELECT 2,2
UNION ALL SELECT 2,3 UNION ALL SELECT 2,5
UNION ALL SELECT 2,3
Now the processing:
-- create a worktable
CREATE TABLE #fun2
(Id INT IDENTITY(1,1)
,funId INT
,page INT NOT NULL
,section INT NOT NULL
)
-- insert data into the second temp table ordered by the relevant columns
-- the identity column will form the basis of the revised section number
INSERT #fun2 (funId, page, section)
SELECT Id,page,section
FROM #fun
ORDER BY page,section,Id
-- write the calculated section value back where it is different
UPDATE p
SET section = y.calc_section
FROM #fun AS p
JOIN
(
SELECT f2.funId, f2.id - x.adjust calc_section
FROM #fun2 AS f2
JOIN (
-- this subquery is used to calculate an offset like
-- PARTITION BY in a 2005+ ROWNUMBER function
SELECT MIN(Id) - 1 adjust, page
FROM #fun2
GROUP BY page
) AS x
ON f2.page = x.page
) AS y
ON p.Id = y.funId
WHERE p.section <> y.calc_section
SELECT * FROM #fun order by page, section
Disclaimer: I don't have SQL Server to test.
If I understand you correctly, if you knew the ROW_NUMBER of your #fun records partitioned over (page, section) duplicates, you could use this relative ranking to increment the "section":
UPDATE p
SET section = section + (rownumber - 1)
FROM #fun AS p
INNER JOIN ( -- SELECT id, ROW_NUMBER() OVER (PARTITION BY page, section) ...
SELECT id, COUNT(1) AS rownumber
FROM #fun a
LEFT JOIN #fun b
ON a.page = b.page AND a.section = b.section AND a.id <= b.id
GROUP BY a.id, a.page, a.section) d
ON p.id = d.id
WHERE rownumber > 1
That won't handle the case where the number of duplicates push you past your upper limit of 30. It may also create new duplicates where if higher numbered sections per page already exist -- that is, one instance of (pg 1, sec 3) becomes (pg 1, sec 4), which already existed -- but you can run the UPDATE repeatedly until no duplicates exist.
And then add a unique index on (page, section).

How do you find a missing number in a table field starting from a parameter and incrementing sequentially?

Let's say I have an sql server table:
NumberTaken CompanyName
2 Fred 3 Fred 4 Fred 6 Fred 7 Fred 8 Fred 11 Fred
I need an efficient way to pass in a parameter [StartingNumber] and to count from [StartingNumber] sequentially until I find a number that is missing.
For example notice that 1, 5, 9 and 10 are missing from the table.
If I supplied the parameter [StartingNumber] = 1, it would check to see if 1 exists, if it does it would check to see if 2 exists and so on and so forth so 1 would be returned here.
If [StartNumber] = 6 the function would return 9.
In c# pseudo code it would basically be:
int ctr = [StartingNumber]
while([SELECT NumberTaken FROM tblNumbers Where NumberTaken = ctr] != null)
ctr++;
return ctr;
The problem with that code is that is seems really inefficient if there are thousands of numbers in the table. Also, I can write it in c# code or in a stored procedure whichever is more efficient.
Thanks for the help
Fine, if this question isn't going to be closed, I may as well Copy and paste my answer from the other one:
I called my table Blank, and used the following:
declare #StartOffset int = 2
; With Missing as (
select #StartOffset as N where not exists(select * from Blank where ID = #StartOffset)
), Sequence as (
select #StartOffset as N from Blank where ID = #StartOffset
union all
select b.ID from Blank b inner join Sequence s on b.ID = s.N + 1
)
select COALESCE((select N from Missing),(select MAX(N)+1 from Sequence))
You basically have two cases - either your starting value is missing (so the Missing CTE will contain one row), or it's present, so you count forwards using a recursive CTE (Sequence), and take the max from that and add 1
Tables:
create table Blank (
ID int not null,
Name varchar(20) not null
)
insert into Blank(ID,Name)
select 2 ,'Fred' union all
select 3 ,'Fred' union all
select 4 ,'Fred' union all
select 6 ,'Fred' union all
select 7 ,'Fred' union all
select 8 ,'Fred' union all
select 11 ,'Fred'
go
I would create a temp table containing all numbers from StartingNumber to EndNumber and LEFT JOIN to it to receive the list of rows not contained in the temp table.
If NumberTaken is indexed you could do it with a join on the same table:
select T.NumberTaken -1 as MISSING_NUMBER
from myTable T
left outer join myTable T1
on T.NumberTaken= T1.NumberTaken+1
where T1.NumberTaken is null and t.NumberTaken >= STARTING_NUMBER
order by T.NumberTaken
EDIT
Edited to get 1 too
1> select 1+ID as ID from #b as b
where not exists (select 1 from #b where ID = 1+b.ID)
2> go
ID
-----------
5
9
12
Take max(1+ID) and/or add your starting value to the where clause, depending on what you actually want.

How to do this data transformation

This is my input data
GroupId Serial Action
1 1 Start
1 2 Run
1 3 Jump
1 8 End
2 9 Shop
2 10 Start
2 11 Run
For each activitysequence in a group I want to Find pairs of Actions where Action1.SerialNo = Action2.SerialNo + k and how may times it happens
Suppose k = 1, then output will be
FirstAction NextAction Frequency
Start Run 2
Run Jump 1
Shop Start 1
How can I do this in SQL, fast enough given the input table contains millions of entries.
tful, This should produce the result you want, but I don't know if it will be as fast as you 'd like. It's worth a try.
create table Actions(
GroupId int,
Serial int,
"Action" varchar(20) not null,
primary key (GroupId, Serial)
);
insert into Actions values
(1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
(1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
(2,11,'Run');
go
declare #k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
select
Serial, 'a', "Action"
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B
), Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table Actions;
If you will be doing the same computation for various #k values on stable data, this may work better in the long run:
declare #k int = 1;
select
Serial, 'a' as Tag, "Action"
into ActionsDoubled
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B;
go
create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go
with Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table ActionsDoubled;
SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + #k)
GROUP BY a1.Action, a2.Action;
The problem is this: Your query has to go through EVERY row regardless.
You can make it more manageable for your database by tackling each group separately as separate queries. Especially if the size of each group is SMALL.
There's a lot going on under the hood and when the query has to do a scan of the entire table, this actually ends up being many times slower than if you did small chunks which effectively cover all million rows.
So for instance:
--Stickler for clean formatting...
SELECT
a1.Action AS FirstAction,
a2.Action AS NextAction,
COUNT(*) AS Frequency
FROM
Activities a1 JOIN Activities a2
ON (a1.groupid = a2.groupid
AND a1.Serial = a2.Serial + #k)
WHERE
a1.groupid = 1
GROUP BY
a1.Action,
a2.Action;
By the way, you have an index (GroupId, Serial) on the table, right?