SQL Algorithm for grouping all "equivalent" strings

SQL Algorithm for grouping all "equivalent" strings - sql

I have an input table with two columns, each holding a string representing an id_number, and these columns are called id1 and id2. The two id_numbers that appear as a pair in any given row are defined as being equivalent to each other. If one of those id_numbers also appears in another row, then then all the strings in both rows are equivalent to eachother, etc. The goal is to return a table with two columns, one containing all unique id_numbers and another identifying their grouping by equivalency.
Sample Data:
`
create table input_table (
id1 varchar(100),
id2 varchar(100)
)
insert into input_table(id1,id2)
values
('a','b'),
('b','c'),
('d','a'),
('a','b'),
('f','g'),
('f','k'),
('l','m')
Expected Output:
| Id | Grouping |
| a | 1 |
| b | 1 |
| c | 1 |
| d | 1 |
| f | 2 |
| g | 2 |
| k | 2 |
| l | 3 |
| m | 3 |
To further explain the results:
Row 1 tells us a=b so they are assigned to group 1
Row 2 tells us b=c, since b is already in group 1, c is also assigned to group 1
Row 3 tells us d=a, since a is already in group 1, d is also assigned to group 1
Row 4 tells us a=b, which we already know so we don't need to do anything
Row 5 tells us f=g, since neither are in an existing group, we assign them to group 2
etc.

Your table describes a nondirected graph, where the set (distinct values) of all Id1 and Id2 values over all tuples ("rows") represent nodes and the tuples themselves represent the node's edges (as they connect - or link - these Id values in a relationship) - which means we can apply graph-theory techniques to solve your problem, which can be stated as finding disconnected components and assigning them an identifier) within your graph (as each disconnected component represents each set of connected nodes).
...with that formalism out of the way, I'll say I don't think it's necessarily correct to use the word "equivalent" or "equivalence": just because a binary relation (each of your tuples) is transitive (i.e. nondirected node edges are a transitive relation) it says nothing about what the nodes themselves represent, so I'm guessing you meant to say the Id1/Id2 values represent things-that-are-equivalent in your problem-domain, which is fair, but in the context of my answer I'll refrain from using that term (not least because "set equivalence" is something else entirely)
ANYWAY...
Step 1: Normalize your bad data:
Assuming that those Id1/Id2 values are comparable values (as in: every value can be sorted consistently and deterministically by a less-than comparison), such as strings or integers, then do that first so we can generate a normalized representation of your data.
This normalized representation of your data would be the same as your current data, except that:
Every row's Id1 value must be < (i.e. less-than) its Id2 value.
There are no duplicate rows.
So if there's a row where Id2 < Id1 you should swap the values, so ( 'd', 'a' ) becomes ( 'a', 'd' ), and the duplicate ('a', 'b') row would be removed.
This also means that every (disconnected) component in the graph we're representing now has a minimum node which we can treat as the (arbitrary) "root" of that component - which means we will have a target to look for from any node (but we don't know which values are minimums in each component yet, hang on...)
To aid identification, let's call the smallest value in a pair the smol value and the big value.
(You can do this step as another CTE step, but for clarity I'm using a
table-variable #normalized, like so):
DECLARE #horribleInputData TABLE (
Id1 char(1) NOT NULL,
Id2 char(1) NOT NULL
);
INSERT INTO #horribleInputData ( Id1, Id2 ) VALUES
( 'a', 'b' ),
( 'b', 'c' ),
( 'd', 'a' ),
( 'a', 'b' ),
( 'f', 'g' ),
( 'f', 'k' ),
( 'l', 'm' );
--
DECLARE #normalized TABLE (
Smol char(1) NOT NULL,
Big char(1) NOT NULL,
INDEX IDX1 ( Smol ),
INDEX IDX2 ( Big ),
CHECK ( Smol < Big )
);
INSERT INTO #normalized ( Smol, Big )
SELECT
DISTINCT -- Exclude duplicate rows.
CASE WHEN Id1 < Id2 THEN Id1 ELSE Id2 END AS Smol, -- Sort the values horizontally, not vertically.
CASE WHEN Id1 < Id2 THEN Id2 ELSE Id1 END AS Big
FROM
#horribleInputData
WHERE
Id1 <> Id2; -- Exclude zero-information rows.
After running the above, the #normalized table looks like this:
Smol
Big
a
b
a
d
b
c
f
g
f
k
l
m
Step 2: Designate the minimum node as root in each disconnected component:
If we treat the node with the minimum-value as the "root" in a directed graph (or in this case, the minimum node in each disconnected component) then we can find connected nodes by finding a path from each node to that root. But what else defines a root? In this case, it would be a node with no "outgoing" references from Smol to Big - so we can simply take #normalized and do an anti-join to itself (i.e. for every Smol, see if that Smol is a Big to any other Smol - if none exist then that Smol is the Smollest - and is therefore a root:
SELECT
DISTINCT
l.Smol AS Smollest
FROM
#normalized AS l
LEFT OUTER JOIN #normalized AS r ON l.Smol = r.Big
WHERE
r.Big IS NULL;
Which gives us this output:
Smollest
a
f
l
...which tells us that there's 3 disconnected-components in this graph, and what those minimum nodes are.
Step 3: the srsbsns part
Which means we can now try to trace a route from each node to a root using a recursive CTE - which is a technique in SQL to traverse
hierarchical directed graphs (either from top-to-bottom, or bottom-to-top), so it's a good thing we converted the data into a directed graph first.
For example:
SQL Server CTE and recursion example
SQL Server 2012 CTE Find Root or Top Parent of Hierarchical Data
Identifying equivalent sets in SQL Server
Like so:
-- This `smolRoots` CTE is the same as the query from Step 2 above:
WITH smolRoots AS (
SELECT
DISTINCT
l.Smol AS Smollest
FROM
-- This is an anti-join (a LEFT OUTER JOIN combined with a IS NULL predicate):
#normalized AS l
LEFT OUTER JOIN #normalized AS r ON l.Smol = r.Big
WHERE
r.Big IS NULL
),
-- Generate a simple flat list of every Id1/Id2 value:
everyNode AS (
SELECT DISTINCT Smol AS Id FROM #normalized
UNION
SELECT DISTINCT Big AS Id FROM #normalized
),
-- Now do the tree-walk, it's like a nature-walk but more math-y:
recursiveCte AS (
-- Each root (Smol) value:
SELECT
CONVERT( char(1), NULL ) AS Smol,
s.Smol AS Big,
s.Smol AS SmolRoot,
0 AS Depth
FROM
smolRoots AS s
UNION ALL
-- Then recurisvely UNION ALL (concatenate) all other rows that can be connected by their Smol-to-Big values:
SELECT
f.Smol,
f.Big,
r.SmolRoot,
r.Depth + 1 AS Depth
FROM
#normalized AS f
INNER JOIN recursiveCte AS r ON f.Smol = r.Big
)
The above recursiveCte, if evalated as-is, will return this output:
Smol
Big
SmolRoot
Depth
NULL
a
a
0
NULL
f
f
0
NULL
l
l
0
l
m
l
1
f
g
f
1
f
k
f
1
a
b
a
1
a
d
a
1
b
c
a
2
Notice how for each row, each Smol and Big value is mapped to one of the 3 SmolRoot values identified in Step 2.
...so I hope you can now start to see how this works. But we're not done yet...
Step 3.2: I lied about Step 3 being Step 3, it's really Step 3.1:
We still need to now then query the recursiveCte's results to convert those SmolRoot values into unique identifiers for each disconnected component (i.e. each set) - so let's generate a new 1-based int value for each distinct value in SmolRoot - we can do this with ROW_NUMBER() but you could also use GENERATE_SERIES - and I'm sure other techniques exist to do this too:
WITH smolRoots AS ( /* same as above */ ),
everyNode AS ( /* same as above */ ),
recursiveCte AS ( /* same as above */ ),
numberForEachRoot AS ( -- Generate distinct numbers (from 1, 2, ...) for each SmolRoot number, i.e. a number for each disconnected-component or disjoint graph:
SELECT
SmolRoot,
ROW_NUMBER() OVER ( ORDER BY SmolRoot ) AS DisconnectedComponentNumber
FROM
recursiveCte AS r
GROUP BY
r.SmolRoot
And numberForEachRoot looks like this:
SmolRoot
DisconnectedComponentNumber
a
1
f
2
l
3
Step 5: There is no Step 4
Now ignore all of the above and start over with that everyNode CTE that was tucked-away in Step 2: Take the everyNode CTE and JOIN it to recursiveCte and numberForEachRoot to get the actual output you're after:
WITH
smolRoots AS ( /* same as above */ ),
everyNode AS ( /* same as above */ ),
recursiveCte AS ( /* same as above */ ),
numberForEachRoot AS ( /* same as above */ )
SELECT
e.Id,
n.DisconnectedComponentNumber
FROM
everyNode AS e
INNER JOIN recursiveCte AS r ON e.Id = r.Big
INNER JOIN numberForEachRoot AS n ON r.SmolRoot = n.SmolRoot
ORDER BY
e.Id;
Which gives us...
Id
DisconnectedComponentNumber
a
1
b
1
d
1
c
1
f
2
g
2
k
2
l
3
m
3
This technique also works on graphs with cycles, though you might get duplicate output rows in that case, but adjusting the above query to filter those out is a trivial exercise left to the reader.
Also it seems I spent too much time on this answer, that means the Vyvanse is working tonight. And yup, it's just gone 4am where I am right now, good job, me. Now where's my Ambien gone?
Step ∞: Just give me the solution I can copy-and-paste into my CS homework and/or Nissan car firmware
Alright, you asked for it...
DECLARE #horribleInputData TABLE (
Id1 char(1) NOT NULL,
Id2 char(1) NOT NULL
);
INSERT INTO #horribleInputData ( Id1, Id2 ) VALUES
('a','b'),
('b','c'),
('d','a'),
('a','b'),
('f','g'),
('f','k'),
('l','m'),
-- Also adding this to show that it works for cycles in graphs too:
( '0', '1' ),
( '1', '2' ),
( '2', '3' ),
( '3', '0' );
-------------
-- 1. Normalize to a table form that's easier to work with: given that transitivity exists, enforce `Id1 < Id2` so we can have a "direction" to look in.
-- This can be a CTE step instead of a TABLE too, if you dare. I'm curious what the execution plan would be in that case.
DECLARE #normalized TABLE (
Smol char(1) NOT NULL,
Big char(1) NOT NULL,
INDEX IDX1 ( Smol ),
INDEX IDX2 ( Big ),
CHECK ( Smol < Big )
);
INSERT INTO #normalized ( Smol, Big )
SELECT
DISTINCT
CASE WHEN Id1 < Id2 THEN Id1 ELSE Id2 END AS Smol,
CASE WHEN Id1 < Id2 THEN Id2 ELSE Id1 END AS Big
FROM
#horribleInputData
WHERE
Id1 <> Id2 -- Exclude zero-information rows.
ORDER BY
Smol, -- Make it easier to read, interactively.
Big;
/*
Smol Big
----------
a b
a d
b c
f g
f k
l m
*/
-- Also, just gonna bury this in here and see what happens:
DECLARE #superImportantPart nvarchar(300) = CONVERT( nvarchar(300), 0x4800450059002000450056004500520059004F004E00450020004900200043004F0050005900200041004E004400200050004100530054004500200043004F00440045002000460052004F004D00200053005400410043004B004F0056004500520046004C004F005700200057004900540048004F0055005400200055004E004400450052005300540041004E00440049004E00470020005700480041005400200049005400200044004F004500530020004F005200200048004F005700200049005400200057004F0052004B00530020004F005400480045005200570049005300450020004900200057004F0055004C004400200048004100560045002000520045004D004F005600450044002000540048004900530020005000520049004E0054002000530054004100540045004D0045004E005400 );
RAISERROR( #superImportantPart, /*severity:*/ 0, /*state:*/ 1 ) WITH NOWAIT;
-- Then trace a route from every Big to its smallest connected Smol.
-- Each Big sharing the same Smol is in the same connected graph, the set of distinct Smol nodes identifies each output set.
-- 1. Get all roots first: these will be `Smol` nodes that never appear in `Big`.
WITH smolRoots AS (
SELECT
DISTINCT
l.Smol AS Smollest
FROM
#normalized AS l
LEFT OUTER JOIN #normalized AS r ON l.Smol = r.Big
WHERE
r.Big IS NULL
/*
Smollest
-----
a
f
l
*/
),
everyNode AS (
SELECT DISTINCT Smol AS Id FROM #normalized
UNION
SELECT DISTINCT Big AS Id FROM #normalized
),
-- The tree-walk:
recursiveCte AS (
-- Each root (Smol) value:
SELECT
CONVERT( char(1), NULL ) AS Smol,
s.Smollest AS Big,
s.Smollest AS SmolRoot,
0 AS Depth
FROM
smolRoots AS s
UNION ALL
-- Then the magic happens...
SELECT
n.Smol,
n.Big,
r.SmolRoot,
r.Depth + 1 AS Depth
FROM
#normalized AS n
INNER JOIN recursiveCte AS r ON n.Smol = r.Big
/*
Smol Big SmolRoot Depth
-----------------------------
NULL a a 0
NULL f f 0
NULL l l 0
l m l 1
f g f 1
f k f 1
a b a 1
a d a 1
b c a 2
*/
),
numberForEachRoot AS ( -- Generate distinct numbers (from 1, 2, ...) for each SmolRoot number, i.e. a number for each disconnected-component or disjoint graph:
SELECT
SmolRoot,
ROW_NUMBER() OVER ( ORDER BY SmolRoot ) AS DisconnectedComponentNumber
FROM
recursiveCte AS r
GROUP BY
r.SmolRoot
/*
SmolRoot DisconnectedComponentNumber
-------------------
a 1
f 2
l 3
*/
)
-- Then ignore all of the above and start with `everyNode` and JOIN it to `recursiveCte` and `numberForEachRoot`:
SELECT
e.Id,
n.DisconnectedComponentNumber
FROM
everyNode AS e
INNER JOIN recursiveCte AS r ON e.Id = r.Big
INNER JOIN numberForEachRoot AS n ON r.SmolRoot = n.SmolRoot
ORDER BY
e.Id;
/*
Id DisconnectedComponentNumber
a 1
f 2
l 3
m 3
g 2
k 2
b 1
d 1
c 1
*/

Related

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.

I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values

From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;

You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

Link all IDs from associative entity that have common ID

ERD just for ease to vizualise:
Locations from table "Location" can be linked to each other via the associative entity "Link Location"
Let's say that there are some current links that look like this:
location_id_1 location_id_2 active
1 5 True
5 3 True
2 6 True
4 6 True
6 7 True
I am trying to write a query that will return a single column with all IDs that might be connected to each other even if removed/distanced by one or more links. So, 1 is linked to 5 and 3 is linked to 5. Because of the common ID of 5, 1 is also linked to 3 once removed.
So, in my query I'd like to be able to decide on a "Prime Location", if you will, and then it will return all location ids, in one column, that are connected to my prime location, be it directly, or once or twice or n times removed.
I can do this easily with the 1st degree of links that might happen (See query below), but once I introduce 2nd or 3rd degree links, I am struggling to see another way other than manually updating my query to allow for another degree of linking.
declare #PrimeLocation int
set #PrimeLocation = 1
Select location_id_1
from [Link Location]
where location_id_1 = #PrimeLocation
or location_id_2 = #PrimeLocation
union
Select location_id_2
from [Link Location]
where location_id_1 = #PrimeLocation
or location_id_2 = #PrimeLocation
This query obviously only returns "1" and "5". But how do I get it to return "3" as well, and other IDs, should I add another link maybe to 3 in the future that might then be twice removed from 1? And can I do this without having to add to my query every time?
So, if my "Prime Location" = 1 (or 3 or 5) my result set should be:
location_id
1
3
5
And if my "prime location" is 2 (or 4 or 6 or 7) my result set should be:
location_id
2
4
6
7
Thanks in advance.

Assuming that the order of id's within pairs has no significance, this will produce the desired results:
-- Sample data.
declare #LinkLocations as Table ( LocationId1 Int, LocationId2 Int );
insert into #LinkLocations ( LocationId1, LocationId2 ) values
( 1, 5 ), ( 5, 3 ), ( 2, 6 ), ( 4, 6 ), ( 6, 7 );
select * from #LinkLocations;
-- Search the links.
declare #PrimeLocationId as Int = 1;
with Locations as (
select #PrimeLocationId as LocationId,
Cast( '.' + Cast( #PrimeLocationId as VarChar(10) ) + '.' as VarChar(1024) ) as Visited
union all
select LL.LocationId1,
Cast( '.' + Cast( LL.LocationId1 as VarChar(10) ) + L.Visited as VarChar(1024) )
from #LinkLocations as LL inner join
Locations as L on L.LocationId = LL.LocationId2
where L.Visited not like '%.' + Cast( LL.LocationId1 as VarChar(10) ) + '.%'
union all
select LocationId2,
Cast( '.' + Cast( LL.LocationId2 as VarChar(10) ) + L.Visited as VarChar(1024) )
from #LinkLocations as LL inner join
Locations as L on L.LocationId = LL.LocationId1
where L.Visited not like '%.' + Cast( LL.LocationId2 as VarChar(10) ) + '.%' )
select LocationId -- , Visited
from Locations
option ( MaxRecursion 0 );
You can uncomment Visited in the last select to see some of the internals. This will correctly handle even degenerate cases like 42, 42 that link one id to itself.

Well, after a bunch of experimenting and reading up on recursive CTEs I finally figured something out that works for me.
A few notes:
I had to add the depth column so that I could cancel out of the loop once my depth reaches a certain level. This is still not ideal, as I'd like the recursion to cancel itself when it finishes, but I don't see how that is possible as I do not have a top level in my hierarchy. In other words, my links aren't child to parent links like all the CTE examples I've been able to find where eventually there is one location that no longer has a link.
Anyway, here is what I did:
declare #PrimeLocation int
set #PrimeLocation = 1
;
WITH LocationLinks AS
(
SELECT location_id_1, location_id_2, 0 as depth
FROM [Link Location]
WHERE location_id_1 = #PrimeLocation
OR location_id_2 = #PrimeLocation
UNION ALL
SELECT T1.location_id_1, T1.location_id_2, T2.depth + 1
FROM [Link Location] T1
JOIN LocationLinks T2 on T1.location_id_1 = T2.location_id_2
or T1.location_id_2 = T2.location_id_1
WHERE T2.depth <=4
)
SELECT *
INTO #Links
FROM LocationLinks
------Lumping everything into one column-------
SELECT distinct location_id_1
FROM
(
SELECT distinct location_id_1
FROM #Links
UNION ALL
SELECT distinct location_id_2
FROM #Links
) a
ORDER BY location_id_1
This results in my expected output of:
location_id_1
1
3
5

How to select only the next smaller value

I am trying to select smaller number from the database with the SQL.
I have table in which I have records like this
ID NodeName NodeType
4 A A
2 B B
2 C C
1 D D
0 E E
and other columns like name, and type.
If I pass "4" as a parameter then I want to receive the next smallest number records:
ID NodeName NodeType
2 B B
2 C C
Right now if I am using the < sign then it is giving me
ID NodeName NodeType
2 B B
2 C C
1 D D
0 E E
How can I get this working?

You can use WITH TIES clause:
SELECT TOP (1) WITH TIES *
FROM mytable
WHERE ID < 4
ORDER BY ID DESC
TOP clause in conjunction with WHERE and ORDER BY selects the next smallest value to 4. WITH TIES clause guarantees that all these values will be returned, in case there is more than one.
Demo here

select ID
from dbo.yourtable
where ID in
(
select top 1 ID
from dbo.your_table
where ID < 4
order by ID desc
);
Note: where dbo.your_table is your source table
What this does it uses an inner query to pull the next smallest ID below your selected value. Then the outer query just pulls all records that have that same match to the ID of the next smallest value.
Here's a full working example:
use TestDatabase;
go
create table dbo.TestTable1
(
ID int not null
);
go
insert into dbo.TestTable1 (ID)
values (6), (4), (2), (2), (1), (0);
go
select ID
from dbo.TestTable1
where ID in
(
select top 1 ID
from dbo.TestTable1
where ID < 4
order by ID desc
);
/*
ID
2
2
*/

How to do this data transformation

This is my input data
GroupId Serial Action
1 1 Start
1 2 Run
1 3 Jump
1 8 End
2 9 Shop
2 10 Start
2 11 Run
For each activitysequence in a group I want to Find pairs of Actions where Action1.SerialNo = Action2.SerialNo + k and how may times it happens
Suppose k = 1, then output will be
FirstAction NextAction Frequency
Start Run 2
Run Jump 1
Shop Start 1
How can I do this in SQL, fast enough given the input table contains millions of entries.

tful, This should produce the result you want, but I don't know if it will be as fast as you 'd like. It's worth a try.
create table Actions(
GroupId int,
Serial int,
"Action" varchar(20) not null,
primary key (GroupId, Serial)
);
insert into Actions values
(1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
(1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
(2,11,'Run');
go
declare #k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
select
Serial, 'a', "Action"
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B
), Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table Actions;
If you will be doing the same computation for various #k values on stable data, this may work better in the long run:
declare #k int = 1;
select
Serial, 'a' as Tag, "Action"
into ActionsDoubled
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B;
go
create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go
with Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table ActionsDoubled;

SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + #k)
GROUP BY a1.Action, a2.Action;

The problem is this: Your query has to go through EVERY row regardless.
You can make it more manageable for your database by tackling each group separately as separate queries. Especially if the size of each group is SMALL.
There's a lot going on under the hood and when the query has to do a scan of the entire table, this actually ends up being many times slower than if you did small chunks which effectively cover all million rows.
So for instance:
--Stickler for clean formatting...
SELECT
a1.Action AS FirstAction,
a2.Action AS NextAction,
COUNT(*) AS Frequency
FROM
Activities a1 JOIN Activities a2
ON (a1.groupid = a2.groupid
AND a1.Serial = a2.Serial + #k)
WHERE
a1.groupid = 1
GROUP BY
a1.Action,
a2.Action;
By the way, you have an index (GroupId, Serial) on the table, right?

SQL query to return rows sorted by key plus empty rows for missing keys

I have a table that has, in essence, this structure:
key value
------ ------
2 val1
3 val2
5 val3
The keys are sequential integers from 1 up to (currently) 1 million, increasing by several thousand each day. Gaps in the keys occur when records have been deleted.
I'm looking for an SQL query that returns this:
key value
------ ------
1
2 val1
3 val2
4
5 val3
I can see how to do this with joining to a second table that has a complete list of keys. However I'd prefer a solution that uses standard SQL (no stored procedures or a second table of keys), and that will work no matter what the upper value of the key is.

SQL queries have no looping mechanism. Procedure languages have loops, but queries themselves can only "loop" over data that they find in a table (or a derived table).
What I do to generate a list of numbers on the fly is to do a cross-join on a small table of digits 0 through 9:
CREATE TABLE n (d NUMERIC);
INSERT INTO n VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9);
Then to generate 00..99:
SELECT n1.d + n2.d*10 AS d
FROM n AS n1 CROSS JOIN n AS n10;
If you want only 00..57:
SELECT n1.d + n2.d*10 AS d
FROM n AS n1 CROSS JOIN n AS n2
WHERE n1.d + n2.d*10 <= 57;
You can of course join the table for the 100's place, 1000's place, etc. Note that you can't use column aliases in the WHERE clause, so you have to repeat the full expression.
Now you can use this as a derived table in a FROM clause and join it to your data table.
SELECT n0.d, mytable.value
FROM
(SELECT n1.d + n2.d*10 + n2.d*100 + n3.d*1000
+ n4.d*10000 + n5.d*100000 AS d
FROM n AS n1 CROSS JOIN n AS n2 CROSS JOIN n AS n3
CROSS JOIN n AS n4 CROSS JOIN n AS n5) AS n0
LEFT OUTER JOIN mytable ON (n0.d = mytable.key)
WHERE n0.d <= (SELECT MAX(key) FROM mytable);
You do need to add another CROSS JOIN each time your table exceeds an order of magnitude in size. E.g. when it grows past 1 million, add a join for n6.
Note also we can now use the column alias in the WHERE clause of the outer query.
Admittedly, it can be a pretty expensive query to do this solely in SQL. You might find that it's both simpler and speedier to "fill in the gaps" by writing some application code.

Another method would be to create a resultset of the million numbers, and use it as a basis for the join. That might do the job for you. (stolen from ASKTOMs Blog)
select level
from dual
connect by level <= 1000000
yielding something like this
WITH
upper_limit AS
(
select 1000000 limit from dual
),
fake_table AS
(
select level key
from dual
connect by level <= (select limit from upper_limit)
)
select key, value
from table, fake_table
where fake_table.key = table.key(+)
I'm not at work, so I can't test this. Your mileage may vary. I use Oracle at work.

In MySQL you can find the edges of the gaps by performing left joins against itself with positive and negative offsets.
Eg:
create table seq ( i int primary key, v varchar(10) );
insert into seq values( 2, 'val1' ), (3, 'val2' ), (5, 'val3' );
select s.i-1 from seq s left join seq m on m.i = (s.i -1) where m.i is null;
+-------+
| s.i-1 |
+-------+
| 1 |
| 4 |
+-------+
select s.i+1 from seq s left join seq m on m.i = (s.i +1) where m.i is null;
+-------+
| s.i+1 |
+-------+
| 4 |
| 6 |
+-------+
This doesn't give you exactly want you want, but gives enough information to work out what the missing rows are.

WITH range (num) AS (
SELECT 1 -- use your own lowerbound
UNION ALL
SELECT 1 + num FROM range
WHERE num < 10 -- use your own upper bound
)
SELECT r.num, y.* FROM range r left join yourtable y
on r.num = y.id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Algorithm for grouping all "equivalent" strings - sql

Related

Remove duplicated subsets from very large table

Link all IDs from associative entity that have common ID

How to select only the next smaller value

How to do this data transformation

SQL query to return rows sorted by key plus empty rows for missing keys

Categories

Resources