Query to group fully recursive relationships in many-to-many junction table - sql

Apologies if this is a duplicate somewhere, searching here and the web seems to have similar but not exact matches to my problem so I decided to post.
I'm calling this a fully recursive grouping in a many-to-many relationship. I have tried writing joins and ctes to do this but without full recursion I'm only getting one level deep, and I'm not exactly thrilled about trying to write nested dynamic cursors.
Assume expressions can be derived from this "junction" table to find distinct students / classes, therefore the question can be summarized by using only one set.
SELECT * FROM student_class
+-------------+----------+--------------+
| student_id | class_id | group_number |
+-------------+----------+--------------+
| 1 | A | null |
| 1 | C | null |
| 2 | A | null |
| 2 | B | null |
| 2 | C | null |
| 3 | E | null |
| 4 | B | null |
| 4 | F | null |
+-------------+----------+--------------+
The question is how to populate a group number through a recursive relationship for each student and for each class. Ex: if student_id 1 has class_id A, then what other students have class_id A? For those other students, what other classes do they have? For each of those other classes, which other students have those classes? Then keep recursing through results until no more dependencies are found.
So in this example the final update would only contain two groups, read like this since no other students have class_id C and student_id 3 has no other classes:
+-------------+----------+--------------+
| student_id | class_id | group_number |
+-------------+----------+--------------+
| 1 | A | 1 |
| 1 | C | 1 |
| 2 | A | 1 |
| 2 | B | 1 |
| 2 | C | 1 |
| 3 | E | 2 |
| 4 | B | 1 |
| 4 | F | 1 |
+-------------+----------+--------------+

This is really tricky. It is a graph walking algorithm (which is not that hard). But you have to define the graph. You can define a link between two classes by using a self join. That can then be used for a recursive CTE.
So, to get equivalences classes (kind of punny here), you can do:
with cs as (
select *
from (values (1, 'A'), (1, 'C'), (2, 'A'), (2, 'B'), (2, 'C'), (3, 'E'), (4, 'B'), (4, 'F')) v(student, class)
),
cc as (
select distinct cs1.class as class1, cs2.class as class2
from cs cs1 join
cs cs2
on cs1.student = cs2.student
),
cte as (
select cc.class1 as class, cc.class2 as grp, cast(',' + cc.class1 + ',' as varchar(max)) as grps
from cc
union all
select cte.class, cc.class2,
cast(grps + cc.class2 + ',' as varchar(max)) as grps
from cte join
cc
on cc.class1 = cte.grp and cte.grps not like '%,' + cc.class1 + ',%'
)
select cte.class, min(cte.grp)
from cte
group by cte.class;
If you want to convert them to numbers:
select cte.class, min(cte.grp),
dense_rank() over (order by min(cte_grp)) as group_number
from cte
group by cte.class;

Related

Spark sql query to find many to many mappings between two columns of the same table ordered by maximum overlapedness

I wanted to write a Spark sql query or pyspark code to extract many to many mappings between two columns of the same table ordered by maximum overlapedness.
For example:
SysA SysB
A Y
A Z
B Z
B Y
C W
Which means there is therefore a M:M relationship between the above two columns.
Is there a way to extract all M:M combinations ordered by maximum overlapedness i.e values which share a lot among each other should be at the top? and discarding the one-one mappings like C W
Z maps to both A and B
Y maps to both A and B
A maps to both Y and Z
B maps to both Y and Z
Therefore both A ,B AND X,Y have M:M relationships and C W Is 1:1. The order would be sorted by the count i.e 2 , in above example only mappings of two are there between A,B:X,Y hence both are 2.
Similar question:https://social.msdn.microsoft.com/Forums/en-US/fa496933-e85a-4dfe-98df-b6c29ad812f4/sql-to-find-manytomany-combinations-of-two-columns
AS you requested and simplified version of your similar MSDN quesiton identifying just M:M relationships and ordered.
The following approaches may be used on Spark SQL.
CREATE TABLE SampleData (
`SysA` VARCHAR(1),
`SysB` VARCHAR(1)
);
INSERT INTO SampleData
(`SysA`, `SysB`)
VALUES
('A', 'Y'),
('A', 'Z'),
('B', 'Z'),
('B', 'G'),
('B', 'Y'),
('C', 'W');
Query #1
For demo purposes i have used * instead of SysA,SysB in the final projection below
SELECT
*
FROM
(
SELECT
*,
(
SELECT
count(*)
FROM
SampleData s
WHERE s.SysA=sd.SysA
) SysA_SysB,
(
SELECT
count(*)
FROM
SampleData s
WHERE s.SysB=sd.SysB
) SysB_SysA
FROM
SampleData sd
) t
WHERE t.SysA_SysB > 1 AND t.SysB_SysA>1
ORDER BY t.SysA_SysB DESC, t.SysB_SysA DESC;
| SysA | SysB | SysA_SysB | SysB_SysA |
| ---- | ---- | --------- | --------- |
| B | Z | 3 | 2 |
| B | Y | 3 | 2 |
| A | Y | 2 | 2 |
| A | Z | 2 | 2 |
Query #2
NB. Cross Joins should be enabled in spark i.e. setting spark.sql.crossJoin.enabled as true in your spark conf
SELECT
s1.SysA,
s1.SysB
FROM
SampleData s1
CROSS JOIN
SampleData s2
GROUP BY
s1.SysA, s1.SysB
HAVING
SUM(
CASE WHEN s1.SysA = s2.SysA THEN 1 ELSE 0 END
) > 1 AND
SUM(
CASE WHEN s1.SysB = s2.SysB THEN 1 ELSE 0 END
) > 1
ORDER BY
SUM(
CASE WHEN s1.SysA = s2.SysA THEN 1 ELSE 0 END
) DESC,
SUM(
CASE WHEN s1.SysB = s2.SysB THEN 1 ELSE 0 END
) DESC;
| SysA | SysB |
| ---- | ---- |
| B | Z |
| B | Y |
| A | Z |
| A | Y |
Query #3 (Recommended)
WITH SampleDataOcc AS (
SELECT
SysA,
SysB,
COUNT(SysA) OVER (PARTITION BY SysA) as SysAOcc,
COUNT(SysB) OVER (PARTITION BY SysB) as SysBOcc
FROM
SampleData
)
SELECT
SysA,
SysB,
SysAOcc,
SysBOcc
FROM
SampleDataOcc t
WHERE
t.SysAOcc > 1 AND
t.SysBOcc>1
ORDER BY
t.SysAOcc DESC,
t.SysBOcc DESC;
| SysA | SysB | SysAOcc | SysBOcc |
| ---- | ---- | --------- | --------- |
| B | Z | 3 | 2 |
| B | Y | 3 | 2 |
| A | Y | 2 | 2 |
| A | Z | 2 | 2 |
View on DB Fiddle

Oracle Connect_is_leaf similar in SQL server

Here is my query which is in Oracle PL/SQL syntax, How can I Change it to SQL server format?
Any alternatives for Connect_by_isleaf?
(
select PARTY_KEY, ltrim(sys_connect_by_path(alt_name, '|'), '|') AS alt_name_list
from
(select PARTY_KEY, alt_name, row_number() over(partition by PARTY_KEY order by alt_name) rno
from (
select party_key, (select alt_name_type_desc from "CRMS"."PRJ_APP_ALT_NAME_TYPE" where alt_name_type_cd = alt_name_type) || ' - ' || alt_name as alt_name
from "CDD_PROFILES"."PRJ_PRF_ALT_NAME" order by party_key, alt_name_type
) alt
)
where connect_by_isleaf = 1
connect by PARTY_KEY = prior PARTY_KEY
and rno = prior rno+1
start with rno = 1
)
tried to use With AS clause but it is not working somehow.
Thanks in advance
The equivalent in SQL Server is called a "recursive CTE".
You can read about it here:
https://learn.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql?view=sql-server-2017
Oracle Hierarchical queries can be rewritten as recursive CTE statements in databases that support them (SQL Server included). A classic set of hierarchical data would be an organization hierarchy such as the one below:
SQL Fiddle
MS SQL Server 2017 Schema Setup:
CREATE TABLE ORGANIZATIONS
([ID] int primary key
, [ORG_NAME] varchar(30)
, [ORG_TYPE] varchar(30)
, [PARENT_ID] int foreign key references organizations)
;
INSERT INTO ORGANIZATIONS
([ID], [ORG_NAME], [ORG_TYPE], [PARENT_ID])
VALUES
(1, 'ACME Corp', 'Company', NULL),
(2, 'Finance', 'Division', 1),
(6, 'Accounts Payable', 'Department', 2),
(7, 'Accounts Receivables', 'Department', 2),
(8, 'Payroll', 'Department', 2),
(3, 'Operations', 'Division', 1),
(4, 'Human Resources', 'Division', 1),
(10, 'Benefits Admin', 'Department', 4),
(5, 'Marketing', 'Division', 1),
(9, 'Sales', 'Department', 5)
;
In the recursive t1 below the select statement before the union all is the anchor query and the select statement after the union all is the recursive part. The recursive part has exactly one reference to t1 in its from clause. The org_path column simulates oracles sys_connect_by_path function concatenating the org_names together. The level column simulates oracles LEVEL pseudo column and is utilized in the output query to determine the leaf status (is_leaf column) similar to oracles connect_by_isleaf pseudo column:
with t1(id, org_name, org_type, parent_id, org_path, level) as (
select o.*
, cast('|' + org_name as varchar(max))
, 1
from organizations o
where parent_id is null
union all
select o.*
, t1.org_path+cast('|'+o.org_name as varchar(max))
, t1.level+1
from organizations o
join t1
on t1.id = o.parent_id
)
select t1.*
, case when t1.level < lead(t1.level) over (order by org_path) then 0 else 1 end is_leaf
from t1 order by org_path
Results:
| id | org_name | org_type | parent_id | org_path | level | is_leaf |
|----|----------------------|------------|-----------|-------------------------------------------|-------|---------|
| 1 | ACME Corp | Company | (null) | |ACME Corp | 1 | 0 |
| 2 | Finance | Division | 1 | |ACME Corp|Finance | 2 | 0 |
| 6 | Accounts Payable | Department | 2 | |ACME Corp|Finance|Accounts Payable | 3 | 1 |
| 7 | Accounts Receivables | Department | 2 | |ACME Corp|Finance|Accounts Receivables | 3 | 1 |
| 8 | Payroll | Department | 2 | |ACME Corp|Finance|Payroll | 3 | 1 |
| 4 | Human Resources | Division | 1 | |ACME Corp|Human Resources | 2 | 0 |
| 10 | Benefits Admin | Department | 4 | |ACME Corp|Human Resources|Benefits Admin | 3 | 1 |
| 5 | Marketing | Division | 1 | |ACME Corp|Marketing | 2 | 0 |
| 9 | Sales | Department | 5 | |ACME Corp|Marketing|Sales | 3 | 1 |
| 3 | Operations | Division | 1 | |ACME Corp|Operations | 2 | 1 |
To select just the leaf nodes, change the output query from above to another CTE (T2) dropping the order by clause or moving it to final output query and limiting by the is_leaf column:
with t1(id, org_name, org_type, parent_id, org_path, level) as (
select o.*
, cast('|' + org_name as varchar(max))
, 1
from organizations o
where parent_id is null
union all
select o.*
, t1.org_path+cast('|'+o.org_name as varchar(max))
, t1.level+1
from organizations o
join t1
on t1.id = o.parent_id
), t2 as (
select t1.*
, case when t1.level < lead(t1.level) over (order by org_path) then 0 else 1 end is_leaf
from t1
)
select * from t2 where is_leaf = 1
Results:
| id | org_name | org_type | parent_id | org_path | level | is_leaf |
|----|----------------------|------------|-----------|-------------------------------------------|-------|---------|
| 6 | Accounts Payable | Department | 2 | |ACME Corp|Finance|Accounts Payable | 3 | 1 |
| 7 | Accounts Receivables | Department | 2 | |ACME Corp|Finance|Accounts Receivables | 3 | 1 |
| 8 | Payroll | Department | 2 | |ACME Corp|Finance|Payroll | 3 | 1 |
| 10 | Benefits Admin | Department | 4 | |ACME Corp|Human Resources|Benefits Admin | 3 | 1 |
| 9 | Sales | Department | 5 | |ACME Corp|Marketing|Sales | 3 | 1 |
| 3 | Operations | Division | 1 | |ACME Corp|Operations | 2 | 1 |
Alternatively if you realize that leaf nodes can be identified by their lack of child nodes, you can flip this on its head and start with the leaf nodes, and search up the tree, retaining all the original record values, building out the org_path in reverse, and passing along the next parent id as next_id. In the final output, stage, selecting only those records whose next_id is null will yield the same results as the prior query:
with t1(id, org_name, org_type, parent_id, org_path, level, next_id) as (
select o.*
, cast('|'+org_name as varchar(max))
, 1
, parent_id
from organizations o
where not exists (select 1 from organizations c where c.parent_id = o.id)
union all
select t1.id
, t1.org_name
, t1.org_type
, t1.parent_id
, cast('|'+p.org_name as varchar(max))+t1.org_path
, level+1
, p.parent_id
from organizations p
join t1
on t1.next_id = p.id
)
select * from t1 where next_id is null order by org_path
Results:
| id | org_name | org_type | parent_id | org_path | level | next_id |
|----|----------------------|------------|-----------|-------------------------------------------|-------|---------|
| 6 | Accounts Payable | Department | 2 | |ACME Corp|Finance|Accounts Payable | 3 | (null) |
| 7 | Accounts Receivables | Department | 2 | |ACME Corp|Finance|Accounts Receivables | 3 | (null) |
| 8 | Payroll | Department | 2 | |ACME Corp|Finance|Payroll | 3 | (null) |
| 10 | Benefits Admin | Department | 4 | |ACME Corp|Human Resources|Benefits Admin | 3 | (null) |
| 9 | Sales | Department | 5 | |ACME Corp|Marketing|Sales | 3 | (null) |
| 3 | Operations | Division | 1 | |ACME Corp|Operations | 2 | (null) |
One of these two methods may prove more performant than the other, but you'll need to try them each out on your data to see which one works better.

Using Limit on Distinct group by values psql

Suppose I have a table that looks like this or maybe I am going nowhere.
create table customers (id text, name text, number int, useless text);
With values
insert into customers (id, name, number, useless)
values
('1','apple',1, 'a'),
('2','banana',3, 'b'),
('3','pear',2, 's'),
('4','apple',1,'e'),
('5','banana',3,'s'),
('6','cherry',3, 'a'),
('7','cherry',4, 's'),
('8','apple',2, 'd'),
('9','banana',4, 'c'),
('10','pear',5, 'e');
My failed psql query is this.
select id, name, number, useless
from customers
where number < 4
group by customers.name limit 2
the query i want to use that it returns first 2 unique grouped by customers.name. Not the first 2 rows
In the end I want it to return
('1','apple',1, 'a'),
('4','apple',1,'e'),
('8','apple',2, 'd'),
('2','banana',3, 'b'),
('5','banana',3,'s'),
so it returns the first 2 grouped names.
How can I make this query?
Thank you.
Edit:
this query is my second try I know I am kinda close.
select t.id, t.name, t.ranking
from (
SELECT id, name, dense_rank() OVER (order by name) as
ranking
FROM customers
group by name
) t
where t.ranking < 3
try this:
select id, name, number, useless
from customers
where name in (
select name
from customers
where number < 4
group by customers.name
order by name limit 2
)
| id | name | number | useless |
|----|--------|--------|---------|
| 1 | apple | 1 | a |
| 2 | banana | 3 | b |
| 4 | apple | 1 | e |
| 5 | banana | 3 | s |
| 8 | apple | 2 | d |
| 9 | banana | 4 | c |
SQL Fiddle DEMO
The group by customers.name function do not order your output, just group them by the customers.name, what you want to do is to order the group right? So what i think you want to do is:
select id, name, number, useless
from customers
group by name
order by name []*
*[asc/desc] depends of what order you want to do:
asc - ascendent,
desc - descendent
Hope it helps you.
You can use dense_rank() as:
SELECT * FROM (
SELECT DENSE_RANK() OVER (order by name) AS rank, temp.*
FROM customers temp WHERE number < 4) data
WHERE data.rank <= 2
| rank| id| name | number | useless |
|-----|---|--------|--------|---------|
| 1 | 4 | apple | 1 | e |
| 1 | 1 | apple | 1 | a |
| 1 | 8 | apple | 2 | d |
| 2 | 5 | banana | 3 | s |
| 2 | 2 | banana | 3 | b |

Recursive SQL - count number of descendants in hierarchical structure

Consider a database table with the following columns:
mathematician_id
name
advisor1
advisor2
The database represents data from the Math Genealogy Project, where each mathematician usually has one single advisor, but there are situations when there are two advisors.
Visual aid to make things clearer:
How do I count the number of descendants for each of the mathematicians?
I should probably use Common Table Expressions (WITH RECURSIVE), but I am pretty much stuck at the moment. All the similar examples I found deal with hierarchies having only one parent, not two.
Update:
I adapted the solution for SQL Server provided by Vladimir Baranov to also work in PostgreSQL:
WITH RECURSIVE cte AS (
SELECT m.id as start_id,
m.id,
m.name,
m.advisor1,
m.advisor2,
1 AS level
FROM public.mathematicians AS m
UNION ALL
SELECT cte.start_id,
m.id,
m.name,
m.advisor1,
m.advisor2,
cte.level + 1 AS level
FROM public.mathematicians AS m
INNER JOIN cte ON cte.id = m.advisor1
OR cte.id = m.advisor2
),
cte_distinct AS (
SELECT DISTINCT start_id, id
FROM cte
)
SELECT cte_distinct.start_id,
m.name,
COUNT(*)-1 AS descendants_count
FROM cte_distinct
INNER JOIN public.mathematicians AS m ON m.id = cte_distinct.start_id
GROUP BY cte_distinct.start_id, m.name
ORDER BY cte_distinct.start_id
You didn't say what DBMS you use. I'll use SQL Server for this example, but it will work in other databases that support recursive queries as well.
Sample data
I entered only the right part of your tree, starting from Euler.
The most interesting part is the multiple paths between Lagrange and Dirichlet.
DECLARE #T TABLE (ID int, name nvarchar(50), Advisor1ID int, Advisor2ID int);
INSERT INTO #T (ID, name, Advisor1ID, Advisor2ID) VALUES
(1, 'Euler', NULL, NULL),
(2, 'Lagrange', 1, NULL),
(3, 'Laplace', NULL, NULL),
(4, 'Fourier', 2, NULL),
(5, 'Poisson', 2, 3),
(6, 'Dirichlet', 4, 5),
(7, 'Lipschitz', 6, NULL),
(8, 'Klein', NULL, 7),
(9, 'Lindemann', 8, NULL),
(10, 'Furtwangler', 8, NULL),
(11, 'Hilbert', 9, NULL),
(12, 'Taussky-Todd', 10, NULL);
This is how it looks like:
SELECT * FROM #T;
+----+--------------+------------+------------+
| ID | name | Advisor1ID | Advisor2ID |
+----+--------------+------------+------------+
| 1 | Euler | NULL | NULL |
| 2 | Lagrange | 1 | NULL |
| 3 | Laplace | NULL | NULL |
| 4 | Fourier | 2 | NULL |
| 5 | Poisson | 2 | 3 |
| 6 | Dirichlet | 4 | 5 |
| 7 | Lipschitz | 6 | NULL |
| 8 | Klein | NULL | 7 |
| 9 | Lindemann | 8 | NULL |
| 10 | Furtwangler | 8 | NULL |
| 11 | Hilbert | 9 | NULL |
| 12 | Taussky-Todd | 10 | NULL |
+----+--------------+------------+------------+
Query
It is a classic recursive query with two interesting points.
1) The recursive part of the CTE joins to the anchor part using both Advisor1ID and Advisor2ID:
INNER JOIN CTE
ON CTE.ID = T.Advisor1ID
OR CTE.ID = T.Advisor2ID
2) Since it is possible to have multiple paths to the descendant, recursive query may output the node several times. To eliminate these duplicates I used DISTINCT in CTE_Distinct. It may be possible to solve it more efficiently.
To understand better how the query works run each CTE separately and examine intermediate results.
WITH
CTE
AS
(
SELECT
T.ID AS StartID
,T.ID
,T.name
,T.Advisor1ID
,T.Advisor2ID
,1 AS Lvl
FROM #T AS T
UNION ALL
SELECT
CTE.StartID
,T.ID
,T.name
,T.Advisor1ID
,T.Advisor2ID
,CTE.Lvl + 1 AS Lvl
FROM
#T AS T
INNER JOIN CTE
ON CTE.ID = T.Advisor1ID
OR CTE.ID = T.Advisor2ID
)
,CTE_Distinct
AS
(
SELECT DISTINCT
StartID
,ID
FROM CTE
)
SELECT
CTE_Distinct.StartID
,T.name
,COUNT(*) AS DescendantCount
FROM
CTE_Distinct
INNER JOIN #T AS T ON T.ID = CTE_Distinct.StartID
GROUP BY
CTE_Distinct.StartID
,T.name
ORDER BY CTE_Distinct.StartID;
Result
+---------+--------------+-----------------+
| StartID | name | DescendantCount |
+---------+--------------+-----------------+
| 1 | Euler | 11 |
| 2 | Lagrange | 10 |
| 3 | Laplace | 9 |
| 4 | Fourier | 8 |
| 5 | Poisson | 8 |
| 6 | Dirichlet | 7 |
| 7 | Lipschitz | 6 |
| 8 | Klein | 5 |
| 9 | Lindemann | 2 |
| 10 | Furtwangler | 2 |
| 11 | Hilbert | 1 |
| 12 | Taussky-Todd | 1 |
+---------+--------------+-----------------+
Here DescendantCount counts the node itself as a descendant. You can subtract 1 from this result if you want to see 0 instead of 1 for the leaf nodes.
Here is SQL Fiddle.

Set-based way to calculate family ranges in SQL?

I have a table that contains parents and 0 or more children for each parent, with a flag indicating which records are parents. All of the members of a given family have the same parent id, and the parent always has the lowest id in a given family. Also, each child has a value associated with it. (Specifically, this is a database of emails and attachments, where each parent is an email and the children are the attachments.)
I have two fields I need to calculate:
Range = {lowest id in family} - {highest id in family} [populated for all members]
Value-list = {delimited list of the values of each child, in id order} [only for parent]
So, given this:
Id | Parent| HasChildren| Value | Range | Value-list
----------------------------------------|-----------
1 | 1 | 1 | | |
2 | 1 | 0 | a | |
3 | 1 | 0 | b | |
4 | 4 | 1 | | |
5 | 4 | 0 | c | |
6 | 6 | 0 | | |
I would like to end up with this:
Id | Parent| HasChildren| Value | Range | Value-list
----------------------------------------|-----------
1 | 1 | 1 | | 1-3 | a;b
2 | 1 | 0 | a | 1-3 |
3 | 1 | 0 | b | 1-3 |
4 | 4 | 1 | | 4-5 | c
5 | 4 | 0 | c | 4-5 |
6 | 6 | 0 | | 6-6 |
How can I do this efficiently? Ideally, I'd like to do this with just set-based logic, without cursors, or even stored procedures. Temporary tables are fine.
I'm working in T-SQL, if that makes a difference, though I'd be curious to see platform agnostic answers.
The following SQLFiddle Solution should do the job for you, however as #Allan mentioned, you might want to revise your database structure.
Using CTE's:
Note: my query uses table1 as name of Your table
with cte as(
select parent
,ValueList= stuff(( select ';' +isnull(t2.Value, '')
from table1 t2
where t1.parent=t2.parent
order by t2.value
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)'), 1, 2, '')
from table1 t1
group by parent
),
cte2 as (select parent
, min(id) as firstID
, max(id) as LastID
from table1
group by parent)
select *
,(select FirstID from cte2 t2 where t2.parent=t1.parent)+'-'+(select LastID from cte2 t2 where t2.parent=t1.parent) as [Range]
,(select ValueList from cte t2 where t1.parent=t2.parent and t1.[haschildren]='1') as [Value -List]
from table1 t1