SQL / BigQuery- How to group by chain of pairs in different columns? - sql

I am trying to aggregate rows on a chain of pairs. For example, given table:
pair0 | pair1
a z
b a
c b
d b
m n
z y
Returns:
matches
[a, z, b, c, d, y]
[m, n]
where order of the pairs doesn't matter.
I've tried joining the table on itself, but am unable to aggregate in this way without putting the join in the loop, for the max number of possible combinations.
SELECT
[a.pair0, a.pair1, b.pair0, b.pair1] as matches
FROM pairs a
LEFT JOIN pairs b
ON a.pair0 = b.pair1
GROUP BY
matches
and then would filter matches for distinct. but, this solution only works if the chain is limited to two rows. In the example above, the chain extends for 5 rows. Grouping by an array also is not allowed.

BigQuery supports recursive queries: yay! With this feature at hand, we can try and solve this graph-walking problem.
The idea is to start by generating all possible edges, and then recursively query traverse the graph, while taking care of not visiting the same node twice. Once all paths are visited, we can identify to which group each node belongs by looking at the aggregated list of its visited nodes.
with recursive
edges as (
select x.pair0, x.pair1
from pairs p
cross join lateral (
select p.pair0, p.pair1 union all select p.pair1, p.pair0
) x(pair0, pair1)
),
cte as (
select pair0, pair1, [ pair1 ] visited
from edges
union all
select c.pair0, e.pair1, array_concat(c.visited, [ e.pair1 ] )
from cte c
inner join edges e on e.pair0 = c.pair1 and e.pair1 not in unnest(c.visited)
),
res as (
select pair0, array_agg(distinct pair1 order by pair1) grp
from cte
group by pair0
)
select distinct grp from res

Related

SQL query for topological sort

I have a directed acyclic graph:
DROP TABLE IF EXISTS #Edges
CREATE TABLE #Edges(from_node int, to_node int);
INSERT INTO #Edges VALUES (1,2),(1,3),(1,4),(5,1);
I want to list all nodes, always listing a to node before its from node.
For example: 2, 3, 4, 1, 5.
It is also called a topological ordering. How can it be done in SQL ?
You can use a recursive CTE to calculate the depth. Then order by the depth:
with cte as (
select e.from_node, e.to_node, 1 as lev
from edges e
where not exists (select 1 from edges e2 where e2.to_node = e.from_node)
union all
select e.from_node, e.to_node, lev + 1
from cte join
edges e
on e.from_node = cte.to_node
)
select *
from cte
order by lev desc;
EDIT:
I notice that you do not have "1" in your edges list. To handle this:
with cte as (
select 1 as from_node, e.from_node as to_node, 1 as lev
from edges e
where not exists (select 1 from edges e2 where e2.to_node = e.from_node)
union all
select e.from_node, e.to_node, lev + 1
from cte join
edges e
on e.from_node = cte.to_node
-- where lev < 5
)
select *
from cte
order by lev desc;
Here is a db<>fiddle.
DROP TABLE IF EXISTS #topological_sorted
CREATE TABLE #topological_sorted(id int identity(1,1) primary key, n int);
WITH rcte(n) AS (
SELECT e1.to_node
FROM #Edges AS e1
LEFT JOIN #Edges AS e2 ON e1.to_node = e2.from_node
WHERE e2.from_node IS NULL
UNION ALL
SELECT e.from_node
FROM #Edges AS e
JOIN rcte ON e.to_node = rcte.n
)
INSERT INTO #topological_sorted(n)
SELECT *
FROM rcte;
SELECT * FROM #topological_sorted
nodes might be listed several times. We only want to keep the fist occurence:
DROP TABLE IF EXISTS #topological_sorted_2
SELECT *, MIN(id) OVER (PARTITION BY n) AS idm
INTO #topological_sorted_2
FROM #topological_sorted
ORDER BY id;
SELECT * FROM #topological_sorted_2
WHERE id=idm
ORDER BY id;
I found this question running into similar problem. Since both #Ludovic's and #Gordon's answers didn't fully answered all my particular questions and there's not enough room in comments I decided to summarize my own answer.
Recursive query solution is good
Basically, #Gordon's answer is based on graph traversal of all paths. The cte.lev column actually represents length of path from some start node to cte.to_node.
What about loops?
What's not clear is what to return when multiple paths into particular node are possible (i.e. if the undirected version of DAG had loops). For example in following graph
1
^
|\
2 \
^ /
|/
3
the node 1 is reachable from initial node 3 at distance 1 directly and at distance 2 via node 2. Hence the node 1 is expanded twice with different value of path length.
Let v be the greatest value of the two. The v can generally be defined as length of the longest path from start node to given node. This value corresponds with topological ordering. It essentially splits nodes into chunks so that for any two nodes n1, n2 with values v1, v2 respectively, the node n1 is before n2 when v1<v2 and ordering of n1,n2 is arbitrary when v1=v2. (I have no exact proof but by contradiction if this ordering wouldn't hold there would have to be counter-directed edge or edge within chunk so the value v wouldn't be the length of the longest path.)
Hence the SQL is (original example fiddle, my looped example fiddle)
with cte as (
select 0 as from_node, e.from_node as to_node, 1 as lev
from edges e
where not exists (select 1 from edges e2 where e2.to_node = e.from_node)
union all
select e.from_node, e.to_node, lev + 1
from cte join
edges e
on e.from_node = cte.to_node
)
select to_node, max(lev)
from cte
group by to_node
order by max(lev)
(which is close to #Ludovic's answer but Ludovic relies on ordering by id which IMHO cannot guarantee the proper ordering in general case.
Optimization
The recursive CTE now generates rows with to_node and length of the path to it. If some node was reached by multiple paths of same length, each of that paths expands to new rows at another level of recursion, which generates duplicate rows and for some graphs it can lead to combinatorial explosion. For example in following graph (let the edges be directed from left to right)
B E
/ \ / \
A D G
\ / \ /
C F
the D node is reached from A via two paths but algorithm does not take it into consideration hence E has two paths as well as F, G has even four paths.
For SQL-based solution in ideal world, adding distinct would suffice, which would eliminate duplicate expansion of D-E and D-F edges:
select distinct 0 as from_node, e.from_node as to_node, 1 as lev
from edges e
where not exists (select 1 from edges e2 where e2.to_node = e.from_node)
union all
select distinct e.from_node, e.to_node, lev + 1
from cte join
edges e
on e.from_node = cte.to_node
Unfortunately this doesn't work because of DISTINCT operator is not allowed in the recursive part of a recursive common table expression 'cte'. error in SQLServer. (I actually work with Oracle where the result is analogous - ORA-32486 unsupported operation in recursive branch of recursive WITH clause.) Similarly neither the group by nor some query nesting tricks can be used.
In this point I gave up with SQLServer but for Oracle there exists one more solution based on window functions. In the recursive part of query it is possible to define bunch of duplicate rows as a partition, number rows within that partition and choose only one of potentially many duplicates.
with edges (from_node,to_node) as (
select 'A','B' from dual union all
select 'A','C' from dual union all
select 'B','D' from dual union all
select 'C','D' from dual union all
select 'D','E' from dual union all
select 'D','F' from dual union all
select 'E','G' from dual union all
select 'F','G' from dual
)
, cte (from_node, to_node, lev, dup) as (
select distinct null as from_node, e.from_node as to_node, 0 as lev, 1 as dup
from edges e
where not exists (select 1 from edges e2 where e2.to_node = e.from_node)
union all
select e.from_node, e.to_node, cte.lev + 1
, row_number() over (partition by e.to_node, cte.lev order by null) as dup
from cte
join edges e on e.from_node = cte.to_node
where cte.dup = 1
)
select to_node, lev from cte where dup = 1 order by lev
The drawback is that the row_number of current level of recursion cannot be filtered in where condition. Hence we must stand that duplicate rows pass and expand into next level of recursion where they are finally pruned. However this heuristics is still useful - I was querying the Oracle dba_dependencies table and the query didn't terminate at all without it.
I didn't found the way to make this small trick work in SQLServer since SQLServer handles window function in recursive queries differently. Sorry for messing question with Oracle issues but I consider this topic interesting for anyone who finds this question.
I needed a topological sort for a SQLite application and the following works for SQLite 3.37.0, using #Tomáš's code and data. In SQLite, DISTINCT works within a recursive CTE. I have added an additional dependency between his nodes 'C' and 'F' to make things a little more interesting, but it works the same without this edge.
I need to determine the order of processing entities in a dependency management system, similar to #Ludovic's need, so I changed the sorting order to DESC so the first item returned is the first item to process.
DROP TABLE IF EXISTS edges;
CREATE TABLE edges(from_node int, to_node int);
INSERT INTO edges VALUES ('A','B'),('A','C'),('B','D'),('C','D')
, ('D','E'),('D','F'),('E','G'),('F','G')
, ('C','F');
with recursive cte as (
select distinct 0 as from_node, e.from_node as to_node, 1 as lev
from edges e
where not exists (select 1 from edges e2 where e2.to_node = e.from_node)
union all
select e.from_node, e.to_node, lev + 1
from cte join
edges e
on e.from_node = cte.to_node
)
select to_node, max(lev) from cte group by to_node order by max(lev) desc
;
Result:
to_node max(lev)
------- --------
G 5
F 4
E 4
D 3
C 2
B 2
A 1

Count number of points within different ranges SQL

We have real estate point X.
We want to calculate the number of stations within
0-200 m
200-400 m
400-600 m
After i have this I will later create a new table where these are summarized according to mathematical expressions.
SELECT loc_dist.id, loc_dist.namn1, grps.grp, count(*)
FROM (
SELECT b.id, b.namn1, ST_Distance_Sphere(b.geom, s.geom) AS dist
FROM stations s, bostader b) AS loc_dist
JOIN (
VALUES (1,200.), (2,400.), (3,600.)
) AS grps(grp, dist) ON loc_dist.dist < grps.dist
GROUP BY 1,2,3
ORDER BY 1,2,3;
I have this now, but it takes forever to run and can't get any results since I have more than 2000 entries from both b and s, I want number of s from a specific b. But this calculates for all, how do I add a:
WHERE b.id= 114477
for example? I only get syntax error on the join when I try to do this, I only want group distances from one or maybe 5 different b, depending on their b.id
After a lot of help from TA, the answer is here and works nicely, added ranges and a BETWEEN clause to get count within the circle rings
SELECT loc_dist.id, loc_dist.namn1, grps.grp, count(*)
FROM (
SELECT b.id, b.namn1, ST_Distance_Sphere(b.geom, s.geom) AS dist
FROM stations s, bostader b WHERE b.id=114477) AS loc_dist
JOIN (
VALUES (1,0,200), (2,200,400), (3,400,600)
) AS grps(grp, dist_l, dist_u) ON loc_dist.dist BETWEEN dist_l AND dist_u
GROUP BY 1,2,3
ORDER BY 1,2,3;

Split array by portions in PostgreSQL

I need split array by 2-pair portions, only nearby values.
For example I have following array:
select array[1,2,3,4,5]
And I want to get 4 rows with following values:
{1,2}
{2,3}
{3,4}
{4,5}
Can I do it by SQL query?
select a
from (
select array[e, lead(e) over()] as a
from unnest(array[1,2,3,4,5]) u(e)
) a
where not exists (
select 1
from unnest(a) u (e)
where e is null
);
a
-------
{1,2}
{2,3}
{3,4}
{4,5}
One option is to do this with a recursive cte. Starting from the first position in the array and going up to the last.
with recursive cte(a,val,strt,ed,l) as
(select a,a[1:2] as val,1 strt,2 ed,cardinality(a) as l
from t
union all
select a,a[strt+1:ed+1],strt+1,ed+1,l
from cte where ed<l
)
select val from cte
a in the cte is the array.
Another option if you know the max length of the array is to use generate_series to get all numbers from 1 to max length and cross joining the array table on cardinality. Then use lead to get slices of the array and omit the last one (as lead on last row for a given partition would be null).
with nums(n) as (select * from generate_series(1,10))
select a,res
from (select a,t.a[nums.n:lead(nums.n) over(partition by t.a order by nums.n)] as res
from nums
cross join t
where cardinality(t.a)>=nums.n
) tbl
where res is not null

Get Row's Sequence (Linked-List) in PostgreSQL

I have a submissions table which is essentially a single linked list. Given the id of a given row I want to return the entire list that particular row is a part of (and it be in the proper order). For example in the table below if had id 2 I would want to get back rows 1,2,3,4 in that order.
(4,3) -> (3,2) -> (2,1) -> (1,null)
I expect 1,2,3,4 here because 4 is essentially the head of the list that 2 belongs to and I want to traverse all the through the list.
http://sqlfiddle.com/#!15/c352e/1
Is there a way to do this using postgresql's RECURSIVE CTE? So far I have the following but this will only give me the parents and not the descendants
WITH RECURSIVE "sequence" AS (
SELECT * FROM submissions WHERE "submissions"."id" = 2
UNION ALL SELECT "recursive".* FROM "submissions" "recursive"
INNER JOIN "sequence" ON "recursive"."id" = "sequence"."link_id"
)
SELECT "sequence"."id" FROM "sequence"
This approach uses what you have already come up with.
It adds another block to calculate the rest of the list and then combines both doing a custom reverse ordering.
WITH RECURSIVE pathtobottom AS (
-- Get the path from element to bottom list following next element id that matches current link_id
SELECT 1 i, -- add fake order column to reverse retrieved records
* FROM submissions WHERE submissions.id = 2
UNION ALL
SELECT pathtobottom.i + 1 i, -- add fake order column to reverse retrieved records
recursive.* FROM submissions recursive
INNER JOIN pathtobottom ON recursive.id = pathtobottom.link_id
)
, pathtotop AS (
-- Get the path from element to top list following previous element link_id that matches current id
SELECT 1 i, -- add fake order column to reverse retrieved records
* FROM submissions WHERE submissions.id = 2
UNION ALL
SELECT pathtotop.i + 1 i, -- add fake order column to reverse retrieved records
recursive2.* FROM submissions recursive2
INNER JOIN pathtotop ON recursive2.link_id = pathtotop.id
), pathtotoprev as (
-- Reverse path to top using fake 'i' column
SELECT pathtotop.id FROM pathtotop order by i desc
), pathtobottomrev as (
-- Reverse path to bottom using fake 'i' column
SELECT pathtobottom.id FROM pathtobottom order by i desc
)
-- Elements ordered from bottom to top
SELECT pathtobottomrev.id FROM pathtobottomrev where id != 2 -- remove element to avoid duplicate
UNION ALL
SELECT pathtotop.id FROM pathtotop;
/*
-- Elements ordered from top to bottom
SELECT pathtotoprev.id FROM pathtotoprev
UNION ALL
SELECT pathtobottom.id FROM pathtobottom where id != 2; -- remove element to avoid duplicate
*/
In was yet another quest for my brain. Thanks.
with recursive r as (
select *, array[id] as lst from submissions s where id = 6
union all
select s.*, r.lst || s.id
from
submissions s inner join
r on (s.link_id=r.id or s.id=r.link_id)
where (not array[s.id] <# r.lst)
)
select * from r;

Limit and sort, from what word must start next result

My problem is i use query which must return N values when I call it first, then
next N values etc. Also I have some types of sort like sort by date, sort by word name etc., asc and desc variants. Where I use sort by date I can use something like id>N (in my code a>3) i.e. first id is be 4, last id is 9 then in next query first will be 10 and last 15 etc. but what to do if I need to sort by word name, how I can determine from what word to start?
select distinct s.a,w._word
from (
select a from edges
where a in
(
select distinct w._id
from edges as e
inner join words as w
on w._id=e.a
where w.lang_id=2
) and b in
(
select distinct w._id
from edges as e
inner join words as w
on w._id=e.b
where w.lang_id=1
)
union
select b from edges
where b in
(
select distinct w._id
from edges as e
inner join words as w
on w._id=e.b
where w.lang_id=2
) and a in
(
select distinct w._id
from edges as e
inner join words as w
on w._id=e.a
where w.lang_id=1
)
) as s
inner join words as w
on s.a=w._id
inner join groups_set as gs
on w._id=gs.word_id
where gs.group_id in (1,2,3) or w._word like '%d%' and a>3
order by w._word desc limit 5
Am I getting the question wrong? LIMIT can be used with an offset.
Either like
... LIMIT 5 OFFSET 5
or
... LIMIT 5, 5
You don't work the offset out in the query, you just increase it in your application.
And with this you can also ORDER BY whatever you want.