How can i find all linked rows by array values in postgres? - sql

i have a table like this:
id | arr_val | grp
-----------------
1 | {10,20} | -
2 | {20,30} | -
3 | {50,5} | -
4 | {30,60} | -
5 | {1,5} | -
6 | {7,6} | -
I want to find out which rows are in a group together.
In this example 1,2,4 are one group because 1 and 2 have a common element and 2 and 4. 3 and 5 form a group because they have a common element. 6 Has no common elments with anybody else. So it forms a group for itself.
The result should look like this:
id | arr_val | grp
-----------------
1 | {10,20} | 1
2 | {20,30} | 1
3 | {50,5} | 2
4 | {30,60} | 1
5 | {1,5} | 2
6 | {7,6} | 3
I think i need recursive cte because my problem is graphlike but i am not sure how to that.
Additional info and background:
The Table has ~2500000 rows.
In reality the problem i try to solve has more fields and conditions for finding a group:
id | arr_val | date | val | grp
---------------------------------
1 | {10,20} | -
2 | {20,30} | -
Not only do the element of a group need to be linked by common elements in arr_val. They all need to have the same value in val and need to be linked by a timespan in date (gaps and islands). I solved the other two but now the condition of my question was added. If there is an easy way to do all three together in one query that would be awesome but it is not necessary.
----Edit-----
While both answer work for the example of five rows they do not work for a table with a lot more rows. Both answers have the problem that the number of rows in the recursive part explodes and only reduce them at the end.
A solutiuon should work for data like this too:
id | arr_val | grp
-----------------
1 | {1} | -
2 | {1} | -
3 | {1} | -
4 | {1} | -
5 | {1} | -
6 | {1} | -
7 | {1} | -
8 | {1} | -
9 | {1} | -
10 | {1} | -
11 | {1} | -
more rows........
Is there a solution to that problem?

You can handle this as a recursive CTE. Define the edges between the ids based on common values. Then traverse the edges and aggregate:
with recursive nodes as (
select id, val
from t cross join
unnest(arr_val) as val
),
edges as (
select distinct n1.id as id1, n2.id as id2
from nodes n1 join
nodes n2
on n1.val = n2.val
),
cte as (
select id1, id2, array[id1] as visited, 1 as lev
from edges
where id1 = id2
union all
select cte.id1, e.id2, visited || e.id2,
lev + 1
from cte join
edges e
on cte.id2 = e.id1
where e.id2 <> all(cte.visited)
),
vals as (
select id1, array_agg(distinct id2 order by id2) as id2s
from cte
group by id1
)
select *, dense_rank() over (order by id2s) as grp
from vals;
Here is a db<>fiddle.

Here is an approach at this graph-walking problem:
with recursive cte as (
select id, arr_val, array[id] path from mytable
union all
select t.id, t.arr_val, c.path || t.id
from cte c
inner join mytable t on t.arr_val && c.arr_val and not t.id = any(c.path)
)
select c.id, c.arr_val, dense_rank() over(order by min(x.id)) grp
from cte c
cross join lateral unnest(c.path) as x(id)
group by c.id, c.arr_val
order by c.id
The common-table-expression walks the graph, recursively looking for "adjacent" nodes to the current node, while keeping track of already-visited nodes. Then the outer query aggregates, identifies groups using the least node per path, and finally ranks the groups.
Demo on DB Fiddle:
id | arr_val | grp
-: | :------ | --:
1 | {10,20} | 1
2 | {20,30} | 1
3 | {50,5} | 2
4 | {30,60} | 1
5 | {1,5} | 2
6 | {7,6} | 3

While Gordon Linoffs solution is the fastest i found for a small amount of data where the groups are not so big it will not work for bigger datasets and bigger groups.
I changed his solution to make it work.
I moved the edges to an indexed table:
create table edges
(
id1 integer not null,
id2 integer not null,
constraint staffel_group_nodes_pk
primary key (id1, id2)
);
insert into edges(id1, id2) with
nodes(id, arr_val) as (
select id, arr_val
from my_table
)
select n1.id as id1, n2.id as id2
from nodes n1
join
nodes n2
on n1.arr_val && n2.arr_val ;
Tha alone didnt help. I changed his recursive part too:
with recursive
cte as (
select id1, array [id1] as visited
from edges
where id1 = id2
union all
select unnested.id1, array_agg(distinct unnested.vis) as visited
from (
select cte.id1,
unnest(cte.visited || e.id2) as vis
from cte
join
staffel_group_edges e
on e.id1 = any (cte.visited)
and e.id2 <> all (cte.visited)) as unnested
group by unnested.id1
),
vals as (
select id1, array_agg(distinct vis) as id2s
from (
select cte.id1,
unnest(cte.visited) as vis
from cte) as unnested
group by unnested.id1
)
select id1,id2s, dense_rank() over (order by id2s) as grp
from vals;
Every step i group all searches by their starting point. This reduces the amount of parallel walked paths a lot and works surprisingly fast.

Related

T-SQL Return All Combinations of a Table

In a different question, I asked about all possible 3-way combinations of a table, assuming that there are 3 articles. In this question, I would like to further extend the problem to return all n-way combinations of a table with n distinct articles. One article can have multiple suppliers. My goal is to have groups of combinations with each group having each article.
Below is a sample table but keep in mind that there could be more than those 3 articles.
+---------+----------+
| Article | Supplier |
+---------+----------+
| 4711 | A |
| 4712 | B |
| 4712 | C |
| 4712 | D |
| 4713 | C |
| 4713 | E |
+---------+----------+
For the example above, there would be 6 combination pairs possible with 18 datasets. Below is how the solution for the example above is supposed to look like:
+----------------+---------+----------+
| combination_nr | article | supplier |
+----------------+---------+----------+
| 1 | 4711 | A |
| 1 | 4712 | B |
| 1 | 4713 | C |
| 2 | 4711 | A |
| 2 | 4712 | B |
| 2 | 4713 | E |
| 3 | 4711 | A |
| 3 | 4712 | C |
| 3 | 4713 | C |
| 4 | 4711 | A |
| 4 | 4712 | D |
| 4 | 4713 | E |
| 5 | 4711 | A |
| 5 | 4712 | D |
| 5 | 4713 | C |
| 6 | 4711 | A |
| 6 | 4712 | D |
| 6 | 4713 | E |
+----------------+---------+----------+
I am asking this in a different question because in the other question I have not specified that I need this to be dynamic and not only the 3 articles from above.
Here you can find the old question.
If you want to generate all combinations of articles, then you can use a recursive CTE. However, the trick is to store all the ids at each level -- but that requires using a string (of some sort).
So, for the articles, you can do:
with a as (
select distinct article from t
),
cte as (
select a.article, 1 as lev, convert(varchar(max), article) as grp
from a
union all
select a.article, lev + 1, concat(grp, ':', a.article)
from cte join
a
on a.article > cte.article
)
select cte.combination_number, s.value
from (select cte.*, row_number() over (order by lev, grp) as combination_number
from cte
) cte cross apply
string_split(grp, ':') s
order by cte.combination_number, s.value;
You also seem to want one of the suppliers as well. You can arbitrarily choose a supplier -- but I'm not sure what rules you might want for that.
Here is a db<>fiddle.
EDIT:
If you want the article supplier pairs considered as a unit -- but different articles in each combination -- you can tweak the above query:
with a as (
select distinct article from t
),
cte as (
select a.article, 1 as lev, convert(varchar(max), article) as grp
from a
union all
select a.article, lev + 1, concat(grp, ':', a.article)
from cte join
a
on a.article > cte.article
)
select cte.combination_number, s.value
from (select cte.*, row_number() over (order by lev, grp) as combination_number
from cte
) cte cross apply
string_split(grp, ':') s
order by cte.combination_number, s.value;
The db<>fiddle has this solution as well.
To create this result you'll need to determine two things:
A set of data with ids for the the number of possible combinations.
A way to determine when an article/supplier pair corresponds to a combination id.
In order to figure out the set of combinations, you need to first figure out the number of combinations. To do that, you need to figure out the size of each supplier set for each article. Getting the size is accomplished easily with a count aggregate function, but in order to figure out the combinations, I needed a product of all the values, which is not easily done. Fortunately there was an answer on how to do this in another SO questions.
Now that the number of combinations is determined, the ids have to be generated. There is no great way to do this in TSQL. I ended up using a simple recursive CTE. This has the drawback of only being able to produce up to 32767 combos. There are other methods to produce these values if you're going to need more.
To determine when an article/supplier pair lines up with a combination I use ROW_NUMBER window function partition by Article and sorted by Supplier to get a sequence number for each pair. Then it's using an old trick of using Modulo of the combination number by the sequence number to determine if a pair is displayed.
There is still an issue in that there is no guarantee that redundant pairs won't be paired together. In order to address this a CTE was added that calculates the number of possible combinations that come before an article. The intention is so that a value in a later article is repeated the number of times for a combination before the next in sequence is displayed. I called it multiplier (even though I divide the comboId by it) and this is what will ensure distinct results.
WITH ComboId AS (
-- Product from https://stackoverflow.com/questions/3912204/why-is-there-no-product-aggregate-function-in-sql
SELECT 0 [ComboId], exp (sum (log (SequenceCount))) MaxCombos
FROM (
SELECT COUNT(*) SequenceCount
FROM src
GROUP BY Article
) x
UNION ALL
SELECT ComboId + 1, MaxCombos
FROM ComboId
WHERE ComboId + 1 < MaxCombos
)
, Multiplier AS (
SELECT s.Article
, COALESCE(exp (sum (log (SequenceCount)) OVER (ORDER BY Article ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)), 1) [Multiplier]
FROM (
SELECT Article, COUNT(*) SequenceCount
FROM src
GROUP BY Article
) s
)
, Sequenced AS (
SELECT s.Article, s.Supplier, m.Multiplier
, ROW_NUMBER() OVER (PARTITION BY s.Article ORDER BY s.Supplier) - 1 ArtSupplierSeqNum
, COUNT(*) OVER (PARTITION BY s.Article) MaxArtSupplierSeqNum
FROM src s
INNER JOIN Multiplier m ON m.Article = s.Article
)
SELECT c.ComboId + 1 [ComboId], s.Article, s.Supplier
FROM ComboId c
INNER JOIN Sequenced s ON s.ArtSupplierSeqNum = CAST(c.ComboId / Multiplier as INT) % s.MaxArtSupplierSeqNum
ORDER BY ComboId, Article
OPTION (MAXRECURSION 32767)

Find the count of IDs that have the same value

I'd like to get a count of all of the Ids that have have the same value (Drops) as other Ids. For instance, the illustration below shows you that ID 1 and 3 have A drops so the query would count them. Similarly, ID 7 & 18 have B drops so that's another two IDs that the query would count totalling in 4 Ids that share the same values so that's what my query would return.
+------+-------+
| ID | Drops |
+------+-------+
| 1 | A |
| 2 | C |
| 3 | A |
| 7 | B |
| 18 | B |
+------+-------+
I've tried the several approaches but the following query was my last attempt.
With cte1 (Id1, D1) as
(
select Id, Drops
from Posts
),
cte2 (Id2, D2) as
(
select Id, Drops
from Posts
)
Select count(distinct c1.Id1) newcnt, c1.D1
from cte1 c1
left outer join cte2 c2 on c1.D1 = c2.D2
group by c1.D1
The result if written out in full would be a single value output but the records that the query should be choosing should look as follows:
+------+-------+
| ID | Drops |
+------+-------+
| 1 | A |
| 3 | A |
| 7 | B |
| 18 | B |
+------+-------+
Any advice would be great. Thanks
You can use a CTE to generate a list of Drops values that have more than one corresponding ID value, and then JOIN that to Posts to find all rows which have a Drops value that has more than one Post:
WITH CTE AS (
SELECT Drops
FROM Posts
GROUP BY Drops
HAVING COUNT(*) > 1
)
SELECT P.*
FROM Posts P
JOIN CTE ON P.Drops = CTE.Drops
Output:
ID Drops
1 A
3 A
7 B
18 B
If desired you can then count those posts in total (or grouped by Drops value):
WITH CTE AS (
SELECT Drops
FROM Posts
GROUP BY Drops
HAVING COUNT(*) > 1
)
SELECT COUNT(*) AS newcnt
FROM Posts P
JOIN CTE ON P.Drops = CTE.Drops
Output
newcnt
4
Demo on SQLFiddle
You may use dense_rank() to resolve your problem. if drops has the same ID then dense_rank() will provide the same rank.
Here is the demo.
with cte as
(
select
drops,
count(distinct rnk) as newCnt
from
( select
*,
dense_rank() over (partition by drops order by id) as rnk
from myTable
) t
group by
drops
having count(distinct rnk) > 1
)
select
sum(newCnt) as newCnt
from cte
Output:
|newcnt |
|------ |
| 4 |
First group the count of the ids for your drops and then sum the values greater than 1.
select sum(countdrops) as total from
(select drops , count(id) as countdrops from yourtable group by drops) as temp
where countdrops > 1;

SQL select all rows in a single row's "history"

I have a table that looks like this:
ID | PARENT_ID
--------------
0 | NULL
1 | 0
2 | NULL
3 | 1
4 | 2
5 | 4
6 | 3
Being an SQL noob, I'm not sure if I can accomplish what I would like in a single command.
What I would like is to start at row 6, and recursively follow the "history", using the PARENT_ID column to reference the ID column.
The result (in my mind) should look something like:
6|3
3|1
1|0
0|NULL
I already tried something like this:
SELECT T1.ID
FROM Table T1, Table T2
WHERE T1.ID = 6
OR T1.PARENT_ID = T2.PARENT_ID;
but that just gave me a strange result.
With a recursive cte.
If you want to start from the maximum id:
with recursive cte (id, parent_id) as (
select t.*
from (
select *
from tablename
order by id desc
limit 1
) t
union all
select t.*
from tablename t inner join cte c
on t.id = c.parent_id
)
select * from cte
See the demo.
If you want to start specifically from id = 6:
with recursive cte (id, parent_id) as (
select *
from tablename
where id = 6
union all
select t.*
from tablename t inner join cte c
on t.id = c.parent_id
)
select * from cte;
See the demo.
Results:
| id | parent_id |
| --- | --------- |
| 6 | 3 |
| 3 | 1 |
| 1 | 0 |
| 0 | |

Recursive SQL query to find all matching identifiers

I have a table with following structure
CREATE TABLE Source
(
[ID1] INT,
[ID2] INT
);
INSERT INTO Source ([ID1], [ID2])
VALUES (1, 2), (2, 3), (4, 5),
(2, 5), (6, 7)
Example of Source and Result tables:
Source table basically stores which id is matching which another id. From the diagram it can be seen that 1, 2, 3, 4, 5 are identical. And 6, 7 are identical. I need a SQL query to get a Result table with all matches between ids.
I found this item on the site - Recursive query in SQL Server
similar to my task, but with a different result.
I tried to edit the code for my task, but it does not work. "The statement terminated. The maximum recursion 100 has been exhausted before statement completion."
;WITH CTE
AS
(
SELECT DISTINCT
M1.ID1,
M1.ID1 as ID2
FROM Source M1
LEFT JOIN Source M2
ON M1.ID1 = M2.ID2
WHERE M2.ID2 IS NULL
UNION ALL
SELECT
C.ID2,
M.ID1
FROM CTE C
JOIN Source M
ON C.ID1 = M.ID1
)
SELECT * FROM CTE ORDER BY ID1
Thanks a lot for the help!
This is a challenging question. You are trying to walk through a graph in two directions. There are two key ideas:
Add "reverse" edges, so the graph behaves like a digraph but with edges in both directions.
Keep a list of edges that have been visited. In SQL Server, strings are one method.
So:
with s as (
select id1, id2 from source
union -- on purpose
select id2, id1 from source
),
cte as (
select s.id1, s.id2, ',' + cast(s.id1 as varchar(max)) + ',' + cast(s.id2 as varchar(max)) + ',' as ids
from s
union all
select cte.id1, s.id2, ids + cast(s.id2 as varchar(max)) + ','
from cte join
s
on cte.id2 = s.id1
where cte.ids not like '%,' + cast(s.id2 as varchar(max)) + ',%'
)
select *
from cte
order by 1, 2;
Here is a db<>fiddle.
Since all node connections are bidirectional - add reversed relations to the original list
Find all possible paths from each node; almost usual recursion, the only difference is - we need to keep root id1
Avoid cycles - we need to be aware of it because we don't have directions
source:
;with src as(
select id1, id2 from source
union
-- reversed connections
select id2, id1 from source
), rec as (
select id1, id2, CAST(CONCAT('/', src.id1, '/', src.id2, '/') as varchar(8000)) path
from src
union all
-- keep the root id1 from the start of each path
select rec.id1, src.id2, CAST(CONCAT(rec.path, src.id2, '/') as varchar(8000))
from rec
-- usual recursion
inner join src on src.id1 = rec.id2
-- avoid cycles
where rec.path not like CONCAT('%/', src.id2, '/%')
)
select id1, id2, path
from rec
order by 1, 2
output
| id1 | id2 | path |
|-----|-----|-----------|
| 1 | 2 | /1/2/ |
| 1 | 3 | /1/2/3/ |
| 1 | 4 | /1/2/5/4/ |
| 1 | 5 | /1/2/5/ |
| 2 | 1 | /2/1/ |
| 2 | 3 | /2/3/ |
| 2 | 4 | /2/5/4/ |
| 2 | 5 | /2/5/ |
| 3 | 1 | /3/2/1/ |
| 3 | 2 | /3/2/ |
| 3 | 4 | /3/2/5/4/ |
| 3 | 5 | /3/2/5/ |
| 4 | 1 | /4/5/2/1/ |
| 4 | 2 | /4/5/2/ |
| 4 | 3 | /4/5/2/3/ |
| 4 | 5 | /4/5/ |
| 5 | 1 | /5/2/1/ |
| 5 | 2 | /5/2/ |
| 5 | 3 | /5/2/3/ |
| 5 | 4 | /5/4/ |
| 6 | 7 | /6/7/ |
| 7 | 6 | /7/6/ |
http://sqlfiddle.com/#!18/76114/13
source table will contain about 100,000 records
There is nothing that can help you with this. The task is unpleasant - finding all possible connections. Almost CROSS JOIN. With even more connections in the end.
Looks like I came up with a similar answer as the other posters. My approach was to insert the existing value pairs, and then insert the reverse of each pair.
Once you expand the list of value pairs, you can transverse the table to find all the pairs.
CREATE TABLE #Source
([ID1] int, [ID2] int);
INSERT INTO #Source
(
[ID1]
,[ID2]
)
VALUES
(1, 2)
,(2, 3)
,(4, 5)
,(2, 5)
,(6, 7)
INSERT INTO #Source
(
[ID1]
,[ID2]
)
SELECT
[ID2]
,[ID1]
FROM #Source
;WITH expanded AS
(
SELECT DISTINCT
ID1 = s1.ID1
,ID2 = s1.ID2
FROM #Source s1
LEFT JOIN #Source s2 ON s1.ID2 = s2.ID1
UNION
SELECT DISTINCT
ID1 = s1.ID1
,ID2 = s2.ID2
FROM #Source s1
LEFT JOIN #Source s2 ON s1.ID2 = s2.ID1
WHERE s1.ID1 <> s2.ID2
)
,recur AS
(
SELECT DISTINCT
e1.ID1
,e1.ID2
FROM expanded e1
LEFT JOIN expanded e2 ON e1.ID2 = e2.ID1
WHERE e1.ID1 <> e1.ID2
UNION ALL
SELECT DISTINCT
e1.ID1
,e2.ID2
FROM expanded e1
INNER JOIN expanded e2 ON e1.ID2 = e2.ID1
WHERE e1.ID1 <> e2.ID2
)
SELECT DISTINCT
ID1, ID2
FROM recur
ORDER BY ID1, ID2
DROP TABLE #Source
This is a way to get that output by brute force, but may not be the best solution with a different/larger data set:
select sub1.rnk as ID1
,sub2.rnk as ID2
from
(
select a.*
,rank() over (partition by 1 order by id1, id2) as RNK
from source a
) sub1
cross join
(
select a.*
,rank() over (partition by 1 order by id1, id2) as RNK
from source a
) sub2
where sub1.rnk <> sub2.rnk
union all
select id1 as ID1
,id2 as ID2
from source
where id1 = 6
union all
select id2 as ID1
,id1 as ID2
from source
where id1 = 6;

Grouping SQL Results based on order

I have table with data something like this:
ID | RowNumber | Data
------------------------------
1 | 1 | Data
2 | 2 | Data
3 | 3 | Data
4 | 1 | Data
5 | 2 | Data
6 | 1 | Data
7 | 2 | Data
8 | 3 | Data
9 | 4 | Data
I want to group each set of RowNumbers So that my result is something like this:
ID | RowNumber | Group | Data
--------------------------------------
1 | 1 | a | Data
2 | 2 | a | Data
3 | 3 | a | Data
4 | 1 | b | Data
5 | 2 | b | Data
6 | 1 | c | Data
7 | 2 | c | Data
8 | 3 | c | Data
9 | 4 | c | Data
The only way I know where each group starts and stops is when the RowNumber starts over. How can I accomplish this? It also needs to be fairly efficient since the table I need to do this on has 52 Million Rows.
Additional Info
ID is truly sequential, but RowNumber may not be. I think RowNumber will always begin with 1 but for example the RowNumbers for group1 could be "1,1,2,2,3,4" and for group2 they could be "1,2,4,6", etc.
For the clarified requirements in the comments
The rownumbers for group1 could be "1,1,2,2,3,4" and for group2 they
could be "1,2,4,6" ... a higher number followed by a lower would be a
new group.
A SQL Server 2012 solution could be as follows.
Use LAG to access the previous row and set a flag to 1 if that row is the start of a new group or 0 otherwise.
Calculate a running sum of these flags to use as the grouping value.
Code
WITH T1 AS
(
SELECT *,
LAG(RowNumber) OVER (ORDER BY ID) AS PrevRowNumber
FROM YourTable
), T2 AS
(
SELECT *,
IIF(PrevRowNumber IS NULL OR PrevRowNumber > RowNumber, 1, 0) AS NewGroup
FROM T1
)
SELECT ID,
RowNumber,
Data,
SUM(NewGroup) OVER (ORDER BY ID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM T2
SQL Fiddle
Assuming ID is the clustered index the plan for this has one scan against YourTable and avoids any sort operations.
If the ids are truly sequential, you can do:
select t.*,
(id - rowNumber) as grp
from t
Also you can use recursive CTE
;WITH cte AS
(
SELECT ID, RowNumber, Data, 1 AS [Group]
FROM dbo.test1
WHERE ID = 1
UNION ALL
SELECT t.ID, t.RowNumber, t.Data,
CASE WHEN t.RowNumber != 1 THEN c.[Group] ELSE c.[Group] + 1 END
FROM dbo.test1 t JOIN cte c ON t.ID = c.ID + 1
)
SELECT *
FROM cte
Demo on SQLFiddle
How about:
select ID, RowNumber, Data, dense_rank() over (order by grp) as Grp
from (
select *, (select min(ID) from [Your Table] where ID > t.ID and RowNumber = 1) as grp
from [Your Table] t
) t
order by ID
This should work on SQL 2005. You could also use rank() instead if you don't care about consecutive numbers.