How to merge overlapped groups in Snowflake - sql

I have a many-many relationship table, and I want to find the overlapped groups and merge them into one.
In the example below, user 2 is in groups 7 and 8, so groups 7 and 8 should be merged into one that contains the records 1, 2, 4. The merged group id can be either 7 or 8, it doesn't matter.
user_id
group
1
7
2
7
2
8
4
8
5
9
6
9
I wish to see output like this:
user_id
group
1
7
2
7
4
7
5
9
6
9

Answering my own question here, below is the SQL I built that fits my needs. This is inspired by #pankaj 's answer.
with data(user_id,group_id) as (
select * from values
(1,7),(2,7),(2,8),(4,9),(5,9),(5,8),
(6,9),(70,8),(21,51),(22,51),(23,52),
(24,51),(24,52),(25,26)
), group_members as (
select
group_id, array_agg(user_id) users
from data
group by group_id
), overlapped_group as (
select
c1.group_id g1,
c2.group_id g2,
-- c1.users,
-- c2.users,
least(g1, coalesce(g2, g1)) as min_group,
min(min_group) over (partition by g2) as merge_to
from group_members c1
left join
group_members c2 on arrays_overlap(c1.users, c2.users)
and g1 <> g2
), merge_mapping as (
select distinct
g1 as group_id,
iff(g2 is null, g1, min(merge_to) over (partition by g1)) as merge_to
from overlapped_group
)
select
user_id,
m.merge_to as group_id
from data
left join merge_mapping m using(group_id);

This is similar to the one asked earlier, where-in grouping needs to be done to the top level in hierarchy.
The below query aggregates user_id based on group_id into array and then compares those arrays with each other.
When two arrays match they both get same group id.
Once arrays match and they have been assigned their parent group id based on minimum group value, we need to get the top of the hierarchy.
There could also be multiple hierarchies in the data-set, so we set starting point of each hierarchy as NULL.
Lastly, we use hierarchical query to get the final grouping.
with data(user_id,group_id) as (
select * from values
(1,7),(2,7),(2,8),(4,9),(5,9),(5,8),
(6,9),(70,8),(21,51),(22,51),(23,52),
(24,51),(24,52),(25,26)
),cte_1 as
(select group_id,array_agg(user_id) arr
from data
group by group_id
), cte_2 as
(select c1.group_id g1, c2.group_id g2 ,
c1.arr arr1, c2.arr arr2,
case when arrays_overlap(arr1, arr2) then g1 end flag,
min(flag) over (partition by g2) grp,
case when g2 <> grp then grp end final_grp
from cte_1 c1, cte_1 c2
), cte_3 as
(select distinct g2, connect_by_root g2 as parent from cte_2
start with final_grp is null
connect by final_grp = prior g2
order by g2
), cte_4 as
(select c3.parent, c1.arr
from cte_1 c1 left join cte_3 c3
where c1.group_id = c3.g2
) select distinct value, parent as final_group
from cte_4,
lateral flatten(input=>arr)
order by value;
VALUE
FINAL_GROUP
1
7
2
7
4
7
5
7
6
7
21
51
22
51
23
51
24
51
25
26
70
7
Adding another query, that is simpler.
with data(user_id,group_id) as (
select * from values
(1,7),(2,7),(2,8),(4,9),(5,9),(5,8),
(6,9),(70,8),(21,51),(22,51),(22,52),
(22,53),(23,52),(25,26)
), cte_1 as
(select a.group_id grp1, b.group_id grp2
from data a, data b
where a.user_id = b.user_id
and a.group_id < b.group_id
), cte_2 as
(select grp2, connect_by_root grp1 as parent
from cte_1
start with grp1 not in (select grp2 from cte_1)
connect by grp1 = prior grp2
) select a.user_id,
coalesce(b.parent, a.group_id) final_grp
from data a left join cte_2 b
on a.group_id = b.grp2;

One way:
select user_id, STRTOK(listagg(group, ', ') within group (ORDER BY user_id ),',',1)
from <table>
GROUP BY user_id ORDER BY user_id;

Related

Closest distance of a column

I need to find the two closest distance of each row based on all the values of the column.
I tried to do cross join and used the lead function to find the distance. I am totally not sure how to write it. Please suggest.
select a.id,lead(a.value,b.value) as distance from cluster a , cluster b
Input table:
ID Values
1 12.1
2 11
3 14
4 10
5 9
6 15
7 16
8 8
ID Values Closest_Value
1 12.1 11,10
2 11 9,10
3 14 15,16
4 10 9,11
5 9 8,10
6 15 14,16
7 16 14,15
8 8 9,10
One method uses a cross join and aggregation:
select id, value,
listagg(other_value, ',') within group (order by diff) as near_values
from (select c.id, c.value, c2.value as other_value
abs(c2.value = c.value) as diff,
row_number() over (partition by c.id order by abs(c2.value = c.value)) as seqnum
from cluster c join
cluster c2
on c.id <> c2.id
) c
where seqnum <= 2
group by id, value;
The above is not particularly efficient for larger amounts of data. An alternative is to use lead() and lag() to get the values, unpivot, and aggregate:
with vals as (
select c.id, c.value,
(case when n.n = 1 then prev_value_2
when n.n = 2 then prev_value
when n.n = 3 then next_value
when n.n = 4 then next_value_2
end) as other_value
from (select c.*,
lag(value, 2) over (order by value) as prev_value_2,
lag(value) over (order by value) as prev_value,
lead(value) over (order by value) as next_value,
lead(value, 2) over (order by value) as next_value_2,
from clusters c
) c cross join
(select rownum as n
from clusters
where rownum <= 4
) n -- just a list of 4 numbers
)
select v.id, v.value,
list_agg(other_value, ',') within group (order by diff)
from (select v.*,
abs(other_value - value) as diff
row_number() over (partition by id order by abs(other_value - value)) as seqnum
from vals v
) v
where seqnum <= 2
group by id, value;

SQL get the closest two rows within duplicate rows

I have following table
ID Name Stage
1 A 1
1 B 2
1 C 3
1 A 4
1 N 5
1 B 6
1 J 7
1 C 8
1 D 9
1 E 10
I need output as below with parameters A and N need to select closest rows where difference between stage is smallest
ID Name Stage
1 A 4
1 N 5
I need to select rows where difference between stage is smallest
This query can make use of an index on (name, stage) efficiently:
WITH cte AS (
SELECT TOP 1
a.id AS a_id, a.name AS a_name, a.stage AS a_stage
, n.id AS n_id, n.name AS n_name, n.stage AS n_stage
FROM tbl a
CROSS APPLY (
SELECT TOP 1 *, stage - a.stage AS diff
FROM tbl
WHERE name = 'N'
AND stage >= a.stage
ORDER BY stage
UNION ALL
SELECT TOP 1 *, a.stage - stage AS diff
FROM tbl
WHERE name = 'N'
AND stage < a.stage
ORDER BY stage DESC
) n
WHERE a.name = 'A'
ORDER BY diff
)
SELECT a_id AS id, a_name AS name, a_stage AS stage FROM cte
UNION ALL
SELECT n_id, n_name, n_stage FROM cte;
SQL Server uses CROSS APPLY in place of standard-SQL LATERAL.
In case of ties (equal difference) the winner is arbitrary, unless you add more ORDER BY expressions as tiebreaker.
dbfiddle here
This solution works, if u know the minimum difference is always 1
SELECT *
FROM myTable as a
CROSS JOIN myTable as b
where a.stage-b.stage=1;
a.ID a.Name a.Stage b.ID b.Name b.Stage
1 A 4 1 N 5
Or simpler if u don't know the minimum
SELECT *
FROM myTable as a
CROSS JOIN myTable as b
where a.stage-b.stage in (SELECT min (a.stage-b.stage)
FROM myTable as a
CROSS JOIN myTable as b)

Zip/repeat join?

Let's say I have a simple table of documents with a type column:
Documents
Id Type
1 A
2 A
3 B
4 C
5 C
6 A
7 A
8 A
9 B
10 C
Users have permissions to access different types of documents:
Permissions
Type User
A John
A Jane
B Sarah
C Peter
C John
C Mark
And I need to distribute those documents among the users as tasks:
Tasks
Id T DocId UserId
1 A 1 John
2 A 2 Jane
3 B 3 Sarah
4 C 4 Peter
5 C 5 John
6 A 6 John
7 A 7 Jane
8 A 8 John
9 B 9 Sarah
10 C 10 Mark
How do I do that? How do I get the Tasks?
You can enumerate the rows and then use modulo arithmetic for the matching:
with d as (
select d.*,
row_number() over (partition by type order by newid()) as seqnum,
count(*) over (partition by type) as cnt
from documents d
),
u as (
select u.*,
row_number() over (partition by type order by newid()) as seqnum,
count(*) over (partition by type) as cnt
from users u
)
select d.*
from d join
u
on d.type = u.type and
u.seqnum = (d.seqnum % u.cnt) + 1
Great question.
This solution returns all possible distributions, ordered by priority which is determined by information such as number of user involved, minimum documents per user, standard deviation of tasks per user etc.
I'm not counting on document.id to be a sequence of numbers starting with 1, therfore the use of dense_rank.
The core of the solutions is the iterative CTE which generates the record sets of all possible distributions.
Execution time on my laptop is around 20 seconds, (the iterative part takes 5 seconds)
with doc_user as
(
select d."id" as docid
,p."user" as userid
,dense_rank () over (order by d."id") as doc_seq
from documents d
left join permissions p
on p.type = d.type
)
,it_cte as
(
select docid
,userid
,doc_seq
,cast (coalesce(userid,'') as varchar(max)) as path
,'A' as cte_part
from doc_user
where doc_seq = 1
union all
select r.docid
,r.userid
,du.doc_seq
,r.path + ',' + coalesce (du.userid,'')
,'B'
from it_cte as r
cross join doc_user as du
where du.doc_seq = r.doc_seq + 1
union all
select du.docid
,du.userid
,du.doc_seq
,r.path + ',' + coalesce (du.userid,'')
,'C'
from it_cte as r
cross join doc_user as du
where du.doc_seq = r.doc_seq + 1
and r.cte_part in ('A','C')
)
,result_sets as
(
select dense_rank () over (order by path) as set_id
,docid
,userid
from it_cte
where doc_seq = (select count(*) from documents)
)
,result_sets_stat as
(
select set_id
,count (distinct userid) as users_involved
from result_sets
group by set_id
)
,result_sets_users_stat as
(
select set_id
,min (doc) min_doc_per_user
,stdevp (doc) stdevp_doc_per_user
from (select set_id
,userid
,count (*) as doc
from result_sets
group by set_id
,userid
) t
group by set_id
)
select s.set_priority
,r.docid
,r.userid
,s.users_involved
,s.min_doc_per_user
,s.stdevp_doc_per_user
from (select s.set_id
,s.users_involved
,u.min_doc_per_user
,u.stdevp_doc_per_user
,row_number () over
(
order by s.users_involved desc
,u.min_doc_per_user desc
,u.stdevp_doc_per_user
,s.set_id
) as set_priority
from result_sets_stat as s
join result_sets_users_stat as u
on u.set_id =
s.set_id
) s
join result_sets as r
on r.set_id =
s.set_id
order by s.set_priority
,r.docid
option (merge join)
;

SELECT records until new value SQL

I have a table
Val | Number
08 | 1
09 | 1
10 | 1
11 | 3
12 | 0
13 | 1
14 | 1
15 | 1
I need to return the last values where Number = 1 (however many that may be) until Number changes, but do not need the first instances where Number = 1. Essentially I need to select back until Number changes to 0 (15, 14, 13)
Is there a proper way to do this in MSSQL?
Based on following:
I need to return the last values where Number = 1
Essentially I need to select back until Number changes to 0 (15, 14,
13)
Try (Fiddle demo ):
select val, number
from T
where val > (select max(val)
from T
where number<>1)
EDIT: to address all possible combinations (Fiddle demo 2)
;with cte1 as
(
select 1 id, max(val) maxOne
from T
where number=1
),
cte2 as
(
select 1 id, isnull(max(val),0) maxOther
from T
where val < (select maxOne from cte1) and number<>1
)
select val, number
from T cross join
(select maxOne, maxOther
from cte1 join cte2 on cte1.id = cte2.id
) X
where val>maxOther and val<=maxOne
I think you can use window functions, something like this:
with cte as (
-- generate two row_number to enumerate distinct groups
select
Val, Number,
row_number() over(partition by Number order by Val) as rn1,
row_number() over(order by Val) as rn2
from Table1
), cte2 as (
-- get groups with Number = 1 and last group
select
Val, Number,
rn2 - rn1 as rn1, max(rn2 - rn1) over() as rn2
from cte
where Number = 1
)
select Val, Number
from cte2
where rn1 = rn2
sql fiddle demo
DEMO: http://sqlfiddle.com/#!3/e7d54/23
DDL
create table T(val int identity(8,1), number int)
insert into T values
(1),(1),(1),(3),(0),(1),(1),(1),(0),(2)
DML
; WITH last_1 AS (
SELECT Max(val) As val
FROM t
WHERE number = 1
)
, last_non_1 AS (
SELECT Coalesce(Max(val), -937) As val
FROM t
WHERE EXISTS (
SELECT val
FROM last_1
WHERE last_1.val > t.val
)
AND number <> 1
)
SELECT t.val
, t.number
FROM t
CROSS
JOIN last_1
CROSS
JOIN last_non_1
WHERE t.val <= last_1.val
AND t.val > last_non_1.val
I know it's a little verbose but I've deliberately kept it that way to illustrate the methodolgy.
Find the highest val where number=1.
For all values where the val is less than the number found in step 1, find the largest val where the number<>1
Finally, find the rows that fall within the values we uncovered in steps 1 & 2.
select val, count (number) from
yourtable
group by val
having count(number) > 1
The having clause is the key here, giving you all the vals that have more than one value of 1.
This is a common approach for getting rows until some value changes. For your specific case use desc in proper spots.
Create sample table
select * into #tmp from
(select 1 as id, 'Alpha' as value union all
select 2 as id, 'Alpha' as value union all
select 3 as id, 'Alpha' as value union all
select 4 as id, 'Beta' as value union all
select 5 as id, 'Alpha' as value union all
select 6 as id, 'Gamma' as value union all
select 7 as id, 'Alpha' as value) t
Pull top rows until value changes:
with cte as (select * from #tmp t)
select * from
(select cte.*, ROW_NUMBER() over (order by id) rn from cte) OriginTable
inner join
(
select cte.*, ROW_NUMBER() over (order by id) rn from cte
where cte.value = (select top 1 cte.value from cte order by cte.id)
) OnlyFirstValueRecords
on OriginTable.rn = OnlyFirstValueRecords.rn and OriginTable.id = OnlyFirstValueRecords.id
On the left side we put an original table. On the right side we put only rows whose value is equal to the value in first line.
Records in both tables will be same until target value changes. After line #3 row numbers will get different IDs associated because of the offset and will never be joined with original table:
LEFT RIGHT
ID Value RN ID Value RN
1 Alpha 1 | 1 Alpha 1
2 Alpha 2 | 2 Alpha 2
3 Alpha 3 | 3 Alpha 3
----------------------- result set ends here
4 Beta 4 | 5 Alpha 4
5 Alpha 5 | 7 Alpha 5
6 Gamma 6 |
7 Alpha 7 |
The ID must be unique. Ordering by this ID must be same in both ROW_NUMBER() functions.

SQL group uniquely by type and by position

Given this dataset:
ID type_id Position
1 2 7
2 1 2
3 3 5
4 1 1
5 3 3
6 2 4
7 2 6
8 3 8
(There are only 3 different possible type_ids) I'd like to return a dataset with one of each type_id in groups, ordered by position.
so it would be grouped like so:
Results (ID): [4, 6, 5], [2, 7, 3], [null, 1, 8]
So the first group would consist of each of the entries type_id's with the highest (Relative) position score, the second group would have the second highest score, the third would only consist of two entries (and a null) because there are not three more of each type_id
Does this make sense? And is it possible?
something like that:
with CTE as (
select
row_number() over (partition by type_id order by Position) as row_num,
*
from test
)
select array_agg(ID order by type_id)
from CTE
group by row_num
SQL FIDDLE
of, if you absolutely need nulls in your arrays:
with CTE as (
select
row_number() over (partition by type_id order by Position) as row_num,
*
from test
)
select array_agg(c.ID order by t.type_id)
from (select distinct row_num from CTE) as a
cross join (select distinct type_id from test) as t
left outer join CTE as c on c.row_num = a.row_num and c.type_id = t.type_id
group by a.row_num
SQL FIDDLE