SQL: select sets containing exactly given members - sql

I'm sure there is a proper word for this which I fail to remember, but the problem is easy to describe:
I have a table groupmembers, which is a simple relationship between groups and members:
id | groupid | memberid
1 | g1 | m1
2 | g1 | m2
3 | g2 | m1
4 | g2 | m2
5 | g2 | m3
Above describing two groups, one with m1 and m2 and one with m1,m2 and m3.
If I want to select groupids which has members m1,m2 but no other members, how do I do it? The approaches I have tried would also return g2, as m1 and m2 is a subset of them.
UPDATE: Wow, some great answers! Let me first clarify my question a little - I want to be able to select the group that exactly matches the given members m1 and m2. So, it should NOT match if the group also contains more members than m1 and m2, and it should NOT match if the group contains less than members m1 and m2.

from your phrase
I want to select groupids which has members m1,m2 but no other members
try this one, the idea behind is to count the total instances of records that match the condition and the where clause and that it is equal to the total number of records per group.
SELECT groupid
FROM table1 a
WHERE memberid IN ('m1','m2')
GROUP BY groupid
HAVING COUNT(*) =
(
SELECT COUNT(*)
FROM table1 b
WHERE b.groupid = a.groupid
GROUP BY b.groupID
)
SQLFiddle Demo

You are looking for the intersection between those groups that have m1 and m2 and those groups that have exactly two members. SQL has an operator for that:
select groupid
from group_table
where memberid in ('m1','m2')
group by groupid
having count(distinct memberid) = 2
intersect
select groupid
from group_table
group by groupid
having count(distinct memberid) = 2
(If you are using Oracle, intersect is called minus)
Here is a SQLFiddle demo: http://sqlfiddle.com/#!12/df94d/1
Although I think John Woo's solution could be more efficient in terms of performance.

there is an issue with this query
SELECT groupid
FROM table1 a
WHERE memberid IN ('m1','m2')
GROUP BY groupid
HAVING COUNT(*) =
(
SELECT COUNT(*)
FROM table1 b
WHERE b.groupid = a.groupid
GROUP BY b.groupID
)
It will match groups with m1 only or m2 only.
For that we can add another count check
SELECT groupid
FROM table1 a
WHERE memberid IN ('m1','m2')
GROUP BY groupid
HAVING COUNT(*) = 2 --since we already know we should have exactly two rows
AND COUNT(*) =
(
SELECT COUNT(*)
FROM table1 b
WHERE b.groupid = a.groupid
GROUP BY b.groupID
)

SELECT DISTINCT -- if (groupid, memberid) is unique
-- no need for the DISTINCT
a.groupid
FROM
tableX AS a
JOIN
tableX AS b
ON b.groupid = a.groupid
WHERE a.memberid = 'm1'
AND b.memberid = 'm2'
AND NOT EXISTS
( SELECT *
FROM tableX AS t
WHERE t.groupid = a.groupid
AND t.memberid NOT IN ('m1', 'm2')
) ;

-- sample table for discussion
CREATE TABLE tbl
(id int, groupid varchar(2), memberid varchar(2));
INSERT INTO tbl
(id, groupid, memberid)
VALUES
(6, 'g4', 'm1'),
(7, 'g4', 'm2'),
(8, 'g6', 'm1'),
(9, 'g6', 'm3'),
(1, 'g1', 'm1'),
(2, 'g1', 'm2'),
(3, 'g2', 'm1'),
(4, 'g2', 'm2'),
(5, 'g2', 'm3')
;
-- the query
select a.groupid, b.groupid peer
from (select groupid, count(*) member_count, min(memberid) x, max(memberid) y
from tbl
group by groupid) A
join
(select groupid, count(*) member_count, min(memberid) x, max(memberid) y
from tbl
group by groupid) B
on a.groupid<b.groupid and a.member_count=b.member_count and a.x=b.x and a.y=b.y
join tbl A1
on A1.groupid = A.groupid
join tbl B1
on B1.groupid = B.groupid and A1.memberid = B1.memberid
group by A.groupid, b.groupid, A.member_count
having count(1) = A.member_count;
-- the result
GROUPID PEER
g1 g4
The above shows a way to get groups listed with their peers, in a highly optimal way. It works well with large databases by decomposing the groups into member counts and takes along the min and max. The groups are quickly pared down using a direct join, and only for the remaining matches is the full table consulted joining back on group ids A and B to finally determine if they are equivalent groups.
If you had 3 similar groups (101,103,104), the sets will appear as three separate rows (101,103),(101,104),(103,104) - because each pair forms a peering, so such a query is best used if you already know one of the groups that you want to find peers for. This filter would fit into the first subquery.

id | groupid | memberid
1 | g1 | m1
2 | g1 | m2
3 | g2 | m1
4 | g2 | m2
5 | g2 | m3
select GRPID from arcv where GRPID in (
select GRPID from arcv
group by GRPID having count(1)=2) and memberid in ('m1','m2')

Related

Conditional probability in SQL

I think I have end up in a bit of a dead end.
Let's say I have a dataset, which is fairly easy -
person_id and book_id. Which is pretty much factual table that says person X bought books A, B and C.
I know how to find out how many persons have bought Book X and Book Y together.
This is
select a.book_id as B1, b.book_id as B2, count(b.person_id) as
Bought_Together
from dbo.data a
cross join dbo.data b
where a.book_id != b.book_id and a.person_id = b.person_id
group by a.book_id, b.book_id
Yet again this is where my brain decided to shut down. I know that I would probably need to do it so that
count(b.person_id) / all the people that bought book A * 100
but im not entirely sure.
I hope I was clear enough.
EDIT1: I'm using SQL Server 2017 currently, so i think the correct answer is T-SQL?.
In the end the format should be something similliar to this. Also there is no cases where person A could have bought three copies of book X.
Book1 Book2 HowManyPeopleBoughtBook2
1 2 50%
1 3 7%
2 3 15%
2 1 40%
3 1 60%
3 2 20%
EDIT2: Let it be said there is hundreds of thousands of rows in the database. Yes this is bit related to a data science course i am taking - hence huge amounts of data.
You can extend your logic to do this:
select a.book_id as B1, b.book_id as B2,
count(b.book_id) as bought_second_book,
count(b.book_id) * 1.0 / book_cnt as ratio_Bought_Together
from (select a.*, count(*) over (partition by a.book_id) as book_cnt
from dbo.data a
) a left join
dbo.data b
on a.person_id = b.person_id and a.book_id <> b.book_id
group by a.book_id, b.book_id, a.book_cnt;
This assumes that people buy a book only once. If there are duplicates, then count(distinct) would adjust for that.
If you would like to generate all possible combinations of the pairs of books bought together along with the percentage of the persons who bought that combination the following can help
create table data1(book_id int, person_id int)
insert into data1
select *
from (values(1,300)
,(2,300)
,(2,301)
,(1,301)
,(3,301)
)t(book_id,person_id)
with books
as (select distinct book_id
from data1 a
)
,tot_persons
as (select count(distinct person_id) as tot_cnt
from data1
)
,pairs
as (
select a.book_id as col1 /* This block generates all possible pair combinations of books*/
,b.book_id as col2
from books a
join books b
on a.book_id<b.book_id
)
select a.col1,a.col2
,count(b.person_id)*100/(select tot_cnt from tot_persons) as percent_of_persons_buying_both
from pairs a
join data1 b
on a.col1=b.book_id
where exists(select 1
from data1 b1
where b.person_id=b1.person_id
and a.col2=b1.book_id)
group by a.col1,a.col2
On my phone, apologies for typo's
SELECT
SUM(bought_b) * 100.0 / COUNT(*)
FROM
(
SELECT
person_id,
MAX(CASE WHEN book_id = 'A' THEN 1 END) AS bought_a,
MAX(CASE WHEN book_id = 'B' THEN 1 END) AS bought_b
FROM
data
WHERE
book_id IN ('A', 'B')
GROUP BY
person_id
)
person_stats
WHERE
bought_a = 1
On my phone, apologies for typo's
EDIT : just saw that you want all combinations, just just one set combination.
WITH
book AS
(
SELECT DISTINCT book_id FROM data
)
SELECT
book_a_id,
book_b_id,
bought_b * 100.0 / bought_b
FROM
(
SELECT
book_a.book_id AS book_a_id,
book_b.book_id AS book_b_id,
COUNT(DISTINCT data_a.person_id) AS bought_a,
COUNT(DISTINCT data_b.person_id) AS bought_b
FROM
book AS book_a
CROSS JOIN
book AS book_b
INNER JOIN
data AS data_a
ON data_a.book_id = book_a.book_id
LEFT JOIN
data AS data_b
ON data_b.book_id = book_b.book_id
GROUP BY
book_a.book_id,
book_b.book_id
)
stats

How to group results by count of relationships

Given tables, Profiles, and Memberships where a profile has many memberships, how do I query profiles based on the number of memberships?
For example I want to get the number of profiles with 2 memberships. I can get the number of profiles for each membership with:
SELECT "memberships"."profile_id", COUNT("profiles"."id") AS "membership_count"
FROM "profiles"
INNER JOIN "memberships" on "profiles"."id" = "memberships"."profile_id"
GROUP BY "memberships"."profile_id"
That returns results like
profile_id | membership_count
_____________________________
1 2
2 5
3 2
...
But how do I group and sum the counts to get the query to return results like:
n | profiles_with_n_memberships
_____________________________
1 36
2 28
3 29
...
Or even just a query for a single value of n that would return
profiles_with_2_memberships
___________________________
28
I don't have your sample data, but I just recreated the scenario here with a single table : Demo
You could LEFT JOIN the counts with generate_series() and get zeroes for missing count of n memberships. If you don't want zeros, just use the second query.
Query1
WITH c
AS (
SELECT profile_id
,count(*) ct
FROM Table1
GROUP BY profile_id
)
,m
AS (
SELECT MAX(ct) AS max_ct
FROM c
)
SELECT n
,COUNT(c.profile_id)
FROM m
CROSS JOIN generate_series(1, m.max_ct) AS i(n)
LEFT JOIN c ON c.ct = i.n
GROUP BY n
ORDER BY n;
Query2
WITH c
AS (
SELECT profile_id
,count(*) ct
FROM Table1
GROUP BY profile_id
)
SELECT ct
,COUNT(*)
FROM c
GROUP BY ct
ORDER BY ct;

SQL Server - only join if condition is met

I have three tables (at least, something similar) with the following relationships:
Item table:
ID | Val
---------+---------
1 | 12
2 | 5
3 | 22
Group table:
ID | Parent | Range
---------+---------+---------
1 | NULL | [10-30]
2 | 1 | [20-25]
3 | NULL | [0-15]
GroupToItem table:
GroupID | ItemID
---------+---------
1 | 1
1 | 3
And now I want to add rows to the GroupToItem table for Groups 2 and 3, using the same query (since some other conditions not shown here are more complicated). I want to restrict the items through which I search if the new group has a parent, but to look through all items if there is not.
At the moment I am using an IF/ELSE on two statements that are almost exactly the same, but for the addition of another JOIN row when a parent exists. Is it possible to do a join to reduce the number of items to look at, only if a restriction is possible?
My two queries as they stand are given below:
DECLARE #GroupID INT = 2;...
INSERT INTO GroupToItem(GroupID, ItemID)
SELECT g.ID,
i.ID,
FROM Group g
JOIN Item i ON i.Val IN g.Range
JOIN GroupToItem gti ON g.Parent = gti.GroupID AND i.ID = gti.ItemID
WHERE g.ID = #GroupID
-
DECLARE #GroupID INT = 3;...
INSERT INTO GroupToItem(GroupID, ItemID)
SELECT g.ID,
i.ID,
FROM Group g
JOIN Item i ON i.Val IN g.Range
WHERE g.ID = #GroupID
So essentially I only want to do the second JOIN if the given group has a parent. Is this possible in a single query? It is important that the number of items that are compared against the range is as small as possible, since for me this is an intensive operation.
EDIT: This seems to have solved it in this test setup, similar to what was suggested by Denis Valeev. I'll accept if I can get it to work with my live data. I've been having some weird issues - potentially more questions coming up.
SELECT g.Id,
i.Id
FROM Group g
JOIN Item i ON (i.Val > g.Start AND i.Val < g.End)
WHERE g.Id = 2
AND (
(g.ParentId IS NULL)
OR
(EXISTS(SELECT 1 FROM GroupToItem gti WHERE g.ParentId = gti.GroupId AND i.Id = gti.ItemId))
)
SQL Fiddle
Try this:
INSERT INTO GroupToItem(GroupID, ItemID)
SELECT g.ID,
i.ID,
FROM Group g
JOIN Item i ON i.Val IN g.Range
WHERE g.ID = #GroupID
and (g.ID in (3) or exists (select top 1 1 from GroupToItem gti where g.Parent = gti.GroupID AND i.ID = gti.ItemID))
If a Range column is a varchar datatype, you can try something like this:
INSERT INTO GROUPTOITEM (GROUPID, ITEMID)
SELECT A.ID, B.ID
FROM GROUP AS A
LEFT JOIN ITEM AS B
ON B.VAL BETWEEN CAST(SUBSTRING(SUBSTRING(A.RANGE,1,CHARINDEX('-',A.RANGE,1)-1),2,10) AS INT)
AND CAST(REPLACE(SUBSTRING(A.RANGE,CHARINDEX('-',A.RANGE,1)+1,10),']','') AS INT)

Query to find mutual likes?

I have a table which has the id of a particular person and id of the person he likes.
Likes
(p1,p2)
id1,id2
id2,id1
id3,id4
id3 id5
expected output
id1,id2
I have to remove duplicates also meaning id1,id2 to be returned once.
It is a exercise question.
select hh.id, hh.name, hh.grade as gr
, hh.id2, kk.name, kk.grade as gr1
from ( select id, id2, grade, name
from highschooler ab
, Likes cd
where ab.id = cd.id1 ) hh
, highschooler kk
where hh.id2 = kk.id
This query returns something like this
student id,student name,student grade,friend student likes,friend name,friend grade
This should do it joining on itself:
SELECT p.p1, p.p2
FROM Likes p
INNER JOIN Likes p2 ON
p.p1=p2.p2 AND
p.p2=p2.p1 AND
p.p1<p2.p1
Sample Fiddle Demo
I think the nicest way to do this is with a group by. In SQL Server, this requires using a case statement:
with l as (
select (case when p1 < p2 then p1 else p2 end) as pfirst,
(case when p1 < p2 then p2 else p1 end) as psecond
from likes
)
select pfirst, psecond
from l
group by pfirst, psecond
having count(*) = 2
If you have duplicates in the original data, then the having clause should be:
having count(distinct p1) = 2

SQL display two results side-by-side

I have two tables, and am doing an ordered select on each of them. I wold like to see the results of both orders in one result.
Example (simplified):
"SELECT * FROM table1 ORDER BY visits;"
name|# of visits
----+-----------
AA | 5
BB | 9
CC | 12
.
.
.
"SELECT * FROM table2 ORDER BY spent;"
name|$ spent
----+-------
AA | 20
CC | 30
BB | 50
.
.
.
I want to display the results as two columns so I can visually get a feeling if the most frequent visitors are also the best buyers. (I know this example is bad DB design and not a real scenario. It is an example)
I want to get this:
name by visits|name by spent
--------------+-------------
AA | AA
BB | CC
CC | BB
I am using SQLite.
Select A.Name as NameByVisits, B.Name as NameBySpent
From (Select C.*, RowId as RowNumber From (Select Name From Table1 Order by visits) C) A
Inner Join
(Select D.*, RowId as RowNumber From (Select Name From Table2 Order by spent) D) B
On A.RowNumber = B.RowNumber
Try this
select
ISNULL(ts.rn,tv.rn),
spent.name,
visits.name
from
(select *, (select count(*) rn from spent s where s.value>=spent.value ) rn from spent) ts
full outer join
(select *, (select count(*) rn from visits v where v.visits>=visits.visits ) rn from visits) tv
on ts.rn = tv.rn
order by ISNULL(ts.rn,tv.rn)
It creates a rank for each entry in the source table, and joins the two on their rank. If there are duplicate ranks they will return duplicates in the results.
I know it is not a direct answer, but I was searching for it so in case someone needs it: this is a simpler solution for when the results are only one per column:
select
(select roleid from role where rolename='app.roles/anon') roleid, -- the name of the subselect will be the name of the column
(select userid from users where username='pepe') userid; -- same here
Result:
roleid | userid
--------------------------------------+--------------------------------------
31aa33c4-4e66-4da3-8525-42689e46e635 | 12ad8c95-fbef-4287-9834-7458a4b250ee
For RDBMS that support common table expressions and window functions (e.g., SQL Server, Oracle, PostreSQL), I would use:
WITH most_visited AS
(
SELECT ROW_NUMBER() OVER (ORDER BY num_visits) AS num, name, num_visits
FROM visits
),
most_spent AS
(
SELECT ROW_NUMBER() OVER (ORDER BY amt_spent) AS num, name, amt_spent
FROM spent
)
SELECT mv.name, ms.name
FROM most_visited mv INNER JOIN most_spent ms
ON mv.num = ms.num
ORDER BY mv.num
Just join table1 and table2 with name as key like bellow:
select a.name,
b.name,
a.NumOfVisitField,
b.TotalSpentField
from table1 a
left join table2 b on a.name = b.name