SQL- jaccard similarity - sql

My table looks as follows:
author | group
daniel | group1,group2,group3,group4,group5,group8,group10
adam | group2,group5,group11,group12
harry | group1,group10,group15,group13,group15,group18
...
...
I want my output to look like:
author1 | author2 | intersection | union
daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam | harry| 0 | 10
THANK YOU

Try below (for BigQuery)
SELECT
a.author AS author1,
b.author AS author2,
SUM(a.item=b.item) AS intersection,
EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author
GROUP BY 1,2
Added solution for BigQuery Standard SQL
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, SPLIT(grp) AS grp
FROM YourTable
)
SELECT
a.author AS author1,
b.author AS author2,
(SELECT COUNT(1) FROM a.grp) AS count1,
(SELECT COUNT(1) FROM b.grp) AS count2,
(SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
(SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author
What I like about this one:
much simpler / friendlier code
no CROSS JOIN and extra GROUP BY needed
When/If try - make sure to uncheck Use Legacy SQL checkbox under Show Options

I propose this option that scales better:
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, grp
FROM YourTable, UNNEST(SPLIT(grp)) as grp
),
intersection AS (
SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
FROM tempTable a
JOIN tempTable b
USING (grp)
WHERE a.author > b.author
GROUP BY a.author, b.author
),
count_distinct_groups AS (
SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
FROM tempTable
GROUP BY author
),
join_it AS (
SELECT
intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
FROM
intersection
JOIN
count_distinct_groups cg1
ON
intersection.author1 = cg1.author
JOIN
count_distinct_groups cg2
ON
intersection.author2 = cg2.author
)
SELECT
*,
count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
join_it
A full cross join on Big Data (tens of thousands x millions) fails for too much shuffling while the second proposal takes hours to execute. That one takes minutes.
The consequence of this approach though is that pairs having no intersection will not appear, so it will be the responsibility of the process that uses it to handle IFNULL.
Last detail: the union on Daniel and Harry is 10 rather than 11 as group15 is repeated in the initial example.

Inspired by Mikhail Berlyant's second answer, here is essentially the same method reformatted for Presto (as another example for a different flavor of SQL). Again all credit to Mikhail for this one.
WITH
YourTable AS (
SELECT
'daniel' AS author,
'group1,group2,group3,group4,group5,group8,group10' AS grp
UNION ALL
SELECT
'adam' AS author,
'group2,group5,group11,group12' AS grp
UNION ALL
SELECT
'harry' AS author,
'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT
author,
SPLIT(grp, ',') AS grp
FROM
YourTable
)
SELECT
a.author AS author1,
b.author AS author2,
CARDINALITY(a.grp) AS count1,
CARDINALITY(b.grp) AS count2,
CARDINALITY(ARRAY_INTERSECT(a.grp, b.grp)) AS intersection_count,
CARDINALITY(ARRAY_UNION(a.grp, b.grp)) AS union_count
FROM tempTable a
JOIN tempTable b ON a.author < b.author
;
Note that this will give slightly different counts for harry as well as the union_count as it only counts unique entries, e.g. harry has two group15 values, but only one will be counted:
author1 | author2 | count1 | count2 | intersection_count | union_count
---------+---------+--------+--------+--------------------+-------------
daniel | harry | 7 | 5 | 2 | 10
adam | harry | 4 | 5 | 0 | 9
adam | daniel | 4 | 7 | 2 | 9

Related

SQL: Reverse tree traversal on single result

My table looks like this:
id | name | type_id | desc | parent_id
1 | Foo | 1 | Foo | NULL
2 | Bar | 2 | Bar | 1
3 | FB | 2 | FB | 1
4 | Foo1 | 1 | Foo1 | NULL
5 | Bar1 | 2 | Bar1 | 4
6 | FB1 | 2 | FB1 | 4
And I want to provide an ID of the lowest node, returning everything up to the highest node in a single row (There is other data that I'm returning along with this).
For example, I want to provide ID 3, and the results to look like so:
xxxxx (other data) | id | name | type_id | desc | parent_id | id | name | type_id | desc | parent_id
xxxxxxx | 3 | FB | 2 | FB | 1 | 1 | Foo | 1 | Foo | NULL
Unfortunately, I haven't found anything that can work for me. I have a CTE but it goes top down and each node is its own row:
WITH RECURSIVE cte AS (
select T.*
from table as T
where T.id = 3
union all
select T.*
from table as T
inner join cte as C
on T.parent_id = C.id
)
SELECT * FROM cte
When I do this, I only get one result:
id | name | type_id | desc | parent_id
3 | FB | 2 | FB | 1
Any help would be appreciated, thanks!
The logic of the common-table expression looks good; it generates one row for the original id, and then one row per parent. To pivot the resulting rows to columns, you can then use conditional aggregation - this requires that you decide in advance the maximum number of levels. For two levels, this would be:
with recursive cte as (
select t.*, 1 lvl
from table as t
where t.id = 3
union all
select t.*, c.lvl + 1
from table as t
inner join cte as c on t.parent_id = c.id
)
select
max(id) filter(where lvl = 1) id,
max(name) filter(where lvl = 1) name,
max(type_id) filter(where lvl = 1) type_id,
max(descr) filter(where lvl = 1) descr,
max(parent_id) filter(where lvl = 1) parent_id,
max(id) filter(where lvl = 2) id2,
max(name) filter(where lvl = 2) name2,
max(type_id) filter(where lvl = 2) type_id2,
max(descr) filter(where lvl = 2) descr2,
max(parent_id) filter(where lvl = 2) parent_id2,
from cte
You might also want to consider accumating the rows as an array of json objects:
with recursive cte as (
select t.*, 1 lvl
from table as t
where t.id = 3
union all
select t.*, c.lvl + 1
from table as t
inner join cte as c on t.parent_id = c.id
)
select jsonb_agg(to_jsonb(c) order by lvl) res
from cte c
I have used Oracle 11g using Pivot, Row_number and hierarchical queries to solve this.
Demo
WITH CTE1 AS (SELECT A.*, LEVEL AS LVL FROM TABLE1 A
START WITH ID IN (2,3)
CONNECT BY PRIOR PARENT_ID = ID)
select * from (
select x.*, row_number() over (order by id desc) rn from (
SELECT DISTINCT ID, NAME, TYPE_ID, DESCRIPTION, PARENT_ID FROM CTE1 ORDER BY ID DESC) x) y
pivot
( min(id) ID, min(name) name, min(type_id) type_id,
min(description) description, min(parent_id) for rn in (1, 2, 3)
);

SQL Query with combining MAX() and SUM() aggregates

I have tried looking into different topics over here and in other forums but I can't seem to find a solution to my problem.
What I'm trying to achieve is "Display the net sales (in dollars) of the Product Line with the highest revenue for that Customer. Use a heading of: Best Sales. Format as $999,999.99.
Here's what I've tried so far:
SELECT cc.CustID, cc.CompanyName, cc.ContactName, pl.pl_id,to_char((sum(od.unitprice*od.quantity*(1-discount))), '$9,999,999.99') as rev
FROM corp.customers cc JOIN corp.orders co ON (cc.CustID=co.CustID)
LEFT OUTER JOIN corp.order_details od ON (co.orderID=od.orderID)
LEFT OUTER JOIN corp.products cp ON (od.ProductID=cp.ProductID)
LEFT OUTER JOIN corp.product_lines pl ON (cp.pl_id=pl.pl_id)
GROUP BY cc.CustID, cc.CompanyName, cc.ContactName, pl.pl_id
HAVING sum(od.unitprice*od.quantity*(1-discount))=
(
SELECT max(sum(od.unitprice*od.quantity*(1-discount)))
FROM corp.customers cc JOIN corp.orders co ON (cc.CustID=co.CustID)
JOIN corp.order_details od ON (co.orderID=od.orderID)
JOIN corp.products cp ON (od.ProductID=cp.ProductID)
JOIN corp.product_lines pl ON (cp.pl_id=pl.pl_id)
GROUP BY cc.CustID, cc.CompanyName, cc.ContactName, pl.pl_id);
This gives me only one output indicating the highest revenue of all customers, but I would like it to display the highest revenue according to each of the product line for that customer.
The result is shown below.
CustID | Company Name | Contact Name | PL_ID | Revenue
QUICK | QUICK-Stop | Horst Kloss | 1 | $37,161.63
I would like it to show something like.
CustID | Company Name | Contact Name | PL_ID | Revenue
QUICK | QUICK-Stop | Horst Kloss | 1 | $37,161.63
QS | QUICK-Start | Clark Stone | 2 | $50,000.00
QUI | QUICK | Mary Haynes | 1 | $60,000.00
QShelf | QUICK-Shelf | Doreen Lucas | 4 | $35,161.63
Any help is appreciated. Thank you!
This query uses your original query, a rank() function to order by your rev column, and a selection to only get the highest rev. This will give multiple rows if you have multiple rows with the same rev value. Change rank() to row_number() if you only want one.
You could also use CTE instead of the nested queries, wont make any difference.
select CustID, CompanyName, ContactName, pl_id, rev from (
select CustID, CompanyName, ContactName, pl_id, to_char(rev, '$9,999,999.99') as rev,
rank() over(order by rev desc) r
from (
SELECT cc.CustID, cc.CompanyName, cc.ContactName, pl.pl_id,
sum(od.unitprice*od.quantity*(1-discount)) as rev
FROM corp.customers cc JOIN corp.orders co ON (cc.CustID=co.CustID)
LEFT OUTER JOIN corp.order_details od ON (co.orderID=od.orderID)
LEFT OUTER JOIN corp.products cp ON (od.ProductID=cp.ProductID)
LEFT OUTER JOIN corp.product_lines pl ON (cp.pl_id=pl.pl_id)
GROUP BY cc.CustID, cc.CompanyName, cc.ContactName, pl.pl_id
) q
) q2 where r=1
Since you didn't provide us with sample input data for your tables, I've knocked up a simple example that you can hopefully use to amend your query:
WITH sample_data AS (SELECT 1 ID, 1 id2, 10 val FROM dual UNION ALL
SELECT 1 ID, 1 id2, 20 val FROM dual UNION ALL
SELECT 1 ID, 2 id2, 30 val FROM dual UNION ALL
SELECT 1 ID, 2 id2, 40 val FROM dual UNION ALL
SELECT 2 ID, 1 id2, 50 val FROM dual UNION ALL
SELECT 2 ID, 2 id2, 60 val FROM dual UNION ALL
SELECT 2 ID, 3 id2, 60 val FROM dual)
SELECT ID,
id2,
max_sum_val
FROM (SELECT ID,
id2,
SUM(val) sum_val,
MAX(SUM(val)) OVER (PARTITION BY ID) max_sum_val
FROM sample_data
GROUP BY ID, id2)
WHERE sum_val = max_sum_val;
ID ID2 MAX_SUM_VAL
---------- ---------- -----------
1 2 70
2 2 60
2 3 60
This will display all id2 values that have the same sum(val) that's the highest. If you don't want to display all tied rows, you can used the row_number() analytic function instead:
WITH sample_data AS (SELECT 1 ID, 1 id2, 10 val FROM dual UNION ALL
SELECT 1 ID, 1 id2, 20 val FROM dual UNION ALL
SELECT 1 ID, 2 id2, 30 val FROM dual UNION ALL
SELECT 1 ID, 2 id2, 40 val FROM dual UNION ALL
SELECT 2 ID, 1 id2, 50 val FROM dual UNION ALL
SELECT 2 ID, 2 id2, 60 val FROM dual UNION ALL
SELECT 2 ID, 3 id2, 60 val FROM dual)
SELECT ID,
id2,
max_sum_val
FROM (SELECT ID,
id2,
SUM(val) sum_val,
row_number() OVER (PARTITION BY ID ORDER BY SUM(val) DESC, id2) rn
FROM sample_data
GROUP BY ID, id2)
WHERE rn = 1;
ID ID2 MAX_SUM_VAL
---------- ---------- -----------
1 2 70
2 2 60
ETA:
That means your query would end up something like:
SELECT custid,
companyname,
contactname,
pl_id,
to_char(rev, '$9,999,999.99') rev
FROM (SELECT cc.custid,
cc.companyname,
cc.contactname,
pl.pl_id,
SUM(od.unitprice * od.quantity * (1 - discount)) AS rev,
MAX(SUM(od.unitprice * od.quantity * (1 - discount))) OVER (PARTITION BY cc.custid) max_rev
FROM corp.customers cc
INNER JOIN corp.orders co ON (cc.custid = co.custid)
LEFT OUTER JOIN corp.order_details od ON (co.orderid = od.orderid)
LEFT OUTER JOIN corp.products cp ON (od.productid = cp.productid)
LEFT OUTER JOIN corp.product_lines PL ON (cp.pl_id = pl.pl_id)
GROUP BY cc.custid,
cc.companyname,
cc.contactname,
pl.pl_id)
WHERE rev = max_rev;

Clause group by if two lines return the maximum

My query is:
SELECT ID,B,C,D, SUM(X), SUM(Y)
FROM TABLE
GROUP BY A,B,C,D
But if the query returns 2 lines with the same id, I want take only one line which have the max of column D.
For example if the query returns:
+----+---+---+---+--------+--------+
| ID | B | C | D | SUM(X) | SUM(Y) |
+----+---+---+---+--------+--------+
| 2 | 1 | 1 | 1 | 70 | 100 |
| 2 | 1 | 1 | 3 | 100 | 150 |
+----+---+---+---+--------+--------+
Then drop the line with D=1, and keep only the line with D=3
If you want one row per id, then that should be the only column in the group by:
SELECT ID, SUM(X), SUM(Y)
FROM TABLE
GROUP BY ID;
Put your query inside of a CTE and then filter the result:
;WITH CTE AS (
SELECT ID,B,C,D, SUM(X) AS X, SUM(Y) AS Y
FROM TABLE
GROUP BY ID,B,C,D
)
SELECT * FROM CTE T
WHERE [D] = (SELECT MAX(D) FROM CTE where id = t.id)
You want to show one row per ID/B/C combination. In case you find multiple results for one combination (ID=2, B=1, C=1 in your example), you want the one with the greatest D. So you must rank your aggregation results.
In standard SQL you rank your rows with RANK, DENSE_RANK or ROW_NUMBER depending on what you want to do with ties. We don't know your DBMS, but many DBMS feature these functions, so just try.
select id, b, c, d, sum_x, sum_y
from
(
select
id, b, c, d, sum(x) as sum_x, sum(y) as sum_y,
row_number() over (partition by id, b, c order by d desc) as rnk
from table
group by id, b, c, d
) ranked
where rnk = 1;
You can use JOIN like this :
SELECT t1.ID, t1.B, t1.C, t1.D, SUM(t1.X), SUM(t1.Y)
FROM TABLE1 t1
join (select ID, max(D) D from TABLE1 group by ID) t2
on t1.ID=t2.ID and t1.D=t2.D
GROUP BY t1.ID, t1.B, t1.C, t1.D
SQL HERE

Select except where different in SQL

I need a bit of help with a SQL query.
Imagine I've got the following table
id | date | price
1 | 1999-01-01 | 10
2 | 1999-01-01 | 10
3 | 2000-02-02 | 15
4 | 2011-03-03 | 15
5 | 2011-04-04 | 16
6 | 2011-04-04 | 20
7 | 2017-08-15 | 20
What I need is all dates where only one price is present.
In this example I need to get rid of row 5 and 6 (because there is two difference prices for the same date) and either 1 or 2(because they're duplicate).
How do I do that?
select date,
count(distinct price) as prices -- included to test
from MyTable
group by date
having count(distinct price) = 1 -- distinct for the duplicate pricing
The following should work with any DBMS
SELECT id, date, price
FROM TheTable o
WHERE NOT EXISTS (
SELECT *
FROM TheTable i
WHERE i.date = o.date
AND (
i.price <> o.price
OR (i.price = o.price AND i.id < o.id)
)
)
;
JohnHC answer is more readable and delivers the information the OP asked for ("[...] I need all the dates [...]").
My answer, though less readable at first, is more general (allows for more complexes tie-breaking criteria) and also is capable of returning the full row (with id and price, not just date).
;WITH CTE_1(ID ,DATE,PRICE)
AS
(
SELECT 1 , '1999-01-01',10 UNION ALL
SELECT 2 , '1999-01-01',10 UNION ALL
SELECT 3 , '2000-02-02',15 UNION ALL
SELECT 4 , '2011-03-03',15 UNION ALL
SELECT 5 , '2011-04-04',16 UNION ALL
SELECT 6 , '2011-04-04',20 UNION ALL
SELECT 7 , '2017-08-15',20
)
,CTE2
AS
(
SELECT A.*
FROM CTE_1 A
INNER JOIN
CTE_1 B
ON A.DATE=B.DATE AND A.PRICE!=B.PRICE
)
SELECT * FROM CTE_1 WHERE ID NOT IN (SELECT ID FROM CTE2)

SQL intersect with group by

Given these two tables/sets with different groups of items, how can I find which groups in set1 span across more than a single group in set2? how can I find the groups in set1 which cannot be covered by a single group in set2?
e.g. for tables below, A (1,2,5) is the only group that spans across s1(1,2,3) and s2(2,3,4,5). B and C are not the answers because both are covered in a single group s2.
I would prefer to use SQL (Sql Server 2008 R2 available).
Thanks.
set1 set2
+---------+----------+ +---------+----------+
| group | item | | group | item |
`````````````````````+ `````````````````````+
| A | 1 | | s1 | 1 |
| A | 2 | | s1 | 2 |
| A | 5 | | s1 | 3 |
| B | 4 | | s2 | 2 |
| B | 5 | | s2 | 3 |
| C | 3 | | s2 | 4 |
| C | 5 | | s2 | 5 |
+---------+----------+ +---------+----------+
Use this sqlfiddle to try: http://sqlfiddle.com/#!6/fac8a/3
Or use the script below to generate temp tables to try out the answers:
create table #set1 (grp varchar(5),item int)
create table #set2 (grp varchar(5),item int)
insert into #set1 select 'a',1 union select 'a',2 union select 'a',5 union select 'b',4 union select 'b',5 union select 'c',3 union select 'c',5
insert into #set2 select 's1',1 union select 's1',2 union select 's1',3 union select 's2',2 union select 's2',3 union select 's2',4 union select 's2',5
select * from #set1
select * from #set2
--drop table #set1
--drop table #set2
Select groups from set1 for which there are no groups in set2 for which all items in set1 exists in set2:
select s1.grp from set1 s1
where not exists(
select * from set2 s2 where not exists(
select item from set1 s11
where s11.grp = s1.grp
except
select item from set2 s22
where s22.grp = s2.grp))
group by s1.grp
Ok. This is ugly, but it should work. I tried it in fiddle. I think it can be done through windowing, but I have to think about it.
Here is the ugly one for now.
WITH d1 AS (
SELECT set1.grp
, COUNT(*) cnt
FROM set1
GROUP BY set1.grp
), d2 AS (
SELECT set1.grp grp1
, set2.grp grp2
, COUNT(set1.item) cnt
FROM set1
INNER JOIN set2
ON set1.item = set2.item
GROUP BY set1.grp
, set2.grp
)
SELECT grp
FROM d1
EXCEPT
SELECT d1.grp
FROM d1
INNER JOIN d2
ON d2.grp1 = d1.grp
AND d2.cnt = d1.cnt
Can you check this
SELECT DISTINCT a.Group1, a.Item, b.CNT
FROM SET1 a
INNER JOIN
(SELECT GroupA, COUNT(*) CNT
FROM
(
SELECT DISTINCT a.Group1 GroupA, b.Group1 GroupB
FROM SET1 a
INNER JOIN SET2 b ON a.Item = b.Item
) a GROUP BY GroupA
) b ON a.Group1 = b.GroupA
WHERE b.CNT > 1
Thanks for the comments. I believe the following edited query will work:
Select distinct grp1, initialRows, max(MatchedRows) from
(
select a.grp as grp1, b.grp as grp2
, count(distinct case when b.item is not null then a.item end) as MatchedRows
, d.InitialRows
from set1 a
left join set2 b
on a.item = b.item
left join
(select grp, count(distinct Item) as InitialRows from set1
group by grp) d
on a.grp = d.grp
group by a.grp, b.grp, InitialRows
) c
group by grp1, InitialRows
having max(MatchedRows) < InitialRows
I think this will do the trick. The subquery returns set2 groups per set1 group, that have a match for all the items in set1, by counting the matches and comparing the matches count to the set1 group count.
select s.grp from #set1 s
group by s.grp
having not exists (
select s2.grp from #set2 s2 inner join #set1 s1 on s2.item = s1.item
where s1.grp = s.grp
group by s2.grp
having count(s.item) = count(s2.item)
)
You can find the solution through following query:
SELECT A.GROUP AS G1, A.ITEM AS T1, B.GROUP, B.ITEM
FROM SET1 A RIGHT JOIN SET2 B ON A.ITEM=B.ITEM
WHERE A.GROUP IS NULL
Basically the same as Robert Co
I did not get this from his answer - came up with this independently
select set1.group
from set1
except
select set1count.group
from ( select set1.group , count(*) as [count]
from set1
) as set1count
join ( select set1.group as [group1], count(*) as [count]
from set1
join set2
on set2.item = set1.item
group by set1.group, set2.group -- this is the magic
) as set1count
on set1count.group = set2count.[group1] -- note no set2.group match
and set1count.count = set12count.count -- the items in set1 are in at least on set2 group