Calculate correlation between two words - sql

Let's say I have a table in Postgres that stores a column of strings like this.
animal
cat/dog/bird
dog/lion
bird/dog
dog/cat
cat/bird
What I want to do, is calculate how "correlated" any two animals are to each other in this column, and store that as its own table so that I can easily look up how often "cat" and "dog" show up together.
For example, "cat" shows up a total of 3 times in all of these strings. Of those instances, "dog" shows up in the same string 2 out of the three times. Therefore, the correlation from cat -> dog would be 66%, and the number of co-occurrence instances (we'll call this instance_count) would be 2.
According to the above logic, the resulting table from this example would look like this.
base_animal
correlated_animal
instance_count
correlation
cat
cat
3
100
cat
dog
2
66
cat
bird
2
66
cat
lion
0
0
dog
dog
4
100
dog
cat
2
50
dog
bird
2
50
dog
lion
1
25
bird
bird
3
100
bird
cat
2
66
bird
dog
2
66
bird
lion
0
0
lion
lion
1
100
lion
cat
0
0
lion
dog
1
100
lion
bird
0
0
I've come up with a working solution in Python, but I have no idea how to do this easily in Postgres. Anybody have any ideas?
Edit:
Based off Erwin's answer, here's the same idea, except this answer doesn't make a record for animal combinations that never intersect.
with flat as (
select t.id, a
from (select row_number() over () as id, animal from animals) t,
unnest(string_to_array(t.animal, '/')) a
), ct as (select a, count(*) as ct from flat group by 1)
select
f1.a as b_animal,
f2.a as c_animal,
count(*) as instance_count,
round(count(*) * 100.0 / ct.ct, 0) as correlation
from flat f1
join flat f2 using(id)
join ct on f1.a = ct.a
group by f1.a, f2.a, ct.ct

Won't get much simpler or faster than this:
WITH flat AS (
SELECT t.id, a
FROM (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
, ct AS (SELECT a, count(*) AS ct FROM flat GROUP BY 1)
SELECT a AS base_animal
, b AS corr_animal
, COALESCE(xc.ct, 0) AS instance_count
, COALESCE(round(xc.ct * 100.0 / x.ct), 0) AS correlation
FROM (
SELECT a.a, b.a AS b, a.ct
FROM ct a, ct b
) x
LEFT JOIN (
SELECT f1.a, f2.a AS b, count(*) AS ct
FROM flat f1
JOIN flat f2 USING (id)
GROUP BY 1,2
) xc USING (a,b)
ORDER BY a, instance_count DESC;
db<>fiddle here
Produces your desired result, except for ...
added consistent sort order
rounded correctly
This assumes distinct animals per row in the source data. Else it's unclear how to count the same animal in the same row exactly ...
Setp-by-step
CTE flat attaches an arbitrary row number as unique id. (If you have a PRIMARY KEY, use that instead and skip the subquery t.) Then unnest animals to get one pet per row (& id).
CTE ct gets the list of distinct animals & their total count.
The outer SELECT builds the complete raster of animal pairs (a / b) in subquery x, plus total count for a. LEFT JOIN to the actual pair count in subquery xc. Two steps are needed to keep pairs that never met in the result. Finally, compute and round the "correlation" smartly. See:
Look for percentage of characters in a word/phrase within a block of text
Updated task
If you don't need pairs that never met, and pairing with self, either, this could be your query:
-- updated task excluding pairs that never met and same pairing with self
WITH flat AS (
SELECT t.id, a, count(*) OVER (PARTITION BY a) AS ct
FROM (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
SELECT f1.a AS base_animal
, f1.ct AS base_count
, f2.a AS corr_animal
, count(*) AS instance_count
, round(count(*) * 100.0 / f1.ct) AS correlation
FROM flat f1
JOIN flat f2 USING (id)
JOIN (SELECT a, count(*) AS ct FROM flat GROUP BY 1) ct ON ct.a = f1.a
WHERE f1.a <> f2.a -- exclude pairing with self
GROUP BY f1.a, f1.ct, f2.a
ORDER BY f1.a, instance_count DESC;
db<>fiddle here
I added the total occurrence count of the base animal as base_count.
Most notably, I dropped the additional CTE ct, and get the base_count from the first CTE with a window function. That's about the same cost by itself, but we then don't need another join in the outer query, which should be cheaper overall.
You can still use that if you include pairs with self. Check the fiddle.
Oh, and we don't need COALESCE any more.

Idea is to split the data into rows (using unnest(string_to_array())) and then cross-join same to get all permutations.
with data1 as (
select *
from corr_tab), data2 as (
select distinct un as base_animal, x.correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un,
(select distinct un as correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un) X)
select base_animal, correlated_animal,
(case
when
data2.base_animal = data2.correlated_animal
then
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL)
else
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL)
end) instance_count,
(case
when
data2.base_animal = data2.correlated_animal
then
100
else
ceil(
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL) * 100 /
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL) )
end) correlation
from data2
order by base_animal
Refer to fiddle here.

Related

How to rank the closest matches to a record of attributes in Google BigQuery

I have a BiqQuery table (Table A) that has around 1,000 records containing an ID and 15 datapoints that range between 0 - 100. Imagine its like a top-trumps card but with 15 attributes. Here's an example:
Record_ID = 0001
Size = 56
Height = 34
Width = 23
Weight = 78
Color = 42
Volume = 8
Density = 77
Smell = 23
Touch = 67
Hearing = 52
Power = 87
Sensitivity = 3
Strength = 78
Endurance = 45
Reliability = 87
And I have a separate table (Table B) that has exactly the same schema with around 5,000 different records
I need to take each Record_ID from Table A and then somehow rank the records in Table B that most closely match across all attributes. If I were just trying to rank records based on a single attribute such as Size then this would be really easy but I don't know where to start when I'm trying to find the closest matches and rankings across all attributes.
Is there any kind of model or approach that might help me achieve this? I have been reading up on clustering and K-means nearest neighbor but these don't seem to help.
How about a cross join of both tables. 1000 times 1000 rows will generate a 1 million row table. Aggregating this to the most closed ranked value.
We compare the values in both tables by substracting and taking the absolute value: abs(A.size-B.size). In the following example I choose only two entries, you can add more. Arithmetic weights would be pow(...,2) instead of abs. But doing normalisation of each variable to a scale from 0 to 1 beforehand would help missleading results; I did not do this here.
with recursive
tblA_tmp as (select id, cast(rand()*100 as int64) as size,cast(rand()*1000 as int64) as height from unnest(generate_array(1,10000) ) id ),
tblB_tmp as (select id, cast(rand()*100 as int64) as size,cast(rand()*1000 as int64) as height from unnest(generate_array(1,10000) ) id),
tblA as (Select * from tblA_tmp union all select * from tblA where false),
tblB as (Select * from tblB_tmp union all select * from tblB where false)
SELECT *
from (
SELECT A.id as id_A,
array_agg(B.id order by abs(A.size-B.size)+abs(A.height-B.height) limit 1)[safe_offset(0)] as id_B,
min(abs(A.size-B.size)+abs(A.height-B.height)) as distance
from tblA A
cross join tblB B
group by 1
)
left join tblA on id_A=tblA.id
left join tblB on id_B=tblB.id
Please ignore the CTE with part. Only by using recursive I can make the example tables A and B permanent and not be generated in each step.
Consider below approach (note use of CORR function)
select a_id, b_id from (
select a.Record_ID a_id, b.Record_ID b_id, (
select corr(x.value, y.value)
from (
select as struct value, col
from (select * from unnest([a]))
unpivot (value for col in (Size, Height, Width, Weight, Color, Volume, Density, Smell, Touch, Hearing, Power, Sensitivity, Strength, Endurance, Reliability))
) x
join (
select as struct value, col
from (select * from unnest([b]))
unpivot (value for col in (Size, Height, Width, Weight, Color, Volume, Density, Smell, Touch, Hearing, Power, Sensitivity, Strength, Endurance, Reliability))
) y
using(col)
) a_b_corr
from tableA a
cross join tableB b
)
qualify 1 = row_number() over(partition by a_id order by a_b_corr desc)
As improvements direction - you can move all those unpivots out up into
from tableA a cross join tableB b

Select column with maximum value in another column but with aggregate SUM calculation

For each name, I need to output the category with the MAX net revenue and I am not sure how to do this. I have tried a bunch of different approaches, but it basically looks like this:
SELECT Name, Category, MAX(CatNetRev)
FROM (
SELECT Name, Category, SUM(Price*(Shipped-Returned)) AS "CatNetRev"
FROM a WITH (NOLOCK)
INNER JOIN b WITH (NOLOCK) ON b.ID = a.ID
...
-- (bunch of additional joins here, not relevant to question)
WHERE ... -- (not relevant to question)
GROUP BY Name, Category
) a GROUP BY Name;
This currently doesn't work because "Category" is not contained in an aggregate function or Group By (and this is obvious) but other approaches I have tried have failed for different reasons.
Each Name can have a bunch of different Categories, and Names can have the same Categories but the overlap is irrelevant to the question. I need to output just each unique Name that I have (we can assume they are already all unique) along with the "Top Selling Category" based on that Net Revenue calculation.
So for example if I have:
Name:
Category:
"CatNetRev":
A
1
100
A
2
300
A
3
50
B
1
300
B
2
500
C
1
40
C
2
20
C
3
10
I would want to output:
Name:
Category:
A
2
B
2
C
1
What's the best way to go about doing this?
Having to guess at your data schema a bit, as you didn't alias any of your columns, or define what table a vs b really was (as Gordon alluded). I'd use CROSS APPLY to get the max value, then bind the revenues in a WHERE clause, like so.
DECLARE #Revenue TABLE
(
Name VARCHAR(50)
,Category VARCHAR(50)
,NetRevenue DECIMAL(16, 9)
);
INSERT INTO #Revenue
(
Name
,Category
,NetRevenue
)
SELECT Name
,Category
,SUM(a.Price * (b.Shipped - b.Returned)) AS CatNetRev
FROM Item AS a
INNER JOIN ShipmentDetails AS b ON b.ID = a.ID
WHERE 1 = 1
GROUP BY
Name
,Category;
SELECT r.Name
,r.Category
FROM #Revenue AS r
CROSS APPLY (
SELECT MAX(r2.NetRevenue) AS MaxRevenue
FROM #Revenue AS r2
WHERE r.Name = r2.Name
) AS mr
WHERE r.NetRevenue = mr.MaxRevenue;
you can use window functions:
select * from
(
select * , rank() over (partition by Name order by CatNetRev desc) rn
from table
) t
where t.rn = 1

array_agg from two different tables without join

I need to take inputs as array from two different tables which are not related.
Sample Data
CITY1 TABLE
NAME
TOT_POP
city1
10
city2
20
FACILITIES TABLE
NAME
Quota
f1
1
f2
2
f3
3
f4
4
Close solution I found
SELECT ARRAY_AGG(t1."TOT_POP") as City_Pop, ARRAY_AGG(t2."Quota") as Facility_Quota FROM
(SELECT "TOT_POP", row_number() OVER (order by (SELECT 0)) FROM CITY1) as t1 right JOIN
(select "Quota", row_number() OVER (order by (SELECT 0)) FROM FACILITIES) as t2 on t1.row_number = t2.row_number;
The output array contains null and joining on key is unnecessary in my case.
City_Pop
Facility_Quota
{10,20,NULL,NULL}
{1,2,3,4}
I want the following result, without using join if possible
City_Pop
Facility_Quota
{10,20}
{1,2,3,4}
You are over complicating things. Just use two scalar subqueries:
select (select array_agg(tot_pop) from city1) as city_pop,
(select array_agg(quota) from facilities) as facility_quot;
or slightly faster:
select (array(select tot_pop from city1)) as city_pop,
(array(select quota from facilities)) as facility_quot;

Convert a categorical column to binary representation in SQL

Consider there is a column of array of strings in a table containing categorical data. Is there an easy way to convert this schema so there is number of categories boolean columns representing binary encoding of that categorical column?
Example:
id type
-------------
1 [A, C]
2 [B, C]
being converted to :
id is_A is_B is_C
1 1 0 1
2 0 1 1
I know I can do this 'by hand', i.e. using:
WITH flat AS (SELECT * FROM t, unnest(type) type),
mid AS (SELECT id, (type='A') as is_A, (type='B') AS is_B, (type='C') as is_C)
SELECT id, SUM(is_A), SUM(is_B), SUM(is_C) FROM mid GROUP BY id
But I am looking for a solution that works when the number of categories is around 1-10K
By the way I am using BigQuery SQL.
looking for a solution that works when the number of categories is around 1-10K
Below is for BigQuery SQL
Step 1 - produce dynamically query (similar to one used in your question - but now it is built dynamically base on you table - yourTable)
#standardSQL
WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat)
SELECT CONCAT(
"WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat), ",
"ids AS (SELECT DISTINCT id FROM yourTable), ",
"pairs AS (SELECT id, cat FROM ids CROSS JOIN categories), ",
"flat AS (SELECT id, cat FROM yourTable, UNNEST(type) cat), ",
"combinations AS ( ",
" SELECT p.id, p.cat AS col, IF(f.cat IS NULL, 0, 1) AS flag ",
" FROM pairs AS p LEFT JOIN flat AS f ",
" ON p.cat = f.cat AND p.id=f.id ",
") ",
"SELECT id, ",
STRING_AGG(CONCAT("SUM(IF(col = '", cat, "', flag, 0)) as is_", cat) ORDER BY cat),
" FROM combinations ",
"GROUP BY id ",
"ORDER BY id"
) as query
FROM categories
Step 2 - copy result of above query, paste it back to Web UI and run Query
I think you've got an idea. Yo can implement it as above purely in SQL or you can generate final query in any client of your choice
I had tried this approach of generating the query (but in Python) the problem is that query can easily reach the 256KB limit of query size in BigQuery
First, let’s see how “easily” it is to reach 256KB limit
Assuming you have 10 chars as average length of category – in this case you can cover about 4750 categories with this approach.
With 20 as average - coverage is about 3480 and for 30 – 2750
If you will "compress" sql a little by removing spaces and AS , etc. you can make it respectively:
5400, 3800, 2970 for respectively 10, 20, 30 chars
So, I would say – Yes/Agree – it most likely reach limit before 5K in real case
So, secondly, let’s see if this is actually a big of a problem!
Just as an example, assume you need 6K categories. Let’s see how you can split this to two batches (assuming that 3K scenario does work as per initial solution)
What we need to do is to split categories to two groups – just based on category names
So first group will be - BETWEEN ‘cat1’ AND ‘cat3000’
And second group will be – BETWEEN ‘cat3001’ AND ‘cat6000’
So, now run both groups with Step1 and Step2 with temp1 and temp2 tables as destination
In Step 1 – add (to the very bottom of query - after FROM categories
WHERE cat BETWEEN ‘cat1’ AND ‘cat3000’
for first batch, and
WHERE cat BETWEEN ‘cat3001’ AND ‘cat6000’
for second batch
Now, proceed to Step 3
Step 3 – Combining partial results
#standardSQL
SELECT * EXCEPT(id2)
FROM temp1 FULL JOIN (
SELECT id AS id2, * EXCEPT(id) FROM temp2
) ON id = id2
-- ORDER BY id
You can test last logic with below simple/dummy data
WITH temp1 AS (
SELECT 1 AS id, 1 AS is_A, 0 AS is_B UNION ALL
SELECT 2 AS id, 0 AS is_A, 1 AS is_B UNION ALL
SELECT 3 AS id, 1 AS is_A, 0 AS is_B
),
temp2 AS (
SELECT 1 AS id, 1 AS is_C, 0 AS is_D UNION ALL
SELECT 2 AS id, 1 AS is_C, 0 AS is_D UNION ALL
SELECT 3 AS id, 0 AS is_C, 1 AS is_D
)
Above can easily be extended to more than just two batches
Hope this helped

Get up to topmost parent

Explain my problem with sample.
I have three tables ledger, balance, group.
Columns are
ledger ---> no, name, groupno
balance --> ledgerno, balance
group --> groupno, groupname, undergroupno
I want to show the ledger with top most parent which has balance > 0.
ledger
no name groupno
1 A 5
2 B 4
balance
ledgerno balance
1 100
2 200
group
groupno groupname undergroupno
1 AA 0
2 BB 0
3 CC 1
4 DD 1
5 EE 1
6 FF 1
7 GG 2
8 HH 2
9 II 2
10 JJ 2
So I want the result like this:
name balance
AA
CC
DD
B 100
EE
A 100
FF
I tried the below with query but it does not show the right results
WITH rel AS (
SELECT groupname, amount
FROM (
WITH RECURSIVE rel_tree AS (
SELECT groupno, groupname, undergroupno
FROM "group"
WHERE undergroupno = 0
UNION ALL
SELECT groupno, groupname, undergroupno
FROM balance b
INNER JOIN ledger l ON l.no = b.ledgerno
INNER JOIN "group" g ON g.groupno = l.groupno AS tt
INNER JOIN rel_tree r ON r.groupno = tt.undergroupno
)
SELECT *, 0 AS amount
FROM rel_tree
GROUP BY groupno, groupname, undergroupno
)
SELECT *
FROM rel
UNION ALL
SELECT groupname, amount
FROM (
SELECT name AS groupname, balance AS amount, groupname AS ord
FROM balance b
INNER JOIN ledger l ON l.no = b.ledgerno
INNER JOIN "group" g ON g.groupno = l.groupno) AS ta
INNER JOIN rel ON rel.groupname = ta.ord
Using postgresql 9.3
First of all, NEVER EVER use a SQL reserved word as a name for a table or a column. NEVER. EVER. Below I use grp instead of group.
Second, use column names that are immediately clear. Below I use parent instead of undergroupno.
Third, this is a really nice problem that I happily spent some time on. I am using recursive data structures myself and getting the query right is always something of a puzzle.
Fourth, what you state that you want is rather impossible. You have output on multiple lines from one table (grp), which is interspersed with data from other tables. I have a solution that comes quite close to what you specified. Here it is:
WITH tree AS (
WITH RECURSIVE rel_tree(parent, groupno, refs, path) AS (
SELECT groupno, groupno, 0, lpad(groupno::text, 4, '0') FROM grp WHERE parent > 0
UNION
SELECT g.parent, t.groupno, t.refs+1, lpad(g.parent::text, 4, '0') || '.' || t.path FROM grp g
JOIN rel_tree t ON t.parent = g.groupno)
SELECT * FROM rel_tree WHERE parent > 0
UNION
SELECT groupno, groupno, 0 AS refs, lpad(groupno::text, 4, '0') FROM grp WHERE parent = 0)
SELECT repeat(' ', t.refs) || grp.groupname AS name, l.name AS ledger, b.balance
FROM grp
JOIN (
SELECT DISTINCT ON (groupno) groupno, parent, max(refs) AS refs, path
FROM tree
GROUP BY parent, groupno, path
ORDER BY groupno, path) t USING (groupno)
LEFT JOIN ledger l USING (groupno)
LEFT JOIN balance b ON b.ledgerno = l.no
ORDER BY t.path
This gives the output:
name, ledger, balance
AA
CC
DD, B, 200
EE, A, 100
FF
BB
GG
HH
II
JJ
A few words on the recursive-with query:
This query yields a self-inclusive complete hierarchy. What this means is that it lists for every node of the hierarchy all of its parents, including itself. If you run the tree CTE as a separate query, you will find that it returns more rows than the 10 in the grp table.This is because it lists all grp records with their groupno as groupno but also as parent and additionally all parent nodes higher up in the hierarchy. Such a self-inclusive complete hierarchy is very handy when analyzing other properties of recursive data structures, such as containment and parentage.
Note that the hierarchy is built from the bottom up, starting with every node having itself as a parent, a refs value of 0 (i.e. 0 referrals between parent and self and a path which is just the groupno as a text value (padded with 0's). The recursion works its way up the hierarchy with increasing ref values and longer paths. The sub-select in the main query trims the complete list down to a single record for each grp record, which can be ordered by path. (Note that the use of refs can be omitted and replaced by length(path), but refs does have uses in other contexts of using a similar query.)
This query works to greater depths of the hierarchy as well. If you add:
11 KK 8
12 LL 8
13 MM 11
to table grp, the query will output:
name, ledger, balance
AA
CC
DD, B, 200
EE, A, 100
FF
BB
GG
HH
KK
MM
LL
II
JJ
Note that the path works with groupno values up to 9999. For larger groupno values increase the leading 0's in the recursive CTE, or consider using the ltree extension for more flexibility.