Convert a categorical column to binary representation in SQL - sql

Consider there is a column of array of strings in a table containing categorical data. Is there an easy way to convert this schema so there is number of categories boolean columns representing binary encoding of that categorical column?
Example:
id type
-------------
1 [A, C]
2 [B, C]
being converted to :
id is_A is_B is_C
1 1 0 1
2 0 1 1
I know I can do this 'by hand', i.e. using:
WITH flat AS (SELECT * FROM t, unnest(type) type),
mid AS (SELECT id, (type='A') as is_A, (type='B') AS is_B, (type='C') as is_C)
SELECT id, SUM(is_A), SUM(is_B), SUM(is_C) FROM mid GROUP BY id
But I am looking for a solution that works when the number of categories is around 1-10K
By the way I am using BigQuery SQL.

looking for a solution that works when the number of categories is around 1-10K
Below is for BigQuery SQL
Step 1 - produce dynamically query (similar to one used in your question - but now it is built dynamically base on you table - yourTable)
#standardSQL
WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat)
SELECT CONCAT(
"WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat), ",
"ids AS (SELECT DISTINCT id FROM yourTable), ",
"pairs AS (SELECT id, cat FROM ids CROSS JOIN categories), ",
"flat AS (SELECT id, cat FROM yourTable, UNNEST(type) cat), ",
"combinations AS ( ",
" SELECT p.id, p.cat AS col, IF(f.cat IS NULL, 0, 1) AS flag ",
" FROM pairs AS p LEFT JOIN flat AS f ",
" ON p.cat = f.cat AND p.id=f.id ",
") ",
"SELECT id, ",
STRING_AGG(CONCAT("SUM(IF(col = '", cat, "', flag, 0)) as is_", cat) ORDER BY cat),
" FROM combinations ",
"GROUP BY id ",
"ORDER BY id"
) as query
FROM categories
Step 2 - copy result of above query, paste it back to Web UI and run Query
I think you've got an idea. Yo can implement it as above purely in SQL or you can generate final query in any client of your choice
I had tried this approach of generating the query (but in Python) the problem is that query can easily reach the 256KB limit of query size in BigQuery
First, let’s see how “easily” it is to reach 256KB limit
Assuming you have 10 chars as average length of category – in this case you can cover about 4750 categories with this approach.
With 20 as average - coverage is about 3480 and for 30 – 2750
If you will "compress" sql a little by removing spaces and AS , etc. you can make it respectively:
5400, 3800, 2970 for respectively 10, 20, 30 chars
So, I would say – Yes/Agree – it most likely reach limit before 5K in real case
So, secondly, let’s see if this is actually a big of a problem!
Just as an example, assume you need 6K categories. Let’s see how you can split this to two batches (assuming that 3K scenario does work as per initial solution)
What we need to do is to split categories to two groups – just based on category names
So first group will be - BETWEEN ‘cat1’ AND ‘cat3000’
And second group will be – BETWEEN ‘cat3001’ AND ‘cat6000’
So, now run both groups with Step1 and Step2 with temp1 and temp2 tables as destination
In Step 1 – add (to the very bottom of query - after FROM categories
WHERE cat BETWEEN ‘cat1’ AND ‘cat3000’
for first batch, and
WHERE cat BETWEEN ‘cat3001’ AND ‘cat6000’
for second batch
Now, proceed to Step 3
Step 3 – Combining partial results
#standardSQL
SELECT * EXCEPT(id2)
FROM temp1 FULL JOIN (
SELECT id AS id2, * EXCEPT(id) FROM temp2
) ON id = id2
-- ORDER BY id
You can test last logic with below simple/dummy data
WITH temp1 AS (
SELECT 1 AS id, 1 AS is_A, 0 AS is_B UNION ALL
SELECT 2 AS id, 0 AS is_A, 1 AS is_B UNION ALL
SELECT 3 AS id, 1 AS is_A, 0 AS is_B
),
temp2 AS (
SELECT 1 AS id, 1 AS is_C, 0 AS is_D UNION ALL
SELECT 2 AS id, 1 AS is_C, 0 AS is_D UNION ALL
SELECT 3 AS id, 0 AS is_C, 1 AS is_D
)
Above can easily be extended to more than just two batches
Hope this helped

Related

Calculate correlation between two words

Let's say I have a table in Postgres that stores a column of strings like this.
animal
cat/dog/bird
dog/lion
bird/dog
dog/cat
cat/bird
What I want to do, is calculate how "correlated" any two animals are to each other in this column, and store that as its own table so that I can easily look up how often "cat" and "dog" show up together.
For example, "cat" shows up a total of 3 times in all of these strings. Of those instances, "dog" shows up in the same string 2 out of the three times. Therefore, the correlation from cat -> dog would be 66%, and the number of co-occurrence instances (we'll call this instance_count) would be 2.
According to the above logic, the resulting table from this example would look like this.
base_animal
correlated_animal
instance_count
correlation
cat
cat
3
100
cat
dog
2
66
cat
bird
2
66
cat
lion
0
0
dog
dog
4
100
dog
cat
2
50
dog
bird
2
50
dog
lion
1
25
bird
bird
3
100
bird
cat
2
66
bird
dog
2
66
bird
lion
0
0
lion
lion
1
100
lion
cat
0
0
lion
dog
1
100
lion
bird
0
0
I've come up with a working solution in Python, but I have no idea how to do this easily in Postgres. Anybody have any ideas?
Edit:
Based off Erwin's answer, here's the same idea, except this answer doesn't make a record for animal combinations that never intersect.
with flat as (
select t.id, a
from (select row_number() over () as id, animal from animals) t,
unnest(string_to_array(t.animal, '/')) a
), ct as (select a, count(*) as ct from flat group by 1)
select
f1.a as b_animal,
f2.a as c_animal,
count(*) as instance_count,
round(count(*) * 100.0 / ct.ct, 0) as correlation
from flat f1
join flat f2 using(id)
join ct on f1.a = ct.a
group by f1.a, f2.a, ct.ct
Won't get much simpler or faster than this:
WITH flat AS (
SELECT t.id, a
FROM (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
, ct AS (SELECT a, count(*) AS ct FROM flat GROUP BY 1)
SELECT a AS base_animal
, b AS corr_animal
, COALESCE(xc.ct, 0) AS instance_count
, COALESCE(round(xc.ct * 100.0 / x.ct), 0) AS correlation
FROM (
SELECT a.a, b.a AS b, a.ct
FROM ct a, ct b
) x
LEFT JOIN (
SELECT f1.a, f2.a AS b, count(*) AS ct
FROM flat f1
JOIN flat f2 USING (id)
GROUP BY 1,2
) xc USING (a,b)
ORDER BY a, instance_count DESC;
db<>fiddle here
Produces your desired result, except for ...
added consistent sort order
rounded correctly
This assumes distinct animals per row in the source data. Else it's unclear how to count the same animal in the same row exactly ...
Setp-by-step
CTE flat attaches an arbitrary row number as unique id. (If you have a PRIMARY KEY, use that instead and skip the subquery t.) Then unnest animals to get one pet per row (& id).
CTE ct gets the list of distinct animals & their total count.
The outer SELECT builds the complete raster of animal pairs (a / b) in subquery x, plus total count for a. LEFT JOIN to the actual pair count in subquery xc. Two steps are needed to keep pairs that never met in the result. Finally, compute and round the "correlation" smartly. See:
Look for percentage of characters in a word/phrase within a block of text
Updated task
If you don't need pairs that never met, and pairing with self, either, this could be your query:
-- updated task excluding pairs that never met and same pairing with self
WITH flat AS (
SELECT t.id, a, count(*) OVER (PARTITION BY a) AS ct
FROM (SELECT row_number() OVER () AS id, animal FROM tbl) t
, unnest(string_to_array(t.animal, '/')) a
)
SELECT f1.a AS base_animal
, f1.ct AS base_count
, f2.a AS corr_animal
, count(*) AS instance_count
, round(count(*) * 100.0 / f1.ct) AS correlation
FROM flat f1
JOIN flat f2 USING (id)
JOIN (SELECT a, count(*) AS ct FROM flat GROUP BY 1) ct ON ct.a = f1.a
WHERE f1.a <> f2.a -- exclude pairing with self
GROUP BY f1.a, f1.ct, f2.a
ORDER BY f1.a, instance_count DESC;
db<>fiddle here
I added the total occurrence count of the base animal as base_count.
Most notably, I dropped the additional CTE ct, and get the base_count from the first CTE with a window function. That's about the same cost by itself, but we then don't need another join in the outer query, which should be cheaper overall.
You can still use that if you include pairs with self. Check the fiddle.
Oh, and we don't need COALESCE any more.
Idea is to split the data into rows (using unnest(string_to_array())) and then cross-join same to get all permutations.
with data1 as (
select *
from corr_tab), data2 as (
select distinct un as base_animal, x.correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un,
(select distinct un as correlated_animal
from corr_tab, unnest(string_to_array(animal,'/')) un) X)
select base_animal, correlated_animal,
(case
when
data2.base_animal = data2.correlated_animal
then
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL)
else
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL)
end) instance_count,
(case
when
data2.base_animal = data2.correlated_animal
then
100
else
ceil(
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL
and substring(animal,data2.correlated_animal) is not NULL) * 100 /
(select count(*) from data1 where substring(animal,data2.base_animal) is not NULL) )
end) correlation
from data2
order by base_animal
Refer to fiddle here.

How to check the possibility of groups union in a sequence order

I have a table with the following columns:
ID_group, ID_elements
For example with the following records:
1, 1
1, 2
2, 2
2, 4
2, 5
2, 6
3, 7
And I have sets of the elements, for example: 1,2,5; 1,5,2; 1,2,4; 2,7;
I need to check (true or false) that exist a common group for the pairs of adjacent elements.
For example elements:
1,2,5 -> true [i.e. elements 1,2 has common group 1 and elements 2,5 has common group 2]
1,5,2 -> false [i.e. 1,5 do not have a common group unlike 5,2 (but the result is false due to 1,5 - false)]
1,2,4 -> true
2,7 -> false
First, we need a list of pairs. We can get this by taking your set as an array, turning each element into a row with unnest and then making pairs by matching each row with its previous row using lag.
with nums as (
select *
from unnest(array[1,2,5]) i
)
select lag(i) over() a, i b
from nums
offset 1;
a | b
---+---
1 | 2
2 | 5
(2 rows)
Then we join each pair with each matching row. To avoid counting duplicate data rows twice, we count only the distinct rows.
with nums as (
select *
from unnest(array[1,2,5]) i
), pairs as (
select lag(i) over() a, i b
from nums
offset 1
)
select
count(distinct(id_group,id_elements)) = (select count(*) from pairs)
from pairs
join foo on foo.id_group = a and foo.id_elements = b;
This works on any size array.
dbfiddle
Your query to check if elements in a set evaluate to true or not can be done via procedures/function. Set representation can be taken as a string and then splitting it to substring then returning the required result can use a record for multiple entries. For sql query, below is a sample that can be used as a workaround, you can try changing the below query based on your requirement.
select case when ( Select count(*)
from ( SELECT
id_group, count(distinct id_elements)
from table where
id_group
in (1,2,5)
group by ID_group having
id_elements
in (1,2,5)) =3 ) then "true" else "false"
end) from table;
#Schwern, thank you, it helped. But I have changed the condition join foo on foo.id_group = a, because as I understand, a is element's ID, not group's. I have changed the following section:
join foo fA on fA.id_elements = a
join foo fB on fB.id_elements = b and fA.group_id = fB.group_id;

BigQuery - Populating SELECT fields from results of a temp function

In Google BigQuery, I have a query that has the same field name appearing multiple times in various join subqueries. I would like to abstract out this field name into a temporary function such that it will amend it in all places if I change it within the function only.
This is the query I have:
SELECT *
FROM
(SELECT field1, COUNT(*) sq1_total
FROM table
WHERE condition = 1
GROUP BY field 1) sq1
LEFT JOIN
(SELECT field1, COUNT(*) sq2_total
FROM table
WHERE condition = 0
GROUP BY field 1) sq2
USING(field1)
This is what I would like to have:
CREATE TEMP FUNCTION replace_field_name() AS (...);
SELECT *
FROM
(SELECT replace_field_name(), COUNT(*) sq1_total
FROM table
WHERE condition = 1
GROUP BY replace_field_name()) sq1
LEFT JOIN
(SELECT replace_field_name(), COUNT(*) sq2_total
FROM table
WHERE condition = 0
GROUP BY replace_field_name()) sq2
USING(replace_field_name())
So that when I want to compare many different fields like this, I only need to change the field name in one place as opposed to five places.
Is this possible?
Below thoughts/proposals relevant in terms of BigQuery Standard SQL
I would like to abstract out this field name into a temporary function ...
As Tim mentioned in his comment - it is quite not possible to do in a way you mock it
I want to compare many different fields like this, I only need to change the field name in one place as opposed to five places.
You can try to re-write your query in such a way that you will need to change field name in less places, like in below examples
#standardSQL
SELECT * FROM (SELECT field1, COUNT(*) sq1_total FROM `project.dataset.table` WHERE condition = 1 GROUP BY 1) sq1
LEFT JOIN (SELECT field1, COUNT(*) sq2_total FROM `project.dataset.table` WHERE condition = 0 GROUP BY 1) sq2
USING (field1)
OR
#standardSQL
SELECT DISTINCT field1,
COUNTIF(condition = 1) OVER(PARTITION BY field1) sq1_total,
COUNTIF(condition = 0) OVER(PARTITION BY field1) sq2_total
FROM `project.dataset.table`
In bothe above queries - there are "just" three place to replace field name in (as opposed to 5 in original query)
Obviously - this does not address the problem in qualitative way - just quantitatively
Is this possible?
Good news - there is always work around - but usually it requires to slightly change something in your requirements, expectations
For example in below solution you need to set field name only once!!! in UNNEST(['field1']) field line
#standardSQL
SELECT DISTINCT field, value,
COUNTIF(condition = 1) OVER(PARTITION BY field, value) sq1_total,
COUNTIF(condition = 0) OVER(PARTITION BY field, value) sq2_total
FROM (
SELECT field, REGEXP_EXTRACT(x, CONCAT(r'"', field, '":"?([^",])"?')) value, condition
FROM `project.dataset.table` t,
UNNEST([TO_JSON_STRING(t)]) x,
UNNEST(['field1']) field
)
the "price" is - you will have output in form of (with dummy data)
Row field value sq1_total sq2_total
1 field1 1 1 3
2 field1 2 1 0
instead of output from original query
Row field1 sq1_total sq2_total
1 1 1 3
2 2 1 null
I want to compare many different fields like this ...
The extra value in above approach is that you can run your comparison (for as many fields as you want) in one shot - by adding needed fields' names into UNNEST(['field1']) field list as in below example
#standardSQL
SELECT DISTINCT field, value,
COUNTIF(condition = 1) OVER(PARTITION BY field, value) sq1_total,
COUNTIF(condition = 0) OVER(PARTITION BY field, value) sq2_total
FROM (
SELECT field, REGEXP_EXTRACT(x, CONCAT(r'"', field, '":"?([^",])"?')) value, condition
FROM `project.dataset.table` t,
UNNEST([TO_JSON_STRING(t)]) x,
UNNEST(['field1', 'field2']) field
)
-- ORDER BY field, value
so result could look like
Row field value sq1_total sq2_total
1 field1 1 1 3
2 field1 2 1 0
3 field2 1 1 1
4 field2 2 0 2
5 field2 3 1 0

sql query - difference between the row values of same column

Can anybody tell me how to calculate the difference between the rows of the same column?
ID DeviceID Reading Date Flag
1 2 10 12/02/2015 1
2 3 08 12/02/2015 1
3 2 12 12/02/2015 1
4 2 20 12/02/2015 0
5 4 10 12/02/2015 0
6 2 19 12/02/2015 0
In ABOVE table I want to calculate the difference between the Readings for DeviceID 2 for some date say 12/02/2015 for example,
(12-10=2)
(20-12=8)
(19-2 =-1) and want to sum up this difference
i.e. 2+8+(-1)=9
If you use MS Access, I was try this code for your question:
I was made 4 query in MS Access:
Query1 to get data deviceId=2 and date=12/2/2015:
select id, reading from table1 where deviceid=2 and date=#12/2/2015#;
Then I make Query2 to get row number from query1:
select
(select count(*) from query1 where a.id>=id) as rowno,
a.reading from query1 a;
Then I make Query3 to get difference value field reading from query2:
select
(tbl2.reading-tbl1.reading) as diff
from query2 tbl1
left join query2 tbl2 on tbl1.rowno=tbl2.rowno-1
And then final query to get sum from result difference in query3:
SELECT sum(diff) as Total_Diff
FROM Query3;
But, if you use SQL Server, you can use this query (look for example sqlfiddle):
;with tbl as(
select row_number()over(order by id) as rowno,
reading
from table1
where deviceid=2 and date='20150212'
)
select sum(diff) as sum_diff
from (
select
(b.reading-a.reading) as diff
from tbl a
left join tbl b on a.rowno=b.rowno-1
) tbl_diff
You can try this (replace Table1 with your table name):
SELECT Sum([Diffs].[Difference]) AS FinalReading
FROM (
SELECT IDs.DeviceID, [Table1].Reading AS NextReading, Table1_1.Reading AS PrevReading, [Table1].Reading-Table1_1.Reading AS Difference
FROM (
(
SELECT [Table1].DeviceID,
[Table1].ID,
CLng(Nz(DMax("ID","Table1","[DeviceID] = " & [DeviceID] & " And [ID] < " & [ID]),0)) AS PrevID
FROM Table1
WHERE DeviceID = 2
) AS IDs
INNER JOIN Table1
ON IDs.ID=[Table1].ID)
INNER JOIN Table1 AS Table1_1
ON IDs.PrevID=Table1_1.ID
) AS Diffs;
The IDs table expression calculates the prev ID for the DeviceID in question. (I put the WHERE clause in this table expression, but you can move it to the outer one if you want to calc the FinalReadings for ALL devices at once, the filter it at the end. Less efficient but more flexible.) We join back to the original tables on the ID and PrevIDs from the inner table expressions, get their Reading values, and perform the difference operation in the Diffs table expression. The final outer query just sums the Difference values from each row value.

Select values in SQL that do not have other corresponding values except those that i search for

I have a table in my database:
Name | Element
1 2
1 3
4 2
4 3
4 5
I need to make a query that for a number of arguments will select the value of Name that has on the right side these and only these values.
E.g.:
arguments are 2 and 3, the query should return only 1 and not 4 (because 4 also has 5). For arguments 2,3,5 it should return 4.
My query looks like this:
SELECT name FROM aggregations WHERE (element=2 and name in (select name from aggregations where element=3))
What do i have to add to this query to make it not return 4?
A simple way to do it:
SELECT name
FROM aggregations
WHERE element IN (2,3)
GROUP BY name
HAVING COUNT(element) = 2
If you want to add more, you'll need to change both the IN (2,3) part and the HAVING part:
SELECT name
FROM aggregations
WHERE element IN (2,3,5)
GROUP BY name
HAVING COUNT(element) = 3
A more robust way would be to check for everything that isn't not in your set:
SELECT name
FROM aggregations
WHERE NOT EXISTS (
SELECT DISTINCT a.element
FROM aggregations a
WHERE a.element NOT IN (2,3,5)
AND a.name = aggregations.name
)
GROUP BY name
HAVING COUNT(element) = 3
It's not very efficient, though.
Create a temporary table, fill it with your values and query like this:
SELECT name
FROM (
SELECT DISTINCT name
FROM aggregations
) n
WHERE NOT EXISTS
(
SELECT 1
FROM (
SELECT element
FROM aggregations aii
WHERE aii.name = n.name
) ai
FULL OUTER JOIN
temptable tt
ON tt.element = ai.element
WHERE ai.element IS NULL OR tt.element IS NULL
)
This is more efficient than using COUNT(*), since it will stop checking a name as soon as it finds the first row that doesn't have a match (either in aggregations or in temptable)
This isn't tested, but usually I would do this with a query in my where clause for a small amount of data. Note that this is not efficient for large record counts.
SELECT ag1.Name FROM aggregations ag1
WHERE ag1.Element IN (2,3)
AND 0 = (select COUNT(ag2.Name)
FROM aggregatsions ag2
WHERE ag1.Name = ag2.Name
AND ag2.Element NOT IN (2,3)
)
GROUP BY ag1.name;
This says "Give me all of the names that have the elements I want, but have no records with elements I don't want"