Stratified random sample from column with standard SQL in BigQuery [duplicate] - sql

How can I do stratified sampling on BigQuery?
For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.

With #standardSQL, let's define our table and some stats over it:
WITH table AS (
SELECT *, subreddit category
FROM `fh-bigquery.reddit_comments.2018_09` a
), table_stats AS (
SELECT *, SUM(c) OVER() total
FROM (
SELECT category, COUNT(*) c
FROM table
GROUP BY 1
HAVING c>1000000)
)
In this setup:
subreddit will be our category
we only want subreddits with more than 1000000 comments
So, if we want 1% of each category in our sample:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 1/100
)
GROUP BY 2
Or let's say we want ~80,000 samples - but chosen proportionally through all categories:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 80000/total
)
GROUP BY 2
Now, if you want to get the ~same number of samples from each group (let's say, 20,000):
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 20000/c
)
GROUP BY 2
If you want exactly 20,000 elements from each category:
SELECT ARRAY_LENGTH(cat_samples) samples, category, ROUND(100*ARRAY_LENGTH(cat_samples)/c,2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 20000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
)
If you want exactly 2% of each group:
SELECT COUNT(*) samples, sample.category, ROUND(100*COUNT(*)/ANY_VALUE(c),2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND()) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
GROUP BY 2
If this last approach is what you want, you might notice it failing when you actually want to get data out. An early LIMIT similar to the largest group size will make sure we don't sort more data than needed:
SELECT sample.*
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 105000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c

I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.
This looks like:
select t.*
from (select t.*,
row_number() over (order by category order by rand()) as seqnum
from t
) t
where seqnum % 10 = 1;
Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.
If you want equal sized samples, then order within each category and just take a fixed number:
select t.*
from (select t.*,
row_number() over (partition by category order by rand()) as seqnum
from t
) t
where seqnum <= 100;
Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.
Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.

Related

Stratified random sampling with BigQuery?

How can I do stratified sampling on BigQuery?
For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.
With #standardSQL, let's define our table and some stats over it:
WITH table AS (
SELECT *, subreddit category
FROM `fh-bigquery.reddit_comments.2018_09` a
), table_stats AS (
SELECT *, SUM(c) OVER() total
FROM (
SELECT category, COUNT(*) c
FROM table
GROUP BY 1
HAVING c>1000000)
)
In this setup:
subreddit will be our category
we only want subreddits with more than 1000000 comments
So, if we want 1% of each category in our sample:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 1/100
)
GROUP BY 2
Or let's say we want ~80,000 samples - but chosen proportionally through all categories:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 80000/total
)
GROUP BY 2
Now, if you want to get the ~same number of samples from each group (let's say, 20,000):
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 20000/c
)
GROUP BY 2
If you want exactly 20,000 elements from each category:
SELECT ARRAY_LENGTH(cat_samples) samples, category, ROUND(100*ARRAY_LENGTH(cat_samples)/c,2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 20000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
)
If you want exactly 2% of each group:
SELECT COUNT(*) samples, sample.category, ROUND(100*COUNT(*)/ANY_VALUE(c),2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND()) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
GROUP BY 2
If this last approach is what you want, you might notice it failing when you actually want to get data out. An early LIMIT similar to the largest group size will make sure we don't sort more data than needed:
SELECT sample.*
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 105000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.
This looks like:
select t.*
from (select t.*,
row_number() over (order by category order by rand()) as seqnum
from t
) t
where seqnum % 10 = 1;
Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.
If you want equal sized samples, then order within each category and just take a fixed number:
select t.*
from (select t.*,
row_number() over (partition by category order by rand()) as seqnum
from t
) t
where seqnum <= 100;
Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.
Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.

SQL Min and Count

What I'm trying to achieve is to display the count number of books in categories
Where Count(Category) > Minimum number in Count(Category)
Example;
If categories are
A = 1
b = 2
c = 3
D = 1
E = 1
I'm trying to show the categories which are > 1 using MIN.
The error I'm getting is:
ORA-00935: group function is nested too deeply
SELECT Count(Category),
Category
From Books
Having Count((Category) > MIN(Count(Category)
Group BY Category
Looking for something like this:
Select Count(Category),
Category
From Books
Group BY Category
Having Count(Category) > (Select Min(cnt)
from (Select Count(Category) AS cnt
From Books
Group By Category))
This will select all categories having a count that is greater than the minimum count among all categories.
Another way is to rank by count starting with the lowest (ties are assigned the same rank) and to only select rows with rank greater than 1:
select * from (
select count(*) cnt, category,
rank() over (order by count(*)) rn
from books
group by category
) t where rn > 1
This should do it:
SELECT Category, CategoryCount from
(SELECT rownum as r, Category, Count(*) as CategoryCount
From Books
Group BY Category
Order by CategoryCount asc)
Where r > 1;
Giorgos's answer is the correct one. It can be rearranged (and made slightly more efficient) using subquery factoring:
with ctg (category, categ_count) as (
select category, count(*)
from books
group by category
)
select category, categ_count
from ctg
where categ_count > (select min(categ_count) from ctg);

How do I get the frequency of repeated fields that contain some value?

Supposed I have a data set that looks like this
{"id":15,"classification":"goth","categories":["blackLipstick","hotTopic"]}
{"id":14,"classification":"goth","categories":["drinking","girls","hotTopic"]}
{"id":13,"classification":"jock","categories":["basketball","chicharones","fooball","girls","pop","pregnant","sports","starTrek","tortilla","tostada"]}
{"id":12,"classification":"geek","categories":["academics","cacahuates","computers","glasses","papas","physics","programming","ps4","science"]}
{"id":11,"classification":"geek","categories":["cacahuates","fajitas","math","pregnant","raves","xbox"]}
{"id":10,"classification":"goth","categories":["cutting"]}
{"id":9,"classification":"geek","categories":["cafe","chalupa","chimichangas","manson","physics","pollo","tostada"]}
{"id":8,"classification":"jock","categories":["basketball","chalupa","enchurrito","piercings","running","sports"]}
{"id":7,"classification":"geek","categories":["aguacate","blackLipstick","computers","fajitas","fooball","glasses","lifting","outdoors","physics","pollo","pregnant","ps4"]}
{"id":6,"classification":"none","categories":["brocode","girls","raves","tacos"]}
{"id":5,"classification":"goth","categories":["blackLipstick","blackShirts","drugs","mole","piercings","tattoos","tortilla"]}
{"id":4,"classification":"jock","categories":["girls","tattoos"]}
{"id":3,"classification":"goth","categories":["girls"]}
{"id":2,"classification":"none","categories":["cutting","enchurrito","fooball","pastel","pregnant","tattoos","vampires"]}
{"id":1,"classification":"goth","categories":["cacahuates","cutting","drugs","empanadas","frijoles","manson","nachos","outdoors","piercings","tattoos"]}
{"id":0,"classification":"geek","categories":["pollo","pop","programming","science"]}
How do I write a query to where I can say
"If someone has category 'math' what other categories do they often have?"
For this data set I can write something like this to tell me what goths, geeks and jocks like most.
SELECT classification, categories, count(categories) C
FROM [xx.stereotypes] group by classification
, categories ORDER BY C DESC LIMIT 1000
But in my real dataset I don't have the classification field. I want a query that could make help me create classifications like "goth", "jock" or "geek".
For example how do I say select all the counts of the categories where categories contains "math", this only selects math
SELECT categories, count(categories) C FROM [xx.stereotypes]
where categories CONTAINS "math" group by categories ORDER
BY C DESC LIMIT 1000
How do I say select all the counts of the categories where categories
contains "math"
SELECT categories, COUNT(1) AS weight
FROM [xx.stereotypes]
OMIT RECORD IF NOT SOME(categories = 'math')
GROUP BY categories
ORDER BY weight DESC
How do I write a query to where I can say "If someone has category
'math' what other categories do they often have?"
SELECT category, related_category, weight
FROM (
SELECT category, related_category, COUNT(1) AS weight
FROM (
SELECT a.id AS id, a.categories AS category, b.categories AS related_category
FROM (FLATTEN([xx.stereotypes], categories)) AS a
JOIN (FLATTEN([xx.stereotypes], categories)) AS b
ON a.id = b.id
HAVING category != related_category
)
GROUP BY category, related_category
)
WHERE category = 'math'
ORDER BY category, weight DESC, related_category
I want a query that could make help me create classifications
Below simplified way to assign classification to each id
SELECT id, category AS classification
FROM (
SELECT
x.id AS id, y.category AS category, SUM(weight) AS rate,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY rate DESC) AS pos
FROM (FLATTEN(xx.stereotypes, categories)) AS x
JOIN (
SELECT category, related_category, COUNT(1) AS weight
FROM (
SELECT a.id AS id, a.categories AS category, b.categories AS related_category
FROM (FLATTEN([xx.stereotypes], categories)) AS a
JOIN (FLATTEN([xx.stereotypes], categories)) AS b
ON a.id = b.id
)
GROUP BY category, related_category
) AS y
ON x.categories = y.related_category
GROUP BY 1, 2
)
WHERE pos = 1
ORDER BY id DESC

Aggregate values by range

I have a table of users with profit and number of transactions columns:
...
I want to average profit of users in three groups - with relatively large number of transactions, average number if transactions and small number if transactions.
To get range series I use generate_series:
SELECT generate_series(
max(transactions_year)/3,
max(transactions_year),
max(transactions_year)/3
)
FROM portfolios_static
And I do get three categories:
I need a table like this one:
How do I get average profit of users which belong to each category and count number of users that belong to each category?
This can be simpler and faster. Assuming no entry has 0 deals:
SELECT y.max_deals AS deals
, avg(profit_perc_year) AS avg_profit
, count(*) AS users
FROM (
SELECT (generate_series (0,2) * x.max_t)/3 AS min_deals
,(generate_series (1,3) * x.max_t)/3 AS max_deals
FROM (SELECT max(transactions_year) AS max_t FROM portfolios_static) x
) y
JOIN portfolios_static p ON p.transactions_year > min_deals
AND p.transactions_year <= max_deals
GROUP BY 1
ORDER BY 1;
SQL Fiddle.
This will do:
with s as
(SELECT max(transactions_year)/3 series FROM portfolios_static
UNION ALL
SELECT max(transactions_year)/3 * 2 series FROM portfolios_static
UNION ALL
SELECT max(transactions_year) series FROM portfolios_static
),
s1 as
(SELECT generate_series(
max(transactions_year)/3,
max(transactions_year),
max(transactions_year)/3
) AS series
FROM portfolios_static
),
srn as
(SELECT series,
row_number() over (order by series) rn
from s),
prepost as
(select coalesce(pre.series,0) as pre,
post.series as post
from srn post
left join srn pre on pre.rn = post.rn-1)
select pp.post number_of_deals_or_less,
avg(profit_perc_year) average_profit,
count(*) number_of_users
from portfolios_static p INNER JOIN prepost pp
ON p.transactions_year > pp.pre AND p.transactions_year <= pp.post
GROUP by pp.post
order by pp.post;
BTW, I had to ditch generate_series and use just normal UNION ALL, as generate series will not return the proper MAX() value when the max value is not divisible by 3. For example, if you replace the srn CTE to
srn as
(SELECT series,
row_number() over (order by series) rn
from s1), -- use generate_series
You will notice that in some cases the last value in series will be less then max(transactions_year)
SQL Fiddle

In a map table, I need to select a subset on rows for each individual

I have a table with 3 columns a, b and weight representing a persons mappings. The table has +- 1 T of rows.
The simple problem can be explained like:
I need the top 20 relation for a givin a, something like:
SELECT * FROM my_table WHERE a = ? ORDER BY weight LIMIT 20;
The real problem is that I need to do this for 1000 a, and doing this query 1000 times is slow. My question is: how can I do it in a single SQL query.
Thanks for your help.
Using Row_number() function. I have assumed the top 20 based on the weight (i.e order by weight) here.
To get all the a's top 20 records;
select a,b, weight
from (
select a,b, weight, Row_number() over (partition by a order by weight) rn
from my_table
) AB
where rn<21
order by rn
To get a single a's top 20 records;
select top(20) a,b, weight
from my_table
where a = yourValue
order by weight
Try out this solution:
SELECT *
FROM my_table t1
WHERE 20 <= ( SELECT COUNT(*) FROM my_table t2 WHERE t2.a=t1.a AND t2.weight < t1.weight)
ORDER BY a, weight;