Aggregate values by range - sql

I have a table of users with profit and number of transactions columns:
...
I want to average profit of users in three groups - with relatively large number of transactions, average number if transactions and small number if transactions.
To get range series I use generate_series:
SELECT generate_series(
max(transactions_year)/3,
max(transactions_year),
max(transactions_year)/3
)
FROM portfolios_static
And I do get three categories:
I need a table like this one:
How do I get average profit of users which belong to each category and count number of users that belong to each category?

This can be simpler and faster. Assuming no entry has 0 deals:
SELECT y.max_deals AS deals
, avg(profit_perc_year) AS avg_profit
, count(*) AS users
FROM (
SELECT (generate_series (0,2) * x.max_t)/3 AS min_deals
,(generate_series (1,3) * x.max_t)/3 AS max_deals
FROM (SELECT max(transactions_year) AS max_t FROM portfolios_static) x
) y
JOIN portfolios_static p ON p.transactions_year > min_deals
AND p.transactions_year <= max_deals
GROUP BY 1
ORDER BY 1;
SQL Fiddle.

This will do:
with s as
(SELECT max(transactions_year)/3 series FROM portfolios_static
UNION ALL
SELECT max(transactions_year)/3 * 2 series FROM portfolios_static
UNION ALL
SELECT max(transactions_year) series FROM portfolios_static
),
s1 as
(SELECT generate_series(
max(transactions_year)/3,
max(transactions_year),
max(transactions_year)/3
) AS series
FROM portfolios_static
),
srn as
(SELECT series,
row_number() over (order by series) rn
from s),
prepost as
(select coalesce(pre.series,0) as pre,
post.series as post
from srn post
left join srn pre on pre.rn = post.rn-1)
select pp.post number_of_deals_or_less,
avg(profit_perc_year) average_profit,
count(*) number_of_users
from portfolios_static p INNER JOIN prepost pp
ON p.transactions_year > pp.pre AND p.transactions_year <= pp.post
GROUP by pp.post
order by pp.post;
BTW, I had to ditch generate_series and use just normal UNION ALL, as generate series will not return the proper MAX() value when the max value is not divisible by 3. For example, if you replace the srn CTE to
srn as
(SELECT series,
row_number() over (order by series) rn
from s1), -- use generate_series
You will notice that in some cases the last value in series will be less then max(transactions_year)
SQL Fiddle

Related

BigQuery SQL: Sum of first N related items

I would like to know the sum of a value in the first n items in a related table. For example, I want to get the sum of a companies first 6 invoices (the invoices can be sorted by ID asc)
Current SQL:
SELECT invoices.company_id, SUM(invoices.amount)
FROM invoices
JOIN companies on invoices.company_id = companies.id
GROUP BY invoices.company_id
This seems simple but I can't wrap my head around it.
Consider also below approach
select company_id, (
select sum(amount)
from t.amounts amount
) as top_six_invoices_amount
from (
select invoices.company_id,
array_agg(invoices.amount order by invoices.invoice_id limit 6) amounts
from your_table invoices
group by invoices.company_id
) t
You can create order row numbers to the lines in a partition based on invoice id and filter to it, something like this:
with array_table as (
select 'a' field, * from unnest([3, 2, 1 ,4, 6, 3]) id
union all
select 'b' field, * from unnest([1, 2, 1, 7]) id
)
select field, sum(id) from (
select field, id, row_number() over (partition by a.field order by id desc) rownum
from array_table a
)
where rownum < 3
group by field
More examples for analytical examples here:
https://medium.com/#aliz_ai/analytic-functions-in-google-bigquery-part-1-basics-745d97958fe2
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts

How to select the max of revenue for each user_id with row number in SQL?

In my dataset there are some user_id that each of them has several row number (from 1 to n) that each row has a specific revenue. I want to select the maximum of the revenue for each user_id with the row number belongs to this revenue. I want to have a query with result of the highlighted rows.
One method is a correlated subquery:
select t.*
from t
where t.revenue = (select max(t2.revenue) from t t2 where t2.user_id = t.user_id);
If there are ties for the maximum, this returns all the highest value rows.
select *,
case when revenue = max(revenue) over (partition by user_id) then 1 else 0 end as highlight
from T
select tt.*
from #tbl tt
join (select user_Id, max(revenue) as revenue from #tbl group by user_Id) tm on tt.user_Id = tm.user_Id and tt.revenue = tm.revenue

Stratified random sample from column with standard SQL in BigQuery [duplicate]

How can I do stratified sampling on BigQuery?
For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.
With #standardSQL, let's define our table and some stats over it:
WITH table AS (
SELECT *, subreddit category
FROM `fh-bigquery.reddit_comments.2018_09` a
), table_stats AS (
SELECT *, SUM(c) OVER() total
FROM (
SELECT category, COUNT(*) c
FROM table
GROUP BY 1
HAVING c>1000000)
)
In this setup:
subreddit will be our category
we only want subreddits with more than 1000000 comments
So, if we want 1% of each category in our sample:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 1/100
)
GROUP BY 2
Or let's say we want ~80,000 samples - but chosen proportionally through all categories:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 80000/total
)
GROUP BY 2
Now, if you want to get the ~same number of samples from each group (let's say, 20,000):
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 20000/c
)
GROUP BY 2
If you want exactly 20,000 elements from each category:
SELECT ARRAY_LENGTH(cat_samples) samples, category, ROUND(100*ARRAY_LENGTH(cat_samples)/c,2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 20000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
)
If you want exactly 2% of each group:
SELECT COUNT(*) samples, sample.category, ROUND(100*COUNT(*)/ANY_VALUE(c),2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND()) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
GROUP BY 2
If this last approach is what you want, you might notice it failing when you actually want to get data out. An early LIMIT similar to the largest group size will make sure we don't sort more data than needed:
SELECT sample.*
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 105000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.
This looks like:
select t.*
from (select t.*,
row_number() over (order by category order by rand()) as seqnum
from t
) t
where seqnum % 10 = 1;
Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.
If you want equal sized samples, then order within each category and just take a fixed number:
select t.*
from (select t.*,
row_number() over (partition by category order by rand()) as seqnum
from t
) t
where seqnum <= 100;
Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.
Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.

Count the total (N) of duplicates in a column

I'm attempting to count the total number of duplicates in a column (not the individual duplicates).
from outputs
GROUP BY journal_id
HAVING ( COUNT(doi) > 1 )
WHERE journal_id = 1
SQL TABLE
doi journal_id
123 1
123 2
123 1
124 1
The expected answer is 2
The number of entire row duplicates can be calculated by taking the total number of rows and subtracting the number of distinct rows:
select a.cnt_all - d.cnt_individual
from (select count(*) as cnt_all
from outputs
) a cross join
(select count(*) as cnt_individual
from (select distinct *
from outputs
) d
) d;
If you know your columns and your database supports multiple arguments to count(distinct), this can be radically simplified to:
select count(*) - count(distinct doi, journal_id)
from outputs;
Or, if your database doesn't support this:
select sum(cnt - 1)
from (select doi, journal_id, count(*) as cnt
from outputs
group by doi, journal_id
) o;
Just sum up the count of the individual duplicates by journal id.
SELECT
SUM(COUNT(doi)) AS total_duplicates
from
outputs
WHERE
journal_id = 1
GROUP BY
journal_id
HAVING
(COUNT(doi) > 1)

Could this query be optimized?

My goal is to select record by two criterias that depend on each other and group it by other criteria.
I found solution that select record by single criteria and group it
SELECT *
FROM "records"
NATURAL JOIN (
SELECT "group", min("priority1") AS "priority1"
FROM "records"
GROUP BY "group") AS "grouped"
I think I understand concept of this searching - select properties you care about and match them in original table - but when I use this concept with two priorities I get this monster
SELECT *
FROM "records"
NATURAL JOIN (
SELECT *
FROM (
SELECT "group", "priority1", min("priority2") AS "priority2"
FROM "records"
GROUP BY "group", "priority1") AS "grouped2"
NATURAL JOIN (
SELECT "group", min("priority1") AS "priority1"
FROM "records"
NATURAL JOIN (
SELECT "group", "priority1", min("priority2") AS "priority2"
FROM "records"
GROUP BY "group", "priority1") AS "grouped2'"
GROUP BY "group") AS "GroupNested") AS "grouped1"
All I am asking is couldn't it be written better (optimalized and looking-better)?
JSFIDDLE
---- Update ----
The goal is that I want select single id for each group by priority1 and priority2 should be selected as first and then priority2).
Example:
When I have table records with id, group, priority1 and priority2
with data:
id , group , priority1 , priority2
56 , 1 , 1 , 2
34 , 1 , 1 , 3
78 , 1 , 3 , 1
the result should be 56,1,1,2. For each group search first for min of priority1 than search for min of priority2.
I tried combine max and min together (in one query`, but it does not find anything (I do not have this query anymore).
EXISTS() to the rescue! (I did some renaming to avoid reserved words)
SELECT *
FROM zrecords r
WHERE NOT EXISTS (
SELECT *
FROM zrecords nx
WHERE nx.zgroup = r.zgroup
AND ( nx.priority1 < r.priority1
OR nx.priority1 = r.priority1 AND nx.priority2 < r.priority2
)
);
Or, to avoid the AND / OR logic, compare the two-tuples directly:
SELECT *
FROM zrecords r
WHERE NOT EXISTS (
SELECT *
FROM zrecords nx
WHERE nx.zgroup = r.zgroup
AND (nx.priority1, nx.priority2) < (r.priority1 , r.priority2)
);
maybe this is what you expect
with dat as (
SELECT "group" grp
, priority1, priority2, id
, row_number() over (partition by "group" order by priority1) +
row_number() over (partition by "group" order by priority2) as lp
FROM "records")
select dt.grp, priority1, priority2, dt.id
from dat dt
join (select min(lp) lpmin, grp from dat group by grp) dt1 on (dt1.lpmin = dt.lp and dt1.grp =dt.grp)
Simply use row_number() . . . once:
select r.*
from (select r.*,
row_number() over (partition by "group" order by priority1, priority2) as seqnum
from records r
) r
where seqnum = 1;
Note: I would advise you to avoid natural join. You can use using instead (if you don't want to explicitly include equality comparisons).
Queries with natural join are very hard to debug, because the join keys are not listed. Worse, "natural" joins do not use properly declared foreign key relationships. They depend simply on columns that have the same name.
In tables that I design, they would never be useful anyway, because almost all tables have createdAt and createdBy columns.