BigQuery equivalent of SQL query - sql

I would like to run the following query in BigQuery, ideally as efficiently as possible. The idea is that I have all of these rows corresponding to tests (taken daily) by millions of users and I want to determine, of the users who have been active for over a year, how much each user has improved.
"Improvement" in this case is the average of the first N subtracted from the last N.
In this example, N is 30. (I've also added in the where cnt >= 100 part because I don't want to consider users who took a test a long time ago and just came back to try once more.)
select user_id,
avg(score) filter (where seqnum_asc <= 30) as first_n_avg,
avg(score) filter (where seqnum_desc <= 30) as last_n_avg
from (select *,
row_number() over (partition by user_id order by created_at) as seqnum_asc,
row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
count(*) over (partition by user_id) as cnt
from tests
) t
where cnt >= 100
group by user_id
having max(created_at) >= min(created_at) + interval '1 year';

Just use conditional aggregation and fix the date functions:
select user_id,
avg(case when seqnum_asc <= 30 then score end) as first_n_avg,
avg(case when seqnum_desc <= 30 then score end) as last_n_avg
from (select *,
row_number() over (partition by user_id order by created_at) as seqnum_asc,
row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
count(*) over (partition by user_id) as cnt
from tests
) t
where cnt >= 100
group by user_id
having max(created_at) >= timestamp_add(min(created_at), interval 1 year);
The function in the having clause could be timetamp_add(), datetime_add(), or date_add(), depending on the type of created_at.

Related

Trying to add a column that assigns to each User ID a Quadrant

Here is my current
I'm trying to add a column that assigns each user ID to a Quadrant based upon their Progress relative to all other users for that Day. (25th, 50th, 75th, 90th, and 99th percentiles)
For example, the I would like to
Here is the current Query, though I'm not sure how helpful it will be, as it contains a few joins
SELECT Day, user_id, progress, Revenue, users
FROM Unlock_Progress
INNER JOIN Daily_Revenue USING(user_id, DAY)
LEFT JOIN DAU USING (Day)
ORDER BY Day, user_id
You can use a window function. You don't specify the exact logic you want to use, but the following should do what you want or be easily modifiable to your specific needs:
SELECT udd.*,
(CASE WHEN seqnum < 0.25 * cnt THEN 0
WHEN seqnum < 0.50 * cnt THEN 25
WHEN seqnum < 0.75 * cnt THEN 50
WHEN seqnum < 0.99 * cnt THEN 75
ELSE 99
END) as percentile_group
FROM (SELECT Day, user_id, progress, Revenue, users,
RANK() OVER (PARTITION BY Day ORDER BY Revenue) as seqnum,
COUNT(*) OVER (PARTITION BY Day) as cnt
FROM Unlock_Progress u JOIN
Daily_Revenue dr
USING (user_id, DAY) LEFT JOIN
DAU
USING (Day)
) udd
ORDER BY Day, user_id

How to calculate the median in Postgres?

I have created a basic database (picture attached) Database, I am trying to find the following:
"Median total amount spent per user in each calendar month"
I tried the following, but getting errors:
SELECT
user_id,
AVG(total_per_user)
FROM (SELECT user_id,
ROW_NUMBER() over (ORDER BY total_per_user DESC) AS desc_total,
ROW_NUMBER() over (ORDER BY total_per_user ASC) AS asc_total
FROM (SELECT EXTRACT(MONTH FROM created_at) AS calendar_month,
user_id,
SUM(amount) AS total_per_user
FROM transactions
GROUP BY calendar_month, user_id) AS total_amount
ORDER BY user_id) AS a
WHERE asc_total IN (desc_total, desc_total+1, desc_total-1)
GROUP BY user_id
;
In Postgres, you could just use aggregate function percentile_cont():
select
user_id,
percentile_cont(0.5) within group(order by total_per_user) median_total_per_user
from (
select user_id, sum(amount) total_per_user
from transactions
group by date_trunc('month', created_at), user_id
) t
group by user_id
Note that date_trunc() is probably closer to what you want than extract(month from ...) - unless you do want to sum amounts of the same month for different years together, which is not how I understood your requirement.
Just use percentile_cont(). I don't fully understand the question. If you want the median of the monthly spending, then:
SELECT user_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_per_user
ROW_NUMBER() over (ORDER BY total_per_user DESC) AS desc_total,
ROW_NUMBER() over (ORDER BY total_per_user ASC) AS asc_total
FROM (SELECT DATE_TRUNC('month', created_at) AS calendar_month,
user_id, SUM(amount) AS total_per_user
FROM transactions t
GROUP BY calendar_month, user_id
) um
GROUP BY user_id;
There is a built-in function for median. No need for fancier processing.

SQL Random 10 Percent based on 2 other fields

I have reviewed the other posts here with no luck on finding a solution to get a random 10 percent of records based on 2 other fields. For example my table contains ID, Date and User. I want to flag 10 percent of the records for each user for each day.
You can use row_number() and count():
select t.*,
(case when seqnum * 10 <= cnt then 'Y' else 'N' end) as flag
from (select t.*,
row_number() over (partition by user, date order by newid()) as seqnum,
count(*) over (partition by user, date) as cnt
from t
) t;
You don't actually need the subquery. It is just to make it a bit easier to follow, so:
select t.*,
(case when row_number() over (partition by user, date order by newid()) * 10 <= count(*) over (partition by user, date)
then 'Y' else 'N'
end) as flag
from t;

count consecutive record with timestamp interval requirement

ref to this post: link, I used the answer provided by #Gordon Linoff:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
row_number() over (partition by taxi order by time) as seqnum,
row_number() over (partition by taxi, client order by time) as seqnum_c
from t
) t
group by t.taxi, t.client, (seqnum - seqnum_c)
having count(*) >= 2
)
group by taxi;
and got my answer perfectly like this:
Tom 3 (AA count as 1, AAA count as 1 and BB count as 1, so total of 3 count)
Bob 1
But now I would like to add one more condition which is the time between two consecutive clients for same taxi should not be longer than 2hrs.
I know that I should probably use row_number() again and calculate the time difference with datediff. But I have no idea where to add and how to do.
So any suggestion?
This requires a bit more logic. In this case, I would use lag() to calculate the groups:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
sum(case when prev_client = client and
prev_time > time - interval '2 hour'
then 1
else 0
end) over (partition by client order by time) as grp
from (select t.*,
lag(client) over (partition by taxi order by time) as prev_client,
lag(time) over (partition by taxi order by time) as prev_time
from t
) t
) t
group by t.taxi, t.client, grp
having count(*) >= 2
)
group by taxi;
Note: You don't specify the database, so this uses ISO/ANSI standard syntax for date/time comparisons. You can adjust this for your actual database.

Running count distinct

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.
I am trying to turn it into the table below. Email 1#gmail.com was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.
Thanks for the help.
I would usually do this using row_number():
select date, count(*),
sum(count(*)) over (order by date),
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
row_number() over (partition by email order by date) as seqnum
from t
) t
group by date
order by date;
This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.
Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.
select distinct dt
,count(*) over(partition by dt) as day_total
,count(*) over(order by dt) as cumsum
,sum(flag) over(order by dt) as cumdist
from (select t.*
,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag
from tbl t
) t
DEMO HERE
Here is a solution that does not uses sum over, neither lag... And does produces the correct results.
Hence it could appear as simpler to read and to maintain.
select
t1.date_created,
(select count(*) from my_table where date_created = t1.date_created) emails_created,
(select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
(select count( distinct email) from my_table where date_created <= t1.date_created) running_count_distinct
from
(select distinct date_created from my_table) t1
order by 1