Trying to add a column that assigns to each User ID a Quadrant - sql

Here is my current
I'm trying to add a column that assigns each user ID to a Quadrant based upon their Progress relative to all other users for that Day. (25th, 50th, 75th, 90th, and 99th percentiles)
For example, the I would like to
Here is the current Query, though I'm not sure how helpful it will be, as it contains a few joins
SELECT Day, user_id, progress, Revenue, users
FROM Unlock_Progress
INNER JOIN Daily_Revenue USING(user_id, DAY)
LEFT JOIN DAU USING (Day)
ORDER BY Day, user_id

You can use a window function. You don't specify the exact logic you want to use, but the following should do what you want or be easily modifiable to your specific needs:
SELECT udd.*,
(CASE WHEN seqnum < 0.25 * cnt THEN 0
WHEN seqnum < 0.50 * cnt THEN 25
WHEN seqnum < 0.75 * cnt THEN 50
WHEN seqnum < 0.99 * cnt THEN 75
ELSE 99
END) as percentile_group
FROM (SELECT Day, user_id, progress, Revenue, users,
RANK() OVER (PARTITION BY Day ORDER BY Revenue) as seqnum,
COUNT(*) OVER (PARTITION BY Day) as cnt
FROM Unlock_Progress u JOIN
Daily_Revenue dr
USING (user_id, DAY) LEFT JOIN
DAU
USING (Day)
) udd
ORDER BY Day, user_id

Related

Rolling NEW active users in SQL (BigQuery)

I have already computed rolling active users (on a weekly basis) as follow:
SELECT
DATE_TRUNC(EXTRACT(DATE FROM tracks.timestamp), WEEK),
COUNT(DISTINCT tracks.user_id)
FROM `company.dataset.tracks` AS tracks
WHERE tracks.timestamp > TIMESTAMP('2020-01-01')
AND tracks.event = 'activation_event'
GROUP BY 1
ORDER BY 1
I am interested in knowing the number of distinct users who performed the activation event for the 1st time on a rolling weekly basis.
If I follow you correctly, you can use two levels of aggrgation:
select
date_trunc(date(activation_timestamp), week) activation_week,
count(*) cnt_active_users
from (
select min(timestamp) activation_timestamp
from `company.dataset.tracks` t
where event = 'activation_event'
group by user_id
) t
where activation_timestamp > timestamp('2020-01-01
The subquery comptes the date of the first activation event per user, then the outer query counts the number of such events per week.
If you want both the actives and starts in the same query:
SELECT week, COUNT(*) as users_in_week,
COUNTIF(seqnum = 1) as new_users
FROM (SELECT DATE_TRUNC(EXTRACT(DATE FROM t.timestamp), WEEK) as week,
t.user_id, COUNT(*) as cnt,
ROW_NUMBER() OVER (PARTITION BY t.user_id ORDER BY MIN(t.timestamp)) as seqnum
FROM `company.dataset.tracks` t
WHERE t.event = 'activation_event'
GROUP BY 1, 2
) t
WHERE tracks.timestamp > TIMESTAMP('2020-01-01')
GROUP BY 1
ORDER BY 1;

How to pull a list of all visitor_ids that generated more than $500 combined in their first two sessions in the month of January 2020?

Tables:
Sessions
session_ts
visitor_id
vertical
session_id
Transactions
session_ts
session_id
rev_bucket
revenue
Currently have the following query (using SQLite):
SELECT
s.visitor_id,
sub.session_id,
month,
year,
total_rev,
CASE
WHEN (row_num IN (1,2) >= total_rev >= 500) THEN 'Yes'
ELSE 'No' END AS High_Value_Transactions,
sub.row_num
FROM
sessions s
JOIN
(
SELECT
s.visitor_id,
t.session_id,
strftime('%m',t.session_ts) as month,
strftime('%Y',t.session_ts) as year,
SUM(t.revenue) as total_rev,
row_number() OVER(PARTITION BY s.visitor_id ORDER BY s.session_ts) as row_num
FROM
Transactions t
JOIN
sessions s
ON
s.session_id = t.session_id
WHERE strftime('%m',t.session_ts) = '01'
AND strftime('%Y',t.session_ts) = '2020'
GROUP BY 1,2
) sub
ON
s.session_id = sub.session_id
WHERE sub.row_num IN (1,2)
ORDER BY 1
I'm having trouble identifying the first two sessions that combine for $500.
Open to any feedback and simplifying of query. Thanks!
You can use window functions and aggregation:
select visitor_id, sum(t.revenue) total_revenue
from (
select
s.visitor_id,
t.revenue,
row_number() over(partition by s.visitor_id order by t.session_ts) rn
from transactions t
inner join sessions s on s.session_id = t.session_id
where t.session_ts >= '2020-01-01' and t.session_ts < '2020-02-01'
) t
where rn <= 2
group by visitor_id
having sum(t.revenue) >= 500
The subquery joins the two tables, filters on the target month (note that using half-open interval predicates is more efficient than applying date functions on the date column), and ranks each row within groups of visits of the same customer.
Then, the outer query filters on the first two visits per visitor, aggregates by visitor, computes the corresponding revenue, and filters it with a having clause.

BigQuery equivalent of SQL query

I would like to run the following query in BigQuery, ideally as efficiently as possible. The idea is that I have all of these rows corresponding to tests (taken daily) by millions of users and I want to determine, of the users who have been active for over a year, how much each user has improved.
"Improvement" in this case is the average of the first N subtracted from the last N.
In this example, N is 30. (I've also added in the where cnt >= 100 part because I don't want to consider users who took a test a long time ago and just came back to try once more.)
select user_id,
avg(score) filter (where seqnum_asc <= 30) as first_n_avg,
avg(score) filter (where seqnum_desc <= 30) as last_n_avg
from (select *,
row_number() over (partition by user_id order by created_at) as seqnum_asc,
row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
count(*) over (partition by user_id) as cnt
from tests
) t
where cnt >= 100
group by user_id
having max(created_at) >= min(created_at) + interval '1 year';
Just use conditional aggregation and fix the date functions:
select user_id,
avg(case when seqnum_asc <= 30 then score end) as first_n_avg,
avg(case when seqnum_desc <= 30 then score end) as last_n_avg
from (select *,
row_number() over (partition by user_id order by created_at) as seqnum_asc,
row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
count(*) over (partition by user_id) as cnt
from tests
) t
where cnt >= 100
group by user_id
having max(created_at) >= timestamp_add(min(created_at), interval 1 year);
The function in the having clause could be timetamp_add(), datetime_add(), or date_add(), depending on the type of created_at.

count consecutive record with timestamp interval requirement

ref to this post: link, I used the answer provided by #Gordon Linoff:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
row_number() over (partition by taxi order by time) as seqnum,
row_number() over (partition by taxi, client order by time) as seqnum_c
from t
) t
group by t.taxi, t.client, (seqnum - seqnum_c)
having count(*) >= 2
)
group by taxi;
and got my answer perfectly like this:
Tom 3 (AA count as 1, AAA count as 1 and BB count as 1, so total of 3 count)
Bob 1
But now I would like to add one more condition which is the time between two consecutive clients for same taxi should not be longer than 2hrs.
I know that I should probably use row_number() again and calculate the time difference with datediff. But I have no idea where to add and how to do.
So any suggestion?
This requires a bit more logic. In this case, I would use lag() to calculate the groups:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
sum(case when prev_client = client and
prev_time > time - interval '2 hour'
then 1
else 0
end) over (partition by client order by time) as grp
from (select t.*,
lag(client) over (partition by taxi order by time) as prev_client,
lag(time) over (partition by taxi order by time) as prev_time
from t
) t
) t
group by t.taxi, t.client, grp
having count(*) >= 2
)
group by taxi;
Note: You don't specify the database, so this uses ISO/ANSI standard syntax for date/time comparisons. You can adjust this for your actual database.

Running count distinct

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.
I am trying to turn it into the table below. Email 1#gmail.com was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.
Thanks for the help.
I would usually do this using row_number():
select date, count(*),
sum(count(*)) over (order by date),
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
row_number() over (partition by email order by date) as seqnum
from t
) t
group by date
order by date;
This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.
Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.
select distinct dt
,count(*) over(partition by dt) as day_total
,count(*) over(order by dt) as cumsum
,sum(flag) over(order by dt) as cumdist
from (select t.*
,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag
from tbl t
) t
DEMO HERE
Here is a solution that does not uses sum over, neither lag... And does produces the correct results.
Hence it could appear as simpler to read and to maintain.
select
t1.date_created,
(select count(*) from my_table where date_created = t1.date_created) emails_created,
(select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
(select count( distinct email) from my_table where date_created <= t1.date_created) running_count_distinct
from
(select distinct date_created from my_table) t1
order by 1