SQL cohort calculations - sql

I have my table of players activity like this:
user_id
event_name
install_date
event_date
1
active
2021-03-01
2021-03-01
1
active
2021-03-01
2021-03-01
1
active
2021-03-01
2021-03-02
2
active
2021-03-02
2021-03-02
2
active
2021-03-02
2021-03-04
2
active
2021-03-02
2021-03-04
and I want to calculate cohort retention like this
user_id
install_date
ret0
ret1
ret2
1
2021-03-01
1
1
0
2
2021-03-02
1
0
1
Help me please to write sql query. Thanks)

If I understand correctly, you just want to compare the event_date to the install_date and keep track of when "x" days appear between the two:
select user_id,
max(case when event_date = install_date then 1 else 0 end) as ret1,
max(case when event_date = date_add(install_date, interval 1 day) then 1 else 0 end) as ret1,
max(case when event_date = date_add(install_date, interval 2 day) then 1 else 0 end) as ret2
from t
group by user_id;

Consider below approach - less verbose and better manageable and expandable to more generic cases
select * from (
select user_id, install_date, date_diff(event_date, install_date, day) diff
from `project.dataset.table`
)
pivot (count(diff) as ret for diff in (0, 1, 2))
if applied to sample data in your question - output is
Btw, if you want to output 1 or 0 in respective columns - you can adjust above to
select * from (
select user_id, install_date, date_diff(event_date, install_date, day) diff
from `project.dataset.table`
group by 1,2,3
)
pivot (count(diff) as ret for diff in (0, 1, 2))
in this case - output is

Related

SQL 30 day active user query

I have a table of users and how many events they fired on a given date:
DATE
USERID
EVENTS
2021-08-27
1
5
2021-07-25
1
7
2021-07-23
2
3
2021-07-20
3
9
2021-06-22
1
9
2021-05-05
1
4
2021-05-05
2
2
2021-05-05
3
6
2021-05-05
4
8
2021-05-05
5
1
I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.
DATE
ACTIVE_USERS
2021-08-27
1
2021-07-25
3
2021-07-23
2
2021-07-20
2
2021-06-22
1
2021-05-05
5
I tried the following query which returned only the users who were active on the specified date:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
I also tried using a window function with rows between but seems to end up getting the same result:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
I'm using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.
This is tricky to do as window functions -- because count(distinct) is not permitted. You can use a self-join:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
However, that can be expensive. One solution is to "unpivot" the data. That is, do an incremental count per user of going "in" and "out" of active states and then do a cumulative sum:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;

SQL How to Query Total & Subtotal

I have a table looks like below where day, order_id, and order_type are stored.
select day, order_id, order_type
from sample_table
day
order_id
order_type
2021-03-01
1
offline
2021-03-01
2
offline
2021-03-01
3
online
2021-03-01
4
online
2021-03-01
5
offline
2021-03-01
6
offline
2021-03-02
7
online
2021-03-02
8
online
2021-03-02
9
offline
2021-03-02
10
offline
2021-03-03
11
offline
2021-03-03
12
offline
Below is desired output:
day
total_order
num_offline_order
num_online_order
2021-03-01
6
4
2
2021-03-02
4
2
2
2021-03-03
2
2
0
Does anybody know how to query to get the desired output?
You need to pivot the data. A simple way to implement conditional aggregation in Vertica uses :::
select day, count(*) as total_order,
sum( (order_type = 'online')::int ) as num_online,
sum( (order_type = 'offline')::int ) as num_offline
from t
group by day;
Use case and sum:
select day,
count(1) as total_order
sum(case when order_type='offline' then 1 end) as num_offline_order,
sum(case when order_type='online' then 1 end) as num_online_order
from sample_table
group by day
order by day
you can also use count to aggregate values that are not null
select
day,
count(*) as total_order,
count(case when order_type='offline' then 1 else null end) as offline_orders,
count(case when order_type='online' then 1 else null end) as online_orders
from sample_table
group by day
order by day;

Google Big Query - Calculating monthly totals by status based on multiple date conditionals

I have table with the following data:
customer_id subscription_id plan status trial_start trial_end activated_at cancelled_at
1 jg1 basic cancelled 2020-06-26 2020-07-14 2020-07-14 2020-09-25
2 ab1 basic cancelled 2020-08-10 2020-08-24 2020-08-24 2021-02-15
3 cf8 basic cancelled 2020-08-25 2020-09-04 2020-09-04 2020-10-24
4 bc2 basic active 2020-10-12 2020-10-26 2020-10-26
5 hg4 basic active 2021-01-09 2021-02-08 2021-02-08
6 cd5 basic in-trial 2021-02-26
As you notice from the table, status = in_trial when a subscription is in trial. When subscription converts from in_trial to active there is activated_at date. When an in_trial or active subscription is cancelled, status switches to cancelled and cancelled_at date is present. Status column always shows only most recent status of a subscription. For every change in status a new row does not appear for subscription. For every change in status, status is changed, and appropriate dates appear to reflect time when status was changed.
My goal is to calculate, month-over-month, how many subscriptions are in status = in_trial, how many are in status = active and how many are in status = cancelled. Because status column reflects the most recent status of subscription, a query has to be able to determine how many subscriptions were in status = in_trial, status = active, and status = active based on available dates column.
If a particular subscription had multiple statuses in a given month (for example, subscription_id = ab1 was in trial in Aug-2020 and also converted to active in Aug-2020), I want only the most recent status to be considered for that subscription. So, as example, for subscription_id = ab1 I want it to be counted as active subscription for the month of Aug-2020.
The output I am looking for is:
date in_trial active cancelled
2020-06-01 1 0 0
2020-07-01 0 1 0
2020-08-01 1 2 0
2020-09-01 0 2 1
2020-10-01 0 2 1
2020-11-01 0 2 0
2020-12-01 0 2 0
2021-01-01 1 2 0
2021-02-01 1 2 1
2021-03-01 1 2 0
Or, results can be displayed in a different format, as long as numbers are correct. Another example of output can be:
date status count
2020-06-01 in_trial 1
2020-06-01 active 0
2020-06-01 cancelled 0
2020-07-01 in_trial 0
2020-07-01 active 1
2020-07-01 cancelled 0
... ... ...
2021-03-01 in_trial 1
2021-03-01 active 2
2021-03-01 cancelled 0
Below is the query you can use to reproduce the example table provided in this question:
SELECT 1 AS customer_id, 'jg1' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-06-26' AS trial_start, '2020-07-14' AS trial_end, '2020-07-14' AS activated_at, '2020-09-25' AS cancelled_at UNION ALL
SELECT 2 AS customer_id, 'ab1' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-08-10' AS trial_start, '2020-08-24' AS trial_end, '2020-08-24' AS activated_at, '2021-02-15' AS cancelled_at UNION ALL
SELECT 3 AS customer_id, 'cf8' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-08-25' AS trial_start, '2020-09-04' AS trial_end, '2020-09-04' AS activated_at, '2020-10-24' AS cancelled_at UNION ALL
SELECT 4 AS customer_id, 'bc2' AS subscription_id, 'basic' AS plan, 'active' AS status, '2020-10-12' AS trial_start, '2020-10-26' AS trial_end, '2020-10-26' AS activated_at, '' AS cancelled_at UNION ALL
SELECT 5 AS customer_id, 'hg4' AS subscription_id, 'basic' AS plan, 'active' AS status, '2021-01-09' AS trial_start, '2021-02-08' AS trial_end, '2021-02-08' AS activated_at, '' AS cancelled_at UNION ALL
SELECT 6 AS customer_id, 'cd5' AS subscription_id, 'basic' AS plan, 'in_trial' AS status, '2021-02-26' AS trial_start, '' AS trial_end, '' AS activated_at, '' AS cancelled_at
I have been working on this problem since yesterday morning and continuing to figure out a way to do this efficiently. Thank you in advance for helping me solve this problem.
Below should work for you
select month,
count(distinct if(status = 0, customer_id, null)) in_trial,
count(distinct if(status = 1, customer_id, null)) active,
count(distinct if(status = 2, customer_id, null)) canceled
from (
select month, customer_id,
array_agg(status order by status desc limit 1)[offset(0)] status
from (
select distinct customer_id, 0 status, date_trunc(date, month) month
from `project.dataset.table`,
unnest(generate_date_array(date(trial_start), ifnull(date(trial_end), current_date()))) date
union all
select distinct customer_id, 1 status, date_trunc(date, month) month
from `project.dataset.table`,
unnest(generate_date_array(date(activated_at), ifnull(date(cancelled_at), current_date()))) date
union all
select distinct customer_id, 2 status, date_trunc(date(cancelled_at), month) month
from `project.dataset.table`
)
where not month is null
group by month, customer_id
)
group by month
# order by month
If applied to sample data in your question - output is

Group Records based on predefined date range in SQL (Oracle)

Is it possible to group records based on a predefined date range differences (e.g. 30 days) based on the start_date of a row and the end_date of the previous row for non-consecutive dates? I want to take the min(start_date) and max(end_date) of each group. I tried the lead and lag function with partition by in Oracle but couldn't come up with a proper solution. A related but unanswered post related to my question can be found here.
E.g.
ROW_NUM PROJECT_ID START_DATE END_DATE
1 1 2016-01-14 2016-08-15
2 1 2016-08-16 2016-09-10 --- Date diff Row 1&2 = 1 Day
3 1 2016-11-15 2017-01-10 --- Date diff Row 2&3 = 66 Days
4 1 2016-01-17 2017-04-10 --- Date diff Row 3&4 = 7 Days
5 2 2018-04-28 2018-06-01 --- Other Project
6 2 2019-02-01 2019-04-05 --- Diff > 30 Days
7 2 2019-04-08 2019-07-28 --- Diff 3 Days
Expected Result:
ROW_NUM PROJECT_ID START_DATE END_DATE
1 1 2016-01-14 2016-09-10
3 1 2016-11-15 2017-04-10
5 2 2018-04-28 2018-06-01
6 2 2019-02-01 2019-07-28
Use lag() and a cumulative sum to define where the groups begin. Then aggregate:
select project_id, min(start_date), max(end_date)
from (select t.*,
sum(case when prev_end_date > start_date - interval '30' day then 0 else 1 end) over
(partition by project_id order by start_date) as grp
from (select t.*,
lag(end_date) over (partition by project_id order by start_date) as prev_end_date
from t
) t
) t
group by project_id, grp;

Skip specific rows using LAG in sql

I have a table that looks like this:
Using the LAG function in SQL, I would like to perform the LAG on only values where star_date=end_date and get the past previous start_date record where start_date=end_date.
That my end table will have an extra column like this:
I hope my question is clear, any help is appreciated.
You can assign a group to these values and use that:
select t.*,
(case when start_date = end_date
then lag(start_date) over (partition by (case when start_date = end_date then 1 else 0 end) order by start_date)
end) as prev_eq_start_date
from t;
Or:
select t.*,
(case when start_date = end_date
then lag(start_date) over (partition by start_date = end_date order by start_date)
end) as prev_eq_start_date
from t;
Note if you data is big and most rows have different dates, then you might have a resources issue. In this case, an additional, unused partition by key can help:
select t.*,
(case when start_date = end_date
then lag(start_date) over (partition by (case when start_date = end_date then 1 else 2 end), (case when start_date <> end_date then start_date end) order by start_date)
end) as prev_eq_start_date
from t;
This has no impact on the result but it can avoid a resources error caused by too many rows with different values.
Below is for BigQuery Standard SQL
#standardSQL
SELECT *, NULL AS lag_result
FROM `project.dataset.table` WHERE start_date != end_date
UNION ALL
SELECT *, LAG(start_date) OVER(ORDER BY start_date)
FROM `project.dataset.table` WHERE start_date = end_date
If to apply to sample data in your question - result is
Row user_id start_date end_date lag_result
1 1 2019-01-01 2019-02-28 null
2 3 2019-02-27 2019-02-28 null
3 4 2019-08-04 2019-09-01 null
4 2 2019-02-01 2019-02-01 null
5 5 2019-08-07 2019-08-07 2019-02-01
6 6 2019-08-27 2019-08-27 2019-08-07
Btw, in case if your start_date and end_date are of STRING data type ('27/02/2019') vs. DATE type ('2019-02-27' as it was assumed in above query) - you should use below one
#standardSQL
SELECT *, NULL AS lag_result
FROM `project.dataset.table` WHERE start_date != end_date
UNION ALL
SELECT *, LAG(start_date) OVER(ORDER BY PARSE_DATE('%d/%m/%Y', start_date))
FROM `project.dataset.table` WHERE start_date = end_date
with result
Row user_id start_date end_date lag_result
1 1 01/01/2019 28/02/2019 null
2 3 27/02/2019 28/02/2019 null
3 4 04/08/2019 01/09/2019 null
4 2 01/02/2019 01/02/2019 null
5 5 07/08/2019 07/08/2019 01/02/2019
6 6 27/08/2019 27/08/2019 07/08/2019
Use JOIN
SQL FIDDLE
SELECT T.*,T1.LAG_Result
FROM TABLE T LEFT JOIN
(
SELECT User_Id,LAG(start_date) OVER(ORDER BY start_date) LAG_Result
FROM TABLE S
WHERE start_date = end_date
) T1 ON T.User_Id = T1.User_Id