Google Big Query - Calculating monthly totals by status based on multiple date conditionals - google-bigquery

I have table with the following data:
customer_id subscription_id plan status trial_start trial_end activated_at cancelled_at
1 jg1 basic cancelled 2020-06-26 2020-07-14 2020-07-14 2020-09-25
2 ab1 basic cancelled 2020-08-10 2020-08-24 2020-08-24 2021-02-15
3 cf8 basic cancelled 2020-08-25 2020-09-04 2020-09-04 2020-10-24
4 bc2 basic active 2020-10-12 2020-10-26 2020-10-26
5 hg4 basic active 2021-01-09 2021-02-08 2021-02-08
6 cd5 basic in-trial 2021-02-26
As you notice from the table, status = in_trial when a subscription is in trial. When subscription converts from in_trial to active there is activated_at date. When an in_trial or active subscription is cancelled, status switches to cancelled and cancelled_at date is present. Status column always shows only most recent status of a subscription. For every change in status a new row does not appear for subscription. For every change in status, status is changed, and appropriate dates appear to reflect time when status was changed.
My goal is to calculate, month-over-month, how many subscriptions are in status = in_trial, how many are in status = active and how many are in status = cancelled. Because status column reflects the most recent status of subscription, a query has to be able to determine how many subscriptions were in status = in_trial, status = active, and status = active based on available dates column.
If a particular subscription had multiple statuses in a given month (for example, subscription_id = ab1 was in trial in Aug-2020 and also converted to active in Aug-2020), I want only the most recent status to be considered for that subscription. So, as example, for subscription_id = ab1 I want it to be counted as active subscription for the month of Aug-2020.
The output I am looking for is:
date in_trial active cancelled
2020-06-01 1 0 0
2020-07-01 0 1 0
2020-08-01 1 2 0
2020-09-01 0 2 1
2020-10-01 0 2 1
2020-11-01 0 2 0
2020-12-01 0 2 0
2021-01-01 1 2 0
2021-02-01 1 2 1
2021-03-01 1 2 0
Or, results can be displayed in a different format, as long as numbers are correct. Another example of output can be:
date status count
2020-06-01 in_trial 1
2020-06-01 active 0
2020-06-01 cancelled 0
2020-07-01 in_trial 0
2020-07-01 active 1
2020-07-01 cancelled 0
... ... ...
2021-03-01 in_trial 1
2021-03-01 active 2
2021-03-01 cancelled 0
Below is the query you can use to reproduce the example table provided in this question:
SELECT 1 AS customer_id, 'jg1' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-06-26' AS trial_start, '2020-07-14' AS trial_end, '2020-07-14' AS activated_at, '2020-09-25' AS cancelled_at UNION ALL
SELECT 2 AS customer_id, 'ab1' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-08-10' AS trial_start, '2020-08-24' AS trial_end, '2020-08-24' AS activated_at, '2021-02-15' AS cancelled_at UNION ALL
SELECT 3 AS customer_id, 'cf8' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-08-25' AS trial_start, '2020-09-04' AS trial_end, '2020-09-04' AS activated_at, '2020-10-24' AS cancelled_at UNION ALL
SELECT 4 AS customer_id, 'bc2' AS subscription_id, 'basic' AS plan, 'active' AS status, '2020-10-12' AS trial_start, '2020-10-26' AS trial_end, '2020-10-26' AS activated_at, '' AS cancelled_at UNION ALL
SELECT 5 AS customer_id, 'hg4' AS subscription_id, 'basic' AS plan, 'active' AS status, '2021-01-09' AS trial_start, '2021-02-08' AS trial_end, '2021-02-08' AS activated_at, '' AS cancelled_at UNION ALL
SELECT 6 AS customer_id, 'cd5' AS subscription_id, 'basic' AS plan, 'in_trial' AS status, '2021-02-26' AS trial_start, '' AS trial_end, '' AS activated_at, '' AS cancelled_at
I have been working on this problem since yesterday morning and continuing to figure out a way to do this efficiently. Thank you in advance for helping me solve this problem.

Below should work for you
select month,
count(distinct if(status = 0, customer_id, null)) in_trial,
count(distinct if(status = 1, customer_id, null)) active,
count(distinct if(status = 2, customer_id, null)) canceled
from (
select month, customer_id,
array_agg(status order by status desc limit 1)[offset(0)] status
from (
select distinct customer_id, 0 status, date_trunc(date, month) month
from `project.dataset.table`,
unnest(generate_date_array(date(trial_start), ifnull(date(trial_end), current_date()))) date
union all
select distinct customer_id, 1 status, date_trunc(date, month) month
from `project.dataset.table`,
unnest(generate_date_array(date(activated_at), ifnull(date(cancelled_at), current_date()))) date
union all
select distinct customer_id, 2 status, date_trunc(date(cancelled_at), month) month
from `project.dataset.table`
)
where not month is null
group by month, customer_id
)
group by month
# order by month
If applied to sample data in your question - output is

Related

create additional date after and before current row and create new column based on it

Lets say I have this kind of data
create table example
(cust_id VARCHAR, product VARCHAR, price float, datetime varchar);
insert into example (cust_id, product, price, datetime)
VALUES
('1', 'scooter', 2000, '2022-01-10'),
('1', 'skateboard', 1500, '2022-01-20'),
('1', 'beefmeat', 300, '2022-06-08'),
('2', 'wallet', 200, '2022-02-25'),
('2', 'hairdryer', 250, '2022-04-28'),
('3', 'skateboard', 1600, '2022-03-29')
I want to make some kind of additional rows, and after that make new column based on this additional rows
My expectation output will like this
cust_id
total_price
date
is_active
1
3500
2022-01
active
1
0
2022-02
active
1
0
2022-03
active
1
0
2022-04
inactive
1
0
2022-05
inactive
1
300
2022-06
active
1
0
2022-07
active
2
0
2022-01
inactive
2
200
2022-02
active
2
0
2022-03
active
2
250
2022-04
active
2
0
2022-05
active
2
0
2022-06
active
2
0
2022-07
inactive
3
0
2022-01
inactive
3
0
2022-02
inactive
3
1600
2022-03
active
3
0
2022-04
active
3
0
2022-05
active
3
0
2022-06
inactive
3
0
2022-07
inactive
the rules is like this
the first month when the customer make transaction is called active, before this transaction called inactive.
ex: first transaction in month 2, then month 2 is active, month 1 is inactive (look cust_id 2 and 3)
if more than 2 months there isnt transaction, the next month is inactive until there is new transaction is active.
ex: if last transaction in month 1, then month 2 and month 3 is inactive, and month 4, month 5 inactive if month 6 there is new transaction (look cust_id 1 and 3)
well my first thought is used this code, but I dont know what the next step after it
select *,
date_part('month', age(to_date(date, 'YYYY-MM'), to_date(lag(date) over (partition by cust_id order by date),'YYYY-MM')))date_diff
from(
select
cust_id,
sum(price)total_price,
to_char(to_date(datetime, 'YYYY-MM-DD'),'YYYY-MM')date
from example
group BY
cust_id,
date
order by
cust_id,
date)test
I'm open to any suggestion
Try the following, an explanation within query comments:
/* use generate_series to generate a series of dates
starting from the min date of datetime up to the
max datetime with one-month intervals, then do a
cross join with the distinct cust_id to map each cust_id
to each generated date.*/
WITH cust_dates AS
(
SELECT EX.cust_id, to_char(dts, 'YYYY-mm') dts
FROM generate_series
(
(SELECT MIN(datetime)::timestamp FROM example),
(SELECT MAX(datetime)::timestamp + '2 month'::interval FROM example),
'1 month'::interval
) dts
CROSS JOIN (SELECT DISTINCT cust_id FROM example) EX
),
/* do a left join with your table to find prices
for each cust_id/ month, and aggregate for cust_id, month_date
to find the sum of prices for each cust_id, month_date.
*/
monthly_price AS
(
SELECT CD.cust_id,
CD.dts AS month_date,
COALESCE(SUM(price), 0) total_price
FROM cust_dates CD LEFT JOIN example EX
ON CD.cust_id = EX.cust_id AND
CD.dts = to_char(EX.datetime, 'YYYY-mm')
GROUP BY CD.cust_id, CD.dts
)
/* Now, we have the sum of monthly prices for each cust_id,
we can use the max window function with "ROWS BETWEEN 2 PRECEDING AND CURRENT ROW"
to check if one of the (current month or the previous two months) has a sum of prices > 0.
*/
SELECT cust_id, month_date, total_price,
CASE MAX(total_price) OVER
(PARTITION BY cust_id ORDER BY month_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
WHEN 0 THEN 'inactive'
ELSE 'active'
END AS is_active
FROM monthly_price
ORDER BY cust_id, month_date
See demo

Calculate time between one event and the next one in PostgreSQL

I have a table
id
user_id
created_at
status
1
100
2022-12-13 00:12:12
IN_TRANSIT
2
104
2022-12-13 01:12:12
IN_TRANSIT
3
100
2022-12-13 02:12:12
DONE
4
100
2022-12-13 03:12:12
IN_TRANSIT
5
104
2022-12-13 04:12:12
DONE
6
100
2022-12-13 05:12:12
DONE
7
104
2022-12-13 06:12:12
IN_TRANSIT
7
104
2022-12-13 07:12:12
REJECTED
I am trying to calculate the sum for each user of the idle time, so the time between status DONE and next IN_TRANSIT for that user.
The result should be
user_id
idle_time
100
01:00:00
104
02:00:00
select user_id
,idle_time
from (
select user_id
,status
,created_at-lag(created_at) over(partition by user_id order by created_at) as idle_time
,lag(status) over(partition by user_id order by created_at) as pre_status
from t
) t
where status = 'IN_TRANSIT'
and pre_status = 'DONE'
user_id
idle_time
100
01:00:00
104
02:00:00
Fiddle
Try the following:
select user_id, min(case status when 'IN_TRANSIT' then created_at end) -
min(case status when 'DONE' then created_at end) idle_time
from
(
select user_id, created_at, status,
sum(case status when 'DONE' then 1 end) over (partition by user_id order by created_at) as grp
from table_name
) T
group by user_id, grp
having min(case status when 'IN_TRANSIT' then created_at end) -
min(case status when 'DONE' then created_at end) is not null
This will find the time difference between 'DONE' status and the first next 'IN_TRANSIT' status, if there is a multiple 'IN_TRANSIT' statuses after and you want to find the difference with the last one just change min(case status when 'IN_TRANSIT' then created_at end) to max.
Also, if there are multiple 'DONE', 'IN_TRANSIT' for a user_id, then they will show as a separate rows in the result, but you can use this query as a subquery to find the sum of all differences grouped by user_id.
See a demo.
you need to use lag function for finding the diff between first and second row
select user_id,idle_time from (select user_id,status,
created_at - lag(created_at) over (order by user_id, created_at)
as idle_time
from calctime)
as drt WHERE idle_time > interval '10 sec' and status='IN_TRANSIT'

SQL cohort calculations

I have my table of players activity like this:
user_id
event_name
install_date
event_date
1
active
2021-03-01
2021-03-01
1
active
2021-03-01
2021-03-01
1
active
2021-03-01
2021-03-02
2
active
2021-03-02
2021-03-02
2
active
2021-03-02
2021-03-04
2
active
2021-03-02
2021-03-04
and I want to calculate cohort retention like this
user_id
install_date
ret0
ret1
ret2
1
2021-03-01
1
1
0
2
2021-03-02
1
0
1
Help me please to write sql query. Thanks)
If I understand correctly, you just want to compare the event_date to the install_date and keep track of when "x" days appear between the two:
select user_id,
max(case when event_date = install_date then 1 else 0 end) as ret1,
max(case when event_date = date_add(install_date, interval 1 day) then 1 else 0 end) as ret1,
max(case when event_date = date_add(install_date, interval 2 day) then 1 else 0 end) as ret2
from t
group by user_id;
Consider below approach - less verbose and better manageable and expandable to more generic cases
select * from (
select user_id, install_date, date_diff(event_date, install_date, day) diff
from `project.dataset.table`
)
pivot (count(diff) as ret for diff in (0, 1, 2))
if applied to sample data in your question - output is
Btw, if you want to output 1 or 0 in respective columns - you can adjust above to
select * from (
select user_id, install_date, date_diff(event_date, install_date, day) diff
from `project.dataset.table`
group by 1,2,3
)
pivot (count(diff) as ret for diff in (0, 1, 2))
in this case - output is

SQL How to Query Total & Subtotal

I have a table looks like below where day, order_id, and order_type are stored.
select day, order_id, order_type
from sample_table
day
order_id
order_type
2021-03-01
1
offline
2021-03-01
2
offline
2021-03-01
3
online
2021-03-01
4
online
2021-03-01
5
offline
2021-03-01
6
offline
2021-03-02
7
online
2021-03-02
8
online
2021-03-02
9
offline
2021-03-02
10
offline
2021-03-03
11
offline
2021-03-03
12
offline
Below is desired output:
day
total_order
num_offline_order
num_online_order
2021-03-01
6
4
2
2021-03-02
4
2
2
2021-03-03
2
2
0
Does anybody know how to query to get the desired output?
You need to pivot the data. A simple way to implement conditional aggregation in Vertica uses :::
select day, count(*) as total_order,
sum( (order_type = 'online')::int ) as num_online,
sum( (order_type = 'offline')::int ) as num_offline
from t
group by day;
Use case and sum:
select day,
count(1) as total_order
sum(case when order_type='offline' then 1 end) as num_offline_order,
sum(case when order_type='online' then 1 end) as num_online_order
from sample_table
group by day
order by day
you can also use count to aggregate values that are not null
select
day,
count(*) as total_order,
count(case when order_type='offline' then 1 else null end) as offline_orders,
count(case when order_type='online' then 1 else null end) as online_orders
from sample_table
group by day
order by day;

Select start and end dates for changing values in SQL

I have a database with accounts and historical status changes
select Date, Account, OldStatus, NewStatus from HistoricalCodes
order by Account, Date
Date
Account
OldStatus
NewStatus
2020-01-01
12345
1
2
2020-10-01
12345
2
3
2020-11-01
12345
3
2
2020-12-01
12345
2
1
2020-01-01
54321
2
3
2020-09-01
54321
3
2
2020-12-01
54321
2
3
For every account I need to determine Start Date and End Date when Status = 2. An additional challenge is that the status can change back and forth multiple times. Is there a way in SQL to create something like this for at least first two timeframes when account was in 2? Any ideas?
Account
StartDt_1
EndDt_1
StartDt_2
EndDt_2
12345
2020-01-01
2020-10-01
2020-11-01
2020-12-01
54321
2020-09-01
2020-12-01
I would suggest putting this information in separate rows:
select t.*
from (select account, date as startdate,
lead(date) over (partition by account order by date) as enddate
from t
) t
where newstatus = 2;
This produces a separate row for each period when an account has a status of 2. This is better than putting the dates in separate pairs of columns, because you do not need to know the maximum number of periods of status = 2 when you write the query.
For a fixed maximum of status changes per account, you can use window functions and conditional aggregation:
select account,
max(case when rn = 1 then date end) as start_dt1,
max(case when rn = 1 then lead_date end) as end_dt1,
max(case when rn = 2 then date end) as start_dt2,
max(case when rn = 2 then lead_date end) as end_dt2
from (
select t.*,
row_number() over(partition by account, newstatus order by date) as rn,
lead(date) over(partition by account order by date) as lead_date
from mytable t
) t
where newstatus = 2
group by account
You can extend the select clause with more conditional expressions to handle more possible ranges per account.