I am trying to calculate a monthly churn rate (for a given month: number_of_people_who_unsubscribed / number_of_subscribers_at_beginnin
enter image description here
That gives me a single percentage of users who unsubscribed during January. However, I'd like to output this percentage for every month, so I can display it as a line chart. I'm not sure where to start - it feels like I need to loop and run the query for each month but that feels wrong. How could I make the same calculation, but without specifying the month manually? We can assume that there is at least one start_date and one end_date per month, so some kind of group by might work.
WITH
date_range AS (
SELECT '2022-10-01' AS start_date, '2022-10-31'AS end_date
),
start_accounts AS
(
SELECT DISTINCT ProductContractId
FROM HD s INNER JOIN date_range d ON
s.FirstInvoiceDate<= d.start_date
AND (s.ItemRejectDate>d.start_date or s.ItemRejectDate is null)
),
end_accounts AS
(
SELECT DISTINCT ProductContractId
FROM HD s INNER JOIN date_range d ON
s.FirstInvoiceDate<= d.end_date
AND (s.ItemRejectDate>d.end_date or s.ItemRejectDate is null)
),
churned_accounts AS
(
SELECT s.ProductContractId
FROM start_accounts s
LEFT OUTER JOIN end_accounts e ON
s.ProductContractId=e.ProductContractId
WHERE e.ProductContractId is null
),
start_count AS (
SELECT COUNT(*) AS n_start FROM start_accounts
),
churn_count AS (
SELECT COUNT(*) AS n_churn FROM churned_accounts
)
SELECT
convert(numeric(10,4),(n_churn* 1.0/n_start ))*100
AS churn_rate,
(1.0-(n_churn/n_start)*100)
AS retention_rate,
n_start,
n_churn
FROM start_count, churn_count
Ultimately I'm looking for an output that looks something like:
enter image description here
I don't have SQL Serve setup yet (working on it) but here is a solution that works on the Postgres schema from the book Fighting Churn With Data (https://www.fightchurnwithdata.com, which is the basis for the original query in the question.) The key is to select multiple dates as a constant table in the first CTE (or use some external table defined for this purpose) and then carry it through all the other CTE's so it calculates multiple churn rates at once. The end date of each churn calculation period is used with a date interval expression (that will definitely have to change for SQLServer but I think the rest of it should work...)
WITH
calc_dates AS (
SELECT generate_series(min(start_date)
, max(start_date)-interval '1month'
, interval '1 month') AS start_date
FROM subscription
),
start_accounts AS
(
SELECT DISTINCT d.start_date, account_id
FROM SUBSCRIPTION s INNER JOIN calc_dates d ON
s.START_DATE<= d.start_date
AND (s.END_DATE>d.start_date or s.END_DATE is null)
),
end_accounts AS
(
SELECT DISTINCT d.start_date, account_id
FROM SUBSCRIPTION s INNER JOIN calc_dates d ON
s.START_DATE<= (d.start_date+interval '1 month')
AND (s.END_DATE>(d.start_date+interval '1 month') or s.end_date is null)
),
churned_accounts AS
(
SELECT s.start_date, s.account_id
FROM start_accounts s
LEFT OUTER JOIN end_accounts e ON
s.account_id=e.account_id
and s.start_date = e.start_date
WHERE e.account_id is null
),
start_count AS (
SELECT start_date, COUNT(*)::FLOAT AS n_start
FROM start_accounts
group by start_date
),
churn_count AS (
SELECT start_date, COUNT(*)::FLOAT AS n_churn
FROM churned_accounts
group by start_date
)
SELECT s.start_date,(s.start_date+interval '1 month')::date as end_date,
(n_churn* 1.0/n_start )*100
AS churn_rate
FROM start_count s
inner join churn_count c
on s.start_date=c.start_date
order by s.start_date;
This gives the following output when run against the book schema:
start_date
end_date
churn_rate
2020-02-01
2020-03-01
6.42486011191047
2020-03-01
2020-04-01
5.63271604938272
2020-04-01
2020-05-01
4.83752524325317
I'm working on getting a SQLServer running and will try to update the solution soon...
Related
I have an SQLite database (with Django as ORM) with a table of change events (an Account is assigned a new Strategy). I would like to convert it to a timeseries, to have on each day the Strategy the Account was following.
My table :
Expected output :
As showed, there can be more than 1 change per day. In this case I select the last change of the day, as the desired timeseries output must have only one value per day.
My question is similar to this one but in SQL, not BigQuery (but I'm not sure I understood the unnest part they propose). I have a working solution in Pandas with reindex and fillna, but I'm sure there is an elegant and simple solution in SQL (maybe even better with Django ORM).
You can use a RECURSIVE Common Table Expression to generate all dates between first and last and then join this generated table with your data to get the needed value for each day:
WITH RECURSIVE daterange(d) AS (
SELECT date(min(created_at)) from events
UNION ALL
SELECT date(d,'1 day') FROM daterange WHERE d<(select max(created_at) from events)
)
SELECT d, account_id, strategy_id
FROM daterange JOIN events
WHERE created_at = (select max(e.created_at) from events e where e.account_id=events.account_id and date(e.created_at) <= d)
GROUP BY account_id, d
ORDER BY account_id, d
date() function is used to convert a datetime value to a simple date, so you can use it to group your data by date.
date(d, '1 day') applies a modifier of +1 calendar day to d.
Here is an example with your data:
CREATE TABLE events (
created_at,
account_id,
strategy_id
);
insert into events
VALUES ('2022-10-07 12:53:53', 4801323843, 7),
('2022-10-07 08:10:07', 4801323843, 5),
('2022-10-07 15:00:45', 4801323843, 8),
('2022-10-10 13:01:16', 4801323843, 6);
WITH RECURSIVE daterange(d) AS (
SELECT date(min(created_at)) from events
UNION ALL
SELECT date(d,'1 day') FROM daterange WHERE d<(select max(created_at) from events)
)
SELECT d, account_id, strategy_id
FROM daterange JOIN events
WHERE created_at = (select max(e.created_at) from events e where e.account_id=events.account_id and date(e.created_at) <= d)
GROUP BY account_id, d
ORDER BY account_id, d
d
account_id
strategy_id
2022-10-07
4801323843
8
2022-10-08
4801323843
8
2022-10-09
4801323843
8
2022-10-10
4801323843
6
2022-10-11
4801323843
6
fiddle
The query could be slow with many rows. In that case create an index on the created_at column:
CREATE INDEX events_created_idx ON events(created_at);
My final version is the version proposed by #Andrea B., with just a slight improve in performance, merging only the rows that we need in the join, and therefore discarding the where clause.
I also converted the null to date('now')
Here is the final version I used :
with recursive daterange(day) as
(
select min(date(created_at)) from events
union all select date(day, '1 day') from daterange
where day < date('now')
),
events as (
select account_id, strategy_id, created_at as start_date,
case lead(created_at) over(partition by account_id order by created_at) is null
when True then datetime('now')
else lead(created_at) over(partition by account_id order by created_at)
end as end_date
from events
)
select * from daterange
join events on events.start_date<daterange.day and daterange.day<events.end_date
order by events.account_id
Hope this helps !
I have SCD table type 2 that I join with various other tables and I am looking to aggregate sum total from any entity that was active (by active I mean the ones that don't yet have an end_date) during an individual month.
currently, I have a query similar to this (let's say aggregating data for the month of May 2022 and April 2022):
select
count(1) as enitities_agg,
DATE_TRUNC('Month', h.start) as date,
sum(h.price) filter (where c.name='HIGH') as sum_total,
----....----
from
project as p
join class as c on p.class_id = c.id
join stage as s on s.project_id = p.id
join stage_info as si on si.stage_id = s.id
join history as h on h.stage_info_id = si.id
where
h.start <= '2022-06-01' and
h.end_date >= '2022-04-01' and
COALESCE(p.end_date, '2099-01-01') >= '2022-04-01' and
COALESCE(p.start_date, '2099-01-01') <= '2022-06-01' and
COALESCE(stage.end, '2099-01-01') >= '2022-04-01' and
h.price is not null and
h.price != 0
group by DATE_TRUNC('Month', h.start)
It aggregates fine only those whose history starts in May or April, not the ones that overlap those months and are still active.
The problem I have is that some history entities start in April, March, etc., and still haven't ended by May. Because I group with group by DATE_TRUNC('Month', h.start) i.e. by history start date, I don't get the entities that start earlier than April or May and continue to be active after May, I get aggregates only in those months that they have started in.
I was trying to do it by generating series and group by the generated month, however, I didn't find a way that would group them correctly. Example, of one experimentation that I tried.
from
generate_series('2022-03-01', '2022-07-01', INTERVAL '1 month') as mt
join project as p on COALESCE(p.end_date, '2099-01-01') >= mt and
COALESCE(p.start_date, '2099-01-01') <= mt + INTERVAL '1 month'
join class as c on p.class_id = c.id
join stage as stage on stage.project_id = p.id and
COALESCE(stage.end, '2099-01-01') >= mt
join stage_info as si on si.stage_id = stage.id
join history as h on h.stage_info_id = si.id
where
h.start <= mt and
h.end_date >= mt + INTERVAL '1 month' and
h.price is not null and
h.price != 0
group by mt
How would it be possible to iterate through the history table aggregating any active entities in a month and group them by the same month and get something like this?
"enitities_agg" | "date" | "sum_total"
832 | "2022-04-01 00:00:00" | 15432234
1020 | "2022-05-01 00:00:00" | 19979458
Seems your logic is: if any day of begin_ - _end interval falls into month, count it in. This was the hardest part to guess from the desired results.
So I guess you need this:
with dim as (
select
m::date as month_start
,(date_trunc('month', m) + interval '1 month - 1 day')::date as month_end
,to_char(date_trunc('month', m), 'Mon') as month
from generate_series('2022-01-01', '2022-08-01', INTERVAL '1 month') as m
)
SELECT
dim.month
, sum(coalesce(t.price, 0)) as sum_price
FROM dim
left join test as t
on t.begin_ <= dim.month_end
and t._end >= dim.month_start
group by dim.month_start, dim.month
order by dim.month_start, dim.month
;
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=614030d4db5e03876f693a9a8a5ff122
You want all history entities was happened during 2022-May? If so, the following maybe help.
daterange(h.start,
h.end_date,
'[]') && daterange('2022-05-01', '2022-06-01', '[]');
demo:
CREATE temp TABLE test (
begin_ date,
_end date
);
INSERT INTO test
VALUES ('2022-01-01', '2022-05-01');
INSERT INTO test
VALUES ('2022-01-01', '2022-05-11');
INSERT INTO test
VALUES ('2022-05-01', '2022-07-11');
INSERT INTO test
VALUES ('2022-06-11', '2022-07-11');
SELECT
*,
daterange(begin_, _end, '[]')
FROM
test t
WHERE
daterange(t.begin_, t._end, '[]') && daterange('2022-05-01', '2022-05-31', '[]');
&& range operator reference: https://www.postgresql.org/docs/current/functions-range.html#RANGE-OPERATORS-TABLE
If I have a table that has the following format:
purchase_time | user_id | items_purchased
The current query I'm doing is something like this:
SELECT user_id, date(purchase_time), sum(items_purchased)
from user_purchase_metrics
GROUP BY date(purchase_time), user_id;
I'm trying to create a query that will fill in 0 for purchases if there isn't an entry in that date for that given user. Is this possible?
Side stepping the valid concern by #xQbert about generating missing dates performance must always give way to necessity. Without a convenient calendar table generating the dates of interest is a necessity. Moreover, in this case the dates must be generated for each user_id. In the following this is done by generating each date with the distinct user_id from the user_purchase_metrics table. The result is then LEFT joined to the same table to sum the purchases and giving the desired 0 results for the missing dates. (see demo, for dates I just picked March):
with dates( user_id, idate ) as
( select user_id, d::date
from ( select distinct user_id
from user_purchase_metrics
) u
join generate_series( date '2021-03-01' --- start_date
, date '2021-03-31' --- end_date
, interval '1 day'
) gs(d)
on true
) -- select * from dates;
select d.user_id
, d.idate
, coalesce(sum(pm.items_purchased),0)
from dates d
left join user_purchase_metrics pm
on ( pm.user_id = d.user_id
and date(pm.purchase_time) = d.idate
)
group by d.user_id, d.idate
order by d.user_id, d.idate;
To parametrize the query can be embedded in a SQL function that returns a table. (Also in demo).
I have 3 tables which I'm querying to get the data based on different conditions. I have from and to params and these are the ones I'm using to create a range of time in which I'm looking for the data in those tables.
For instance if I have from equals to '2020-07-01' and to equals to '2020-08-01' I'm expecting to receive the grouped row values of the tables by week, if in some case some of the weeks don't have records I want to return 0, if some tables have records for the same week, I'd like to sum them.
Currently I have this:
SELECT d.day, COALESCE(t.total, 0)
FROM (
SELECT day::date
FROM generate_series(timestamp '2020-07-01',
timestamp '2020-08-01',
interval '1 week') day
) d
LEFT JOIN (
SELECT date AS day,
SUM(total)
FROM table1
WHERE id = '1'
AND date BETWEEN '2020-07-01' AND '2020-08-01'
GROUP BY day
) t USING (day)
ORDER BY d.day;
I'm generating a series of dates grouped by week, and on top of that I'm doing adding a left join. Now for some reason, it only works if the dates match completely, otherwise COALESCE(t.total, 0) returns 0 even if in that week the SUM(total) is not 0.
The same way I'm applying the LEFT JOIN, I'm using other left joins with other tables in the same query, so I'm falling with the same problem.
Please see if this works for you. Whenever you find yourself aggregating more than once, ask yourself whether it is necessary.
Rather than try to match on discrete days, use time ranges.
with limits as (
select '2020-07-01'::timestamp as dt_start,
'2020-08-01'::timestamp as dt_end
), weeks as (
SELECT x.day::date as day, least(x.day::date + 7, dt_end::date) as day_end
FROM limits l
CROSS JOIN LATERAL
generate_series(l.dt_start, l.dt_end, interval '1 week') as x(day)
WHERE x.day::date != least(x.day::date + 7, dt_end::date)
), t1 as (
select w.day,
sum(coalesce(t.total, 0)) as t1total
from weeks w
left join table1 t
on t.id = 1
and t.date >= w.day
and t.date < w.day_end
group by w.day
), t2 as (
select w.day,
sum(coalesce(t.sum_measure, 0)) as t2total
from weeks w
left join table2 t
on t.something = 'whatever'
and t.date >= w.day
and t.date < w.day_end
group by w.day
)
select t1.day,
t1.t1total,
t2.t2total
from t1
join t2 on t2.day = t1.day;
You can keep adding tables like that with CTEs.
My earlier example with multiple left join was bad because it blows out the rows due to a lack of join conditions between the left-joined tables.
There is an interesting corner case for e.g. 2019-02-01 to 2019-03-01 which returns an empty interval as the last week. I have updated to filter that out.
New here. I am trying to get the Daily and Weekly active users over time. they have 30 days before they are considered inactive. My goal is to create graph's that can be split by user_id to show cohorts, regions, categories, etc.
I have created a date table to get every day for the time period and I have the simplified orders table with the base info that I need to calculate this.
I am trying to do a Left Join to get the status by date using the following SQL Query:
WITH daily_use AS (
SELECT
__key__.id AS user_id
, DATE_TRUNC(date(placeOrderDate), day) AS activity_date
FROM `analysis.Order`
where isBuyingGroupOrder = TRUE
AND testOrder = FALSE
GROUP BY 1, 2
),
dates AS (
SELECT DATE_ADD(DATE "2016-01-01", INTERVAL d.d DAY) AS date
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY __key__.id) -1 AS d
FROM `analysis.Order`
ORDER BY __key__.id
LIMIT 1096
) AS d
ORDER BY 1 DESC
)
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
LEFT JOIN daily_use
ON wd.date >= daily_use.activity_date
AND wd.date < DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
I am getting this Error: LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. In BigQuery and was wondering how can I go around this. I am using Standard SQL within BigQuery.
Thank you
Below is for BigQuery Standard SQL and mostly reproduce logic in your query with exception of not including days where no activity at all is found
#standardSQL
SELECT
daily_use.user_id
, wd.date AS DATE
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
-- ORDER BY 1,2
if for whatever reason you still need to exactly reproduce your logic - you can embrace above with final left join as below:
#standardSQL
SELECT *
FROM dates AS wd
LEFT JOIN (
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
) AS daily_use
USING (date)
-- ORDER BY 1,2