how to aggregate data by month overlapping postgresql - sql

I have SCD table type 2 that I join with various other tables and I am looking to aggregate sum total from any entity that was active (by active I mean the ones that don't yet have an end_date) during an individual month.
currently, I have a query similar to this (let's say aggregating data for the month of May 2022 and April 2022):
select
count(1) as enitities_agg,
DATE_TRUNC('Month', h.start) as date,
sum(h.price) filter (where c.name='HIGH') as sum_total,
----....----
from
project as p
join class as c on p.class_id = c.id
join stage as s on s.project_id = p.id
join stage_info as si on si.stage_id = s.id
join history as h on h.stage_info_id = si.id
where
h.start <= '2022-06-01' and
h.end_date >= '2022-04-01' and
COALESCE(p.end_date, '2099-01-01') >= '2022-04-01' and
COALESCE(p.start_date, '2099-01-01') <= '2022-06-01' and
COALESCE(stage.end, '2099-01-01') >= '2022-04-01' and
h.price is not null and
h.price != 0
group by DATE_TRUNC('Month', h.start)
It aggregates fine only those whose history starts in May or April, not the ones that overlap those months and are still active.
The problem I have is that some history entities start in April, March, etc., and still haven't ended by May. Because I group with group by DATE_TRUNC('Month', h.start) i.e. by history start date, I don't get the entities that start earlier than April or May and continue to be active after May, I get aggregates only in those months that they have started in.
I was trying to do it by generating series and group by the generated month, however, I didn't find a way that would group them correctly. Example, of one experimentation that I tried.
from
generate_series('2022-03-01', '2022-07-01', INTERVAL '1 month') as mt
join project as p on COALESCE(p.end_date, '2099-01-01') >= mt and
COALESCE(p.start_date, '2099-01-01') <= mt + INTERVAL '1 month'
join class as c on p.class_id = c.id
join stage as stage on stage.project_id = p.id and
COALESCE(stage.end, '2099-01-01') >= mt
join stage_info as si on si.stage_id = stage.id
join history as h on h.stage_info_id = si.id
where
h.start <= mt and
h.end_date >= mt + INTERVAL '1 month' and
h.price is not null and
h.price != 0
group by mt
How would it be possible to iterate through the history table aggregating any active entities in a month and group them by the same month and get something like this?
"enitities_agg" | "date" | "sum_total"
832 | "2022-04-01 00:00:00" | 15432234
1020 | "2022-05-01 00:00:00" | 19979458

Seems your logic is: if any day of begin_ - _end interval falls into month, count it in. This was the hardest part to guess from the desired results.
So I guess you need this:
with dim as (
select
m::date as month_start
,(date_trunc('month', m) + interval '1 month - 1 day')::date as month_end
,to_char(date_trunc('month', m), 'Mon') as month
from generate_series('2022-01-01', '2022-08-01', INTERVAL '1 month') as m
)
SELECT
dim.month
, sum(coalesce(t.price, 0)) as sum_price
FROM dim
left join test as t
on t.begin_ <= dim.month_end
and t._end >= dim.month_start
group by dim.month_start, dim.month
order by dim.month_start, dim.month
;
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=614030d4db5e03876f693a9a8a5ff122

You want all history entities was happened during 2022-May? If so, the following maybe help.
daterange(h.start,
h.end_date,
'[]') && daterange('2022-05-01', '2022-06-01', '[]');
demo:
CREATE temp TABLE test (
begin_ date,
_end date
);
INSERT INTO test
VALUES ('2022-01-01', '2022-05-01');
INSERT INTO test
VALUES ('2022-01-01', '2022-05-11');
INSERT INTO test
VALUES ('2022-05-01', '2022-07-11');
INSERT INTO test
VALUES ('2022-06-11', '2022-07-11');
SELECT
*,
daterange(begin_, _end, '[]')
FROM
test t
WHERE
daterange(t.begin_, t._end, '[]') && daterange('2022-05-01', '2022-05-31', '[]');
&& range operator reference: https://www.postgresql.org/docs/current/functions-range.html#RANGE-OPERATORS-TABLE

Related

Calculating monthly churn - on SQL SERVER

I am trying to calculate a monthly churn rate (for a given month: number_of_people_who_unsubscribed / number_of_subscribers_at_beginnin
enter image description here
That gives me a single percentage of users who unsubscribed during January. However, I'd like to output this percentage for every month, so I can display it as a line chart. I'm not sure where to start - it feels like I need to loop and run the query for each month but that feels wrong. How could I make the same calculation, but without specifying the month manually? We can assume that there is at least one start_date and one end_date per month, so some kind of group by might work.
WITH
date_range AS (
SELECT '2022-10-01' AS start_date, '2022-10-31'AS end_date
),
start_accounts AS
(
SELECT DISTINCT ProductContractId
FROM HD s INNER JOIN date_range d ON
s.FirstInvoiceDate<= d.start_date
AND (s.ItemRejectDate>d.start_date or s.ItemRejectDate is null)
),
end_accounts AS
(
SELECT DISTINCT ProductContractId
FROM HD s INNER JOIN date_range d ON
s.FirstInvoiceDate<= d.end_date
AND (s.ItemRejectDate>d.end_date or s.ItemRejectDate is null)
),
churned_accounts AS
(
SELECT s.ProductContractId
FROM start_accounts s
LEFT OUTER JOIN end_accounts e ON
s.ProductContractId=e.ProductContractId
WHERE e.ProductContractId is null
),
start_count AS (
SELECT COUNT(*) AS n_start FROM start_accounts
),
churn_count AS (
SELECT COUNT(*) AS n_churn FROM churned_accounts
)
SELECT
convert(numeric(10,4),(n_churn* 1.0/n_start ))*100
AS churn_rate,
(1.0-(n_churn/n_start)*100)
AS retention_rate,
n_start,
n_churn
FROM start_count, churn_count
Ultimately I'm looking for an output that looks something like:
enter image description here
I don't have SQL Serve setup yet (working on it) but here is a solution that works on the Postgres schema from the book Fighting Churn With Data (https://www.fightchurnwithdata.com, which is the basis for the original query in the question.) The key is to select multiple dates as a constant table in the first CTE (or use some external table defined for this purpose) and then carry it through all the other CTE's so it calculates multiple churn rates at once. The end date of each churn calculation period is used with a date interval expression (that will definitely have to change for SQLServer but I think the rest of it should work...)
WITH
calc_dates AS (
SELECT generate_series(min(start_date)
, max(start_date)-interval '1month'
, interval '1 month') AS start_date
FROM subscription
),
start_accounts AS
(
SELECT DISTINCT d.start_date, account_id
FROM SUBSCRIPTION s INNER JOIN calc_dates d ON
s.START_DATE<= d.start_date
AND (s.END_DATE>d.start_date or s.END_DATE is null)
),
end_accounts AS
(
SELECT DISTINCT d.start_date, account_id
FROM SUBSCRIPTION s INNER JOIN calc_dates d ON
s.START_DATE<= (d.start_date+interval '1 month')
AND (s.END_DATE>(d.start_date+interval '1 month') or s.end_date is null)
),
churned_accounts AS
(
SELECT s.start_date, s.account_id
FROM start_accounts s
LEFT OUTER JOIN end_accounts e ON
s.account_id=e.account_id
and s.start_date = e.start_date
WHERE e.account_id is null
),
start_count AS (
SELECT start_date, COUNT(*)::FLOAT AS n_start
FROM start_accounts
group by start_date
),
churn_count AS (
SELECT start_date, COUNT(*)::FLOAT AS n_churn
FROM churned_accounts
group by start_date
)
SELECT s.start_date,(s.start_date+interval '1 month')::date as end_date,
(n_churn* 1.0/n_start )*100
AS churn_rate
FROM start_count s
inner join churn_count c
on s.start_date=c.start_date
order by s.start_date;
This gives the following output when run against the book schema:
start_date
end_date
churn_rate
2020-02-01
2020-03-01
6.42486011191047
2020-03-01
2020-04-01
5.63271604938272
2020-04-01
2020-05-01
4.83752524325317
I'm working on getting a SQLServer running and will try to update the solution soon...

In Postgres how do I write a SQL query to select distinct values overall but aggregated over a set time period

What I mean by this is if I have a table called payments with a created_at column and user_id column I want to select the count of purchases aggregated weekly (can be any interval I want) but only selecting first time purchases e.g. if a user purchased for the first time in week 1 it would be counted but if he purchased again in week 2 he would not be counted.
created_at
user_id
timestamp
1
timestamp
1
This is the query I came up with. The issue is if the user purchases multiple times they are all included. How can I improve this?
WITH dates AS
(
SELECT *
FROM generate_series(
'2022-07-22T15:30:06.687Z'::DATE,
'2022-11-21T17:04:59.457Z'::DATE,
'1 week'
) date
)
SELECT
dates.date::DATE AS date,
COALESCE(COUNT(DISTINCT(user_id)), 0) AS registrations
FROM
dates
LEFT JOIN
payment ON created_at::DATE BETWEEN dates.date AND dates.date::date + '1 ${dateUnit}'::INTERVAL
GROUP BY
dates.date
ORDER BY
dates.date DESC;
You want to count only first purchases. So get those first purchases in the first step and work with these.
WITH dates AS
(
SELECT *
FROM generate_series(
'2022-07-22T15:30:06.687Z'::DATE,
'2022-11-21T17:04:59.457Z'::DATE,
'1 week'
) date
)
, first_purchases AS
(
SELECT user_id, MIN(created_at:DATE) AS purchase_date
FROM payment
GROUP BY user_id
)
SELECT
d.date,
COALESCE(COUNT(p.purchase_date), 0) AS registrations
FROM
dates d
LEFT JOIN
first_purchases p ON p.purchase_date >= d.date
AND p.purchase_date < d.date + '1 ${dateUnit}'::INTERVAL
GROUP BY
d.date
ORDER BY
d.date DESC;

Calculate the number of active clients for each month (SQL)

i'm currently doing a query for my job and I need a little bit of help with it.
My database is really simple
client_id
status
start_date
end_date
1
active
2020-01-01
2020-03-15
1
inactive
2020-03-15
null
2
active
2020-01-01
null
3
active
2020-01-01
2021-04-28
3
inactive
2021-04-28
2020-07-28
3
active
2021-07-28
null
For each new status of a given client, the database will have a new line and the previous will be updated. There are many different statuses that a client can be in but this is irrelevent for this case.
i need to get, for each month, the number of active clients and i'm a bit struguling with it. I've tried using a partition and i can give the active clients for a range of dates but do not have the monthly distribution.
Can anyone give my some advice :)
Thanks
You didn't tag your database so this may not work for you (however it is similar in all changing function names, and say lateral to cross apply in SQL server - generate_series could be expressed in a different way in your database). This is basically postgreSQL:
select client_id, extract(year from d) y, extract(month from d) m
from MyInfo t1,
lateral (select generate_series(make_date(cast(extract(year from t1.start_date) as int),
cast(extract(month from t1.start_date) as int), 1),
case when end_date is null then current_date else end_date end,
'1 month'::interval)
) Days(d)
where t1.status = 'active'
order by client_id, y, m;
Sample DBFiddle demo
EDIT: Noticed you were asking for monthly counts. Can be written easier, but expanding the above:
with baseData as
(
select client_id, extract(year from d) y, extract(month from d) m
from MyInfo t1,
lateral (select generate_series(make_date(cast(extract(year from t1.start_date) as int),
cast(extract(month from t1.start_date) as int), 1),
case when end_date is null then current_date else end_date end,
'1 month'::interval)
) Days(d)
where t1.status = 'active'
)
select y, m, count(*)
from baseData
group by y,m
order by y,m;
EDIT: Another way:
select m, count(*)
from (select generate_series(min(start_date), current_date, '1 month'::interval) from MyInfo) months(m)
inner join MyInfo i on
m >= date_trunc('month', i.start_date) and
m <= date_trunc('month', case when end_date is null then current_date else end_date end)
where i.status = 'active'
group by m
order by m;

Postgresql left join date_trunc with default values

I have 3 tables which I'm querying to get the data based on different conditions. I have from and to params and these are the ones I'm using to create a range of time in which I'm looking for the data in those tables.
For instance if I have from equals to '2020-07-01' and to equals to '2020-08-01' I'm expecting to receive the grouped row values of the tables by week, if in some case some of the weeks don't have records I want to return 0, if some tables have records for the same week, I'd like to sum them.
Currently I have this:
SELECT d.day, COALESCE(t.total, 0)
FROM (
SELECT day::date
FROM generate_series(timestamp '2020-07-01',
timestamp '2020-08-01',
interval '1 week') day
) d
LEFT JOIN (
SELECT date AS day,
SUM(total)
FROM table1
WHERE id = '1'
AND date BETWEEN '2020-07-01' AND '2020-08-01'
GROUP BY day
) t USING (day)
ORDER BY d.day;
I'm generating a series of dates grouped by week, and on top of that I'm doing adding a left join. Now for some reason, it only works if the dates match completely, otherwise COALESCE(t.total, 0) returns 0 even if in that week the SUM(total) is not 0.
The same way I'm applying the LEFT JOIN, I'm using other left joins with other tables in the same query, so I'm falling with the same problem.
Please see if this works for you. Whenever you find yourself aggregating more than once, ask yourself whether it is necessary.
Rather than try to match on discrete days, use time ranges.
with limits as (
select '2020-07-01'::timestamp as dt_start,
'2020-08-01'::timestamp as dt_end
), weeks as (
SELECT x.day::date as day, least(x.day::date + 7, dt_end::date) as day_end
FROM limits l
CROSS JOIN LATERAL
generate_series(l.dt_start, l.dt_end, interval '1 week') as x(day)
WHERE x.day::date != least(x.day::date + 7, dt_end::date)
), t1 as (
select w.day,
sum(coalesce(t.total, 0)) as t1total
from weeks w
left join table1 t
on t.id = 1
and t.date >= w.day
and t.date < w.day_end
group by w.day
), t2 as (
select w.day,
sum(coalesce(t.sum_measure, 0)) as t2total
from weeks w
left join table2 t
on t.something = 'whatever'
and t.date >= w.day
and t.date < w.day_end
group by w.day
)
select t1.day,
t1.t1total,
t2.t2total
from t1
join t2 on t2.day = t1.day;
You can keep adding tables like that with CTEs.
My earlier example with multiple left join was bad because it blows out the rows due to a lack of join conditions between the left-joined tables.
There is an interesting corner case for e.g. 2019-02-01 to 2019-03-01 which returns an empty interval as the last week. I have updated to filter that out.

SQL - Unequal left join BigQuery

New here. I am trying to get the Daily and Weekly active users over time. they have 30 days before they are considered inactive. My goal is to create graph's that can be split by user_id to show cohorts, regions, categories, etc.
I have created a date table to get every day for the time period and I have the simplified orders table with the base info that I need to calculate this.
I am trying to do a Left Join to get the status by date using the following SQL Query:
WITH daily_use AS (
SELECT
__key__.id AS user_id
, DATE_TRUNC(date(placeOrderDate), day) AS activity_date
FROM `analysis.Order`
where isBuyingGroupOrder = TRUE
AND testOrder = FALSE
GROUP BY 1, 2
),
dates AS (
SELECT DATE_ADD(DATE "2016-01-01", INTERVAL d.d DAY) AS date
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY __key__.id) -1 AS d
FROM `analysis.Order`
ORDER BY __key__.id
LIMIT 1096
) AS d
ORDER BY 1 DESC
)
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
LEFT JOIN daily_use
ON wd.date >= daily_use.activity_date
AND wd.date < DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
I am getting this Error: LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. In BigQuery and was wondering how can I go around this. I am using Standard SQL within BigQuery.
Thank you
Below is for BigQuery Standard SQL and mostly reproduce logic in your query with exception of not including days where no activity at all is found
#standardSQL
SELECT
daily_use.user_id
, wd.date AS DATE
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
-- ORDER BY 1,2
if for whatever reason you still need to exactly reproduce your logic - you can embrace above with final left join as below:
#standardSQL
SELECT *
FROM dates AS wd
LEFT JOIN (
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
) AS daily_use
USING (date)
-- ORDER BY 1,2