i'm currently doing a query for my job and I need a little bit of help with it.
My database is really simple
client_id
status
start_date
end_date
1
active
2020-01-01
2020-03-15
1
inactive
2020-03-15
null
2
active
2020-01-01
null
3
active
2020-01-01
2021-04-28
3
inactive
2021-04-28
2020-07-28
3
active
2021-07-28
null
For each new status of a given client, the database will have a new line and the previous will be updated. There are many different statuses that a client can be in but this is irrelevent for this case.
i need to get, for each month, the number of active clients and i'm a bit struguling with it. I've tried using a partition and i can give the active clients for a range of dates but do not have the monthly distribution.
Can anyone give my some advice :)
Thanks
You didn't tag your database so this may not work for you (however it is similar in all changing function names, and say lateral to cross apply in SQL server - generate_series could be expressed in a different way in your database). This is basically postgreSQL:
select client_id, extract(year from d) y, extract(month from d) m
from MyInfo t1,
lateral (select generate_series(make_date(cast(extract(year from t1.start_date) as int),
cast(extract(month from t1.start_date) as int), 1),
case when end_date is null then current_date else end_date end,
'1 month'::interval)
) Days(d)
where t1.status = 'active'
order by client_id, y, m;
Sample DBFiddle demo
EDIT: Noticed you were asking for monthly counts. Can be written easier, but expanding the above:
with baseData as
(
select client_id, extract(year from d) y, extract(month from d) m
from MyInfo t1,
lateral (select generate_series(make_date(cast(extract(year from t1.start_date) as int),
cast(extract(month from t1.start_date) as int), 1),
case when end_date is null then current_date else end_date end,
'1 month'::interval)
) Days(d)
where t1.status = 'active'
)
select y, m, count(*)
from baseData
group by y,m
order by y,m;
EDIT: Another way:
select m, count(*)
from (select generate_series(min(start_date), current_date, '1 month'::interval) from MyInfo) months(m)
inner join MyInfo i on
m >= date_trunc('month', i.start_date) and
m <= date_trunc('month', case when end_date is null then current_date else end_date end)
where i.status = 'active'
group by m
order by m;
Related
I am trying to calculate a monthly churn rate (for a given month: number_of_people_who_unsubscribed / number_of_subscribers_at_beginnin
enter image description here
That gives me a single percentage of users who unsubscribed during January. However, I'd like to output this percentage for every month, so I can display it as a line chart. I'm not sure where to start - it feels like I need to loop and run the query for each month but that feels wrong. How could I make the same calculation, but without specifying the month manually? We can assume that there is at least one start_date and one end_date per month, so some kind of group by might work.
WITH
date_range AS (
SELECT '2022-10-01' AS start_date, '2022-10-31'AS end_date
),
start_accounts AS
(
SELECT DISTINCT ProductContractId
FROM HD s INNER JOIN date_range d ON
s.FirstInvoiceDate<= d.start_date
AND (s.ItemRejectDate>d.start_date or s.ItemRejectDate is null)
),
end_accounts AS
(
SELECT DISTINCT ProductContractId
FROM HD s INNER JOIN date_range d ON
s.FirstInvoiceDate<= d.end_date
AND (s.ItemRejectDate>d.end_date or s.ItemRejectDate is null)
),
churned_accounts AS
(
SELECT s.ProductContractId
FROM start_accounts s
LEFT OUTER JOIN end_accounts e ON
s.ProductContractId=e.ProductContractId
WHERE e.ProductContractId is null
),
start_count AS (
SELECT COUNT(*) AS n_start FROM start_accounts
),
churn_count AS (
SELECT COUNT(*) AS n_churn FROM churned_accounts
)
SELECT
convert(numeric(10,4),(n_churn* 1.0/n_start ))*100
AS churn_rate,
(1.0-(n_churn/n_start)*100)
AS retention_rate,
n_start,
n_churn
FROM start_count, churn_count
Ultimately I'm looking for an output that looks something like:
enter image description here
I don't have SQL Serve setup yet (working on it) but here is a solution that works on the Postgres schema from the book Fighting Churn With Data (https://www.fightchurnwithdata.com, which is the basis for the original query in the question.) The key is to select multiple dates as a constant table in the first CTE (or use some external table defined for this purpose) and then carry it through all the other CTE's so it calculates multiple churn rates at once. The end date of each churn calculation period is used with a date interval expression (that will definitely have to change for SQLServer but I think the rest of it should work...)
WITH
calc_dates AS (
SELECT generate_series(min(start_date)
, max(start_date)-interval '1month'
, interval '1 month') AS start_date
FROM subscription
),
start_accounts AS
(
SELECT DISTINCT d.start_date, account_id
FROM SUBSCRIPTION s INNER JOIN calc_dates d ON
s.START_DATE<= d.start_date
AND (s.END_DATE>d.start_date or s.END_DATE is null)
),
end_accounts AS
(
SELECT DISTINCT d.start_date, account_id
FROM SUBSCRIPTION s INNER JOIN calc_dates d ON
s.START_DATE<= (d.start_date+interval '1 month')
AND (s.END_DATE>(d.start_date+interval '1 month') or s.end_date is null)
),
churned_accounts AS
(
SELECT s.start_date, s.account_id
FROM start_accounts s
LEFT OUTER JOIN end_accounts e ON
s.account_id=e.account_id
and s.start_date = e.start_date
WHERE e.account_id is null
),
start_count AS (
SELECT start_date, COUNT(*)::FLOAT AS n_start
FROM start_accounts
group by start_date
),
churn_count AS (
SELECT start_date, COUNT(*)::FLOAT AS n_churn
FROM churned_accounts
group by start_date
)
SELECT s.start_date,(s.start_date+interval '1 month')::date as end_date,
(n_churn* 1.0/n_start )*100
AS churn_rate
FROM start_count s
inner join churn_count c
on s.start_date=c.start_date
order by s.start_date;
This gives the following output when run against the book schema:
start_date
end_date
churn_rate
2020-02-01
2020-03-01
6.42486011191047
2020-03-01
2020-04-01
5.63271604938272
2020-04-01
2020-05-01
4.83752524325317
I'm working on getting a SQLServer running and will try to update the solution soon...
I have SCD table type 2 that I join with various other tables and I am looking to aggregate sum total from any entity that was active (by active I mean the ones that don't yet have an end_date) during an individual month.
currently, I have a query similar to this (let's say aggregating data for the month of May 2022 and April 2022):
select
count(1) as enitities_agg,
DATE_TRUNC('Month', h.start) as date,
sum(h.price) filter (where c.name='HIGH') as sum_total,
----....----
from
project as p
join class as c on p.class_id = c.id
join stage as s on s.project_id = p.id
join stage_info as si on si.stage_id = s.id
join history as h on h.stage_info_id = si.id
where
h.start <= '2022-06-01' and
h.end_date >= '2022-04-01' and
COALESCE(p.end_date, '2099-01-01') >= '2022-04-01' and
COALESCE(p.start_date, '2099-01-01') <= '2022-06-01' and
COALESCE(stage.end, '2099-01-01') >= '2022-04-01' and
h.price is not null and
h.price != 0
group by DATE_TRUNC('Month', h.start)
It aggregates fine only those whose history starts in May or April, not the ones that overlap those months and are still active.
The problem I have is that some history entities start in April, March, etc., and still haven't ended by May. Because I group with group by DATE_TRUNC('Month', h.start) i.e. by history start date, I don't get the entities that start earlier than April or May and continue to be active after May, I get aggregates only in those months that they have started in.
I was trying to do it by generating series and group by the generated month, however, I didn't find a way that would group them correctly. Example, of one experimentation that I tried.
from
generate_series('2022-03-01', '2022-07-01', INTERVAL '1 month') as mt
join project as p on COALESCE(p.end_date, '2099-01-01') >= mt and
COALESCE(p.start_date, '2099-01-01') <= mt + INTERVAL '1 month'
join class as c on p.class_id = c.id
join stage as stage on stage.project_id = p.id and
COALESCE(stage.end, '2099-01-01') >= mt
join stage_info as si on si.stage_id = stage.id
join history as h on h.stage_info_id = si.id
where
h.start <= mt and
h.end_date >= mt + INTERVAL '1 month' and
h.price is not null and
h.price != 0
group by mt
How would it be possible to iterate through the history table aggregating any active entities in a month and group them by the same month and get something like this?
"enitities_agg" | "date" | "sum_total"
832 | "2022-04-01 00:00:00" | 15432234
1020 | "2022-05-01 00:00:00" | 19979458
Seems your logic is: if any day of begin_ - _end interval falls into month, count it in. This was the hardest part to guess from the desired results.
So I guess you need this:
with dim as (
select
m::date as month_start
,(date_trunc('month', m) + interval '1 month - 1 day')::date as month_end
,to_char(date_trunc('month', m), 'Mon') as month
from generate_series('2022-01-01', '2022-08-01', INTERVAL '1 month') as m
)
SELECT
dim.month
, sum(coalesce(t.price, 0)) as sum_price
FROM dim
left join test as t
on t.begin_ <= dim.month_end
and t._end >= dim.month_start
group by dim.month_start, dim.month
order by dim.month_start, dim.month
;
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=614030d4db5e03876f693a9a8a5ff122
You want all history entities was happened during 2022-May? If so, the following maybe help.
daterange(h.start,
h.end_date,
'[]') && daterange('2022-05-01', '2022-06-01', '[]');
demo:
CREATE temp TABLE test (
begin_ date,
_end date
);
INSERT INTO test
VALUES ('2022-01-01', '2022-05-01');
INSERT INTO test
VALUES ('2022-01-01', '2022-05-11');
INSERT INTO test
VALUES ('2022-05-01', '2022-07-11');
INSERT INTO test
VALUES ('2022-06-11', '2022-07-11');
SELECT
*,
daterange(begin_, _end, '[]')
FROM
test t
WHERE
daterange(t.begin_, t._end, '[]') && daterange('2022-05-01', '2022-05-31', '[]');
&& range operator reference: https://www.postgresql.org/docs/current/functions-range.html#RANGE-OPERATORS-TABLE
I have a table in the snowflake with a time range from for example 2019.01 to 2020.01. An ID can appear multiple times (match with) on any of the dates.
For example:
my_table: two columns dddate and id
dddate
id
2019-02-03
607
2019-01-07
356
2019-08-06
491
2019-01-01
607
2019-12-17
529
2019-04-15
356
......
Is there a way I can find the total number of IDs that appeared at least one time in the current month that also appeared at least one time in the previous three months, and group by month to show each month's number count starting from 2019-04 (The first month that has previous three months data available in the table) until 2020-01.
I am thinking of some code like this:
WITH PREV_THREE AS (
SELECT
DATE_TRUNC('MONTH', dddate) AS MONTH,
ID AS CURR_ID
FROM my_table mt
INNER JOIN
(
(
SELECT
MONTH(DATEADD(DATE_TRUNC('MONTH', dddate), -1, GETDATE())) AS PREV_MONTH,
ID AS PREV_3_MON_ID
FROM my_table
)
UNION ALL
(
SELECT
MONTH(DATEADD(DATE_TRUNC('MONTH', dddate), -2, GETDATE())) AS PREV_MONTH,
ID AS PREV_3_MON_ID
FROM my_table
)
UNION ALL
(
SELECT
MONTH(DATEADD(DATE_TRUNC('MONTH', dddate), -3, GETDATE())) AS PREV_MONTH,
ID AS PREV_3_MON_ID
FROM my_table
)
) AS PREV_3_MON
ON mt.CURR_ID = PREV_3_MON.PREV_3_MON_ID
)
SELECT MONTH, COUNT(DISTINCT ID) AS COUNTER
FROM PREV_THREE
GROUP BY 1
ORDER BY 1
However, it somehow returns an error and doesn't seem working. Could anyone please help me with this? Thank you in advance!
You can use lag():
select distinct id
from (select t.*,
lag(dddate) over (partition by id order by dddate) as prev_dddate
from my_table t
) t
where dddate >= date_trunc('MONTH', current_date) and
prev_dddate < date_trunc('MONTH', current_date) and
prev_dddate >= date_trunc('MONTH', current_date) - interval '3 month';
You can do this for multiple months as:
select date_trunc('MONTH', dddate), count(distinct id)
from (select t.*,
lag(dddate) over (partition by id order by dddate) as prev_dddate
from my_table t
) t
where prev_dddate < date_trunc('MONTH', date_trunc('MONTH', dddate)) and
prev_dddate >= date_trunc('MONTH', date_trunc('MONTH', dddate)) - interval '3 month'
group by date_trunc('MONTH', dddate);
Even if an id appears multiple times in one month, one of those will be first and the lag() will identify the most recent previous month.
I am currently trying to compare aggregated numbers from today and exactly 7 days ago (not between today and 7 days ago, but instead simply comparing these two discrete dates).
I already have a way of doing it using a lot of subqueries, but the performance is bad, and I am now trying to optimize.
This is what I have come up with so far (sample query, not with real table names and columns due to confidentiality):
Select current_date, previous_date, current_sum, previous_sum, percentage
From (Select date as current_date, sum(numbers) as current_sum,
lag (sum(numbers)) over (partition by date order by date) as previous_sum,
(Select max(date)-7 From t1 ) as previous_date,
(current_sum - previous_sum)*100/current_sum as percentage
From t1 where date>=sysdate-7 group by date,previous_date)
But I am definitely doing something wrong since in the output the previous_sum appears null, and naturally the percentage too.
Any ideas on what I am doing wrong? I haven't used LAG before so it must be something there.
Thanks!
Using Join of pre-aggregated subqueries.
with agg as (
select sum(numbers) as sum_numbers, date from t1 group by date
)
select curr.sum_numbers as current_sum,
prev.sum_numbers as prev_sum,
curr.date as curr_date,
prev.date as prev_date
from agg curr
left join agg prev on curr.date-7=prev.date
Using lag:
with agg as (
select sum(numbers) as sum_numbers, date from t1 group by date
)
select sum_numbers as current_sum,
lag(sum_numbers, 7) over(order by date) as prev_sum,
a.date as curr_date,
lag(a.date,7) over(order by date) as prev_date
from agg a
If you want exactly 2 dates only (today and today-7) then it can be done much simpler using conditional aggregation and filter:
select sum(case when date = trunc(sysdate) then numbers else null end) as current_sum,
sum(case when date = trunc(sysdate-7) then numbers else null end) as previous_sum,
trunc(sysdate) as curr_date,
trunc(sysdate-7) as prev_date,
(current_sum - previous_sum)*100/current_sum as percentage
from t1 where date = trunc(sysdate) or date = trunc(sysdate-7)
You can do this with window (analytic) functions, which should be the fastest method. Your actually aggregation query is a bit unclear, but I think it is:
select date as current_date, sum(numbers) as current_sum
from t1
group by date;
If you have values for all dates, then use:
select date as current_date, sum(numbers) as current_sum,
lag(sum(numbers), 7) over (order by date) as prev_7_sum
from t1
group by date;
If you don't have data for all days, then use a window frame:
select date as current_date, sum(numbers) as current_sum,
max(sum(numbers), 7) over (order by date range between '7' day preceding and '7' day preceding) as prev_7_sum
from t1
group by date;
I have a table with the following columns :
sID, start_date and end_date
Some of the values are as follows:
1 1995-07-28 2003-07-20
1 2003-07-21 2010-05-04
1 2010-05-03 2010-05-03
2 1960-01-01 2011-03-01
2 2011-03-02 2012-03-13
2 2012-03-12 2012-10-21
2 2012-10-22 2012-11-08
3 2003-07-23 2010-05-02
I only want the 2nd and 3rd rows in my result as they are the overlapping date ranges.
I tried this but it would not get rid of the first row. Not sure where I am going wrong?
select a.sID from table a
inner join table b
on a.sID = b.sID
and ((b.start_date between a.start_date and a.end_date)
and (b.end_date between a.start_date and b.end_date ))
order by end_date desc
I am trying to do in SQL Server
One way of doing this reasonably efficiently is
WITH T1
AS (SELECT *,
MAX(end_date) OVER (PARTITION BY sID ORDER BY start_date) AS max_end_date_so_far
FROM YourTable),
T2
AS (SELECT *,
range_start = IIF(start_date <= LAG(max_end_date_so_far) OVER (PARTITION BY sID ORDER BY start_date), 0, 1),
next_range_start = IIF(LEAD(start_date) OVER (PARTITION BY sID ORDER BY start_date) <= max_end_date_so_far, 0, 1)
FROM T1)
SELECT SId,
start_date,
end_date
FROM T2
WHERE 0 IN ( range_start, next_range_start );
if you have an index on (sID, start_date) INCLUDE (end_date) this can perform the work with a single ordered scan.
Your logic is not totally correct, although it almost works on your sample data. The specific reason it fails is because between includes the end points, so any given row matches itself. That said, the logic still isn't correct because it doesn't catch this situation:
a-------------a
b----b
Here is correct logic:
select a.*
from table a
where exists (select 1
from table b
where a.sid = b.sid and
a.start_date < b.end_date and
a.end_date > b.start_date and
(a.start_date <> b.start_date or -- filter out the record itself
a.end_date <> b.end_date
)
)
order by a.end_date;
The rule for overlapping time periods (or ranges of any sort) is that period 1 overlaps with period 2 when period 1 starts before period 2 ends and period 1 ends after period 2 starts. Happily, there is no need or use for between for this purpose. (I strongly discourage using between with date/time operands.)
I should note that this version does not consider two time periods to overlap when one ends on the same day another begins. That is easily adjusted by changing the < and > to <= and >=.
Here is a SQL Fiddle.