Using windows functions to count by groups of dates PostgreSQL - sql

I have a table amongst whose columns are id and created_at and I want to use window functions around the created_at of each entry to count how many entries there are within 48 hours of them. As an example, for the original table:
id | created_at
----|------------
01 | 2016/01/04
02 | 2016/01/05
03 | 2016/01/05
04 | 2016/01/06
05 | 2016/01/07
06 | 2016/01/08
07 | 2016/01/08
08 | 2016/01/09
and the result should be
id | created_at | count
----|------------|-------
01 | 2016/01/04 | 4
02 | 2016/01/05 | 5
03 | 2016/01/05 | 5
04 | 2016/01/06 | 7
05 | 2016/01/07 | 7
06 | 2016/01/08 | 5
07 | 2016/01/08 | 5
08 | 2016/01/09 | 4
The explanation is that since there are 2 transactions on 2016/01/05, 1 on 2016/01/06, 1 on 2016/01/07, 2 on 2016/01/08, and 1 on 2016/01/09, there are a total of 7 transactions within 2 days of transaction 05.

It is better to use a date table that have consecutive dates in case dates in your table have gaps.
I am wondering what's the role of the id column? Here is how I would do it without considering the id column.
select row_number()over(order by dt) as id
,dt as created_at
,cnt1+cnt2+cnt3+cnt4+cnt5 as cnt
from
(
select
date_table.dt
,lag(cnt,2,0)over(order by created_at asc) as cnt1
,lag(cnt,1,0)over(order by created_at asc) as cnt2
,isnull(cnt,0) cnt3
,lead(cnt,1,0)over(order by created_at asc) as cnt4
,lead(cnt,2,0)over(order by created_at asc) as cnt5
from
date_table left join
(select created_at,count(*) as cnt from your_table group by created_at) c
on date_table.day = c.created_at
) T

Using window functions for this purpose is challenging because of the duplicate days. You can get the results using a join or correlated subquery:
select t.*,
(select count(*)
from t t2
where t2 between t.created_at - interval 2 * '1 day' and
t.created_at + interval 2 * '1 day'
) as cnt
from t;
EDIT:
You could use window functions by doing a cumulative sum by date and then joining back. This is, of course, a bit challenging because of holes in the dates. But, something like this:
with c as (
select d.dte, count(t.created_at) as cnt,
sum(count(t.created_at))) over (order by d.dte) as cumecnt
from (select generate_series(min(created_at) - interval '2 day',
max(created_at) + interval '2 day',
'1 day')
from t
) d(dte) left join
on d.dte = t.created_at
)
select t.*, cmax.cumecnt - cmin.cumecnt
from t join
c cmin
on t.created_at = cmin.dte + interval '2 day' join
c cmax
on t.created_at = cmax.dte - interval '2 day';

Related

Converting PostgreSQL recursive CTE to SQL Server

I'm having trouble adapting some recursive CTE code from PostgreSQL to SQL Server, from the book "Fighting Churn with Data"
This is the working PostgreSQL code:
with recursive
active_period_params as (
select interval '30 days' as allowed_gap,
'2021-09-30'::date as calc_date
),
active as (
-- anchor
select distinct account_id, min(start_date) as start_date
from subscription inner join active_period_params
on start_date <= calc_date
and (end_date > calc_date or end_date is null)
group by account_id
UNION
-- recursive
select s.account_id, s.start_date
from subscription s
cross join active_period_params
inner join active e on s.account_id=e.account_id
and s.start_date < e.start_date
and s.end_date >= (e.start_date-allowed_gap)::date
)
select account_id, min(start_date) as start_date
from active
group by account_id
This is my attempt at converting to SQL Server. It gets stuck in a loop. I believe the issue has to do with the UNION ALL required by SQL Server.
with
active_period_params as (
select 30 as allowed_gap,
cast('2021-09-30' as date) as calc_date
),
active as (
-- anchor
select distinct account_id, min(start_date) as start_date
from subscription inner join active_period_params
on start_date <= calc_date
and (end_date > calc_date or end_date is null)
group by account_id
UNION ALL
-- recursive
select s.account_id, s.start_date
from subscription s
cross join active_period_params
inner join active e on s.account_id=e.account_id
and s.start_date < e.start_date
and s.end_date >= dateadd(day, -allowed_gap, e.start_date)
)
select account_id, min(start_date) as start_date
from active
group by account_id
The subscription table is a list of subscriptions belonging to customers. A customer can have multiple subscriptions with overlapping dates or gaps between dates. null end_date means the subscription is currently active and has no defined end_date. Example data for a single customer (account_id = 15) below:
subscription
---------------------------------------------------
| id | account_id | start_date | end_date |
---------------------------------------------------
| 6 | 15 | 01/06/2021 | null |
| 5 | 15 | 01/01/2021 | null |
| 4 | 15 | 01/06/2020 | 01/02/2021 |
| 3 | 15 | 01/04/2020 | 15/05/2020 |
| 2 | 15 | 01/03/2020 | 15/05/2020 |
| 1 | 15 | 01/06/2019 | 01/01/2020 |
Expected query result (as produced by PostgreSQL code):
------------------------------
| account_id | start_date |
------------------------------
| 15 | 01/03/2020 |
Issue:
The SQL Server code above gets stuck in a loop and doesn't produce a result.
Description of the PostgreSQL code:
anchor block finds subs that are active as at the calc_date (30/09/2021) (id 5 & 6), and returns the min start_date (01/01/2021)
the recursion block then looks for any earlier subs that existed within the allowed_gap, which is 30 days prior to the min_start date found in 1). id 4 meets this criteria, so the new min start_date is 01/06/2020
recursion repeats and finds two subs within the allowed_gap (01/06/2020 - 30 days). Of these subs (id 2 & 3), the new min start_date is 01/03/2020
recursion fails to find an earlier sub within the allowed_gap (01/03/2020 - 30 days)
query returns a start date of 01/03/2020 for account_id 15
Any help appreciated!
It seems the issue is related to the way SQL Server deals with recursive CTEs.
This is a type of gaps-and-islands problem, and does not actually require recursion.
There are a number of solutions, here is one. Given your requirement, there may be more efficient methods, but this should get you started.
Using LAG we identify rows which are within the specified gap of the next row
We use a running COUNT to give each consecutive set of rows an ID
We group by that ID, and take the minimum start_date, filtering out non-qualifying groups
Group again to get the minimum per account
DECLARE #allowed_gap int = 30,
#calc_date datetime = cast('2021-09-30' as date);
WITH PrevValues AS (
SELECT *,
IsStart = CASE WHEN ISNULL(LAG(end_date) OVER (PARTITION BY account_id
ORDER BY start_date), '2099-01-01') < DATEADD(day, -#allowed_gap, start_date)
THEN 1 END
FROM subscription
),
Groups AS (
SELECT *,
GroupId = COUNT(IsStart) OVER (PARTITION BY account_id
ORDER BY start_date ROWS UNBOUNDED PRECEDING)
FROM PrevValues
),
ByGroup AS (
SELECT
account_id,
GroupId,
start_date = MIN(start_date)
FROM Groups
GROUP BY account_id, GroupId
HAVING COUNT(CASE WHEN start_date <= #calc_date
and (end_date > #calc_date or end_date is null) THEN 1 END) > 0
)
SELECT
account_id,
start_date = MIN(start_date)
FROM ByGroup
GROUP BY account_id;
db<>fiddle

Obtain Name Column Based on Value

I have a table that calculates the number of associated records that fit a criteria for each parent record. See example below:
note - morning, afternoon and evening are only weekdays
| id | morning | afternoon | evening | weekend |
| -- | ------- | --------- | ------- | ------- |
| 1 | 0 | 2 | 3 | 1 |
| 2 | 2 | 9 | 4 | 6 |
What I am trying to achieve is to determine which columns have the lowest value and get their column name as such:
| id | time_of_day |
| -- | ----------- |
| 1 | morning |
| 2 | afternoon |
Here is my current SQL code to result in the first table:
SELECT
leads.id,
COALESCE(morning, 0) morning,
COALESCE(afternoon, 0) afternoon,
COALESCE(evening, 0) evening,
COALESCE(weekend, 0) weekend
FROM leads
LEFT OUTER JOIN (
SELECT DISTINCT ON (lead_id) lead_id, COUNT(*) AS morning
FROM lead_activities
WHERE lead_activities.modality = 'Call' AND lead_activities.bound_type = 'outbound' AND extract('dow' from created_at) IN (0,1,2,3,4,5) AND (extract('hour' from created_at) >= 0 AND extract('hour' from created_at) < 12)
GROUP BY lead_id
) morning ON morning.lead_id = leads.id
LEFT OUTER JOIN (
SELECT DISTINCT ON (lead_id) lead_id, COUNT(*) AS afternoon
FROM lead_activities
WHERE lead_activities.modality = 'Call' AND lead_activities.bound_type = 'outbound' AND extract('dow' from created_at) IN (0,1,2,3,4,5) AND (extract('hour' from created_at) >= 12 AND extract('hour' from created_at) < 17)
GROUP BY lead_id
) afternoon ON afternoon.lead_id = leads.id
LEFT OUTER JOIN (
SELECT DISTINCT ON (lead_id) lead_id, COUNT(*) AS evening
FROM lead_activities
WHERE lead_activities.modality = 'Call' AND lead_activities.bound_type = 'outbound' AND extract('dow' from created_at) IN (0,1,2,3,4,5) AND (extract('hour' from created_at) >= 17 AND extract('hour' from created_at) < 25)
GROUP BY lead_id
) evening ON evening.lead_id = leads.id
LEFT OUTER JOIN (
SELECT DISTINCT ON (lead_id) lead_id, COUNT(*) AS weekend
FROM lead_activities
WHERE lead_activities.modality = 'Call' AND lead_activities.bound_type = 'outbound' AND extract('dow' from created_at) IN (6,7)
GROUP BY lead_id
) weekend ON weekend.lead_id = leads.id
You can use CASE/WHEN/ELSE to check for the specific conditions and produce different values. For example:
with
q as (
-- your query here
)
select
id,
case
when morning <= least(afternoon, evening, weekend) then 'morning'
when afternoon <= least(morning, evening, weekend) then 'afternoon'
when evening <= least(morning, afternoon, weekend) then 'evening'
else 'weekend'
end as time_of_day
from q

Count days per month from days off table

I have table which stores person, start of holiday and stop of holiday.
I need to count from it, how many working days per month person was on holiday. So I want to partition this table over month.
To get holidays I'm using: https://github.com/christopherthompson81/pgsql_holidays
Let's assume I have table for one person only with start/stop only.
create table data (id int, start date, stop date);
This is function for network_days I wrote:
CREATE OR REPLACE FUNCTION network_days(start_date date , stop_date date) RETURNS bigint AS $$
SELECT count(*) FROM
generate_series(start_date , stop_date - interval '1 minute' , interval '1 day') the_day
WHERE
extract('ISODOW' FROM the_day) < 6 AND the_day NOT IN (
SELECT datestamp::timestamptz FROM holidays_poland (extract(year FROM o.start_date)::int, extract(year FROM o.stop_date)::int))
$$
LANGUAGE sql
STABLE;
and I created function with query like:
--$2 = 2020
SELECT
month, year, sum(value_per_day)
FROM (
SELECT to_char(dt , 'mm') AS month, to_char(dt, 'yyyy') AS year, (network_days ((
CASE WHEN EXTRACT(year FROM df.start_date) < 2020 THEN (SELECT date_trunc('year' , df.start_date) + interval '1 year')::date
ELSE df.start_date END) , ( CASE WHEN EXTRACT(year FROM df.stop_date) > $2 THEN (date_trunc('year' , df.stop_date))::date
ELSE
df.stop_date END))::int ::numeric / count(*) OVER (PARTITION BY id))::int AS value_per_day
FROM intranet.dayoff df
LEFT JOIN generate_series((
CASE WHEN EXTRACT(year FROM df.start_date) < $2 THEN (SELECT date_trunc('year' , df.start_date) + interval '1 year')::date ELSE df.start_date
END) , (CASE WHEN EXTRACT(year FROM df.stop_date) > $2 THEN (date_trunc('year' , df.stop_date))::date
ELSE df.stop_date END) - interval '1 day' , interval '1 day') AS t (dt) ON extract('ISODOW' FROM dt) < 6
WHERE
extract(isodow FROM dt) < 6 AND (EXTRACT(year FROM start_date) = $2 OR EXTRACT(year FROM stop_date) = $2)) t
GROUP BY month, year
ORDER BY month;
based on: https://dba.stackexchange.com/questions/237745/postgresql-split-date-range-by-business-days-then-aggregate-by-month?rq=1
and I almost have it:
10 rows returned
| month | year | sum |
| ----- | ---- | ---- |
| 03 | 2020 | 2 |
| 04 | 2020 | 13 |
| 06 | 2020 | 1 |
| 11 | 2020 | 1 |
| 12 | 2020 | 2 |
| 05 | 2020 | 1 |
| 10 | 2020 | 2 |
| 08 | 2020 | 10 |
| 01 | 2020 | 1 |
| 02 | 2020 | 1 |
so in function I created I'd need to add something like this
dt NOT IN (SELECT datestamp::timestamptz FROM holidays_poland ($2, $2))
but I end up with many conditions and I feel like this wrong approach.
I feel like I should just somehow divide table from:
id start stop
1 31.12.2019 00:00:00 01.01.2020 00:00:00
2 30.03.2020 00:00:00 14.04.2020 00:00:00
3 01.05.2020 00:00:00 03.05.2020 00:00:00
to
start stop
30.03.2020 00:00:00 01.01.2020 00:00:00
01.01.2020 00:00:00 14.04.2020 00:00:00
01.05.2020 00:00:00 03.05.2020 00:00:00
and just run network_days function for this date range, but I couldn't successfully partition my query of the table to get such result.
What do you think is best way to achieve what I want to calculate?
demo:db<>fiddle
SELECT
gs::date
FROM person_holidays p,
generate_series(p.start, p.stop, interval '1 day') gs -- 1
WHERE gs::date NOT IN (SELECT holiday FROM holidays) -- 2
AND EXTRACT(isodow from gs::date) < 6 -- 3
Generate date series from person's start and stop date
Exclude all dates from the holidays table
If necessary: Exclude all weekend days (Saturday and Sunday)
Afterwards you are able to GROUP BY months and count the records:
SELECT
date_trunc('month', gs),
COUNT(*)
FROM person_holidays p,
generate_series(p.start, p.stop, interval '1 day') gs
WHERE gs::date NOT IN (SELECT holiday FROM holidays)
and extract(isodow from gs::date) < 6
GROUP BY 1

How to force zero values in Redshift?

If this is my query:
select
(min(timestamp))::date as date,
(count(distinct(user_id)) as user_id_count
(row_number() over (order by signup_day desc)-1) as days_since
from
data.table
where
timestamp >= current_date - 3
group by
timestamp
order by
timestamp asc;
And these are my results
date | user_id_count | days_since
------------+-----------------+-------------
2018-01-22 | 3 | 1
2018-01-23 | 5 | 0
How can I get it the table to show (where the user ID count is 0?):
date | user_id_count | days_since
------------+-----------------+-------------
2018-01-21 | 0 | 0
2018-01-22 | 3 | 1
2018-01-23 | 5 | 0
You need to generate the dates. In Postgres, generate_series() is the way to go:
select g.ts as dte,
count(distinct t.user_id) as user_id_count
row_number() over (order by signup_day desc) - 1) as days_since
from generate_series(current_date::timestamp - interval '3 day', current_date::timestamp, interval '1 day') g(ts) left join
data.table t
on t.timestamp::date = g.ts
group by t.ts
order by t.ts;
You have to create a "calendar" table with dates and left join your aggregated result like that:
with
aggregated_result as (
select ...
)
select
t1.date
,coalesce(t2.user_id_count,0) as user_id_count
,coalesce(t2. days_since,0) as days_since
from calendar t1
left join aggregated_result t2
using (date)
more on creating calendar table: How do I create a dates table in Redshift?

Cumulative sum of values by month, filling in for missing months

I have this data table and I'm wondering if is possible create a query that get a cumulative sum by month considering all months until the current month.
date_added | qty
------------------------------------
2015-08-04 22:28:24.633784-03 | 1
2015-05-20 20:22:29.458541-03 | 1
2015-04-08 14:16:09.844229-03 | 1
2015-04-07 23:10:42.325081-03 | 1
2015-07-06 18:50:30.164932-03 | 1
2015-08-22 15:01:54.03697-03 | 1
2015-08-06 18:25:07.57763-03 | 1
2015-04-07 23:12:20.850783-03 | 1
2015-07-23 17:45:29.456034-03 | 1
2015-04-28 20:12:48.110922-03 | 1
2015-04-28 13:26:04.770365-03 | 1
2015-05-19 13:30:08.186289-03 | 1
2015-08-06 18:26:46.448608-03 | 1
2015-08-27 16:43:06.561005-03 | 1
2015-08-07 12:15:29.242067-03 | 1
I need a result like that:
Jan|0
Feb|0
Mar|0
Apr|5
May|7
Jun|7
Jul|9
Aug|15
This is very similar to other questions, but the best query is still tricky.
Basic query to get the running sum quickly:
SELECT to_char(date_trunc('month', date_added), 'Mon YYYY') AS mon_text
, sum(sum(qty)) OVER (ORDER BY date_trunc('month', date_added)) AS running_sum
FROM tbl
GROUP BY date_trunc('month', date_added)
ORDER BY date_trunc('month', date_added);
The tricky part is to fill in for missing months:
WITH cte AS (
SELECT date_trunc('month', date_added) AS mon, sum(qty) AS mon_sum
FROM tbl
GROUP BY 1
)
SELECT to_char(mon, 'Mon YYYY') AS mon_text
, sum(c.mon_sum) OVER (ORDER BY mon) AS running_sum
FROM (SELECT min(mon) AS min_mon FROM cte) init
, generate_series(init.min_mon, now(), interval '1 month') mon
LEFT JOIN cte c USING (mon)
ORDER BY mon;
The implicit CROSS JOIN LATERAL requires Postgres 9.3+. This starts with the first month in the table.
To start with a given month:
WITH cte AS (
SELECT date_trunc('month', date_added) AS mon, sum(qty) AS mon_sum
FROM tbl
GROUP BY 1
)
SELECT to_char(mon, 'Mon YYYY') AS mon_text
, COALESCE(sum(c.mon_sum) OVER (ORDER BY mon), 0) AS running_sum
FROM generate_series('2015-01-01'::date, now(), interval '1 month') mon
LEFT JOIN cte c USING (mon)
ORDER BY mon;
db<>fiddle here
Old sqlfiddle
Keeping months from different years apart. You did not ask for that, but you'll most likely want it.
Note that the "month" to some degree depends on the time zone setting of the current session! Details:
Ignoring time zones altogether in Rails and PostgreSQL
Related:
Calculating Cumulative Sum in PostgreSQL
PostgreSQL: running count of rows for a query 'by minute'
Postgres window function and group by exception