Count days per month from days off table - sql

I have table which stores person, start of holiday and stop of holiday.
I need to count from it, how many working days per month person was on holiday. So I want to partition this table over month.
To get holidays I'm using: https://github.com/christopherthompson81/pgsql_holidays
Let's assume I have table for one person only with start/stop only.
create table data (id int, start date, stop date);
This is function for network_days I wrote:
CREATE OR REPLACE FUNCTION network_days(start_date date , stop_date date) RETURNS bigint AS $$
SELECT count(*) FROM
generate_series(start_date , stop_date - interval '1 minute' , interval '1 day') the_day
WHERE
extract('ISODOW' FROM the_day) < 6 AND the_day NOT IN (
SELECT datestamp::timestamptz FROM holidays_poland (extract(year FROM o.start_date)::int, extract(year FROM o.stop_date)::int))
$$
LANGUAGE sql
STABLE;
and I created function with query like:
--$2 = 2020
SELECT
month, year, sum(value_per_day)
FROM (
SELECT to_char(dt , 'mm') AS month, to_char(dt, 'yyyy') AS year, (network_days ((
CASE WHEN EXTRACT(year FROM df.start_date) < 2020 THEN (SELECT date_trunc('year' , df.start_date) + interval '1 year')::date
ELSE df.start_date END) , ( CASE WHEN EXTRACT(year FROM df.stop_date) > $2 THEN (date_trunc('year' , df.stop_date))::date
ELSE
df.stop_date END))::int ::numeric / count(*) OVER (PARTITION BY id))::int AS value_per_day
FROM intranet.dayoff df
LEFT JOIN generate_series((
CASE WHEN EXTRACT(year FROM df.start_date) < $2 THEN (SELECT date_trunc('year' , df.start_date) + interval '1 year')::date ELSE df.start_date
END) , (CASE WHEN EXTRACT(year FROM df.stop_date) > $2 THEN (date_trunc('year' , df.stop_date))::date
ELSE df.stop_date END) - interval '1 day' , interval '1 day') AS t (dt) ON extract('ISODOW' FROM dt) < 6
WHERE
extract(isodow FROM dt) < 6 AND (EXTRACT(year FROM start_date) = $2 OR EXTRACT(year FROM stop_date) = $2)) t
GROUP BY month, year
ORDER BY month;
based on: https://dba.stackexchange.com/questions/237745/postgresql-split-date-range-by-business-days-then-aggregate-by-month?rq=1
and I almost have it:
10 rows returned
| month | year | sum |
| ----- | ---- | ---- |
| 03 | 2020 | 2 |
| 04 | 2020 | 13 |
| 06 | 2020 | 1 |
| 11 | 2020 | 1 |
| 12 | 2020 | 2 |
| 05 | 2020 | 1 |
| 10 | 2020 | 2 |
| 08 | 2020 | 10 |
| 01 | 2020 | 1 |
| 02 | 2020 | 1 |
so in function I created I'd need to add something like this
dt NOT IN (SELECT datestamp::timestamptz FROM holidays_poland ($2, $2))
but I end up with many conditions and I feel like this wrong approach.
I feel like I should just somehow divide table from:
id start stop
1 31.12.2019 00:00:00 01.01.2020 00:00:00
2 30.03.2020 00:00:00 14.04.2020 00:00:00
3 01.05.2020 00:00:00 03.05.2020 00:00:00
to
start stop
30.03.2020 00:00:00 01.01.2020 00:00:00
01.01.2020 00:00:00 14.04.2020 00:00:00
01.05.2020 00:00:00 03.05.2020 00:00:00
and just run network_days function for this date range, but I couldn't successfully partition my query of the table to get such result.
What do you think is best way to achieve what I want to calculate?

demo:db<>fiddle
SELECT
gs::date
FROM person_holidays p,
generate_series(p.start, p.stop, interval '1 day') gs -- 1
WHERE gs::date NOT IN (SELECT holiday FROM holidays) -- 2
AND EXTRACT(isodow from gs::date) < 6 -- 3
Generate date series from person's start and stop date
Exclude all dates from the holidays table
If necessary: Exclude all weekend days (Saturday and Sunday)
Afterwards you are able to GROUP BY months and count the records:
SELECT
date_trunc('month', gs),
COUNT(*)
FROM person_holidays p,
generate_series(p.start, p.stop, interval '1 day') gs
WHERE gs::date NOT IN (SELECT holiday FROM holidays)
and extract(isodow from gs::date) < 6
GROUP BY 1

Related

SQL window function over 31 days but including only one first day of the month

I want to calculate avg or sum (metric) over (past 31 days) to use it in visualization.
"metric" varies every days but it always jumps at the first day of calendar months.
The problem is that some months (like November) are 30-day long. So this function actually includes two first day of the months on first of December (run the query below and check the row at 2021-12-01T00:00:00Z).
I need avg (metric) over (past 30 days) if we have two first days of the months in the window and (past 31 days) otherwise.
with days as(
select
'2021-10-01' :: timestamptz + (d || ' day') ::interval as "day"
from generate_series(0, 100) d
)
, daily_metrics as (
select
"day"
-- in reality "metric" fluctuates every day. But it jumps on the first day of the months
, case when extract(day from "day") = 1 then 300 else 100 end :: float as metric
from days
)
, result as (
select
"day"
, avg(metric) over (rows between 30 preceding and current row) as metric_roll_avg
from daily_metrics
)
select * from result
where "day" > '2021-10-01' :: timestamptz + '30 day' :: interval
This is what I ended up doing:
First crate a calendar view, each rows for the days of the calendar
with
calendar as (
select c."day"
, extract(day from c."day") = 1 and extract(day from date_trunc('month', c."day") - interval '1 day') = 30 as last_month_30
, extract(day from c."day") in (1, 2) and extract(day from date_trunc('month', c."day") - interval '1 day') = 29 as last_month_29
, extract(day from c."day") in (1, 2, 3) and extract(day from date_trunc('month', c."day") - interval '1 day') = 28 as last_month_28
from ... as c ...
)
Which returns this view:
+------------+---------------+---------------+---------------+
| day | last_month_30 | last_month_29 | last_month_28 |
|------------+---------------+---------------+---------------|
| 2022-02-26 | False | False | False |
| 2022-02-27 | False | False | False |
| 2022-02-28 | False | False | False |
| 2022-03-01 | False | False | True |
| 2022-03-02 | False | False | True |
+------------+---------------+---------------+---------------+
And using a case switch:
select
"day"
, case
when last_month_30
then avg(revenue) over (order by "day" rows between 29 preceding and current row )
when last_month_29
then avg(revenue) over (order by "day" rows between 28 preceding and current row)
when last_month_28
then avg(revenue) over (order by "day" rows between 27 preceding and current row)
else avg(revenue) over (order by "day" rows between 30 preceding and current row )
end as monthly_revenue
from ...
Probably not the cleanest way, it but works for all cases.

Count if previous month data exists postgres

i'm stuck with a query to count id where if it exists in previous month than 1
my table look like this
date | id |
2020-02-02| 1 |
2020-03-04| 1 |
2020-03-04| 2 |
2020-04-05| 1 |
2020-04-05| 3 |
2020-05-06| 2 |
2020-05-06| 3 |
2020-06-07| 2 |
2020-06-07| 3 |
i'm stuck with this query
SELECT date_trunc('month',date), id
FROM table
WHERE id IN
(SELECT DISTINCT id FROM table WHERE date
BETWEEN date_trunc('month', current_date) - interval '1 month' AND date_trunc('month', current_date)
the main problem is that i stuck with current_date function. is there any dynamic ways change current_date? thanks
What i expected to be my result is
date | count |
2020-02-01| 0 |
2020-03-01| 1 |
2020-04-01| 1 |
2020-05-01| 1 |
2020-06-01| 2 |
Solution 1 with SELF JOIN
SELECT date_trunc('month', c.date) :: date AS date
, count(DISTINCT c.id) FILTER (WHERE p.date IS NOT NULL)
FROM test AS c
LEFT JOIN test AS p
ON c.id = p.id
AND date_trunc('month', c.date) = date_trunc('month', p.date) + interval '1 month'
GROUP BY date_trunc('month', c.date)
ORDER BY date_trunc('month', c.date)
Result :
date count
2020-02-01 0
2020-03-01 1
2020-04-01 1
2020-05-01 1
2020-06-01 2
Solution 2 with WINDOW FUNCTIONS
SELECT DISTINCT ON (date) date
, count(*) FILTER (WHERE count > 0 AND previous_month) OVER (PARTITION BY date)
FROM
( SELECT DISTINCT ON (id, date_trunc('month', date))
id
, date_trunc('month', date) AS date
, count(*) OVER w AS count
, first_value(date_trunc('month', date)) OVER w = date_trunc('month', date) - interval '1 month' AS previous_month
FROM test
WINDOW w AS (PARTITION BY id ORDER BY date_trunc('month', date) GROUPS BETWEEN 1 PRECEDING AND 1 PRECEDING)
) AS a
Result :
date count
2020-02-01 0
2020-03-01 1
2020-04-01 1
2020-05-01 1
2020-06-01 2
see dbfiddle

Weeks between two dates

I'm attempting to turn two dates into a series of records. One record for each week between the dates.
Additionally the original start and end dates should be used to clip the week in case the range starts or ends mid-week. I'm also assuming that a week starts on Monday.
With a start date of: 05/09/2018 and an end date of 27/09/2018 I would like to retrieve the following results:
| # | Start Date | End date |
|---------------------------------|
| 0 | '05/09/2018' | '09/09/2018' |
| 1 | '10/09/2018' | '16/09/2018' |
| 2 | '17/09/2018' | '23/09/2018' |
| 3 | '24/09/2018' | '27/09/2018' |
I have made some progress - at the moment I can get the total number of weeks between the date range with:
SELECT (
EXTRACT(
days FROM (
date_trunc('week', to_date('27/09/2018', 'DD/MM/YYYY')) -
date_trunc('week', to_date('05/09/2018', 'DD/MM/YYYY'))
) / 7
) + 1
) as total_weeks;
Total weeks will return 4 for the above SQL. This is where I'm stuck, going from an integer to actual set of results.
Window functions are your friend:
SELECT week_num,
min(d) AS start_date,
max(d) AS end_date
FROM (SELECT d,
count(*) FILTER (WHERE new_week) OVER (ORDER BY d) AS week_num
FROM (SELECT DATE '2018-09-05' + i AS d,
extract(dow FROM DATE '2018-09-05'
+ lag(i) OVER (ORDER BY i)
) = 1 AS new_week
FROM generate_series(0, DATE '2018-09-27' - DATE '2018-09-05') AS i
) AS week_days
) AS weeks
GROUP BY week_num
ORDER BY week_num;
week_num | start_date | end_date
----------+------------+------------
0 | 2018-09-05 | 2018-09-09
1 | 2018-09-10 | 2018-09-16
2 | 2018-09-17 | 2018-09-23
3 | 2018-09-24 | 2018-09-27
(4 rows)
Use generate_series():
select gs.*
from generate_series(date_trunc('week', '2018-09-05'::date),
'2018-09-27'::date,
interval '1 week'
) gs(dte)
Ultimately I expanded on Gordon's solution to get to the following, however Laurenz's answer is slightly more concise.
select
(
case when (week_start - interval '6 days' <= date_trunc('week', '2018-09-05'::date)) then '2018-09-05'::date else week_start end
) as start_date,
(
case when (week_start + interval '6 days' >= '2018-09-27'::date) then '2018-09-27'::date else week_start + interval '6 days' end
) as end_date
from generate_series(
date_trunc('week', '2018-09-05'::date),
'2018-09-27'::date,
interval '1 week'
) gs(week_start);

Using windows functions to count by groups of dates PostgreSQL

I have a table amongst whose columns are id and created_at and I want to use window functions around the created_at of each entry to count how many entries there are within 48 hours of them. As an example, for the original table:
id | created_at
----|------------
01 | 2016/01/04
02 | 2016/01/05
03 | 2016/01/05
04 | 2016/01/06
05 | 2016/01/07
06 | 2016/01/08
07 | 2016/01/08
08 | 2016/01/09
and the result should be
id | created_at | count
----|------------|-------
01 | 2016/01/04 | 4
02 | 2016/01/05 | 5
03 | 2016/01/05 | 5
04 | 2016/01/06 | 7
05 | 2016/01/07 | 7
06 | 2016/01/08 | 5
07 | 2016/01/08 | 5
08 | 2016/01/09 | 4
The explanation is that since there are 2 transactions on 2016/01/05, 1 on 2016/01/06, 1 on 2016/01/07, 2 on 2016/01/08, and 1 on 2016/01/09, there are a total of 7 transactions within 2 days of transaction 05.
It is better to use a date table that have consecutive dates in case dates in your table have gaps.
I am wondering what's the role of the id column? Here is how I would do it without considering the id column.
select row_number()over(order by dt) as id
,dt as created_at
,cnt1+cnt2+cnt3+cnt4+cnt5 as cnt
from
(
select
date_table.dt
,lag(cnt,2,0)over(order by created_at asc) as cnt1
,lag(cnt,1,0)over(order by created_at asc) as cnt2
,isnull(cnt,0) cnt3
,lead(cnt,1,0)over(order by created_at asc) as cnt4
,lead(cnt,2,0)over(order by created_at asc) as cnt5
from
date_table left join
(select created_at,count(*) as cnt from your_table group by created_at) c
on date_table.day = c.created_at
) T
Using window functions for this purpose is challenging because of the duplicate days. You can get the results using a join or correlated subquery:
select t.*,
(select count(*)
from t t2
where t2 between t.created_at - interval 2 * '1 day' and
t.created_at + interval 2 * '1 day'
) as cnt
from t;
EDIT:
You could use window functions by doing a cumulative sum by date and then joining back. This is, of course, a bit challenging because of holes in the dates. But, something like this:
with c as (
select d.dte, count(t.created_at) as cnt,
sum(count(t.created_at))) over (order by d.dte) as cumecnt
from (select generate_series(min(created_at) - interval '2 day',
max(created_at) + interval '2 day',
'1 day')
from t
) d(dte) left join
on d.dte = t.created_at
)
select t.*, cmax.cumecnt - cmin.cumecnt
from t join
c cmin
on t.created_at = cmin.dte + interval '2 day' join
c cmax
on t.created_at = cmax.dte - interval '2 day';

GROUP BY next months over N years

I need to aggregate amounts grouped by "horizon" 12 next months over 5 year:
assuming we are 2015-08-15
SUM amount from 0 to 12 next months (from 2015-08-16 to 2016-08-15)
SUM amount from 12 to 24 next months (from 2016-08-16 to 2017-08-15)
SUM amount from 24 to 36 next months ...
SUM amount from 36 to 48 next months
SUM amount from 48 to 60 next months
Here is a fiddled dataset example:
+----+------------+--------+
| id | date | amount |
+----+------------+--------+
| 1 | 2015-09-01 | 10 |
| 2 | 2015-10-01 | 10 |
| 3 | 2016-10-01 | 10 |
| 4 | 2017-06-01 | 10 |
| 5 | 2018-06-01 | 10 |
| 6 | 2019-05-01 | 10 |
| 7 | 2019-04-01 | 10 |
| 8 | 2020-04-01 | 10 |
+----+------------+--------+
Here is the expected result:
+---------+--------+
| horizon | amount |
+---------+--------+
| 1 | 20 |
| 2 | 20 |
| 3 | 10 |
| 4 | 20 |
| 5 | 10 |
+---------+--------+
How can I get these 12 next months grouped "horizons" ?
I tagged PostgreSQL but I'm actually using an ORM so it's just to find the idea. (by the way I don't have access to the date formatting functions)
I would split by 12 months time frame and group by this:
SELECT
FLOOR(
(EXTRACT(EPOCH FROM date) - EXTRACT(EPOCH FROM now()))
/ EXTRACT(EPOCH FROM INTERVAL '12 month')
) + 1 AS "horizon",
SUM(amount) AS "amount"
FROM dataset
GROUP BY horizon
ORDER BY horizon;
SQL Fiddle
Inspired by: Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)
Assuming you need intervals from current date to this day next year and so on, I would query this like this:
SELECT 1 AS horizon, SUM(amount) FROM dataset
WHERE date > now()
AND date < (now() + '12 months'::INTERVAL)
UNION
SELECT 2 AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + '12 months'::INTERVAL)
AND date < (now() + '24 months'::INTERVAL)
UNION
SELECT 3 AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + '24 months'::INTERVAL)
AND date < (now() + '36 months'::INTERVAL)
UNION
SELECT 4 AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + '36 months'::INTERVAL)
AND date < (now() + '48 months'::INTERVAL)
UNION
SELECT 5 AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + '48 months'::INTERVAL)
AND date < (now() + '60 months'::INTERVAL)
ORDER BY horizon;
You can generalize it and make something like this using additional variable:
SELECT number AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + ((number - 1) * '12 months'::INTERVAL))
AND date < (now() + (number * '12 months'::INTERVAL));
Where number is an integer from range [1,5]
Here is what I get from the Fiddle:
| horizon | sum |
|---------|-----|
| 1 | 20 |
| 2 | 20 |
| 3 | 10 |
| 4 | 20 |
| 5 | 10 |
Perhaps CTE?
WITH RECURSIVE grps AS
(
SELECT 1 AS Horizon, (date '2015-08-15') + interval '1' day AS FromDate, (date '2015-08-15') + interval '1' year AS ToDate
UNION ALL
SELECT Horizon + 1, ToDate + interval '1' day AS FromDate, ToDate + interval '1' year
FROM grps WHERE Horizon < 5
)
SELECT
Horizon,
(SELECT SUM(amount) FROM dataset WHERE date BETWEEN g.FromDate AND g.ToDate) AS SumOfAmount
FROM
grps g
SQL fiddle
Rather simply:
SELECT horizon, sum(amount) AS amount
FROM generate_series(1, 5) AS s(horizon)
JOIN dataset ON "date" >= current_date + (horizon - 1) * interval '1 year'
AND "date" < current_date + horizon * interval '1 year'
GROUP BY horizon
ORDER BY horizon;
You need a union and an aggregate function:
select 1 as horizon,
sum(amount) amount
from the_table
where date >= current_date
and date < current_date + interval '12' month
union all
select 2 as horizon,
sum(amount) amount
where date >= current_date + interval '12' month
and date < current_date + interval '24' month
union all
select 3 as horizon,
sum(amount) amount
where date >= current_date + interval '24' month
and date < current_date + interval '36' month
... and so on ...
But I don't know, how to do that with an obfuscation layer (aka ORM) but I'm sure it supports (or it should) aggregation and unions.
This could easily be wrapped up into a PL/PgSQL function where you pass the "horizon" and the SQL is built dynamically so that all you need to call is something like: select * from sum_horizon(5) where 5 indicates the number of years.
Btw: date is a horrible name for a column. For one because it's a reserved word, but more importantly because it doesn't document the meaning of the column. Is it a "release date"? A "due date"? An "order date"?
Try this
select
id,
sum(case when date>=current_date and date<current_date+interval 1 year then amount else 0 end) as year1,
sum(case when date>=current_date+interval 1 year and date<current_date+interval 2 year then amount else 0 end) as year2,
sum(case when date>=current_date+interval 2 year and date<current_date+interval 3 year then amount else 0 end) as year3,
sum(case when date>=current_date+interval 3 year and date<current_date+interval 4 year then amount else 0 end) as year4,
sum(case when date>=current_date+interval 4 year and date<current_date+interval 5 year then amount else 0 end) as year5
from table
group by id