Get average duration per week-day from a list of records with start and end date - sql

I have a input table with three columns :
id => string
start_date => timestamptz
end_date => timestamptz
I want to get the average duration in seconds (end_date - start_date) per week-day number over records.
My problem is : If I have a record where interval between start_date and end_date is 4 days, I want to get the result per day, not only at the start_date or end_date, and if I have no records between 3 weeks for example, take no value for a weekday as 'zero' value in the average.
Example :
id
start_date
end_date
1 (Friday to Sunday)
2021-03-12T01:00:00.000Z
2021-03-14T01:00:00.000Z
2 (Friday)
2021-03-12T01:00:00.000Z
2021-03-12T05:00:00.000Z
3 (Wed.)
2021-03-03T16:00:00.000Z
2021-03-03T17:00:00.000Z
Expected result (european weekday here for example, sunday is 7) :
weekday
avg_duration_seconds
1
0
2
0
3
1800
4
0
5
48600
6
86400
7
3600
Thank's for your help !

Note: the following works on Postgres as you tagged that as well. I have no idea if this works on CockroachDB as well.
You can "expand" the start/end timestamps to days by using generate_series(). To calculate the effective duration on each day, the full days need to be treated differently than the partial days at the start and end. Once those timestamps are calculated it's easy to get the duration per day. The do a left join on all weekdays and group by them:
select x.weekday,
avg(extract(epoch from real_end - real_start)) as duration
from generate_series(1,7) as x(weekday)
left join (
select t.id,
extract(isodow from g.dt) as weekday,
case
when start_date < g.dt then date_trunc('day', g.dt)
else start_date
end as real_start,
case
when end_date::date > g.dt then date_trunc('day', g.dt::date + 1)
else end_date
end as real_end
from the_table t
cross join generate_series(start_date, end_date, interval '1 day') as g(dt)
) t on x.weekday = t.weekday
group by x.weekday
order by x.weekday;
I am not 100% my expressions for "real_start" and "real_end" cover all corner cases, but it should be enough to get you started.
This gives a slightly different result than your expected one, because you have the weekdays wrong for 2021-03-02 and 2021-03-11.
Online example

Related

How can I aggregate time series data in postgres from a specific timestamp & fixed intervals (e.g. 1 hour , 1 day, 7 day ) without using date_trunc()?

I have a postgres table "Generation" with half-hourly timestamps spanning 2009 - present with energy data:
I need to aggregate (average) the data across different intervals from specific timepoints, for example data from 2021-01-07T00:00:00.000Z for one year at 7 day intervals, or 3 months at 1 day interval or 7 days at 1h interval etc. date_trunc() partly solves this, but rounds the weeks to the nearest monday e.g.
SELECT date_trunc('week', "DATETIME") AS week,
count(*),
AVG("GAS") AS gas,
AVG("COAL") AS coal
FROM "Generation"
WHERE "DATETIME" >= '2021-01-07T00:00:00.000Z' AND "DATETIME" <= '2022-01-06T23:59:59.999Z'
GROUP BY week
ORDER BY week ASC
;
returns the first time series interval as 2021-01-04 with an incorrect count:
week count gas coal
"2021-01-04 00:00:00" 192 18291.34375 2321.4427083333335
"2021-01-11 00:00:00" 336 14477.407738095239 2027.547619047619
"2021-01-18 00:00:00" 336 13947.044642857143 1152.047619047619
****EDIT: the following will return the correct weekly intervals by checking the start date relative to the nearest monday / start of week, and adjusts the results accordingly:
WITH vars1 AS (
SELECT '2021-01-07T00:00:00.000Z'::timestamp as start_time,
'2021-01-28T00:00:00.000Z'::timestamp as end_time
),
vars2 AS (
SELECT
((select start_time from vars1)::date - (date_trunc('week', (select start_time from vars1)::timestamp))::date) as diff
)
SELECT date_trunc('week', "DATETIME" - ((select diff from vars2) || ' day')::interval)::date + ((select diff from vars2) || ' day')::interval AS week,
count(*),
AVG("GAS") AS gas,
AVG("COAL") AS coal
FROM "Generation"
WHERE "DATETIME" >= (select start_time from vars1) AND "DATETIME" < (select end_time from vars1)
GROUP BY week
ORDER BY week ASC
returns..
week count gas coal
"2021-01-07 00:00:00" 336 17242.752976190477 2293.8541666666665
"2021-01-14 00:00:00" 336 13481.497023809523 1483.0565476190477
"2021-01-21 00:00:00" 336 15278.854166666666 1592.7916666666667
And then for any daily or hourly (swap out day with hour) intervals you can use the following:
SELECT date_trunc('day', "DATETIME") AS day,
count(*),
AVG("GAS") AS gas,
AVG("COAL") AS coal
FROM "Generation"
WHERE "DATETIME" >= '2022-01-07T00:00:00.000Z' AND "DATETIME" < '2022-01-10T23:59:59.999Z'
GROUP BY day
ORDER BY day ASC
;
In order to select the complete week, you should change the WHERe-clause to something like:
WHERE "DATETIME" >= date_trunc('week','2021-01-07T00:00:00.000Z'::timestamp)
AND "DATETIME" < (date_trunc('week','2022-01-06T23:59:59.999Z'::timestamp) + interval '7' day)::date
This will effectively get the records from January 4,2021 until (and including ) January 9,2022
Note: I changed <= to < to stop the end-date being included!
EDIT:
when you want your weeks to start on January 7, you can always group by:
(date_part('day',(d-'2021-01-07'))::int-(date_part('day',(d-'2021-01-07'))::int % 7))/7
(where d is the column containing the datetime-value.)
see: dbfiddle
EDIT:
This will get the list from a given date, and a specified interval.
see DBFIFFLE
WITH vars AS (
SELECT
'2021-01-07T00:00:00.000Z'::timestamp AS qstart,
'2022-01-06T23:59:59.999Z'::timestamp AS qend,
7 as qint,
INTERVAL '1 DAY' as qinterval
)
SELECT
(select date(qstart) FROM vars) + (SELECT qinterval from vars) * ((date_part('day',("DATETIME"-(select date(qstart) FROM vars)))::int-(date_part('day',("DATETIME"-(select date(qstart) FROM vars)))::int % (SELECT qint FROM vars)))::int) AS week,
count(*),
AVG("GAS") AS gas,
AVG("COAL") AS coal
FROM "Generation"
WHERE "DATETIME" >= (SELECT qstart FROM vars) AND "DATETIME" <= (SELECT qend FROM vars)
GROUP BY week
ORDER BY week
;
I added the WITH vars to do the variable stuff on top and no need to mess with the rest of the query. (Idea borrowed here)
I only tested with qint=7,qinterval='1 DAY' and qint=14,qinterval='1 DAY' (but others values should work too...)
Using the function EXTRACT you may calculate the difference in days, weeks and hours between your timestamp ts and the start_date as follows
Difference in Days
extract (day from ts - start_date)
Difference in Weeks
Is the difference in day divided by 7 and truncated
trunc(extract (day from ts - start_date)/7)
Difference in Hours
Is the difference in day times 24 + the difference in hours of the day
extract (day from ts - start_date)*24 + extract (hour from ts - start_date)
The difference can be used in GROUP BY directly. E.g. for week grouping the first group is difference 0, i.e. same week, the next group with difference 1, the next week, etc.
Sample Example
I'm using a CTE for the start date to avoid multpile copies of the paramater
with start_time as
(select DATE'2021-01-07' as start_ts),
prep as (
select
ts,
extract (day from ts - (select start_ts from start_time)) day_diff,
trunc(extract (day from ts - (select start_ts from start_time))/7) week_diff,
extract (day from ts - (select start_ts from start_time)) *24 + extract (hour from ts - (select start_ts from start_time)) hour_diff,
value
from test_table
where ts >= (select start_ts from start_time)
)
select week_diff, avg(value)
from prep
group by week_diff order by 1

BigQuery SQL to change start date and end date into groups of months

I work with a hotel client where they have a BigQuery database which has hotel booking data. I've shared the relevant columns in the image below which list the names of each hotel, the arrival date of the guest, the departure date, and the revenue generated from the each booking:
My problem statement is that I have to showcase how many rooms have been booked, and how much revenue has been made for each hotel every month where my final grid would look similar to this:
The important points to remember are:
the depart_dt - arrival_dt are the number of nights that the guest is staying
the Rez_rate_total / (depart_dt - arrival_dt) is the revenue made per night
My problem here is trying to figure out how to change the start date and end date columns into groups of months. The challenge comes when a guest arrives in one month and leaves in the next month. For example, Row 5 in the original data has the guest coming in on 18th July and leaving on 1st Aug - so 13 days of his stay and 13 days of revenue has to be included in July and 1 day has to be included in August.
I haven't used SQL in a while so this is as far as I got:
WITH
temp_table AS (
SELECT
hotel_long_nm,
arrival_dt,
depart_dt,
DATE_DIFF(depart_dt, arrival_dt, day) AS room_nights,
rez_rate_total
FROM
`DATABASE.analytics.bookings` )
SELECT
*
FROM
temp_table
Any help would be greatly appreciated!
Consider the following approach:
with bookings as (
select hotel_long_nm, date(arrival_dt) as arrival_dt, date(depart_dt) as depart_dt, rez_rate_total from project.dataset.bookings
),
tmp as (
-- expose the dates in the reservation (excluding last day of reservation)
select *, generate_date_array(arrival_dt,date_sub(depart_dt, interval 1 day)) as stay_dates from bookings
),
calc as (
-- unnest and calculate the daily rate
select
hotel_long_nm,
stay_dt,
1 as stay_nights,
rez_rate_total/array_length(stay_dates) as rez_rate_daily
from tmp
left join unnest(stay_dates) as stay_dt
),
agg as (
-- aggregate to the year-month level
select
date_trunc(stay_dt, month) as year_month,
hotel_long_nm,
sum(stay_nights) as room_nights,
round(sum(rez_rate_daily),2) as rez_rate_total
from calc
group by 1,2
)
select * from agg
order by hotel_long_nm, year_month
You can consider this approach, following this logic.
Validate if both dates are in the same month
If are not in the same month, i get the final date of the month of
arrival date and subtract both dates
I get the first date of the month of the depart date and subtract
and subtract both dates
In this code you can see an example:
SELECT
/*arrival date*/
CURRENT_DATE() AS the_arival,
/*depart_dt*/
DATE_ADD(CURRENT_DATE(), INTERVAL 30 DAY) AS the_depart,
/*total of night between arrival date and depart date*/
DATE_DIFF(DATE_ADD(CURRENT_DATE(), INTERVAL 30 DAY) , CURRENT_DATE(), DAY) AS total_room_nights,
/* validate if the dates are in the same month or different month if equal 0 same month if >0 another month */
DATE_DIFF(DATE_ADD(CURRENT_DATE(), INTERVAL 30 DAY) , CURRENT_DATE(), MONTH) AS Same_Month,/*1 no and 0 yes/
/*in this case are in different month*/
/*I get the final date of the arrival month and subtract with the arrival date*/
DATE_DIFF(DATE_SUB(DATE_TRUNC(DATE_ADD(DATE_ADD(CURRENT_DATE(), INTERVAL 30 DAY), INTERVAL 1 MONTH), MONTH), INTERVAL 1 DAY),DATE_ADD(CURRENT_DATE(), INTERVAL 30 DAY), DAY) as total_room_nights_first_mont,
/*I get the initial date of the depart month and subtract with the depart date i add +1 because is the night between last day of the mont and first day of the next month*/
DATE_DIFF(DATE_ADD(CURRENT_DATE(), INTERVAL 30 DAY),DATE_TRUNC(DATE_ADD(CURRENT_DATE(), INTERVAL 30 DAY), MONTH), DAY)+1 as total_room_nights_second_month
You can see more information about the date function.Click Here.

how to generate_date_array unnest with end_date after current_date but in results show me till current_date

WITH dates AS (
SELECT `day`
FROM UNNEST(GENERATE_DATE_ARRAY('2020-11-11', CURRENT_DATE(), INTERVAL 1 DAY)) `day`
)
The above gets the dates till current day.
The below where I add +60 at the end_date gets dates till after 60 days from current date.
WITH dates AS (
SELECT `day`
FROM UNNEST(GENERATE_DATE_ARRAY('2020-11-11', CURRENT_DATE()+60, INTERVAL 1 DAY)) `day`
)
I want to count records that had set_at_date from current_date to future. For example, the number of bookings from current day till 60 days later but without getting me at the results the future dates. Just dates and counts till today like this:
date
bookings
2021-02-26
30
2021-02-25
32
2021-02-24
28

Get the number of remaining days after excluding date ranges in a table

create table test (start date ,"end" date);
insert into test values
('2019-05-05','2019-05-10')
,('2019-05-25','2019-06-10')
,('2019-07-05','2019-07-10')
;
I am looking for the following output, where for every date between the start and end the person is available only between start and end. considering for the month of may he is present for 11 days(05/05 to 05/10 and 05/25 to 05/31) and the total number of days in the month of may is 31. The output column should have 31-11 (the number of days he worked)
MonthDate------Days-
2019-05-01 20(31-11)
2019-06-01 20(30-10)
2019-07-01 26(31-5)
I get slightly different results.
But the idea is to generate every date. Then filter out the ones that are used and aggregate:
select date_trunc('month', dte) as yyyymm,
count(*) filter (where t.startd is null) as available_days
from (select generate_series(date_trunc('month', min(startd)), date_trunc('month', max(endd)) + interval '1 month - 1 day', interval '1 day') dte
from test
) d left join
test t
on d.dte between t.startd and t.endd
group by date_trunc('month', dte)
order by date_trunc('month', dte);
Here is a db<>fiddle.
The free days in May are:
1
2
3
4
11
12
13
14
15
16
17
18
19
20
21
22
23
24
I am counting 18 of these. So, I believe the results from this query.
If you do not want to include the end date (which is contrary to your description using "between", then the on logic would be:
on d.dte >= t.startd and
d.dte < t.endd
But that would only get you up to 19 in May.
Your results are inconsistent. I decided to go with inclusive bounds for the simplest solution:
SELECT date_trunc('month', d)::date, count(*)
FROM (
SELECT generate_series(timestamp '2019-05-01', timestamp '2019-07-31', interval '1 day') d
EXCEPT ALL
SELECT generate_series(start_date::timestamp, end_date::timestamp, interval '1 day') x
FROM test
) sub
GROUP BY date_trunc('month', d);
date_trunc | count
-----------+------
2019-05-01 | 18
2019-06-01 | 20
2019-07-01 | 25
db<>fiddle here
This generates all days of a given time frame (May to July of the year in your case) and excludes the days generated from all your date ranges.
Assuming at least Postgres 10.
What is the expected behaviour for multiple set-returning functions in SELECT clause?
Assuming data type date in your table. I cast to timestamp for best results. See:
Generating time series between two dates in PostgreSQL
Aside: don't use the reserved words start and end as identifiers.
Related:
Select rows which are not present in other table

Get value zero if data is not there in PostgreSQL

I have a table employee in Postgres:
Query:
SELECT DISTINCT month_last_date,number_of_cases,reopens,csat
FROM employee
WHERE month_last_date >=(date('2017-01-31') - interval '6 month')
AND month_last_date <= date('2017-01-31')
AND agent_id='analyst'
AND name='SAM';
Output:
But if data is not in table for other month I want column value as 0.
Generate all dates you are interested in, LEFT JOIN to the table and default to 0 with COALESCE:
SELECT DISTINCT -- see below
i.month_last_date
, COALESCE(number_of_cases, 0) AS number_of_cases -- see below
, COALESCE(reopens, 0) AS reopens
, COALESCE(csat, 0) AS csat
FROM (
SELECT date '2017-01-31' - i * interval '1 mon' AS month_last_date
FROM generate_series(0, 5) i -- see below
) i
LEFT JOIN employee e ON e.month_last_date = i.month_last_date
AND e.agent_id = 'analyst' -- see below
AND e.name = 'SAM';
Notes
If you add or subtract an interval of 1 month and the same day does not exist in the target month, Postgres defaults to the latest existing day of that moth. So this works as desired, you get the last day of each month:
SELECT date '2017-12-31' - i * interval '1 mon' -- note 31
FROM generate_series(0,11) i;
But this does not, you'd get the 28th of each month:
SELECT date '2017-02-28' - i * interval '1 mon' -- note 28
FROM generate_series(0,11) i;
The safe alternative is to subtract 1 day from the first day of the next month, like #Oto demonstrated. Related:
Daily average for the month (needs number of days in month)
Here are two optimized ways to generate a series of last days of the month - up to and including a given month:
1.
SELECT (timestamp '2017-01-01' - i * interval '1 month')::date - 1 AS month_last_date
FROM generate_series(-1, 10) i; -- generate 12 months, off-by-1
Input is the first day of the month - or calculate it from a given date or timestamp with date_trunc():
SELECT date_trunc('month', timestamp '2017-01-17')::date AS this_mon1
Subtracting an interval from a date produces a timestamp. After the cast back to date we can simply subtract an integer to subtract days.
2.
SELECT m::date - 1 AS month_last_date
FROM generate_series(timestamp '2017-02-01' - interval '11 month' -- for 12 months
, timestamp '2017-02-01'
, interval '1 mon') m;
Input is the first day of the next month - or calculate it from any given date or timestamp with:
SELECT date_trunc('month', timestamp '2017-01-17' + interval '1 month')::date AS next_mon1
Related:
How do I determine the last day of the previous month using PostgreSQL?
Create list with first and last day of month for given period
Not sure you actually need DISTINCT. Typically, (agent_id, month_last_date) would be defined unique, then remove DISTINCT ...
Be sure to use the LEFT JOIN correctly. Join conditions go into the join clause, not the WHERE clause:
Explain JOIN vs. LEFT JOIN and WHERE condition performance suggestion in more detail
Finally, default to 0 with COALESCE where NULL values are filled in by the LEFT JOIN.
Note that COALESCE cannot distinguish between actual NULL values from the right table and NULL values filled in for missing rows. If your columns are not defined NOT NULL, there may be ambiguity to address.
As I see, you need generate last days of all last 6 months, before certain date. (before "2017-01-31" in this case).
If I correctly understand, then you can use this query, which generates all of these days
SELECT (date_trunc('MONTH', mnth) + INTERVAL '1 MONTH - 1 day')::DATE
FROM
generate_series('2017-01-31'::date - interval '6 month', '2017-01-31'::date, '1 month') as mnth;
You just need LEFT JOIN this query to your existing query, and you get desirable result
Please note that this will returns 7 record (days), not 6.