How to work out overlap of the union of date intervals in BigQuery - sql

In BigQuery, given a table of date intervals, how can I find the overlap of their union with a single date interval of interest?
For example, given a table of date intervals (call this table A) as:
start_date end_date
2021-02-01 2021-05-01
2021-04-01 2021-07-01
2020-12-01 2021-03-01
2021-09-01 2021-12-01
And the single date interval of interest (call this table B) as:
start_date end_date
2021-01-01 2021-11-01
I would like to calculate the overlap between the intervals in A with the interval in B as 8 months.
When A's intervals are disjoint, I can solve this with the following:
SELECT
SUM(GREATEST(0, DATE_DIFF(LEAST(B.end_date, A.end_date),
GREATEST(B.start_date,A.start_date), MONTH)))
AS months_overlap
FROM
A, B
The problem comes in when the date intervals in A overlap with each other, as in the above example, in which case the above code double counts overlapping intervals in A i.e. it will return 10 months for the above example.
Any suggestions on how to calculate the overlap of these intervals without double counting? I thought about introducing Lags into the date diff function but I'm not coming right.

Consider below approach
select count(1) as months_overlap
from (
select distinct date_trunc(day, month) month
from tableA, unnest(generate_date_array(start_date, end_date - 1)) day
)
join (
select distinct date_trunc(day, month) month
from tableB, unnest(generate_date_array(start_date, end_date - 1)) day
)
using(month)
if applied to sample data in your question - output is

One approach is to expand the various intervals into months, join and count:
with b as (
select mon
from b cross join
unnest(generate_date_array(b.start_date, b.end_date, interval 1 month)) mon
),
a as (
select mon
from a cross join
unnest(generate_date_array(a.start_date, a.end_date, interval 1 month)) mon
)
select count(distinct mon)
from a join
b
using (mon);

Related

Get days of the week from a date range in Postgres

So I have the following table :
id end_date name number_of_days start_date
1 "2022-01-01" holiday1 1 "2022-01-01"
2 "2022-03-20" holiday2 1 "2022-03-20"
3 "2022-04-09" holiday3 1 "2022-04-09"
4 "2022-05-01" holiday4 1 "2022-05-01"
5 "2022-05-04" holiday5 3 "2022-05-02"
6 "2022-07-12" holiday6 9 "2022-07-20"
I want to check if a week falls in a holiday range.
So far I can select the holidays that overlap with my choosen week( week_start_date, week_end_date) , but i cant get the exact days in which the overlap happens.
this is the query i'm using, i want to add a mechanism to detect the DAYS OF THE WEEK IN WHICH THE OVERLAP HAPPENS
SELECT * FROM holidays
where daterange(CAST(start_date AS date), CAST(end_date as date), '[]') && daterange('2022-07-18', '2022-07-26','[]')
THE CURRENT QUERY RETURNS THE OVERLLAPPING HOLIDA, (id = 6), however i'm trying to get the exact DAYS OF THE WEEK in which the overlap happens ( in this case, it should be monday,tuesday , wednesday)
You can use the * operator with tsranges, generate a series of dates with the lower and upper dates and finally with to_char print the days of the week, e.g.
SELECT
id, name, start_date, end_date, array_agg(dow) AS days
FROM (
SELECT *,
trim(
to_char(
generate_series(lower(overlap), upper(overlap),'1 day'),
'Day')) AS dow
FROM holidays
CROSS JOIN LATERAL (SELECT tsrange(start_date,end_date) *
tsrange('2022-07-18', '2022-07-26')) t (overlap)
WHERE tsrange(start_date,end_date) && tsrange('2022-07-18', '2022-07-26')) j
GROUP BY id,name,start_date,end_date,number_of_days;
id | name | start_date | end_date | days
----+----------+------------+------------+----------------------------
6 | holiday6 | 2022-07-12 | 2022-07-20 | {Monday,Tuesday,Wednesday}
(1 row)
Demo: db<>fiddle

Query a 30 day interval for every 30 day interval in the last year

I want to query every 30 day interval in 2021, but I don't know how to do it without a for loop in SQL.
Here's psuedo code of what I want to do with a table called _table and a date column called application_date:
for _day in range(335):
select '2021-01-01' + _day as start_date, count(*) as _count
from _table
where '2021-01-01' + _day <= application_date <= ('2021-01-01' + _day + interval '30' day )
It would output something like this:
start_date
_count
2021-01-01
{number of rows between 2021-01-01 and 2021-01-31}
2021-01-02
{number of rows between 2021-01-02 and 2021-02-01}
...
...
2021-11-31
{number of rows between 2021-11-31 and 2021-12-30}
2021-12-01
{number of rows between 2021-12-01 and 2021-12-31}
Assuming that you have rows for each day you can group data by date, count it in the group and then use sum window function with range of 30 rows (current + next 30 rows, note that {rows between 2021-01-01 and 2021-01-31} have interval of 31 day, not 30):
-- sample data
WITH dataset(start_date) AS (
VALUES (date '2021-01-01'),
(date '2021-01-01'),
(date '2021-01-01'),
(date '2021-01-02'),
(date '2021-01-03'),
(date '2021-01-03')
)
-- query
select start_date
, sum(cnt) over (order by start_date ROWS BETWEEN CURRENT ROW AND 30 FOLLOWING) rolling_count_31_days
from (
select start_date
, count(*) cnt
from dataset
where year(start_date) = 2021
group by start_date
)
Output:
start_date
rolling_count_31_days
2021-01-01
6
2021-01-02
3
2021-01-03
2
If some dates are missing - checkout this or this answer describing how to insert missing dates and insert dates into the group result with cnt set to 0.
Note that Trino (the new name for PrestoSQL) updated support for RANGE frame type and you can implement this without need to insert missing rows.

Get count of susbcribers for each month in current year even if count is 0

I need to get the count of new subscribers each month of the current year.
DB Structure: Subscriber(subscriber_id, create_timestamp, ...)
Expected result:
date | count
-----------+------
2021-01-01 | 3
2021-02-01 | 12
2021-03-01 | 0
2021-04-01 | 8
2021-05-01 | 0
I wrote the following query:
SELECT
DATE_TRUNC('month',create_timestamp)
AS create_timestamp,
COUNT(subscriber_id) AS count
FROM subscriber
GROUP BY DATE_TRUNC('month',create_timestamp);
Which works but does not include months where the count is 0. It's only returning the ones that are existing in the table. Like:
"2021-09-01 00:00:00" 3
"2021-08-01 00:00:00" 9
First subquery is used for retrieving year wise each month row then LEFT JOIN with another subquery which is used to retrieve month wise total_count. COALESCE() is used for replacing NULL value to 0.
-- PostgreSQL (v11)
SELECT t.cdate
, COALESCE(p.total_count, 0) total_count
FROM (select generate_series('2021-01-01'::timestamp, '2021-12-15', '1 month') as cdate) t
LEFT JOIN (SELECT DATE_TRUNC('month',create_timestamp) create_timestamp
, SUM(subscriber_id) total_count
FROM subscriber
GROUP BY DATE_TRUNC('month',create_timestamp)) p
ON t.cdate = p.create_timestamp
Please check from url https://dbfiddle.uk/?rdbms=postgres_11&fiddle=20dcf6c1784ed0d9c5772f2487bcc221
get the count of new subscribers each month of the current year
SELECT month::date, COALESCE(s.count, 0) AS count
FROM generate_series(date_trunc('year', LOCALTIMESTAMP)
, date_trunc('year', LOCALTIMESTAMP) + interval '11 month'
, interval '1 month') m(month)
LEFT JOIN (
SELECT date_trunc('month', create_timestamp) AS month
, count(*) AS count
FROM subscriber
GROUP BY 1
) s USING (month);
db<>fiddle here
That's assuming every row is a "new subscriber". So count(*) is simplest and fastest.
See:
Join a count query on generate_series() and retrieve Null values as '0'
Generating time series between two dates in PostgreSQL

Date range to row in postgres

I have a table in postgres like this:
id
open_date
close_date
5
2006-08-04
2019-12-31
There exist 4897 days between them. I need to turn the date range to date to have one record per day. For example:
id
open_date
close_date
valid_date
5
2006-08-04
2019-12-31
2006-08-04
5
2006-08-04
2019-12-31
2006-08-05
5
2006-08-04
2019-12-31
2006-08-06
...
..........
..........
..........
5
2006-08-04
2019-12-31
2019-12-31
I tried the query provided here like this:
SELECT
id,
open_date,
close_date,
open_date + seq.seqnum * interval '1 day' AS valid_date,
FROM
TAB1
LEFT JOIN (
SELECT
row_number() over () AS seqnum
FROM
TAB1) seq ON seqnum <= (close_date - open_date)
)
The TAB1 contains 600 rows. After running this query it produce correct records but only max 600 records for each id. This means for this date range only till 2008-06-08
In Postgres, you would use generate_series():
select t1.*, gs.valid_date
from tab1 t1 cross join lateral
generate_series(t1.open_date, t1.close_date, interval '1 day') as gs(valid_date);

Add Missing monthly dates in a timeseries data in Postgresql

I have monthly time series data in table where dates are as a last day of month. Some of the dates are missing in the data. I want to insert those dates and put zero value for other attributes.
Table is as follows:
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-04-30 34
2 2014-05-31 45
2 2014-08-31 47
I want to convert this table to
id report_date price
1 2015-01-31 40
1 2015-02-28 56
1 2015-03-31 0
1 2015-04-30 34
2 2014-05-31 45
2 2014-06-30 0
2 2014-07-31 0
2 2014-08-31 47
Is there any way we can do this in Postgresql?
Currently we are doing this in Python. As our data is growing day by day and its not efficient to handle I/O just for one task.
Thank you
You can do this using generate_series() to generate the dates and then left join to bring in the values:
with m as (
select id, min(report_date) as minrd, max(report_date) as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) as report_date
from m
) m left join
t
on m.report_date = t.report_date;
EDIT:
Turns out that the above doesn't quite work, because adding months to the end of month doesn't keep the last day of the month.
This is easily fixed:
with t as (
select 1 as id, date '2012-01-31' as report_date, 10 as price union all
select 1 as id, date '2012-04-30', 20
), m as (
select id, min(report_date) - interval '1 day' as minrd, max(report_date) - interval '1 day' as maxrd
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select m.*, generate_series(minrd, maxrd, interval '1' month) + interval '1 day' as report_date
from m
) m left join
t
on m.report_date = t.report_date;
The first CTE is just to generate sample data.
This is a slight improvement over Gordon's query which fails to get the last date of a month in some cases.
Essentially you generate all the month end dates between the min and max date for each id (using generate_series) and left join on this generated table to show the missing dates with 0 price.
with minmax as (
select id, min(report_date) as mindt, max(report_date) as maxdt
from t
group by id
)
select m.id, m.report_date, coalesce(t.price, 0) as price
from (select *,
generate_series(date_trunc('MONTH',mindt+interval '1' day),
date_trunc('MONTH',maxdt+interval '1' day),
interval '1' month) - interval '1 day' as report_date
from minmax
) m
left join t on m.report_date = t.report_date
Sample Demo