How to fill in rows based on available data - sql

Using Snowflake SQL.
So my table has 2 columns: hour and customerID. Every customer will have 2 rows, one corresponding to hour that he/she came into the store, and one corresponding to hour that he/she left the store. With this data, I want to create a table that has every hour that a customer has been in the store. For example, a customer X entered the store at 1PM and left at 5PM, so there would be 5 rows (1 for each hour) like the screenshot below.
Here's my attempt that's now:
select
hour
,first_value(customer_id) over (partition by customer_id order by hour rows between unbounded preceding and current row) as customer_id
FROM table

In Snowflake, you would typically use a table of numbers to solve this. You can use the table (generator ...) syntax to generate such derived table, and then join it with an aggregate query that computes the hour boundaries of each client with an inequality condition:
select t.customer_id, dateadd(hour, n.rn, t.min_hour) final_hour
from (
select t.customer_id, min(t.hour) min_hour, max(t.hour) max_hour
from mytable t
group by t.customer_id
) t
inner join (
select row_number() over(order by null) - 1 rn
from table (generator(rowcount => 24))
) n on dateadd(hour, n.rn, t.min_hour) <= t.max_hour
order by customer_id, final_hour
This would handle up to 24 hours of visit per customer. If you need more, then you can increase the parameter to the table generator.

so for the example case as shown in the test data, where there is only one days worth of data GMB's solution works fine.
once you get into many days (that can/cannot have overlapping store visits, lets just pretend you cannot overnight in the store)
which can be fixed via:
select t.hour::date, t.customer_id, min(t.hour) min_hour, max(t.hour) max_hour
from mytable t
group by 1,2
but multiple entries, ether requires tag data like:
with mytable as (
select * from values
('2019-04-01 09:00:00','x','in')
,('2019-04-01 15:00:00','x','out')
,('2019-04-02 12:00:00','x','in')
,('2019-04-02 14:00:00','x','out')
v(hour, customer_id, state)
)
or for it to be inferred:
with mytable as (
select * from values ('2019-04-01 09:00:00','x','in'),('2019-04-01 15:00:00','x','out')
,('2019-04-02 12:00:00','x','in'),('2019-04-02 14:00:00','x','out')
v(hour, customer_id, state)
)
select hour::date as day
,hour
,customer_id
,state
,BITAND(row_number() over(partition by day, customer_id order by hour), 1) = 1 AS in_dir
from mytable
order by 3,1,2;
giving:
DAY HOUR CUSTOMER_ID STATE IN_DIR
2019-04-01 2019-04-01 09:00:00 x in TRUE
2019-04-01 2019-04-01 15:00:00 x out FALSE
2019-04-02 2019-04-02 12:00:00 x in TRUE
2019-04-02 2019-04-02 14:00:00 x out FALSE
now this can be used with a LAG and QUALIFY to get true ranges the can handle multi-entries:
select customer_id
,day
,hour
,lead(hour) over (partition by customer_id, day order by hour) as exit_time
from infer_direction
qualify in_dir = true
which works by getting then next time for all rows of each day/customer, and after that (via the qualify) only keeping the rows 'in' rows.
then we can then join to the time of a day:
select dateadd('hour', row_number() over(order by null) - 1, '00:00:00'::time) as hour
from table (generator(rowcount => 24))
thus for it all woven together
with mytable as (
select hour::timestamp as hour, customer_id, state
from values
('2019-04-01 09:00:00','x','in')
,('2019-04-01 12:00:00','x','out')
,('2019-04-02 13:00:00','x','in')
,('2019-04-02 14:00:00','x','out')
,('2019-04-02 9:00:00','x','in')
,('2019-04-02 10:00:00','x','out')
v(hour, customer_id, state)
), infer_direction AS (
select hour::date as day
,hour::time as hour
,customer_id
,state
,BITAND(row_number() over(partition by day, customer_id order by hour), 1) = 1 AS in_dir
from mytable
), visit_ranges as (
select customer_id
,day
,hour
,lead(hour) over (partition by customer_id, day order by hour) as exit_time
from infer_direction
qualify in_dir = true
), time_of_day AS (
select dateadd('hour', row_number() over(order by null) - 1, '00:00:00'::time) as hour
from table (generator(rowcount => 24))
)
select t.customer_id
,t.day
,h.hour
from visit_ranges as t
join time_of_day h on h.hour between t.hour and t.exit_time
order by 1,2,3;
we get:
CUSTOMER_ID DAY HOUR
x 2019-04-01 09:00:00
x 2019-04-01 10:00:00
x 2019-04-01 11:00:00
x 2019-04-01 12:00:00
x 2019-04-02 09:00:00
x 2019-04-02 10:00:00
x 2019-04-02 13:00:00
x 2019-04-02 14:00:00

Related

Taking Count Based On Year and Month from Date Columns

I want to take count based on from and to date. using from and to date I am trying to take year and month then based on month and year taking count. can someone suggest me how can i implement this.
Database : Snowflake
You want to do more less the solution to this other question
but here let me do all the work for you:
WITH data_table(start_date, end_date) as (
SELECT * from values
('2022-01-15'::date, '2022-02-12'::date),
('2021-12-25'::date, '2022-03-18'::date),
('2022-02-25'::date, '2022-03-06'::date),
('2021-10-20'::date, '2022-01-07'::date)
), large_range as (
SELECT row_number() over (order by null)-1 as rn
FROM table(generator(ROWCOUNT => 1000))
), pre_condition as (
SELECT
date_trunc('month', start_date) as month_start
,datediff('month', month_start, date_trunc('month', end_date)) as m
FROM data_table
)
SELECT
to_char(dateadd('month', r.rn, d.month_start),'MON-YY') as month_yr
,count(*) as count
FROM pre_condition as d
JOIN large_range as r ON r.rn <= d.m
GROUP BY 1;
MONTH_YR
COUNT
Jan-22
3
Dec-21
2
Feb-22
3
Oct-21
1
Nov-21
1
Mar-22
2

Select latest available SQL entry state

Consider this DDL:
CREATE TABLE cash_depot_state
(
id INTEGER NOT NULL PRIMARY KEY,
date DATE,
amount REAL,
cash_depot_id INTEGER
);
INSERT INTO cash_depot_state (date, amount, cash_depot_id)
VALUES (DATE('2022-03-02'), 382489, 5);
INSERT INTO cash_depot_state (date, amount, cash_depot_id)
VALUES (DATE('2022-03-03'), 750, 2);
INSERT INTO cash_depot_state (date, amount, cash_depot_id)
VALUES (DATE('2022-03-04'), 750, 3);
INSERT INTO cash_depot_state (date, amount, cash_depot_id)
VALUES (DATE('2022-03-05'), 0, 5);
For an array of dates I need to select sum of all cash depots' actual amounts:
2022-03-01 - no data available - expect 0
2022-03-02 - cash depot #5 has changed it's value to 382489 - expect 382489
2022-03-03 - cash depot #2 has changed it's value to 750 - expect 382489 + 750
2022-03-03 - cash depot #3 has changed it's value to 750 - expect 382489 + 750 + 750
2022-03-04 - cash depot #5 has changed it's value to 0 - expect 0 + 750 + 750
My best attempt: http://sqlfiddle.com/#!5/94ad0d/1
But I can't figure out how to pick winner of a subgroup
You could define the latest amount per cash depot as the record that has row number 1, when you divvy up records by cash_depot_id, and order them descending by date:
SELECT
id,
cash_depot_id,
date,
amount,
ROW_NUMBER() OVER (PARTITION BY cash_depot_id ORDER BY date DESC) rn
FROM
cash_depot_state
This will highlight the latest data from your table - all the relevant rows will have rn = 1:
id
cash_depot_id
date
amount
rn
2
2
2022-03-03
750.0
1
3
3
2022-03-04
750.0
1
4
5
2022-03-05
0.0
1
1
5
2022-03-02
382489.0
2
Now you can use a WHERE clause to filter records to a certain date, e.g. WHERE data <= '2022-03-05':
SELECT
SUM(amount) sum_amount
FROM
(
SELECT amount, ROW_NUMBER() OVER (PARTITION BY cash_depot_id ORDER BY date DESC) rn
FROM cash_depot_state
WHERE date <= '2022-03-05'
) latest
WHERE
rn = 1;
will return 1500.
A more traditional way to solve this would be a correlated sub-query:
SELECT
SUM(amount) sum_amount
FROM
cash_depot_state s
WHERE
date = (
SELECT MAX(date)
FROM cash_depot_state
WHERE date <= '2022-03-05' AND cash_depot_id = s.cash_depot_id
)
or a join against a materialized sub-query:
SELECT
SUM(amount) sum_amount
FROM
cash_depot_state s
INNER JOIN (
SELECT MAX(date) date, cash_depot_id
FROM cash_depot_state
WHERE date <= '2022-03-05'
GROUP BY cash_depot_id
) latest ON latest.cash_depot_id = s.cash_depot_id AND latest.date = s.date
In large tables, these are potentially faster than the ROW_NUMBER() variant. YMMV, take measurements.
An index that covers date, cash_depot_id, and amount helps all shown approaches:
CREATE INDEX ix_latest_cash ON cash_depot_state (date DESC, cash_depot_id ASC, amount);
To run against a CTE that produces a calendar, any of the above can be correlated as a subquery
WITH RECURSIVE dates(date) AS (
SELECT '2022-03-01'
UNION ALL
SELECT date(date, '+1 day') FROM dates WHERE date < DATE('now')
)
SELECT
date,
IFNULL(
(
-- any of the above approaches with `WHERE date <= dates.date`
), 0
) balance
FROM
dates;
e.g. http://sqlfiddle.com/#!5/94ad0d/12

Collapse multiple rows based on time values

I'm trying to collapse rows with consecutive timeline within the same day into one row but having an issue because of gap in time. For example, my dataset looks like this.
Date StartTime EndTime ID
2017-12-1 09:00:00 11:00:00 12345
2017-12-1 11:00:00 13:00:00 12345
2018-09-08 09:00:00 10:00:00 78465
2018-09-08 10:00:00 12:00:00 78465
2018-09-08 15:00:00 16:00:00 78465
2018-09-08 16:00:00 18:00:00 78465
As up can see, the first two rows can just be combined together without any issue because there's no time gap within that day. However. for the entries on 2019-09-08, there is a gap between 12:00 and 15:00. And I'd like to merge these four records into two different rows like this:
Date StartTime EndTime ID
2017-12-1 09:00:00 13:00:00 12345
2018-09-08 09:00:00 12:00:00 78465
2018-09-08 15:00:00 18:00:00 78465
In other words, I only want to collapse the rows only when the time variables are consecutive within the same day for the same ID.
Could anyone please help me with this? I tried to generate unique group using LAG and LEAD functions but it didn't work.
You can use a recursive cte. Group it as same group if the EndTime is same as next StartTime. And then find the MIN() and MAX()
with cte as
(
select rn = row_number() over (partition by [ID], [Date] order by [StartTime]),
*
from tbl
),
rcte as
(
-- anchor member
select rn, [ID], [Date], [StartTime], [EndTime], grp = 1
from cte
where rn = 1
union all
-- recursive member
select c.rn, c.[ID], c.[Date], c.[StartTime], c.[EndTime],
grp = case when r.[EndTime] = c.[StartTime]
then r.grp
else r.grp + 1
end
from rcte r
inner join cte c on r.[ID] = c.[ID]
and r.[Date] = c.[Date]
and r.rn = c.rn - 1
)
select [ID], [Date],
min([StartTime]) as StartTime,
max([EndTime]) as EndTime
from rcte
group by [ID], [Date], grp
db<>fiddle demo
Unless you have a particular objection to collapsing non-consecutive rows, which are consecutive for that ID, you can just use GROUP BY:
SELECT
Date,
StartTime = MIN(StartTime),
EndTime = MAX(EndTime),
ID
FROM table
GROUP BY ID, Date
Otherwise you can use a solution based on ROW_NUMBER:
SELECT
Date,
StartTime,
EndTime,
ID
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY Date, ID ORDER BY StartTime)
FROM table
) t
WHERE rn = 1
This is an example of a gaps-and-islands problem -- actually a pretty simple example. The idea is to assign an "island" grouping to each row specifying that they should be combined because they overlap. Then aggregate.
How do you assign the island? In this case, look at the previous endtime and if it is different from the starttime, then the row starts a new island. Voila! A cumulative sum of the the start flag identifies each island.
As SQL:
select id, date, min(starttime), max(endtime)
from (select t.*,
sum(case when prev_endtime = starttime then 0 else 1 end) over (partition by id, date order by starttime) as grp
from (select t.*,
lag(endtime) over (partition by id, date order by starttime) as prev_endtime
from t
) t
) t
group by id, date, grp;
Here is a db<>fiddle.
Note: This assumes that the time periods never span multiple days. The code can be very easily modified to handle that . . . but with a caveat. The start and end times should be stored as datetime (or a related timestamp) rather than separating the date and times into different columns. Why? SQL Server doesn't support '24:00:00' as a valid time.

Google Big Query SQL - Get most recent unique value by date

#EDIT - Following the comments, I rephrase my question
I have a BigQuery table that i want to use to get some KPI of my application.
In this table, I save each create or update as a new line in order to keep a better history.
So I have several times the same data with a different state.
Example of the table :
uuid |status |date
––––––|–––––––––––|––––––––––
3 |'inactive' |2018-05-12
1 |'active' |2018-05-10
1 |'inactive' |2018-05-08
2 |'active' |2018-05-08
3 |'active' |2018-05-04
2 |'inactive' |2018-04-22
3 |'inactive' |2018-04-18
We can see that we have multiple value of each data.
What I would like to get:
I would like to have the number of current 'active' entry (So there must be no 'inactive' entry with the same uuid after). And to complicate everything, I need this total per day.
So for each day, the amount of 'active' entries, including those from previous days.
So with this example I should have this result :
date | actives
____________|_________
2018-05-02 | 0
2018-05-03 | 0
2018-05-04 | 1
2018-05-05 | 1
2018-05-06 | 1
2018-05-07 | 1
2018-05-08 | 2
2018-05-09 | 2
2018-05-10 | 3
2018-05-11 | 3
2018-05-12 | 2
Actually i've managed to get the good amount of actives for one day. But my problem is when i want the results for each days.
What I've tried:
I'm stuck with two solutions that each return a different error.
First solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT COUNT(uuid)
FROM (
SELECT
uuid, status, date,
RANK() OVER(PARTITION BY uuid ORDER BY date DESC) rank
FROM users
WHERE
PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d",date)) <= i_date
)
WHERE
status = 'active'
and rank = 1
## rank is the condition which causes the error
) users
FROM
dates, UNNEST(arr_dates) i_date
ORDER BY i_date;
The SELECT with the RANK() OVER correctly returns the users with a rank column that allow me to know which entry is the last for each uuid.
But when I try this, I got a :
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. because of the rank = 1 condition.
Second solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT
COUNT(t1.uuid)
FROM
users t1
WHERE
t1.date = (
SELECT MAX(t2.date)
FROM users t2
WHERE
t2.uuid = t1.uuid
## Here that's the i_date condition which causes problem
AND PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d", t2.date)) <= i_date
)
AND status='active' ) users
FROM
dates,
UNNEST(arr_dates) i_date
ORDER BY i_date;
Here, the second select is working too and correctly returning the number of active user for a current day.
But the problem is when i try to use i_date to retrieve datas among the multiple days.
And Here i got a LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. error...
Which solution is more able to succeed ? What should i change ?
And, if my way of storing the data isn't good, how should i proceed in order to keep a precise history ?
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, COUNT(DISTINCT uuid) total_active
FROM `project.dataset.table`
WHERE status = 'active'
GROUP BY date
-- ORDER BY date
Update to address your "rephrased" question :o)
Below example is using dummy data from your question
#standardSQL
WITH `project.dataset.users` AS (
SELECT 3 uuid, 'inactive' status, DATE '2018-05-12' date UNION ALL
SELECT 1, 'active', '2018-05-10' UNION ALL
SELECT 1, 'inactive', '2018-05-08' UNION ALL
SELECT 2, 'active', '2018-05-08' UNION ALL
SELECT 3, 'active', '2018-05-04' UNION ALL
SELECT 2, 'inactive', '2018-04-22' UNION ALL
SELECT 3, 'inactive', '2018-04-18'
), dates AS (
SELECT day FROM UNNEST((
SELECT GENERATE_DATE_ARRAY(MIN(date), MAX(date))
FROM `project.dataset.users`
)) day
), active_users AS (
SELECT uuid, status, date first, DATE_SUB(next_status.date, INTERVAL 1 DAY) last FROM (
SELECT uuid, date, status, LEAD(STRUCT(status, date)) OVER(PARTITION BY uuid ORDER BY date ) next_status
FROM `project.dataset.users` u
)
WHERE status = 'active'
)
SELECT day, COUNT(DISTINCT uuid) actives
FROM dates d JOIN active_users u
ON day BETWEEN first AND IFNULL(last, day)
GROUP BY day
-- ORDER BY day
with result
Row day actives
1 2018-05-04 1
2 2018-05-05 1
3 2018-05-06 1
4 2018-05-07 1
5 2018-05-08 2
6 2018-05-09 2
7 2018-05-10 3
8 2018-05-11 3
9 2018-05-12 2
I think this -- or something similar -- will do what you want:
SELECT day,
coalesce(running_actives, 0) - coalesce(running_inactives, 0)
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2015-05-11'), DATE('2018-06-29'), INTERVAL 1 DAY)
) AS day left join
(select date, sum(countif(status = 'active')) over (order by date) as running_actives,
sum(countif(status = 'active')) over (order by date) as running_inactives
from t
group by date
) a
on a.date = day
order by day;
The exact solution depends on whether the "inactive" is inclusive of the day (as above) or takes effect the next day. Either is handled the same way, by using cumulative sums of actives and inactives and then taking the difference.
In order to get data on all days, this generates the days using arrays and unnest(). If you have data on all days, that step may be unnecessary

Exclude overlapping periods in time aggregate function

I have a table containing each a start and and end date:
DROP TABLE temp_period;
CREATE TABLE public.temp_period
(
id integer NOT NULL,
"startDate" date,
"endDate" date
);
INSERT INTO temp_period(id,"startDate","endDate") VALUES(1,'2010-01-01','2010-03-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(2,'2013-05-17','2013-07-18');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(3,'2010-02-15','2010-05-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(7,'2014-01-01','2014-12-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(56,'2014-03-31','2014-06-30');
Now I want to know the total duration of all periods stored there. I need just the time as an interval. That's pretty easy:
SELECT sum(age("endDate","startDate")) FROM temp_period;
However, the problem is: Those periods do overlap. And I want to eliminate all overlapping periods, so that I get the total amount of time which is covered by at least one record in the table.
You see, there are quite some gaps in between the times, so passing the smallest start date and the most recent end date to the age function won't do the trick. However, I thought about doing that and subtracting the total amount of gaps, but no elegant way to do that came into my mind.
I use PostgreSQL 9.6.
What about this:
WITH
/* get all time points where something changes */
points AS (
SELECT "startDate" AS p
FROM temp_period
UNION SELECT "endDate"
FROM temp_period
),
/*
* Get all date ranges between these time points.
* The first time range will start with NULL,
* but that will be excluded in the next CTE anyway.
*/
inter AS (
SELECT daterange(
lag(p) OVER (ORDER BY p),
p
) i
FROM points
),
/*
* Get all date ranges that are contained
* in at least one of the intervals.
*/
overlap AS (
SELECT DISTINCT i
FROM inter
CROSS JOIN temp_period
WHERE i <# daterange("startDate", "endDate")
)
/* sum the lengths of the date ranges */
SELECT sum(age(upper(i), lower(i)))
FROM overlap;
For your data it will return:
┌──────────┐
│ interval │
├──────────┤
│ 576 days │
└──────────┘
(1 row)
You could try to use recursive cte to calculate the period. For each record, we will check if it's overlapped with previous records. If it is, we only calculate the period that is not overlapping.
WITH RECURSIVE days_count AS
(
SELECT startDate,
endDate,
AGE(endDate, startDate) AS total_days,
rowSeq
FROM ordered_data
WHERE rowSeq = 1
UNION ALL
SELECT GREATEST(curr.startDate, prev.endDate) AS startDate,
GREATEST(curr.endDate, prev.endDate) AS endDate,
AGE(GREATEST(curr.endDate, prev.endDate), GREATEST(curr.startDate, prev.endDate)) AS total_days,
curr.rowSeq
FROM ordered_data curr
INNER JOIN days_count prev
ON curr.rowSeq > 1
AND curr.rowSeq = prev.rowSeq + 1),
ordered_data AS
(
SELECT *,
ROW_NUMBER() OVER (ORDER BY startDate) AS rowSeq
FROM temp_period)
SELECT SUM(total_days) AS total_days
FROM days_count;
I've created a demo here
Actually there is a case that is not covered by the previous examples.
What if we have such a period ?
INSERT INTO temp_period(id,"startDate","endDate") VALUES(100,'2010-01-03','2010-02-10');
We have the following intervals:
Interval No. | | start_date | | end_date
--------------+------------------+------------+----------------+------------
1 | Interval start | 2010-01-01 | Interval end | 2010-03-31
2 | Interval start | 2010-01-03 | Interval end | 2010-02-10
3 | Interval start | 2010-02-15 | Interval end | 2010-05-31
4 | Interval start | 2013-05-17 | Interval end | 2013-07-18
5 | Interval start | 2014-01-01 | Interval end | 2014-12-31
6 | Interval start | 2014-03-31 | Interval end | 2014-06-30
Even though segment 3 overlaps segment 1, it's seen as a new segment, hence the (wrong) result:
sum
-----
620
(1 row)
The solution is to tweak the core of the query
CASE WHEN start_date < lag(end_date) OVER (ORDER BY start_date, end_date) then NULL ELSE start_date END
needs to be replaced by
CASE WHEN start_date < max(end_date) OVER (ORDER BY start_date, end_date rows between unbounded preceding and 1 preceding) then NULL ELSE start_date END
then it works as expected
sum
-----
576
(1 row)
Summary:
SELECT sum(e - s)
FROM (
SELECT left_edge as s, max(end_date) as e
FROM (
SELECT start_date, end_date, max(new_start) over (ORDER BY start_date, end_date) as left_edge
FROM (
SELECT start_date, end_date, CASE WHEN start_date < max(end_date) OVER (ORDER BY start_date, end_date rows between unbounded preceding and 1 preceding) then NULL ELSE start_date END AS new_start
FROM temp_period
) s1
) s2
GROUP BY left_edge
) s3;
This one required two outer joins on a complex query. One join to identify all overlaps with a startdate larger than THIS and to expand the timespan to match the larger of the two. The second join is needed to match records with no overlaps. Take the Min of the min and the max of the max, including non matched. I was using MSSQL so the syntax may be a bit different.
DECLARE #temp_period TABLE
(
id int NOT NULL,
startDate datetime,
endDate datetime
)
INSERT INTO #temp_period(id,startDate,endDate) VALUES(1,'2010-01-01','2010-03-31')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(2,'2013-05-17','2013-07-18')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-05-31')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-07-31')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(7,'2014-01-01','2014-12-31')
INSERT INTO #temp_period(id,startDate,endDate) VALUES(56,'2014-03-31','2014-06-30')
;WITH OverLaps AS
(
SELECT
Main.id,
OverlappedID=Overlaps.id,
OverlapMinDate,
OverlapMaxDate
FROM
#temp_period Main
LEFT OUTER JOIN
(
SELECT
This.id,
OverlapMinDate=CASE WHEN This.StartDate<Prior.StartDate THEN This.StartDate ELSE Prior.StartDate END,
OverlapMaxDate=CASE WHEN This.EndDate>Prior.EndDate THEN This.EndDate ELSE Prior.EndDate END,
PriorID=Prior.id
FROM
#temp_period This
LEFT OUTER JOIN #temp_period Prior ON Prior.endDate > This.startDate AND Prior.startdate < this.endDate AND This.Id<>Prior.ID
) Overlaps ON Main.Id=Overlaps.PriorId
)
SELECT
T.Id,
--If has overlapped then sum all overlapped records prior to this one, else not and overlap get the start and end
MinDate=MIN(COALESCE(HasOverlapped.OverlapMinDate,startDate)),
MaxDate=MAX(COALESCE(HasOverlapped.OverlapMaxDate,endDate))
FROM
#temp_period T
LEFT OUTER JOIN OverLaps IsAOverlap ON IsAOverlap.OverlappedID=T.id
LEFT OUTER JOIN OverLaps HasOverlapped ON HasOverlapped.Id=T.id
WHERE
IsAOverlap.OverlappedID IS NULL -- Exclude older records that have overlaps
GROUP BY
T.Id
Beware: the answer by Laurenz Albe has a huge scalability issue.
I was more than happy when I found it. I customized it for our needs. We deployed to staging and very soon, the server took several minutes to return the results.
Then I found this answer on postgresql.org. Much more efficient.
https://wiki.postgresql.org/wiki/Range_aggregation
SELECT sum(e - s)
FROM (
SELECT left_edge as s, max(end_date) as e
FROM (
SELECT start_date, end_date, max(new_start) over (ORDER BY start_date, end_date) as left_edge
FROM (
SELECT start_date, end_date, CASE WHEN start_date < lag(end_date) OVER (ORDER BY start_date, end_date) then NULL ELSE start_date END AS new_start
FROM temp_period
) s1
) s2
GROUP BY left_edge
) s3;
Result:
sum
-----
576
(1 row)