Sum and segment overlapping date ranges - sql

Our HR system specifies employee assignments, which can be concurrent. Our rostering system only allows one summary assignment for a person. Therefore I need to pre-process the HR records, so rostering can determine the number of shifts a worker is expected to work on a given day.
Looking just at worker A who has two assignments, the first is for a quarter shift and the second for a half shift, but overlapping in the middle where they work .75 shifts.
Person StartDate EndDate Shifts
A 01/01/21 04/01/21 .25
A 03/01/21 06/01/21 .5
01---02---03---04---05---06---07
Rec 1 |------------------|
Rec 2 | |===================|
Total | 0.25 | 0.75 | 0.5 |
Required output.
Person StartDate EndDate ShiftCount
A 01/01/21 02/01/21 0.25
A 03/01/21 04/01/21 0.75
A 05/01/21 06/01/21 0.5
Given this data, how do we sum and segment the data? I found an exact question for MySQL but the version was too early and code was suggested. I also found a Postgres solution but we don't have ranges.
select * from (
values
('A','01/01/21','04/01/21',0.25),
('A','03/01/21','05/01/21',0.5)
) AS Data (Person,StartDate,EndDate,Shifts);

It looks like a Gaps-and-Islands to me.
If it helps, cte1 is used to expand the date ranges via an ad-hoc tally table. Then cte2 is used to create the Gaps-and-Islands. The final result is then a small matter of aggregation.
Example
Set Dateformat DMY
Declare #YourTable table (Person varchar(50),StartDate Date,EndDate date,Shifts decimal(10,2))
Insert Into #YourTable values
('A','01/01/21','04/01/21',0.25)
,('A','03/01/21','05/01/21',0.5)
;with cte1 as (
Select [Person]
,[d] = dateadd(DAY,N,StartDate)
,Shifts = sum(Shifts)
From #YourTable A
Join (
Select Top 1000 N=-1+Row_Number() Over (Order By (Select Null))
From master..spt_values n1,master..spt_values n2
) B on N <= datediff(DAY,[StartDate],[EndDate])
Group By Person,dateadd(DAY,N,StartDate)
), cte2 as (
Select *
,Grp = datediff(day,'1900-01-01',d)-row_number() over (partition by Person,Shifts Order by d)
From cte1
)
Select Person
,StartDate = min(d)
,EndDate = max(d)
,Shifts = max(Shifts)
From cte2
Group By Person,Grp
Returns
Person StartDate EndDate Shifts
A 2021-01-01 2021-01-02 0.25
A 2021-01-03 2021-01-04 0.75
A 2021-01-05 2021-01-05 0.50

Related

PARTITION BY with date between 2 date

I work on Azure SQL Database working with SQL Server
In SQL, I try to have a table by day, but the day is not in the table.
I explain it by the example below:
TABLE STARTER: (Format Date: YYYY-MM-DD)
Date begin
Date End
Category
Value
2021-01-01
2021-01-03
1
0.2
2021-01-02
2021-01-03
1
0.1
2021-01-01
2021-01-02
2
0.3
For the result, I try to have this TABLE RESULT:
Date
Category
Value
2021-01-01
1
0.2
2021-01-01
2
0.3
2021-01-02
1
0.3 (0.2+0.1)
2021-01-02
2
0.3
2021-01-03
1
0.3 (0.2+0.1)
For each day, I want to sum the value if the day is between the beginning and the end of the date. I need to do that for each category.
In terms of SQL code I try to do something like that:
SELECT SUM(CAST(value as float)) OVER (PARTITION BY Date begin, Category) as value,
Date begin,
Category,
Value
FROM TABLE STARTER
This code calculates only the value that has the same Date begin but don't consider all date between Date begin and Date End.
So in my code, it doesn't calculate the sum of the value for the 02-01-2021 of Category 1 because it doesn't write explicitly. (between 01-01-2021 and 03-01-2021)
Is it possible to do that in SQL?
Thanks so much for your help!
You can use a recursive CTE to expand the date ranges into the list of separate days. Then, it's matter of joining and aggregating.
For example:
with
r as (
select category,
min(date_begin) as date_begin, max(date_end) as date_end
from starter
group by category
),
d as (
select category, date_begin as d from r
union all
select d.category, dateadd(day, 1, d.d)
from d
join r on r.category = d.category
where d.d < r.date_end
)
select d.d, d.category, sum(s.value) as value
from d
join starter s on s.category = d.category
and d.d between s.date_begin and s.date_end
group by d.category, d.d;
Result:
d category value
----------- --------- -----
2021-01-01 1 0.20
2021-01-01 2 0.30
2021-01-02 1 0.30
2021-01-02 2 0.30
2021-01-03 1 0.30
See running example at db<>fiddle.
Note: Starting in SQL Server 2022 it seems there is/will be a new GENERATE_SERIES() function that will make this query much shorter.

SQL: Getting Missing Date Values and Copy Data to Those New Dates

So this seems somewhat weird, but this use case came up, and I have been somewhat struggling trying to figure out how to come about a solution. Let's say I have this data set:
date
value1
value2
2020-01-01
50
2
2020-01-04
23
5
2020-01-07
14
8
My goal is to try and fill in the gap between the two dates while copying whatever values were from the date before it. So for example, the data output I would want is:
date
value1
value2
2020-01-01
50
2
2020-01-02
50
2
2020-01-03
50
2
2020-01-04
23
5
2020-01-05
23
5
2020-01-06
23
5
2020-01-07
14
8
Not sure if this is something I can do with SQL but would definitely take any suggestions.
One approach is to use the window function lead() in concert with an ad-hoc tally table if you don't have a calendar table (highly suggested).
Example
;with cte as (
Select *
,nrows = datediff(day,[date],lead([date],1,[date]) over (order by [date]))
From YourTable A
)
Select date = dateadd(day,coalesce(N-1,0),[date])
,value1
,value2
From cte A
left Join (Select Top 1000 N=Row_Number() Over (Order By (Select NULL)) From master..spt_values n1 ) B
on N<=nRows
Results
date value1 value2
2020-01-01 50 2
2020-01-02 50 2
2020-01-03 50 2
2020-01-04 23 5
2020-01-05 23 5
2020-01-06 23 5
2020-01-07 14 8
EDIT: If you have a calendar table
Select Date = coalesce(B.Date,A.Date)
,value1
,value2
From (
Select Date
,value1
,value2
,Date2 = lead([date],1,[date]) over (order by [date])
From YourTable A
) A
left Join CalendarTable B on B.Date >=A.Date and B.Date< A.Date2
Another option is to use CROSS APPLY. I am not sure how you are determining what range you want from the table, but you can easily override my guess by explicitly defining #s and #e:
DECLARE #s date, #e date;
SELECT #s = MIN(date), #e = MAX(date) FROM dbo.TheTable;
;WITH d(d) AS
(
SELECT #s UNION ALL
SELECT DATEADD(DAY,1,d) FROM d
WHERE d < #e
)
SELECT d.d, x.value1, x.value2
FROM d CROSS APPLY
(
SELECT TOP (1) value1, value2
FROM dbo.TheTable
WHERE date <= d.d
AND value1 IS NOT NULL
ORDER BY date DESC
) AS x
-- OPTION (MAXRECURSION 32767) -- if date range can be > 100 days but < 89 years
-- OPTION (MAXRECURSION 0) -- if date range can be > 89 years
If you don't like the recursive CTE, you could easily use a calendar table (but presumably you'd still need a way to define the overall date range you're after as opposed to all of time).
Example db<>fiddle
In SQL Server you can make a cursor, which iterates over the dates. If it finds values for a given date, it takes those and stores them for later. in the next iteration it can then take the stored values, in case there are no values in the database

Add Missing Dates in Running Total

If I have a table in SQL SERVER:
DATE ITEM Quantity_Received
1/1/2016 Hammer 20
1/3/2016 Hammer 50
1/5/2016 Hammer 100
...
And I want to output:
DATE ITEM Quantity_Running
1/1/2016 Hammer 20
1/2/2016 Hammer 20
1/3/2016 Hammer 70
1/4/2016 Hammer 70
1/5/2016 Hammer 120
Where Quantity_Running is just a cumulative sum. I am confused on how I would add the missing dates to the output table that do not exist in the initial table. Note, I would need to do this for many items and would probably have a temp-table with the dates I want.
Thank you!
EDIT
And is there a way to do it such that you use an inner join?
SELECT TD.Date,
FT1.Item,
SUM(FT2.QTY) AS Cumulative
FROM tempDates TD
LEFT JOIN FutureOrders FT1
ON TD.SETTLE_DATE = FT1.SETTLE_DATE
INNER JOIN FutureOrders FT2
ON FT1.Settle_Date < ft2.Settle_Date AND ft1.ITEM= ft2.ITEM
GROUP BY ft1.ITEM, ft1.Settle_Date
tempDates is a CTE that has the dates I want. Because I am then returning NULL values. And sorry, but to make it a bit more complicated I actually want:
DATE ITEM Quantity_Running
1/1/2016 Hammer 150
1/2/2016 Hammer 100
1/3/2016 Hammer 100
1/4/2016 Hammer 0
1/5/2016 Hammer 0
Thought I would be able to get the answer by myself based on my simpler question.
Generate all the dates (for a specified range of dates) using a recursive cte. Thereafter left join on the generated dates cte and get the running sum for missing dates as well.
with dates_cte as
(select cast('2016-01-01' as date) dt
union all
select dateadd(dd,1,dt) from dates_cte where dt < '2016-01-31' --change this date to the maximum date needed
)
select dt,item,sum(quantity) over(partition by item order by dt) quantity_running
from (select d.dt,i.item,
coalesce(t.quantity_received,0) quantity
from dates_cte d
cross join (select distinct item from t) i
left join t on t.dt=d.dt and i.item=t.item) x
order by 2,1
Edit: To get the reverse cumulative sum excluding the current row, use
with dates_cte as
(select cast('2016-01-01' as date) dt
union all
select dateadd(dd,1,dt) from dates_cte where dt < '2016-01-31' --change this date to the maximum date needed
)
select dt,item,
sum(quantity) over(partition by item order by dt desc rows between unbounded preceding and 1 preceding) quantity_running
from (select d.dt,i.item,
coalesce(t.quantity_received,0) quantity
from dates_cte d
cross join (select distinct item from t) i
left join t on t.dt=d.dt and i.item=t.item) x
order by 2,1

Find From/To Dates across multiple rows - SQL Postgres

I want to be able to "book" within range of dates, but you can't book across gaps of days. So booking across multiple rates is fine as long as they are contiguous.
I am happy to change data structure/index, if there are better ways of storing start/end ranges.
So far I have a "rates" table which contains Start/End Periods of time with a daily rate.
e.g. Rates Table.
ID Price From To
1 75.00 2015-04-12 2016-04-15
2 100.00 2016-04-16 2016-04-17
3 50.00 2016-04-18 2016-04-30
For the above data I would want to return:
From To
2015-04-12 2016-4-30
For simplicity sake it is safe to assume that dates are safely consecutive. For contiguous dates To is always 1 day before from.
For the case there is only 1 row, I would want it to return the From/To of that single row.
Also to clarify if I had the following data:
ID Price From To
1 75.00 2015-04-12 2016-04-15
2 100.00 2016-04-17 2016-04-18
3 50.00 2016-04-19 2016-04-30
4 50.00 2016-05-01 2016-05-21
Meaning where there is a gap >= 1 day it would count as a separate range.
In which case I would expect the following:
From To
2015-04-12 2016-04-15
2015-04-17 2016-05-21
Edit 1
After playing around I have come up with the following SQL which seems to work. Although I'm not sure if there are better ways/issues with it?
WITH grouped_rates AS
(SELECT
from_date,
to_date,
SUM(grp_start) OVER (ORDER BY from_date, to_date) group
FROM (SELECT
gite_id,
from_date,
to_date,
CASE WHEN (from_date - INTERVAL '1 DAY') = lag(to_date)
OVER (ORDER BY from_date, to_date)
THEN 0
ELSE 1
END grp_start
FROM rates
GROUP BY from_date, to_date) AS start_groups)
SELECT
min(from_date) from_date,
max(to_date) to_date
FROM grouped_rates
GROUP BY grp;
This is identifying contiguous overlapping groups in the data. One approach is to find where each group begins and then do a cumulative sum. The following query adds a flag indicating if a row starts a group:
select r.*,
(case when not exists (select 1
from rates r2
where r2.from < r.from and r2.to >= r.to or
(r2.from = r.from and r2.id < r.id)
)
then 1 else 0 end) as StartFlag
from rate r;
The or in the correlation condition is to handle the situation where intervals that define a group overlap on the start date for the interval.
You can then do a cumulative sum on this flag and aggregate by that sum:
with r as (
select r.*,
(case when not exists (select 1
from rates r2
where (r2.from < r.from and r2.to >= r.to) or
(r2.from = r.from and r2.id < r.id)
)
then 1 else 0 end) as StartFlag
from rate r
)
select min(from), max(to)
from (select r.*,
sum(r.StartFlag) over (order by r.from) as grp
from r
) r
group by grp;
CREATE TABLE prices( id INTEGER NOT NULL PRIMARY KEY
, price MONEY
, date_from DATE NOT NULL
, date_upto DATE NOT NULL
);
-- some data (upper limit is EXCLUSIVE)
INSERT INTO prices(id, price, date_from, date_upto) VALUES
( 1, 75.00, '2015-04-12', '2016-04-16' )
,( 2, 100.00, '2016-04-17', '2016-04-19' )
,( 3, 50.00, '2016-04-19', '2016-05-01' )
,( 4, 50.00, '2016-05-01', '2016-05-22' )
;
-- SELECT * FROM prices;
-- Recursive query to "connect the dots"
WITH RECURSIVE rrr AS (
SELECT date_from, date_upto
, 1 AS nperiod
FROM prices p0
WHERE NOT EXISTS (SELECT * FROM prices nx WHERE nx.date_upto = p0.date_from) -- no preceding segment
UNION ALL
SELECT r.date_from, p1.date_upto
, 1+r.nperiod AS nperiod
FROM prices p1
JOIN rrr r ON p1.date_from = r.date_upto
)
SELECT * FROM rrr r
WHERE NOT EXISTS (SELECT * FROM prices nx WHERE nx.date_from = r.date_upto) -- no following segment
;
Result:
date_from | date_upto | nperiod
------------+------------+---------
2015-04-12 | 2016-04-16 | 1
2016-04-17 | 2016-05-22 | 3
(2 rows)

Group table into 15 minute intervals

T-SQL, SQL Server 2008 and up
Given a sample table of
StatusSetDateTime | UserID | Status | StatusEndDateTime | StatusDuration(in seconds)
============================================================================
2012-01-01 12:00:00 | myID | Available | 2012-01-01 13:00:00 | 3600
I need to break that down into a view that uses 15 minute intervals for example:
IntervalStart | UserID | Status | Duration
===========================================
2012-01-01 12:00:00 | myID | Available | 900
2012-01-01 12:15:00 | myID | Available | 900
2012-01-01 12:30:00 | myID | Available | 900
2012-01-01 12:45:00 | myID | Available | 900
2012-01-01 13:00:00 | myID | Available | 0
etc....
Now I've been able to search around and find some queries that will break down
I found something similar for MySql Here :
And something for T-SQL Here
But on the second example they are summing the results whereas I need to divide the total duration by the interval time (900 seconds) by user by status.
I was able to adapt the examples in the second link to split everything into intervals but the total duration time is returned and I cannot quite figure out how to get the Interval durations to split (and still sum up to the total original duration).
Thanks in advance for any insight!
edit : First Attempt
;with cte as
(select MIN(StatusDateTime) as MinDate
, MAX(StatusDateTime) as MaxDate
, convert(varchar(14),StatusDateTime, 120) as StartDate
, DATEPART(minute, StatusDateTime) /15 as GroupID
, UserID
, StatusKey
, avg(StateDuration) as AvgAmount
from AgentActivityLog
group by convert(varchar(14),StatusDateTime, 120)
, DATEPART(minute, StatusDateTime) /15
, Userid,StatusKey)
select dateadd(minute, 15*GroupID, CONVERT(datetime,StartDate+'00'))
as [Start Date]
, UserID, StatusKey, AvgAmount as [Average Amount]
from cte
edit : Second Attempt
;With cte As
(Select DateAdd(minute
, 15 * (DateDiff(minute, '20000101', StatusDateTime) / 15)
, '20000101') As StatusDateTime
, userid, statuskey, StateDuration
From AgentActivityLog)
Select StatusDateTime, userid,statuskey,Avg(StateDuration)
From cte
Group By StatusDateTime,userid,statuskey;
;with cte_max as
(
select dateadd(mi, -15, max(StatusEndDateTime)) as EndTime, min(StatusSetDateTime) as StartTime
from AgentActivityLog
), times as
(
select StartTime as Time from cte_max
union all
select dateadd(mi, 15, c.Time)
from times as c
cross join cte_max as cm
where c.Time <= cm.EndTime
)
select
t.Time, A.UserID, A.Status,
case
when t.Time = A.StatusEndDateTime then 0
else A.StatusDuration / (count(*) over (partition by A.StatusSetDateTime, A.UserID, A.Status) - 1)
end as Duration
from AgentActivityLog as A
left outer join times as t on t.Time >= A.StatusSetDateTime and t.Time <= A.StatusEndDateTime
sql fiddle demo
I've never been comfortable with using date math to split things up into partitions. It seems like there are all kinds of pitfalls to fall into.
What I prefer to do is to create a table (pre-defined, table-valued function, table variable) where there's one row for each date partition range. The table-valued function approach is particularly useful because you can build it for arbitrary ranges and partition sizes as you need. Then, you can join to this table to split things out.
paritionid starttime endtime
---------- ------------- -------------
1 8/1/2012 5:00 8/1/2012 5:15
2 8/1/2012 5:15 8/1/2012 5:30
...
I can't speak to the performance of this method, but I find the queries are much more intuitive.
It is relatively simple if you have a helper table with every 15-minute timestamp, which you JOIN to your base table via BETWEEN. You can build the helper table on the fly or keep it permanently in your database. Simple for the next guy at your company to figure out too:
// declare a table and a timestamp variable
declare #timetbl table(t datetime)
declare #t datetime
// set the first timestamp
set #t = '2012-01-01 00:00:00'
// set the last timestamp, can easily be extended to cover many years
while #t <= '2013-01-01'
begin
// populate the table with a new row, every 15 minutes
insert into #timetbl values (#t)
set #t = dateadd(mi, 15, #t)
end
// now the Select query:
select
tt.t, aal.UserID, aal.Status,
case when aal.StatusEndDateTime <= tt.t then 0 else 900 end as Duration
// using a shortcut for Duration, based on your comment that Start/End are always on the quarter-hour, and thus always 900 seconds or zero
from
#timetbl tt
INNER JOIN AgentActivityLog aal
on tt.t between aal.StatusSetDateTime and aal.StatusEndDateTime
order by
aal.UserID, tt.t
You can use a recursive Common Table Expression, where you keep adding your duration while the StatusEndDateTime is greater than the IntervalStart e.g.
;with cte as (
select StatusSetDateTime as IntervalStart
,UserID
,Status
,StatusDuration/(datediff(mi, StatusSetDateTime, StatusEndDateTime)/15) as Duration
, StatusEndDateTime
From AgentActivityLog
Union all
Select DATEADD(ss, Duration, IntervalStart) as IntervalStart
, UserID
, Status
, case when DATEADD(ss, Duration, IntervalStart) = StatusEndDateTime then 0 else Duration end as Duration
, StatusEndDateTime
From cte
Where IntervalStart < StatusEndDateTime
)
select IntervalStart, UserID, Status, Duration from cte
Here's a query that will do the job for you without requiring helper tables. (I have nothing against helper tables, they are useful and I use them. It is also possible to not use them sometimes.) This query allows for activities to start and end at any times, even if not whole minutes ending in :00, :15, :30, :45. If there will be millisecond portions then you'll have to do some experimenting because, following your model, I only went to second resolution.
If you have a known hard maximum duration, then remove #MaxDuration and replace it with that value, in minutes. N <= #MaxDuration is crucial to the query performing well.
DECLARE #MaxDuration int;
SET #MaxDuration = (SELECT Max(StatusDuration) / 60 FROM #AgentActivityLog);
WITH
L0 AS(SELECT 1 c UNION ALL SELECT 1),
L1 AS(SELECT 1 c FROM L0, L0 B),
L2 AS(SELECT 1 c FROM L1, L1 B),
L3 AS(SELECT 1 c FROM L2, L2 B),
L4 AS(SELECT 1 c FROM L3, L3 B),
L5 AS(SELECT 1 c FROM L4, L4 B),
Nums AS(SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) n FROM L5)
SELECT
S.IntervalStart,
Duration = DateDiff(second, S.IntervalStart, E.IntervalEnd)
FROM
#AgentActivityLog L
CROSS APPLY (
SELECT N, Offset = (N.N - 1) * 900
FROM Nums N
WHERE N <= #MaxDuration
) N
CROSS APPLY (
SELECT Edge =
DateAdd(second, N.Offset, DateAdd(minute,
DateDiff(minute, '20000101', L.StatusSetDateTime)
/ 15 * 15, '20000101')
)
) G
CROSS APPLY (
SELECT IntervalStart = Max(T.BeginTime)
FROM (
SELECT L.StatusSetDateTime
UNION ALL SELECT G.Edge
) T (BeginTime)
) S
CROSS APPLY (
SELECT IntervalEnd = Min(T.EndTime)
FROM (
SELECT L.StatusEndDateTime
UNION ALL SELECT G.Edge + '00:15:00'
) T (EndTime)
) E
WHERE
N.Offset <= L.StatusDuration
ORDER BY
L.StatusSetDateTime,
S.IntervalStart;
Here is setup script if you want to try it:
CREATE TABLE #AgentActivityLog (
StatusSetDateTime datetime,
StatusEndDateTime datetime,
StatusDuration AS (DateDiff(second, 0, StatusEndDateTime - StatusSetDateTime))
);
INSERT #AgentActivityLog -- weird end times
SELECT '20120101 12:00:00', '20120101 13:00:00'
UNION ALL SELECT '20120101 13:00:00', '20120101 13:27:56'
UNION ALL SELECT '20120101 13:27:56', '20120101 13:28:52'
UNION ALL SELECT '20120101 13:28:52', '20120120 11:00:00'
INSERT #AgentActivityLog -- 15-minute quantized end times
SELECT '20120101 12:00:00', '20120101 13:00:00'
UNION ALL SELECT '20120101 13:00:00', '20120101 13:30:00'
UNION ALL SELECT '20120101 13:30:00', '20120101 14:00:00'
UNION ALL SELECT '20120101 14:00:00', '20120120 11:00:00'
Also, here's a version that expects ONLY times that have whole minutes ending in :00, :15, :30, or :45.
DECLARE #MaxDuration int;
SET #MaxDuration = (SELECT Max(StatusDuration) / 60 FROM #AgentActivityLog);
WITH
L0 AS(SELECT 1 c UNION ALL SELECT 1),
L1 AS(SELECT 1 c FROM L0, L0 B),
L2 AS(SELECT 1 c FROM L1, L1 B),
L3 AS(SELECT 1 c FROM L2, L2 B),
L4 AS(SELECT 1 c FROM L3, L3 B),
L5 AS(SELECT 1 c FROM L4, L4 B),
Nums AS(SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) n FROM L5)
SELECT
S.IntervalStart,
Duration = CASE WHEN Offset = StatusDuration THEN 0 ELSE 900 END
FROM
#AgentActivityLog L
CROSS APPLY (
SELECT N, Offset = (N.N - 1) * 900
FROM Nums N
WHERE N <= #MaxDuration
) N
CROSS APPLY (
SELECT IntervalStart = DateAdd(second, N.Offset, L.StatusSetDateTime)
) S
WHERE
N.Offset <= L.StatusDuration
ORDER BY
L.StatusSetDateTime,
S.IntervalStart;
It really seems like having the final 0 Duration row is not correct, because then you can't just order by IntervalStart as there are duplicate IntervalStart values. What is the benefit of having rows that add 0 to the total?