SQL grouping data with overlapping timespans - sql

I need to group data together that are related to each other by overlapping timespans based on the records start and end times. SQL-fiddle here: http://sqlfiddle.com/#!18/87e4b/1/0
The current query I have built is giving incorrect results. Callid 3 should give a callCount of 4. It does not because record 6 is not included since it does not overlap with 3, but should be included because it does overlap with one of the other related records. So I believe a recursive CTE may be in need but I am unsure how to write this.
Schema:
CREATE TABLE Calls
([callid] int, [src] varchar(10), [start] datetime, [end] datetime, [conf] varchar(5));
INSERT INTO Calls
([callid],[src],[start],[end],[conf])
VALUES
('1','5555550001','2019-07-09 10:00:00', '2019-07-09 10:10:00', '111'),
('2','5555550002','2019-07-09 10:00:01', '2019-07-09 10:11:00', '111'),
('3','5555550011','2019-07-09 11:00:00', '2019-07-09 11:10:00', '111'),
('4','5555550012','2019-07-09 11:00:01', '2019-07-09 11:11:00', '111'),
('5','5555550013','2019-07-09 11:01:00', '2019-07-09 11:15:00', '111'),
('6','5555550014','2019-07-09 11:12:00', '2019-07-09 11:16:00', '111'),
('7','5555550014','2019-07-09 15:00:00', '2019-07-09 15:01:00', '111');
Current query:
SELECT
detail_record.callid,
detail_record.conf,
MIN(related_record.start) AS sessionStart,
MAX(related_record.[end]) As sessionEnd,
COUNT(related_record.callid) AS callCount
FROM
Calls AS detail_record
INNER JOIN
Calls AS related_record
ON related_record.conf = detail_record.conf
AND ((related_record.start >= detail_record.start
AND related_record.start < detail_record.[end])
OR (related_record.[end] > detail_record.start
AND related_record.[end] <= detail_record.[end])
OR (related_record.start <= detail_record.start
AND related_record.[end] >= detail_record.[end])
)
WHERE
detail_record.start > '1/1/2019'
AND detail_record.conf = '111'
GROUP BY
detail_record.callid,
detail_record.start,
detail_record.conf
HAVING
MIN(related_record.start) >= detail_record.start
ORDER BY sessionStart DESC
Expected Results:
callid conf sessionStart sessionEnd callCount
7 111 2019-07-09T15:00:00Z 2019-07-09T15:01:00Z 1
3 111 2019-07-09T11:00:00Z 2019-07-09T11:15:00Z 4
1 111 2019-07-09T10:00:00Z 2019-07-09T10:11:00Z 2

This is a gaps-and-islands problem. It does not require a recursive CTE. You can use window functions:
select min(callid), conf, grouping, min([start]), max([end]), count(*)
from (select c.*,
sum(case when prev_end < [start] then 1 else 0 end) over (order by start) as grouping
from (select c.*,
max([end]) over (partition by conf order by [start] rows between unbounded preceding and 1 preceding) as prev_end
from calls c
) c
) c
group by conf, grouping;
The innermost subquery calculates the previous end. The middle subquery compares this to the current start, to determine when groups of adjacent rows are the beginning of a new group. A cumulative sum then determines the grouping.
And, the outer query aggregates to summarize information about each group.
Here is a db<>fiddle.

Related

Calculate date difference between dates based on a specific condition

I have a table History with the columns date, person and status and I need to know what is the total amount of time spent since it started until it reaches the finished status ( Finished status can occur multiples times). I need to get the datediff from the first time it's created until the first time it's with status finished, afterwards I need to get the next date were it's not finished and get again the datediff using the date it was again finished and so on. Another condition is to do this calculation only if Person who changed the status is not null. After that I need to sum all times and get the total.
I tried with Lead and Lag function but was not getting the results that I need.
First let's talk about providing demo data. Here's a good way to do it:
Create a table variable similar to your actual object(s) and then populate them:
DECLARE #statusTable TABLE (Date DATETIME, Person INT, Status NVARCHAR(10), KeyID NVARCHAR(7))
INSERT INTO #statusTable (Date, Person, Status, KeyID) VALUES
('2022-10-07 07:01:17.463', 1, 'Start', 'AAA-111'),
('2022-10-07 07:01:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-11 14:01:44.463', 1, 'Waiting', 'AAA-111'),
('2022-10-14 10:04:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-14 10:04:17.463', 1, 'Finished','AAA-111'),
('2022-10-14 10:04:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-17 17:01:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-21 11:03:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-21 11:03:17.463', 1, 'Finished','AAA-111'),
('2022-10-21 11:03:17.463', 1, 'Waiting', 'AAA-111'),
('2022-10-21 11:04:17.463', NULL, 'Waiting', 'AAA-111'),
('2022-10-21 11:05:17.463', 1, 'Finished','AAA-111')
Your problem is recursive, so we can use a rCTE to resolve it.
;WITH base AS (
SELECT *, CASE WHEN LAG(Status,1) OVER (PARTITION BY KeyID ORDER BY Date) <> 'Waiting' AND Status = 'Waiting' THEN 1 END AS isStart, ROW_NUMBER() OVER (PARTITION BY KeyID ORDER BY Date) AS rn
FROM #statusTable
), rCTE AS (
SELECT date AS startDate, date, Person, Status, KeyID, IsStart, rn
FROM base
WHERE isStart = 1
UNION ALL
SELECT a.startDate, r.date, r.Person, r.Status, a.KeyID, r.IsStart, r.rn
FROM rCTE a
INNER JOIN base r
ON a.rn+1 = r.rn
AND a.KeyID = r.KeyID
AND r.IsStart IS NULL
)
SELECT StartDate, MAX(date) AS FinishDate, KeyID, DATEDIFF(MINUTE,StartDate,MAX(Date)) AS Minutes
FROM rCTE
GROUP BY rCTE.startDate, KeyID
HAVING COUNT(Person) = COUNT(KeyID)
StartDate FinishDate KeyID Minutes
---------------------------------------------------------------
2022-10-07 07:01:17.463 2022-10-14 10:04:17.463 AAA-111 10263
2022-10-14 10:04:17.463 2022-10-21 11:03:17.463 AAA-111 10139
What we're doing here is finding, and marking the starts. Since when there is a Start row, the timestamp matches the first Waiting row and there isn't always a start row, we're gonna use the first waiting row as the start marker.
Then, we go through and find the next Finish row for that KeyID.
Using this we can now group on the StartDate, Max the StatusDate (as FinishDate) and then use a DATEDIFF to calculate the difference.
Finally, we compare the count of KeyIDs to the count of Person. If there is a NULL value for Person the counts will not match, and we just discard the data.
select min(date) as start
,max(date) as finish
,datediff(millisecond, min(date), max(date)) as diff_in_millisecond
,sum(datediff(millisecond, min(date), max(date))) over() as total_diff_in_millisecond
from
(
select *
,count(case when Status = 'Finished' then 1 end) over(order by date desc, status desc) as grp
,case when person is null then 0 else 1 end as flg
from t
) t
group by grp
having min(flg) = 1
order by start
start
finish
diff_in_millisecond
total_diff_in_millisecond
2022-10-07 07:01:17.4630000
2022-10-14 10:04:28.4730000
615791010
1242093518
2022-10-14 10:04:28.4730000
2022-10-21 11:03:06.7170000
608318244
1242093518
2022-10-26 12:46:14.7730000
2022-10-26 17:45:59.0370000
17984264
1242093518
Fiddle

Collapse multiple rows into a single row based upon a break condition

I have a simple sounding requirement that has had me stumped for a day or so now, so its time to seek help from the experts.
My requirement is to simply roll-up multiple rows into a single row based upon a break condition - when any of these columns change Employee ID, Allowance Plan, Allowance Amount or To Date, then the row is to be kept, if that makes sense.
An example source data set is shown below:
and the target data after collapsing the rows should look like this:
As you can see I don't need any type of running totals calculating I just need to collapse the rows into a single record per from date/to date combination.
So far I have tried the following SQL using a GROUP BY and MIN function
select [Employee ID], [Allowance Plan],
min([From Date]), max([To Date]), [Allowance Amount]
from [dbo].[#AllowInfo]
group by [Employee ID], [Allowance Plan], [Allowance Amount]
but that just gives me a single row and does not take into account the break condition.
what do I need to do so that the records are rolled-up (correct me if that is not the right terminology) correctly taking into account the break condition?
Any help is appreciated.
Thank you.
Note that your test data does not really exercise the algo that well - e.g. you only have one employee, one plan. Also, as you described it, you would end up with 4 rows as there is a change of todate between 7->8, 8->9, 9->10 and 10->11.
But I can see what you are trying to do, so this should at least get you on the right track, and returns the expected 3 rows. I have taken the end of a group to be where either employee/plan/amount has changed, or where todate is not null (or where we reach the end of the data)
CREATE TABLE #data
(
RowID INT,
EmployeeID INT,
AllowancePlan VARCHAR(30),
FromDate DATE,
ToDate DATE,
AllowanceAmount DECIMAL(12,2)
);
INSERT INTO #data(RowID, EmployeeID, AllowancePlan, FromDate, ToDate, AllowanceAmount)
VALUES
(1,200690,'CarAllowance','30/03/2017', NULL, 1000.0),
(2,200690,'CarAllowance','01/08/2017', NULL, 1000.0),
(6,200690,'CarAllowance','23/04/2018', NULL, 1000.0),
(7,200690,'CarAllowance','30/03/2018', NULL, 1000.0),
(8,200690,'CarAllowance','21/06/2018', '01/04/2019', 1000.0),
(9,200690,'CarAllowance','04/11/2021', NULL, 1000.0),
(10,200690,'CarAllowance','30/03/2017', '13/05/2022', 1000.0),
(11,200690,'CarAllowance','14/05/2022', NULL, 850.0);
-- find where the break points are
WITH chg AS
(
SELECT *,
CASE WHEN LAG(EmployeeID, 1, -1) OVER(ORDER BY RowID) != EmployeeID
OR LAG(AllowancePlan, 1, 'X') OVER(ORDER BY RowID) != AllowancePlan
OR LAG(AllowanceAmount, 1, -1) OVER(ORDER BY RowID) != AllowanceAmount
OR LAG(ToDate, 1) OVER(ORDER BY RowID) IS NOT NULL
THEN 1 ELSE 0 END AS NewGroup
FROM #data
),
-- count the number of break points as we go to group the related rows
grp AS
(
SELECT chg.*,
ISNULL(
SUM(NewGroup)
OVER (ORDER BY RowID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
0) AS grpNum
FROM chg
)
SELECT MIN(grp.RowID) AS RowID,
MAX(grp.EmployeeID) AS EmployeeID,
MAX(grp.AllowancePlan) AS AllowancePlan,
MIN(grp.FromDate) AS FromDate,
MAX(grp.ToDate) AS ToDate,
MAX(grp.AllowanceAmount) AS AllowanceAmount
FROM grp
GROUP BY grpNum
one way is to get all rows the last todate, and then group on that
select min(t.RowID) as RowID,
t.EmployeeID,
min(t.AllowancePlan) as AllowancePlan,
min(t.FromDate) as FromDate,
max(t.ToDate) as ToDate,
min(t.AllowanceAmount) as AllowanceAmount
from ( select t.RowID,
t.EmployeeID,
t.FromDate,
t.AllowancePlan,
t.AllowanceAmount,
case when t.ToDate is null then ( select top 1 t2.ToDate
from test t2
where t2.EmployeeID = t.EmployeeID
and t2.ToDate is not null
and t2.FromDate > t.FromDate -- t2.RowID > t.RowID
order by t2.RowID, t2.FromDate
)
else t.ToDate
end as todate
from test t
) t
group by t.EmployeeID, t.ToDate
order by t.EmployeeID, min(t.RowID)
See and test yourself in this DBFiddle
the result is
RowID
EmployeeID
AllowancePlan
FromDate
ToDate
AllowanceAmount
1
200690
CarAllowance
2017-03-30
2019-04-01
1000
9
200690
CarAllowance
2021-11-04
2022-05-13
1000
11
200690
CarAllowance
2022-05-14
(null)
850

Insert missing dates into existing table

I have a query that finds missing dates from a table.
The query is:
;WITH NullGaps AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY ChannelName, ReadingDate) AS ID,
SerialNumber, ReadingDate, ChannelName, uid
FROM
[UriData]
)
SELECT
(DATEDIFF(MINUTE, g1.ReadingDate , g2.ReadingDate) / 15) -1 AS 'MissingCount',
g1.ReadingDate AS 'FromDate', g2.ReadingDate AS 'ToDate'
FROM
NullGaps g1
INNER JOIN
NullGaps g2 ON g1.ID = (g2.ID - 1)
WHERE
DATEADD(MINUTE, 15, g1.ReadingDate) < g2.ReadingDate
The output is:
--------------------------------------------------------------
| MissingCount | FromDate | ToDate |
--------------------------------------------------------------
| 2 | 2018-09-20 14:30:00 | 2018-09-20 15:15:00 |
| 1 | 2018-09-20 15:30:00 | 2018-09-20 16:00:00 |
| 1 | 2018-09-20 20:30:00 | 2018-09-20 21:00:00 |
--------------------------------------------------------------
The output is the number of datetimes that are missing from the FromDate to the ToDate (which both exist). For example, in the first row of the output (above), the times I want to create and insert will be '2018-09-20 14:45:00' and '2018-09-20 15:00:00' (they are all 15-minute intervals)
I need to understand, how I now create the new dates and insert them into an existing table. I can create one date, but I can't create dates where there are multiple missing values between two times.
TIA
SQL Fiddle
If you also want to find the missing datetimes at the start and the end of a date?
Then comparing to generated datetimes should be a valiable method.
Such dates can be generated via a Recursive CTE.
Then you can join your data to the Recursive CTE and select those that are missing.
Or use a NOT EXISTS.
For example:
WITH RCTE AS
(
select [SerialNumber], [ChannelName], 0 as Lvl, cast(cast([ReadingDate] as date) as datetime) as ReadingDate
from [UriData]
group by SerialNumber, [ChannelName], cast([ReadingDate] as date)
union all
select [SerialNumber], [ChannelName], Lvl + 1, DATEADD(MINUTE,15,[ReadingDate])
from RCTE
where cast([ReadingDate] as date) = cast(DATEADD(MINUTE,15,[ReadingDate]) as date)
)
SELECT [SerialNumber], [ChannelName], [ReadingDate] AS FromDate
FROM RCTE r
WHERE NOT EXISTS
(
select 1
from [UriData] t
where t.[SerialNumber] = r.[SerialNumber]
and t.[ChannelName] = r.[ChannelName]
and t.[ReadingDate] = r.[ReadingDate]
);
A test can be found here
And here's another query that takes a different approuch :
WITH CTE AS
(
SELECT SerialNumber, ChannelName, ReadingDate,
LAG(ReadingDate) OVER (PARTITION BY SerialNumber, ChannelName ORDER BY ReadingDate) AS prevReadingDate
FROM [UriData]
)
, RCTE AS
(
select SerialNumber, ChannelName, 0 as Lvl,
prevReadingDate AS ReadingDate,
prevReadingDate AS MinReadingDate,
ReadingDate AS MaxReadingDate
from CTE
where DATEDIFF(MINUTE, prevReadingDate, ReadingDate) > 15
union all
select SerialNumber, ChannelName, Lvl + 1,
DATEADD(MINUTE,15,ReadingDate),
MinReadingDate,
MaxReadingDate
from RCTE
where ReadingDate < DATEADD(MINUTE,-15,MaxReadingDate)
)
select SerialNumber, ChannelName,
ReadingDate AS FromDate,
DATEADD(MINUTE,15,ReadingDate) AS ToDate,
dense_rank() over (partition by SerialNumber, ChannelName order by MinReadingDate) as GapRank,
(DATEDIFF(MINUTE, MinReadingDate, MaxReadingDate) / 15) AS TotalMissingQuarterGaps
from RCTE
where Lvl > 0 AND MinReadingDate < MaxReadingDate
ORDER BY SerialNumber, ChannelName, MinReadingDate;
You can test that one here
I don't understand your query for calculating missing values. Your question doesn't have sample data or explain the logic. I'm pretty sure that lag() would be much simpler.
But given your query (or any other), one method to expand out the data is to use a recursive CTE:
with missing as (<your query here>)
cte as (
select dateadd(minute, 15, fromdate) as dte, missingcount - 1 as missingcount
from missing
union all
select dateadd(minute, 15, dte), missingcount - 1
from cte
where missingcount > 0
)
select *
from cte;
If you have more than 100 missing times in one row, then add option (maxrecursion 0) to the end of the query.
Based on the information shared with me, I did the following which does what I need.
The first part is to find the date ranges that are missing by finding the from and to dates that have missing dates between them, then insert them into a table for auditing, but it will hold the missing dates I am looking for:
;WITH NullGaps AS(
SELECT ROW_NUMBER() OVER (ORDER BY ChannelName, ReadingDate) AS ID,SerialNumber, ReadingDate, ChannelName, uid
FROM [Staging].[UriData]
)
INSERT INTO [Staging].[MissingDates]
SELECT (DATEDIFF(MINUTE, g1.ReadingDate , g2.ReadingDate) / 15) -1 AS 'MissingCount',
g1.ChannelName,
g1.SerialNumber,
g1.ReadingDate AS FromDate,
g2.ReadingDate AS ToDate
FROM NullGaps g1
INNER JOIN NullGaps g2
ON g1.ID = (g2.ID - 1)
WHERE DATEADD(MINUTE, 15, g1.ReadingDate) < g2.ReadingDate
AND g1.ChannelName IN (SELECT ChannelName FROM staging.ActiveChannels)
AND NOT EXISTS(
SELECT 1 FROM [Staging].[MissingDates] m
WHERE m.Channel = g1.ChannelName
AND m.Serial = g1.SerialNumber
AND m.FromDate = g1.ReadingDate
AND m.ToDate = g2.ReadingDate
)
Now that I have the ranges to look for, I can now create the missing dates and insert them into the table that holds real data.
;WITH MissingDateTime AS(
SELECT DATEADD(MINUTE, 15, FromDate) AS dte, MissingCount -1 AS MissingCount, Serial, Channel
FROM [Staging].[MissingDates]
UNION ALL
SELECT DATEADD(MINUTE, 15, dte), MissingCount - 1, Serial, Channel
FROM MissingDateTime
WHERE MissingCount > 0
) -- END CTE
INSERT INTO [Staging].[UriData]
SELECT NEWID(), Serial, Channel, '999', '0', dte, CURRENT_TIMESTAMP, 0,1,0 FROM MissingDateTime m
WHERE NOT EXISTS(
SELECT 1 FROM [Staging].[UriData] u
WHERE u.ChannelName = m.Channel
AND u.SerialNumber = m.Serial
AND u.ReadingDate = m.dte
) -- END SELECT
I am sure you can offer improvements to this. This solution finds only the missing dates and allows me to back fill my data table with only the missing dates. I can also change the intervals later should other devices need to be used for different intervals. I have put the queries in two sperarate SPROC's so I can control both apects, being: one for auditing and one for back filling.

Group records when date is within N minutes

It's not as simple as creating intervals of time that are N minutes long. One record might be 10:04, and the other 10:17 where N is 15.
Perhaps a user-function will work, maybe a CTE. It could require multiple joins on the same source table.
I'm looking for the most "elegant" solution. Maybe there's a feature in SQL I didn't know about which makes this easy.
Here is a reference scenario to make answers more consistent with each other:
create table Comparisons (
DateField DateTime NOT NULL,
Amount int not null, -- default to 5
)
insert into Comparisons (DateField) values ('2000-01-01 10:04'),('2000-01-01 10:17'),
('2000-01-01 12:01'),('2000-01-01 11:54'),('2000-01-01 03:02'),('2000-01-01 03:05'),
('2000-01-01 05:02'),('2000-01-01 05:05'),('2000-01-01 05:19')
output expected:
min: .. 10:04, max: .. 10:17, sum: 10
min: .. 11:54, max: .. 12:01, sum: 10
min: .. 03:02, max: .. 03:05, sum: 10
min: .. 05:02, max: .. 05:19, sum: 15 [optional]
The last output is optional, but if an elegant solution has that as a side-effect, it's acceptable. If an elegant solution can't achieve that optional last output, it won't be a deal breaker.
I believe this produces the results you want:
DECLARE #Comparisons TABLE (i DATETIME, amt INT NOT NULL DEFAULT(5));
INSERT #Comparisons (i) VALUES ('2016-01-01 10:04:00.000')
, ('2016-01-01 10:17:00.000')
, ('2016-01-01 10:25:00.000')
, ('2016-01-01 10:37:00.000')
, ('2016-01-01 10:44:00.000')
, ('2016-01-01 11:52:00.000')
, ('2016-01-01 11:59:00.000')
, ('2016-01-01 12:10:00.000')
, ('2016-01-01 12:22:00.000')
, ('2016-01-01 13:00:00.000')
, ('2016-01-01 09:00:00.000');
DECLARE #N INT = 15;
WITH T AS (
SELECT i
, amt
, CASE WHEN DATEDIFF(MINUTE, previ, i) <= #N THEN 0 ELSE 1 END RN1
, CASE WHEN DATEDIFF(MINUTE, i, nexti) > #N THEN 1 ELSE 0 END RN2
FROM #Comparisons t
OUTER APPLY (SELECT MAX(i) FROM #Comparisons WHERE i < t.i)x(previ)
OUTER APPLY (SELECT MIN(i) FROM #Comparisons WHERE i > t.i)y(nexti)
)
, T2 AS (
SELECT CASE RN1 WHEN 1 THEN i ELSE (SELECT MAX(i) FROM T WHERE RN1 = 1 AND i < T1.i) END mintime
, CASE WHEN RN2 = 1 THEN i ELSE ISNULL((SELECT MIN(i) FROM T WHERE RN2 = 1 AND i > T1.i), i) END maxtime
, amt
FROM T T1
)
SELECT mintime, maxtime, sum(amt) total
FROM T2
GROUP BY mintime, maxtime
ORDER BY mintime;
It's probably a little clunkier than it could be, but it's basically just grouping anything within an #N-minute chain.
It looks like you want to group records based on gaps between them of at least <N> minutes.
In SQL Server 2012+, you would use lag() to identify when groups start and cumulative sum to identify the groups:
select min(datefield), max(datefield), count(*) as num, sum(amount)
from (select c.*,
sum(case when prev_datefield < dateadd(minute, -N, datefield)
then 1 else 0
end) over (order by datefield) as grp
from (select c.*,
lag(datefield) over (order by datefield) as prev_datefield
from Comparisons c
) c
) c
group by grp;
In earlier versions you can use correlated subqueries or apply for the same functionality (albeit at much worse performance).
Intervals could be used, if adjacent intervals are checked. This would require multiplying the source table records by 3
Pseudo-code:
select *
from Comparisons C, {-1, 0, 1} M
group by (datediff(mi, C.DateField, 0) / N) + M
The problem with this approach is how to eliminate the extra results. I suspect this is a deadend approach but someone else might see value in it.
Update: This approach would not work with the 4th expected output [min: .. 05:02, max: .. 05:19, sum: 15]

Calculate Average time spend based on a change in location zone

I have a table similar to
create table LOCHIST
(
RES_ID VARCHAR(10) NOT NULL,
LOC_DATE TIMESTAMP NOT NULL,
LOC_ZONE VARCHAR(10)
)
with values such as
insert into LOCHIST values(0911,2015-09-23 12:27:00.000000,SYLVSYLGA);
insert into LOCHIST values(5468,2013-02-15 13:13:24.000000,30726);
insert into LOCHIST values(23894,2013-02-15 13:12:13.000000,BECTFOUNC);
insert into LOCHIST values(24119,2013-02-15 13:12:09.000000,30363);
insert into LOCHIST values(7101,2013-02-15 13:11:37.000000,37711);
insert into LOCHIST values(26083,2013-02-15 13:11:36.000000,SHAWANDAL);
insert into LOCHIST values(24978,2013-02-15 13:11:36.000000,38132);
insert into LOCHIST values(26696,2013-02-15 13:11:27.000000,29583);
insert into LOCHIST values(5468,2013-02-15 13:11:00.000000,37760);
insert into LOCHIST values(5552,2013-02-15 13:10:55.000000,30090);
insert into LOCHIST values(24932,2013-02-15 13:10:48.000000,JBTTLITGA);
insert into LOCHIST values(23894,2013-02-15 13:10:42.000000,47263);
insert into LOCHIST values(26803,2013-02-15 13:10:25.000000,32534);
insert into LOCHIST values(24434,2013-02-15 13:10:03.000000,PLANSUFVA);
insert into LOCHIST values(26696,2013-02-15 13:10:00.000000,GEORALBGA);
insert into LOCHIST values(5468,2013-02-15 13:09:54.000000,19507);
insert into LOCHIST values(23894,2013-02-15 13:09:48.000000,37725);
This table literally goes on for millions of records.
Each RES_ID represents the ID of a trailer who pings their location to a LOC_ZONE which is then stored at the time in LOC_DATE.
What I am trying to find, is the average amount of time spent for all trailers in a specific location zone. For example, if trailer x spent 4 hours in in loc zone PLANSUFVA, and trailer y spent 6 hours in loc zone PLANSUFVA I would want to return
Loc Zone Avg Time
PLANSUFVA 5
Is there anyway to do this without cursors?
I really appreciate your help.
This needs SQL 2012:
with data
as (
select *, (case when LOC_ZONE != PREVIOUS_LOC_ZONE or PREVIOUS_LOC_ZONE is null then ROW_ID else null end) as STAY_START, (case when LOC_ZONE != NEXT_LOC_ZONE or NEXT_LOC_ZONE is null then ROW_ID else null end) as STAY_END
from (
select RES_ID, LOC_ZONE, LOC_DATE, lead(LOC_DATE, 1) over (partition by RES_ID, LOC_ZONE order by LOC_DATE) as NEXT_LOC_DATE, lag(LOC_ZONE, 1) over (partition by RES_ID order by LOC_DATE) as PREVIOUS_LOC_ZONE, lead(LOC_ZONE, 1) over (partition by RES_ID order by LOC_DATE) as NEXT_LOC_ZONE, ROW_NUMBER() over (order by RES_ID, LOC_ZONE, LOC_DATE) as ROW_ID
from LOCHIST
) t
), stays as (
select * from (
select RES_ID, LOC_ZONE, STAY_START, lead(STAY_END, 1) over (order by ROWID) as STAY_END
from (
select RES_ID, LOC_ZONE, STAY_START, STAY_END, ROW_NUMBER() over (order by RES_ID, LOC_ZONE, STAY_START desc) as ROWID
from data
where STAY_START is not null or STAY_END is not null
) t
) t
where STAY_START is not null and STAY_END is not null
)
select s.LOC_ZONE, avg(datediff(second, LOC_DATE, NEXT_LOC_DATE)) / 60 / 60 as AVG_IN_HOURS
from data d
inner join stays s on d.RES_ID = s.RES_ID and d.LOC_ZONE = s.LOC_ZONE and d.ROW_ID >= s.STAY_START and d.ROW_ID < s.STAY_END
group by s.LOC_ZONE
To solve this problem, you need the amount of time spent at each location.
One way to do this is with a correlated subquery. You need to group adjacent values. The idea is to find the next value in the sequence:
select resid, min(loc_zone) as loc_zone, min(loc_date) as StartTime,
max(loc_date) as EndTime,
nextdate as NextStartTime
from (select lh.*,
(select min(loc_date) from lochist lh2
where lh2.res_id = lh.res_id and lh2.loc_zone <> lh.loc_zone and
lh2.loc_date > lh.loc_date
) as nextdate
from lochist lh
) lh
group by lh.res_id, nextdate
With this data, you can then get the average that you want.
I am not clear if the time should be based on EndTime - StartTime (last recorded time at the location minus the first recorded time) or NextStartTime - startTime (first time at next location minus first time at this location).
Also, this returns NULL for the last location for each res_id. You don't say what to do about the last in the sequence.
If you build an index on res_id, loc_date, loc_zone, it might run faster.
If you had Oracle or SQL Server 2012, the right query is:
select lh.*,
lead(loc_date) over (partition by res_id order by loc_date) as nextdate
from (select lh.*,
lag(loc_zone) over (partition by res_id order by loc_date) as prevzone
from lochist lh
) lh
where prevzone is null or prevzone <> loc_zone
Now you have one row per stay and nextdate is the date at the next zone.
This should get you each zone ordered by the average number of minutes spent in it. The CROSS APPLY returns the next ping in a different zone.
SELECT
loc.LOC_ZONE
,AVG(DATEDIFF(mi,loc.LOC_DATE,nextPing.LOC_DATE)) AS avgMinutes
FROM LOCHIST loc
CROSS APPLY(
SELECT TOP 1 loc2.LOC_DATE
FROM LOCHIST loc2
WHERE loc2.RES_ID = loc.RES_ID
AND loc2.LOC_DATE > loc.LOC_DATE
AND loc2.LOC_ZONE <> loc.LOC_ZONE
ORDER BY loc2.LOC_DATE ASC
) AS nextPing
GROUP BY loc.LOC_ZONE
ORDER BY avgMinutes DESC
My variation of the solution:
select LOC_ZONE, avg(TOTAL_TIME) AVG_TIME from (
select RES_ID, LOC_ZONE, sum(TIME_SPENT) TOTAL_TIME
from (
select RES_ID, LOC_ZONE, datediff(mi, lag(LOC_DATE, 1) over (
partition by RES_ID order by LOC_DATE), LOC_DATE) TIME_SPENT
from LOCHIST
) t
where TIME_SPENT is not null
group by RES_ID, LOC_ZONE) f
group by LOC_ZONE
This accounts for multiple stays at the same location. The choice between lag or lead depends if a stay should start or end with the ping (ie, if one trailer sends a ping from A and then x hours later from B, does that count for A or B).
To do this without using either a cursor or a correlated subquery, try:
with rl as
(select l.*, rank() over (partition by res_id order by loc_date) rn
from lochist l),
fdr as
(select rc.*, coalesce(rn.loc_date, getdate()) next_date
from rl rc
left join rl rn on rc.res_id = rn.res_id and rc.rn + 1 = rn.rn)
select loc_zone, avg(datediff(second, loc_date, next_date))/3600 avg_time
from fdr
group by loc_zone
SQLFiddle here.
(Because of the way that SQLServer calculates time differences, it's probably better to calculate the average time in seconds and then divide by 60*60. With the exception of the getdate() and datediff clauses - which can be replaced by sysdate and next_date - loc_date - this should work in both SQLServer 2005 onwards and Oracle 10g onwards.)