WHERE [Column] Has Not Changed in N Time - sql

So, I've got a view that is admittedly not well-indexed and there's not much I can do about it.
The view has data that looks a bit like the one in this question, but my problem is essentially the opposite of theirs and I'm not sure their solution will work here, though a similar TVF or CTE is probably in the forecast.
My data looks like this at the moment:
CustomPollerAssignmentId DateTime Status
[Some Id B] 2013-11-18 08:54:00 IDLE
[Some Id A] 2013-11-18 08:54:00 DORMANT
[Some Id B] 2013-11-18 08:53:00 IDLE
[Some Id A] 2013-11-18 08:53:00 NOMINAL
Unlike the other question, I need to see that the status hasn't changed. The view comprises three separate tables. One with minute statistics, one with fifteen minute statistics (for between three and six months ago), and one for hourly statistics (for up to a year ago).
The goal here is to check which modems have been idle for at least the last 10 minutes. We've got about 1200 active modems, so this could be up to 12000 rows, which is why I'd prefer not to do it with C#, but I'm still kind of new to SQL and set-based thinking. I'm currently working with an instance of SQL Server 2012, but it's very new here and I'm not really experienced with the newer windowed functions since we were on 2008R2 until about a month ago.
To be honest, I'm not even sure where to begin here because my OOP background wants me to just grab the TOP 10 statuses for each and loop through. If all 10 == idle || dormant, add to the result set, but I know there's got to be a better way to do it in SQL. Can someone point me in the right direction?
EDIT
To try to clarify a bit:
I'm using T-SQL.
This isn't as simple as a WHERE NOT EXISTS clause.
Regardless of whether or not the status has changed, there should be an entry for the remote's status unless it has been deactivated. This means that it could have (idle, idle, idle, idle, idle, nominal, idle, idle, idle, idle) statuses for the last 10 minutes and that example is a case I would not want to include. The result set should include ONLY those remotes which have had statuses which are only idle or dormant for the last 10 minutes. If the last status is more than three months ago, it will only have one status for a fifteen minute interval.

If i understand correctly what you're asking:
SELECT [theView].CustomPollerAssignmentId
FROM
[theView]
LEFT JOIN
(
SELECT CustomPollerAssignmentId, MAX([DateTime]) AS LastTime
FROM [theView]
WHERE [Status] <> 'IDLE' AND [DateTime] <= #Now
GROUP BY CustomPollerAssignmentId
) AS NotIdleStatus ON
[theView].CustomPollerAssignmentId = NotIdleStatus.CustomPollerAssignmentId
WHERE
[theView].[DateTime] <= #Now AND
[theView].[Status] = 'IDLE' AND
(
[theView].[DateTime] > NotIdleStatus.LastTime OR
NotIdleStatus.LastTime IS NULL
)
GROUP BY [theView].CustomPollerAssignmentId
HAVING MIN([theView].[DateTime]) <= DATEADD(MINUTE, -10, #Now)
The concept here is to
For each ID, select the latest time for non-idle status up till the current time.
For each ID, join idle status against the latest non-idle time.
Group the status by ID.
Filter for those that have been idle before 10 minutes ago.
Also filter for those that are either idle later than its latest non-idle status or does not have non-idle status.
The following code selects the last non-idle time for each of IDs:
SELECT CustomPollerAssignmentId, MAX([DateTime]) AS LastTime
FROM [theView]
WHERE [Status] <> 'IDLE' AND [DateTime] <= #Now
GROUP BY CustomPollerAssignmentId
We are using LEFT JOIN so that any status that is IDLE but never in any other status is still captured. The ON clause joins them by ID.
The following code selects those that are idle and is before the current time:
WHERE
[theView].[DateTime] <= #Now AND
[theView].[Status] = 'IDLE' AND ...
The following code groups the records by ID and selects the ID that has the earliest time earlier than 10 minutes before the current time:
GROUP BY [theView].CustomPollerAssignmentId
HAVING MIN([theView].[DateTime]) <= DATEADD(MINUTE, -10, #Now)
Also, you will need to pass in the #Now value, which would be the current time that you want to check.

SELECT v1.*
FROM (
SELECT *
FROM vdata
WHERE OnlineStatus IN ('IDLE', 'DORMANT')
) v1
WHERE NOT EXISTS (
SELECT 1
FROM vdata v2
WHERE v2.ModemId = v1.ModemId
AND v2.MinuteMarker > v1.MinuteMarker
AND v2.MinuteMarker <= DATEADD(MINUTE, 10, v1.MinuteMarker )
AND v2.OnlineStatus NOT IN ('IDLE', 'DORMANT')
)

Related

Count the time difference between specific events in SQL Server 2019 or pandas

I am working on an infusion dataset in which I need to find the time duration between the infusion stop and other infusion events.
This is a screenshot of the dataset:
In the screenshot, the first event status is STOPPED at 06:28:31 and the infusion started to run by 09:10:54. Hence the total seconds from the stop to run is 9743 which has to get populated for row 1 in a new column. Likewise 16:50:38 the pump stopped and there was an alarm by 06:04:07 so the difference would be approximately 13 hours. so on row 5, I need the difference value of 13 hours. I need this difference to be found for the entire data where ever I have stopped and the followed by running or stopped alarm infusion status.
I was able to find the difference between each running and stopped alarm status from the stopped event. However its getting populated for all places where i have stopped.
This is the SQL code I use:
SELECT
InfusionStatus, InfusionID, EventDescription,
Time AS event_time,
IIF ((InfusionStatus = 'STOPPED'),
DATEDIFF(SECOND, A1.Time, ISNULL((SELECT TOP 1 Time
FROM table1
WHERE InfusionID = A1.InfusionID
AND SiteNumber = A1.SiteNumber
AND SerialNumber = A1.SerialNumber
AND Time >= A1.Time
AND (InfusionStatus = 'RUNNING' OR
InfusionStatus = 'STOPPED_ALARM')
ORDER BY Time ASC), A1.time)), 0) AS stop_run_event_duration_secs
FROM
dbo.table1 A1
The output that I am getting is like this:
Basically I don't want the difference to be populated in the area's I have marked as "X". The difference has to get populated only for the first stopped event.
Link for the data:
Data Link
I can also go with Python code for determining this.
Any help would be greatly appreciated.
Using Common Table Expression (CTE):
;WITH cte AS
(
SELECT
*
-- Every time the InfusionStatus changes into STOPPED, RUNNING or
-- STOPPED_ALARM for the first time, set IsFirstStatusChange = 1
, CASE
WHEN InfusionStatus IN ('STOPPED', 'RUNNING', 'STOPPED_ALARM')
AND InfusionStatus != LAG(InfusionStatus, 1, '') OVER (ORDER BY Time)
THEN 1
ELSE 0
END AS IsFirstStatusChange
FROM Infusion_data
)
-- For each status change, calculate the duration from the previous change
-- I don't know your requirements around handling STOPPED so I set Duration to
-- NULL for these rows
SELECT *
, IIF(InfusionStatus = 'STOPPED', NULL, DATEDIFF(SECOND, LAG(Time) OVER (ORDER BY Time), Time)) AS Duration
FROM cte
WHERE IsFirstStatusChange = 1
ORDER BY Time

Get time difference between Log records

I have a log table that tracks the bug's status. I would like to extract the amount of time spent when the log changes from OPEN (OldStatus) to FIXED or REQUEST CLOSE (NewStatus). Right now, my query looks at the max and min of the log which does not produce the result I want. For example, the bug #1 was fixed in 2 hours on 2020-01-01, then reopened (OldStatus) and got a REQUEST CLOSE (NewStatus) in 3 hours on 2020-12-12. I want the query result to return two rows with date and number of hours spent to fix the bug since its most recently opened time.
Here's the data:
CREATE TABLE Log (
BugID int,
CurrentTime timestamp,
Person varchar(20),
OldStatus varchar(20),
NewStatus varchar(20)
);
INSERT INTO Log (BugID, CurrentTime, Person, OldStatus, NewStatus)
VALUES (1, '2020-01-01 00:00:00', 'A', 'OPEN', 'In Progress'),
(1, '2020-01-01 00:00:01', 'A', 'In Progress', 'REVIEW In Progress'),
(1, '2020-01-01 02:00:00', 'A', 'In Progress', 'FIXED'),
(1, '2020-01-01 06:00:00', 'B', 'OPEN', 'In Progress'),
(1, '2020-01-01 00:00:00', 'B', 'In Progress', 'REQUEST CLOSE')
SELECT DATEDIFF(HOUR, start_time, finish_time) AS Time_Spent_Min
FROM (
SELECT BugId,
MAX(CurrentTime) as finish_time,
MIN(CurrentTime) as start_time
FROM Log
WHERE (OldStatus = 'OPEN' AND NewString = 'In Progress') OR NewString = 'FIXED'
) AS TEMP
The actual data looks as below:
FYI #Charlieface
This is a type of gaps-and-islands problem.
There are a number of solutions, here is one:
We need to assign a grouping ID to each island of OPEN -> In Progress. We can use windowed conditional COUNT to get a grouping number for each start point.
To get a grouping for the end point, we need to assign the previous row's NewStatus using LAG, then do another conditional COUNT on that.
We then simply group by BugId and our calculated grouping and return the start and end times
WITH IslandStart AS (
SELECT *,
COUNT(CASE WHEN OldStatus = 'OPEN' AND NewStatus = 'In Progress' THEN 1 END)
OVER (PARTITION BY BugID ORDER BY CurrentTime ROWS UNBOUNDED PRECEDING) AS GroupStart,
LAG(NewStatus) OVER (PARTITION BY BugID ORDER BY CurrentTime) AS Prev_NewStatus
FROM Log l
),
IslandEnd AS (
SELECT *,
COUNT(CASE WHEN Prev_NewStatus IN ('CLAIM FIXED', 'REQUEST CLOSE') THEN 1 END)
OVER (PARTITION BY BugID ORDER BY CurrentTime ROWS UNBOUNDED PRECEDING) AS GroupEnd
FROM IslandStart l
)
SELECT
BugId,
MAX(CurrentTime) as finish_time,
MIN(CurrentTime) as start_time,
DATEDIFF(minute, MIN(CurrentTime), MAX(CurrentTime)) AS Time_Spent_Min
FROM IslandEnd l
WHERE GroupStart = GroupEnd + 1
GROUP BY
BugId,
GroupStart;
Notes:
timestamp is not meant for actual dates and times, instead use datetime or datetime2
You may need to adjust the COUNT condition if OPEN -> In Progress is not always the first row of an island
You have a few competing factors here:
You should use a SmallDateTime, DateTime2 or DateTimeOffset typed columns to store the actual time in the log, these types allow for calculating the differece between values using DateDiff() and DateAdd() and other date/time based comparison logic, where as Timestamp is designed to be used as a currency token, you can use it to determine if one record is more recent than another, you shouldn't try to use it to determine the actual time of the event.
What is difference between datetime and timestamp
DATETIMEOFFSET, DATE, TIME, SMALLDATETIME, DATETIME SYSUTCDATETIME and SYSUTCDATE
You have not explained the expected workflow, we can only assume that the flow is [OPEN]=>[In Progress]=>[CLAIM FIXED]. There is also no mention of 'In Progress', which we assume is an interim state. What actually happens here is that this structure can really only tell you the time spent in the 'In Progress' state, which is probably OK for your needs as this is the time spent actually working, but it is important to recognise that we do not know when the bug is changed to 'OPEN' in the first place, unless that is also logged but we need to see the data to explain that.
Your example dataset does not cover enough combinations for you to notice that the existing logic will fail as soon as you add more than 1 bug. What is more you have asked to calculate the number of hours, but your example data only shows a variation minutes and has no example where the bug is completed at all.
Without a realistic set of data to test with, you will find it hard to debug your logic and hard to accept that it actually works before you execute this against a larger dataaset. It can help to have a scripted scenario, much like your post here, but you should create the data to reflect that script.
You use 'FIXED' in your example, but 'CLAIM FIXED' in query, so which one is it?
Step 1: Structure
Change the datatype of CurrentTime to a DateTime based column. Your application logic may drive requirements here. If your system is cloud based or international, then you may see benefits from using DateTimeOffset instead of having to convert into UTC, otherwise if you do not need high precision timing in your logs, it is very common to use SmallDateTime for logging.
Many ORM and application frameworks will allow you to configure a DateTime based column as the concurrency token, it you need one at all. If you are not happy using a lower precision value for concurrency, then you could have the two columns side by side, to compare the time difference between two records, we need to use a DateTime based type.
In the case of log, we rarely allow or expect logs to be edited, if your logs are read-only then having a concurrency token at all may not be necessary, especially if you only use the concurrency token to determine concurrency during edits of individual records.
NOTE: You should consider using an enum or FK for the Status concept. Already in your example dataset there was a typo for 'In Progerss', using a numeric comparison for the status may provide some performance benefits but it will help to prevent spelling mistakes, especially when FK or lookup lists are used from any application logic.
Step 2: Example Data
If the requirement is to calculate the number of hours spent between records, then we need to create some simple examples that show a difference of a few hours, and then add some examples where the same bug is opened, fixed and then re-opened.
bug #1 was fixed in 2 hours on 2020-01-01, then reopened and got fixed in 3 hours on 2020-12-12
The following table shows the known data states and the expected hrs, we need to throw in a few more data stories to validate that the end query handles obvious boundary conditions like multiple Bugs and overlapping dates
BUG #
Time
Previous State
New State
Hrs In Progress
1
2020-01-01 08:00:00
OPEN
In Progress
1
2020-01-01 10:00:00
In Progress
FIXED
(2 hrs)
1
2020-12-10 09:00:00
FIXED
OPEN
1
2020-12-12 9:30:00
OPEN
In Progress
1
2020-12-12 12:30:00
In Progress
FIXED
(3 hrs)
2
2020-03-17 11:15:00
OPEN
In Progress
2
2020-03-17 14:30:00
In Progress
FIXED
(3.25 hrs)
3
2020-08-22 10:00:00
OPEN
In Progress
3
2020-08-22 16:30:00
In Progress
FIXED
(6.5 hrs)
Step 3: Query
What is interesting to notice here is that 'In Progress' is actually the significant state to query against. What we actually want is to see all rows where the OldStatus is 'In Progress' and we want to link that row to the most recent record before this one with the same BugID and with a NewStatus equal to 'In Progress'
What is interesting in the above table is that not all the expected hours are whole numbers (integers) which makes using DateDiff a little bit tricky because it only counts the boundary changes, not the total number of hours. to highlight this, look at the next two queries, the first one represents 59 minutes, the other only 2 minutes:
SELECT DateDiff(HOUR, '2020-01-01 08:00:00', '2020-01-01 08:59:00') -- 0 (59m)
SELECT DateDiff(HOUR, '2020-01-01 08:59:00', '2020-01-01 09:01:00') -- 1 (1m)
However the SQL results show the first query as 0 hours, but the second query returns 1 hour. That is because it only compares the HOUR column, it is not actually doing a subtraction of the time value at all.
To work around this, we can use MINUTE or MI as the date part argument and divide the result by 60.
SELECT CAST(ROUND(DateDiff(MI, '2020-01-01 08:00:00', '2020-01-01 08:59:00')/60.0,2) as Numeric(10,2)) -- 0.98
SELECT CAST(ROUND(DateDiff(MI, '2020-01-01 08:59:00', '2020-01-01 09:01:00')/60.0,2) as Numeric(10,2)) -- 0.03
You can choose to format this in other ways by calculating the modulo to get the minutes in whole numbers instead of a fraction but that is out of scope for this post, understanding the limitations of DateDiff is what is important to take this further.
There are a number of ways to correlate a previous record within the same table, if you need other values form the record then you might use a join with a sub-query to return the TOP 1 from all the records before the current one, you could use window queries or a CROSS APPLY to perform a nested lookup. The following uses CROSS APPLY which is NOT standard across all RDBMS but I feel it keeps MS SQL queries really clean:
SELECT [Fixed].BugID, [start_time], [Fixed].[CurrentTime] as [finish_time]
, DATEDIFF(MI, [start_time], [Fixed].[CurrentTime]) / 60 AS Time_Spent_Hr
, DATEDIFF(MI, [start_time], [Fixed].[CurrentTime]) % 60 AS Time_Spent_Min
FROM Log as Fixed
CROSS APPLY (SELECT MAX(CurrentTime) AS start_time
FROM Log as Started
WHERE Fixed.BugID = Started.BugID
AND Started.NewStatus = 'In Progress'
AND CurrentTime < Fixed.CurrentTime) as Started
WHERE Fixed.OldStatus = 'In Progress'
You can play with this fiddle: http://sqlfiddle.com/#!18/c408d4/3
However the results show this:
BugID
start_time
finish_time
Time_Spent_Hr
Time_Spent_Min
1
2020-01-01T08:00:00Z
2020-01-01T10:00:00Z
2
0
1
2020-12-12T09:30:00Z
2020-12-12T12:30:00Z
3
0
2
2020-03-17T11:15:00Z
2020-03-17T14:30:00Z
3
15
3
2020-08-22T10:00:00Z
2020-08-22T16:30:00Z
6
30
If I assume that every "open" is followed by one "fixed" before the next open, then you can basically use lead() to solve this problem.
This version unpivots the data, so you could have "open" and "fixed" in the same row:
select l.*, datediff(hour, currenttime, fixed_time)
from (select v.*,
lead(v.currenttime) over (partition by v.bugid order by v.currenttime) as fixed_time
from log l cross apply
(values (bugid, currentTime, oldStatus),
(bugid, currentTime, newStatus)
) v(bugid, currentTime, status)
where v.status in ('OPEN', 'FIXED')
) l
where status = 'OPEN';
Here is a db<>fiddle, which uses data compatible with your explanation. (Your sample data is not correct.)

Find total time worked with multiple jobs / orders with overlap / overlapping times on each worker and job / order

I searched night and day back when I was first starting out in the sql world for an answer to this question. Could not find anything similar to this for my needs so I decided to ask and answer my own question in case others need help like I did.
Here is an example of the data I have. For simplicity, it is all from the Job table. Each JobID has it's own Start and End time that are basically random and can overlap, have gaps, start and end at the same time as other jobs etc.
--Available--
JobID WorkerID JobStart JobEnd
1 25 '2012-11-17 16:00' '2012-11-17 17:00'
2 25 '2012-11-18 16:00' '2012-11-18 16:50'
3 25 '2012-11-19 18:00' '2012-11-19 18:30'
4 25 '2012-11-19 17:30' '2012-11-19 18:10'
5 26 '2012-11-18 16:00' '2012-11-18 17:10'
6 26 '2012-11-19 16:00' '2012-11-19 16:50'
What I wanted the result of the query to show would be:
WorkerID TotalTime(in Mins)
25 170
26 120
EDIT: Forgot to mention that the overlaps need to be ignored. Basically this is supposed to treat these workers and their jobs like you would an hourly employee and not a contractor. Like if I worked two jobIDs and started and finished them both from 12:00pm to 12:30pm, as an employee I would only get paid for 30 mins, whereas a contractor would likely get paid 60 mins, since their jobs are treated individually and get paid per job. The point of this query is to analyze jobs in a database that are tied to a worker, and need to find out if that worker was treated as an employee, what would his total hours worked in a given set of time come out to be.
EDIT2: won't let me answer my own question for 7 hours, will move it there later.
Ok, Answering Question now. Basically, I use temp table to build each minute between the min and max datetime of the jobs I am looking up.
IF OBJECT_ID('tempdb..#time') IS NOT NULL
BEGIN
drop table #time
END
DECLARE #FromDate AS DATETIME,
#ToDate AS DATETIME,
#Current AS DATETIME
SET #FromDate = '2012-11-17 16:00'
SET #ToDate = '2012-11-19 18:30'
create table #time (cte_start_date datetime)
set #current = #FromDate
while (#current < #ToDate)
begin
insert into #time (cte_start_date)
values (#current)
set #current = DATEADD(n, 1, #current)
end
Now I have all the mins in a temp table. Now I need to join all the Job table info into it and select out what I need in one go.
SELECT J.WorkerID
,COUNT(DISTINCT t.cte_start_date) AS TotalTime
FROM #time AS t
INNER JOIN Job AS J ON t.cte_start_date >= J.JobStart AND t.cte_start_date < J.JobEnd --Thanks ErikE
GROUP BY J.WorkerID --Thanks Martin Parkin
drop table #time
That is the very simplified answer and is good to get someone started.
This query does the job as well. Its performance is very good (while the execution plan looks not so great, the actual CPU and IO beat many other queries).
See it working in a Sql Fiddle.
WITH Times AS (
SELECT DISTINCT
H.WorkerID,
T.Boundary
FROM
dbo.JobHistory H
CROSS APPLY (VALUES (H.JobStart), (H.JobEnd)) T (Boundary)
), Groups AS (
SELECT
WorkerID,
T.Boundary,
Grp = Row_Number() OVER (PARTITION BY T.WorkerID ORDER BY T.Boundary) / 2
FROM
Times T
CROSS JOIN (VALUES (1), (1)) X (Dup)
), Boundaries AS (
SELECT
G.WorkerID,
TimeStart = Min(Boundary),
TimeEnd = Max(Boundary)
FROM
Groups G
GROUP BY
G.WorkerID,
G.Grp
HAVING
Count(*) = 2
)
SELECT
B.WorkerID,
WorkedMinutes = Sum(DateDiff(minute, 0, B.TimeEnd - B.TimeStart))
FROM
Boundaries B
WHERE
EXISTS (
SELECT *
FROM dbo.JobHistory H
WHERE
B.WorkerID = H.WorkerID
AND B.TimeStart < H.JobEnd
AND B.TimeEnd > H.JobStart
)
GROUP BY
WorkerID
;
With a clustered index on WorkerID, JobStart, JobEnd, JobID, and with the sample 7 rows from the above fiddle a template for new worker/job data repeated enough times to yield a table with 14,336 rows, here are the performance results. I've included the other working/correct answers on the page (so far):
Author CPU Elapsed Reads Scans
------ --- ------- ------ -----
Erik 157 166 122 2
Gordon 375 378 106964 53251
I did a more exhaustive test from a different (slower) server (where each query was run 25 times, the best and worst values for each metric were thrown out, and the remaining 23 values were averaged) and got the following:
Query CPU Duration Reads Notes
-------- ---- -------- ------ ----------------------------------
Erik 1 215 231 122 query as above
Erik 2 326 379 116 alternate technique with no EXISTS
Gordon 1 578 682 106847 from j
Gordon 2 584 673 106847 from dbo.JobHistory
The alternate technique I thought to be sure to improve things. Well, it saved 6 reads, but cost a lot more CPU (which makes sense). Instead of carrying through the start/end statistics of each timeslice to the end, it is best just recalculating which slices to keep with the EXISTS against the original data. It may be that a different profile of few workers with many jobs could change the performance statistics for different queries.
In case anyone wants to try it, use the CREATE TABLE and INSERT statements from my fiddle and then run this 11 times:
INSERT dbo.JobHistory
SELECT
H.JobID + A.MaxJobID,
H.WorkerID + A.WorkerCount,
DateAdd(minute, Elapsed + 45, JobStart),
DateAdd(minute, Elapsed + 45, JobEnd)
FROM
dbo.JobHistory H
CROSS JOIN (
SELECT
MaxJobID = Max(JobID),
WorkerCount = Max(WorkerID) - Min(WorkerID) + 1,
Elapsed = DateDiff(minute, Min(JobStart), Min(JobEnd))
FROM dbo.JobHistory
) A
;
I built two other solutions to this query but the best one with about double the performance had a fatal flaw (not correctly handling fully enclosed time ranges). The other had very high/bad statistics (which I knew but had to try).
Explanation
Using all the endpoint times from each row, build up a distinct list of all possible time ranges of interest by duplicating each endpoint time and then grouping in such a way as to pair each time with the next possible time. Sum the elapsed minutes of these ranges wherever they coincide with any actual worker's working time.
A query such as the following should provide the answer you are looking for:
SELECT WorkerID,
SUM(DATEDIFF(minute, JobStart, JobEnd)) AS TotalTime
FROM Job
GROUP BY WorkerID
Apologies that it is untested (I have no SQL Server to test it here) but it should do the trick.
This is a complicated query. Explanation follows.
with j as (
select j.*,
(select 1
from jobs j2
where j2.workerid = j.workerid and
j2.starttime < j.endtime and
j2.starttime > j.starttime
) as HasOverlap
from jobs j
)
select workerId,
sum(datediff(minute, periodStart, PeriodEnd)) as NumMinutes
from (select workerId, min(startTime) as periodStart, max(endTime) as PeriodEnd
from (select j.*,
(select min(starttime)
from j j2
where j2.workerid = j.workerid and
j2.starttime >= j.starttime and
j2.HasOverlap is null
) as thegroup
from j
) j
group by workerId, thegroup
) j
group by workerId;
The key to understanding this approach is to understand the "overlap" logic. One time period overlaps with the next when the next start time is before the previous end time. By assigning an overlap flag to each record, we know if it overlaps with the "next" record. The above logic is using the start time for this. It might be better to use the JobId, especially if two jobs for the same worker could start at the same time.
The calculation of the overlap flag uses a correlated subquery (this is j in the with clause).
Then, for each record we go back and find the first record afterwards where the overlap value is NULL. This provides a grouping key for all records in a given overlap set.
The rest, then, is just to aggregate the results, first at the workerId/group level and then at the workerId level to get the final results.
I have not run this SQL, so it might have syntax errors.

How do I analyse time periods between records in SQL data without cursors?

The root problem: I have an application which has been running for several months now. Users have been reporting that it's been slowing down over time (so in May it was quicker than it is now). I need to get some evidence to support or refute this claim. I'm not interested in precise numbers (so I don't need to know that a login took 10 seconds), I'm interested in trends - that something which used to take x seconds now takes of the order of y seconds.
The data I have is an audit table which stores a single row each time the user carries out any activity - it includes a primary key, the user id, a date time stamp and an activity code:
create table AuditData (
AuditRecordID int identity(1,1) not null,
DateTimeStamp datetime not null,
DateOnly datetime null,
UserID nvarchar(10) not null,
ActivityCode int not null)
(Notes: DateOnly (datetime) is the DateTimeStamp with the time stripped off to make group by for daily analysis easier - it's effectively duplicate data to make querying faster).
Also for the sake of ease you can assume that the ID is assigned in date time order, that is 1 will always be before 2 which will always be before 3 - if this isn't true I can make it so).
ActivityCode is an integer identifying the activity which took place, for instance 1 might be user logged in, 2 might be user data returned, 3 might be search results returned and so on.
Sample data for those who like that sort of thing...:
1, 01/01/2009 12:39, 01/01/2009, P123, 1
2, 01/01/2009 12:40, 01/01/2009, P123, 2
3, 01/01/2009 12:47, 01/01/2009, P123, 3
4, 01/01/2009 13:01, 01/01/2009, P123, 3
User data is returned (Activity Code 2) immediate after login (Activity Code 1) so this can be used as a rough benchmark of how long the login takes (as I said, I'm interested in trends so as long as I'm measuring the same thing for May as July it doesn't matter so much if this isn't the whole login process - it takes in enough of it to give a rough idea).
(Note: User data can also be returned under other circumstances so it's not a one to one mapping).
So what I'm looking to do is select the average time between login (say ActivityID 1) and the first instance after that for that user on that day of user data being returned (say ActivityID 2).
I can do this by going through the table with a cursor, getting each login instance and then for that doing a select to say get the minimum user data return following it for that user on that day but that's obviously not optimal and is slow as hell.
My question is (finally) - is there a "proper" SQL way of doing this using self joins or similar without using cursors or some similar procedural approach? I can create views and whatever to my hearts content, it doesn't have to be a single select.
I can hack something together but I'd like to make the analysis I'm doing a standard product function so would like it to be right.
SELECT TheDay, AVG(TimeTaken) AvgTimeTaken
FROM (
SELECT
CONVERT(DATE, logins.DateTimeStamp) TheDay
, DATEDIFF(SS, logins.DateTimeStamp,
(SELECT TOP 1 DateTimeStamp
FROM AuditData userinfo
WHERE UserID=logins.UserID
and userinfo.ActivityCode=2
and userinfo.DateTimeStamp > logins.DateTimeStamp )
)TimeTaken
FROM AuditData logins
WHERE
logins.ActivityCode = 1
) LogInTimes
GROUP BY TheDay
This might be dead slow in real world though.
In Oracle this would be a cinch, because of analytic functions. In this case, LAG() makes it easy to find the matching pairs of activity codes 1 and 2 and also to calculate the trend. As you can see, things got worse on 2nd JAN and improved quite a bit on the 3rd (I'm working in seconds rather than minutes).
SQL> select DateOnly
2 , elapsed_time
3 , elapsed_time - lag (elapsed_time) over (order by DateOnly) as trend
4 from
5 (
6 select DateOnly
7 , avg(databack_time - prior_login_time) as elapsed_time
8 from
9 ( select DateOnly
10 , databack_time
11 , ActivityCode
12 , lag(login_time) over (order by DateOnly,UserID, AuditRecordID, ActivityCode) as prior_login_time
13 from
14 (
15 select a1.AuditRecordID
16 , a1.DateOnly
17 , a1.UserID
18 , a1.ActivityCode
19 , to_number(to_char(a1.DateTimeStamp, 'SSSSS')) as login_time
20 , 0 as databack_time
21 from AuditData a1
22 where a1.ActivityCode = 1
23 union all
24 select a2.AuditRecordID
25 , a2.DateOnly
26 , a2.UserID
27 , a2.ActivityCode
28 , 0 as login_time
29 , to_number(to_char(a2.DateTimeStamp, 'SSSSS')) as databack_time
30 from AuditData a2
31 where a2.ActivityCode = 2
32 )
33 )
34 where ActivityCode = 2
35 group by DateOnly
36 )
37 /
DATEONLY ELAPSED_TIME TREND
--------- ------------ ----------
01-JAN-09 120
02-JAN-09 600 480
03-JAN-09 150 -450
SQL>
Like I said in my comment I guess you're working in MSSQL. I don't know whether that product has any equivalent of LAG().
If the assumptions are that:
Users will perform various tasks in no mandated order, and
That the difference between any two activities reflects the time it takes for the first of those two activities to execute,
Then why not create a table with two timestamps, the first column containing the activity start time, the second column containing the next activity start time. Thus the difference between these two will always be total time of the first activity. So for the logout activity, you would just have NULL for the second column.
So it would be kind of weird and interesting, for each activity (other than logging in and logging out), the time stamp would be recorded in two different rows--once for the last activity (as the time "completed") and again in a new row (as time started). You would end up with a jacob's ladder of sorts, but finding the data you are after would be much more simple.
In fact, to get really wacky, you could have each row have the time that the user started activity A and the activity code, and the time started activity B and the time stamp (which, as mentioned above, gets put down again for the following row). This way each row will tell you the exact difference in time for any two activities.
Otherwise, you're stuck with a query that says something like
SELECT TIME_IN_SEC(row2-timestamp) - TIME_IN_SEC(row1-timestamp)
which would be pretty slow, as you have already suggested. By swallowing the redundancy, you end up just querying the difference between the two columns. You probably would have less need of knowing the user info as well, since you'd know that any row shows both activity codes, thus you can just query the average for all users on any given day and compare it to the next day (unless you are trying to find out which users are having the problem as well).
This is the faster query to find out, in one row you will have current and row before datetime value, after that you can use DATEDIFF ( datepart , startdate , enddate ). I use #DammyVariable and DamyField as i remember the is some problem if is not first #variable=Field in update statement.
SELECT *, Cast(NULL AS DateTime) LastRowDateTime, Cast(NULL As INT) DamyField INTO #T FROM AuditData
GO
CREATE CLUSTERED INDEX IX_T ON #T (AuditRecordID)
GO
DECLARE #LastRowDateTime DateTime
DECLARE #DammyVariable INT
SET #LastRowDateTime = NULL
SET #DammyVariable = 1
UPDATE #T SET
#DammyVariable = DammyField = #DammyVariable
, LastRowDateTime = #LastRowDateTime
, #LastRowDateTime = DateTimeStamp
option (maxdop 1)

Calculating different tariff-periods for a call in SQL Server

For a call-rating system, I'm trying to split a telephone call duration into sub-durations for different tariff-periods. The calls are stored in a SQL Server database and have a starttime and total duration. Rates are different for night (0000 - 0800), peak (0800 - 1900) and offpeak (1900-235959) periods.
For example:
A call starts at 18:50:00 and has a duration of 1000 seconds. This would make the call end at 19:06:40, making it 10 minutes / 600 seconds in the peak-tariff and 400 seconds in the off-peak tariff.
Obviously, a call can wrap over an unlimited number of periods (we do not enforce a maximum call duration). A call lasting > 24 h can wrap all 3 periods, starting in peak, going through off-peak, night and back into peak tariff.
Currently, we are calculating the different tariff-periods using recursion in VB. We calculate how much of the call goes in the same tariff-period the call starts in, change the starttime and duration of the call accordingly and repeat this process till the full duration of the call has been reach (peakDuration + offpeakDuration + nightDuration == callDuration).
Regarding this issue, I have 2 questions:
Is it possible to do this effectively in a SQL Server statement? (I can think of subqueries or lots of coding in stored procedures, but that would not generate any performance improvement)
Will SQL Server be able to do such calculations in a way more resource-effective than the current VB scripts are doing it?
It seems to me that this is an operation with two phases.
Determine which parts of the phone call use which rates at which time.
Sum the times in each of the rates.
Phase 1 is trickier than Phase 2. I've worked the example in IBM Informix Dynamic Server (IDS) because I don't have MS SQL Server. The ideas should translate easily enough. The INTO TEMP clause creates a temporary table with an appropriate schema; the table is private to the session and vanishes when the session ends (or you explicitly drop it). In IDS, you can also use an explicit CREATE TEMP TABLE statement and then INSERT INTO temp-table SELECT ... as a more verbose way of doing the same job as INTO TEMP.
As so often in SQL questions on SO, you've not provided us with a schema, so everyone has to invent a schema that might, or might not, match what you describe.
Let's assume your data is in two tables. The first table has the call log records, the basic information about the calls made, such as the phone making the call, the number called, the time when the call started and the duration of the call:
CREATE TABLE clr -- call log record
(
phone_id VARCHAR(24) NOT NULL, -- billing plan
called_number VARCHAR(24) NOT NULL, -- needed to validate call
start_time TIMESTAMP NOT NULL, -- date and time when call started
duration INTEGER NOT NULL -- duration of call in seconds
CHECK(duration > 0),
PRIMARY KEY(phone_id, start_time)
-- other complicated range-based constraints omitted!
-- foreign keys omitted
-- there would probably be an auto-generated number here too.
);
INSERT INTO clr(phone_id, called_number, start_time, duration)
VALUES('650-656-3180', '650-794-3714', '2009-02-26 15:17:19', 186234);
For convenience (mainly to save writing the addition multiple times), I want a copy of the clr table with the actual end time:
SELECT phone_id, called_number, start_time AS call_start, duration,
start_time + duration UNITS SECOND AS call_end
FROM clr
INTO TEMP clr_end;
The tariff data is stored in a simple table:
CREATE TABLE tariff
(
tariff_code CHAR(1) NOT NULL -- code for the tariff
CHECK(tariff_code IN ('P','N','O'))
PRIMARY KEY,
rate_start TIME NOT NULL, -- time when rate starts
rate_end TIME NOT NULL, -- time when rate ends
rate_charged DECIMAL(7,4) NOT NULL -- rate charged (cents per second)
);
INSERT INTO tariff(tariff_code, rate_start, rate_end, rate_charged)
VALUES('N', '00:00:00', '08:00:00', 0.9876);
INSERT INTO tariff(tariff_code, rate_start, rate_end, rate_charged)
VALUES('P', '08:00:00', '19:00:00', 2.3456);
INSERT INTO tariff(tariff_code, rate_start, rate_end, rate_charged)
VALUES('O', '19:00:00', '23:59:59', 1.2345);
I debated whether the tariff table should use TIME or INTERVAL values; in this context, the times are very similar to intervals relative to midnight, but intervals can be added to timestamps where times cannot. I stuck with TIME, but it made things messy.
The tricky part of this query is generating the relevant date and time ranges for each tariff without loops. In fact, I ended up using a loop embedded in a stored procedure to generate a list of integers. (I also used a technique that is specific to IBM Informix Dynamic Server, IDS, using the table ID numbers from the system catalog as a source of contiguous integers in the range 1..N, which works for numbers from 1 to 60 in version 11.50.)
CREATE PROCEDURE integers(lo INTEGER DEFAULT 0, hi INTEGER DEFAULT 0)
RETURNING INT AS number;
DEFINE i INTEGER;
FOR i = lo TO hi STEP 1
RETURN i WITH RESUME;
END FOR;
END PROCEDURE;
In the simple case (and the most common case), the call falls in a single-tariff period; the multi-period calls add the excitement.
Let's assume we can create a table expression that matches this schema and covers all the timestamp values we might need:
CREATE TEMP TABLE tariff_date_time
(
tariff_code CHAR(1) NOT NULL,
rate_start TIMESTAMP NOT NULL,
rate_end TIMESTAMP NOT NULL,
rate_charged DECIMAL(7,4) NOT NULL
);
Fortunately, you haven't mentioned weekend rates, so you charge the customers the same
rates at the weekend as during the week. However, the answer should adapt to such
situations if at all possible. If you were to get as complex as giving weekend rates on
public holidays, except that at Christmas or New Year, you charge peak rate instead of
weekend rate because of the high demand, then you would be best off storing the rates in a permanent tariff_date_time table.
The first step in populating tariff_date_time is to generate a list of dates which are relevant to the calls:
SELECT DISTINCT EXTEND(DATE(call_start) + number, YEAR TO SECOND) AS call_date
FROM clr_end,
TABLE(integers(0, (SELECT DATE(call_end) - DATE(call_start) FROM clr_end)))
AS date_list(number)
INTO TEMP call_dates;
The difference between the two date values is an integer number of days (in IDS).
The procedure integers generates values from 0 to the number of days covered by the call and stores the result in a temp table. For the more general case of multiple records, it might be better to calculate the minimum and maximum dates and generate the dates in between rather than generate dates multiple times and then eliminate them with the DISTINCT clause.
Now use a cartesian product of the tariff table with the call_dates table to generate the rate information for each day. This is where the tariff times would be neater as intervals.
SELECT r.tariff_code,
d.call_date + (r.rate_start - TIME '00:00:00') AS rate_start,
d.call_date + (r.rate_end - TIME '00:00:00') AS rate_end,
r.rate_charged
FROM call_dates AS d, tariff AS r
INTO TEMP tariff_date_time;
Now we need to match the call log record with the tariffs that apply. The condition is a standard way of dealing with overlaps - two time periods overlap if the end of the first is later than the start of the second and if the start of the first is before the end of the second:
SELECT tdt.*, clr_end.*
FROM tariff_date_time tdt, clr_end
WHERE tdt.rate_end > clr_end.call_start
AND tdt.rate_start < clr_end.call_end
INTO TEMP call_time_tariff;
Then we need to establish the start and end times for the rate. The start time for the rate is the later of the start time for the tariff and the start time of the call. The end time for the rate is the earlier of the end time for the tariff and the end time of the call:
SELECT phone_id, called_number, tariff_code, rate_charged,
call_start, duration,
CASE WHEN rate_start < call_start THEN call_start
ELSE rate_start END AS rate_start,
CASE WHEN rate_end >= call_end THEN call_end
ELSE rate_end END AS rate_end
FROM call_time_tariff
INTO TEMP call_time_tariff_times;
Finally, we need to sum the times spent at each tariff rate, and take that time (in seconds) and multiply by the rate charged. Since the result of SUM(rate_end - rate_start) is an INTERVAL, not a number, I had to invoke a conversion function to convert the INTERVAL into a DECIMAL number of seconds, and that (non-standard) function is iv_seconds:
SELECT phone_id, called_number, tariff_code, rate_charged,
call_start, duration,
SUM(rate_end - rate_start) AS tariff_time,
rate_charged * iv_seconds(SUM(rate_end - rate_start)) AS tariff_cost
FROM call_time_tariff_times
GROUP BY phone_id, called_number, tariff_code, rate_charged,
call_start, duration;
For the sample data, this yielded the data (where I'm not printing the phone number and called number for compactness):
N 0.9876 2009-02-26 15:17:19 186234 0 16:00:00 56885.760000000
O 1.2345 2009-02-26 15:17:19 186234 0 10:01:11 44529.649500000
P 2.3456 2009-02-26 15:17:19 186234 1 01:42:41 217111.081600000
That's a very expensive call, but the telco will be happy with that. You can poke at any of the intermediate results to see how the answer is derived. You can use fewer temporary tables at the cost of some clarity.
For a single call, this will not be much different than running the code in VB in the client. For a lot of calls, this has the potential to be more efficient. I'm far from convinced that recursion is necessary in VB - straight iteration should be sufficient.
kar_vasile(id,vid,datein,timein,timeout,bikari,tozihat)
{
--- the bikari field is unemployment time you can delete any where
select
id,
vid,
datein,
timein,
timeout,
bikari,
hourwork =
case when
timein <= timeout
then
SUM
(abs(DATEDIFF(mi, timein, timeout)) - bikari)/60 --
calculate Hour
else
SUM(abs(DATEDIFF(mi, timein, '23:59:00:00') + DATEDIFF(mi, '00:00:00', timeout) + 1) - bikari)/60 --
calculate
minute
end
,
minwork =
case when
timein <= timeout
then
SUM
(abs(DATEDIFF(MI, timein, timeout)) - bikari)%60 --
calclate Hour
starttime is later
than endtime
else
SUM(abs(DATEDIFF(mi, timein, '23:59:00:00') + DATEDIFF(mi, '00:00:00', timeout) + 1) - bikari)%60--
calculate minute
starttime is later
than
endtime
end, tozihat
from kar_vasile
group
by id, vid, datein, timein, timeout, tozihat, bikari
}
Effectively in T-SQL? I suspect not, with the schema as described at present.
It might be possible, however, if your rate table stores the three tariffs for each date. There is at least one reason why you might do this, apart from the problem at hand: it's likely at some point that rates for one period or another might change and you may need to have the historic rates available.
So say we have these tables:
CREATE TABLE rates (
from_date_time DATETIME
, to_date_time DATETIME
, rate MONEY
)
CREATE TABLE calls (
id INT
, started DATETIME
, ended DATETIME
)
I think there are three cases to consider (may be more, I'm making this up as I go):
a call occurs entirely within one
rate period
a call starts in one
rate period (a) and ends in the next (b)
a call spans at least one complete
rate period
Assuming rate is per second, I think you might produce something like the following (completely untested) query
SELECT id, DATEDIFF(ss, started, ended) * rate /* case 1 */
FROM rates JOIN calls ON started > from_date_time AND ended < to_date_time
UNION
SELECT id, DATEDIFF(ss, started, to_date_time) * rate /* case 2a and the start of case 3 */
FROM rates JOIN calls ON started > from_date_time AND ended > to_date_time
UNION
SELECT id, DATEDIFF(ss, from_date_time, ended) * rate /* case 2b and the last part of case 3 */
FROM rates JOIN calls ON started < from_date_time AND ended < to_date_time
UNION
SELECT id, DATEDIFF(ss, from_date_time, to_date_time) * rate /* case 3 for entire rate periods, should pick up all complete periods */
FROM rates JOIN calls ON started < from_date_time AND ended > to_date_time
You could apply a SUM..GROUP BY over that in SQL or handle it in your code. Alternatively, with carefully-constructed logic, you could probably merge the UNIONed parts into a single WHERE clause with lots of ANDs and ORs. I thought the UNION showed the intent rather more clearly.
HTH & HIW (Hope It Works...)
This is a thread about your problem we had over at sqlteam.com. take a look because it includes some pretty slick solutions.
Following on from Mike Woodhouse's answer, this may work for you:
SELECT id, SUM(DATEDIFF(ss, started, ended) * rate)
FROM rates
JOIN calls ON
CASE WHEN started < from_date_time
THEN DATEADD(ss, 1, from_date_time)
ELSE started > from_date_time
AND
CASE WHEN ended > to_date_time
THEN DATEADD(ss, -1, to_date_time)
ELSE ended END
< ended
GROUP BY id
An actual schema for the relevant tables in your database would have been very helpful. I'll take my best guesses. I've assumed that the Rates table has start_time and end_time as the number of minutes past midnight.
Using a calendar table (a VERY useful table to have in most databases):
SELECT
C.id,
R.rate,
SUM(DATEDIFF(ss,
CASE
WHEN C.start_time < R.rate_start_time THEN R.rate_start_time
ELSE C.start_time
END,
CASE
WHEN C.end_time > R.rate_end_time THEN R.rate_end_time
ELSE C.end_time
END)) AS
FROM
Calls C
INNER JOIN
(
SELECT
DATEADD(mi, Rates.start_time, CAL.calendar_date) AS rate_start_time,
DATEADD(mi, Rates.end_time, CAL.calendar_date) AS rate_end_time,
Rates.rate
FROM
Calendar CAL
INNER JOIN Rates ON
1 = 1
WHERE
CAL.calendar_date >= DATEADD(dy, -1, C.start_time) AND
CAL.calendar_date <= C.start_time
) AS R ON
R.rate_start_time < C.end_time AND
R.rate_end_time > C.start_time
GROUP BY
C.id,
R.rate
I just came up with this as I was typing, so it's untested and you will very likely need to tweak it, but hopefully you can see the general idea.
I also just realized that you use a start_time and a duration for your calls. You can just replace C.end_time wherever you see it with DATEADD(ss, C.start_time, C.duration) assuming that the duration is in seconds.
This should perform pretty quickly in any decent RDBMS assuming proper indexes, etc.
Provided that you calls last less than 100 days:
WITH generate_range(item) AS
(
SELECT 0
UNION ALL
SELECT item + 1
FROM generate_range
WHERE item < 100
)
SELECT tday, id, span
FROM (
SELECT tday, id,
DATEDIFF(minute,
CASE WHEN tbegin < clbegin THEN clbegin ELSE tbegin END,
CASE WHEN tend < clend THEN tend ELSE clend END
) AS span
FROM (
SELECT DATEADD(day, item, DATEDIFF(day, 0, clbegin)) AS tday,
ti.id,
DATEADD(minute, rangestart, DATEADD(day, item, DATEDIFF(day, 0, clbegin))) AS tbegin,
DATEADD(minute, rangeend, DATEADD(day, item, DATEDIFF(day, 0, clbegin))) AS tend
FROM calls, generate_range, tariff ti
WHERE DATEADD(day, 1, DATEDIFF(day, 0, clend)) > DATEADD(day, item, DATEDIFF(day, 0, clbegin))
) t1
) t2
WHERE span > 0
I'm assuming you keep your tariffs ranges in minutes from midnight and count lengths in minutes too.
The big problem with performing this kind of calculation at the database level is that it takes resource away from your database while it's going on, both in terms of CPU and availability of rows and tables via locking. If you were calculating 1,000,000 tariffs as part of a batch operation, then that might run on the database for a long time and during that time you'd be unable to use the database for anything else.
If you have the resource, retrieve all the data you need with one transaction and do all the logic calculations outside the database, in a language of your choice. Then insert all the results. Databases are for storing and retrieving data, and any business logic they perform should be kept to an absolute bare minimum at all times. Whilst brilliant at some things, SQL isn't the best language for date or string manipulation work.
I suspect you're already on the right lines with your VBA work, and without knowing more it certainly feels like a recursive, or at least an iterative, problem to me. When done correctly recursion can be a powerful and elegant solution to a problem. Tying up the resources of your database very rarely is.