How to create start & stop times when multiple start times are given? - sql

I'm trying to quantify active vs. idle times, and the first thing I need to do is to create distinct and discrete start and end times. The issue is that the database is (I'm told this is a bug) creating multiple "start" times for the events. To make it even more complicated, a "report" can have multiple instances of being worked on, and each one should be logged as a discrete duration.
For instance,
WorkflowID ReportID User Action Timestamp
1 1 A Start 1:00
2 1 A Stop 1:03
3 1 B Start 1:05
4 1 B Start 1:06
5 1 B Stop 1:08
6 1 B Start 1:10
7 1 B Start 1:11
8 1 B Stop 1:14
I want to write a SQL query that would output the following:
User StartTime EndTime
A 1:00 1:03
B 1:05 1:08
B 1:10 1:14
The issue I'm running into is that the number of start/stop events needs to be arbitrary (per ReportID per User). In addition, the superfluous "start" times between the first "start" in the series and the following "stop" need to be removed to not mess it up.
Maybe I'm missing something, but this is tricky to me. Any thoughts? Thank you.

To deduplicate use lag() to compare the previous action for a user and a report with the current one. If they are the same it's a duplicate, mark it as such. Then number the starts and stops using row_number(), so that each pair of a start and a stop belonging together share a number (per report and user). Then join on the report, the user and that number.
For convenience you can use CTEs to structure the query and prevent the necessity of duplicating some subqueries.
WITH
[DeduplicatedAndNumbered]
AS
(
SELECT [WorkflowID],
[ReportID],
[User],
[Action],
[Timestamp],
row_number() OVER (PARTITION BY [ReportID],
[User],
[Action]
ORDER BY [Timestamp]) [Number]
FROM (SELECT [WorkflowID],
[ReportID],
[User],
[Action],
[Timestamp],
CASE
WHEN lag([Action]) OVER (PARTITION BY [ReportId],
[User]
ORDER BY [Timestamp]) = [Action] THEN
1
ELSE
0
END [IsDuplicate]
FROM [elbaT]) [x]
WHERE [IsDuplicate] = 0
),
[DeduplicatedAndNumberedStart]
AS
(SELECT [WorkflowID],
[ReportID],
[User],
[Action],
[Timestamp],
[Number]
FROM [DeduplicatedAndNumbered]
WHERE [Action] = 'Start'),
[DeduplicatedAndNumberedStop]
AS
(SELECT [WorkflowID],
[ReportID],
[User],
[Action],
[Timestamp],
[Number]
FROM [DeduplicatedAndNumbered]
WHERE [Action] = 'Stop')
SELECT [DeduplicatedAndNumberedStart].[User],
[DeduplicatedAndNumberedStart].[Timestamp] [StartTime],
[DeduplicatedAndNumberedStop].[Timestamp] [EndTime]
FROM [DeduplicatedAndNumberedStart]
INNER JOIN [DeduplicatedAndNumberedStop]
ON [DeduplicatedAndNumberedStart].[ReportId] = [DeduplicatedAndNumberedStop].[ReportId]
AND [DeduplicatedAndNumberedStart].[User] = [DeduplicatedAndNumberedStop].[User]
AND [DeduplicatedAndNumberedStart].[Number] = [DeduplicatedAndNumberedStop].[Number];
db<>fiddle

OP has tagged their question with sql-server-2008.
Since SQL Server 2008 lacks the lag() function (it was added in SQL Server 2012), here is a solution that uses Common Table Expressions and row_number() which were available from SQL Server 2005 onwards...
;with [StopEvents] as (
select [WorkflowID],
[ReportID],
[User],
[EndTime] = [Timestamp],
[StopEventSeq] = row_number() over (
partition by [ReportID], [User], [Timestamp]
order by [Timestamp])
from Workflow
where [Action] = 'Stop'
)
select this.[User], [StartTime], this.[EndTime]
from [StopEvents] this
-- Left join here because first Stop event won't have a previous Stop event
left join [StopEvents] previous
on previous.[ReportID] = this.[ReportID]
and previous.[User] = this.[User]
and previous.[StopEventSeq] = this.[StopEventSeq] - 1
outer apply (
select [StartTime] = min([Timestamp])
from Workflow W
where W.[ReportID] = this.[ReportID]
and W.[User] = this.[User]
and W.[Timestamp] < this.[EndTime]
-- First Stop event won't have a previous, so just get the min([Timestamp])
and (previous.[EndTime] is null or W.[Timestamp] >= previous.[EndTime])
) thisStart
order by this.[User], this.[EndTime]

Related

Collapse multiple rows into a single row based upon a break condition

I have a simple sounding requirement that has had me stumped for a day or so now, so its time to seek help from the experts.
My requirement is to simply roll-up multiple rows into a single row based upon a break condition - when any of these columns change Employee ID, Allowance Plan, Allowance Amount or To Date, then the row is to be kept, if that makes sense.
An example source data set is shown below:
and the target data after collapsing the rows should look like this:
As you can see I don't need any type of running totals calculating I just need to collapse the rows into a single record per from date/to date combination.
So far I have tried the following SQL using a GROUP BY and MIN function
select [Employee ID], [Allowance Plan],
min([From Date]), max([To Date]), [Allowance Amount]
from [dbo].[#AllowInfo]
group by [Employee ID], [Allowance Plan], [Allowance Amount]
but that just gives me a single row and does not take into account the break condition.
what do I need to do so that the records are rolled-up (correct me if that is not the right terminology) correctly taking into account the break condition?
Any help is appreciated.
Thank you.
Note that your test data does not really exercise the algo that well - e.g. you only have one employee, one plan. Also, as you described it, you would end up with 4 rows as there is a change of todate between 7->8, 8->9, 9->10 and 10->11.
But I can see what you are trying to do, so this should at least get you on the right track, and returns the expected 3 rows. I have taken the end of a group to be where either employee/plan/amount has changed, or where todate is not null (or where we reach the end of the data)
CREATE TABLE #data
(
RowID INT,
EmployeeID INT,
AllowancePlan VARCHAR(30),
FromDate DATE,
ToDate DATE,
AllowanceAmount DECIMAL(12,2)
);
INSERT INTO #data(RowID, EmployeeID, AllowancePlan, FromDate, ToDate, AllowanceAmount)
VALUES
(1,200690,'CarAllowance','30/03/2017', NULL, 1000.0),
(2,200690,'CarAllowance','01/08/2017', NULL, 1000.0),
(6,200690,'CarAllowance','23/04/2018', NULL, 1000.0),
(7,200690,'CarAllowance','30/03/2018', NULL, 1000.0),
(8,200690,'CarAllowance','21/06/2018', '01/04/2019', 1000.0),
(9,200690,'CarAllowance','04/11/2021', NULL, 1000.0),
(10,200690,'CarAllowance','30/03/2017', '13/05/2022', 1000.0),
(11,200690,'CarAllowance','14/05/2022', NULL, 850.0);
-- find where the break points are
WITH chg AS
(
SELECT *,
CASE WHEN LAG(EmployeeID, 1, -1) OVER(ORDER BY RowID) != EmployeeID
OR LAG(AllowancePlan, 1, 'X') OVER(ORDER BY RowID) != AllowancePlan
OR LAG(AllowanceAmount, 1, -1) OVER(ORDER BY RowID) != AllowanceAmount
OR LAG(ToDate, 1) OVER(ORDER BY RowID) IS NOT NULL
THEN 1 ELSE 0 END AS NewGroup
FROM #data
),
-- count the number of break points as we go to group the related rows
grp AS
(
SELECT chg.*,
ISNULL(
SUM(NewGroup)
OVER (ORDER BY RowID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
0) AS grpNum
FROM chg
)
SELECT MIN(grp.RowID) AS RowID,
MAX(grp.EmployeeID) AS EmployeeID,
MAX(grp.AllowancePlan) AS AllowancePlan,
MIN(grp.FromDate) AS FromDate,
MAX(grp.ToDate) AS ToDate,
MAX(grp.AllowanceAmount) AS AllowanceAmount
FROM grp
GROUP BY grpNum
one way is to get all rows the last todate, and then group on that
select min(t.RowID) as RowID,
t.EmployeeID,
min(t.AllowancePlan) as AllowancePlan,
min(t.FromDate) as FromDate,
max(t.ToDate) as ToDate,
min(t.AllowanceAmount) as AllowanceAmount
from ( select t.RowID,
t.EmployeeID,
t.FromDate,
t.AllowancePlan,
t.AllowanceAmount,
case when t.ToDate is null then ( select top 1 t2.ToDate
from test t2
where t2.EmployeeID = t.EmployeeID
and t2.ToDate is not null
and t2.FromDate > t.FromDate -- t2.RowID > t.RowID
order by t2.RowID, t2.FromDate
)
else t.ToDate
end as todate
from test t
) t
group by t.EmployeeID, t.ToDate
order by t.EmployeeID, min(t.RowID)
See and test yourself in this DBFiddle
the result is
RowID
EmployeeID
AllowancePlan
FromDate
ToDate
AllowanceAmount
1
200690
CarAllowance
2017-03-30
2019-04-01
1000
9
200690
CarAllowance
2021-11-04
2022-05-13
1000
11
200690
CarAllowance
2022-05-14
(null)
850

MSSQL - Delete duplicate rows using common column values

I haven't used SQL in quite a while, so I'm a bit lost here. I wanted to check for rows with duplicate values in the "Duration" and "date" columns to remove them from the query results. I would need to keep the rows where column = "Transfer" since these hold more information about the call and how it was routed through our system.
I want to use this for a dashboard, which would include counting the total number of calls from that query, which is why I cannot have both.
Here's the (Simplified) code used:
SELECT status, user, duration, phonenumber, date
FROM (SELECT * FROM view_InboundPhoneCalls) as Phonecalls
WHERE date>=DATEADD(dd, -15, getdate())
--GROUP BY duration
Which gives something of the sort:
Status
User
Duration
phonenumber 
date
Received
Receptionnist
00:34:03
 from: +1234567890 
2021-09-30 16:01:57 
Received
Receptionnist
00:03:12
 from: +9876543210 
2021-09-30 16:02:40 
Transfer
User1
00:05:12
 +14161654965;Receptionnist;User1 
2021-09-30 16:01:57 
Received
Receptionnist
00:05:12
 from: +14161654965 
2021-09-30 16:01:57 
The end result would be something like this:
Status
User
Duration
phonenumber 
date
Received
Receptionnist
00:34:03
 from: +1234567890 
2021-09-30 16:01:57 
Received
Receptionnist
00:03:12
 from: +9876543210 
2021-09-30 16:02:40 
Transfer
Receptionnist
00:05:12
 +14161654965;Receptionnist;User1 
2021-09-30 16:01:57 
The normal "trick" is to detect duplicates first. One of the easier ways is a CTE (Common Table Expression) along with the ROW_NUMBER() function.
Part One - Mark the duplicates
WITH
cte_Sorted_List
(
status, usertype, duration, phonenumber, dated, duplicate_check
)
AS
( -- only use required fields to speed up
SELECT status, user, duration, phonenumber, date,
-- marks depend on correct columns!
Row_Number() OVER
( -- sort over relevant columns to show
PARTITION BY user, phonenumber, date, duration
-- with correct sort order
-- bit of hack: As T comes after R
-- logic: mark records to show as row number 1 in duplicate list
ORDER BY status DESC
) AS duplicate_check
FROM view_InboundPhoneCalls
-- and lose all unnecessary data
WHERE date>=DATEADD(dd, -15, getdate())
)
Part two - show relevant rows
SELECT
status, usertype, duration, phonenumber, dated
FROM
cte_Sorted_List
WHERE
Duplicate_Check = 1
;
First CTE extracts required fields in single pass, then that data only is used for output.
You could go for a blacklist, say with a CTE, then filter out the undesired rows.
Something like:
WITH Blacklist ([date], [duration]) AS (
SELECT [date], [duration] FROM view_InboundPhoneCalls
GROUP BY [date], [duration]
Having count(*) > 1
)
SELECT status, user, duration, phonenumber, date
FROM
(SELECT * FROM view_InboundPhoneCalls) as Phonecalls
LEFT JOIN
Blacklist
ON Phonecalls.[date] = Blacklist.[date]
AND Phonecalls.[duration] = Blacklist.[duration]
Where
Blacklist.[date] is null
Or
(Blacklist.[date] is not null AND Phonecalls.[Status] == 'Transfer')
You can use row-numbering for this, along with a custom ordering. There is no need for any joins.
SELECT status, [user], duration, phonenumber, date
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY duration, date
ORDER BY CASE WHEN Status = 'Transfer' THEN 1 ELSE 2 END)
FROM view_InboundPhoneCalls
WHERE date >= DATEADD(day, -15, getdate())
) as Phonecalls
WHERE rn = 1

Query without WHILE Loop

We have appointment table as shown below. Each appointment need to be categorized as "New" or "Followup". Any appointment (for a patient) within 30 days of first appointment (of that patient) is Followup. After 30 days, appointment is again "New". Any appointment within 30 days become "Followup".
I am currently doing this by typing while loop.
How to achieve this without WHILE loop?
Table
CREATE TABLE #Appt1 (ApptID INT, PatientID INT, ApptDate DATE)
INSERT INTO #Appt1
SELECT 1,101,'2020-01-05' UNION
SELECT 2,505,'2020-01-06' UNION
SELECT 3,505,'2020-01-10' UNION
SELECT 4,505,'2020-01-20' UNION
SELECT 5,101,'2020-01-25' UNION
SELECT 6,101,'2020-02-12' UNION
SELECT 7,101,'2020-02-20' UNION
SELECT 8,101,'2020-03-30' UNION
SELECT 9,303,'2020-01-28' UNION
SELECT 10,303,'2020-02-02'
You need to use recursive query.
The 30days period is counted starting from prev(and no it is not possible to do it without recursion/quirky update/loop). That is why all the existing answer using only ROW_NUMBER failed.
WITH f AS (
SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY PatientId ORDER BY ApptDate)
FROM Appt1
), rec AS (
SELECT Category = CAST('New' AS NVARCHAR(20)), ApptId, PatientId, ApptDate, rn, startDate = ApptDate
FROM f
WHERE rn = 1
UNION ALL
SELECT CAST(CASE WHEN DATEDIFF(DAY, rec.startDate,f.ApptDate) <= 30 THEN N'FollowUp' ELSE N'New' END AS NVARCHAR(20)),
f.ApptId,f.PatientId,f.ApptDate, f.rn,
CASE WHEN DATEDIFF(DAY, rec.startDate, f.ApptDate) <= 30 THEN rec.startDate ELSE f.ApptDate END
FROM rec
JOIN f
ON rec.rn = f.rn - 1
AND rec.PatientId = f.PatientId
)
SELECT ApptId, PatientId, ApptDate, Category
FROM rec
ORDER BY PatientId, ApptDate;
db<>fiddle demo
Output:
+---------+------------+-------------+----------+
| ApptId | PatientId | ApptDate | Category |
+---------+------------+-------------+----------+
| 1 | 101 | 2020-01-05 | New |
| 5 | 101 | 2020-01-25 | FollowUp |
| 6 | 101 | 2020-02-12 | New |
| 7 | 101 | 2020-02-20 | FollowUp |
| 8 | 101 | 2020-03-30 | New |
| 9 | 303 | 2020-01-28 | New |
| 10 | 303 | 2020-02-02 | FollowUp |
| 2 | 505 | 2020-01-06 | New |
| 3 | 505 | 2020-01-10 | FollowUp |
| 4 | 505 | 2020-01-20 | FollowUp |
+---------+------------+-------------+----------+
How it works:
f - get starting point(anchor - per every PatientId)
rec - recursibe part, if the difference between current value and prev is > 30 change the category and starting point, in context of PatientId
Main - display sorted resultset
Similar class:
Conditional SUM on Oracle - Capping a windowed function
Session window (Azure Stream Analytics)
Running Total until specific condition is true - Quirky update
Addendum
Do not ever use this code on production!
But another option, that is worth mentioning besides using cte, is to use temp table and update in "rounds"
It could be done in "single" round(quirky update):
CREATE TABLE Appt_temp (ApptID INT , PatientID INT, ApptDate DATE, Category NVARCHAR(10))
INSERT INTO Appt_temp(ApptId, PatientId, ApptDate)
SELECT ApptId, PatientId, ApptDate
FROM Appt1;
CREATE CLUSTERED INDEX Idx_appt ON Appt_temp(PatientID, ApptDate);
Query:
DECLARE #PatientId INT = 0,
#PrevPatientId INT,
#FirstApptDate DATE = NULL;
UPDATE Appt_temp
SET #PrevPatientId = #PatientId
,#PatientId = PatientID
,#FirstApptDate = CASE WHEN #PrevPatientId <> #PatientId THEN ApptDate
WHEN DATEDIFF(DAY, #FirstApptDate, ApptDate)>30 THEN ApptDate
ELSE #FirstApptDate
END
,Category = CASE WHEN #PrevPatientId <> #PatientId THEN 'New'
WHEN #FirstApptDate = ApptDate THEN 'New'
ELSE 'FollowUp'
END
FROM Appt_temp WITH(INDEX(Idx_appt))
OPTION (MAXDOP 1);
SELECT * FROM Appt_temp ORDER BY PatientId, ApptDate;
db<>fiddle Quirky update
You could do this with a recursive cte. You should first order by apptDate within each patient. That can be accomplished by a run-of-the-mill cte.
Then, in the anchor portion of your recursive cte, select the first ordering for each patient, mark the status as 'new', and also mark the apptDate as the date of the most recent 'new' record.
In the recursive portion of your recursive cte, increment to the next appointment, calculate the difference in days between the present appointment and the most recent 'new' appointment date. If it's greater than 30 days, mark it 'new' and reset the most recent new appointment date. Otherwise mark it as 'follow up' and just pass along the existing days since new appointment date.
Finallly, in the base query, just select the columns you want.
with orderings as (
select *,
rn = row_number() over(
partition by patientId
order by apptDate
)
from #appt1 a
),
markings as (
select apptId,
patientId,
apptDate,
rn,
type = convert(varchar(10),'new'),
dateOfNew = apptDate
from orderings
where rn = 1
union all
select o.apptId, o.patientId, o.apptDate, o.rn,
type = convert(varchar(10),iif(ap.daysSinceNew > 30, 'new', 'follow up')),
dateOfNew = iif(ap.daysSinceNew > 30, o.apptDate, m.dateOfNew)
from markings m
join orderings o
on m.patientId = o.patientId
and m.rn + 1 = o.rn
cross apply (select daysSinceNew = datediff(day, m.dateOfNew, o.apptDate)) ap
)
select apptId, patientId, apptDate, type
from markings
order by patientId, rn;
I should mention that I initially deleted this answer because Abhijeet Khandagale's answer seemed to meet your needs with a simpler query (after reworking it a bit). But with your comment to him about your business requirement and your added sample data, I undeleted mine because believe this one meets your needs.
I'm not sure that it's exactly what you implemented. But another option, that is worth mentioning besides using cte, is to use temp table and update in "rounds". So we are going to update temp table while all statuses are not set correctly and build result in an iterative way. We can control number of iteration using simply local variable.
So we split each iteration into two stages.
Set all Followup values that are near to New records. That's pretty easy to do just using right filter.
For the rest of the records that dont have status set we can select first in group with same PatientID. And say that they are new since they not processed by the first stage.
So
CREATE TABLE #Appt2 (ApptID INT, PatientID INT, ApptDate DATE, AppStatus nvarchar(100))
select * from #Appt1
insert into #Appt2 (ApptID, PatientID, ApptDate, AppStatus)
select a1.ApptID, a1.PatientID, a1.ApptDate, null from #Appt1 a1
declare #limit int = 0;
while (exists(select * from #Appt2 where AppStatus IS NULL) and #limit < 1000)
begin
set #limit = #limit+1;
update a2
set
a2.AppStatus = IIF(exists(
select *
from #Appt2 a
where
0 > DATEDIFF(day, a2.ApptDate, a.ApptDate)
and DATEDIFF(day, a2.ApptDate, a.ApptDate) > -30
and a.ApptID != a2.ApptID
and a.PatientID = a2.PatientID
and a.AppStatus = 'New'
), 'Followup', a2.AppStatus)
from #Appt2 a2
--select * from #Appt2
update a2
set a2.AppStatus = 'New'
from #Appt2 a2 join (select a.*, ROW_NUMBER() over (Partition By PatientId order by ApptId) rn from (select * from #Appt2 where AppStatus IS NULL) a) ar
on a2.ApptID = ar.ApptID
and ar.rn = 1
--select * from #Appt2
end
select * from #Appt2 order by PatientID, ApptDate
drop table #Appt1
drop table #Appt2
Update. Read the comment provided by Lukasz. It's by far smarter way. I leave my answer just as an idea.
I believe the recursive common expression is great way to optimize queries avoiding loops, but in some cases it can lead to bad performance and should be avoided if possible.
I use the code below to solve the issue and test it will more values, but encourage you to test it with your real data, too.
WITH DataSource AS
(
SELECT *
,CEILING(DATEDIFF(DAY, MIN([ApptDate]) OVER (PARTITION BY [PatientID]), [ApptDate]) * 1.0 / 30 + 0.000001) AS [GroupID]
FROM #Appt1
)
SELECT *
,IIF(ROW_NUMBER() OVER (PARTITION BY [PatientID], [GroupID] ORDER BY [ApptDate]) = 1, 'New', 'Followup')
FROM DataSource
ORDER BY [PatientID]
,[ApptDate];
The idea is pretty simple - I want separate the records in group (30 days), in which group the smallest record is new, the others are follow ups. Check how the statement is built:
SELECT *
,DATEDIFF(DAY, MIN([ApptDate]) OVER (PARTITION BY [PatientID]), [ApptDate])
,DATEDIFF(DAY, MIN([ApptDate]) OVER (PARTITION BY [PatientID]), [ApptDate]) * 1.0 / 30
,CEILING(DATEDIFF(DAY, MIN([ApptDate]) OVER (PARTITION BY [PatientID]), [ApptDate]) * 1.0 / 30 + 0.000001)
FROM #Appt1
ORDER BY [PatientID]
,[ApptDate];
So:
first, we are getting the first date, for each group and calculating the differences in days with the current one
then, we are want to get groups - * 1.0 / 30 is added
as for 30, 60, 90, etc days we are getting whole number and we wanted to start a new period, I have added + 0.000001; also, we are using ceiling function to get the smallest integer greater than, or equal to, the specified numeric expression
That's it. Having such group we simply use ROW_NUMBER to find our start date and make it as new and leaving the rest as follow ups.
With due respect to everybody and in IMHO,
There is not much difference between While LOOP and Recursive CTE in terms of RBAR
There is not much performance gain when using Recursive CTE and Window Partition function all in one.
Appid should be int identity(1,1) , or it should be ever increasing clustered index.
Apart from other benefit it also ensure that all successive row APPDate of that patient must be greater.
This way you can easily play with APPID in your query which will be more efficient than putting inequality operator like >,< in APPDate.
Putting inequality operator like >,< in APPID will aid Sql Optimizer.
Also there should be two date column in table like
APPDateTime datetime2(0) not null,
Appdate date not null
As these are most important columns in most important table,so not much cast ,convert.
So Non clustered index can be created on Appdate
Create NonClustered index ix_PID_AppDate_App on APP (patientid,APPDate) include(other column which is not i predicate except APPID)
Test my script with other sample data and lemme know for which sample data it not working.
Even if it do not work then I am sure it can be fix in my script logic itself.
CREATE TABLE #Appt1 (ApptID INT, PatientID INT, ApptDate DATE)
INSERT INTO #Appt1
SELECT 1,101,'2020-01-05' UNION ALL
SELECT 2,505,'2020-01-06' UNION ALL
SELECT 3,505,'2020-01-10' UNION ALL
SELECT 4,505,'2020-01-20' UNION ALL
SELECT 5,101,'2020-01-25' UNION ALL
SELECT 6,101,'2020-02-12' UNION ALL
SELECT 7,101,'2020-02-20' UNION ALL
SELECT 8,101,'2020-03-30' UNION ALL
SELECT 9,303,'2020-01-28' UNION ALL
SELECT 10,303,'2020-02-02'
;With CTE as
(
select a1.* ,a2.ApptDate as NewApptDate
from #Appt1 a1
outer apply(select top 1 a2.ApptID ,a2.ApptDate
from #Appt1 A2
where a1.PatientID=a2.PatientID and a1.ApptID>a2.ApptID
and DATEDIFF(day,a2.ApptDate, a1.ApptDate)>30
order by a2.ApptID desc )A2
)
,CTE1 as
(
select a1.*, a2.ApptDate as FollowApptDate
from CTE A1
outer apply(select top 1 a2.ApptID ,a2.ApptDate
from #Appt1 A2
where a1.PatientID=a2.PatientID and a1.ApptID>a2.ApptID
and DATEDIFF(day,a2.ApptDate, a1.ApptDate)<=30
order by a2.ApptID desc )A2
)
select *
,case when FollowApptDate is null then 'New'
when NewApptDate is not null and FollowApptDate is not null
and DATEDIFF(day,NewApptDate, FollowApptDate)<=30 then 'New'
else 'Followup' end
as Category
from cte1 a1
order by a1.PatientID
drop table #Appt1
Although it's not clearly addressed in the question, it's easy to figure out that the appointment dates cannot be simply categorized by 30-day groups. It makes no business sense. And you cannot use the appt id either. One can make a new appointment today for 2020-09-06.
Here is how I address this issue. First, get the first appointment, then calculate the date difference between each appointment and the first appt. If it's 0, set to 'New'. If <= 30 'Followup'. If > 30, set as 'Undecided' and do the next round check until there is no more 'Undecided'. And for that, you really need a while loop, but it does not loop through each appointment date, rather only a few datasets. I checked the execution plan. Even though there are only 10 rows, the query cost is significantly lower than that using recursive CTE, but not as low as Lukasz Szozda's addendum method.
IF OBJECT_ID('tempdb..#TEMPTABLE') IS NOT NULL DROP TABLE #TEMPTABLE
SELECT ApptID, PatientID, ApptDate
,CASE WHEN (DATEDIFF(DAY, MIN(ApptDate) OVER (PARTITION BY PatientID), ApptDate) = 0) THEN 'New'
WHEN (DATEDIFF(DAY, MIN(ApptDate) OVER (PARTITION BY PatientID), ApptDate) <= 30) THEN 'Followup'
ELSE 'Undecided' END AS Category
INTO #TEMPTABLE
FROM #Appt1
WHILE EXISTS(SELECT TOP 1 * FROM #TEMPTABLE WHERE Category = 'Undecided') BEGIN
;WITH CTE AS (
SELECT ApptID, PatientID, ApptDate
,CASE WHEN (DATEDIFF(DAY, MIN(ApptDate) OVER (PARTITION BY PatientID), ApptDate) = 0) THEN 'New'
WHEN (DATEDIFF(DAY, MIN(ApptDate) OVER (PARTITION BY PatientID), ApptDate) <= 30) THEN 'Followup'
ELSE 'Undecided' END AS Category
FROM #TEMPTABLE
WHERE Category = 'Undecided'
)
UPDATE #TEMPTABLE
SET Category = CTE.Category
FROM #TEMPTABLE t
LEFT JOIN CTE ON CTE.ApptID = t.ApptID
WHERE t.Category = 'Undecided'
END
SELECT ApptID, PatientID, ApptDate, Category
FROM #TEMPTABLE
I hope this will help you.
WITH CTE AS
(
SELECT #Appt1.*, RowNum = ROW_NUMBER() OVER (PARTITION BY PatientID ORDER BY ApptDate, ApptID) FROM #Appt1
)
SELECT A.ApptID , A.PatientID , A.ApptDate ,
Expected_Category = CASE WHEN (DATEDIFF(MONTH, B.ApptDate, A.ApptDate) > 0) THEN 'New'
WHEN (DATEDIFF(DAY, B.ApptDate, A.ApptDate) <= 30) then 'Followup'
ELSE 'New' END
FROM CTE A
LEFT OUTER JOIN CTE B on A.PatientID = B.PatientID
AND A.rownum = B.rownum + 1
ORDER BY A.PatientID, A.ApptDate
You could use a Case statement.
select
*,
CASE
WHEN DATEDIFF(d,A1.ApptDate,A2.ApptDate)>30 THEN 'New'
ELSE 'FollowUp'
END 'Category'
from
(SELECT PatientId, MIN(ApptId) 'ApptId', MIN(ApptDate) 'ApptDate' FROM #Appt1 GROUP BY PatientID) A1,
#Appt1 A2
where
A1.PatientID=A2.PatientID AND A1.ApptID<A2.ApptID
The question is, should this category be assigned based off the initial appointment, or the one prior? That is, if a Patient has had three appointments, should we compare the third appointment to the first, or the second?
You problem states the first, which is how I've answered. If that's not the case, you'll want to use lag.
Also, keep in mind that DateDiff makes not exception for weekends. If this should be weekdays only, you'll need to create your own Scalar-Valued function.
using Lag function
select apptID, PatientID , Apptdate ,
case when date_diff IS NULL THEN 'NEW'
when date_diff < 30 and (date_diff_2 IS NULL or date_diff_2 < 30) THEN 'Follow Up'
ELSE 'NEW'
END AS STATUS FROM
(
select
apptID, PatientID , Apptdate ,
DATEDIFF (day,lag(Apptdate) over (PARTITION BY PatientID order by ApptID asc),Apptdate) date_diff ,
DATEDIFF(day,lag(Apptdate,2) over (PARTITION BY PatientID order by ApptID asc),Apptdate) date_diff_2
from #Appt1
) SRC
Demo --> https://rextester.com/TNW43808
with cte
as
(
select
tmp.*,
IsNull(Lag(ApptDate) Over (partition by PatientID Order by PatientID,ApptDate),ApptDate) PriorApptDate
from #Appt1 tmp
)
select
PatientID,
ApptDate,
PriorApptDate,
DateDiff(d,PriorApptDate,ApptDate) Elapsed,
Case when DateDiff(d,PriorApptDate,ApptDate)>30
or DateDiff(d,PriorApptDate,ApptDate)=0 then 'New' else 'Followup' end Category from cte
Mine is correct. The authors was incorrect, see elapsed

Retrieving data dependent on attributes

everyone. I can't do the following query. Please help.
Initial data and ouput are on the following excel initial data/output google/drive
Here is the logic: for 'Rest' = 2500, it takes minimum value of 'Date', increments it by one and put it into Date1 column of output; Date2 receives the minimum value of date of the next 'Rest' value (1181,85).. and so on: Date1 receives 'Rest' (1181,85) min value of 'Date'(14.01.2013) incremented by one (15.01.2013) and so on. It should not do the above operations for 'Rest' value of zero (it should just skip it). We can't initially delete rows with 'Rest' value of zero, because their Date is used in Date2, as I have explained above. There are many 'accNumber's, it should list all of them. Please help. I hope you understood, if not ask for more details. Thanks in advance. I'm using SQL server.
If I've understood you correctly, you want to group the items by rest number, and then display the minimum date + 1 day, as well as the minimum date for the "next" rest number. What are you expecting to happen when the Rest number is 0 in two different places?
with Base as
(
select t.AccNum,
t.Rest,
DATEADD(day, 1, MIN(t.Date)) as [StartDate],
ROW_NUMBER() OVER (ORDER BY MIN(t.Date)) as RowNumber
from Accounts t
where t.Rest <> 0
group by t.AccNum, t.Rest
)
select a.AccNum, a.Rest, a.StartDate, DATEADD(DAY, -1, b.StartDate) as [EndDate]
from Base a
left join Base b
on a.RowNumber = b.RowNumber - 1
order by a.[StartDate]
If there's the possibility of the Rest number being duplicated further down, but that needing to be a separate item, then we need to be a bit cleverer in our initial select query.
with Base as
(
select b.AccNum, b.Rest, MIN(DATEADD(day, 1, b.Date)) as [StartDate], ROW_NUMBER() OVER (ORDER BY MIN(Date)) as [RowNumber]
from (
select *, ROW_NUMBER() OVER (PARTITION BY Rest ORDER BY Date) - ROW_NUMBER() OVER (ORDER BY Date) as [Order]
from Accounts a
-- where a.Rest <> 0
-- If we're still filtering out Rest 0 uncomment the above line
) b
group by [order], AccNum, Rest
)
select a.RowNumber, a.AccNum, a.Rest, a.StartDate, DATEADD(DAY, -1, b.StartDate) as [EndDate]
from Base a
left join Base b
on a.RowNumber = b.RowNumber - 1
order by a.[StartDate]
Results for both queries:
Account Number REST Start Date End Date
45817840200000057948 2500 2013-01-01 2013-01-14
45817840200000057948 1181 2013-01-15 2013-01-31
45817840200000057948 2431 2013-02-01 2013-02-09
45817840200000057948 1563 2013-02-10 NULL

SQL query to identify paired items (challenging)

Assume there is a relation database with one table:
{datetime, tapeID, backupStatus}
2012-07-09 3:00, ID33, Start
2012-07-09 3:05, ID34, Start
2012-07-09 3:10, ID35, Start
2012-07-09 4:05, ID34, End
2012-07-09 4:10, ID33, Start
2012-07-09 5:05, ID33, End
2012-07-09 5:10, ID34, Start
2012-07-09 6:00, ID34, End
2012-07-10 4:00, ID35, Start
2012-07-11 5:00, ID35, End
tapeID = any of 100 different tapes each with their own unique ID.
backupStatus = one of two assignments either Start or End.
I want to write a SQL query that returns five fields
{startTime,endTime,tapeID,totalBackupDuration,numberOfRestarts}
2012-07-09 3:00,2012-07-09 5:05, ID33, 0days2hours5min,1
2012-07-09 3:05,2012-07-09 4:05, ID34, 0days1hours0min,0
2012-07-09 3:10,2012-07-10 5:00, ID35, 0days0hours50min,1
2012-07-09 5:10,2012-07-09 6:00, ID34, 0days0hours50min,0
I'm looking to pair the start and end dates to identify when each backupset has truely completed. The caveat here is that the backup of a single backupset may be restarted so there may be multiple Start times that are not considered complete until the following End event. A single backupset may be backed up multiple times a day, which would need to be identified as a with a separate start and end time.
Thank you for your assistance in advance!
B
Here's my version. If you add INSERT #T SELECT '2012-07-11 12:00', 'ID35', 'Start' to the table, you'll see unfinished backups in this query as well. OUTER APPLY is a natural way to solve the problem.
SELECT
Min(T.dt) StartTime,
Max(E.dt) EndTime,
T.tapeID,
Datediff(Minute, Min(T.dt), Max(E.dt)) TotalBackupDuration,
Count(*) - 1 NumberOfRestarts
FROM
#T T
OUTER APPLY (
SELECT TOP 1 E.dt
FROM #T E
WHERE
T.tapeID = E.tapeID
AND E.BackupStatus = 'End'
AND E.dt > T.dt
ORDER BY E.dt
) E
WHERE
T.BackupStatus = 'Start'
GROUP BY
T.tapeID,
IsNull(E.dt, T.dt)
One thing about CROSS APPLY is that if you're only returning one row and the outer references are all real tables, you have an equivalent in SQL 2000 by moving it into the WHERE clause of a derived table:
SELECT
Min(T.dt) StartTime,
Max(T.EndTime) EndTime,
T.tapeID,
Datediff(Minute, Min(T.dt), Max(T.EndTime)) TotalBackupDuration,
Count(*) - 1 NumberOfRestarts
FROM (
SELECT
T.*,
(SELECT TOP 1 E.dt
FROM #T E
WHERE
T.tapeID = E.tapeID
AND E.BackupStatus = 'End'
AND E.dt > T.dt
ORDER BY E.dt
) EndTime
FROM #T T
WHERE T.BackupStatus = 'Start'
) T
GROUP BY
T.tapeID,
IsNull(T.EndTime, T.dt)
For outer references that are not all real tables (you want a calculated value from another subquery's outer reference) you have to add nested derived tables to accomplish this.
I finally bit the bullet and did some real testing. I used SPFiredrake's table population script to see the actual performance with a large amount of data. I did it programmatically so there are no typing errors. I took 10 executions each, and threw out the worst and best value for each column, then averaged the remaining 8 column values for that statistic.
The indexes were created after populating the table, with 100% fill factor. The Indexes column shows 1 when just the clustered index is present. It shows 2 when the nonclustered index on BackupStatus is added.
To exclude client network data transfer from the testing, I selected each query into variables like so:
DECLARE
#StartTime datetime,
#EndTime datetime,
#TapeID varchar(5),
#Duration int,
#Restarts int;
WITH A AS (
-- The query here
)
SELECT
#StartTime = StartTime,
#EndTime = EndTime,
#TapeID = TapeID,
#Duration = TotalBackupDuration,
#Restarts = NumberOfRestarts
FROM A;
I also trimmed the table column lengths to something more reasonable: tapeID varchar(5), BackupStatus varchar(5). In fact, the BackupStatus should be a bit column, and the tapeID should be an integer. But we'll stick with varchar for the time being.
Server Indexes UserName Reads Writes CPU Duration
--------- ------- ------------- ------ ------ ----- --------
x86 VM 1 ErikE 97219 0 599 325
x86 VM 1 Gordon Linoff 606 0 63980 54638
x86 VM 1 SPFiredrake 344927 260 23621 13105
x86 VM 2 ErikE 96388 0 579 324
x86 VM 2 Gordon Linoff 251443 0 22775 11830
x86 VM 2 SPFiredrake 197845 0 11602 5986
x64 Beefy 1 ErikE 96745 0 919 61
x64 Beefy 1 Gordon Linoff 320012 70 62372 13400
x64 Beefy 1 SPFiredrake 362545 288 20154 1686
x64 Beefy 2 ErikE 96545 0 685 164
x64 Beefy 2 Gordon Linoff 343952 72 65092 17391
x64 Beefy 2 SPFiredrake 198288 0 10477 924
Notes:
x86 VM: an almost idle virtual machine, Microsoft SQL Server 2008 (RTM) - 10.0.1600.22 (Intel X86)
x64 Beefy: a quite beefy and possibly very busy Microsoft SQL Server 2008 R2 (RTM) - 10.50.1765.0 (X64)
The second index helped all the queries, mine the least.
It is interesting that Gordon's initially low number of reads on one server was high on the second--but it had a lower duration, so it obviously picked a different execution plan probably due to having more resources to search the possible plan space faster (being a beefier server). But, the index raised the number of reads because that plan lowered the CPU cost by a ton and so costed out less in the optimizer.
What you need to do is to assign the next end date to all starts. Then count the number of starts in-between.
select tstart.datetime as starttime, min(tend.datetime) as endtime, tstart.tapeid
from (select *
from t
where BackupStatus = 'Start'
) tstart join
(select *
from t
where BackupStatus = 'End'
) tend
on tstart.tapeid = tend.tapeid and
tend.datetime >= tstart.datetime
This is close, but we have multiple rows for each end time (depending on the number of starts). To handle this, we need to group by the tapeid and the end time:
select min(a.starttime) as starttime, a.endtime, a.tapeid,
datediff(s, min(a.starttime), endtime), -- NOT CORRECT, DATABASE SPECIFIC
count(*) - 1 as NumRestarts
from (select tstart.dt as starttime, min(tend.dt) as endtime, tstart.tapeid
from (select *
from #t
where BackupStatus = 'Start'
) tstart join
(select *
from #t
where BackupStatus = 'End'
) tend
on tstart.tapeid = tend.tapeid and
tend.dt >= tstart.dt
group by tstart.dt, tstart.tapeid
) a
group by a.endtime, a.tapeid
I've written this version using SQL Server syntax. To create the test table, you can use:
create table #t (
dt datetime,
tapeID varchar(255),
BackupStatus varchar(255)
)
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 3:00', 'ID33', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 3:05', 'ID34', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 3:10', 'ID35', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 4:05', 'ID34', 'End')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 4:10', 'ID33', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 5:05', 'ID33', 'End')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 5:10', 'ID34', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-09 6:00', 'ID34', 'End')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-10 4:00', 'ID35', 'Start')
insert into #t (dt, tapeID, BackupStatus) values ('2012-07-11 5:00', 'ID35', 'End')
Thought I'd take a stab at it. Tested out Gordon Linoff's solution, and it doesn't quite calculate correctly for tapeID 33 in his own example (matches to the next start, not the corresponding end).
My attempt assumes you're using SQL server 2005+, as it utilizes CROSS/OUTER APPLY. If you need it for server 2000 I could probably swing it, but this seemed like the cleanest solution to me (as you're starting with all end elements and matching the first start elements to get the result). I'll annotate as well so you can understand what I'm doing.
SELECT
startTime, endT.dt endTime, endT.tapeID, DATEDIFF(s, startTime, endT.dt), restarts
FROM
#t endT -- Main source, getting all 'End' records so we can match.
OUTER APPLY ( -- Match possible previous 'End' records for the tapeID
SELECT TOP 1 dt
FROM #t
WHERE dt < endT.dt AND tapeID = endT.tapeID
AND BackupStatus = 'End') g
CROSS APPLY (SELECT ISNULL(g.dt, CAST(0 AS DATETIME)) dt) t
CROSS APPLY (
-- Match 'Start' records between our 'End' record
-- and our possible previous 'End' record.
SELECT MIN(dt) startTime,
COUNT(*) - 1 restarts -- Restarts, so -1 for the first 'Start'
FROM #t
WHERE tapeID = endT.tapeID AND BackupStatus = 'Start'
-- This is where our previous possible 'End' record is considered
AND dt > t.dt AND dt < endt.dt) starts
WHERE
endT.BackupStatus = 'End'
Edit: Test data generation script found at this link.
So decided to run some data against the three methods, and found that ErikE's solution is fastest, mine is a VERY close second, and Gordon's is just inefficient for any sizable set (even when working with 1000 records, it started showing slowness). For smaller sets (at about 5k records), my method wins over Erik's, but not by much. Honestly, I like my method as it doesn't require any additional aggregate functions to get the data, but ErikE's wins in the efficiency/speed battle.
Edit 2: For 55k records in the table (and 12k matching start/end pairs), Erik's takes ~0.307s and mine takes ~0.157s (averaging over 50 attempts). I was a little surprised about this, because I would've assumed that individual runs would've translated to the overall, but I guess the index cache is being better utilized by my query so subsequent hits are less expensive. Looking at the execution plans, ErikE's only has 1 branch off the main path, so he's ultimately working with a larger set for most of the query. I have 3 branches that combine closer to the output, so I'm churning on less data at any given moment and combine right at the end.
Make it very simple --
Make one sub query for start event and another one for End event. Rank function in each set for each row that has start and end. Then, use a left joins for 2 sub queries:
-- QUERY
WITH CTE as
(
SELECT dt
, ID
, status
--, RANK () OVER (PARTITION BY ID ORDER BY DT) as rnk1
--, RANK () OVER (PARTITION BY status ORDER BY DT) as rnk2
FROM INT_backup
)
SELECT *
FROM CTE
ORDER BY id, rnk2
select * FROM INT_backup order by id, dt
SELECT *
FROM
(
SELECT dt
, ID
, status
, rank () over (PARTITION by ID ORDER BY dt) as rnk
FROM INT_backup
WHERE status = 'start'
) START_SET
LEFT JOIN
(
SELECT dt
, ID
, status
, rank () over (PARTITION by ID ORDER BY dt) as rnk
FROM INT_backup
where status = 'END'
) END_SET
ON Start_Set.ID = End_SET.ID
AND Start_Set.Rnk = End_Set.rnk