Determine the end time continuously based on the average duration - sql

I'm trying to predict the start / end time of an order's processes in SQL. I have determined the average duration for processes from the past.
The processes run in several parallel rows (RNr) and rows are independent of each other. Each row can have 1-30 processes (PNr) that have different durations. The duration of a process may vary and is known only as an average duration.
After one process is completed, the next automatically starts.
So PNr 1 finish = PNr 2 start.
The start time of the first process in each row is known at the beginning and is the same for each row.
When some processes are completed, the times are known and should be used to calculate the more accurate prediction of upcoming processes.
How can I predict the time when a process will be started or stopped?
I used an large subquery to get this table.
RNr PNr Duration_avg_h Start Finish
1 1 1 2019-06-06 16:32:11 2019-06-06 16:33:14
1 2 262 2019-06-06 16:33:14 NULL
1 3 51 NULL NULL
1 4 504 NULL NULL
1 5 29 NULL NULL
2 1 1 2019-06-06 16:32:11 NULL
2 2 124 NULL NULL
2 3 45 NULL NULL
2 4 89 NULL NULL
2 5 19 NULL NULL
2 6 1565 NULL NULL
2 7 24 NULL NULL
Now I want to find the values ​​for the prediction.
SELECT
RNr,
PNr,
Duration_avg_h,
Start,
Finish,
Predicted_Start = CASE
WHEN Start IS NULL
THEN DATEADD(HH,LAG(Duration_avg_h, 1,NULL) OVER (ORDER BY RNr,PNr), LAG(Start, 1,NULL) OVER (ORDER BY RNr,PNr))
ELSE Start END,
Predicted_Finish = CASE
WHEN Finish IS NULL
THEN DATEADD(HH,Duration_avg_h,Start)
ELSE Finish END,
SUM(Duration_avg_h) over (PARTITION BY RNr ORDER BY RNr, PNr ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Duration_row_h
FROM (...)
ORDER BY RNr, PNr
I tried LAG () but with that I only get the values ​​for the next line. I also came to no conclusion with "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW".
RNr PNr Duration_avg_h Start Finish Predicted_Start Predicted_Finish Duration_row_h
1 1 1 2019-06-06 16:32:11 2019-06-06 16:33:14 2019-06-06 16:32:11 2019-06-06 16:33:14 1
1 2 262 2019-06-06 16:33:14 NULL 2019-06-06 16:33:14 2019-06-17 14:33:14 263
1 3 51 NULL NULL 2019-06-17 14:33:14 NULL 314
1 4 504 NULL NULL NULL NULL 818
1 5 29 NULL NULL NULL NULL 847
2 1 1 2019-06-06 16:32:11 NULL 2019-06-06 16:32:11 2019-06-06 17:32:11 1
2 2 124 NULL NULL 2019-06-06 17:32:11 NULL 125
2 3 45 NULL NULL NULL NULL 170
2 4 89 NULL NULL NULL NULL 259
2 5 19 NULL NULL NULL NULL 278
So can somebody help me to fill the columns Predicted_Start and Predicted_Finish ?

LAG only works if all your rows have values. For this use case you need to cascade the results from one row to another. One way of doing this is with a self join to get running totals
--Sample Data
DECLARE #dataset TABLE
(
RNr INT
,PNr INT
,Duration_avg_h INT
,START DATETIME
,Finish DATETIME
)
INSERT INTO #dataset
(
RNr
,PNr
,Duration_avg_h
,START
,Finish
)
VALUES
(1, 1, 1, '2019-06-06 16:32:11',NULL)
,(1, 2, 262, NULL,NULL)
,(1, 3, 51, NULL,NULL)
,(1, 4, 504, NULL,NULL)
,(1, 5, 29, NULL,NULL)
,(2, 1, 1, '2019-06-06 16:32:11', NULL)
,(2, 2, 124, NULL,NULL)
,(2, 3, 45, NULL,NULL)
,(2, 4, 89, NULL,NULL)
,(2, 5, 19, NULL,NULL)
,(2, 6, 1565, NULL,NULL)
,(2, 7, 24, NULL,NULL)
SELECT
d.RNr,
d.PNr,
d.Duration_avg_h,
d.Start,
d.Finish,
--SUM() gives us the total time up to and including this step
--take of the current step and you get the total time of all the previous steps
--this can give us our start time, or when the previous step ended.
SUM(running_total.Duration_avg_h) - d.Duration_avg_h AS running_total_time,
--MIN() gives us the lowest start time we have pre process.
MIN(running_total.Start) AS min_start,
ISNULL(
d.Start
,DATEADD(HH,SUM(running_total.Duration_avg_h),MIN(running_total.Start) )
) AS Predicted_Start,
ISNULL(
d.Finish
,DATEADD(HH,SUM(running_total.Duration_avg_h),MIN(running_total.Start) )
) AS Predicted_Finish
FROM #dataset AS d
LEFT JOIN #dataset AS running_total
ON d.RNr = running_total.RNr
AND
--the running total for all steps.
running_total.PNr <= d.PNr
GROUP BY
d.RNr,
d.PNr,
d.Duration_avg_h,
d.Start,
d.Finish
ORDER BY
RNr,
PNr
This code will not work once you have actual finish times unless you update the Duration_avg_h to be the actual hours taken.

Jonathan, thanks for your help.
Your idea of using "MIN (running_total.Start) AS min_start," brought me to the idea of using "MAX (d.Start) OVER (PARTITION BY RNr)". And this resulted in the following query:
--Sample Data
DECLARE #dataset TABLE
(
RNr INT
,PNr INT
,Duration_avg_h INT
,START DATETIME
,Finish DATETIME
)
INSERT INTO #dataset
(
RNr
,PNr
,Duration_avg_h
,START
,Finish
)
VALUES
(1, 1, 1, '2019-06-06 16:32:11','2019-06-06 16:33:14')
,(1, 2, 262, '2019-06-06 16:33:14','2019-08-22 17:30:00')
,(1, 3, 51, '2019-08-22 17:30:00',NULL)
,(1, 4, 504, NULL,NULL)
,(1, 5, 29, NULL,NULL)
,(2, 1, 1, '2019-06-06 16:32:11', NULL)
,(2, 2, 124, NULL,NULL)
,(2, 3, 45, NULL,NULL)
,(2, 4, 89, NULL,NULL)
,(2, 5, 19, NULL,NULL)
,(2, 6, 1565, NULL,NULL)
,(2, 7, 24, NULL,NULL)
SELECT RNr,
PNr,
Duration_avg_h,
Start,
Finish,
--Start_max,
--Finish_bit,
--Duration_row_h,
CASE WHEN Start IS NOT NULL THEN Start ELSE DATEADD(HH,(Duration_row_h - MAX(Duration_row_h*Finish_bit) OVER (PARTITION BY RNr) - Duration_avg_h), Start_max) END as Predicted_Start,
CASE WHEN Finish IS NOT NULL THEN Finish ELSE DATEADD(HH,(Duration_row_h - MAX(Duration_row_h*Finish_bit) OVER (PARTITION BY RNr)), Start_max) END as Predicted_Finish
FROM ( SELECT
RNr,
PNr,
Duration_avg_h,
--Convert to a short DATETIME format
CONVERT(DATETIME2(0),Start) as Start,
CONVERT(DATETIME2(0),Finish) as Finish,
--Get MAX start time for each row
Start_max = MAX (CONVERT(DATETIME2(0),d.Start)) OVER (PARTITION BY RNr),
--If process is finished then 1
Finish_bit = (CASE WHEN d.Finish IS NULL THEN 0 ELSE 1 END),
--continuously count the Duration of all processes in the row
SUM(Duration_avg_h) over (PARTITION BY RNr ORDER BY RNr, PNr ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Duration_row_h
FROM #dataset AS d
) AS e
ORDER BY
RNr,
PNr
This query takes into account changes in start and stop times. And calculated by that, the prediction for the upcoming processes.
RNr PNr Duration_avg_h Start Finish Predicted_Start Predicted_Finish
1 1 1 2019-06-06 16:32:11 2019-06-06 16:33:14 2019-06-06 16:32:11 2019-06-06 16:33:14
1 2 262 2019-06-06 16:33:14 2019-08-22 17:30:00 2019-06-06 16:33:14 2019-08-22 17:30:00
1 3 51 2019-08-22 17:30:00 NULL 2019-08-22 17:30:00 2019-08-24 20:30:00
1 4 504 NULL NULL 2019-08-24 20:30:00 2019-09-14 20:30:00
1 5 29 NULL NULL 2019-09-14 20:30:00 2019-09-16 01:30:00
2 1 1 2019-06-06 16:32:11 NULL 2019-06-06 16:32:11 2019-06-06 17:32:11
2 2 124 NULL NULL 2019-06-06 17:32:11 2019-06-11 21:32:11
2 3 45 NULL NULL 2019-06-11 21:32:11 2019-06-13 18:32:11
2 4 89 NULL NULL 2019-06-13 18:32:11 2019-06-17 11:32:11
2 5 19 NULL NULL 2019-06-17 11:32:11 2019-06-18 06:32:11
2 6 1565 NULL NULL 2019-06-18 06:32:11 2019-08-22 11:32:11
2 7 24 NULL NULL 2019-08-22 11:32:11 2019-08-23 11:32:11
I think this way is still complicated. Does anyone know maybe a simpler query?

Related

How to prevent SQL query from returning overlapping groups?

I'm trying to generate a report that displays the number of failed login attempts that happen within 30 minutes of each other. The data for this report is in a SQL database.
This is the query I'm using to pull the data out.
SELECT
A.LoginID,
A.LogDatetime AS firstAttempt,
MAX(B.LogDatetime) AS lastAttempt,
COUNT(B.LoginID) + 1 AS attempts
FROM
UserLoginHistory A
JOIN UserLoginHistory B ON A.LoginID = B.LoginID
WHERE
A.SuccessfulFlag = 0
AND B.SuccessfulFlag = 0
AND A.LogDatetime < B.LogDatetime
AND B.LogDatetime <= DATEADD(minute, 30, A.LogDatetime)
GROUP BY
A.LoginID, A.LogDatetime
ORDER BY
A.LoginID, A.LogDatetime
This returns results that looks something like this:
Row
LoginID
firstAttempt
lastAttempt
attempts
1
1
2022-05-01 00:00
2022-05-01 00:29
6
2
1
2022-05-01 00:06
2022-05-01 00:33
6
3
1
2022-05-01 00:13
2022-05-01 00:39
6
4
1
2022-05-01 00:15
2022-05-01 00:45
6
5
1
2022-05-01 00:20
2022-05-01 00:50
6
6
1
2022-05-01 00:29
2022-05-01 00:55
6
7
1
2022-05-01 00:33
2022-05-01 01:01
6
8
1
2022-05-01 00:39
2022-05-01 01:04
6
...
...
...
...
...
However, you can see that the rows overlap a lot. For example, row 1 shows attempts from 00:00 to 00:29, which overlaps with row 2 showing attempts from 00:06 to 00:33. Row 2 ought to be like row 7 (00:33 - 01:01), since that row's firstAttempt is the next one after row 1's lastAttempt.
You might need to use recursive CTE's or insert your data into a temp table and loop it with updates to remove the overlaps.
Do you need to have set starting times? As a quick work around you could round down the the DATETIME to 30 minute intervals, that would ensure the logins don't overlap but it will only group the attempts by 30 minute buckets
SELECT
A.LoginID,
DATEADD(MINUTE, ROUND(DATEDIFF(MINUTE, '2022-01-01', A.LogDatetime) / 30.0, 0) * 30, '2022-01-01') AS LoginInterval,
MIN(A.LogDatetime) AS firstAttempt,
MAX(A.LogDatetime) AS lastAttempt,
COUNT(*) attempts
FROM
UserLoginHistory A
WHERE
A.SuccessfulFlag = 0
GROUP BY
A.LoginID, DATEADD(MINUTE, ROUND(DATEDIFF(MINUTE, '2022-01-01', A.LogDatetime) / 30.0, 0) * 30, '2022-01-01')
ORDER BY
A.LoginID, LoginInterval

Calculate the duration from start date and end date in SQL

I am using SQL Server.
select DISTINCT caseNumber, dateStarted,dateStopped from patientView where dateStarted !='' and dateStopped != '';
We get the following output,
CaseNumber
dateStarted
dateStopped
1
2022-01-01
2022-01-04
1
2022-01-05
2022-01-19
2
2022-01-03
2022-01-10
4
2022-01-05
2022-01-11
4
2022-01-13
2022-01-14
4
2022-01-21
2022-01-23
5
2022-01-15
2022-01-16
5
2022-01-17
2022-01-24
5
2022-01-24
2022-01-26
8
2022-01-17
2022-01-20
8
2022-01-21
2022-01-28
11
2022-01-18
2022-01-25
11
2022-01-26
2022-01-27
I want to calculate the duration for each caseNumber. For eg. caseNumber 1 has 2 rows and hence total duration would be 18days.
I would suggest using the group by keyword to group redundant case numbers and take the min for the startdates and max for stopdates. You can do something like:
SELECT caseNumber, max(dateStopped)-min(dateStarted)
from patientView
where dateStarted != '' and dateStopped != ''
GROUP BY caseNumber;
It is not clear whether you want the sum of the durations for individual patientView records or the duration from the earliest start to the latest end. It is also not clear whether the stop date is inclusive or exclusive. Is 2022-01-01 to 2022-01-04 considered 3 days or 4 days?
Here is code that shows 4 different calculations:
DECLARE #patientView TABLE (CaseNumber INT, dateStarted DATETIME, dateStopped DATETIME)
INSERT #patientView
VALUES
(1, '2022-01-01 ', '2022-01-04'),
(1, '2022-01-05 ', '2022-01-19'),
(2, '2022-01-03 ', '2022-01-10'),
(4, '2022-01-05 ', '2022-01-11'),
(4, '2022-01-13 ', '2022-01-14'),
(4, '2022-01-21 ', '2022-01-23'),
(5, '2022-01-15 ', '2022-01-16'),
(5, '2022-01-17 ', '2022-01-24'),
(5, '2022-01-24 ', '2022-01-26'),
(8, '2022-01-17 ', '2022-01-20'),
(8, '2022-01-21 ', '2022-01-28'),
(11, '2022-01-18 ', '2022-01-25'),
(11, '2022-01-26 ', '2022-01-27')
SELECT
CaseNumber,
SumDaysExclusive = SUM(DATEDIFF(day, dateStarted, dateStopped)),
SumDaysInclusive = SUM(DATEDIFF(day, dateStarted, dateStopped) + 1),
RangeDaysExclusive = DATEDIFF(day, MIN(dateStarted), MAX(dateStopped)),
RangeDaysInclusive = DATEDIFF(day, MIN(dateStarted), MAX(dateStopped)) + 1
FROM #patientView
GROUP BY CaseNumber
ORDER BY CaseNumber
Results:
CaseNumber
SumDaysExclusive
SumDaysInclusive
RangeDaysExclusive
RangeDaysInclusive
1
17
19
18
19
2
7
8
7
8
4
9
12
18
19
5
10
13
11
12
8
10
12
11
12
11
8
10
9
10
db<>fiddle
The test data above uses DATETIME types. (DATE would also work.) If you have dates stored as character data (not a good practice), you may need to add CAST or CONVERT statements.

Postgresql compare two rows recursively

I want to write a query where I can find track the downgraded versions for each id.
So here is the table;
id version ts
1 3 2021-09-01 10:47:50+00
1 5 2021-09-05 10:47:50+00
1 1 2021-09-11 10:47:50+00
2 2 2021-09-11 10:47:50+00
2 6 2021-09-15 10:47:50+00
3 2 2021-09-01 10:47:50+00
3 4 2021-09-05 10:47:50+00
3 6 2021-09-15 10:47:50+00
3 1 2021-09-16 10:47:50+00
I want to print out something like that;
id:1 downgraded their version from 5 to 1 at 2021-09-11 10:47:50+00
id:3 downgraded their version from 6 to 1 at 2021-09-16 10:47:50+00
So when I run the query the output should be:
id version downgraded_to ts
1 5 1 2021-09-11 10:47:50+00
3 6 1 2021-09-16 10:47:50+00
but I'm completely lost here.
Does it make sense to handle this situation in Postgresql? Is it possible to do it?
You may use lead analytic function to get the next version and compare it with current version assuming that the version is of a numeric type.
with next_vers as (
select t.*, lead(version) over(partition by id order by ts asc) as next_version
from(values
(1, 3, timestamp '2021-09-01 10:47:50'),
(1, 5, timestamp '2021-09-05 10:47:50'),
(1, 1, timestamp '2021-09-11 10:47:50'),
(2, 2, timestamp '2021-09-11 10:47:50'),
(2, 6, timestamp '2021-09-15 10:47:50'),
(3, 2, timestamp '2021-09-01 10:47:50'),
(3, 4, timestamp '2021-09-05 10:47:50'),
(3, 6, timestamp '2021-09-15 10:47:50'),
(3, 1, timestamp '2021-09-16 10:47:50')
) as t(id, version, ts)
)
select *
from next_vers
where version > next_version
id | version | ts | next_version
-: | ------: | :------------------ | -----------:
1 | 5 | 2021-09-05 10:47:50 | 1
3 | 6 | 2021-09-15 10:47:50 | 1
db<>fiddle here

Sum and Count by month, shown with last day of that month

I have a transaction table like this:
Trandate channelID branch amount
--------- --------- ------ ------
01/05/2019 1 2 2000
11/05/2019 1 2 2200
09/03/2020 1 2 5600
15/03/2020 1 2 600
12/10/2019 2 10 12000
12/10/2019 2 10 12000
15/11/2019 4 7 4400
15/02/2020 4 2 2500
I need to sum amount and count transactions by year and month. I tried this:
select DISTINCT
DATEPART(YEAR,a.TranDate) as [YearT],
DATEPART(MONTH,a.TranDate) as [monthT],
count(*) as [countoftran],
sum(a.Amount) as [amount],
a.Name as [branch],
a.ChannelName as [channelID]
from transactions as a
where a.TranDate>'20181231'
group by a.Name, a.ChannelName, DATEPART(YEAR,a.TranDate), DATEPART(MONTH,a.TranDate)
order by a.Name, YearT, MonthT
It works like charm. However, I will use this data on PowerBI thus I cannot show these results in a "line graphic" due to the year and month info being in separate columns.
I tried changing format on SQL to 'YYYYMM' alas powerBI doesn't recognise this column as date.
So, in the end, I need a result table looks like this:
YearT channelID branch Tamount TranT
--------- --------- ------ ------- -----
31/05/2019 1 2 4400 2
30/03/2020 1 2 7800 2
31/10/2019 2 10 24000 2
30/11/2019 4 7 4400 1
29/02/2020 4 2 2500 1
I have tried several little changes with no result.
Help is much appreciated.
You may try with the following statement:
SELECT
EOMONTH(DATEFROMPARTS(YEAR(Trandate), MONTH(Trandate), 1)) AS YearT,
branch, channelID,
SUM(amount) AS TAmount,
COUNT(*) AS TranT
FROM (VALUES
('20190501', 1, 2, 2000),
('20190511', 1, 2, 2200),
('20200309', 1, 2, 5600),
('20200315', 1, 2, 600),
('20191012', 2, 10, 12000),
('20191012', 2, 10, 12000),
('20191115', 4, 7, 4400),
('20200215', 4, 2, 2500)
) v (Trandate, channelID, branch, amount)
GROUP BY DATEFROMPARTS(YEAR(Trandate), MONTH(Trandate), 1), branch, channelID
ORDER BY DATEFROMPARTS(YEAR(Trandate), MONTH(Trandate), 1)
Result:
YearT branch channelID TAmount TranT
2019-05-31 2 1 4200 2
2019-10-31 10 2 24000 2
2019-11-30 7 4 4400 1
2020-02-29 2 4 2500 1
2020-03-31 2 1 6200 2

Set based solution to generate batch number based on proximity and type of record in SQL server

I have a table which has the transactions. Each transaction is represented by a row. The row has a field TranCode indicating the type of transaction and also the date of transaction is also recorded. Following is the table, and corresponding data.
create table t
(
id int identity(1,1),
TranDate datetime,
TranCode int,
BatchNo int
)
GO
insert into t (TranDate, TranCode)
VALUES(GETDATE(), 1),
(DATEADD(MINUTE, 1, GETDATE()), 1),
(DATEADD(MINUTE, 2, GETDATE()), 1),
(DATEADD(MINUTE, 3, GETDATE()), 1),
(DATEADD(MINUTE, 4, GETDATE()), 2),
(DATEADD(MINUTE, 5, GETDATE()), 2),
(DATEADD(MINUTE, 6, GETDATE()), 2),
(DATEADD(MINUTE, 7, GETDATE()), 2),
(DATEADD(MINUTE, 8, GETDATE()), 2),
(DATEADD(MINUTE, 9, GETDATE()), 1),
(DATEADD(MINUTE, 10, GETDATE()), 1),
(DATEADD(MINUTE, 11, GETDATE()), 1),
(DATEADD(MINUTE, 12, GETDATE()), 2),
(DATEADD(MINUTE, 13, GETDATE()), 2),
(DATEADD(MINUTE, 14, GETDATE()), 1),
(DATEADD(MINUTE, 15, GETDATE()), 1),
(DATEADD(MINUTE, 16, GETDATE()), 1),
(DATEADD(MINUTE, 17, GETDATE()), 2),
(DATEADD(MINUTE, 18, GETDATE()), 2),
(DATEADD(MINUTE, 19, GETDATE()), 1),
(DATEADD(MINUTE, 20, GETDATE()), 1),
(DATEADD(MINUTE, 21, GETDATE()), 1),
(DATEADD(MINUTE, 21, GETDATE()), 1)
After the above code, the table contains the following data, well values in the tranDate field will be different for you, but that is fine.
id TranDate TranCode BatchNo
----------- ----------------------- ----------- -----------
1 2015-02-12 20:40:47.547 1 NULL
2 2015-02-12 20:41:47.547 1 NULL
3 2015-02-12 20:42:47.547 1 NULL
4 2015-02-12 20:43:47.547 1 NULL
5 2015-02-12 20:44:47.547 2 NULL
6 2015-02-12 20:45:47.547 2 NULL
7 2015-02-12 20:46:47.547 2 NULL
8 2015-02-12 20:47:47.547 2 NULL
9 2015-02-12 20:48:47.547 2 NULL
10 2015-02-12 20:49:47.547 1 NULL
11 2015-02-12 20:50:47.547 1 NULL
12 2015-02-12 20:51:47.547 1 NULL
13 2015-02-12 20:52:47.547 2 NULL
14 2015-02-12 20:53:47.547 2 NULL
15 2015-02-12 20:54:47.547 1 NULL
16 2015-02-12 20:55:47.547 1 NULL
17 2015-02-12 20:56:47.547 1 NULL
18 2015-02-12 20:57:47.547 2 NULL
19 2015-02-12 20:58:47.547 2 NULL
20 2015-02-12 20:59:47.547 1 NULL
21 2015-02-12 21:00:47.547 1 NULL
22 2015-02-12 21:01:47.547 1 NULL
23 2015-02-12 21:01:47.547 1 NULL
I want a set based solution and not a cursor or row based solution to update the batch number for the rows. For example, the first 4 records should get a batchNo of 1 as they have TranCode as 1, and the next 5 (having tranCode of 2 and are closer to each other in time) should have batchNo as 2, and the next 4 should have 3 and so on. Following is the expected output.
id TranDate TranCode BatchNo
----------- ----------------------- ----------- -----------
1 2015-02-12 20:43:59.123 1 1
2 2015-02-12 20:44:59.123 1 1
3 2015-02-12 20:45:59.123 1 1
4 2015-02-12 20:46:59.123 1 1
5 2015-02-12 20:47:59.123 2 2
6 2015-02-12 20:48:59.123 2 2
7 2015-02-12 20:49:59.123 2 2
8 2015-02-12 20:50:59.123 2 2
9 2015-02-12 20:51:59.123 2 2
10 2015-02-12 20:52:59.123 1 3
11 2015-02-12 20:53:59.123 1 3
12 2015-02-12 20:54:59.123 1 3
13 2015-02-12 20:55:59.123 2 4
14 2015-02-12 20:56:59.123 2 4
15 2015-02-12 20:57:59.123 1 5
16 2015-02-12 20:58:59.123 1 5
17 2015-02-12 20:59:59.123 1 5
18 2015-02-12 21:00:59.123 2 6
19 2015-02-12 21:01:59.123 2 6
20 2015-02-12 21:02:59.123 1 7
21 2015-02-12 21:03:59.123 1 7
22 2015-02-12 21:04:59.123 1 7
23 2015-02-12 21:04:59.123 1 7
I have tried very hard with row_number, rank and dense_rank and none of them came for my rescue. I am looking for set based solution as I want really good performance.
Your help is very much appreciated.
You could do this using Recursive CTE. I also used the lead function to check the next row and determine if you transcode changed.
Query:
WITH A
AS (
SELECT id
,trancode
,trandate
,lead(trancode) OVER (ORDER BY id,trancode) leadcode
FROM #t
)
,cte
AS (
SELECT id
,trandate
,trancode
,lead(trancode) OVER (ORDER BY id,trancode) leadcode
,1 batchnum
,1 nextbatchnum
,id + 1 nxtId
FROM #t
WHERE id = 1
UNION ALL
SELECT A.id
,A.trandate
,A.trancode
,A.leadcode
,nextbatchnum
,CASE
WHEN A.trancode <> A.leadcode THEN nextbatchnum + 1 ELSE nextbatchnum END nextbatchnum
,A.id + 1 nxtid
FROM A
INNER JOIN CTE B ON A.id = B.nxtId
)
SELECT id
,trandate
,trancode
,batchnum
FROM CTE
OPTION (MAXRECURSION 100)
Result:
id trandate trancode batchnum
1 2015-02-12 10:19:06.717 1 1
2 2015-02-12 10:20:06.717 1 1
3 2015-02-12 10:21:06.717 1 1
4 2015-02-12 10:22:06.717 1 1
5 2015-02-12 10:23:06.717 2 2
6 2015-02-12 10:24:06.717 2 2
7 2015-02-12 10:25:06.717 2 2
8 2015-02-12 10:26:06.717 2 2
9 2015-02-12 10:27:06.717 2 2
10 2015-02-12 10:28:06.717 1 3
11 2015-02-12 10:29:06.717 1 3
12 2015-02-12 10:30:06.717 1 3
13 2015-02-12 10:31:06.717 2 4
14 2015-02-12 10:32:06.717 2 4
15 2015-02-12 10:33:06.717 1 5
16 2015-02-12 10:34:06.717 1 5
17 2015-02-12 10:35:06.717 1 5
18 2015-02-12 10:36:06.717 2 6
19 2015-02-12 10:37:06.717 2 6
20 2015-02-12 10:38:06.717 1 7
21 2015-02-12 10:39:06.717 1 7
22 2015-02-12 10:40:06.717 1 7
23 2015-02-12 10:40:06.717 1 7
I think that ultimately the operation you wish to perform on the data is not relational, so a nice set based solution doesn't exist. What you are trying to do relies on the order, and on the row sequentially before/after it, and so needs to use a cursor somewhere.
I've managed to get your desired output using a recursive CTE, although it's not optimised, but thought it might be useful to post what I've done to give you something to work with.
The issue I have with this is the GROUP BY and MAX I'm using on the result set to get the correct values. I'm sure it can be done in a better way.
;WITH cte
AS ( SELECT ID ,
TranDate ,
TranCode ,
1 AS BatchNumber
FROM t
UNION ALL
SELECT t.ID ,
t.TranDate ,
t.TranCode ,
CASE WHEN t.TranCode != cte.TranCode
THEN cte.BatchNumber + 1
ELSE cte.BatchNumber
END AS BatchNumber
FROM t
INNER JOIN cte ON t.id = cte.Id + 1
)
SELECT id ,
trandate ,
trancode ,
MAX(cte.BatchNumber) AS BatchNumber
FROM cte
GROUP BY id ,
tranDate ,
trancode