Related
(First question on StackOverflow, New on SQL with MSAccess. Please advise if I am missing anything or wrong format.)
I have two tables [Summary] and [Detail] with the layout as follow:
[Summary]
Driver ID
DateOfOperation
SalaryMonth
24
1/21/2023
2/1/2023
24
1/23/2023
2/1/2023
30
1/21/2023
2/1/2023
30
1/23/2023
2/1/2023
...Record Total:18734
[Detail]
Driver ID
DateOfOperation
WorkOrder
Points
SalaryMonth
24
1/21/2023
1
400
2/1/2023
24
1/21/2023
2
118
2/1/2023
24
1/21/2023
3
118
2/1/2023
24
1/21/2023
4
118
2/1/2023
30
1/21/2023
1
462
2/1/2023
30
1/21/2023
2
1264
2/1/2023
30
1/23/2023
1
924
2/1/2023
30
1/23/2023
2
1264
2/1/2023
24
1/21/2023
1
260
2/1/2023
24
1/21/2023
2
354
2/1/2023
24
1/21/2023
3
236
2/1/2023
24
1/21/2023
4
260
2/1/2023
24
1/21/2023
5
236
2/1/2023
24
1/21/2023
6
236
2/1/2023
24
1/21/2023
7
236
2/1/2023
24
1/21/2023
8
236
2/1/2023
24
1/21/2023
9
236
2/1/2023
...Record Total: 52838
I attempted to
count the total days in a period (eg.month) a driver work; &
Calculate total points of a driver got in a period.
Average points of a driver got in a period.
I ran the query with the SQL as follow. The query ran unusually long and the numbers on CountDateOfOperation and Month_points turned haywired like 1003922 days in a month.
SELECT Summary.[Driver ID], Count(Summary.DateOfOperation) AS CountDateOfOperation, Sum([Points]) AS Month_Points
FROM Summary, Detail
WHERE (((Summary.DateOfOperation) Between [Begin Date?] And [end date?]))
GROUP BY Summary.[Driver ID];
Expected result:
[Begin Date?] - 12/21/2022
[end date?] - 1/20/2023
Driver ID
CountDateOfOperation
Month_Points
SalaryMonth
24
19
18794
1/1/2023
30
25
26548
1/1/2023
...Record Total: 39
Actually result:
[Begin Date?] - 12/21/2022
[end date?] - 1/20/2023
Driver ID
CountDateOfOperation
Month_Points
SalaryMonth
24
1003922
293134356
1/1/2023
30
1320950
385703100
1/1/2023
...Record Total: 39
May anyone tell me what's wrong with the SQL and how to resolve this issue?
#################################
Thank you for your prompt reply (which scared me a bit...)
I used Access to link up the tables and the SQL turned out likes below:
SELECT Summary.[Driver ID], Count(Summary.DateOfOperation) AS CountDateOfOperation, Sum([Points]) AS Month_Points, Summary.SalaryMonth
FROM Drivers INNER JOIN (Summary INNER JOIN Detail ON (Summary.SalaryMonth = Detail.Salary_month) AND (Summary.DateOfOperation = Detail.[Date of Operation]) AND (Summary.[Driver ID] = Detail.[Driver ID])) ON (Drivers.[Driver ID] = Summary.[Driver ID]) AND (Drivers.[Driver ID] = Detail.[Driver ID])
WHERE (((Summary.DateOfOperation) Between [Begin Date?] And [end date?]))
GROUP BY Summary.[Driver ID], Summary.SalaryMonth;
The outcome is making a lot more sense, but still not accurate...
Actually result:
[Begin Date?] - 12/21/2022
[end date?] - 1/20/2023
Driver ID
CountDateOfOperation
Month_Points
SalaryMonth
24
80
18794
1/1/2023
30
50
26548
1/1/2023
...Record Total: 39
Just found that CountDateOfOperation is now counting Detail.WorkOrder instead of Summary.DateOfOperation.
Does anyone know what went wrong?
Thank you all.
I'm not sure why you need the Summary table. There's nothing in it that is not in Detail.
Test Data Fiddle
CREATE TABLE #detail ( Driver_ID INT
, DateOfOperation DATE
, WorkOrder INT
, Points INT
, SalaryMonth DATE);
INSERT INTO #detail
VALUES
( 24, '1/21/2023', 1, 400, '2/1/2023')
,( 24, '1/21/2023', 2, 118, '2/1/2023')
,( 24, '1/21/2023', 3, 118, '2/1/2023')
,( 24, '1/21/2023', 4, 118, '2/1/2023')
,( 30, '1/21/2023', 1, 462, '2/1/2023')
,( 30, '1/21/2023', 2, 1264, '2/1/2023')
,( 30, '1/23/2023', 1, 924, '2/1/2023')
,( 30, '1/23/2023', 2, 1264, '2/1/2023')
,( 24, '1/21/2023', 1, 260, '2/1/2023')
,( 24, '1/21/2023', 2, 354, '2/1/2023')
,( 24, '1/21/2023', 3, 236, '2/1/2023')
,( 24, '1/21/2023', 4, 260, '2/1/2023')
,( 24, '1/21/2023', 5, 236, '2/1/2023')
,( 24, '1/21/2023', 6, 236, '2/1/2023')
,( 24, '1/21/2023', 7, 236, '2/1/2023')
,( 24, '1/21/2023', 8, 236, '2/1/2023')
,( 24, '1/21/2023', 9, 236, '2/1/2023');
Query - I wrote this in SQL Server but I believe it should work in Access.
DECLARE #BeginDate DATE = '2023-01-01'
, #EndDate DATE = '2023-02-01'
SELECT Driver_ID
, COUNT(Distinct DateOfOperation) DaysWorked
, SUM(points) TotalPointsEarned
, SUM(Points)/COUNT(DISTINCT DateOfOperation) PointsPerWorkDay
, SalaryMonth
FROM #detail
WHERE DateOfOperation BETWEEN #BeginDate AND #EndDate
GROUP BY Driver_ID, SalaryMonth
Returns:
Driver_ID
DaysWorked
TotalPointsEarned
AvgPointsPerWorkDay
SalaryMonth
24
1
3044
3044
2023-02-01
30
2
3914
1957
2023-02-01
I'm trying to predict the start / end time of an order's processes in SQL. I have determined the average duration for processes from the past.
The processes run in several parallel rows (RNr) and rows are independent of each other. Each row can have 1-30 processes (PNr) that have different durations. The duration of a process may vary and is known only as an average duration.
After one process is completed, the next automatically starts.
So PNr 1 finish = PNr 2 start.
The start time of the first process in each row is known at the beginning and is the same for each row.
When some processes are completed, the times are known and should be used to calculate the more accurate prediction of upcoming processes.
How can I predict the time when a process will be started or stopped?
I used an large subquery to get this table.
RNr PNr Duration_avg_h Start Finish
1 1 1 2019-06-06 16:32:11 2019-06-06 16:33:14
1 2 262 2019-06-06 16:33:14 NULL
1 3 51 NULL NULL
1 4 504 NULL NULL
1 5 29 NULL NULL
2 1 1 2019-06-06 16:32:11 NULL
2 2 124 NULL NULL
2 3 45 NULL NULL
2 4 89 NULL NULL
2 5 19 NULL NULL
2 6 1565 NULL NULL
2 7 24 NULL NULL
Now I want to find the values for the prediction.
SELECT
RNr,
PNr,
Duration_avg_h,
Start,
Finish,
Predicted_Start = CASE
WHEN Start IS NULL
THEN DATEADD(HH,LAG(Duration_avg_h, 1,NULL) OVER (ORDER BY RNr,PNr), LAG(Start, 1,NULL) OVER (ORDER BY RNr,PNr))
ELSE Start END,
Predicted_Finish = CASE
WHEN Finish IS NULL
THEN DATEADD(HH,Duration_avg_h,Start)
ELSE Finish END,
SUM(Duration_avg_h) over (PARTITION BY RNr ORDER BY RNr, PNr ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Duration_row_h
FROM (...)
ORDER BY RNr, PNr
I tried LAG () but with that I only get the values for the next line. I also came to no conclusion with "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW".
RNr PNr Duration_avg_h Start Finish Predicted_Start Predicted_Finish Duration_row_h
1 1 1 2019-06-06 16:32:11 2019-06-06 16:33:14 2019-06-06 16:32:11 2019-06-06 16:33:14 1
1 2 262 2019-06-06 16:33:14 NULL 2019-06-06 16:33:14 2019-06-17 14:33:14 263
1 3 51 NULL NULL 2019-06-17 14:33:14 NULL 314
1 4 504 NULL NULL NULL NULL 818
1 5 29 NULL NULL NULL NULL 847
2 1 1 2019-06-06 16:32:11 NULL 2019-06-06 16:32:11 2019-06-06 17:32:11 1
2 2 124 NULL NULL 2019-06-06 17:32:11 NULL 125
2 3 45 NULL NULL NULL NULL 170
2 4 89 NULL NULL NULL NULL 259
2 5 19 NULL NULL NULL NULL 278
So can somebody help me to fill the columns Predicted_Start and Predicted_Finish ?
LAG only works if all your rows have values. For this use case you need to cascade the results from one row to another. One way of doing this is with a self join to get running totals
--Sample Data
DECLARE #dataset TABLE
(
RNr INT
,PNr INT
,Duration_avg_h INT
,START DATETIME
,Finish DATETIME
)
INSERT INTO #dataset
(
RNr
,PNr
,Duration_avg_h
,START
,Finish
)
VALUES
(1, 1, 1, '2019-06-06 16:32:11',NULL)
,(1, 2, 262, NULL,NULL)
,(1, 3, 51, NULL,NULL)
,(1, 4, 504, NULL,NULL)
,(1, 5, 29, NULL,NULL)
,(2, 1, 1, '2019-06-06 16:32:11', NULL)
,(2, 2, 124, NULL,NULL)
,(2, 3, 45, NULL,NULL)
,(2, 4, 89, NULL,NULL)
,(2, 5, 19, NULL,NULL)
,(2, 6, 1565, NULL,NULL)
,(2, 7, 24, NULL,NULL)
SELECT
d.RNr,
d.PNr,
d.Duration_avg_h,
d.Start,
d.Finish,
--SUM() gives us the total time up to and including this step
--take of the current step and you get the total time of all the previous steps
--this can give us our start time, or when the previous step ended.
SUM(running_total.Duration_avg_h) - d.Duration_avg_h AS running_total_time,
--MIN() gives us the lowest start time we have pre process.
MIN(running_total.Start) AS min_start,
ISNULL(
d.Start
,DATEADD(HH,SUM(running_total.Duration_avg_h),MIN(running_total.Start) )
) AS Predicted_Start,
ISNULL(
d.Finish
,DATEADD(HH,SUM(running_total.Duration_avg_h),MIN(running_total.Start) )
) AS Predicted_Finish
FROM #dataset AS d
LEFT JOIN #dataset AS running_total
ON d.RNr = running_total.RNr
AND
--the running total for all steps.
running_total.PNr <= d.PNr
GROUP BY
d.RNr,
d.PNr,
d.Duration_avg_h,
d.Start,
d.Finish
ORDER BY
RNr,
PNr
This code will not work once you have actual finish times unless you update the Duration_avg_h to be the actual hours taken.
Jonathan, thanks for your help.
Your idea of using "MIN (running_total.Start) AS min_start," brought me to the idea of using "MAX (d.Start) OVER (PARTITION BY RNr)". And this resulted in the following query:
--Sample Data
DECLARE #dataset TABLE
(
RNr INT
,PNr INT
,Duration_avg_h INT
,START DATETIME
,Finish DATETIME
)
INSERT INTO #dataset
(
RNr
,PNr
,Duration_avg_h
,START
,Finish
)
VALUES
(1, 1, 1, '2019-06-06 16:32:11','2019-06-06 16:33:14')
,(1, 2, 262, '2019-06-06 16:33:14','2019-08-22 17:30:00')
,(1, 3, 51, '2019-08-22 17:30:00',NULL)
,(1, 4, 504, NULL,NULL)
,(1, 5, 29, NULL,NULL)
,(2, 1, 1, '2019-06-06 16:32:11', NULL)
,(2, 2, 124, NULL,NULL)
,(2, 3, 45, NULL,NULL)
,(2, 4, 89, NULL,NULL)
,(2, 5, 19, NULL,NULL)
,(2, 6, 1565, NULL,NULL)
,(2, 7, 24, NULL,NULL)
SELECT RNr,
PNr,
Duration_avg_h,
Start,
Finish,
--Start_max,
--Finish_bit,
--Duration_row_h,
CASE WHEN Start IS NOT NULL THEN Start ELSE DATEADD(HH,(Duration_row_h - MAX(Duration_row_h*Finish_bit) OVER (PARTITION BY RNr) - Duration_avg_h), Start_max) END as Predicted_Start,
CASE WHEN Finish IS NOT NULL THEN Finish ELSE DATEADD(HH,(Duration_row_h - MAX(Duration_row_h*Finish_bit) OVER (PARTITION BY RNr)), Start_max) END as Predicted_Finish
FROM ( SELECT
RNr,
PNr,
Duration_avg_h,
--Convert to a short DATETIME format
CONVERT(DATETIME2(0),Start) as Start,
CONVERT(DATETIME2(0),Finish) as Finish,
--Get MAX start time for each row
Start_max = MAX (CONVERT(DATETIME2(0),d.Start)) OVER (PARTITION BY RNr),
--If process is finished then 1
Finish_bit = (CASE WHEN d.Finish IS NULL THEN 0 ELSE 1 END),
--continuously count the Duration of all processes in the row
SUM(Duration_avg_h) over (PARTITION BY RNr ORDER BY RNr, PNr ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Duration_row_h
FROM #dataset AS d
) AS e
ORDER BY
RNr,
PNr
This query takes into account changes in start and stop times. And calculated by that, the prediction for the upcoming processes.
RNr PNr Duration_avg_h Start Finish Predicted_Start Predicted_Finish
1 1 1 2019-06-06 16:32:11 2019-06-06 16:33:14 2019-06-06 16:32:11 2019-06-06 16:33:14
1 2 262 2019-06-06 16:33:14 2019-08-22 17:30:00 2019-06-06 16:33:14 2019-08-22 17:30:00
1 3 51 2019-08-22 17:30:00 NULL 2019-08-22 17:30:00 2019-08-24 20:30:00
1 4 504 NULL NULL 2019-08-24 20:30:00 2019-09-14 20:30:00
1 5 29 NULL NULL 2019-09-14 20:30:00 2019-09-16 01:30:00
2 1 1 2019-06-06 16:32:11 NULL 2019-06-06 16:32:11 2019-06-06 17:32:11
2 2 124 NULL NULL 2019-06-06 17:32:11 2019-06-11 21:32:11
2 3 45 NULL NULL 2019-06-11 21:32:11 2019-06-13 18:32:11
2 4 89 NULL NULL 2019-06-13 18:32:11 2019-06-17 11:32:11
2 5 19 NULL NULL 2019-06-17 11:32:11 2019-06-18 06:32:11
2 6 1565 NULL NULL 2019-06-18 06:32:11 2019-08-22 11:32:11
2 7 24 NULL NULL 2019-08-22 11:32:11 2019-08-23 11:32:11
I think this way is still complicated. Does anyone know maybe a simpler query?
I have a data set which contains account_number, date, balance, interest charged, and code. This is accounting data so transactions are posted and then reversed if they're was a mistake by the data provider so things can be posted and reversed multiple times.
Account_Number Date Balance Interest Charged Code
0012 01/01/2017 1,000,000 $ 50.00 Posted
0012 01/05/2017 1,000,000 $-50.00 Reversed
0012 01/07/2017 1,000,000 $ 50.00 Posted
0012 01/10/2017 1,000,000 $-50.00 Reversed
0012 01/15/2017 1,000,000 $50.00 Posted
0012 01/17/2017 1,500,000 $25.00 Posted
0012 01/18/2017 1,500,000 $-25.00 Reversed
Looking at the data set above- I am trying to figure out a way to look at every row by account number and balance and if they're is a inverse charge it should remove both of those rows and only keep a charge if they're is no corresponding reversal for it (01/15/2017). For example on 01/01/2017 a charge of 50.00 dollar was posted on a balance of 1,000,000 and on 01/05/2017 the charged was reversed on the same balance -- so both of these rows should be thrown out. This is the same case for 01/07 and 01/10.
I am not to sure on how to code out this problem - any ideas or tips would be great!
So the problem with a question like this is that there are many corner cases. Optimizing for them many or many not depend on how the data is already processed. That being said, here is one solution. Assuming -
For each Account number and balance, the for for each Reversed transaction is just after the corresponding payment.
>>import pandas as pd
>>from datetime import date
>>df = pd.DataFrame(data = [
['0012', date(2017, 1, 1), 1000000, 50, 'Posted'],['0012', date(2017, 1, 5), 1000000, -50, 'Reversed'],
['0012', date(2017, 1, 7), 1000000, 50, 'Posted'],['0012', date(2017, 1, 10), 1000000, -50, 'Reversed'],
['0012', date(2017, 1, 15), 1000000, 50, 'Posted'],['0012', date(2017, 1, 17), 1500000, 25, 'Posted'],
['0012', date(2017, 1, 18), 1500000, -25, 'Reversed'],],
columns=['Account_Number', 'Date', 'Balance', 'Interest Charged', 'Code'])
>>df
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-01 1000000 50 Posted
1 0012 2017-01-05 1000000 -50 Reversed
2 0012 2017-01-07 1000000 50 Posted
3 0012 2017-01-10 1000000 -50 Reversed
4 0012 2017-01-15 1000000 50 Posted
5 0012 2017-01-17 1500000 25 Posted
6 0012 2017-01-18 1500000 -25 Reversed
>> def f(df_g):
idx = df_g[df_g['Code'] == 'Reversed'].index
return df_g.loc[~df_g.index.isin(idx.union(idx-1)), ['Date', 'Interest Charged', 'Code']]
>>df.groupby(['Account_Number', 'Balance']).apply(f).reset_index().loc[:, df.columns]
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-15 1000000 50 Posted
How it works - Basically for each combination of Account Number and Balance, I look at the Rows with Reversed, and I remove them plus the row just before it.
EDIT: - To make it slightly more Robust (now it picked up the last row based on Amount, Balance and account_number:
>>df = pd.DataFrame(data = [
['0012', date(2017, 1, 1), 1000000, 53, 'Posted'],['0012', date(2017, 1, 7), 1000000, 50, 'Posted'],['0012', date(2017, 1, 5), 1000000, -50, 'Reversed'],
['0012', date(2017, 1, 10), 1000000, -53, 'Reversed'],
['0012', date(2017, 1, 15), 1000000, 50, 'Posted'],['0012', date(2017, 1, 17), 1500000, 25, 'Posted'],
['0012', date(2017, 1, 18), 1500000, -25, 'Reversed'],],
columns=['Account_Number', 'Date', 'Balance', 'Interest Charged', 'Code'])
>>df
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-01 1000000 53 Posted
1 0012 2017-01-07 1000000 50 Posted
2 0012 2017-01-05 1000000 -50 Reversed
3 0012 2017-01-10 1000000 -53 Reversed
4 0012 2017-01-15 1000000 50 Posted
5 0012 2017-01-17 1500000 25 Posted
6 0012 2017-01-18 1500000 -25 Reversed
>>output_cols = df.columns
>>df['ABS_VALUE'] = df['Interest Charged'].abs()
>>def f(df_g):
df_g = df_g.reset_index() # Added this new line
idx = df_g[df_g['Code'] == 'Reversed'].index
return df_g.loc[~df_g.index.isin(idx.union(idx-1)), ['Date', 'Interest Charged', 'Code']]
>>df.groupby(['Account_Number', 'Balance', 'ABS_VALUE']).apply(f).reset_index().loc[:, output_cols]
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-15 1000000 50 Posted
I have a data frame (df) with multi column headers:
enter image description here
yearQ YearC YearS Type1 Type2
index City State Year1 Year2 Year3 Year4 Year5 Year6
0 New York NY 355 189 115 234 178 422
1 Los Angeles CA 100 207 298 230 214 166
2 Chicago IL 1360 300 211 121 355 435
3 Philadelphia PA 270 156 455 232 532 355
4 Phoenix AZ 270 234 112 432 344 116
I want to count the average number for each type. the final format should be like the following:
City State Type1 Type2
New York NY avg of(355+189+115) avg of (234+178+422)
.......
Can anybody give me a hint?
Many thanks.
Kath
I think you can use groupby by first level of Multindex in columns with aggregate sum:
print (df.index)
MultiIndex(levels=[[0, 1, 2, 3, 4],
['Chicago', 'Los Angeles', 'New York', 'Philadelphia', 'Phoenix'],
['AZ', 'CA', 'IL', 'NY', 'PA']],
labels=[[0, 1, 2, 3, 4], [2, 1, 0, 3, 4], [3, 1, 2, 4, 0]])
print (df.columns)
MultiIndex(levels=[['Type1', 'Type2'],
['Year1', 'Year2', 'Year3', 'Year4', 'Year5', 'Year6']],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 3, 4, 5]],
names=['YearQ', 'index'])
df = df.groupby(axis=1, level=0).sum()
print (df)
YearQ Type1 Type2
0 New York NY 659 834
1 Los Angeles CA 605 610
2 Chicago IL 1871 911
3 Philadelphia PA 881 1119
4 Phoenix AZ 616 892
But maybe before is necessary set_index:
print (df.index)
Int64Index([0, 1, 2, 3, 4], dtype='int64')
print (df.columns)
MultiIndex(levels=[['Type1', 'Type2', 'YearC', 'YearS'],
['City', 'State', 'Year1', 'Year2', 'Year3', 'Year4', 'Year5', 'Year6']],
labels=[[2, 3, 0, 0, 0, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7]],
names=['YearQ', 'index'])
df = df.set_index([('YearC','City'), ('YearS','State')])
df = df.groupby(axis=1, level=0).sum()
print (df)
YearQ Type1 Type2
(YearC, City) (YearS, State)
New York NY 659 834
Los Angeles CA 605 610
Chicago IL 1871 911
Philadelphia PA 881 1119
Phoenix AZ 616 892
I have a table which has the transactions. Each transaction is represented by a row. The row has a field TranCode indicating the type of transaction and also the date of transaction is also recorded. Following is the table, and corresponding data.
create table t
(
id int identity(1,1),
TranDate datetime,
TranCode int,
BatchNo int
)
GO
insert into t (TranDate, TranCode)
VALUES(GETDATE(), 1),
(DATEADD(MINUTE, 1, GETDATE()), 1),
(DATEADD(MINUTE, 2, GETDATE()), 1),
(DATEADD(MINUTE, 3, GETDATE()), 1),
(DATEADD(MINUTE, 4, GETDATE()), 2),
(DATEADD(MINUTE, 5, GETDATE()), 2),
(DATEADD(MINUTE, 6, GETDATE()), 2),
(DATEADD(MINUTE, 7, GETDATE()), 2),
(DATEADD(MINUTE, 8, GETDATE()), 2),
(DATEADD(MINUTE, 9, GETDATE()), 1),
(DATEADD(MINUTE, 10, GETDATE()), 1),
(DATEADD(MINUTE, 11, GETDATE()), 1),
(DATEADD(MINUTE, 12, GETDATE()), 2),
(DATEADD(MINUTE, 13, GETDATE()), 2),
(DATEADD(MINUTE, 14, GETDATE()), 1),
(DATEADD(MINUTE, 15, GETDATE()), 1),
(DATEADD(MINUTE, 16, GETDATE()), 1),
(DATEADD(MINUTE, 17, GETDATE()), 2),
(DATEADD(MINUTE, 18, GETDATE()), 2),
(DATEADD(MINUTE, 19, GETDATE()), 1),
(DATEADD(MINUTE, 20, GETDATE()), 1),
(DATEADD(MINUTE, 21, GETDATE()), 1),
(DATEADD(MINUTE, 21, GETDATE()), 1)
After the above code, the table contains the following data, well values in the tranDate field will be different for you, but that is fine.
id TranDate TranCode BatchNo
----------- ----------------------- ----------- -----------
1 2015-02-12 20:40:47.547 1 NULL
2 2015-02-12 20:41:47.547 1 NULL
3 2015-02-12 20:42:47.547 1 NULL
4 2015-02-12 20:43:47.547 1 NULL
5 2015-02-12 20:44:47.547 2 NULL
6 2015-02-12 20:45:47.547 2 NULL
7 2015-02-12 20:46:47.547 2 NULL
8 2015-02-12 20:47:47.547 2 NULL
9 2015-02-12 20:48:47.547 2 NULL
10 2015-02-12 20:49:47.547 1 NULL
11 2015-02-12 20:50:47.547 1 NULL
12 2015-02-12 20:51:47.547 1 NULL
13 2015-02-12 20:52:47.547 2 NULL
14 2015-02-12 20:53:47.547 2 NULL
15 2015-02-12 20:54:47.547 1 NULL
16 2015-02-12 20:55:47.547 1 NULL
17 2015-02-12 20:56:47.547 1 NULL
18 2015-02-12 20:57:47.547 2 NULL
19 2015-02-12 20:58:47.547 2 NULL
20 2015-02-12 20:59:47.547 1 NULL
21 2015-02-12 21:00:47.547 1 NULL
22 2015-02-12 21:01:47.547 1 NULL
23 2015-02-12 21:01:47.547 1 NULL
I want a set based solution and not a cursor or row based solution to update the batch number for the rows. For example, the first 4 records should get a batchNo of 1 as they have TranCode as 1, and the next 5 (having tranCode of 2 and are closer to each other in time) should have batchNo as 2, and the next 4 should have 3 and so on. Following is the expected output.
id TranDate TranCode BatchNo
----------- ----------------------- ----------- -----------
1 2015-02-12 20:43:59.123 1 1
2 2015-02-12 20:44:59.123 1 1
3 2015-02-12 20:45:59.123 1 1
4 2015-02-12 20:46:59.123 1 1
5 2015-02-12 20:47:59.123 2 2
6 2015-02-12 20:48:59.123 2 2
7 2015-02-12 20:49:59.123 2 2
8 2015-02-12 20:50:59.123 2 2
9 2015-02-12 20:51:59.123 2 2
10 2015-02-12 20:52:59.123 1 3
11 2015-02-12 20:53:59.123 1 3
12 2015-02-12 20:54:59.123 1 3
13 2015-02-12 20:55:59.123 2 4
14 2015-02-12 20:56:59.123 2 4
15 2015-02-12 20:57:59.123 1 5
16 2015-02-12 20:58:59.123 1 5
17 2015-02-12 20:59:59.123 1 5
18 2015-02-12 21:00:59.123 2 6
19 2015-02-12 21:01:59.123 2 6
20 2015-02-12 21:02:59.123 1 7
21 2015-02-12 21:03:59.123 1 7
22 2015-02-12 21:04:59.123 1 7
23 2015-02-12 21:04:59.123 1 7
I have tried very hard with row_number, rank and dense_rank and none of them came for my rescue. I am looking for set based solution as I want really good performance.
Your help is very much appreciated.
You could do this using Recursive CTE. I also used the lead function to check the next row and determine if you transcode changed.
Query:
WITH A
AS (
SELECT id
,trancode
,trandate
,lead(trancode) OVER (ORDER BY id,trancode) leadcode
FROM #t
)
,cte
AS (
SELECT id
,trandate
,trancode
,lead(trancode) OVER (ORDER BY id,trancode) leadcode
,1 batchnum
,1 nextbatchnum
,id + 1 nxtId
FROM #t
WHERE id = 1
UNION ALL
SELECT A.id
,A.trandate
,A.trancode
,A.leadcode
,nextbatchnum
,CASE
WHEN A.trancode <> A.leadcode THEN nextbatchnum + 1 ELSE nextbatchnum END nextbatchnum
,A.id + 1 nxtid
FROM A
INNER JOIN CTE B ON A.id = B.nxtId
)
SELECT id
,trandate
,trancode
,batchnum
FROM CTE
OPTION (MAXRECURSION 100)
Result:
id trandate trancode batchnum
1 2015-02-12 10:19:06.717 1 1
2 2015-02-12 10:20:06.717 1 1
3 2015-02-12 10:21:06.717 1 1
4 2015-02-12 10:22:06.717 1 1
5 2015-02-12 10:23:06.717 2 2
6 2015-02-12 10:24:06.717 2 2
7 2015-02-12 10:25:06.717 2 2
8 2015-02-12 10:26:06.717 2 2
9 2015-02-12 10:27:06.717 2 2
10 2015-02-12 10:28:06.717 1 3
11 2015-02-12 10:29:06.717 1 3
12 2015-02-12 10:30:06.717 1 3
13 2015-02-12 10:31:06.717 2 4
14 2015-02-12 10:32:06.717 2 4
15 2015-02-12 10:33:06.717 1 5
16 2015-02-12 10:34:06.717 1 5
17 2015-02-12 10:35:06.717 1 5
18 2015-02-12 10:36:06.717 2 6
19 2015-02-12 10:37:06.717 2 6
20 2015-02-12 10:38:06.717 1 7
21 2015-02-12 10:39:06.717 1 7
22 2015-02-12 10:40:06.717 1 7
23 2015-02-12 10:40:06.717 1 7
I think that ultimately the operation you wish to perform on the data is not relational, so a nice set based solution doesn't exist. What you are trying to do relies on the order, and on the row sequentially before/after it, and so needs to use a cursor somewhere.
I've managed to get your desired output using a recursive CTE, although it's not optimised, but thought it might be useful to post what I've done to give you something to work with.
The issue I have with this is the GROUP BY and MAX I'm using on the result set to get the correct values. I'm sure it can be done in a better way.
;WITH cte
AS ( SELECT ID ,
TranDate ,
TranCode ,
1 AS BatchNumber
FROM t
UNION ALL
SELECT t.ID ,
t.TranDate ,
t.TranCode ,
CASE WHEN t.TranCode != cte.TranCode
THEN cte.BatchNumber + 1
ELSE cte.BatchNumber
END AS BatchNumber
FROM t
INNER JOIN cte ON t.id = cte.Id + 1
)
SELECT id ,
trandate ,
trancode ,
MAX(cte.BatchNumber) AS BatchNumber
FROM cte
GROUP BY id ,
tranDate ,
trancode