Grouping by timeframes with a modifier that changes over time - sql

After poring over a similar problem and finding it never provided a complete solution I finally have gotten to the heart of the problem I can't solve. I'm looking for the consecutive amount of days that a person can be prescribed a certain amount of drugs. Because the prescriptions begin and end, there can be multiple, non-contiguous intervals that a person is on X number of drugs. The following SQL script produces the result set of the query I'll post momentarily: Also, I don't have SQL Server 2012.
create table test
(pat_id int, cal_date date, grp_nbr int, drug_qty int,[ranking] int)
go
insert into test(pat_id,cal_date, grp_nbr,drug_qty,[ranking])
values
(1, '1/8/2007',7,2, 1),
(1, '1/9/2007',7,2, 1),
(1, '1/10/2007',7, 2,1),
(1, '1/11/2007',7, 2,1),
(1, '1/12/2007',7, 2,1),
(1, '1/13/2007',7, 2,1),
(1, '1/14/2007',7, 2,1),
(1, '1/15/2007',7, 2,1),
(1, '6/1/2007',7,2, 1),
(1, '6/2/2007',7,2, 1),
(1, '6/3/2007',7,2, 1)
Notice here that there are two non-contiguous intervals where this person was on two drugs at once. In the days that are omitted,drug_qty was more than two. The last column in this example was my attempt at adding another field that I could group by to help solve the problem (didn't work).
Query to create tables:
CREATE TABLE [dbo].[rx](
[pat_id] [int] NOT NULL,
[fill_Date] [date] NOT NULL,
[script_End_Date] AS (dateadd(day,[dayssup],[filldate])),
[drug_Name] [varchar](50) NULL,
[days_Sup] [int] NOT NULL,
[quantity] [float] NOT NULL,
[drug_Class] [char](3) NOT NULL,
CHECK(fill_Date <=script_End_Date
PRIMARY KEY CLUSTERED
(
[clmid] ASC
)
CREATE TABLE [dbo].[Calendar](
[cal_date] [date] PRIMARY KEY,
[Year] AS YEAR(cal_date) PERSISTED,
[Month] AS MONTH(cal_date) PERSISTED,
[Day] AS DAY(cal_date) PERSISTED,
[julian_seq] AS 1+DATEDIFF(DD, CONVERT(DATE, CONVERT(varchar,YEAR(cal_date))+'0101'),cal_date),
id int identity);
the query I'm using to produce my result sets:
;WITH x
AS (SELECT rx.pat_id,
c.cal_date,
Count(DISTINCT rx.drug_name) AS distinctDrugs
FROM rx,
calendar AS c
WHERE c.cal_date BETWEEN rx.fill_date AND rx.script_end_date
AND rx.ofinterest = 1
GROUP BY rx.pat_id,
c.cal_date
--the query example I used having count(1) =2, but to illustrate the non-contiguous intervals, in practice I need the below having statement
HAVING Count(*) > 1),
y
AS (SELECT x.pat_id,
x.cal_date
--c2.id is the row number in the calendar table.
,
c2.id - Row_number()
OVER(
partition BY x.pat_id
ORDER BY x.cal_date) AS grp_nbr,
distinctdrugs
FROM x,
calendar AS c2
WHERE c2.cal_date = x.cal_date)
SELECT *,
Rank()
OVER(
partition BY pat_id, grp_nbr
ORDER BY distinctdrugs) AS [ranking]
FROM y
WHERE y.pat_id = 1604012867
AND distinctdrugs = 2
Besides the fact that I shouldn't have a column in the calendar table named 'id', is there anything egregiously wrong with this approach? I can get the query to show me the distinct intervals of distinctDrugs=x, but it will only work for that integer and not anything >1. By this I mean that I can find the separate intervals where a patient is on two drugs, but only when I use =2 in the having clause, not >1. I can't do something like
SELECT pat_id,
Min(cal_date),
Max(cal_date),
distinctdrugs
FROM y
GROUP BY pat_id,
grp_nbr
because this will pick up that second group of non-contiguous dates. Does anyone know of an elegant solution to this problem?

The key to this is a simple observation. If you have a sequence of dates, then the difference between them and an increasing sequence is constant. The following does this, assuming you are using SQL Server 2005 or greater:
select pat_id, MIN(cal_date), MAX(cal_date), MIN(drug_qty)
from (select t.*,
cast(cal_date as datetime) - ROW_NUMBER() over (partition by pat_id, drug_qty order by cal_date) as grouping
from #test t
) t
group by pat_id, grouping

Related

How to efficiently compute in step 1 differences between columns and in step 2 aggregate those differences?

This is a follow-up question to this Is there a way to use functions that can take multiple columns as input (e.g. `GREATEST`) without typing out every column name?, where I asked only about the second part of my problem. The feedback was that the data model is most likely not appropriate.
I was thinking again about the data model but still, have trouble figuring out a good way to do what I want.
The complete problem is as follows:
I got time series data for multiple technical devices with columns like energy_consumption and voltage.
Furthermore, I got columns with sensitivities towards multiple external factors for each device which I just added as additional columns (denoted with the cc_ in the example).
There are queries where I want to operate on the raw sensitivities. However, there are also queries for which I need to take first some differences such as cc_a - cc_b and cc_b -cc_c and then compute the max of those differences. The combinations for which the differences are to be computed are a predefined subset (around 30) of all possible combinations. The set of combinations that is of interest might change in the future so that for different time intervals different combinations have to be applied (e.g. from 2022-01-01 to 2024-12-31 take combination set A and from 2025-01-01 to ... take combination set B). However, it is very unlikely that the combination change very often.
Here is an example of how I am doing it at the moment
CREATE TEMP TABLE foo (device_id int, voltage int, energy_consumption int, cc_a int, cc_b int, cc_c int);
INSERT INTO foo VALUES (3, 12, 5, '1', '2', '3'), (4, 6, 3, '15', '4', '100');
WITH diff_table AS (
SELECT
id,
(cc_a - cc_b) as diff_ab,
(cc_a - cc_c) as diff_ac,
(cc_b - cc_c) as diff_bc
FROM foo
)
SELECT
id,
GREATEST(diff_ab, diff_ab, diff_bc) as max_cc
FROM diff_table
Since I got more than 100 sensitivities and also differences I am looking for a way how to do this efficiently, both computationally and in terms of typical query length.
What would be a good data model to perform such operations?
The solution I type below assumes all pairings are considered, and the you don't want the points where these are reached.
CREATE TABLE sources (
source_id int
,source_name varchar(10)
,PRIMARY KEY(source_id))
CREATE TABLE foo_values(
device_id int not null --device_id for "foo"
,source_id int -- you may change that with a foreign key
,value int
,CONSTRAINT fk_source_id
FOREIGN KEY(source_id )
REFERENCES sources(source_id ) )
With the exemple set you gave
INSERT INTO sources ( source_id, source_name ) VALUES
(1,'cc_a')
,(2,'cc_b')
,(3,'cc_c')
-- and so on
INSERT INTO foo_values (device_id,source_id, value ) VALUES
(3,1,1),(3,2,2),(3,3,3)
,(4,1,15),(4,2,4),(4,2,100)
doing this way, the query will be
SELECT device_id
, MAX(value)-MIN(value) as greatest_diff
FROM foo_values
group by device_id
Bonus : with such a schema, you can even tell where the maximum and minimum are reached
WITH ranked as (
SELECT
f.device_id
,f.value
,f.source_id
,RANK() OVER (PARTITION BY f.device_id ORDER BY f.value ) as low_first
,RANK() OVER (PARTITION BY f.device_id ORDER BY f.value DESC) as high_first
FROM foo_values as f)
SELECT h.device_id
, hs.source_name as source_high
, ls.source_name as source_low
, h.value as value_high
, l.value as value_low
, h.value - l.value as greatest_diff
FROM ranked l
INNER JOIN ranked h
on l.device_id = h.device_id
INNER JOIN sources ls
on ls.source_id = l.source_id
INNER JOIN sources hs
on hs.source_id = h.source_id
WHERE l.low_first =1 AND h.high_first = 1
Here is a fiddle for this solution.
EDIT : since you need to control the pairings, you must add a table that list them:
CREATE TABLE high_low_source
(high_source_id int
,low_source_id int
, constraint fk_low
FOREIGN KEY(low_source_id )
REFERENCES sources(source_id )
,constraint fk_high
FOREIGN KEY(high_source_id )
REFERENCES sources(source_id )
);
INSERT INTO high_low_source VALUES
(1,2),(1,3),(2,3)
The query looking for the greatest difference becomes :
SELECT h.device_id
, hs.source_name as source_high
, ls.source_name as source_low
, h.value as value_high
, l.value as value_low
, h.value - l.value as my_diff
, RANK() OVER (PARTITION BY h.device_id ORDER BY (h.value - l.value) DESC) as greatest_first
FROM foo_values l
INNER JOIN foo_values h
on l.device_id = h.device_id
INNER JOIN high_low_source hl
on hl.low_source_id = l.source_id
AND hl.high_source_id = h.source_id
INNER JOIN sources ls
on ls.source_id = l.source_id
INNER JOIN sources hs
on hs.source_id = h.source_id
ORDER BY device_id, greatest_first
With the values you have listed, there is a tie for device 3.
Extended fiddle

Identify Consecutive Chunks in SQL Server Table

I have this table:
ValueId bigint // (identity) item ID
ListId bigint // group ID
ValueDelta int // item value
ValueCreated datetime2 // item created
What I need is to find consecutive Values within the same Group ordered by Created, not ID. Created and ID are not guaranteed to be in the same order.
So the output should be:
ListID bigint
FirstId bigint // from this ID (first in LID with Value ordered by Date)
LastId bigint // to this ID (last in LID with Value ordered by Date)
ValueDelta int // all share this value
ValueCount // and this many occurrences (number of items between FirstId and LastId)
I can do this with Cursors but I'm sure that's not the best idea so I'm wondering if this can be done in a query.
Please, for the answer (if any), explain it a bit.
UPDATE: SQLfiddle basic data set
It does look like a gaps-and-island problem.
Here is one way to do it. It would likely work faster than your variant.
The standard idea for gaps-and-islands is to generate two sets of row numbers partitioning them in two ways. The difference between such row numbers (rn1-rn2) would remain the same within each consecutive chunk. Run the query below CTE-by-CTE and examine intermediate results to see what is going on.
WITH
CTE_RN
AS
(
SELECT
[ValueId]
,[ListId]
,[ValueDelta]
,[ValueCreated]
,ROW_NUMBER() OVER (PARTITION BY ListID ORDER BY ValueCreated) AS rn1
,ROW_NUMBER() OVER (PARTITION BY ListID, [ValueDelta] ORDER BY ValueCreated) AS rn2
FROM [Value]
)
SELECT
ListID
,MIN(ValueID) AS FirstID
,MAX(ValueID) AS LastID
,MIN(ValueCreated) AS FirstCreated
,MAX(ValueCreated) AS LastCreated
,ValueDelta
,COUNT(*) AS ValueCount
FROM CTE_RN
GROUP BY
ListID
,ValueDelta
,rn1-rn2
ORDER BY
FirstCreated
;
This query produces the same result as yours on your sample data set.
It is not quite clear whether FirstID and LastID can be MIN and MAX, or they indeed must be from the first and last rows (when ordered by ValueCreated). If you need really first and last, the query would become a bit more complicated.
In your original sample data set the "first" and "min" for the FirstID are the same. Let's change the sample data set a little to highlight this difference:
insert into [Value]
([ListId], [ValueDelta], [ValueCreated])
values
(1, 1, '2019-01-01 01:01:02'), -- 1.1
(1, 0, '2019-01-01 01:02:01'), -- 2.1
(1, 0, '2019-01-01 01:03:01'), -- 2.2
(1, 0, '2019-01-01 01:04:01'), -- 2.3
(1, -1, '2019-01-01 01:05:01'), -- 3.1
(1, -1, '2019-01-01 01:06:01'), -- 3.2
(1, 1, '2019-01-01 01:01:01'), -- 1.2
(1, 1, '2019-01-01 01:08:01'), -- 4.2
(2, 1, '2019-01-01 01:08:01') -- 5.1
;
All I did is swapped the ValueCreated between the first and seventh rows, so now the FirstID of the first group is 7 and LastID is 1. Your query returns correct result. My simple query above doesn't.
Here is the variant that produces correct result. I decided to use FIRST_VALUE and LAST_VALUE functions to get the appropriate IDs. Again, run the query CTE-by-CTE and examine intermediate results to see what is going on.
This variant produces the same result as your query even with the adjusted sample data set.
WITH
CTE_RN
AS
(
SELECT
[ValueId]
,[ListId]
,[ValueDelta]
,[ValueCreated]
,ROW_NUMBER() OVER (PARTITION BY ListID ORDER BY ValueCreated) AS rn1
,ROW_NUMBER() OVER (PARTITION BY ListID, ValueDelta ORDER BY ValueCreated) AS rn2
FROM [Value]
)
,CTE2
AS
(
SELECT
ValueId
,ListId
,ValueDelta
,ValueCreated
,rn1
,rn2
,rn1-rn2 AS Diff
,FIRST_VALUE(ValueID) OVER(
PARTITION BY ListID, ValueDelta, rn1-rn2 ORDER BY ValueCreated
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS FirstID
,LAST_VALUE(ValueID) OVER(
PARTITION BY ListID, ValueDelta, rn1-rn2 ORDER BY ValueCreated
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS LastID
FROM CTE_RN
)
SELECT
ListID
,FirstID
,LastID
,MIN(ValueCreated) AS FirstCreated
,MAX(ValueCreated) AS LastCreated
,ValueDelta
,COUNT(*) AS ValueCount
FROM CTE2
GROUP BY
ListID
,ValueDelta
,rn1-rn2
,FirstID
,LastID
ORDER BY FirstCreated;
Use a CTE that adds a Row_Number column, partitioned by GroupId and Value and ordered by Created.
Then select from the CTE, GROUP BY GroupId and Value; use COUNT(*) to get the Count, and use correlated subqueries to select the ValueId with the MIN(RowNumber) (which will always be 1, so you can just use that instead of MIN) and the MAX(RowNumber) to get FirstId and LastId.
Although, now that I've noticed you're using SQL Server 2017, you should be able to use First_Value() and Last_Value() instead of correlated subqueries.
After many iterations I think I have a working solution. I'm absolutely sure it's far from optimal but it works.
Link is here: http://sqlfiddle.com/#!18/4ee9f/3
Sample data:
create table [Value]
(
[ValueId] bigint not null identity(1,1),
[ListId] bigint not null,
[ValueDelta] int not null,
[ValueCreated] datetime2 not null,
constraint [PK_Value] primary key clustered ([ValueId])
);
insert into [Value]
([ListId], [ValueDelta], [ValueCreated])
values
(1, 1, '2019-01-01 01:01:01'), -- 1.1
(1, 0, '2019-01-01 01:02:01'), -- 2.1
(1, 0, '2019-01-01 01:03:01'), -- 2.2
(1, 0, '2019-01-01 01:04:01'), -- 2.3
(1, -1, '2019-01-01 01:05:01'), -- 3.1
(1, -1, '2019-01-01 01:06:01'), -- 3.2
(1, 1, '2019-01-01 01:01:02'), -- 1.2
(1, 1, '2019-01-01 01:08:01'), -- 4.2
(2, 1, '2019-01-01 01:08:01') -- 5.1
The Query that seems to work:
-- this is the actual order of data
select *
from [Value]
order by [ListId] asc, [ValueCreated] asc;
-- there are 4 sets here
-- set 1 GroupId=1, Id=1&7, Value=1
-- set 2 GroupId=1, Id=2-4, Value=0
-- set 3 GroupId=1, Id=5-6, Value=-1
-- set 4 GroupId=1, Id=8-8, Value=1
-- set 5 GroupId=2, Id=9-9, Value=1
with [cte1] as
(
select [v1].[ListId]
,[v2].[ValueId] as [FirstId], [v2].[ValueCreated] as [FirstCreated]
,[v1].[ValueId] as [LastId], [v1].[ValueCreated] as [LastCreated]
,isnull([v1].[ValueDelta], 0) as [ValueDelta]
from [dbo].[Value] [v1]
join [dbo].[Value] [v2] on [v2].[ListId] = [v1].[ListId]
and isnull([v2].[ValueDeltaPrev], 0) = isnull([v1].[ValueDeltaPrev], 0)
and [v2].[ValueCreated] <= [v1].[ValueCreated] and not exists (
select 1
from [dbo].[Value] [v3]
where 1=1
and ([v3].[ListId] = [v1].[ListId])
and ([v3].[ValueCreated] between [v2].[ValueCreated] and [v1].[ValueCreated])
and [v3].[ValueDelta] != [v1].[ValueDelta]
)
), [cte2] as
(
select [t1].*
from [cte1] [t1]
where not exists (select 1 from [cte1] [t2] where [t2].[ListId] = [t1].[ListId]
and ([t1].[FirstId] != [t2].[FirstId] or [t1].[LastId] != [t2].[LastId])
and [t1].[FirstCreated] between [t2].[FirstCreated] and [t2].[LastCreated]
and [t1].[LastCreated] between [t2].[FirstCreated] and [t2].[LastCreated]
)
)
select [ListId], [FirstId], [LastId], [FirstCreated], [LastCreated], [ValueDelta] as [ValueDelta]
,(select count(*) from [dbo].[Value] where [ListId] = [t].[ListId] and [ValueCreated] between [t].[FirstCreated] and [t].[LastCreated]) as [ValueCount]
from [cte2] [t];
How it works:
join table to self on same list but only on older (or equal date to handle single sets) values
join again on self and exclude any overlaps keeping only largest date set
once we identify largest sets, we then count entries in set dates
If anyone can find a better / friendlier solution, you get the answer.
PS: The dumb straightforward Cursor approach seems a lot faster than this. Still testing.

Transform column into rows in SQL Server table

I have a query in which I want some columns to be appear as rows.
The query is
Select *
From Emp_mon_day
Where emp_mkey IN (select emp_card_no
from emp_mst
where comp_mkey in (7, 110))
and Year = 2016 and month = 2
and Emp_mkey = 2492
with this output being returned:
Now, I need to show columns Day1, Day2, Day3 as rows in the output with the above query.
How to achieve that?
You can use case an UNPIVOT query like below
Select
comp_mkey,
fmodule_id,
fdepartment_id,
branch_mkey,
entry_department,
dept_mkey,
mkey,
emp_mkey,
entry_date,
month,
year,
day,
data
from
(select * from Emp_mon_day where emp_mkey IN
(select emp_card_no from emp_mst where comp_mkey in
(7,110)) and Year = 2016 and month = 2
and Emp_mkey = 2492) s
unpivot
(
data for day in ([Day1],[Day2]) -- dynamic query can generate all days data
)up
Below is sample test script and output
create table t(comp_mkey int,mont int,yea int,day1 varchar(10),day2 varchar(10))
insert into t values (2,2,2016,'AB','AC')
Select
comp_mkey,
mont,
yea,
day,
data
from
(select * from t) s
unpivot
(
data for day in ([Day1],[Day2]) -- dynamic query can generate all days data
)up
drop table t
Output
If you need all day's data you can either type out all expected columns in this statement
data for day in ([Day1],[Day2], [Day3],[Day4])
Better way would be to convert this into a dynamic query and apply logic for number of days expected in a month
CROSS APPLY can be proven helpful here.
CREATE TABLE [dbo].[emp]
(
[comp_mkey] [INT] NULL,
[mont] [INT] NULL,
[year] [INT] NULL,
[day1] [VARCHAR](10) NULL,
[day2] [VARCHAR](10) NULL
)
INSERT INTO emp VALUES (2, 2, 2016, 'AB', 'AC')
Use following Select statement
SELECT emp.comp_mkey
, emp.mont
, emp.year
, emp_ext.[Day]
, emp_ext.Value
FROM emp
CROSS APPLY
(
VALUES('Day1', emp.day1), ('Day2', emp.day2)
)emp_ext([Day], Value)

Drop rows identified within moving time window

I have a dataset of hospitalisations ('spells') - 1 row per spell. I want to drop any spells recorded within a week after another (there could be multiple) - the rationale being is that they're likely symptomatic of the same underlying cause. Here is some play data:
create table hif_user.rzb_recurse_src (
patid integer not null,
eventdate integer not null,
type smallint not null
);
insert into hif_user.rzb_recurse_src values (1,1,1);
insert into hif_user.rzb_recurse_src values (1,3,2);
insert into hif_user.rzb_recurse_src values (1,5,2);
insert into hif_user.rzb_recurse_src values (1,9,2);
insert into hif_user.rzb_recurse_src values (1,14,2);
insert into hif_user.rzb_recurse_src values (2,1,1);
insert into hif_user.rzb_recurse_src values (2,5,1);
insert into hif_user.rzb_recurse_src values (2,19,2);
Only spells of type 2 - within a week after any other - are to be dropped. Type 1 spells are to remain.
For patient 1, dates 1 & 9 should be kept. For patient 2, all rows should remain.
The issue is with patient 1. Spell date 9 is identified for dropping as it is close to spell date 5; however, as spell date 5 is close to spell date 1 is should be dropped therefore allowing spell date 9 to live...
So, it seems a recursive problem. However, I've not used recursive programming in SQL before and I'm struggling to really picture how to do it. Can anyone help? I should add that I'm using Teradata which has more restrictions than most with recursive SQL (only UNION ALL sets allowed I believe).
It's a cursor logic, check one row after the other if it fits your rules, so recursion is the easiest (maybe the only) way to solve your problem.
To get a decent performance you need a Volatile Table to facilitate this row-by-row processing:
CREATE VOLATILE TABLE vt (patid, eventdate, exac_type, rn, startdate) AS
(
SELECT r.*
,ROW_NUMBER() -- needed to facilitate the join
OVER (PARTITION BY patid ORDER BY eventdate) AS rn
FROM hif_user.rzb_recurse_src AS r
) WITH DATA ON COMMIT PRESERVE ROWS;
WITH RECURSIVE cte (patid, eventdate, exac_type, rn, startdate) AS
(
SELECT vt.*
,eventdate AS startdate
FROM vt
WHERE rn = 1 -- start with the first row
UNION ALL
SELECT vt.*
-- check if type = 1 or more than 7 days from the last eventdate
,CASE WHEN vt.eventdate > cte.startdate + 7
OR vt.exac_type = 1
THEN vt.eventdate -- new start date
ELSE cte.startdate -- keep old date
END
FROM vt JOIN cte
ON vt.patid = cte.patid
AND vt.rn = cte.rn + 1 -- proceed to next row
)
SELECT *
FROM cte
WHERE eventdate - startdate = 0 -- only new start days
order by patid, eventdate
I think the key to solving this is getting the first date more than 7 days from the current date and then doing a recursive subquery:
with rrs as (
select rrs.*,
(select min(rrs2.eventdate)
from hif_user.rzb_recurse_src rrs2
where rrs2.patid = rrs.patid and
rrs2.eventdate > rrs.eventdate + 7
) as eventdate7
from hif_user.rzb_recurse_src rrs
),
recursive cte as (
select patid, min(eventdate) as eventdate, min(eventdate7) as eventdate7
from hif_user.rzb_recurse_src rrs
group by patid
union all
select cte.patid, cte.eventdate7, rrs.eventdate7
from cte join
hif_user.rzb_recurse_src rrs
on rrs.patid = cte.patid and
rrs.eventdate = cte.eventdate7
)
select cte.patid, cte.eventdate
from cte;
If you want additional columns, then join in the original table at the last step.

SQL Server: row present in one query, missing in another

Ok so I think I must be misunderstanding something about SQL queries. This is a pretty wordy question, so thanks for taking the time to read it (my problem is right at the end, everything else is just context).
I am writing an accounting system that works on the double-entry principal -- money always moves between accounts, a transaction is 2 or more TransactionParts rows decrementing one account and incrementing another.
Some TransactionParts rows may be flagged as tax related so that the system can produce a report of total VAT sales/purchases etc, so it is possible that a single Transaction may have two TransactionParts referencing the same Account -- one VAT related, and the other not. To simplify presentation to the user, I have a view to combine multiple rows for the same account and transaction:
create view Accounting.CondensedEntryView as
select p.[Transaction], p.Account, sum(p.Amount) as Amount
from Accounting.TransactionParts p
group by p.[Transaction], p.Account
I then have a view to calculate the running balance column, as follows:
create view Accounting.TransactionBalanceView as
with cte as
(
select ROW_NUMBER() over (order by t.[Date]) AS RowNumber,
t.ID as [Transaction], p.Amount, p.Account
from Accounting.Transactions t
inner join Accounting.CondensedEntryView p on p.[Transaction]=t.ID
)
select b.RowNumber, b.[Transaction], a.Account,
coalesce(sum(a.Amount), 0) as Balance
from cte a, cte b
where a.RowNumber <= b.RowNumber AND a.Account=b.Account
group by b.RowNumber, b.[Transaction], a.Account
For reasons I haven't yet worked out, a certain transaction (ID=30) doesn't appear on an account statement for the user. I confirmed this by running
select * from Accounting.TransactionBalanceView where [Transaction]=30
This gave me the following result:
RowNumber Transaction Account Balance
-------------------- ----------- ------- ---------------------
72 30 23 143.80
As I said before, there should be at least two TransactionParts for each Transaction, so one of them isn't being presented in my view. I assumed there must be an issue with the way I've written my view, and run a query to see if there's anything else missing:
select [Transaction], count(*)
from Accounting.TransactionBalanceView
group by [Transaction]
having count(*) < 2
This query returns no results -- not even for Transaction 30! Thinking I must be an idiot I run the following query:
select [Transaction]
from Accounting.TransactionBalanceView
where [Transaction]=30
It returns two rows! So select * returns only one row and select [Transaction] returns both. After much head-scratching and re-running the last two queries, I concluded I don't have the faintest idea what's happening. Any ideas?
Thanks a lot if you've stuck with me this far!
Edit:
Here are the execution plans:
select *
select [Transaction]
1000 lines each, hence finding somewhere else to host.
Edit 2:
For completeness, here are the tables I used:
create table Accounting.Accounts
(
ID smallint identity primary key,
[Name] varchar(50) not null
constraint UQ_AccountName unique,
[Type] tinyint not null
constraint FK_AccountType foreign key references Accounting.AccountTypes
);
create table Accounting.Transactions
(
ID int identity primary key,
[Date] date not null default getdate(),
[Description] varchar(50) not null,
Reference varchar(20) not null default '',
Memo varchar(1000) not null
);
create table Accounting.TransactionParts
(
ID int identity primary key,
[Transaction] int not null
constraint FK_TransactionPart foreign key references Accounting.Transactions,
Account smallint not null
constraint FK_TransactionAccount foreign key references Accounting.Accounts,
Amount money not null,
VatRelated bit not null default 0
);
Demonstration of possible explanation.
Create table Script
SELECT *
INTO #T
FROM master.dbo.spt_values
CREATE NONCLUSTERED INDEX [IX_T] ON #T ([name] DESC,[number] DESC);
Query one (Returns 35 results)
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY NAME) AS rn
FROM #T
)
SELECT c1.number,c1.[type]
FROM cte c1
JOIN cte c2 ON c1.rn=c2.rn AND c1.number <> c2.number
Query Two (Same as before but adding c2.[type] to the select list makes it return 0 results)
;
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY NAME) AS rn
FROM #T
)
SELECT c1.number,c1.[type] ,c2.[type]
FROM cte c1
JOIN cte c2 ON c1.rn=c2.rn AND c1.number <> c2.number
Why?
row_number() for duplicate NAMEs isn't specified so it just chooses whichever one fits in with the best execution plan for the required output columns. In the second query this is the same for both cte invocations, in the first one it chooses a different access path with resultant different row_numbering.
Suggested Solution
You are self joining the CTE on ROW_NUMBER() over (order by t.[Date])
Contrary to what may have been expected the CTE will likely not be materialised which would have ensured consistency for the self join and thus you assume a correlation between ROW_NUMBER() on both sides that may well not exist for records where a duplicate [Date] exists in the data.
What if you try ROW_NUMBER() over (order by t.[Date], t.[id]) to ensure that in the event of tied dates the row_numbering is in a guaranteed consistent order. (Or some other column/combination of columns that can differentiate records if id won't do it)
If the purpose of this part of the view is just to make sure that the same row isn't joined to itself
where a.RowNumber <= b.RowNumber
then how does changing this part to
where a.RowNumber <> b.RowNumber
affect the results?
It seems you read dirty entries. (Someone else deletes/insertes new data)
try SET TRANSACTION ISOLATION LEVEL READ COMMITTED.
i've tried this code (seems equal to yours)
IF object_id('tempdb..#t') IS NOT NULL DROP TABLE #t
CREATE TABLE #t(i INT, val INT, acc int)
INSERT #t
SELECT 1, 2, 70
UNION ALL SELECT 2, 3, 70
;with cte as
(
select ROW_NUMBER() over (order by t.i) AS RowNumber,
t.val as [Transaction], t.acc Account
from #t t
)
select b.RowNumber, b.[Transaction], a.Account
from cte a, cte b
where a.RowNumber <= b.RowNumber AND a.Account=b.Account
group by b.RowNumber, b.[Transaction], a.Account
and got two rows
RowNumber Transaction Account
1 2 70
2 3 70