Drop rows identified within moving time window - sql

I have a dataset of hospitalisations ('spells') - 1 row per spell. I want to drop any spells recorded within a week after another (there could be multiple) - the rationale being is that they're likely symptomatic of the same underlying cause. Here is some play data:
create table hif_user.rzb_recurse_src (
patid integer not null,
eventdate integer not null,
type smallint not null
);
insert into hif_user.rzb_recurse_src values (1,1,1);
insert into hif_user.rzb_recurse_src values (1,3,2);
insert into hif_user.rzb_recurse_src values (1,5,2);
insert into hif_user.rzb_recurse_src values (1,9,2);
insert into hif_user.rzb_recurse_src values (1,14,2);
insert into hif_user.rzb_recurse_src values (2,1,1);
insert into hif_user.rzb_recurse_src values (2,5,1);
insert into hif_user.rzb_recurse_src values (2,19,2);
Only spells of type 2 - within a week after any other - are to be dropped. Type 1 spells are to remain.
For patient 1, dates 1 & 9 should be kept. For patient 2, all rows should remain.
The issue is with patient 1. Spell date 9 is identified for dropping as it is close to spell date 5; however, as spell date 5 is close to spell date 1 is should be dropped therefore allowing spell date 9 to live...
So, it seems a recursive problem. However, I've not used recursive programming in SQL before and I'm struggling to really picture how to do it. Can anyone help? I should add that I'm using Teradata which has more restrictions than most with recursive SQL (only UNION ALL sets allowed I believe).

It's a cursor logic, check one row after the other if it fits your rules, so recursion is the easiest (maybe the only) way to solve your problem.
To get a decent performance you need a Volatile Table to facilitate this row-by-row processing:
CREATE VOLATILE TABLE vt (patid, eventdate, exac_type, rn, startdate) AS
(
SELECT r.*
,ROW_NUMBER() -- needed to facilitate the join
OVER (PARTITION BY patid ORDER BY eventdate) AS rn
FROM hif_user.rzb_recurse_src AS r
) WITH DATA ON COMMIT PRESERVE ROWS;
WITH RECURSIVE cte (patid, eventdate, exac_type, rn, startdate) AS
(
SELECT vt.*
,eventdate AS startdate
FROM vt
WHERE rn = 1 -- start with the first row
UNION ALL
SELECT vt.*
-- check if type = 1 or more than 7 days from the last eventdate
,CASE WHEN vt.eventdate > cte.startdate + 7
OR vt.exac_type = 1
THEN vt.eventdate -- new start date
ELSE cte.startdate -- keep old date
END
FROM vt JOIN cte
ON vt.patid = cte.patid
AND vt.rn = cte.rn + 1 -- proceed to next row
)
SELECT *
FROM cte
WHERE eventdate - startdate = 0 -- only new start days
order by patid, eventdate

I think the key to solving this is getting the first date more than 7 days from the current date and then doing a recursive subquery:
with rrs as (
select rrs.*,
(select min(rrs2.eventdate)
from hif_user.rzb_recurse_src rrs2
where rrs2.patid = rrs.patid and
rrs2.eventdate > rrs.eventdate + 7
) as eventdate7
from hif_user.rzb_recurse_src rrs
),
recursive cte as (
select patid, min(eventdate) as eventdate, min(eventdate7) as eventdate7
from hif_user.rzb_recurse_src rrs
group by patid
union all
select cte.patid, cte.eventdate7, rrs.eventdate7
from cte join
hif_user.rzb_recurse_src rrs
on rrs.patid = cte.patid and
rrs.eventdate = cte.eventdate7
)
select cte.patid, cte.eventdate
from cte;
If you want additional columns, then join in the original table at the last step.

Related

Count length of consecutive duplicate values for each id

I have a table as shown in the screenshot (first two columns) and I need to create a column like the last one. I'm trying to calculate the length of each sequence of consecutive values for each id.
For this, the last column is required. I played around with
row_number() over (partition by id, value)
but did not have much success, since the circled number was (quite predictably) computed as 2 instead of 1.
Please help!
First of all, we need to have a way to defined how the rows are ordered. For example, in your sample data there is not way to be sure that 'first' row (1, 1) will be always displayed before the 'second' row (1,0).
That's why in my sample data I have added an identity column. In your real case, the details can be order by row ID, date column or something else, but you need to ensure the rows can be sorted via unique criteria.
So, the task is pretty simple:
calculate trigger switch - when value is changed
calculate groups
calculate rows
That's it. I have used common table expression and leave all columns in order to be easy for you to understand the logic. You are free to break this in separate statements and remove some of the columns.
DECLARE #DataSource TABLE
(
[RowID] INT IDENTITY(1, 1)
,[ID]INT
,[value] INT
);
INSERT INTO #DataSource ([ID], [value])
VALUES (1, 1)
,(1, 0)
,(1, 0)
,(1, 1)
,(1, 1)
,(1, 1)
--
,(2, 0)
,(2, 1)
,(2, 0)
,(2, 0);
WITH DataSourceWithSwitch AS
(
SELECT *
,IIF(LAG([value]) OVER (PARTITION BY [ID] ORDER BY [RowID]) = [value], 0, 1) AS [Switch]
FROM #DataSource
), DataSourceWithGroup AS
(
SELECT *
,SUM([Switch]) OVER (PARTITION BY [ID] ORDER BY [RowID]) AS [Group]
FROM DataSourceWithSwitch
)
SELECT *
,ROW_NUMBER() OVER (PARTITION BY [ID], [Group] ORDER BY [RowID]) AS [GroupRowID]
FROM DataSourceWithGroup
ORDER BY [RowID];
You want results that are dependent on actual data ordering in the data source. In SQL you operate on relations, sometimes on ordered set of relations rows. Your desired end result is not well-defined in terms of SQL, unless you introduce an additional column in your source table, over which your data is ordered (e.g. auto-increment or some timestamp column).
Note: this answers the original question and doesn't take into account additional timestamp column mentioned in the comment. I'm not updating my answer since there is already an accepted answer.
One way to solve it could be through a recursive CTE:
create table #tmp (i int identity,id int, value int, rn int);
insert into #tmp (id,value) VALUES
(1,1),(1,0),(1,0),(1,1),(1,1),(1,1),
(2,0),(2,1),(2,0),(2,0);
WITH numbered AS (
SELECT i,id,value, 1 seq FROM #tmp WHERE i=1 UNION ALL
SELECT a.i,a.id,a.value, CASE WHEN a.id=b.id AND a.value=b.value THEN b.seq+1 ELSE 1 END
FROM #tmp a INNER JOIN numbered b ON a.i=b.i+1
)
SELECT * FROM numbered -- OPTION (MAXRECURSION 1000)
This will return the following:
i id value seq
1 1 1 1
2 1 0 1
3 1 0 2
4 1 1 1
5 1 1 2
6 1 1 3
7 2 0 1
8 2 1 1
9 2 0 1
10 2 0 2
See my little demo here: https://rextester.com/ZZEIU93657
A prerequisite for the CTE to work is a sequenced table (e. g. a table with an identitycolumn in it) as a source. In my example I introduced the column i for this. As a starting point I need to find the first entry of the source table. In my case this was the entry with i=1.
For a longer source table you might run into a recursion-limit error as the default for MAXRECURSION is 100. In this case you should uncomment the OPTION setting behind my SELECT clause above. You can either set it to a higher value (like shown) or switch it off completely by setting it to 0.
IMHO, this is easier to do with cursor and loop.
may be there is a way to do the job with selfjoin
declare #t table (id int, val int)
insert into #t (id, val)
select 1 as id, 1 as val
union all select 1, 0
union all select 1, 0
union all select 1, 1
union all select 1, 1
union all select 1, 1
;with cte1 (id , val , num ) as
(
select id, val, row_number() over (ORDER BY (SELECT 1)) as num from #t
)
, cte2 (id, val, num, N) as
(
select id, val, num, 1 from cte1 where num = 1
union all
select t1.id, t1.val, t1.num,
case when t1.id=t2.id and t1.val=t2.val then t2.N + 1 else 1 end
from cte1 t1 inner join cte2 t2 on t1.num = t2.num + 1 where t1.num > 1
)
select * from cte2

Identify Consecutive Chunks in SQL Server Table

I have this table:
ValueId bigint // (identity) item ID
ListId bigint // group ID
ValueDelta int // item value
ValueCreated datetime2 // item created
What I need is to find consecutive Values within the same Group ordered by Created, not ID. Created and ID are not guaranteed to be in the same order.
So the output should be:
ListID bigint
FirstId bigint // from this ID (first in LID with Value ordered by Date)
LastId bigint // to this ID (last in LID with Value ordered by Date)
ValueDelta int // all share this value
ValueCount // and this many occurrences (number of items between FirstId and LastId)
I can do this with Cursors but I'm sure that's not the best idea so I'm wondering if this can be done in a query.
Please, for the answer (if any), explain it a bit.
UPDATE: SQLfiddle basic data set
It does look like a gaps-and-island problem.
Here is one way to do it. It would likely work faster than your variant.
The standard idea for gaps-and-islands is to generate two sets of row numbers partitioning them in two ways. The difference between such row numbers (rn1-rn2) would remain the same within each consecutive chunk. Run the query below CTE-by-CTE and examine intermediate results to see what is going on.
WITH
CTE_RN
AS
(
SELECT
[ValueId]
,[ListId]
,[ValueDelta]
,[ValueCreated]
,ROW_NUMBER() OVER (PARTITION BY ListID ORDER BY ValueCreated) AS rn1
,ROW_NUMBER() OVER (PARTITION BY ListID, [ValueDelta] ORDER BY ValueCreated) AS rn2
FROM [Value]
)
SELECT
ListID
,MIN(ValueID) AS FirstID
,MAX(ValueID) AS LastID
,MIN(ValueCreated) AS FirstCreated
,MAX(ValueCreated) AS LastCreated
,ValueDelta
,COUNT(*) AS ValueCount
FROM CTE_RN
GROUP BY
ListID
,ValueDelta
,rn1-rn2
ORDER BY
FirstCreated
;
This query produces the same result as yours on your sample data set.
It is not quite clear whether FirstID and LastID can be MIN and MAX, or they indeed must be from the first and last rows (when ordered by ValueCreated). If you need really first and last, the query would become a bit more complicated.
In your original sample data set the "first" and "min" for the FirstID are the same. Let's change the sample data set a little to highlight this difference:
insert into [Value]
([ListId], [ValueDelta], [ValueCreated])
values
(1, 1, '2019-01-01 01:01:02'), -- 1.1
(1, 0, '2019-01-01 01:02:01'), -- 2.1
(1, 0, '2019-01-01 01:03:01'), -- 2.2
(1, 0, '2019-01-01 01:04:01'), -- 2.3
(1, -1, '2019-01-01 01:05:01'), -- 3.1
(1, -1, '2019-01-01 01:06:01'), -- 3.2
(1, 1, '2019-01-01 01:01:01'), -- 1.2
(1, 1, '2019-01-01 01:08:01'), -- 4.2
(2, 1, '2019-01-01 01:08:01') -- 5.1
;
All I did is swapped the ValueCreated between the first and seventh rows, so now the FirstID of the first group is 7 and LastID is 1. Your query returns correct result. My simple query above doesn't.
Here is the variant that produces correct result. I decided to use FIRST_VALUE and LAST_VALUE functions to get the appropriate IDs. Again, run the query CTE-by-CTE and examine intermediate results to see what is going on.
This variant produces the same result as your query even with the adjusted sample data set.
WITH
CTE_RN
AS
(
SELECT
[ValueId]
,[ListId]
,[ValueDelta]
,[ValueCreated]
,ROW_NUMBER() OVER (PARTITION BY ListID ORDER BY ValueCreated) AS rn1
,ROW_NUMBER() OVER (PARTITION BY ListID, ValueDelta ORDER BY ValueCreated) AS rn2
FROM [Value]
)
,CTE2
AS
(
SELECT
ValueId
,ListId
,ValueDelta
,ValueCreated
,rn1
,rn2
,rn1-rn2 AS Diff
,FIRST_VALUE(ValueID) OVER(
PARTITION BY ListID, ValueDelta, rn1-rn2 ORDER BY ValueCreated
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS FirstID
,LAST_VALUE(ValueID) OVER(
PARTITION BY ListID, ValueDelta, rn1-rn2 ORDER BY ValueCreated
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS LastID
FROM CTE_RN
)
SELECT
ListID
,FirstID
,LastID
,MIN(ValueCreated) AS FirstCreated
,MAX(ValueCreated) AS LastCreated
,ValueDelta
,COUNT(*) AS ValueCount
FROM CTE2
GROUP BY
ListID
,ValueDelta
,rn1-rn2
,FirstID
,LastID
ORDER BY FirstCreated;
Use a CTE that adds a Row_Number column, partitioned by GroupId and Value and ordered by Created.
Then select from the CTE, GROUP BY GroupId and Value; use COUNT(*) to get the Count, and use correlated subqueries to select the ValueId with the MIN(RowNumber) (which will always be 1, so you can just use that instead of MIN) and the MAX(RowNumber) to get FirstId and LastId.
Although, now that I've noticed you're using SQL Server 2017, you should be able to use First_Value() and Last_Value() instead of correlated subqueries.
After many iterations I think I have a working solution. I'm absolutely sure it's far from optimal but it works.
Link is here: http://sqlfiddle.com/#!18/4ee9f/3
Sample data:
create table [Value]
(
[ValueId] bigint not null identity(1,1),
[ListId] bigint not null,
[ValueDelta] int not null,
[ValueCreated] datetime2 not null,
constraint [PK_Value] primary key clustered ([ValueId])
);
insert into [Value]
([ListId], [ValueDelta], [ValueCreated])
values
(1, 1, '2019-01-01 01:01:01'), -- 1.1
(1, 0, '2019-01-01 01:02:01'), -- 2.1
(1, 0, '2019-01-01 01:03:01'), -- 2.2
(1, 0, '2019-01-01 01:04:01'), -- 2.3
(1, -1, '2019-01-01 01:05:01'), -- 3.1
(1, -1, '2019-01-01 01:06:01'), -- 3.2
(1, 1, '2019-01-01 01:01:02'), -- 1.2
(1, 1, '2019-01-01 01:08:01'), -- 4.2
(2, 1, '2019-01-01 01:08:01') -- 5.1
The Query that seems to work:
-- this is the actual order of data
select *
from [Value]
order by [ListId] asc, [ValueCreated] asc;
-- there are 4 sets here
-- set 1 GroupId=1, Id=1&7, Value=1
-- set 2 GroupId=1, Id=2-4, Value=0
-- set 3 GroupId=1, Id=5-6, Value=-1
-- set 4 GroupId=1, Id=8-8, Value=1
-- set 5 GroupId=2, Id=9-9, Value=1
with [cte1] as
(
select [v1].[ListId]
,[v2].[ValueId] as [FirstId], [v2].[ValueCreated] as [FirstCreated]
,[v1].[ValueId] as [LastId], [v1].[ValueCreated] as [LastCreated]
,isnull([v1].[ValueDelta], 0) as [ValueDelta]
from [dbo].[Value] [v1]
join [dbo].[Value] [v2] on [v2].[ListId] = [v1].[ListId]
and isnull([v2].[ValueDeltaPrev], 0) = isnull([v1].[ValueDeltaPrev], 0)
and [v2].[ValueCreated] <= [v1].[ValueCreated] and not exists (
select 1
from [dbo].[Value] [v3]
where 1=1
and ([v3].[ListId] = [v1].[ListId])
and ([v3].[ValueCreated] between [v2].[ValueCreated] and [v1].[ValueCreated])
and [v3].[ValueDelta] != [v1].[ValueDelta]
)
), [cte2] as
(
select [t1].*
from [cte1] [t1]
where not exists (select 1 from [cte1] [t2] where [t2].[ListId] = [t1].[ListId]
and ([t1].[FirstId] != [t2].[FirstId] or [t1].[LastId] != [t2].[LastId])
and [t1].[FirstCreated] between [t2].[FirstCreated] and [t2].[LastCreated]
and [t1].[LastCreated] between [t2].[FirstCreated] and [t2].[LastCreated]
)
)
select [ListId], [FirstId], [LastId], [FirstCreated], [LastCreated], [ValueDelta] as [ValueDelta]
,(select count(*) from [dbo].[Value] where [ListId] = [t].[ListId] and [ValueCreated] between [t].[FirstCreated] and [t].[LastCreated]) as [ValueCount]
from [cte2] [t];
How it works:
join table to self on same list but only on older (or equal date to handle single sets) values
join again on self and exclude any overlaps keeping only largest date set
once we identify largest sets, we then count entries in set dates
If anyone can find a better / friendlier solution, you get the answer.
PS: The dumb straightforward Cursor approach seems a lot faster than this. Still testing.

Query to return the lowest SUM of values over X consecutive days

I'm not even sure how to word this one!...
I have a table with two columns, Price (double) and StartDate (Date). I need to be able to query the table and return X number of rows, lets say 3 for this example - I need to pull back the 3 rows that have consecutive dates e.g. 7th, 8th, 9th of May 2019 which have the lowest sum'd price values from a date range.
I'm thinking a function which takes startDateRange, endDateRange, duration.
It'll return a number of rows (duration) between startDateRange and endDateRange and those three rows when sum'd up would be the cheapest (lowest) sum of any number of rows within that date range for consecutive dates.
So as an example, if I wanted the cheapest 3 dates from between 1st May 2019 and 14th May 2019, the highlighted 3 rows would be returned;
I think possibly LEAD() and LAG() might be a starting point, but I'm not really a SQL person, so not sure if there's a better way around this.
I've developed some c# on my business layer to do this currently, but over large datasets its a bit sluggish - it would be nice to get a list of records straight from my data layer.
Any ideas would be greatly appreciated!
Thanks in advance.
You can calculate averages over 3 days with a window function. Then use top 1 to pick the set of 3 rows with the lowest average:
select top 1 StartDt
, AvgPrice
from (
select StartDt
, avg(Price) over (order by StartDt rows between 2 preceding
and current row) AvgPrice
, count(*) over (order by StartDt rows between 2 preceding
and current row) RowCnt
from prices
) sets_of_3_days
where RowCnt = 3 -- ignore first two rows
order by
AvgPrice desc
Here is your solution, the logic starts when you start to declare the dates. All the best.
--table example
declare #laVieja table (price float,fecha date )
insert into #laVieja values (632,'20150101')
insert into #laVieja values (649,'20150102')
insert into #laVieja values (632,'20150103')
insert into #laVieja values (607,'20150104')
insert into #laVieja values (598,'20150105')
insert into #laVieja values (624,'20150106')
insert into #laVieja values (641,'20150107')
insert into #laVieja values (598,'20150108')
insert into #laVieja values (556,'20150109')
insert into #laVieja values (480,'20150110')
insert into #laVieja values (510,'20150111')
insert into #laVieja values (541,'20150112')
insert into #laVieja values (634,'20150113')
insert into #laVieja values (634,'20150114')
-- end of setting up table example
--declaring dates
declare #fechaIni date, #fechaEnds date
set #fechaIni = '20150101'
set #fechaEnds = '20150114'
--assigning order based on price
select * , ROW_NUMBER() over (order by price) as unOrden
into #laVieja
from #laVieja
where fecha between #fechaIni and #fechaEnds
-- declaring variables for cycle
declare #iteracion float = 1 ,#iteracionMaxima float, #fechaPrimera date, #fechaSegunda date, #fechaTercera date
select #iteracionMaxima = max(unOrden) from #laVieja
--starting cycle
while(#iteracion <= #iteracionMaxima)
begin
--assigning dates to variables
select #fechaPrimera = fecha from #laVieja where unOrden = #iteracion
select #fechaSegunda = fecha from #laVieja where unOrden = #iteracion + 1
select #fechaTercera = fecha from #laVieja where unOrden = #iteracion + 2
--comparing variables
if(#fechaTercera = DATEADD(day,1,#fechaSegunda) and #fechaSegunda = DATEADD(day,1,#fechaPrimera))
begin
select * from #laVieja
where unOrden in (#iteracion,#iteracion+1,#iteracion+2)
set #iteracion = #iteracionMaxima
end
set #iteracion +=1
end
You can use window functions with OVER (... ROWS BETWEEN) clause to calculate the sum/average over specific number of rows. You can then use ROW_NUMBER to find the other two rows.
WITH cte1 AS (
SELECT *
, SUM(Price) OVER (ORDER BY Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS wsum
, ROW_NUMBER() OVER (ORDER BY Date) AS rn
FROM #t
), cte2 AS (
SELECT TOP 1 rn
FROM cte1
WHERE rn > 2
ORDER BY wsum, Date
)
SELECT *
FROM cte1
WHERE rn BEtWEEN (SELECT rn FROM cte2) - 2 AND (SELECT rn FROM cte2)
In the above query, replace 2 with the size of window - 1.

TSQL - Run date comparison for "duplicates"/false positives on initial query?

I'm pretty new to SQL and am working on pulling some data from several very large tables for analysis. The data is basically triggered events for assets on a system. The events all have a created_date (datetime) field that I care about.
I was able to put together the query below to get the data I need (YAY):
SELECT
event.efkey
,event.e_id
,event.e_key
,l.l_name
,event.created_date
,asset.a_id
,asset.asset_name
FROM event
LEFT JOIN asset
ON event.a_key = asset.a_key
LEFT JOIN l
ON event.l_key = l.l_key
WHERE event.e_key IN (350, 352, 378)
ORDER BY asset.a_id, event.created_date
However, while this gives me the data for the specific events I want, I still have another problem. Assets can trigger these events repeatedly, which can result in large numbers of "false positives" for what I'm looking at.
What I need to do is go through the result set of the query above and remove any events for an asset that occur closer than N minutes together (say 30 minutes for this example). So IF the asset_ID is the same AND the event.created_date is within 30 minutes of another event for that asset in the set THEN I want that removed. For example:
For the following records
a_id 1124 created 2016-02-01 12:30:30
a_id 1124 created 2016-02-01 12:35:31
a_id 1124 created 2016-02-01 12:40:33
a_id 1124 created 2016-02-01 12:45:42
a_id 1124 created 2016-02-02 12:30:30
a_id 1124 created 2016-02-02 13:00:30
a_id 1115 created 2016-02-01-12:30:30
I'd want to return only:
a_id 1124 created 2016-02-01 12:30:30
a_id 1124 created 2016-02-02 12:30:30
a_id 1124 created 2016-02-02 13:00:30
a_id 1115 created 2016-02-01-12:30:30
I tried referencing this and this but I can't make the concepts there work for me. I know I probably need to do a SELECT * FROM (my existing query) but I can't seem to do that without ending up with tons of "multi-part identifier can't be bound" errors (and I have no experience creating temp tables, my attempts at that have failed thus far). I also am not exactly sure how to use DATEDIFF as the date filtering function.
Any help would be greatly appreciated! If you could dumb it down for a novice (or link to explanations) that would also be helpful!
This is a trickier problem than it initially appears. The hard part is capturing the previous good row and removing the next bad rows but not allowing those bad rows to influence whether or not the next row is good. Here is what I came up with. I've tried to explain what is going on with comments in the code.
--sample data since I don't have your table structure and your original query won't work for me
declare #events table
(
id int,
timestamp datetime
)
--note that I changed some of your sample data to test some different scenarios
insert into #events values( 1124, '2016-02-01 12:30:30')
insert into #events values( 1124, '2016-02-01 12:35:31')
insert into #events values( 1124, '2016-02-01 12:40:33')
insert into #events values( 1124, '2016-02-01 13:05:42')
insert into #events values( 1124, '2016-02-02 12:30:30')
insert into #events values( 1124, '2016-02-02 13:00:30')
insert into #events values( 1115, '2016-02-01 12:30:30')
--using a cte here to split the result set of your query into groups
--by id (you would want to partition by whatever criteria you use
--to determine that rows are talking about the same event)
--the row_number function gets the row number for each row within that
--id partition
--the over clause specifies how to break up the result set into groups
--(partitions) and what order to put the rows in within that group so
--that the numbering stays consistant
;with orderedEvents as
(
select id, timestamp, row_number() over (partition by id order by timestamp) as rn
from #events
--you would replace #events here with your query
)
--using a second recursive cte here to determine which rows are "good"
--and which ones are not.
, previousGoodTimestamps as
(
--this is the "seeding" part of the recursive cte where I pick the
--first rows of each group as being a desired result. Since they
--are the first in each group, I know they are good. I also assign
--their timestamp as the previous good timestamp since I know that
--this row is good.
select id, timestamp, rn, timestamp as prev_good_timestamp, 1 as is_good
from orderedEvents
where rn = 1
union all
--this is the recursive part of the cte. It takes the rows we have
--already added to this result set and joins those to the "next" rows
--(as defined by our ordering in the first cte). Then we output
--those rows and do some calculations to determine if this row is
--"good" or not. If it is "good" we set it's timestamp as the
--previous good row timestamp so that rows that come after this one
--can use it to determine if they are good or not. If a row is "bad"
--we just forward along the last known good timestamp to the next row.
--
--We also determine if a row is good by checking if the last good row
--timestamp plus 30 minutes is less than or equal to the current row's
--timestamp. If it is then the row is good.
select e2.id
, e2.timestamp
, e2.rn
, last_good_timestamp.timestamp
, case
when dateadd(mi, 30, last_good_timestamp.timestamp) <= e2.timestamp then 1
else 0
end
from previousGoodTimestamps e1
inner join orderedEvents e2 on e2.id = e1.id and e2.rn = e1.rn + 1
--I used a cross apply here to calculate the last good row timestamp
--once. I could have used two identical subqueries above in the select
--and case statements, but I would rather not duplicate the code.
cross apply
(
select case
when e1.is_good = 1 then e1.timestamp --if the last row is good, just use it's timestamp
else e1.prev_good_timestamp --the last row was bad, forward on what it had for the last good timestamp
end as timestamp
) last_good_timestamp
)
select *
from previousGoodTimestamps
where is_good = 1 --only take the "good" rows
Links to MSDN for some of the more complicated things here:
CTEs and Recursive CTEs
CROSS APPLY
-- Sample data.
declare #Samples as Table ( Id Int Identity, A_Id Int, CreatedDate DateTime );
insert into #Samples ( A_Id, CreatedDate ) values
( 1124, '2016-02-01 12:30:30' ),
( 1124, '2016-02-01 12:35:31' ),
( 1124, '2016-02-01 12:40:33' ),
( 1124, '2016-02-01 12:45:42' ),
( 1124, '2016-02-02 12:30:30' ),
( 1124, '2016-02-02 13:00:30' ),
( 1125, '2016-02-01 12:30:30' );
select * from #Samples;
-- Calculate the windows of 30 minutes before and after each CreatedDate and check for conflicts with other rows.
with Ranges as (
select Id, A_Id, CreatedDate,
DateAdd( minute, -30, S.CreatedDate ) as RangeStart, DateAdd( minute, 30, S.CreatedDate ) as RangeEnd
from #Samples as S )
select Id, A_Id, CreatedDate, RangeStart, RangeEnd,
-- Check for a conflict with another row with:
-- the same A_Id value and an earlier CreatedDate that falls inside the +/-30 minute range.
case when exists ( select 42 from #Samples where A_Id = R.A_Id and CreatedDate < R.CreatedDate and R.RangeStart < CreatedDate and CreatedDate < R.RangeEnd ) then 1
else 0 end as Conflict
from Ranges as R;

Create Historical Table from Dates with Ranked Contracts (Gaps and Islands?)

I have an issue in Teradata where I am trying to build a historical contract table that lists a system, it's corresponding contracts and the start and end dates of each contract. This table would then be queried for reporting as a point in time table. Here is some code to better explain.
CREATE TABLE TMP_WORK_DB.SOLD_SYSTEMS
(
SYSTEM_ID varchar(5),
CONTRACT_TYPE varchar(10),
CONTRACT_RANK int,
CONTRACT_STRT_DT date,
CONTRACT_END_DT date
);
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('AAA', 'BEST', 10, '2012-01-01', '2012-06-30');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('AAA', 'BEST', 9, '2012-01-01', '2012-06-30');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('AAA', 'OK', 1, '2012-08-01', '2012-12-30');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('BBB', 'BEST', 10, '2013-12-01', '2014-03-02');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('BBB', 'BETTER', 7, '2013-12-01', '2017-03-02');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('BBB', 'GOOD', 4, '2016-12-02', '2017-12-02');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('CCC', 'BEST', 10, '2009-10-13', '2014-10-14');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('CCC', 'BETTER', 7, '2009-10-13', '2016-10-14');
INSERT INTO TMP_WORK_DB.SOLD_SYSTEMS VALUES ('CCC', 'OK', 2, '2008-10-13', '2017-10-14');
The required output would be:
SYSTEM_ID CONTRACT_TYPE CONTRACT_STRT_DT CONTARCT_END_DT CONTRACT_RANK
AAA BEST 01/01/2012 06/30/2012 10
AAA OK 08/01/2012 12/30/2012 1
BBB BEST 12/01/2013 03/02/2014 10
BBB BETTER 03/03/2014 03/02/2017 7
BBB GOOD 03/03/2017 12/02/2017 4
CCC OK 10/13/2008 10/12/2009 2
CCC BEST 10/13/2009 10/14/2014 10
CCC BETTER 10/15/2014 10/14/2016 7
CCC OK 10/15/2016 10/14/2017 2
I'm not necessarily looking to reduce rows but am looking to get the correct state of the system_id at any given point in time. Note that when a higher ranked contract ends and a lower ranked contract is still active the lower ranked picks up where the higher one left off.
We are using TD 14 and I have been able to get the easy records where the dates flow sequentially and are of higher rank but am having trouble with the overlaps where two different ranked contracts cover multiple date spans.
I found this blog post (Sharpening Stones) and got it working for the most part but am still having trouble setting the new start dates for the overlapping contracts.
Any help would be appreciated. Thanks.
*UPDATE 04/04/2014 *
I came up with the following code which gives me exactly what I want but I'm not sure of the performance. It works on smaller data sets of a few hundred rows but I havent tested it on several million:
*UPDATE 04/07/2014 *
Updated the date subquery due to spool issues. This query explodes all days where the contract is possibly active and then uses the ROW_NUMBER function to get the highest ranked CONTRACT_TYPE per day. The MIN/MAX functions are then partitioned over the system and contract type to pick up when the highest ranked contract type changes.
*UPDATE - 2 - 04/07/2014 *
I cleaned up the query and it seems to be perform a little better.
SELECT
SYSTEM_ID
, CONTRACT_TYPE
, MIN(CALENDAR_DATE) NEW_START_DATE
, MAX(CALENDAR_DATE) NEW_END_DATE
, CONTRACT_RANK
FROM (
SELECT
CALENDAR_DATE
, SYSTEM_ID
, CONTRACT_TYPE
, CONTRACT_RANK
, ROW_NUMBER() OVER (PARTITION BY SYSTEM_ID, CALENDAR_DATE ORDER BY CONTRACT_RANK DESC, CONTRACT_STRT_DT DESC, CONTRACT_END_DT DESC) AS RNK
FROM SOLD_SYSTEMS t1
JOIN (
SELECT CALENDAR_DATE
FROM FULL_CALENDAR_TABLE ia
WHERE CALENDAR_DATE > DATE'2013-01-01'
)dt
ON CALENDAR_DATE BETWEEN CONTRACT_STRT_DT AND CONTRACT_END_DT
QUALIFY RNK = 1
)z1
GROUP BY 1,2,5
Following approach uses the new PERIOD functions in TD13.10.
-- 1. TD_SEQUENCED_COUNT can't be used in joins, so create a Volatile Table
-- 2. TD_SEQUENCED_COUNT can't use additional columns (e.g. CONTRACT_RANK),
-- so simply create a new row whenever a period starts or ends without
-- considering CONTRACT_RANK
CREATE VOLATILE TABLE vt AS
(
WITH cte
(
SYSTEM_ID
,pd
)
AS
(
SELECT
SYSTEM_ID
-- PERIODs can easily be constructed on-the-fly, but the end date is not inclusive,
-- so I had to adjust to your implementation, CONTRACT_END_DT +/- 1:
,PERIOD(CONTRACT_STRT_DT, CONTRACT_END_DT + 1) AS pd
FROM SOLD_SYSTEMS
)
SELECT
SYSTEM_ID
,BEGIN(pd) AS CONTRACT_STRT_DT
,END(pd) - 1 AS CONTRACT_END_DT
FROM
TABLE (TD_SEQUENCED_COUNT
(NEW VARIANT_TYPE(cte.SYSTEM_ID)
,cte.pd)
RETURNS (SYSTEM_ID VARCHAR(5)
,Policy_Count INTEGER
,pd PERIOD(DATE))
HASH BY SYSTEM_ID
LOCAL ORDER BY SYSTEM_ID ,pd) AS dt
)
WITH DATA
PRIMARY INDEX (SYSTEM_ID)
ON COMMIT PRESERVE ROWS
;
-- Find the matching CONTRACT_RANK
SELECT
vt.SYSTEM_ID
,t.CONTRACT_TYPE
,vt.CONTRACT_STRT_DT
,vt.CONTRACT_END_DT
,t.CONTRACT_RANK
FROM vt
-- If both vt and SOLD_SYSTEMS have a NUPI on SYSTEM_ID this join should be
-- quite efficient
JOIN SOLD_SYSTEMS AS t
ON vt.SYSTEM_ID = t.SYSTEM_ID
AND ( t.CONTRACT_STRT_DT, t.CONTRACT_END_DT)
OVERLAPS (vt.CONTRACT_STRT_DT, vt.CONTRACT_END_DT)
QUALIFY
-- As multiple contracts for the same period are possible:
-- find the row with the highest rank
ROW_NUMBER()
OVER (PARTITION BY vt.SYSTEM_ID,vt.CONTRACT_STRT_DT
ORDER BY t.CONTRACT_RANK DESC, vt.CONTRACT_END_DT DESC) = 1
ORDER BY 1,3
;
-- Previous query might return consecutive rows with the same CONTRACT_RANK, e.g.
-- BBB BETTER 2014-03-03 2016-12-01 7
-- BBB BETTER 2016-12-02 2017-03-02 7
-- If you don't want that you have to normalize the data:
WITH cte
(
SYSTEM_ID
,CONTRACT_STRT_DT
,CONTRACT_END_DT
,CONTRACT_RANK
,CONTRACT_TYPE
,pd
)
AS
(
SELECT
vt.SYSTEM_ID
,vt.CONTRACT_STRT_DT
,vt.CONTRACT_END_DT
,t.CONTRACT_RANK
,t.CONTRACT_TYPE
,PERIOD(vt.CONTRACT_STRT_DT, vt.CONTRACT_END_DT + 1) AS pd
FROM vt
JOIN SOLD_SYSTEMS AS t
ON vt.SYSTEM_ID = t.SYSTEM_ID
AND ( t.CONTRACT_STRT_DT, t.CONTRACT_END_DT)
OVERLAPS (vt.CONTRACT_STRT_DT, vt.CONTRACT_END_DT)
QUALIFY
ROW_NUMBER()
OVER (PARTITION BY vt.SYSTEM_ID,vt.CONTRACT_STRT_DT
ORDER BY t.CONTRACT_RANK DESC, vt.CONTRACT_END_DT DESC) = 1
)
SELECT
SYSTEM_ID
,CONTRACT_TYPE
,BEGIN(pd) AS CONTRACT_STRT_DT
,END(pd) - 1 AS CONTRACT_END_DT
,CONTRACT_RANK
FROM
TABLE (TD_NORMALIZE_MEET
(NEW VARIANT_TYPE(cte.SYSTEM_ID
,cte.CONTRACT_RANK
,cte.CONTRACT_TYPE)
,cte.pd)
RETURNS (SYSTEM_ID VARCHAR(5)
,CONTRACT_RANK INT
,CONTRACT_TYPE VARCHAR(10)
,pd PERIOD(DATE))
HASH BY SYSTEM_ID
LOCAL ORDER BY SYSTEM_ID, CONTRACT_RANK, CONTRACT_TYPE, pd ) A
ORDER BY 1, 3;
Edit: This is another way to get the result of the 2nd query without Volatile Table and TD_SEQUENCED_COUNT:
SELECT
t.SYSTEM_ID
,t.CONTRACT_TYPE
,BEGIN(CONTRACT_PERIOD) AS CONTRACT_STRT_DT
,END(CONTRACT_PERIOD)- 1 AS CONTRACT_END_DT
,t.CONTRACT_RANK
,dt.p P_INTERSECT PERIOD(t.CONTRACT_STRT_DT,t.CONTRACT_END_DT + 1) AS CONTRACT_PERIOD
FROM
(
SELECT
dt.SYSTEM_ID
,PERIOD(d, MIN(d)
OVER (PARTITION BY dt.SYSTEM_ID
ORDER BY d
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)) AS p
FROM
(
SELECT
SYSTEM_ID
,CONTRACT_STRT_DT AS d
FROM SOLD_SYSTEMS
UNION
SELECT
SYSTEM_ID
,CONTRACT_END_DT + 1 AS d
FROM SOLD_SYSTEMS
) AS dt
QUALIFY p IS NOT NULL
) AS dt
JOIN SOLD_SYSTEMS AS t
ON dt.SYSTEM_ID = t.SYSTEM_ID
WHERE CONTRACT_PERIOD IS NOT NULL
QUALIFY
ROW_NUMBER()
OVER (PARTITION BY dt.SYSTEM_ID,p
ORDER BY t.CONTRACT_RANK DESC, t.CONTRACT_END_DT DESC) = 1
ORDER BY 1,3
And based on that you can also include the normalization in a single query:
WITH cte
(
SYSTEM_ID
,CONTRACT_TYPE
,CONTRACT_STRT_DT
,CONTRACT_END_DT
,CONTRACT_RANK
,pd
)
AS
(
SELECT
t.SYSTEM_ID
,t.CONTRACT_TYPE
,BEGIN(CONTRACT_PERIOD) AS CONTRACT_STRT_DT
,END(CONTRACT_PERIOD)- 1 AS CONTRACT_END_DT
,t.CONTRACT_RANK
,dt.p P_INTERSECT PERIOD(t.CONTRACT_STRT_DT,t.CONTRACT_END_DT + 1) AS CONTRACT_PERIOD
FROM
(
SELECT
dt.SYSTEM_ID
,PERIOD(d, MIN(d)
OVER (PARTITION BY dt.SYSTEM_ID
ORDER BY d
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)) AS p
FROM
(
SELECT
SYSTEM_ID
,CONTRACT_STRT_DT AS d
FROM SOLD_SYSTEMS
UNION
SELECT
SYSTEM_ID
,CONTRACT_END_DT + 1 AS d
FROM SOLD_SYSTEMS
) AS dt
QUALIFY p IS NOT NULL
) AS dt
JOIN SOLD_SYSTEMS AS t
ON dt.SYSTEM_ID = t.SYSTEM_ID
WHERE CONTRACT_PERIOD IS NOT NULL
QUALIFY
ROW_NUMBER()
OVER (PARTITION BY dt.SYSTEM_ID,p
ORDER BY t.CONTRACT_RANK DESC, t.CONTRACT_END_DT DESC) = 1
)
SELECT
SYSTEM_ID
,CONTRACT_TYPE
,BEGIN(pd) AS CONTRACT_STRT_DT
,END(pd) - 1 AS CONTRACT_END_DT
,CONTRACT_RANK
FROM
TABLE (TD_NORMALIZE_MEET
(NEW VARIANT_TYPE(cte.SYSTEM_ID
,cte.CONTRACT_RANK
,cte.CONTRACT_TYPE)
,cte.pd)
RETURNS (SYSTEM_ID VARCHAR(5)
,CONTRACT_RANK INT
,CONTRACT_TYPE VARCHAR(10)
,pd PERIOD(DATE))
HASH BY SYSTEM_ID
LOCAL ORDER BY SYSTEM_ID, CONTRACT_RANK, CONTRACT_TYPE, pd ) A
ORDER BY 1, 3;
SEL system_id,contract_type,MAX(contract_rank),
CASE WHEN contract_strt_dt<prev_end_dt THEN prev_end_dt+1
ELSE contract_strt_dt
END AS new_start ,contract_strt_dt,contract_end_dt,
MIN(contract_end_dt) OVER (PARTITION BY system_id
ORDER BY contract_strt_dt,contract_end_dt ROWS BETWEEN 1 PRECEDING
AND 1 PRECEDING) prev_end_dt
FROM sold_systems
GROUP BY system_id,contract_type,contract_strt_dt,contract_end_dt
ORDER BY contract_strt_dt,contract_end_dt,prev_end_dt
I think I'v got it....
try this
select SYSTEM_ID, CONTRACT_TYPE,CONTRACT_RANK,
case
when CONTRACT_STRT_DT<NEW_START_DATE then NEW_START_DATE /*if new_star_date overlap startdate then get new_Start_date */
else CONTRACT_STRT_DT
end as new_contract_str_dt,
CONTRACT_END_DT
from
(select t1.SYSTEM_ID,t1.CONTRACT_TYPE,t1.CONTRACT_RANK,t1.CONTRACT_STRT_DT,t1.CONTRACT_END_DT,
coalesce(max(t1.CONTRACT_END_DT) over (partition by t1.SYSTEM_ID order by t1.CONTRACT_RANK desc rows between UNBOUNDED PRECEDING and 1 preceding ), t1.CONTRACT_STRT_DT) NEW_START_DATE
from SOLD_SYSTEMS t1
) as temp1
/*you may remove fully overlapped contracts*/
where NEW_START_DATE<=CONTRACT_END_DT
It's simpler and have a good execution plan... You can work with large tables (don't forget to collect statistics )