Find conflicted date intervals using SQL - sql

Suppose I have following table in Sql Server 2008:
ItemId StartDate EndDate
1 NULL 2011-01-15
2 2011-01-16 2011-01-25
3 2011-01-26 NULL
As you can see, this table has StartDate and EndDate columns. I want to validate data in these columns. Intervals cannot conflict with each other. So, the table above is valid, but the next table is invalid, becase first row has End Date greater than StartDate in the second row.
ItemId StartDate EndDate
1 NULL 2011-01-17
2 2011-01-16 2011-01-25
3 2011-01-26 NULL
NULL means infinity here.
Could you help me to write a script for data validation?
[The second task]
Thanks for the answers.
I have a complication. Let's assume, I have such table:
ItemId IntervalId StartDate EndDate
1 1 NULL 2011-01-15
2 1 2011-01-16 2011-01-25
3 1 2011-01-26 NULL
4 2 NULL 2011-01-17
5 2 2011-01-16 2011-01-25
6 2 2011-01-26 NULL
Here I want to validate intervals within a groups of IntervalId, but not within the whole table. So, Interval 1 will be valid, but Interval 2 will be invalid.
And also. Is it possible to add a constraint to the table in order to avoid such invalid records?
[Final Solution]
I created function to check if interval is conflicted:
CREATE FUNCTION [dbo].[fnIntervalConflict]
(
#intervalId INT,
#originalItemId INT,
#startDate DATETIME,
#endDate DATETIME
)
RETURNS BIT
AS
BEGIN
SET #startDate = ISNULL(#startDate,'1/1/1753 12:00:00 AM')
SET #endDate = ISNULL(#endDate,'12/31/9999 11:59:59 PM')
DECLARE #conflict BIT = 0
SELECT TOP 1 #conflict = 1
FROM Items
WHERE IntervalId = #intervalId
AND ItemId <> #originalItemId
AND (
(ISNULL(StartDate,'1/1/1753 12:00:00 AM') >= #startDate
AND ISNULL(StartDate,'1/1/1753 12:00:00 AM') <= #endDate)
OR (ISNULL(EndDate,'12/31/9999 11:59:59 PM') >= #startDate
AND ISNULL(EndDate,'12/31/9999 11:59:59 PM') <= #endDate)
)
RETURN #conflict
END
And then I added 2 constraints to my table:
ALTER TABLE dbo.Items ADD CONSTRAINT
CK_Items_Dates CHECK (StartDate IS NULL OR EndDate IS NULL OR StartDate <= EndDate)
GO
and
ALTER TABLE dbo.Items ADD CONSTRAINT
CK_Items_ValidInterval CHECK (([dbo].[fnIntervalConflict]([IntervalId], ItemId,[StartDate],[EndDate])=(0)))
GO
I know, the second constraint slows insert and update operations, but it is not very important for my application.
And also, now I can call function fnIntervalConflict from my application code before inserts and updates of data in the table.

Something like this should give you all overlaping periods
SELECT
*
FROM
mytable t1
JOIN mytable t2 ON t1.EndDate>t2.StartDate AND t1.StartDate < t2.StartDate
Edited for Adrians comment bellow

This will give you the rows that are incorrect.
Added ROW_NUMBER() as I didnt know if all entries where in order.
-- Testdata
declare #date datetime = '2011-01-17'
;with yourTable(itemID, startDate, endDate)
as
(
SELECT 1, NULL, #date
UNION ALL
SELECT 2, dateadd(day, -1, #date), DATEADD(day, 10, #date)
UNION ALL
SELECT 3, DATEADD(day, 60, #date), NULL
)
-- End testdata
,tmp
as
(
select *
,ROW_NUMBER() OVER(order by startDate) as rowno
from yourTable
)
select *
from tmp t1
left join tmp t2
on t1.rowno = t2.rowno - 1
where t1.endDate > t2.startDate
EDIT:
As for the updated question:
Just add a PARTITION BY clause to the ROW_NUMBER() query and alter the join.
-- Testdata
declare #date datetime = '2011-01-17'
;with yourTable(itemID, startDate, endDate, intervalID)
as
(
SELECT 1, NULL, #date, 1
UNION ALL
SELECT 2, dateadd(day, 1, #date), DATEADD(day, 10, #date),1
UNION ALL
SELECT 3, DATEADD(day, 60, #date), NULL, 1
UNION ALL
SELECT 4, NULL, #date, 2
UNION ALL
SELECT 5, dateadd(day, -1, #date), DATEADD(day, 10, #date),2
UNION ALL
SELECT 6, DATEADD(day, 60, #date), NULL, 2
)
-- End testdata
,tmp
as
(
select *
,ROW_NUMBER() OVER(partition by intervalID order by startDate) as rowno
from yourTable
)
select *
from tmp t1
left join tmp t2
on t1.rowno = t2.rowno - 1
and t1.intervalID = t2.intervalID
where t1.endDate > t2.startDate

declare #T table (ItemId int, IntervalID int, StartDate datetime, EndDate datetime)
insert into #T
select 1, 1, NULL, '2011-01-15' union all
select 2, 1, '2011-01-16', '2011-01-25' union all
select 3, 1, '2011-01-26', NULL union all
select 4, 2, NULL, '2011-01-17' union all
select 5, 2, '2011-01-16', '2011-01-25' union all
select 6, 2, '2011-01-26', NULL
select T1.*
from #T as T1
inner join #T as T2
on coalesce(T1.StartDate, '1753-01-01') < coalesce(T2.EndDate, '9999-12-31') and
coalesce(T1.EndDate, '9999-12-31') > coalesce(T2.StartDate, '1753-01-01') and
T1.IntervalID = T2.IntervalID and
T1.ItemId <> T2.ItemId
Result:
ItemId IntervalID StartDate EndDate
----------- ----------- ----------------------- -----------------------
5 2 2011-01-16 00:00:00.000 2011-01-25 00:00:00.000
4 2 NULL 2011-01-17 00:00:00.000

Not directly related to the OP, but since Adrian's expressed an interest. Here's a table than SQL Server maintains the integrity of, ensuring that only one valid value is present at any time. In this case, I'm dealing with a current/history table, but the example can be modified to work with future data also (although in that case, you can't have the indexed view, and you need to write the merge's directly, rather than maintaining through triggers).
In this particular case, I'm dealing with a link table that I want to track the history of. First, the tables that we're linking:
create table dbo.Clients (
ClientID int IDENTITY(1,1) not null,
Name varchar(50) not null,
/* Other columns */
constraint PK_Clients PRIMARY KEY (ClientID)
)
go
create table dbo.DataItems (
DataItemID int IDENTITY(1,1) not null,
Name varchar(50) not null,
/* Other columns */
constraint PK_DataItems PRIMARY KEY (DataItemID),
constraint UQ_DataItem_Names UNIQUE (Name)
)
go
Now, if we were building a normal table, we'd have the following (Don't run this one):
create table dbo.ClientAnswers (
ClientID int not null,
DataItemID int not null,
IntValue int not null,
Comment varchar(max) null,
constraint PK_ClientAnswers PRIMARY KEY (ClientID,DataItemID),
constraint FK_ClientAnswers_Clients FOREIGN KEY (ClientID) references dbo.Clients (ClientID),
constraint FK_ClientAnswers_DataItems FOREIGN KEY (DataItemID) references dbo.DataItems (DataItemID)
)
But, we want a table that can represent a complete history. In particular, we want to design the structure such that overlapping time periods can never appear in the database. We always know which record was valid at any particular time:
create table dbo.ClientAnswerHistories (
ClientID int not null,
DataItemID int not null,
IntValue int null,
Comment varchar(max) null,
/* Temporal columns */
Deleted bit not null,
ValidFrom datetime2 null,
ValidTo datetime2 null,
constraint UQ_ClientAnswerHistories_ValidFrom UNIQUE (ClientID,DataItemID,ValidFrom),
constraint UQ_ClientAnswerHistories_ValidTo UNIQUE (ClientID,DataItemID,ValidTo),
constraint CK_ClientAnswerHistories_NoTimeTravel CHECK (ValidFrom < ValidTo),
constraint FK_ClientAnswerHistories_Clients FOREIGN KEY (ClientID) references dbo.Clients (ClientID),
constraint FK_ClientAnswerHistories_DataItems FOREIGN KEY (DataItemID) references dbo.DataItems (DataItemID),
constraint FK_ClientAnswerHistories_Prev FOREIGN KEY (ClientID,DataItemID,ValidFrom)
references dbo.ClientAnswerHistories (ClientID,DataItemID,ValidTo),
constraint FK_ClientAnswerHistories_Next FOREIGN KEY (ClientID,DataItemID,ValidTo)
references dbo.ClientAnswerHistories (ClientID,DataItemID,ValidFrom),
constraint CK_ClientAnswerHistory_DeletionNull CHECK (
Deleted = 0 or
(
IntValue is null and
Comment is null
)),
constraint CK_ClientAnswerHistory_IntValueNotNull CHECK (Deleted=1 or IntValue is not null)
)
go
That's a lot of constraints. The only way to maintain this table is through merge statements (see examples below, and try to reason about why yourself). We're now going to build a view that mimics that ClientAnswers table defined above:
create view dbo.ClientAnswers
with schemabinding
as
select
ClientID,
DataItemID,
ISNULL(IntValue,0) as IntValue,
Comment
from
dbo.ClientAnswerHistories
where
Deleted = 0 and
ValidTo is null
go
create unique clustered index PK_ClientAnswers on dbo.ClientAnswers (ClientID,DataItemID)
go
And we have the PK constraint we originally wanted. We've also used ISNULL to reinstate the not null-ness of the IntValue column (even though the check constraints already guarantee this, SQL Server is unable to derive this information). If we're working with an ORM, we let it target ClientAnswers, and the history gets automatically built. Next, we can have a function that lets us look back in time:
create function dbo.ClientAnswers_At (
#At datetime2
)
returns table
with schemabinding
as
return (
select
ClientID,
DataItemID,
ISNULL(IntValue,0) as IntValue,
Comment
from
dbo.ClientAnswerHistories
where
Deleted = 0 and
(ValidFrom is null or ValidFrom <= #At) and
(ValidTo is null or ValidTo > #At)
)
go
And finally, we need the triggers on ClientAnswers that build this history. We need to use merge statements, since we need to simultaneously insert new rows, and update the previous "valid" row to end date it with a new ValidTo value.
create trigger T_ClientAnswers_I
on dbo.ClientAnswers
instead of insert
as
set nocount on
;with Dup as (
select i.ClientID,i.DataItemID,i.IntValue,i.Comment,CASE WHEN cah.ClientID is not null THEN 1 ELSE 0 END as PrevDeleted,t.Dupl
from
inserted i
left join
dbo.ClientAnswerHistories cah
on
i.ClientID = cah.ClientID and
i.DataItemID = cah.DataItemID and
cah.ValidTo is null and
cah.Deleted = 1
cross join
(select 0 union all select 1) t(Dupl)
)
merge into dbo.ClientAnswerHistories cah
using Dup on cah.ClientID = Dup.ClientID and cah.DataItemID = Dup.DataItemID and cah.ValidTo is null and Dup.Dupl = 0 and Dup.PrevDeleted = 1
when matched then update set ValidTo = SYSDATETIME()
when not matched and Dup.Dupl=1 then insert (ClientID,DataItemID,IntValue,Comment,Deleted,ValidFrom)
values (Dup.ClientID,Dup.DataItemID,Dup.IntValue,Dup.Comment,0,CASE WHEN Dup.PrevDeleted=1 THEN SYSDATETIME() END);
go
create trigger T_ClientAnswers_U
on dbo.ClientAnswers
instead of update
as
set nocount on
;with Dup as (
select i.ClientID,i.DataItemID,i.IntValue,i.Comment,t.Dupl
from
inserted i
cross join
(select 0 union all select 1) t(Dupl)
)
merge into dbo.ClientAnswerHistories cah
using Dup on cah.ClientID = Dup.ClientID and cah.DataItemID = Dup.DataItemID and cah.ValidTo is null and Dup.Dupl = 0
when matched then update set ValidTo = SYSDATETIME()
when not matched then insert (ClientID,DataItemID,IntValue,Comment,Deleted,ValidFrom)
values (Dup.ClientID,Dup.DataItemID,Dup.IntValue,Dup.Comment,0,SYSDATETIME());
go
create trigger T_ClientAnswers_D
on dbo.ClientAnswers
instead of delete
as
set nocount on
;with Dup as (
select d.ClientID,d.DataItemID,t.Dupl
from
deleted d
cross join
(select 0 union all select 1) t(Dupl)
)
merge into dbo.ClientAnswerHistories cah
using Dup on cah.ClientID = Dup.ClientID and cah.DataItemID = Dup.DataItemID and cah.ValidTo is null and Dup.Dupl = 0
when matched then update set ValidTo = SYSDATETIME()
when not matched then insert (ClientID,DataItemID,Deleted,ValidFrom)
values (Dup.ClientID,Dup.DataItemID,1,SYSDATETIME());
go
Obviously, I could have built a simpler table (not a join table), but this is my standard go-to example (albeit it took me a while to reconstruct it - I forgot the set nocount on statements for a while). But the strength here is that, the base table, ClientAnswerHistories is incapable of storing overlapping time ranges for the same ClientID and DataItemID values.
Things get more complex when you need to deal with temporal foreign keys.
Of course, if you don't want any real gaps, then you can remove the Deleted column (and associated checks), make the not null columns really not null, modify the insert trigger to do a plain insert, and make the delete trigger raise an error instead.

I've always taken a slightly different approach to the design if I have data that is never to have overlapping intervals... namely don't store intervals, but only start times. Then, have a view that helps with displaying the intervals.
CREATE TABLE intervalStarts
(
ItemId int,
IntervalId int,
StartDate datetime
)
CREATE VIEW intervals
AS
with cte as (
select ItemId, IntervalId, StartDate,
row_number() over(partition by IntervalId order by isnull(StartDate,'1753-01-01')) row
from intervalStarts
)
select c1.ItemId, c1.IntervalId, c1.StartDate,
dateadd(dd,-1,c2.StartDate) as 'EndDate'
from cte c1
left join cte c2 on c1.IntervalId=c2.IntervalId
and c1.row=c2.row-1
So, sample data might look like:
INSERT INTO intervalStarts
select 1, 1, null union
select 2, 1, '2011-01-16' union
select 3, 1, '2011-01-26' union
select 4, 2, null union
select 5, 2, '2011-01-26' union
select 6, 2, '2011-01-14'
and a simple SELECT * FROM intervals yields:
ItemId | IntervalId | StartDate | EndDate
1 | 1 | null | 2011-01-15
2 | 1 | 2011-01-16 | 2011-01-25
3 | 1 | 2011-01-26 | null
4 | 2 | null | 2011-01-13
6 | 2 | 2011-01-14 | 2011-01-25
5 | 2 | 2011-01-26 | null

Related

SCD Type 2 - Handling Intraday changes?

I have a merge statement that builds my SCD type 2 table each night. This table must house all historical changes made in the source system and create a new row with the date from/date to columns populated along with the "islatest" flag. I have come across an issue today that I am not really sure how to handle.
There looks to have been multiple changes to the source table within a 24 hour period.
ID Code PAN EnterDate Cost Created
16155 1012401593331 ENRD 2015-11-05 7706.3 2021-08-17 14:34
16155 1012401593331 ENRD 2015-11-05 8584.4 2021-08-17 16:33
I use a basic merge statement to identify my changes however what would be the best approach to ensure all changes get picked up correctly? The above is giving me an error as it's trying to insert/update multiple rows with the same value
DECLARE #DateNow DATETIME = Getdate()
IF Object_id('tempdb..#meteridinsert') IS NOT NULL
DROP TABLE #meteridinsert;
CREATE TABLE #meteridinsert
(
meterid INT,
change VARCHAR(10)
);
MERGE
INTO [DIM].[Meters] AS target
using stg_meters AS source
ON target.[ID] = source.[ID]
AND target.latest=1
WHEN matched THEN
UPDATE
SET target.islatest = 0,
target.todate = #Datenow
WHEN NOT matched BY target THEN
INSERT
(
id,
code,
pan,
enterdate,
cost,
created,
[FromDate] ,
[ToDate] ,
[IsLatest]
)
VALUES
(
source.id,
source.code ,
source.pan ,
source.enterdate ,
source.cost ,
source.created ,
#Datenow ,
NULL ,
1
)
output source.id,
$action
INTO #meteridinsert;INSERT INTO [DIM].[Meters]
(
[id] ,
[code] ,
[pan] ,
[enterdate] ,
[cost] ,
[created] ,
[FromDate] ,
[ToDate] ,
[IsLatest]
)
SELECT ([id] ,[code] ,[pan] ,[enterdate] ,[cost] ,[created] , #DateNow ,NULL ,1 FROM stg_meters a
INNER JOIN #meteridinsert cid
ON a.id = cid.meterid
AND cid.change = 'UPDATE'
Maybe you can do it using merge statement, but I would prefer to use typicall update and insert approach in order to make it easier to understand (also I am not sure that merge allows you to use the same source record for update and insert...)
First of all I create the table dimscd2 to represent your dimension table
create table dimscd2
(naturalkey int, descr varchar(100), startdate datetime, enddate datetime)
And then I insert some records...
insert into dimscd2 values
(1,'A','2019-01-12 00:00:00.000', '2020-01-01 00:00:00.000'),
(1,'B','2020-01-01 00:00:00.000', NULL)
As you can see, the "current" is the one with descr='B' because it has an enddate NULL (I do recommend you to use surrogate keys for each record... This is just an incremental key for each record of your dimension, and the fact table must be linked with this surrogate key in order to reflect the status of the fact in the moment when happened).
Then, I have created some dummy data to represent the source data with the changes for the same natural key
-- new data (src_data)
select 1 as naturalkey,'C' as descr, cast('2020-01-02 00:00:00.000' as datetime) as dt into src_data
union all
select 1 as naturalkey,'D' as descr, cast('2020-01-03 00:00:00.000' as datetime) as dt
After that, I have created a temp table (##tmp) with this query to set the enddate for each record:
-- tmp table
select naturalkey, descr, dt,
lead(dt,1,0) over (partition by naturalkey order by dt) enddate,
row_number() over (partition by naturalkey order by dt) rn
into ##tmp
from src_data
The LEAD function takes the next start date for the same natural key, ordered by date (dt).
The ROW_NUMBER marks with 1 the oldest record in the source data for the natural key in the dimension.
Then, I proceed to close the "current" record using update
update d
set enddate = t.dt
from dimscd2 d
join ##tmp t
on d.naturalkey = t.naturalkey
and d.enddate is null
and t.rn = 1
And finally I add the new source data to the dimension with insert
insert into dimscd2
select naturalkey, descr, dt,
case enddate when '1900-00-00' then null else enddate end
from ##tmp
Final result is obtained with the query:
select * from dimscd2
You can test on this db<>fiddle

Efficient way of storing date ranges

I need to store simple data - suppose I have some products with codes as a primary key, some properties and validity ranges. So data could look like this:
Products
code value begin_date end_date
10905 13 2005-01-01 2016-12-31
10905 11 2017-01-01 null
Those ranges are not overlapping, so on every date I have a list of unique products and their properties. So to ease the use of it I've created the function:
create function dbo.f_Products
(
#date date
)
returns table
as
return (
select
from dbo.Products as p
where
#date >= p.begin_date and
#date <= p.end_date
)
This is how I'm going to use it:
select
*
from <some table with product codes> as t
left join dbo.f_Products(#date) as p on
p.code = t.product_code
This is all fine, but how I can let optimizer know that those rows are unique to have better execution plan?
I did some googling, and found a couple of really nice articles for DDL which prevents storing overlapping ranges in the table:
Self-maintaining, Contiguous Effective Dates in Temporal Tables
Storing intervals of time with no overlaps
But even if I try those constraint I see that optimizer cannot understand that resulting recordset will return unique codes.
What I'd like to have is certain approach which gives me basically the same performance as if I stored those products list on certain date and selected it with date = #date.
I know that some RDMBS (like PostgreSQL) have special data types for this (Range Types). But SQL Server doesn't have anything like this.
Am I missing something or there're no way to do this properly in SQL Server?
You can create an indexed view that contains a row for each code/date in the range.
ProductDate (indexed view)
code value date
10905 13 2005-01-01
10905 13 2005-01-02
10905 13 ...
10905 13 2016-12-31
10905 11 2017-01-01
10905 11 2017-01-02
10905 11 ...
10905 11 Today
Like this:
create schema digits
go
create table digits.Ones (digit tinyint not null primary key)
insert into digits.Ones (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
create table digits.Tens (digit tinyint not null primary key)
insert into digits.Tens (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
create table digits.Hundreds (digit tinyint not null primary key)
insert into digits.Hundreds (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
create table digits.Thousands (digit tinyint not null primary key)
insert into digits.Thousands (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
create table digits.TenThousands (digit tinyint not null primary key)
insert into digits.TenThousands (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
go
create schema info
go
create table info.Products (code int not null, [value] int not null, begin_date date not null, end_date date null, primary key (code, begin_date))
insert into info.Products (code, [value], begin_date, end_date) values
(10905, 13, '2005-01-01', '2016-12-31'),
(10905, 11, '2017-01-01', null)
create table info.DateRange ([begin] date not null, [end] date not null, [singleton] bit not null default(1) check ([singleton] = 1))
insert into info.DateRange ([begin], [end]) values ((select min(begin_date) from info.Products), getdate())
go
create view info.ProductDate with schemabinding
as
select
p.code,
p.value,
dateadd(day, ones.digit + tens.digit*10 + huns.digit*100 + thos.digit*1000 + tthos.digit*10000, dr.[begin]) as [date]
from
info.DateRange as dr
cross join
digits.Ones as ones
cross join
digits.Tens as tens
cross join
digits.Hundreds as huns
cross join
digits.Thousands as thos
cross join
digits.TenThousands as tthos
join
info.Products as p on
dateadd(day, ones.digit + tens.digit*10 + huns.digit*100 + thos.digit*1000 + tthos.digit*10000, dr.[begin]) between p.begin_date and isnull(p.end_date, datefromparts(9999, 12, 31))
go
create unique clustered index idx_ProductDate on info.ProductDate ([date], code)
go
select *
from info.ProductDate with (noexpand)
where
date = '2014-01-01'
drop view info.ProductDate
drop table info.Products
drop table info.DateRange
drop table digits.Ones
drop table digits.Tens
drop table digits.Hundreds
drop table digits.Thousands
drop table digits.TenThousands
drop schema digits
drop schema info
go
A solution without gaps might be this:
DECLARE #tbl TABLE(ID INT IDENTITY,[start_date] DATE);
INSERT INTO #tbl VALUES({d'2016-10-01'}),({d'2016-09-01'}),({d'2016-08-01'}),({d'2016-07-01'}),({d'2016-06-01'});
SELECT * FROM #tbl;
DECLARE #DateFilter DATE={d'2016-08-13'};
SELECT TOP 1 *
FROM #tbl
WHERE [start_date]<=#DateFilter
ORDER BY [start_date] DESC
Important: Be sure that there is an (unique) index on start_date
UPDATE: for different products
DECLARE #tbl TABLE(ID INT IDENTITY,ProductID INT,[start_date] DATE);
INSERT INTO #tbl VALUES
--product 1
(1,{d'2016-10-01'}),(1,{d'2016-09-01'}),(1,{d'2016-08-01'}),(1,{d'2016-07-01'}),(1,{d'2016-06-01'})
--product 1
,(2,{d'2016-10-17'}),(2,{d'2016-09-16'}),(2,{d'2016-08-15'}),(2,{d'2016-07-10'}),(2,{d'2016-06-11'});
DECLARE #DateFilter DATE={d'2016-08-13'};
WITH PartitionedCount AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY [start_date] DESC) AS Nr
,*
FROM #tbl
WHERE [start_date]<=#DateFilter
)
SELECT *
FROM PartitionedCount
WHERE Nr=1
First you need to create a unique clustered index for (begin_date, end_date, code)
Then SQL engine will be able to do INDEX SEEK.
Additionally, you can also try to create a view for dbo.Products table to join that table with pre-populated dbo.Dates table.
select p.code, p.val, p.begin_date, p.end_date, d.[date]
from dbo.Product as p
inner join dbo.dates d on p.begin_date <= d.[date] and d.[date] <= p.end_date
Then in your function, you use that view as "where #date = view.date". The result can be either better or slightly worse... it depends on the actual data.
You also can try to make that view indexed (depends on how often it is being updated).
Alternatively, you can have better performance if you populate dbo.Products table for every date in the [begin_date] .. [end_date] range.
Approach with ROW_NUMBER scans the whole Products table once. It is the best method if you have a lot of product codes in the Products table and few validity ranges for each code.
WITH
CTE_rn
AS
(
SELECT
code
,value
,ROW_NUMBER() OVER (PARTITION BY code ORDER BY begin_date DESC) AS rn
FROM Products
WHERE begin_date <= #date
)
SELECT *
FROM
<some table with product codes> as t
LEFT JOIN CTE_rn ON CTE_rn.code = t.product_code AND CTE_rn.rn = 1
;
If you have few product codes and a lot of validity ranges for each code in the Products table, then it is better to seek the Products table for each code using OUTER APPLY.
SELECT *
FROM
<some table with product codes> as t
OUTER APPLY
(
SELECT TOP(1)
Products.value
FROM Products
WHERE
Products.code = t.product_code
AND Products.begin_date <= #date
ORDER BY Products.begin_date DESC
) AS A
;
Both variants need unique index on (code, begin_date DESC) include (value).
Note how the queries don't even look at end_date, because they assume that intervals don't have gaps. They will work in SQL Server 2008.
EDIT: My original answer was using an INNER JOIN, but the questioner wanted a LEFT JOIN.
CREATE TABLE Products
(
[Code] INT NOT NULL
, [Value] VARCHAR(30) NOT NULL
, Begin_Date DATETIME NOT NULL
, End_Date DATETIME NULL
)
/*
Products
code value begin_date end_date
10905 13 2005-01-01 2016-12-31
10905 11 2017-01-01 null
*/
INSERT INTO Products ([Code], [Value], Begin_Date, End_Date) VALUES (10905, 13, '2005-01-01', '2016-12-31')
INSERT INTO Products ([Code], [Value], Begin_Date, End_Date) VALUES (10905, 11, '2017-01-01', NULL)
CREATE NONCLUSTERED INDEX SK_ProductDate ON Products ([Code], Begin_Date, End_Date) INCLUDE ([Value])
CREATE TABLE SomeTableWithProductCodes
(
[CODE] INT NOT NULL
)
INSERT INTO SomeTableWithProductCodes ([Code]) VALUES (10905)
Here is a prototypical query, with a date predicate. Note that there are more optimal ways to do this in a bulletproof fashion, using a "less than" operator on the upper bound, but that's a different discussion.
SELECT
P.[Code]
, P.[Value]
, P.[Begin_Date]
, P.[End_Date]
FROM
SomeTableWithProductCodes ST
LEFT JOIN Products AS P ON
ST.[Code] = P.[Code]
AND '2016-06-30' BETWEEN P.[Begin_Date] AND ISNULL(P.[End_Date], '9999-12-31')
This query will perform an Index Seek on the Product table.
Here is a SQL Fiddle: SQL Fiddle - Products and Dates

SQL Server: Populate (Minute) Date Dimension table

I'm working on a script to populate a very simple date dimension table whose granularity is down to the minute level. This table should ultimately contain a smalldatetime representing every minute from 1/1/2000 to 12/31/2015 23:59.
Here is the definition for the table:
CREATE TABLE [dbo].[REF_MinuteDimension] (
[TimeStamp] SMALLDATETIME NOT NULL,
CONSTRAINT [PK_REF_MinuteDimension] PRIMARY KEY CLUSTERED ([TimeStamp] ASC) WITH (FILLFACTOR = 100)
);
Here is the latest revision of the script:
DECLARE #CurrentTimeStamp AS SMALLDATETIME;
SELECT TOP(1) #CurrentTimeStamp = MAX([TimeStamp]) FROM [dbo].[REF_MinuteDimension];
IF #CurrentTimeStamp IS NOT NULL
SET #CurrentTimeStamp = DATEADD(MINUTE, 1, #CurrentTimeStamp);
ELSE
SET #CurrentTimeStamp = '1/1/2000 00:00';
ALTER TABLE [dbo].[REF_MinuteDimension] DROP CONSTRAINT [PK_REF_MinuteDimension];
WHILE #CurrentTimeStamp < '12/31/2050 23:59'
BEGIN
;WITH DateIndex ([TimeStamp]) AS
(
SELECT #CurrentTimeStamp
UNION ALL
SELECT DATEADD(MINUTE, 1, [TimeStamp]) FROM DateIndex di WHERE di.[TimeStamp] < dbo.fGetYearEnd(#CurrentTimeStamp)
)
INSERT INTO [dbo].[REF_MinuteDimension] ([TimeStamp])
SELECT di.[TimeStamp] FROM DateIndex di
OPTION (MAXRECURSION 0);
SET #CurrentTimeStamp = DATEADD(YEAR, 1, dbo.fGetYearBegin(#CurrentTimeStamp))
END
ALTER TABLE [dbo].[REF_MinuteDimension] ADD CONSTRAINT [PK_REF_MinuteDimension] PRIMARY KEY CLUSTERED ([TimeStamp] ASC) WITH (FILLFACTOR = 100);
A couple of things to point out:
I've added logic to drop and subsequently re-add the primary key constraint on the table, hoping to boost the performance.
I've added logic to chunk the INSERTS into yearly batches to minimize the impact on the transaction log. On a side note, we're using the SIMPLE recovery model.
Performance is so-so and takes around 15-20 minutes to complete. Any hints/suggestions on how this script could be "tuned up" or improved?
Also, for completeness here are fGetYearBegin and fGetYearEnd:
CREATE FUNCTION dbo.fGetYearBegin
(
#dtConvertDate datetime
)
RETURNS smalldatetime
AS
BEGIN
RETURN DATEADD(YEAR, DATEDIFF(YEAR, 0, #dtConvertDate), 0)
END
CREATE FUNCTION dbo.fGetYearEnd
(
#dtConvertDate datetime
)
RETURNS smalldatetime
AS
BEGIN
RETURN DATEADD(MINUTE, -1, DATEADD(YEAR, 1, dbo.fGetYearBegin(#dtConvertDate)))
END
This takes 11 seconds on my server...
If Object_ID('tempdb..#someNumbers') Is Not Null Drop Table #someNumbers;
Create Table #someNumbers (id Int);
Declare #minutes Int,
#days Int;
Select #minutes = DateDiff(Minute,'1/1/2000 00:00','1/2/2000 00:00'),
#days = DateDiff(Day,'1/1/2000 00:00','1/1/2051 00:00');
With Base As
(
Select 1 As seedID
Union All
Select 1
), Build As
(
Select seedID
From Base
Union All
Select b.seedID + 1
From Build b
Cross Join Base b2
Where b.SeedID < 14
)
Insert #someNumbers
Select Row_Number() Over (Order By seedID) As id
From Build
Option (MaxRecursion 0);
If Object_ID('tempdb..#values') Is Not Null Drop Table #values;
Create Table #values ([TimeStamp] SmallDateTime NOT NULL);
With Dates As
(
Select DateAdd(Day,id-1,'1/1/2000 00:00') As [TimeStamp]
From #someNumbers
Where id <= #days
)
Insert #values
Select Convert(SmallDateTime,DateAdd(Minute,id-1,[TimeStamp]))
From Dates d
Join #someNumbers sn
On sn.id <= #minutes
Order By 1

Select one row per specific time

I have a table that looks like this:
ID UserID DateTime TypeID
1 1 1/1/2010 10:00:00 1
2 2 1/1/2010 10:01:50 1
3 1 1/1/2010 10:02:50 1
4 1 1/1/2010 10:03:50 1
5 1 1/1/2010 11:00:00 1
6 2 1/1/2010 11:00:50 1
I need to query all users where their typeID is 1, but have only one row per 15 mins
For example, the result should be:
1 1 1/1/2010 10:00:00 1
2 2 1/1/2010 10:01:50 1
5 1 1/1/2010 11:00:00 1
6 2 1/1/2010 11:00:50 1
IDs 3 & 4 are not shown because 15 min haven't been passed since the last record for the specific userID.
IDs 1 & 5 are shown because 15 minutes has been passed for this specific userID
Same as for IDs 2 & 6.
How can I do it?
Thanks
Try this:
select * from
(
select ID, UserID,
Max(DateTime) as UpperBound,
Min(DateTime) as LowerBound,
TypeID
from the_table
where TypeID=1
group by ID,UserID,TypeID
) t
where datediff(mi,LowerBound,UpperBound)>=15
EDIT: SINCE MY ABOVE ATTEMPT WAS WRONG, I'm adding one more approach using a Sql table-valued Function that does not require recursion, since, understandable, it's a big concern.
Step 1: Create a table-type as follows (LoginDate is the DateTime column in Shay's example - DateTime name conflicts with a SQL data type and I think it's wise to avoid these conflicts)
CREATE TYPE [dbo].[TVP] AS TABLE(
[ID] [int] NOT NULL,
[UserID] [int] NOT NULL,
[LoginDate] [datetime] NOT NULL,
[TypeID] [int] NOT NULL
)
GO
Step 2: Create the following Function:
CREATE FUNCTION [dbo].[fnGetLoginFreq]
(
-- notice: TVP is the type (declared above)
#TVP TVP readonly
)
RETURNS
#Table_Var TABLE
(
-- This will be our result set
ID int,
UserId int,
LoginTime datetime,
TypeID int,
RowNumber int
)
AS
BEGIN
--We will insert records in this table as we go through the rows in the
--table passed in as parameter and decide that we should add an entry because
--15' had elapsed between logins
DECLARE #temp table
(
ID int,
UserId int,
LoginTime datetime,
TypeID int
)
-- seems silly, but is not because we need to add a row_number column to help
-- in our iteration and table-valued paramters cannot be modified inside the function
insert into #Table_var
select ID,UserID,Logindate,TypeID,row_number() OVER(ORDER BY UserID,LoginDate) AS [RowNumber]
from #TVP order by UserID asc,LoginDate desc
declare #Index int,#End int,#CurrentLoginTime datetime, #NextLoginTime datetime, #CurrentUserID int , #NextUserID int
select #Index=1,#End=count(*) from #Table_var
while(#Index<=#End)
begin
select #CurrentLoginTime=LoginTime,#CurrentUserID=UserID from #Table_var where RowNumber=#Index
select #NextLoginTime=LoginTime,#NextUserID=UserID from #Table_var where RowNumber=(#Index+1)
if(#CurrentUserID=#NextUserID)
begin
if( abs(DateDiff(mi,#CurrentLoginTime,#NextLoginTime))>=15)
begin
insert into #temp
select ID,UserID,LoginTime,TypeID
from #Table_var
where RowNumber=#Index
end
END
else
bEGIN
insert into #temp
select ID,UserID,LoginTime,TypeID
from #Table_var
where RowNumber=#Index and UserID=#CurrentUserID
END
if(#Index=#End)--last element?
begin
insert into #temp
select ID,UserID,LoginTime,TypeID
from #Table_var
where RowNumber=#Index and not
abs((select datediff(mi,#CurrentLoginTime,max(LoginTime)) from #temp where UserID=#CurrentUserID))<=14
end
select #Index=#Index+1
end
delete from #Table_var
insert into #Table_var
select ID, UserID ,LoginTime ,TypeID ,row_number() OVER(ORDER BY UserID,LoginTime) AS 'RowNumber'
from #temp
return
END
Step 3: Give it a spin
declare #TVP TVP
INSERT INTO #TVP
select ID,UserId,[DateType],TypeID from Shays_table where TypeID=1 --AND any other date restriction you want to add
select * from fnGetLoginFreq(#TVP) order by LoginTime asc
My tests returned this:
ID UserId LoginTime TypeID RowNumber
2 2 2010-01-01 10:01:50.000 1 3
4 1 2010-01-01 10:03:50.000 1 1
5 1 2010-01-01 11:00:00.000 1 2
6 2 2010-01-01 11:00:50.000 1 4
How about this, it's fairly straight forward and gives you the result you need:
SELECT ID, UserID, [DateTime], TypeID
FROM Users
WHERE Users.TypeID = 1
AND NOT EXISTS (
SELECT TOP 1 1
FROM Users AS U2
WHERE U2.ID <> Users.ID
AND U2.UserID = Users.UserID
AND U2.[DateTime] BETWEEN DATEADD(MI, -15, Users.[DateTime]) AND Users.[DateTime]
AND U2.TypeID = 1)
The NOT EXISTS restricts to only show records that have no record within 15minutes before them, so you will see the first record in a block rather than one every 15mins.
Edit: Since you want to see one every 15mins this should do without using recursion:
SELECT Users.ID, Users.UserID, Users.[DateTime], Users.TypeID
FROM
(
SELECT MIN(ID) AS ID, UserID,
DATEADD(minute, DATEDIFF(minute,0,[DateTime]) / 15 * 15, 0) AS [DateTime]
FROM Users
GROUP BY UserID, DATEADD(minute, DATEDIFF(minute,0,[DateTime]) / 15 * 15, 0)
) AS Dates
INNER JOIN Users AS Users ON Users.ID = Dates.ID
WHERE Users.TypeID = 1
AND NOT EXISTS (
SELECT TOP 1 1
FROM
(
SELECT MIN(ID) AS ID, UserID,
DATEADD(minute, DATEDIFF(minute,0,[DateTime]) / 15 * 15, 0) AS [DateTime]
FROM Users
GROUP BY UserID, DATEADD(minute, DATEDIFF(minute,0,[DateTime]) / 15 * 15, 0)
) AS Dates2
INNER JOIN Users AS U2 ON U2.ID = Dates2.ID
WHERE U2.ID <> Users.ID
AND U2.UserID = Users.UserID
AND U2.[DateTime] BETWEEN DATEADD(MI, -15, Users.[DateTime]) AND Users.[DateTime]
AND U2.TypeID = 1
)
ORDER BY Users.DateTime
If this doesn't work please post more sample data so that I can see what is missing.
Edit2 same as directly above but just using CTE now instead for improved readability and help improve maintainability, also I improved it to highlighted where you would also restrict the Dates table by whatever DateTime range that you would be restricting to the main query:
WITH Dates(ID, UserID, [DateTime])
AS
(
SELECT MIN(ID) AS ID, UserID,
DATEADD(minute, DATEDIFF(minute,0,[DateTime]) / 15 * 15, 0) AS [DateTime]
FROM Users
WHERE Users.TypeID = 1
--AND Users.[DateTime] BETWEEN #StartDateTime AND #EndDateTime
GROUP BY UserID, DATEADD(minute, DATEDIFF(minute,0,[DateTime]) / 15 * 15, 0)
)
SELECT Users.ID, Users.UserID, Users.[DateTime], Users.TypeID
FROM Dates
INNER JOIN Users ON Users.ID = Dates.ID
WHERE Users.TypeID = 1
--AND Users.[DateTime] BETWEEN #StartDateTime AND #EndDateTime
AND NOT EXISTS (
SELECT TOP 1 1
FROM Dates AS Dates2
INNER JOIN Users AS U2 ON U2.ID = Dates2.ID
WHERE U2.ID <> Users.ID
AND U2.UserID = Users.UserID
AND U2.[DateTime] BETWEEN DATEADD(MI, -15, Users.[DateTime]) AND Users.[DateTime]
AND U2.TypeID = 1
)
ORDER BY Users.DateTime
Also as a performance note, whenever dealing with something that might end up being recursive like this potentially could be (from other answers), you should straight away be considering if you are able to restrict the main query by a date range in general even if it's a whole year or longer range
You can use a recursive CTE for this though I would also evaluate a cursor if the result set is at all large as it may work out more efficient.
I've left out the ID column in my answer. If you really need it it would be possible to add it. It just makes the anchor part of the recursive CTE a bit more unwieldy.
DECLARE #T TABLE
(
ID INT PRIMARY KEY,
UserID INT,
[DateTime] DateTime,
TypeID INT
)
INSERT INTO #T
SELECT 1,1,'20100101 10:00:00', 1 union all
SELECT 2,2,'20100101 10:01:50', 1 union all
SELECT 3,1,'20100101 10:02:50', 1 union all
SELECT 4,1,'20100101 10:03:50', 1 union all
SELECT 5,1,'20100101 11:00:00', 1 union all
SELECT 6,2,'20100101 11:00:50', 1;
WITH RecursiveCTE
AS (SELECT UserID,
MIN([DateTime]) As [DateTime],
1 AS TypeID
FROM #T
WHERE TypeID = 1
GROUP BY UserID
UNION ALL
SELECT UserID,
[DateTime],
TypeID
FROM (
--Can't use TOP directly
SELECT T.*,
rn = ROW_NUMBER() OVER (PARTITION BY T.UserID ORDER BY
T.[DateTime])
FROM #T T
JOIN RecursiveCTE R
ON R.UserID = T.UserID
AND T.[DateTime] >=
DATEADD(MINUTE, 15, R.[DateTime])) R
WHERE R.rn = 1)

How to delete when the parameter varies by group without looping? (T-SQL)

Imagine I have these columns in a table:
id int NOT NULL IDENTITY PRIMARY KEY,
instant datetime NOT NULL,
foreignId bigint NOT NULL
For each group (grouped by foreignId) I want to delete all the rows which are 1 hour older than the max(instant). Thus, for each group the parameter is different.
Is it possible without looping?
Yep, it's pretty straightforward. Try this:
DELETE mt
FROM MyTable AS mt
WHERE mt.instant <= DATEADD(hh, -1, (SELECT MAX(instant)
FROM MyTable
WHERE ForeignID = mt.ForeignID))
Or this:
;WITH MostRecentKeys
AS
(SELECT ForeignID, MAX(instant) AS LatestInstant
FROM MyTable)
DELETE mt
FROM MyTable AS mt
JOIN MostRecentKeys mrk ON mt.ForeignID = mrt.ForeignID
AND mt.Instant <= DATEADD(hh, -1, mrk.LatestInstant)
DELETE
FROM mytable
FROM mytable mto
WHERE instant <
(
SELECT DATEADD(hour, -1, MAX(instant))
FROM mytable mti
WHERE mti.foreignid = mto.foreignid
)
Note double FROM clause, it's on purpose, otherwise you won't be able to alias the table you're deleting from.
The sample data to check:
DECLARE #mytable TABLE
(
id INT NOT NULL PRIMARY KEY,
instant DATETIME NOT NULL,
foreignID INT NOT NULL
)
INSERT
INTO #mytable
SELECT 1, '2009-22-07 10:00:00', 1
UNION ALL
SELECT 2, '2009-22-07 09:30:00', 1
UNION ALL
SELECT 3, '2009-22-07 08:00:00', 1
UNION ALL
SELECT 4, '2009-22-07 10:00:00', 2
UNION ALL
SELECT 5, '2009-22-07 08:00:00', 2
UNION ALL
SELECT 6, '2009-22-07 07:30:00', 2
DELETE
FROM #mytable
FROM #mytable mto
WHERE instant <
(
SELECT DATEADD(hour, -1, MAX(instant))
FROM #mytable mti
WHERE mti.foreignid = mto.foreignid
)
SELECT *
FROM #mytable
1 2009-07-22 10:00:00.000 1
2 2009-07-22 09:30:00.000 1
4 2009-07-22 10:00:00.000 2
I'm going to assume when you say '1 hour older than the max(instant)' you mean '1 hour older than the max(instant) for that foreignId'.
Given that, there's almost certainly a more succinct way than this, but it will work:
DELETE
TableName
WHERE
DATEADD(hh, 1, instant) < (SELECT MAX(instant)
FROM TableName T2
WHERE T2.foreignId = TableName.foreignId)
The inner subquery is called a 'correlated subquery', if you want to look for more info. The way it works is that for each row under consideration by the outer query, it is the foreignId of that row that gets referenced by the subquery.