Search for closes index on SQL table - sql

I have a hypothetical SQL table "EVENTS", with two columns, a UUID index column, and a DateTime column,
The table is populated with values ranging from 1900-01-01 to today, it is not ordered, there are numerous dates missing.
The query that I have to run is basically 'retrieve all events that happened at the requested date (start to the end of the day) or the closest previous date'
If I were looking for all events in a day that I know that exists in the database it would be something as simple as:
SELECT * FROM Events e
WHERE
e.date BETWEEN $START_OF_DAY AND $END_OF_DAY;
But if that date doesn't exist I must retrieve the latest date up to the requested date.

Grab current day, but if no records found, will return all records from the nearest previous day with records.
So in my sample data, Jan 2 returns 3 events dated Jan 1
SQL Server Solution
DECLARE #Input DATE = '2022-01-02' /*Try Jan 1,2,3, or 4*/
DROP TABLE IF EXISTS #Event
CREATE TABLE #Event (ID INT IDENTITY(1,1),EventDateTime DATETIME)
INSERT INTO #Event
VALUES
('2022-01-01 08:00')
,('2022-01-01 09:00')
,('2022-01-01 10:00')
,('2022-01-03 12:00')
SELECT TOP (1) WITH TIES *
FROM #Event AS A
CROSS APPLY (SELECT EventDate = CAST(EventDateTime AS DATE)) AS B
WHERE B.EventDate <= #Input
ORDER BY B.EventDate DESC
SQL Fiddle wasn't letting me create a variable, but here's a the code conceptually for a more efficient version for MySQL. It grabs the desired date range in the first query, then uses it to filter in the second query. I think it should perform far better than the accepted answer assuming you have an index on EventDateTime
CREATE TABLE Event (
ID MEDIUMINT NOT NULL AUTO_INCREMENT
,EventDateTime DATETIME
,PRIMARY KEY (ID));
INSERT INTO Event (EventDateTime)
VALUES
('2022-01-01 08:00')
,('2022-01-01 09:00')
,('2022-01-01 10:00')
,('2022-01-03 12:00');
/*Need to save these off to variables to use in later query*/
SELECT TIMESTAMP(CAST(EventDateTime AS DATE)) AS StartRange
,TIMESTAMP(CAST(EventDateTime AS DATE)) + INTERVAL 1 DAY AS EndRange
FROM Event
WHERE EventDateTime < DATE_ADD('2022-01-04' /*Input*/,INTERVAL 1 DAY)
ORDER BY EventDateTime DESC
LIMIT 1;
SELECT *
FROM Event
WHERE EventDateTime >= StartRange
AND EventDateTime < EndRange

Calculate the most recent date, and do a self join. Although I'm using MYSQL, I believe this is the most generic workaround
CREATE TABLE d0207Event (ID INT ,EventDateTime DATETIME)
INSERT INTO d0207Event
VALUES
(1,'2022-01-01 08:00')
,(2,'2022-01-01 09:00')
,(3,'2022-01-01 10:00')
,(4,'2022-01-03 12:00')
INSERT INTO d0207Event
VALUES
(5, '2021-12-12 08:00');
select t1.*
from d0207Event t1,
(
select min(t1.dat) mindat
from (
select t1.*,
DATEDIFF('2022-01-02', cast(t1.EventDateTime as date)) dat
from d0207Event t1
) t1
where t1.dat >= 0
) t2
where DATEDIFF('2022-01-02', cast(t1.EventDateTime as date)) = t2.mindat
;
There are also many advanced syntaxes that can solve this problem better, depending on which DB you use and your specific application scenario
It seems that you can also choose a database with more syntax, then using an analytic function usually solves the efficiency problem well, since the EVENT table only needs to be queried once.
CREATE TABLE Event (
ID MEDIUMINT NOT NULL AUTO_INCREMENT
,EventDateTime DATETIME
,PRIMARY KEY (ID));
INSERT INTO Event (EventDateTime)
VALUES
('2022-01-01 08:00')
,('2022-01-01 09:00')
,('2022-01-01 10:00')
,('2022-01-03 12:00');
select *
from (
select t1.*,
first_value(cast(t1.EventDateTime as date))
over(order by cast(t1.EventDateTime as date) desc) fv
from event t1
where cast(t1.EventDateTime as date) <= '2022-01-03'
) t1
where cast(t1.EventDateTime as date) = fv
Creating a functional index cast(t1.EventDateTime as date), or creating a virtual column directly can make the query easier, otherwise using date_add() is a good way

Related

SCD Type 2 - Handling Intraday changes?

I have a merge statement that builds my SCD type 2 table each night. This table must house all historical changes made in the source system and create a new row with the date from/date to columns populated along with the "islatest" flag. I have come across an issue today that I am not really sure how to handle.
There looks to have been multiple changes to the source table within a 24 hour period.
ID Code PAN EnterDate Cost Created
16155 1012401593331 ENRD 2015-11-05 7706.3 2021-08-17 14:34
16155 1012401593331 ENRD 2015-11-05 8584.4 2021-08-17 16:33
I use a basic merge statement to identify my changes however what would be the best approach to ensure all changes get picked up correctly? The above is giving me an error as it's trying to insert/update multiple rows with the same value
DECLARE #DateNow DATETIME = Getdate()
IF Object_id('tempdb..#meteridinsert') IS NOT NULL
DROP TABLE #meteridinsert;
CREATE TABLE #meteridinsert
(
meterid INT,
change VARCHAR(10)
);
MERGE
INTO [DIM].[Meters] AS target
using stg_meters AS source
ON target.[ID] = source.[ID]
AND target.latest=1
WHEN matched THEN
UPDATE
SET target.islatest = 0,
target.todate = #Datenow
WHEN NOT matched BY target THEN
INSERT
(
id,
code,
pan,
enterdate,
cost,
created,
[FromDate] ,
[ToDate] ,
[IsLatest]
)
VALUES
(
source.id,
source.code ,
source.pan ,
source.enterdate ,
source.cost ,
source.created ,
#Datenow ,
NULL ,
1
)
output source.id,
$action
INTO #meteridinsert;INSERT INTO [DIM].[Meters]
(
[id] ,
[code] ,
[pan] ,
[enterdate] ,
[cost] ,
[created] ,
[FromDate] ,
[ToDate] ,
[IsLatest]
)
SELECT ([id] ,[code] ,[pan] ,[enterdate] ,[cost] ,[created] , #DateNow ,NULL ,1 FROM stg_meters a
INNER JOIN #meteridinsert cid
ON a.id = cid.meterid
AND cid.change = 'UPDATE'
Maybe you can do it using merge statement, but I would prefer to use typicall update and insert approach in order to make it easier to understand (also I am not sure that merge allows you to use the same source record for update and insert...)
First of all I create the table dimscd2 to represent your dimension table
create table dimscd2
(naturalkey int, descr varchar(100), startdate datetime, enddate datetime)
And then I insert some records...
insert into dimscd2 values
(1,'A','2019-01-12 00:00:00.000', '2020-01-01 00:00:00.000'),
(1,'B','2020-01-01 00:00:00.000', NULL)
As you can see, the "current" is the one with descr='B' because it has an enddate NULL (I do recommend you to use surrogate keys for each record... This is just an incremental key for each record of your dimension, and the fact table must be linked with this surrogate key in order to reflect the status of the fact in the moment when happened).
Then, I have created some dummy data to represent the source data with the changes for the same natural key
-- new data (src_data)
select 1 as naturalkey,'C' as descr, cast('2020-01-02 00:00:00.000' as datetime) as dt into src_data
union all
select 1 as naturalkey,'D' as descr, cast('2020-01-03 00:00:00.000' as datetime) as dt
After that, I have created a temp table (##tmp) with this query to set the enddate for each record:
-- tmp table
select naturalkey, descr, dt,
lead(dt,1,0) over (partition by naturalkey order by dt) enddate,
row_number() over (partition by naturalkey order by dt) rn
into ##tmp
from src_data
The LEAD function takes the next start date for the same natural key, ordered by date (dt).
The ROW_NUMBER marks with 1 the oldest record in the source data for the natural key in the dimension.
Then, I proceed to close the "current" record using update
update d
set enddate = t.dt
from dimscd2 d
join ##tmp t
on d.naturalkey = t.naturalkey
and d.enddate is null
and t.rn = 1
And finally I add the new source data to the dimension with insert
insert into dimscd2
select naturalkey, descr, dt,
case enddate when '1900-00-00' then null else enddate end
from ##tmp
Final result is obtained with the query:
select * from dimscd2
You can test on this db<>fiddle

Optimize this query without using not exist repeatably, is there a better way to write this query?

For example I have three table where say DataTable1, DataTable2 and DataTable3
and need to filter it from DataRange table, every time I have used NOT exist as shown below,
Is there a better way to write this.
Temp table to hold some daterange which is used for fiter:
Declare #DateRangeTable as Table(
StartDate datetime,
EndDate datetime
)
Some temp table which will hold data on which we need to apply date range filter
INSERT INTO #DateRangeTable values
('07/01/2020','07/04/2020'),
('07/06/2020','07/08/2020');
/*Table 1 which will hold some data*/
Declare #DataTable1 as Table(
Id numeric,
Date datetime
)
INSERT INTO #DataTable1 values
(1,'07/09/2020'),
(2,'07/06/2020');
Declare #DataTable2 as Table(
Id numeric,
Date datetime
)
INSERT INTO #DataTable2 values
(1,'07/10/2020'),
(2,'07/06/2020');
Declare #DataTable3 as Table(
Id numeric,
Date datetime
)
INSERT INTO #DataTable3 values
(1,'07/11/2020'),
(2,'07/06/2020');
Now I want to filter data based on DateRange table, here I need some optimized way so that i don't have to use not exists mutiple times, In real senario, I have mutiple tables where I have to filter based on the daterange table.
Select * from #DataTable1
where NOT EXISTS(
Select 1 from #DateRangeTable
where [Date] between StartDate and EndDate
)
Select * from #DataTable2
where NOT EXISTS(
Select 1 from #DateRangeTable
where [Date] between StartDate and EndDate
)
Select * from #DataTable3
where NOT EXISTS(
Select 1 from #DateRangeTable
where [Date] between StartDate and EndDate
)
Instead of using NOT EXISTS you could join the date range table:
SELECT dt.*
FROM #DataTable1 dt
LEFT JOIN #DateRangeTable dr ON dt.[Date] BETWEEN dr.StartDate and dr.EndDate
WHERE dr.StartDate IS NULL
It may perform better on large tables but you would have to compare the execution plans and make sure you have indexes on the date columns.
I would write the same query... but if you can change table structure I would try to improve performance adding two columns to specify the month as an integer (I suppose is the first couple of figures).
Obviously you have to test with your data and compare the timings.
Declare #DateRangeTable as Table(
StartDate datetime,
EndDate datetime,
StartMonth tinyint,
EndMonth tinyint
)
INSERT INTO #DateRangeTable values
('07/01/2020','07/04/2020', 7, 7),
('07/06/2020','07/08/2020', 7, 7),
('07/25/2020','08/02/2020', 7, 8); // (another record with different months)
Now your queries can use the new column to try to reduce comparisons (is a tinyint, sql server can partition records if you define a secondary index for StartMonth and EndMonth):
Select * from #DataTable1
where NOT EXISTS(
Select 1 from #DateRangeTable
where (DATEPART('month', [Date]) between StartMonth and EndMonth)
and ([Date] between StartDate and EndDate)
)

Counting number of transactions within past 1 hour on a particular user

Is there any way how to (in the best case, without using cursor) count number of transactions that the same user made in previous 1 hour.
That means that for this table
CREATE TABLE #TR (PK INT, TR_DATE DATETIME, USER_PK INT)
INSERT INTO #TR VALUES (1,'2018-07-31 06:02:00.000',10)
INSERT INTO #TR VALUES (2,'2018-07-31 06:36:00.000',10)
INSERT INTO #TR VALUES (3,'2018-07-31 06:55:00.000',10)
INSERT INTO #TR VALUES (4,'2018-07-31 07:10:00.000',10)
INSERT INTO #TR VALUES (5,'2018-07-31 09:05:00.000',10)
INSERT INTO #TR VALUES (6,'2018-07-31 06:05:00.000',11)
INSERT INTO #TR VALUES (7,'2018-07-31 06:55:00.000',11)
INSERT INTO #TR VALUES (8,'2018-07-31 07:10:00.000',11)
INSERT INTO #TR VALUES (9,'2018-07-31 06:12:00.000',12)
The result should be:
The solution could be something like: COUNT(*) OVER (PARTITION BY USER_PK ORDER BY TR_DATE ROWS BETWEEN ((WHERE DATEADD(HH,-1,PRECENDING.TR_DATE) > CURRENT ROW.TR_DATE) AND CURRENT ROW ...but I know that ROWS BETWEEN can not be used like that...
I am guessing SQL Server based on the syntax. In SQL Server, you can use apply:
select t.*, tr2.result
from #tr tr outer apply
(select count(*) as result
from #tr tr2
where tr2.user_id = tr.user_id and
tr2.tr_date > dateadd(hour, -1, tr.date) and
tr2.tr_date <= tr.tr_date
) tr2;
SELECT USER_PK, COUNT(*) AS TransactionCount
FROM #TR
WHERE DATEDIFF(MINUTE, TR_DATE, GETDATE()) <= 60
AND DATEDIFF(MINUTE, TR_DATE, GETDATE()) >= 0
GROUP BY USER_PK
You can change GETDATE() with whatever you want, but they need to have the same value

Efficient way of storing date ranges

I need to store simple data - suppose I have some products with codes as a primary key, some properties and validity ranges. So data could look like this:
Products
code value begin_date end_date
10905 13 2005-01-01 2016-12-31
10905 11 2017-01-01 null
Those ranges are not overlapping, so on every date I have a list of unique products and their properties. So to ease the use of it I've created the function:
create function dbo.f_Products
(
#date date
)
returns table
as
return (
select
from dbo.Products as p
where
#date >= p.begin_date and
#date <= p.end_date
)
This is how I'm going to use it:
select
*
from <some table with product codes> as t
left join dbo.f_Products(#date) as p on
p.code = t.product_code
This is all fine, but how I can let optimizer know that those rows are unique to have better execution plan?
I did some googling, and found a couple of really nice articles for DDL which prevents storing overlapping ranges in the table:
Self-maintaining, Contiguous Effective Dates in Temporal Tables
Storing intervals of time with no overlaps
But even if I try those constraint I see that optimizer cannot understand that resulting recordset will return unique codes.
What I'd like to have is certain approach which gives me basically the same performance as if I stored those products list on certain date and selected it with date = #date.
I know that some RDMBS (like PostgreSQL) have special data types for this (Range Types). But SQL Server doesn't have anything like this.
Am I missing something or there're no way to do this properly in SQL Server?
You can create an indexed view that contains a row for each code/date in the range.
ProductDate (indexed view)
code value date
10905 13 2005-01-01
10905 13 2005-01-02
10905 13 ...
10905 13 2016-12-31
10905 11 2017-01-01
10905 11 2017-01-02
10905 11 ...
10905 11 Today
Like this:
create schema digits
go
create table digits.Ones (digit tinyint not null primary key)
insert into digits.Ones (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
create table digits.Tens (digit tinyint not null primary key)
insert into digits.Tens (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
create table digits.Hundreds (digit tinyint not null primary key)
insert into digits.Hundreds (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
create table digits.Thousands (digit tinyint not null primary key)
insert into digits.Thousands (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
create table digits.TenThousands (digit tinyint not null primary key)
insert into digits.TenThousands (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
go
create schema info
go
create table info.Products (code int not null, [value] int not null, begin_date date not null, end_date date null, primary key (code, begin_date))
insert into info.Products (code, [value], begin_date, end_date) values
(10905, 13, '2005-01-01', '2016-12-31'),
(10905, 11, '2017-01-01', null)
create table info.DateRange ([begin] date not null, [end] date not null, [singleton] bit not null default(1) check ([singleton] = 1))
insert into info.DateRange ([begin], [end]) values ((select min(begin_date) from info.Products), getdate())
go
create view info.ProductDate with schemabinding
as
select
p.code,
p.value,
dateadd(day, ones.digit + tens.digit*10 + huns.digit*100 + thos.digit*1000 + tthos.digit*10000, dr.[begin]) as [date]
from
info.DateRange as dr
cross join
digits.Ones as ones
cross join
digits.Tens as tens
cross join
digits.Hundreds as huns
cross join
digits.Thousands as thos
cross join
digits.TenThousands as tthos
join
info.Products as p on
dateadd(day, ones.digit + tens.digit*10 + huns.digit*100 + thos.digit*1000 + tthos.digit*10000, dr.[begin]) between p.begin_date and isnull(p.end_date, datefromparts(9999, 12, 31))
go
create unique clustered index idx_ProductDate on info.ProductDate ([date], code)
go
select *
from info.ProductDate with (noexpand)
where
date = '2014-01-01'
drop view info.ProductDate
drop table info.Products
drop table info.DateRange
drop table digits.Ones
drop table digits.Tens
drop table digits.Hundreds
drop table digits.Thousands
drop table digits.TenThousands
drop schema digits
drop schema info
go
A solution without gaps might be this:
DECLARE #tbl TABLE(ID INT IDENTITY,[start_date] DATE);
INSERT INTO #tbl VALUES({d'2016-10-01'}),({d'2016-09-01'}),({d'2016-08-01'}),({d'2016-07-01'}),({d'2016-06-01'});
SELECT * FROM #tbl;
DECLARE #DateFilter DATE={d'2016-08-13'};
SELECT TOP 1 *
FROM #tbl
WHERE [start_date]<=#DateFilter
ORDER BY [start_date] DESC
Important: Be sure that there is an (unique) index on start_date
UPDATE: for different products
DECLARE #tbl TABLE(ID INT IDENTITY,ProductID INT,[start_date] DATE);
INSERT INTO #tbl VALUES
--product 1
(1,{d'2016-10-01'}),(1,{d'2016-09-01'}),(1,{d'2016-08-01'}),(1,{d'2016-07-01'}),(1,{d'2016-06-01'})
--product 1
,(2,{d'2016-10-17'}),(2,{d'2016-09-16'}),(2,{d'2016-08-15'}),(2,{d'2016-07-10'}),(2,{d'2016-06-11'});
DECLARE #DateFilter DATE={d'2016-08-13'};
WITH PartitionedCount AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY [start_date] DESC) AS Nr
,*
FROM #tbl
WHERE [start_date]<=#DateFilter
)
SELECT *
FROM PartitionedCount
WHERE Nr=1
First you need to create a unique clustered index for (begin_date, end_date, code)
Then SQL engine will be able to do INDEX SEEK.
Additionally, you can also try to create a view for dbo.Products table to join that table with pre-populated dbo.Dates table.
select p.code, p.val, p.begin_date, p.end_date, d.[date]
from dbo.Product as p
inner join dbo.dates d on p.begin_date <= d.[date] and d.[date] <= p.end_date
Then in your function, you use that view as "where #date = view.date". The result can be either better or slightly worse... it depends on the actual data.
You also can try to make that view indexed (depends on how often it is being updated).
Alternatively, you can have better performance if you populate dbo.Products table for every date in the [begin_date] .. [end_date] range.
Approach with ROW_NUMBER scans the whole Products table once. It is the best method if you have a lot of product codes in the Products table and few validity ranges for each code.
WITH
CTE_rn
AS
(
SELECT
code
,value
,ROW_NUMBER() OVER (PARTITION BY code ORDER BY begin_date DESC) AS rn
FROM Products
WHERE begin_date <= #date
)
SELECT *
FROM
<some table with product codes> as t
LEFT JOIN CTE_rn ON CTE_rn.code = t.product_code AND CTE_rn.rn = 1
;
If you have few product codes and a lot of validity ranges for each code in the Products table, then it is better to seek the Products table for each code using OUTER APPLY.
SELECT *
FROM
<some table with product codes> as t
OUTER APPLY
(
SELECT TOP(1)
Products.value
FROM Products
WHERE
Products.code = t.product_code
AND Products.begin_date <= #date
ORDER BY Products.begin_date DESC
) AS A
;
Both variants need unique index on (code, begin_date DESC) include (value).
Note how the queries don't even look at end_date, because they assume that intervals don't have gaps. They will work in SQL Server 2008.
EDIT: My original answer was using an INNER JOIN, but the questioner wanted a LEFT JOIN.
CREATE TABLE Products
(
[Code] INT NOT NULL
, [Value] VARCHAR(30) NOT NULL
, Begin_Date DATETIME NOT NULL
, End_Date DATETIME NULL
)
/*
Products
code value begin_date end_date
10905 13 2005-01-01 2016-12-31
10905 11 2017-01-01 null
*/
INSERT INTO Products ([Code], [Value], Begin_Date, End_Date) VALUES (10905, 13, '2005-01-01', '2016-12-31')
INSERT INTO Products ([Code], [Value], Begin_Date, End_Date) VALUES (10905, 11, '2017-01-01', NULL)
CREATE NONCLUSTERED INDEX SK_ProductDate ON Products ([Code], Begin_Date, End_Date) INCLUDE ([Value])
CREATE TABLE SomeTableWithProductCodes
(
[CODE] INT NOT NULL
)
INSERT INTO SomeTableWithProductCodes ([Code]) VALUES (10905)
Here is a prototypical query, with a date predicate. Note that there are more optimal ways to do this in a bulletproof fashion, using a "less than" operator on the upper bound, but that's a different discussion.
SELECT
P.[Code]
, P.[Value]
, P.[Begin_Date]
, P.[End_Date]
FROM
SomeTableWithProductCodes ST
LEFT JOIN Products AS P ON
ST.[Code] = P.[Code]
AND '2016-06-30' BETWEEN P.[Begin_Date] AND ISNULL(P.[End_Date], '9999-12-31')
This query will perform an Index Seek on the Product table.
Here is a SQL Fiddle: SQL Fiddle - Products and Dates

Returning a set of the most recent rows from a table

I'm trying to retrieve the latest set of rows from a source table containing a foreign key, a date and other fields present. A sample set of data could be:
create table #tmp (primaryId int, foreignKeyId int, startDate datetime,
otherfield varchar(50))
insert into #tmp values (1, 1, '1 jan 2010', 'test 1')
insert into #tmp values (2, 1, '1 jan 2011', 'test 2')
insert into #tmp values (3, 2, '1 jan 2013', 'test 3')
insert into #tmp values (4, 2, '1 jan 2012', 'test 4')
The form of data that I'm hoping to retrieve is:
foreignKeyId maxStartDate otherfield
------------ ----------------------- -------------------------------------------
1 2011-01-01 00:00:00.000 test 2
2 2013-01-01 00:00:00.000 test 3
That is, just one row per foreignKeyId showing the latest start date and associated other fields - the primaryId is irrelevant.
I've managed to come up with:
select t.foreignKeyId, t.startDate, t.otherField from #tmp t
inner join (
select foreignKeyId, max(startDate) as maxStartDate
from #tmp
group by foreignKeyId
) s
on t.foreignKeyId = s.foreignKeyId and s.maxStartDate = t.startDate
but (a) this uses inner queries, which I suspect may lead to performance issues, and (b) it gives repeated rows if two rows in the original table have the same foreignKeyId and startDate.
Is there a query that will return just the first match for each foreign key and start date?
Depending on your sql server version, try the following:
select *
from (
select *, rnum = ROW_NUMBER() over (
partition by #tmp.foreignKeyId
order by #tmp.startDate desc)
from #tmp
) t
where t.rnum = 1
If you wanted to fix your attempt as opposed to re-engineering it then
select t.foreignKeyId, t.startDate, t.otherField from #tmp t
inner join (
select foreignKeyId, max(startDate) as maxStartDate, max(PrimaryId) as Latest
from #tmp
group by foreignKeyId
) s
on t.primaryId = s.latest
would have done the job, assuming PrimaryID increases over time.
Qualms about inner query would have been laid to rest as well assuming some indexes.