What's the most efficient way to match values between 2 tables based on most recent prior date? - sql

I've got two tables in MS SQL Server:
dailyt - which contains daily data:
date val
---------------------
2014-05-22 10
2014-05-21 9.5
2014-05-20 9
2014-05-19 8
2014-05-18 7.5
etc...
And periodt - which contains data coming in at irregular periods:
date val
---------------------
2014-05-21 2
2014-05-18 1
Given a row in dailyt, I want to adjust its value by adding the corresponding value in periodt with the closest date prior or equal to the date of the dailyt row. So, the output would look like:
addt
date val
---------------------
2014-05-22 12 <- add 2 from 2014-05-21
2014-05-21 11.5 <- add 2 from 2014-05-21
2014-05-20 10 <- add 1 from 2014-05-18
2014-05-19 9 <- add 1 from 2014-05-18
2014-05-18 8.5 <- add 1 from 2014-05-18
I know that one way to do this is to join the dailyt and periodt tables on periodt.date <= dailyt.date and then imposing a ROW_NUMBER() (PARTITION BY dailyt.date ORDER BY periodt.date DESC) condition, and then having a WHERE condition on the row number to = 1.
Is there another way to do this that would be more efficient? Or is this pretty much optimal?

I think using APPLY would be the most efficient way:
SELECT d.Val,
p.Val,
NewVal = d.Val + ISNULL(p.Val, 0)
FROM Dailyt AS d
OUTER APPLY
( SELECT TOP 1 Val
FROM Periodt p
WHERE p.Date <= d.Date
ORDER BY p.Date DESC
) AS p;
Example on SQL Fiddle

If there relatively very few periodt rows, then there is an option that may prove quite efficient.
Convert periodt into a From/To ranges table using subqueries or CTEs. (Obviously performance depends on how efficiently this initial step can be done, which is why a small number of periodt rows is preferable.) Then the join to dailyt will be extremely efficient. E.g.
;WITH PIds AS (
SELECT ROW_NUMBER() OVER(ORDER BY PDate) RN, *
FROM #periodt
),
PRange AS (
SELECT f.PDate AS FromDate, t.PDate as ToDate, f.PVal
FROM PIds f
LEFT OUTER JOIN PIds t ON
t.RN = f.RN + 1
)
SELECT d.*, p.PVal
FROM #dailyt d
LEFT OUTER JOIN PRange p ON
d.DDate >= p.FromDate
AND (d.DDate < p.ToDate OR p.ToDate IS NULL)
ORDER BY 1 DESC
If you want to try the query, the following produces the sample data using table variables. Note I added an extra row to dailyt to demonstrate no periodt entries with a smaller date.
DECLARE #dailyt table (
DDate date NOT NULL,
DVal float NOT NULL
)
INSERT INTO #dailyt(DDate, DVal)
SELECT '20140522', 10
UNION ALL SELECT '20140521', 9.5
UNION ALL SELECT '20140520', 9
UNION ALL SELECT '20140519', 8
UNION ALL SELECT '20140518', 7.5
UNION ALL SELECT '20140517', 6.5
DECLARE #periodt table (
PDate date NOT NULL,
PVal int NOT NULL
)
INSERT INTO #periodt
SELECT '20140521', 2
UNION ALL SELECT '20140518', 1

Related

t-SQL: calculate date difference with dynamic lag

Is there a way to compute the duration between consequent dates that are not the same, using SQL Server 2017's OVER clause and without joins or subqueries? Could this possibly be done with a LAG function using some dynamically computed lag argument?
For example, trx 2 & 3 are on the same day, so we compute the duration from 1 to 2 and from 1 to 3. Since 4 occurred on a different day, its duration is from 3 to 4. Since trx 5 is on the same day as 4, we compute its duration from 3 to 5 and so on.
CREATE TABLE #t(Trx TINYINT, DT DATE);
INSERT INTO #t SELECT 1, '1/1/17';
INSERT INTO #t SELECT 2, '1/5/17';
INSERT INTO #t SELECT 3, '1/5/17';
INSERT INTO #t SELECT 4, '1/15/17';
INSERT INTO #t SELECT 5, '1/15/17';
INSERT INTO #t SELECT 6, '1/20/17';
Below is an easy implementation with a join, but can this be done inline with some OVER clause function (no join or subqueries)?
SELECT c.Trx, c.DT,
DurO=DATEDIFF(DAY, LAG(c.DT,1) OVER(ORDER BY c.DT), c.DT), -- does not use join
DurJ=DATEDIFF(DAY, MAX(p.DT), c.DT) -- uses join
FROM #t c LEFT JOIN #t p ON c.DT > p.DT
GROUP BY c.Trx, c.DT
ORDER BY c.DT
Note that DurJ is computed correctly, but DurO is not:
Trx DT DurO DurJ
1 2017-01-01 NULL NULL
2 2017-01-05 4 4
3 2017-01-05 0 4
4 2017-01-15 10 10
5 2017-01-15 0 10
6 2017-01-20 5 5
I'll clarify further any specifics, if needed.
NOTE: Not a dup question. This question is concerned with one date column only and no project grouping. Btw, neither question has a satisfiable solution just yet.
Use dense_rank to treat same dates as one group and use it to get the same difference.
select trx,beg,sum(diff) over(partition by grp) as diff
from (select trx,beg,datediff(day,lag(beg) over(order by beg),beg) as diff,
dense_rank() over(order by beg) as grp
from #t
) t
Per #Alexey's comment's, dense_rank isn't actually needed. You can just use beg date for grouping.
select trx,beg,sum(diff) over(partition by beg) as diff
from (select trx,beg,datediff(day,lag(beg) over(order by beg),beg) as diff
from #t
) t

Aggregate column text where dates in table a are between dates in table b

Sample data
CREATE TEMP TABLE a AS
SELECT id, adate::date, name
FROM ( VALUES
(1,'1/1/1900','test'),
(1,'3/1/1900','testing'),
(1,'4/1/1900','testinganother'),
(1,'6/1/1900','superbtest'),
(2,'1/1/1900','thebesttest'),
(2,'3/1/1900','suchtest'),
(2,'4/1/1900','test2'),
(2,'6/1/1900','test3'),
(2,'7/1/1900','test4')
) AS t(id,adate,name);
CREATE TEMP TABLE b AS
SELECT id, bdate::date, score
FROM ( VALUES
(1,'12/31/1899', 7 ),
(1,'4/1/1900' , 45),
(2,'12/31/1899', 19),
(2,'5/1/1900' , 29),
(2,'8/1/1900' , 14)
) AS t(id,bdate,score);
What I want
What I need to do is aggregate column text from table a where the id matches table b and the date from table a is between the two closest dates from table b. Desired output:
id date score textagg
1 12/31/1899 7 test, testing
1 4/1/1900 45 testinganother, superbtest
2 12/31/1899 19 thebesttest, suchtest, test2
2 5/1/1900 29 test3, test4
2 8/1/1900 14
My thoughts are to do something like this:
create table date_join
select a.id, string_agg(a.text, ','), b.*
from tablea a
left join tableb b
on a.id = b.id
*having a.date between b.date and b.date*;
but I am really struggling with the last line, figuring out how to aggregate only where the date in table b is between the closest two dates in table b. Any guidance is much appreciated.
I can't promise it's the best way to do it, but this is a way to do it.
with b_values as (
select
id, date as from_date, score,
lead (date, 1, '3000-01-01')
over (partition by id order by date) - 1 as thru_date
from b
)
select
bv.id, bv.from_date, bv.score,
string_agg (a.text, ',')
from
b_values as bv
left join a on
a.id = bv.id and
a.date between bv.from_date and bv.thru_date
group by
bv.id, bv.from_date, bv.score
order by
bv.id, bv.from_date
I'm presupposing you will never have a date in your table greater than 12/31/2999, so if you're still running this query after that date, please accept my apologies.
Here is the output I got when I ran this:
id from_date score string_agg
1 0 7 test,testing
1 92 45 testinganother,superbtest
2 0 19 thebesttest,suchtest,test2
2 122 29 test3,test4
2 214 14
I might also note that between in a join is a performance killer. IF you have large data volumes, there might be better ideas on how to approach this, but that depends largely on what your actual data looks like.

Joining next Sequential Row

I am planing an SQL Statement right now and would need someone to look over my thougts.
This is my Table:
id stat period
--- ------- --------
1 10 1/1/2008
2 25 2/1/2008
3 5 3/1/2008
4 15 4/1/2008
5 30 5/1/2008
6 9 6/1/2008
7 22 7/1/2008
8 29 8/1/2008
Create Table
CREATE TABLE tbstats
(
id INT IDENTITY(1, 1) PRIMARY KEY,
stat INT NOT NULL,
period DATETIME NOT NULL
)
go
INSERT INTO tbstats
(stat,period)
SELECT 10,CONVERT(DATETIME, '20080101')
UNION ALL
SELECT 25,CONVERT(DATETIME, '20080102')
UNION ALL
SELECT 5,CONVERT(DATETIME, '20080103')
UNION ALL
SELECT 15,CONVERT(DATETIME, '20080104')
UNION ALL
SELECT 30,CONVERT(DATETIME, '20080105')
UNION ALL
SELECT 9,CONVERT(DATETIME, '20080106')
UNION ALL
SELECT 22,CONVERT(DATETIME, '20080107')
UNION ALL
SELECT 29,CONVERT(DATETIME, '20080108')
go
I want to calculate the difference between each statistic and the next, and then calculate the mean value of the 'gaps.'
Thougts:
I need to join each record with it's subsequent row. I can do that using the ever flexible joining syntax, thanks to the fact that I know the id field is an integer sequence with no gaps.
By aliasing the table I could incorporate it into the SQL query twice, then join them together in a staggered fashion by adding 1 to the id of the first aliased table. The first record in the table has an id of 1. 1 + 1 = 2 so it should join on the row with id of 2 in the second aliased table. And so on.
Now I would simply subtract one from the other.
Then I would use the ABS function to ensure that I always get positive integers as a result of the subtraction regardless of which side of the expression is the higher figure.
Is there an easier way to achieve what I want?
The lead analytic function should do the trick:
SELECT period, stat, stat - LEAD(stat) OVER (ORDER BY period) AS gap
FROM tbstats
The average value of the gaps can be done by calculating the difference between the first value and the last value and dividing by one less than the number of elements:
select sum(case when seqnum = num then stat else - stat end) / (max(num) - 1);
from (select period, row_number() over (order by period) as seqnum,
count(*) over () as num
from tbstats
) t
where seqnum = num or seqnum = 1;
Of course, you can also do the calculation using lead(), but this will also work in SQL Server 2005 and 2008.
By using Join also you achieve this
SELECT t1.period,
t1.stat,
t1.stat - t2.stat gap
FROM #tbstats t1
LEFT JOIN #tbstats t2
ON t1.id + 1 = t2.id
To calculate the difference between each statistic and the next, LEAD() and LAG() may be the simplest option. You provide an ORDER BY, and LEAD(something) returns the next something and LAG(something) returns the previous something in the given order.
select
x.id thisStatId,
LAG(x.id) OVER (ORDER BY x.id) lastStatId,
x.stat thisStatValue,
LAG(x.stat) OVER (ORDER BY x.id) lastStatValue,
x.stat - LAG(x.stat) OVER (ORDER BY x.id) diff
from tbStats x

How to write Oracle query to find a total length of possible overlapping from-to dates

I'm struggling to find the query for the following task
I have the following data and want to find the total network day for each unique ID
ID From To NetworkDay
1 03-Sep-12 07-Sep-12 5
1 03-Sep-12 04-Sep-12 2
1 05-Sep-12 06-Sep-12 2
1 06-Sep-12 12-Sep-12 5
1 31-Aug-12 04-Sep-12 3
2 04-Sep-12 06-Sep-12 3
2 11-Sep-12 13-Sep-12 3
2 05-Sep-12 08-Sep-12 3
Problem is the date range can be overlapping and I can't come up with SQL that will give me the following results
ID From To NetworkDay
1 31-Aug-12 12-Sep-12 9
2 04-Sep-12 08-Sep-12 4
2 11-Sep-12 13-Sep-12 3
and then
ID Total Network Day
1 9
2 7
In case the network day calculation is not possible just get to the second table would be sufficient.
Hope my question is clear
We can use Oracle Analytics, namely the "OVER ... PARTITION BY" clause, in Oracle to do this. The PARTITION BY clause is kind of like a GROUP BY but without the aggregation part. That means we can group rows together (i.e. partition them) and them perform an operation on them as separate groups. As we operate on each row we can then access the columns of the previous row above. This is the feature PARTITION BY gives us. (PARTITION BY is not related to partitioning of a table for performance.)
So then how do we output the non-overlapping dates? We first order the query based on the (ID,DFROM) fields, then we use the ID field to make our partitions (row groups). We then test the previous row's TO value and the current rows FROM value for overlap using an expression like: (in pseudo code)
max(previous.DTO, current.DFROM) as DFROM
This basic expression will return the original DFROM value if it doesnt overlap, but will return the previous TO value if there is overlap. Since our rows are ordered we only need to be concerned with the last row. In cases where a previous row completely overlaps the current row we want the row then to have a 'zero' date range. So we do the same thing for the DTO field to get:
max(previous.DTO, current.DFROM) as DFROM, max(previous.DTO, current.DTO) as DTO
Once we have generated the new results set with the adjusted DFROM and DTO values, we can aggregate them up and count the range intervals of DFROM and DTO.
Be aware that most date calculations in database are not inclusive such as your data is. So something like DATEDIFF(dto,dfrom) will not include the day dto actually refers to, so we will want to adjust dto up a day first.
I dont have access to an Oracle server anymore but I know this is possible with the Oracle Analytics. The query should go something like this:
(Please update my post if you get this to work.)
SELECT id,
max(dfrom, LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) ) as dfrom,
max(dto, LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) ) as dto
from (
select id, dfrom, dto+1 as dto from my_sample -- adjust the table so that dto becomes non-inclusive
order by id, dfrom
) sample;
The secret here is the LAST_VALUE(dto) OVER (PARTITION BY id ORDER BY dfrom) expression which returns the value previous to the current row.
So this query should output new dfrom/dto values which dont overlap. It's then a simple matter of sub-querying this doing (dto-dfrom) and sum the totals.
Using MySQL
I did haves access to a mysql server so I did get it working there. MySQL doesnt have results partitioning (Analytics) like Oracle so we have to use result set variables. This means we use #var:=xxx type expressions to remember the last date value and adjust the dfrom/dto according. Same algorithm just a little longer and more complex syntax. We also have to forget the last date value any time the ID field changes!
So here is the sample table (same values you have):
create table sample(id int, dfrom date, dto date, networkDay int);
insert into sample values
(1,'2012-09-03','2012-09-07',5),
(1,'2012-09-03','2012-09-04',2),
(1,'2012-09-05','2012-09-06',2),
(1,'2012-09-06','2012-09-12',5),
(1,'2012-08-31','2012-09-04',3),
(2,'2012-09-04','2012-09-06',3),
(2,'2012-09-11','2012-09-13',3),
(2,'2012-09-05','2012-09-08',3);
On to the query, we output the un-grouped result set like above:
The variable #ld is "last date", and the variable #lid is "last id". Anytime #lid changes, we reset #ld to null. FYI In mysql the := operators is where the assignment happens, an = operator is just equals.
This is a 3 level query, but it could be reduced to 2. I went with an extra outer query to keep things more readable. The inner most query is simple and it adjusts the dto column to be non-inclusive and does the proper row ordering. The middle query does the adjustment of the dfrom/dto values to make them non-overlapped. The outer query simple drops the non-used fields, and calculate the interval range.
set #ldt=null, #lid=null;
select id, no_dfrom as dfrom, no_dto as dto, datediff(no_dto, no_dfrom) as days from (
select if(#lid=id,#ldt,#ldt:=null) as last, dfrom, dto, if(#ldt>=dfrom,#ldt,dfrom) as no_dfrom, if(#ldt>=dto,#ldt,dto) as no_dto, #ldt:=if(#ldt>=dto,#ldt,dto), #lid:=id as id,
datediff(dto, dfrom) as overlapped_days
from (select id, dfrom, dto + INTERVAL 1 DAY as dto from sample order by id, dfrom) as sample
) as nonoverlapped
order by id, dfrom;
The above query gives the results (notice dfrom/dto are non-overlapping here):
+------+------------+------------+------+
| id | dfrom | dto | days |
+------+------------+------------+------+
| 1 | 2012-08-31 | 2012-09-05 | 5 |
| 1 | 2012-09-05 | 2012-09-08 | 3 |
| 1 | 2012-09-08 | 2012-09-08 | 0 |
| 1 | 2012-09-08 | 2012-09-08 | 0 |
| 1 | 2012-09-08 | 2012-09-13 | 5 |
| 2 | 2012-09-04 | 2012-09-07 | 3 |
| 2 | 2012-09-07 | 2012-09-09 | 2 |
| 2 | 2012-09-11 | 2012-09-14 | 3 |
+------+------------+------------+------+
How about constructing an SQL which merges intervals by removing holes and considering only maximum intervals. It goes like this (not tested):
SELECT DISTINCT F.ID, F.From, L.To
FROM Temp AS F, Temp AS L
WHERE F.From < L.To AND F.ID = L.ID
AND NOT EXISTS (SELECT *
FROM Temp AS T
WHERE T.ID = F.ID
AND F.From < T.From AND T.From < L.To
AND NOT EXISTS ( SELECT *
FROM Temp AS T1
WHERE T1.ID = F.ID
AND T1.From < T.From
AND T.From <= T1.To)
)
AND NOT EXISTS (SELECT *
FROM Temp AS T2
WHERE T2.ID = F.ID
AND (
(T2.From < F.From AND F.From <= T2.To)
OR (T2.From < L.To AND L.To < T2.To)
)
)
with t_data as (
select 1 as id,
to_date('03-sep-12','dd-mon-yy') as start_date,
to_date('07-sep-12','dd-mon-yy') as end_date from dual
union all
select 1,
to_date('03-sep-12','dd-mon-yy'),
to_date('04-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('05-sep-12','dd-mon-yy'),
to_date('06-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('06-sep-12','dd-mon-yy'),
to_date('12-sep-12','dd-mon-yy') from dual
union all
select 1,
to_date('31-aug-12','dd-mon-yy'),
to_date('04-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('04-sep-12','dd-mon-yy'),
to_date('06-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('11-sep-12','dd-mon-yy'),
to_date('13-sep-12','dd-mon-yy') from dual
union all
select 2,
to_date('05-sep-12','dd-mon-yy'),
to_date('08-sep-12','dd-mon-yy') from dual
),
t_holidays as (
select to_date('01-jan-12','dd-mon-yy') as holiday
from dual
),
t_data_rn as (
select rownum as rn, t_data.* from t_data
),
t_model as (
select distinct id,
start_date
from t_data_rn
model
partition by (rn, id)
dimension by (0 as i)
measures(start_date, end_date)
rules
( start_date[for i
from 1
to end_date[0]-start_date[0]
increment 1] = start_date[0] + cv(i),
end_date[any] = start_date[cv()] + 1
)
order by 1,2
),
t_network_days as (
select t_model.*,
case when
mod(to_char(start_date, 'j'), 7) + 1 in (6, 7)
or t_holidays.holiday is not null
then 0 else 1
end as working_day
from t_model
left outer join t_holidays
on t_holidays.holiday = t_model.start_date
)
select id,
sum(working_day) as network_days
from t_network_days
group by id;
t_data - your initial data
t_holidays - contains list of holidays
t_data_rn - just adds unique key (rownum) to each row of t_data
t_model - expands t_data date ranges into a flat list of dates
t_network_days - marks each date from t_model as working day or weekend based on day of week (Sat and Sun) and holidays list
final query - calculates number of network day per each group.

Best way to interpolate values in SQL

I have a table with rate at certain date :
Rates
Id | Date | Rate
----+---------------+-------
1 | 01/01/2011 | 4.5
2 | 01/04/2011 | 3.2
3 | 04/06/2011 | 2.4
4 | 30/06/2011 | 5
I want to get the output rate base on a simple linear interpolation.
So if I enter 17/06/2011:
Date Rate
---------- -----
01/01/2011 4.5
01/04/2011 3.2
04/06/2011 2.4
17/06/2011
30/06/2011 5.0
the linear interpolation is (5 + 2,4) / 2 = 3,7
Is there a way to do a simple query (SQL Server 2005), or this kind of stuff need to be done in a programmatic way (C#...) ?
Something like this (corrected):
SELECT CASE WHEN next.Date IS NULL THEN prev.Rate
WHEN prev.Date IS NULL THEN next.Rate
WHEN next.Date = prev.Date THEN prev.Rate
ELSE ( DATEDIFF(d, prev.Date, #InputDate) * next.Rate
+ DATEDIFF(d, #InputDate, next.Date) * prev.Rate
) / DATEDIFF(d, prev.Date, next.Date)
END AS interpolationRate
FROM
( SELECT TOP 1
Date, Rate
FROM Rates
WHERE Date <= #InputDate
ORDER BY Date DESC
) AS prev
CROSS JOIN
( SELECT TOP 1
Date, Rate
FROM Rates
WHERE Date >= #InputDate
ORDER BY Date ASC
) AS next
As #Mark already pointed out, the CROSS JOIN has its limitations. As soon as the target value falls outside the range of defined values no records will be returned.
Also the above solution is limited to one result only. For my project I needed an interpolation for a whole list of x values and came up with the following solution. Maybe it is of interested to other readers too?
-- generate some grid data values in table #ddd:
CREATE TABLE #ddd (id int,x float,y float, PRIMARY KEY(id,x));
INSERT INTO #ddd VALUES (1,3,4),(1,4,5),(1,6,3),(1,10,2),
(2,1,4),(2,5,6),(2,6,5),(2,8,2);
SELECT * FROM #ddd;
-- target x-values in table #vals (results are to go into column yy):
CREATE TABLE #vals (xx float PRIMARY KEY,yy float null, itype int);
INSERT INTO #vals (xx) VALUES (1),(3),(4.3),(9),(12);
-- do the actual interpolation
WITH valstyp AS (
SELECT id ii,xx,
CASE WHEN min(x)<xx THEN CASE WHEN max(x)>xx THEN 1 ELSE 2 END ELSE 0 END flag,
min(x) xmi,max(x) xma
FROM #vals INNER JOIN #ddd ON id=1 GROUP BY xx,id
), ipol AS (
SELECT v.*,(b.x-xx)/(b.x-a.x) f,a.y ya,b.y yb
FROM valstyp v
INNER JOIN #ddd a ON a.id=ii AND a.x=(SELECT max(x) FROM #ddd WHERE id=ii
AND (flag=0 AND x=xmi OR flag=1 AND x<xx OR flag=2 AND x<xma))
INNER JOIN #ddd b ON b.id=ii AND b.x=(SELECT min(x) FROM #ddd WHERE id=ii
AND (flag=0 AND x>xmi OR flag=1 AND x>xx OR flag=2 AND x=xma))
)
UPDATE v SET yy=ROUND(f*ya+(1-f)*yb,8),itype=flag FROM #vals v INNER JOIN ipol i ON i.xx=v.xx;
-- list the interpolated results table:
SELECT * FROM #vals
When running the above script you will get the following data grid points in table #ddd
id x y
-- -- -
1 3 4
1 4 5
1 6 3
1 10 2
2 1 4
2 5 6
2 6 5
2 8 2
[[ The table contains grid points for two identities (id=1 and id=2). In my example I referenced only the 1-group by using where id=1 in the valstyp CTE. This can be changed to suit your requirements. ]]
and the results table #vals with the interpolated data in column yy:
xx yy itype
--- ---- -----
1 2 0
3 4 0
4.3 4.7 1
9 2.25 1
12 1.5 2
The last column itype indicates the type of interpolation/extrapolation that was used to calculate the value:
0: extrapolation to lower end
1: interpolation within given data range
2: extrapolation to higher end
This working example can be found here.
The trick with CROSS JOIN here is it wont return any records if either of the table does not have rows (1 * 0 = 0) and the query may break. Better way to do is use FULL OUTER JOIN with inequality condition (to avoid getting more than one row)
( SELECT TOP 1
Date, Rate
FROM Rates
WHERE Date <= #InputDate
ORDER BY Date DESC
) AS prev
FULL OUTER JOIN
( SELECT TOP 1
Date, Rate
FROM Rates
WHERE Date >= #InputDate
ORDER BY Date ASC
) AS next
ON (prev.Date <> next.Date) [or Rate depending on what is unique]