t-SQL: calculate date difference with dynamic lag - sql

Is there a way to compute the duration between consequent dates that are not the same, using SQL Server 2017's OVER clause and without joins or subqueries? Could this possibly be done with a LAG function using some dynamically computed lag argument?
For example, trx 2 & 3 are on the same day, so we compute the duration from 1 to 2 and from 1 to 3. Since 4 occurred on a different day, its duration is from 3 to 4. Since trx 5 is on the same day as 4, we compute its duration from 3 to 5 and so on.
CREATE TABLE #t(Trx TINYINT, DT DATE);
INSERT INTO #t SELECT 1, '1/1/17';
INSERT INTO #t SELECT 2, '1/5/17';
INSERT INTO #t SELECT 3, '1/5/17';
INSERT INTO #t SELECT 4, '1/15/17';
INSERT INTO #t SELECT 5, '1/15/17';
INSERT INTO #t SELECT 6, '1/20/17';
Below is an easy implementation with a join, but can this be done inline with some OVER clause function (no join or subqueries)?
SELECT c.Trx, c.DT,
DurO=DATEDIFF(DAY, LAG(c.DT,1) OVER(ORDER BY c.DT), c.DT), -- does not use join
DurJ=DATEDIFF(DAY, MAX(p.DT), c.DT) -- uses join
FROM #t c LEFT JOIN #t p ON c.DT > p.DT
GROUP BY c.Trx, c.DT
ORDER BY c.DT
Note that DurJ is computed correctly, but DurO is not:
Trx DT DurO DurJ
1 2017-01-01 NULL NULL
2 2017-01-05 4 4
3 2017-01-05 0 4
4 2017-01-15 10 10
5 2017-01-15 0 10
6 2017-01-20 5 5
I'll clarify further any specifics, if needed.
NOTE: Not a dup question. This question is concerned with one date column only and no project grouping. Btw, neither question has a satisfiable solution just yet.

Use dense_rank to treat same dates as one group and use it to get the same difference.
select trx,beg,sum(diff) over(partition by grp) as diff
from (select trx,beg,datediff(day,lag(beg) over(order by beg),beg) as diff,
dense_rank() over(order by beg) as grp
from #t
) t
Per #Alexey's comment's, dense_rank isn't actually needed. You can just use beg date for grouping.
select trx,beg,sum(diff) over(partition by beg) as diff
from (select trx,beg,datediff(day,lag(beg) over(order by beg),beg) as diff
from #t
) t

Related

How to find observations that occurs at least 3 times spanning at least 15 days but no more than 90 days for each unique ID in SQL?

Suppose I have this table:
CREATE TABLE #t1
(
PersonID int ,
ExamDates date,
Score varchar(50) SPARSE NULL,
);
SET dateformat mdy;
INSERT INTO #t1 (PersonID, ExamDates, Score)
VALUES (1, '1.1.2018',70),
(1, '1.13.2018', 100),
(1, '1.18.2018', 85),
(2, '1.1.2018', 90),
(2, '2.1.2018', 95),
(2, '3.15.2018', 95),
(2, '7.30.2018', 100),
(3, '1.1.2018', 80),
(3, '1.2.2018', 80),
(3, '5.3.2018', 50),
(4, '2.1.2018', 90),
(4, '2.20.2018', 100);
I would like to find observations that occurs at least 3 times spanning at least 15 days but no more than 90 days for each unique ID.
My final table should look like this:
PersonID
ExamDates
Score
1
1/1/2018
70
1
1/13/2018
100
1
1/18/2018
85
2
1/1/2018
90
2
2/1/2018
95
2
3/15/2018
95
We have code working for this using R, but would like to avoid pulling large datasets into R just to run this code. We are doing this in a very large dataset and concerned about efficiency of the query.
Thanks!
-Peter
To start with, the common name for this situation is Gaps and Islands. That will help you as you search for answers or come up with similar problems in the future.
That out of the way, here is my solution. Start with this:
WITH Leads As (
SELECT t1.*
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As Diff
FROM t1
)
SELECT *
FROM Leads
WHERE Diff BETWEEN 15 AND 90
I have to use the CTE, because you can't put a windowing function in a WHERE clause. It produces this result, which is only part of what you want:
PersonID
ExamDates
Score
Diff
1
2018-01-01
70
17
2
2018-01-01
90
73
This shows the first record in each group. We can use it to join back to the original table and find all the records that meet the requirements.
But first, we have a problem. The sample data only has groups with exactly three records. However, the real data might end up with groups with more than three items. In that case this would find multiple first records from the same group.
You can see it in this updated SQL Fiddle, which adds an additional record for PersonID #1 that is still inside the date range.
PersonID
ExamDates
Score
Diff
1
2018-01-01
70
17
1
2018-01-13
100
29
2
2018-01-01
90
73
I'll be using this additional record in every step from now on.
To account for this, we also need to check to see each record is not in the middle or end of a valid group. That is, also look a couple records both ahead and behind.
WITH Diffs As (
SELECT #t1.*
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As LeadDiff2
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As LeadDiff1
, datediff(day, lag(ExamDates, 1, NULL) over (partition by PersonID ORDER BY ExamDates), ExamDates) as LagDiff1
, datediff(day, lag(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates), ExamDates) as LagDiff2
FROM #t1
)
SELECT *
FROM Diffs
WHERE LeadDiff2 BETWEEN 15 AND 90
AND coalesce(LeadDiff1 + LagDiff1,100) > 90 /* Not in the middle of a valid group */
AND coalesce(Lagdiff2, 100) > 90 /* Not at the end of a valid group */
This code gets us back to the original results, even with the additional record. Here's the updated fiddle:
http://sqlfiddle.com/#!18/ea12ad/23
Now we can join back to the original table and find all records in each group:
WITH Diffs As (
SELECT 3t1.*
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As LeadDiff2
, datediff(day, ExamDates, lead(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates)) As LeadDiff1
, datediff(day, lag(ExamDates, 1, NULL) over (partition by PersonID ORDER BY ExamDates), ExamDates) as LagDiff1
, datediff(day, lag(ExamDates, 2, NULL) over (partition by PersonID ORDER BY ExamDates), ExamDates) as LagDiff2
FROM #t1
), FirstRecords AS (
SELECT PersonID, ExamDates, DATEADD(day, 90, ExamDates) AS FinalDate
FROM Diffs
WHERE LeadDiff2 BETWEEN 15 AND 90
AND coalesce(LeadDiff1 + LagDiff1,100) > 90 /* Not in the middle of a valid group */
AND coalesce(lagdiff2, 100) > 90 /* Not at the end of a valid group */
)
SELECT t.*
FROM FirstRecords f
INNER JOIN #t1 t ON t.PersonID = f.PersonID
AND t.ExamDates >= f.ExamDates
AND t.ExamDates <= f.FinalDate
ORDER BY t.PersonID, t.ExamDates
That gives me this, which matches your desired output and my extra record:
PersonID
ExamDates
Score
1
2018-01-01
70
1
2018-01-13
100
1
2018-01-18
85
1
2018-02-11
89
2
2018-01-01
90
2
2018-02-01
95
2
2018-03-15
95
See it work here:
http://sqlfiddle.com/#!18/ea12ad/26
Here's Eli's idea done a bit more simply, and moving all of the heavy computation to the cte, where it may possibly be more efficient:
With cte As (
Select PersonID, ExamDates
,Case When Datediff(DAY,ExamDates, Lead(ExamDates,2,Null) Over (Partition by PersonID Order by ExamDates)) Between 15 and 90
Then Lead(ExamDates,2,Null) Over (Partition by PersonID Order by ExamDates)
Else NULL End as EndDateRange
From #t1
)
Select Distinct B.*
From cte Inner Join #t1 B On B.PersonID=cte.PersonID
And B.ExamDates Between cte.ExamDates and cte.EndDateRange
The Case statement in the CTE only returns a valid date if the entry two items later satisfies the overall condition; that date is used to form a range with the current record's ExamDate. By returning NULL on non-qualified ranges we ensure the join in the outer part of the SQL is not satisfied. The Distinct clause is needed to collapse duplicates when there are are 4+ consecutive observations within the 15-90 day range.
You'll need a CTE to identify the base for the conditions which you described.
This code works with your sample set, and should work even when you have a larger set - though may require a distinct if you have overlapping results, i.e. 5 exam dates in the 15-90 range.
WITH cte AS(
SELECT
PERSONID
,EXAMDATES
,Score
,COUNT(*) OVER (PARTITION BY PERSONID ORDER BY ExamDates ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )AS COUNTS
,LAG(ExamDates,2,NULL) OVER (PARTITION BY PERSONID ORDER BY ExamDates) DIFFS
FROM #t1
)
SELECT B.*
FROM CTE
INNER JOIN #T1 B ON CTE.PERSONID = B.PERSONID
WHERE CTE.COUNTS >=3
AND DATEDIFF(DAY,CTE.DIFFS,CTE.EXAMDATES) BETWEEN 15 AND 90
AND B.EXAMDATES BETWEEN CTE.DIFFS AND CTE.EXAMDATES

Find date in specific group which is next to current group avoiding undesired groups

Assume table called t1:
create table t1(
dates date,
groups number
);
insert into t1 values('01.03.2020', 1);
insert into t1 values('02.03.2020', 2);
insert into t1 values('10.03.2020', 3);
insert into t1 values('01.04.2020', 10);
insert into t1 values('02.04.2020', 20);
insert into t1 values('10.04.2020', 3);
DATES GROUPS
01.03.2020 1
02.03.2020 2
10.03.2020 3
01.04.2020 10
02.04.2020 20
10.04.2020 3
I need to add column which would store value from DATES column where GROUP column value equals to 3 and that should be date of nearest 3d group in term of time.
Desired result:
DATES GROUPS DATE_OF_NEXT_3D_GROUP
01.03.2020 1 10.03.2020
02.03.2020 2 10.03.2020
10.03.2020 3 NULL(or could be 10.04.2020 from next 3d group)
01.04.2020 10 10.04.2020
02.04.2020 20 10.04.2020
10.04.2020 3 NULL(or date from next 3d group)
... ... ...
Appreciate your help
I strongly, strongly recommend using analytic functions for this rather than a correlated subquery:
select dates, groups,
(case when groups <> 3
then min(case when groups = 3 then dates end) over (order by dates desc)
end)
from t1
order by 1;
Analytic functions are designed for this type of operation and should have much better performance.
Here is a db<>fiddle.
You can achieve this with a subquery:
select dates,
groups,
(select min(dates)
from t1 b
where b.groups = 3
and b.dates > a.dates) as next_g3_date
from t1 a;

Joining next Sequential Row

I am planing an SQL Statement right now and would need someone to look over my thougts.
This is my Table:
id stat period
--- ------- --------
1 10 1/1/2008
2 25 2/1/2008
3 5 3/1/2008
4 15 4/1/2008
5 30 5/1/2008
6 9 6/1/2008
7 22 7/1/2008
8 29 8/1/2008
Create Table
CREATE TABLE tbstats
(
id INT IDENTITY(1, 1) PRIMARY KEY,
stat INT NOT NULL,
period DATETIME NOT NULL
)
go
INSERT INTO tbstats
(stat,period)
SELECT 10,CONVERT(DATETIME, '20080101')
UNION ALL
SELECT 25,CONVERT(DATETIME, '20080102')
UNION ALL
SELECT 5,CONVERT(DATETIME, '20080103')
UNION ALL
SELECT 15,CONVERT(DATETIME, '20080104')
UNION ALL
SELECT 30,CONVERT(DATETIME, '20080105')
UNION ALL
SELECT 9,CONVERT(DATETIME, '20080106')
UNION ALL
SELECT 22,CONVERT(DATETIME, '20080107')
UNION ALL
SELECT 29,CONVERT(DATETIME, '20080108')
go
I want to calculate the difference between each statistic and the next, and then calculate the mean value of the 'gaps.'
Thougts:
I need to join each record with it's subsequent row. I can do that using the ever flexible joining syntax, thanks to the fact that I know the id field is an integer sequence with no gaps.
By aliasing the table I could incorporate it into the SQL query twice, then join them together in a staggered fashion by adding 1 to the id of the first aliased table. The first record in the table has an id of 1. 1 + 1 = 2 so it should join on the row with id of 2 in the second aliased table. And so on.
Now I would simply subtract one from the other.
Then I would use the ABS function to ensure that I always get positive integers as a result of the subtraction regardless of which side of the expression is the higher figure.
Is there an easier way to achieve what I want?
The lead analytic function should do the trick:
SELECT period, stat, stat - LEAD(stat) OVER (ORDER BY period) AS gap
FROM tbstats
The average value of the gaps can be done by calculating the difference between the first value and the last value and dividing by one less than the number of elements:
select sum(case when seqnum = num then stat else - stat end) / (max(num) - 1);
from (select period, row_number() over (order by period) as seqnum,
count(*) over () as num
from tbstats
) t
where seqnum = num or seqnum = 1;
Of course, you can also do the calculation using lead(), but this will also work in SQL Server 2005 and 2008.
By using Join also you achieve this
SELECT t1.period,
t1.stat,
t1.stat - t2.stat gap
FROM #tbstats t1
LEFT JOIN #tbstats t2
ON t1.id + 1 = t2.id
To calculate the difference between each statistic and the next, LEAD() and LAG() may be the simplest option. You provide an ORDER BY, and LEAD(something) returns the next something and LAG(something) returns the previous something in the given order.
select
x.id thisStatId,
LAG(x.id) OVER (ORDER BY x.id) lastStatId,
x.stat thisStatValue,
LAG(x.stat) OVER (ORDER BY x.id) lastStatValue,
x.stat - LAG(x.stat) OVER (ORDER BY x.id) diff
from tbStats x

What's the most efficient way to match values between 2 tables based on most recent prior date?

I've got two tables in MS SQL Server:
dailyt - which contains daily data:
date val
---------------------
2014-05-22 10
2014-05-21 9.5
2014-05-20 9
2014-05-19 8
2014-05-18 7.5
etc...
And periodt - which contains data coming in at irregular periods:
date val
---------------------
2014-05-21 2
2014-05-18 1
Given a row in dailyt, I want to adjust its value by adding the corresponding value in periodt with the closest date prior or equal to the date of the dailyt row. So, the output would look like:
addt
date val
---------------------
2014-05-22 12 <- add 2 from 2014-05-21
2014-05-21 11.5 <- add 2 from 2014-05-21
2014-05-20 10 <- add 1 from 2014-05-18
2014-05-19 9 <- add 1 from 2014-05-18
2014-05-18 8.5 <- add 1 from 2014-05-18
I know that one way to do this is to join the dailyt and periodt tables on periodt.date <= dailyt.date and then imposing a ROW_NUMBER() (PARTITION BY dailyt.date ORDER BY periodt.date DESC) condition, and then having a WHERE condition on the row number to = 1.
Is there another way to do this that would be more efficient? Or is this pretty much optimal?
I think using APPLY would be the most efficient way:
SELECT d.Val,
p.Val,
NewVal = d.Val + ISNULL(p.Val, 0)
FROM Dailyt AS d
OUTER APPLY
( SELECT TOP 1 Val
FROM Periodt p
WHERE p.Date <= d.Date
ORDER BY p.Date DESC
) AS p;
Example on SQL Fiddle
If there relatively very few periodt rows, then there is an option that may prove quite efficient.
Convert periodt into a From/To ranges table using subqueries or CTEs. (Obviously performance depends on how efficiently this initial step can be done, which is why a small number of periodt rows is preferable.) Then the join to dailyt will be extremely efficient. E.g.
;WITH PIds AS (
SELECT ROW_NUMBER() OVER(ORDER BY PDate) RN, *
FROM #periodt
),
PRange AS (
SELECT f.PDate AS FromDate, t.PDate as ToDate, f.PVal
FROM PIds f
LEFT OUTER JOIN PIds t ON
t.RN = f.RN + 1
)
SELECT d.*, p.PVal
FROM #dailyt d
LEFT OUTER JOIN PRange p ON
d.DDate >= p.FromDate
AND (d.DDate < p.ToDate OR p.ToDate IS NULL)
ORDER BY 1 DESC
If you want to try the query, the following produces the sample data using table variables. Note I added an extra row to dailyt to demonstrate no periodt entries with a smaller date.
DECLARE #dailyt table (
DDate date NOT NULL,
DVal float NOT NULL
)
INSERT INTO #dailyt(DDate, DVal)
SELECT '20140522', 10
UNION ALL SELECT '20140521', 9.5
UNION ALL SELECT '20140520', 9
UNION ALL SELECT '20140519', 8
UNION ALL SELECT '20140518', 7.5
UNION ALL SELECT '20140517', 6.5
DECLARE #periodt table (
PDate date NOT NULL,
PVal int NOT NULL
)
INSERT INTO #periodt
SELECT '20140521', 2
UNION ALL SELECT '20140518', 1

Build table with previous months (cumulative)

I'm a bit lost with the following problem that I need to solve with an SQL query, no plsql. The idea is to build a cumulative column to calculate all previous months. The input table looks like
Month
1
2
3
..
24
I need build the following table :
Month Cum_Month
1 1
2 1
2 2
3 1
3 2
3 3
..
24 1
...
24 23
All this in SQL Server 2008, thanks in advance
You can do it like this:
DECLARE #tbl TABLE ([Month] INT)
INSERT #tbl VALUES
(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24)
SELECT Month
, ROW_NUMBER() OVER (PARTITION BY Month ORDER BY Month) num
FROM #tbl a
JOIN
(
SELECT *
FROM master..spt_values
WHERE type = 'P'
)
b ON b.number < a.Month
master..spt_values is used to generate numbers, after numbers are generated result of the subquery is joined on the #tbl to get the number of rows that corresponds to the month. After that ROW_NUMBER is used to create appropriate ordinal numbers for each month.
Here's a pretty cool trick not using any tables:
SELECT N.Number as Month, N2.Number as Cum_Month
FROM
(SELECT Number FROM master..spt_values WHERE Number BETWEEN 1 AND 24 AND Type = 'P') N
JOIN (SELECT Number FROM master..spt_values WHERE Number BETWEEN 1 AND 24 AND Type = 'P') N2 ON N.Number >= N2.Number
ORDER BY N.Number, N2.Number
And the Fiddle.
And if you really don't want the last 24 24 (why not), just change the second query to between 1 and 23).