How to combine date ranges in SQL with small gaps - sql

I have a dataset where each row has a date range. I want to combine records into single date ranges if they overlap or there's a gap of less than 30 days and they share the same ID number. If it's more than 30 days, I want them to remain separate. I can figure out how to do it if they are overlapping, and I can figure out how to do it no matter the size of the gap, but I can't figure out how to do it with a limited gap allowance.
So, for example, if my data looks like this:
ID Date1 Date2
ABC 2018-01-01 2018-02-14
ABC 2018-02-13 2018-03-17
ABC 2018-04-01 2018-07-24
DEF 2017-01-01 2017-06-30
DEF 2017-10-01 2017-12-01
I want it to come out like this:
ID Date1 Date2
ABC 2018-01-01 2018-07-24
DEF 2017-01-01 2017-06-30
DEF 2017-10-01 2017-12-01
The three date ranges for ABC are combined, because they either overlap or the gaps are less than 30 days. The two date ranges for DEF stay separate, because the gap between them is larger than 30 days.
I'm using Microsoft SSMS.

You can identify where the new periods begin. For a general problem, I would go with not exists. Then you can assign a group using cumulative sums:
select id, sum(is_start) over (partition by id order by datestart) as grp
from (select t.*,
(case when not exists (select 1
from t t2
where t2.id = t.id and
t2.date1 >= dateadd(day, -30, t1.date1) and
t2.date2 < dateadd(day, 30, t1.date2)
)
then 1 else 0
end) as is_start
from t
) t;
The final step is aggregation:
with g as (
select id, sum(is_start) over (partition by id order by datestart) as grp
from (select t.*,
(case when not exists (select 1
from t t2
where t2.id = t.id and
t2.date1 >= dateadd(day, -30, t1.date1) and
t2.date2 < dateadd(day, 30, t1.date2)
)
then 1 else 0
end) as is_start
from t
) t
)
select id, min(date1), max(date2)
from g
group by id, grp;

Related

For multiple rows with some identical fields, keep the one with updated values, and mark the others

For multiple rows with identical features, I hope two add few marks/new columns in the original table.
The original table is as below:
ID Start_date End_Date Amount
1 2005-01-01 2010-01-01 5
1 2000-07-01 2009-06-01 10
1 2017-08-01 2018-03-01 30
I wish to keep one record with the earliest start date, latest end date, added amount and an indicator to tell me to use this record. For the others, just use the indicator to tell me not to use.
The updated table should be as below:
ID Start_date End_Date Amount Amount_new Usable Start End
1 2005-01-01 2010-01-01 5 45 0 2000-07-01 2018-03-01
1 2000-07-01 2009-06-01 10 1
1 2017-08-01 2018-03-01 30 1
It does not matter which row to keep, as long as there is one row with Usable=0, and Amount_new, Start and End are updated.
If not considering the end date, I was thinking of grouping by ID and Start_date, then update the column Usable and Amount_new of the first row. However I still have the problem of how to select the first row from the group by group. Considering the End_Date makes my mind even more messy!
Could anyone help to shed some light upon this issue?
You seem to want something like this:
alter table original
add amount_new int,
add usable bit,
add new_start,
add new_end;
Then, you can update it using window functions:
with toupdate as (
select o.*,
sum(amount) over (partition by id) as x_amount,
(case when row_number() over (partition by id order by start_date) as x_usable,
min(start_date) as x_start_date,
max(end_date) as x_end_date
from original o
)
update toupdate
set new_amount = x_amount,
usable = x_usable,
new_start = x_start_date,
new_end = x_end_date;
The following query should do what you want:
CREATE TABLE #temp (ID INT, [Start_date] DATE, End_Date DATE, Amount NUMERIC(28,0), Amount_new NUMERIC(28,0), Usable BIT, Start [Date], [End] [Date])
INSERT INTO #temp (ID, [Start_date], End_Date, Amount) VALUES
(1,'2005-01-01','2010-01-01',5),
(1,'2000-07-01','2009-06-01',10),
(1,'2017-08-01','2018-03-01',30),
(2,'2001-07-01','2009-06-01',5),
(2,'2017-08-01','2019-03-01',35)
UPDATE t1
SET Amount_new = t2.[Amount_new],
Usable = 1,
Start = t2.[Start],
[End] = t2.[End]
FROM (SELECT *,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY (SELECT 1)) AS RNO FROM #temp) t1
INNER JOIN
(
SELECT ID,[Start_date],[End_Date],[Amount]
,SUM(Amount) OVER(PARTITION BY ID) AS [Amount_new]
,MIN([Start_date]) OVER(PARTITION BY ID) AS [Start]
,MAX(End_Date) OVER(PARTITION BY ID) AS [End]
,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY (SELECT 1)) AS RNO
FROM #temp ) t2 ON t1.id = t2.id AND t2.rno = t1.RNO AND t2.RNO = 1
SELECT * FROM #temp
The result is as below,
ID Start_date End_Date Amount Amount_new Usable Start End
1 2005-01-01 2010-01-01 5 45 1 2000-07-01 2018-03-01
1 2000-07-01 2009-06-01 10 NULL NULL NULL NULL
1 2017-08-01 2018-03-01 30 NULL NULL NULL NULL
2 2001-07-01 2009-06-01 5 40 1 2001-07-01 2019-03-01
2 2017-08-01 2019-03-01 35 NULL NULL NULL NULL

select rows in sql with end_date >= start_date for each ID repeated multiple times

Attached the image how the data looks like. In my table I have 3 columns id, start date, and end date, and values like this:
id start date end date
-------------------------------
100 2015-01-01 2015-12-31
100 2016-01-10 2018-12-31
200 2015-02-15 2016-03-15
200 2016-03-15 2016-12-31
300 2016-01-01 2016-12-31
400 2017-01-01 2017-12-31
500 2017-02-01 2017-12-31
600 2017-01-15 2017-03-05
600 2017-02-01 2018-12-31
I want my output to be
id start date end date
--------------------------------
100 2015-01-01 2015-12-31
100 2016-01-10 2018-12-31
200 2015-02-15 2016-12-31
300 2016-01-01 2016-12-31
400 2017-01-01 2017-12-31
500 2017-02-01 2017-12-31
600 2017-01-15 2018-12-31
Query:
select
id, *
from
dbo.test_sl
where
id in (select id
from dbo.test_sl
where end_date >= start_date
group by id)
Please help me get the output I am looking for.
This is an example of a gaps-and-islands problem. In this case, you want to find adjacent rows that do not overlap for the same id. These are the starts of groups. A cumulative sum of the starts of a group providing a grouping number, which can be used for aggregation.
In a query, this looks like:
select id, min(startdate), max(enddate)
from (select t.*,
sum(isstart) over (partition by id order by startdate) as grp
from (select t.*,
(case when exists (select 1
from test_sl t2
where t2.id = t.id and
t2.startdate < t.startdate and
t2.enddate >= t.startdate
)
then 0 else 1
end) as isstart
from test_sl t
) t
) t
group by id, grp;
Assuming that only two records can be combined together, you can LEFT JOIN the table with itself and then use a CASE to display the end date of the self-joined record, if available.
SELECT
t1.id,
min(t1.start_date),
CASE WHEN t2.end_date IS NULL THEN t1.end_date ELSE t2.end_date END
FROM
table t1
LEFT JOIN table t2
ON t1.id = t2.id
AND t2.start_date > t1.start_date
AND t2.start_date <= t1.end_date
GROUP BY
t1.id,
CASE WHEN t2.end_date IS NULL THEN t1.end_date ELSE t2.end_date END
ORDER BY 1
Tested in this SQL Fiddle
Here's a solution that uses a Recursive CTE.
It basically loops through the dates per id, and keeps the smallest start_date for the overlapping end_date/start_date.
Then the result is grouped so there are no more overlaps.
Test here on rextester.
WITH SRC AS
(
SELECT id, start_date, end_date,
row_number() over (partition by id order by start_date) as rn
FROM test_sl
)
, RCTE AS
(
SELECT id, rn, start_date, end_date
FROM SRC
WHERE rn = 1
UNION ALL
SELECT t.id, t.rn, iif(r.end_date >= t.start_date, r.start_date, t.start_date), t.end_date
FROM RCTE r
JOIN SRC t ON t.id = r.id AND t.rn = r.rn + 1
)
SELECT id, start_date, max(end_date) as end_date
FROM RCTE
GROUP BY id, start_date
ORDER BY id, start_date;

Retrieve rows for time interval but also previous row of each - how to?

I have a table like this:
Id FKId Amount1 Amount2 Date
-----------------------------------------------------
1 1 100,0000 33,0000 2018-01-18 19:57:39.403
2 2 50,0000 10,0000 2018-01-19 19:57:57.097
3 1 130,0000 40,0000 2018-01-20 19:58:13.660
5 2 44,0000 2,0000 2018-01-21 11:11:00.000
How to get rows from 3 - 5 (all that have dates 2018-01-21 or 2018-01-21) but also their previous row regarding FKId (1 and 2)?
Thank you
In most databases, you can use the ANSI standard lead() function:
select t.*
from (select t.*, lead(date) over (partition by fkid order by date) as next_date
from t
) t
where date in ('2018-01-20', '2018-01-21') or
next_date in ('2018-01-20', '2018-01-21');
Alternatively, if you just want all records where the date is bigger than some date and the previous record, this logic also works:
select t.*
from t
where t.date >= (select max(t2.date)
from t t2
where t2.fkid = t.fkid and t2.date < '2018-01-20'
);

Select min/max dates for periods that don't intersect

Example! I have a table with 4 columns. date format dd.MM.yy
id ban start end
1 1 01.01.15 31.12.18
1 1 02.02.15 31.12.18
1 1 05.04.15 31.12.17
In this case dates from rows 2 and 3 are included in dates from row 1
1 1 02.04.19 31.12.20
1 1 05.05.19 31.12.20
In this case dates from row 5 are included in dates from rows 4. Basically we have 2 periods that don't intersect.
01.01.15 31.12.18
and
02.04.19 31.12.20
Situation where a date starts in one period and ends in another are impossible. The end result should look like this
1 1 01.01.15 31.12.18
1 1 02.04.19 31.12.20
I tried using analitical functions(LAG)
select id
, ban
, case
when start >= nvl(lag(start) over (partition by id, ban order by start, end asc), start)
and end <= nvl(lag(end) over (partition by id, ban order by start, end asc), end)
then nvl(lag(start) over (partition by id, ban order by start, end asc), start)
else start
end as start
, case
when start >= nvl(lag(start) over (partition by id, ban order by start, end asc), start)
and end <= nvl(lag(end) over (partition by id, ban order by start, end asc), end)
then nvl(lag(end) over (partition by id, ban order by start, end asc), end)
else end
end as end
from table
Where I order rows and if current dates are included in previous I replace them. It works if I have just 2 rows. For example this
1 1 08.09.15 31.12.99
1 1 31.12.15 31.12.99
turns into this
1 1 08.09.15 31.12.99
1 1 08.09.15 31.12.99
which I can then group by all fields and get what I want, but if there are more
1 2 13.11.15 31.12.99
1 2 31.12.15 31.12.99
1 2 16.06.15 31.12.99
I get
1 2 16.06.15 31.12.99
1 2 16.06.15 31.12.99
1 2 13.11.15 31.12.99
I understand why this happens, but how do I work around it? Running the query multiple times is not an option.
This query looks promising:
-- test data
with t(id, ban, dtstart, dtend) as (
select 1, 1, date '2015-01-01', date '2015-03-31' from dual union all
select 1, 1, date '2015-02-02', date '2015-03-31' from dual union all
select 1, 1, date '2015-03-15', date '2015-03-31' from dual union all
select 1, 1, date '2015-08-05', date '2015-12-31' from dual union all
select 1, 2, date '2015-01-01', date '2016-12-31' from dual union all
select 2, 1, date '2016-01-01', date '2017-12-31' from dual),
-- end of test data
step1 as (select id, ban, dt, to_number(inout) direction
from t unpivot (dt for inout in (dtstart as '1', dtend as '-1'))),
step2 as (select distinct id, ban, dt, direction,
sum(direction) over (partition by id, ban order by dt) sm
from step1),
step3 as (select id, ban, direction, dt dt1,
lead(dt) over (partition by id, ban order by dt) dt2
from step2
where (direction = 1 and sm = 1) or (direction = -1 and sm = 0) )
select id, ban, dt1, dt2
from step3 where direction = 1 order by id, ban, dt1
step1 - unpivot dates and assign 1 for start date, -1 for end
date (column direction)
step2 - add cumulative sum for direction
step3 - filter only interesting dates, pivot second date using lead()
You can shorten this syntax, I divided it to steps to show what's going on.
Result:
ID BAN DT1 DT2
------ ---------- ----------- -----------
1 1 2015-01-01 2015-03-31
1 1 2015-08-05 2015-12-31
1 2 2015-01-01 2016-12-31
2 1 2016-01-01 2017-12-31
I assumed that for different (ID, BAN) we have to make calculations separately. If not - change partitioning and ordering in sum() and lead().
Pivot and unpivot works in Oracle 11 and later, for earlier versions you need case when.
BTW - START is reserved word in Oracle so in my example I changed slightly column names.
I like to do this by identifying the period starts, then doing a cumulative sum to define the group, and a final aggregation:
select id, ban, min(start), max(end)
from (select t.*, sum(start_flag) over (partition by id, bin order by start) as grp
from (select t.*,
(case when exists (select 1
from t t2
where t2.id = t.id and t2.ban = t.ban and
t.start <= t2.end and t.end >= t2.start and
t.start <> t2.start and t.end <> t2.end
)
then 0 else 1
end) as start_flag
from t
) t
) t
group by id, ban, grp;

SQL Server- find all records within a certain date (not that straightforward!)

Ok. My SQL is pretty pants so I'm struggling to get my head around this.
I have a table that stores records complete with a time stamp.
What I want, is a list of uids where there are 2 or more records for that user within a time frame of 1 second of each other. Maybe I've made it more complicated in my head, just cannot figure it out.
Shortened version of table (pk ignored)
uid date
1 2015-01-01 10:00:30.020*
1 2015-01-01 10:00:30.300*
1 2015-01-01 10:00:30.500*
1 2015-01-01 10:00:39.000
1 2015-01-01 10:00:35.000
1 2015-01-01 10:00:37.800
2 2015-02-02 12:00:30.000
2 2015-02-02 14:00:30.000
2 2015-02-02 15:00:30.000
2 2015-02-02 18:00:30.000
3 2015-03-02 15:00:24.000
3 2015-03-02 15:00:20.000 *
3 2015-03-02 15:00:20.300 *
I've marked * next to the records I'd expect to match.
The results list I'd like is just a list of uid, so the result I'd want would just be
1
3
You can do this with exists:
select distinct uid
from t
where exists (select 1
from t t2
where t2.uid = t.uid and
t2.date > t.date and
t2.date <= t.date + interval 1 second
);
Note: The syntax for adding 1 second varies by database. But the above gives the idea for the logic.
In SQL Server, the syntax is:
select distinct uid
from t
where exists (select 1
from t t2
where t2.uid = t.uid and
t2.date > t.date and
t2.date <= dateadd(second, 1, t.date)
);
EDIT:
Or, in SQL Server 2012+, a faster alternative is to use lead() or lag():
select distinct uid
from (select t.*, lead(date) over (partition by uid order by date) as next_date
from t
) t
where next_date < dateadd(second, 1, date);
If you want the records, not just the uids, then you need to get both:
select t.*
from (select t.*,
lag(date) over (partition by uid order by date) as prev_date,
lead(date) over (partition by uid order by date) as next_date
from t
) t
where next_date <= dateadd(second, 1, date) or
prev_date >= dateadd(second, -1, date);