display only records with continuous coverage - sql

I have sample data as below. Even though i have duplicates in the data i want to result as below
duplicates can be ignored.
Sample data
123456 2019-03-01 2199-12-31
123456 2019-03-01 2019-12-31
123456 2019-03-01 2199-12-31
123456 2020-01-01 2199-12-31
123456 1920-01-01 2019-02-28
Output is required as below
123456 1920-01-01 2019-02-28
123456 2019-03-01 2199-12-31
can some please help me write a SQL display the out with continuous coverage records with end date as 2199/12/31

You can solve this by unpivoting the data and keeping track of periods when the net number is greater than 0:
with t as (
select id, start as dt, 1 as inc
from <table> t
union all
select id, end, -1 as inc
from <table> t
)
select id, min(dt), max(next_dt)
from (select t.*,
sum(case when net_ins = 0 then 1 else 0 end) over (partition by id order by dt) as grp,
lead(dt) over (partition by id order by dt) as next_dt
from (select id, dt,
sum(sum(inc)) over (partition by id order by dt) as net_ins
from t
group by id, dt
) t
) t
where net_ins > 0
group by id, grp;
Here is a db<>fiddle.

Related

SQL, rank for each instance of a partition

I am trying to to create a rank for each instance of a status occurring, for example
ID
Status
From_date
To_date
rank
1
Available
2022-01-01
2022-01-02
1
1
Available
2022-01-02
2022-01-03
1
1
Unavailable
2022-01-03
2022-01-10
2
1
Available
2022-01-10
2022-01-20
3
For each ID, for each instance of a status occurring, by from_date ascending.
I want to do this as i see this as the best way of getting to the final result i want which is
ID
Status
From_date
To_date
rank
1
Available
2022-01-01
2022-01-03
1
1
Unavailable
2022-01-03
2022-01-10
2
1
Available
2022-01-10
2022-01-20
3
I tried dense_rank(partition by id order by status, from_date but can see now why that wouldnt work. Not sure how to get to this result.
So with this CTE for the data:
with data(ID, Status, From_date, To_date) as (
select * from values
(1, 'Available', '2022-01-01', '2022-01-02'),
(1, 'Available', '2022-01-02', '2022-01-03'),
(1, 'Unavailable', '2022-01-03', '2022-01-10'),
(1, 'Available', '2022-01-10', '2022-01-20')
)
the first result, being rank can be done with CONDITIONAL_CHANGE_EVENT:
select *
,CONDITIONAL_CHANGE_EVENT( Status ) OVER ( PARTITION BY ID ORDER BY From_date ) as rank
from data;
ID
STATUS
FROM_DATE
TO_DATE
RANK
1
Available
2022-01-01
2022-01-02
0
1
Available
2022-01-02
2022-01-03
0
1
Unavailable
2022-01-03
2022-01-10
1
1
Available
2022-01-10
2022-01-20
2
and thus the keeps the first of each rank can be achieved with a QUALIFY/ROW_NUMBER, because the CONDITIONAL_CHANGE is a complex operation, needs wrapping in a sub-select, so the answer is not as short as I would like:
select * from (
select *
,CONDITIONAL_CHANGE_EVENT( Status ) OVER ( PARTITION BY ID ORDER BY From_date ) as rank
from data
)
qualify row_number() over(partition by id, rank ORDER BY From_date ) = 1
gives:
ID
STATUS
FROM_DATE
TO_DATE
RANK
1
Available
2022-01-01
2022-01-02
0
1
Unavailable
2022-01-03
2022-01-10
1
1
Available
2022-01-10
2022-01-20
2
Also, the final result minus the ranking can be done with:
select *
from data
qualify nvl(Status <> lag(status) over ( PARTITION BY ID ORDER BY From_date ), true)
ID
STATUS
FROM_DATE
TO_DATE
1
Available
2022-01-01
2022-01-02
1
Unavailable
2022-01-03
2022-01-10
1
Available
2022-01-10
2022-01-20
and thus a rank can be added at the end
select *
,rank() over ( PARTITION BY ID ORDER BY From_date ) as rank
from (
select *
from data
qualify nvl(Status <> lag(status) over ( PARTITION BY ID ORDER BY From_date ), true)
)
ID
STATUS
FROM_DATE
TO_DATE
RANK
1
Available
2022-01-01
2022-01-02
1
1
Unavailable
2022-01-03
2022-01-10
2
1
Available
2022-01-10
2022-01-20
3
This is a typical gaps-and-island problem, where islands are groups of consecutive rows that have the same status.
Here is one way to solve it with window functions:
select id, status,
min(from_date) from_date, max(to_date) to_date,
row_number() over (partition by id order by min(from_date)) rn
from (
select t.*,
row_number() over (partition by id order by from_date) rn1,
row_number() over (partition by id, status order by from_date) rn2
from mytable t
) t
group by id, status, rn1 - rn2
order by min(from_date)
This worked by ranking rows within two different partitions (with a without the status) ; the difference between the row numbers define the islands.
You can group consecutive status using conditional_change_event, then collapse the dates using min and max, and finally use row_number() to rank the events
with cte as
(select *,conditional_change_event(status) over (partition by id order by from_date) as rn
from t)
select id,
status,
min(from_date) as from_date,
max(to_date) as to_date,
row_number() over (partition by id, order by min(from_date), max(to_date)) as rank
from cte
group by id, status, rn
order by rank

Need to get maximum date range which is overlapping in SQL

I have a table with 3 columns id, start_date, end_date
Some of the values are as follows:
1 2018-01-01 2030-01-01
1 2017-10-01 2018-10-01
1 2019-01-01 2020-01-01
1 2015-01-01 2016-01-01
2 2010-01-01 2011-02-01
2 2010-10-01 2010-12-01
2 2008-01-01 2009-01-01
I have the above kind of data set where I have to filter out overlap date range by keeping maximum datarange and keep the other date range which is not overlapping for a particular id.
Hence desired output should be:
1 2018-01-01 2030-01-01
1 2015-01-01 2016-01-01
2 2010-01-01 2011-02-01
2 2008-01-01 2009-01-01
I am unable to find the right way to code in impala. Can someone please help me.
I have tried like,
with cte as(
select a*, row_number() over(partition by id order by datediff(end_date , start_date) desc) as flag from mytable a) select * from cte where flag=1
but this will remove other date range which is not overlapping. Please help.
use row number with countItem for each id
with cte as(
select *,
row_number() over(partition by id order by id) as seq,
count(*) over(partition by id order by id) as countItem
from mytable
)
select id,start_date,end_date
from cte
where seq = 1 or seq = countItem
or without cte
select id,start_date,end_date
from
(select *,
row_number() over(partition by id order by id) as seq,
count(*) over(partition by id order by id) as countItem
from mytable) t
where seq = 1 or seq = countItem
demo in db<>fiddle
You can use a cumulative max to see if there is any overlap with preceding rows. If there is not, then you have the first row of a new group (row in the result set).
A cumulative sum of the starts assigns each row in the source to a group. Then aggregate:
select id, min(start_date), max(end_date)
from (select t.*,
sum(case when prev_end_date >= start_date then 0 else 1 end) over
(partition by id
order by start_date
rows between unbounded preceding and current row
) as grp
from (select t.*,
max(end_date) over (partition by id
order by start_date
rows between unbounded preceding and 1 preceding
) as prev_end_date
from t
) t
) t
group by id, grp;

Select start and end dates for changing values in SQL

I have a database with accounts and historical status changes
select Date, Account, OldStatus, NewStatus from HistoricalCodes
order by Account, Date
Date
Account
OldStatus
NewStatus
2020-01-01
12345
1
2
2020-10-01
12345
2
3
2020-11-01
12345
3
2
2020-12-01
12345
2
1
2020-01-01
54321
2
3
2020-09-01
54321
3
2
2020-12-01
54321
2
3
For every account I need to determine Start Date and End Date when Status = 2. An additional challenge is that the status can change back and forth multiple times. Is there a way in SQL to create something like this for at least first two timeframes when account was in 2? Any ideas?
Account
StartDt_1
EndDt_1
StartDt_2
EndDt_2
12345
2020-01-01
2020-10-01
2020-11-01
2020-12-01
54321
2020-09-01
2020-12-01
I would suggest putting this information in separate rows:
select t.*
from (select account, date as startdate,
lead(date) over (partition by account order by date) as enddate
from t
) t
where newstatus = 2;
This produces a separate row for each period when an account has a status of 2. This is better than putting the dates in separate pairs of columns, because you do not need to know the maximum number of periods of status = 2 when you write the query.
For a fixed maximum of status changes per account, you can use window functions and conditional aggregation:
select account,
max(case when rn = 1 then date end) as start_dt1,
max(case when rn = 1 then lead_date end) as end_dt1,
max(case when rn = 2 then date end) as start_dt2,
max(case when rn = 2 then lead_date end) as end_dt2
from (
select t.*,
row_number() over(partition by account, newstatus order by date) as rn,
lead(date) over(partition by account order by date) as lead_date
from mytable t
) t
where newstatus = 2
group by account
You can extend the select clause with more conditional expressions to handle more possible ranges per account.

Making groups of dates in SQL Server

I have a table contains ids and dates, I want to groups of dates for each id
id date
------------------
1 2019-01-01
2 2019-01-01
1 2019-01-02
2 2019-01-02
2 2019-01-03
1 2019-01-04
1 2019-01-05
2 2019-01-05
2 2019-01-06
I want to check where are gaps in date for each id to get output like
id from to
------------------------------------
1 2019-01-01 2019-01-02
1 2019-01-04 2019-01-05
2 2019-01-01 2019-01-03
2 2019-01-05 2019-01-06
This is a form of gaps-and-islands problem. The simplest solution is to generate a sequential number for each id and subtract that from the date. This is constant for dates that are sequential.
So:
select id, min(date), max(date)
from (select t.*, row_number() over (partition by id order by date) as seqnum
from t
) t
group by id, dateadd(day, -seqnum, date)
order by id, min(date);
Here is a db<>fiddle.
A typical approach to this gaps-and-islands problem is build the groups by comparing the date of the current record to the "previous" date of the same id. When dates are not consecutive, a new group starts:
select id, min(date) from_date, max(date) to_date
from (
select
t.*,
sum(case when date = dateadd(day, 1, lag_date) then 0 else 1 end)
over(partition by id order by date) grp
from (
select
t.*,
lag(date) over(partition by id order by date) lag_date
from mytable t
) t
) t
group by id, grp
order by id, from_date

Skip specific rows using LAG in sql

I have a table that looks like this:
Using the LAG function in SQL, I would like to perform the LAG on only values where star_date=end_date and get the past previous start_date record where start_date=end_date.
That my end table will have an extra column like this:
I hope my question is clear, any help is appreciated.
You can assign a group to these values and use that:
select t.*,
(case when start_date = end_date
then lag(start_date) over (partition by (case when start_date = end_date then 1 else 0 end) order by start_date)
end) as prev_eq_start_date
from t;
Or:
select t.*,
(case when start_date = end_date
then lag(start_date) over (partition by start_date = end_date order by start_date)
end) as prev_eq_start_date
from t;
Note if you data is big and most rows have different dates, then you might have a resources issue. In this case, an additional, unused partition by key can help:
select t.*,
(case when start_date = end_date
then lag(start_date) over (partition by (case when start_date = end_date then 1 else 2 end), (case when start_date <> end_date then start_date end) order by start_date)
end) as prev_eq_start_date
from t;
This has no impact on the result but it can avoid a resources error caused by too many rows with different values.
Below is for BigQuery Standard SQL
#standardSQL
SELECT *, NULL AS lag_result
FROM `project.dataset.table` WHERE start_date != end_date
UNION ALL
SELECT *, LAG(start_date) OVER(ORDER BY start_date)
FROM `project.dataset.table` WHERE start_date = end_date
If to apply to sample data in your question - result is
Row user_id start_date end_date lag_result
1 1 2019-01-01 2019-02-28 null
2 3 2019-02-27 2019-02-28 null
3 4 2019-08-04 2019-09-01 null
4 2 2019-02-01 2019-02-01 null
5 5 2019-08-07 2019-08-07 2019-02-01
6 6 2019-08-27 2019-08-27 2019-08-07
Btw, in case if your start_date and end_date are of STRING data type ('27/02/2019') vs. DATE type ('2019-02-27' as it was assumed in above query) - you should use below one
#standardSQL
SELECT *, NULL AS lag_result
FROM `project.dataset.table` WHERE start_date != end_date
UNION ALL
SELECT *, LAG(start_date) OVER(ORDER BY PARSE_DATE('%d/%m/%Y', start_date))
FROM `project.dataset.table` WHERE start_date = end_date
with result
Row user_id start_date end_date lag_result
1 1 01/01/2019 28/02/2019 null
2 3 27/02/2019 28/02/2019 null
3 4 04/08/2019 01/09/2019 null
4 2 01/02/2019 01/02/2019 null
5 5 07/08/2019 07/08/2019 01/02/2019
6 6 27/08/2019 27/08/2019 07/08/2019
Use JOIN
SQL FIDDLE
SELECT T.*,T1.LAG_Result
FROM TABLE T LEFT JOIN
(
SELECT User_Id,LAG(start_date) OVER(ORDER BY start_date) LAG_Result
FROM TABLE S
WHERE start_date = end_date
) T1 ON T.User_Id = T1.User_Id