Partition rows where dates are between the previous dates - sql

I have the below table.
I want to identify overlapping intervals of start_date and end_date.
*edit I would like to remove the row that has the least amount of days between the start and end date where those rows overlap.
Example:
pgid 1 & pgid 2 have overlapping days. Remove the row that has the least amount of days between start_date and end_date.
Table A
id pgid Start_date End_date Days
1 1 8/4/2018 9/10/2018 37
1 2 9/8/2018 9/8/2018 0
1 3 10/29/2018 11/30/2018 32
1 4 12/1/2018 sysdate 123
Expected Results:
id Start_date End_date Days
1 8/4/2018 9/10/2018 37
1 10/29/2018 11/30/2018 32
1 12/1/2018 sysdate 123

I am thinking exists:
select t.*,
(case when exists (select 1
from t t2
where t2.start_date < t.start_date and
t2.end_date > t.end_date and
t2.id = t.id
)
then 2 else 1
end) as overlap_flag
from t;

Maybe lead and lag:
SELECT
CASE
WHEN END_DATE > LEAD (START_DATE) OVER (PARTITION BY id ORDER BY START_DATE) THEN 1
WHEN START_DATE < LAG (END_DATE) OVER (PARTITION BY id ORDER BY START_DATE) THEN 1
ELSE 0
END OVERLAP_FLAG
FROM A

Related

SQL 30 day active user query

I have a table of users and how many events they fired on a given date:
DATE
USERID
EVENTS
2021-08-27
1
5
2021-07-25
1
7
2021-07-23
2
3
2021-07-20
3
9
2021-06-22
1
9
2021-05-05
1
4
2021-05-05
2
2
2021-05-05
3
6
2021-05-05
4
8
2021-05-05
5
1
I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.
DATE
ACTIVE_USERS
2021-08-27
1
2021-07-25
3
2021-07-23
2
2021-07-20
2
2021-06-22
1
2021-05-05
5
I tried the following query which returned only the users who were active on the specified date:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
I also tried using a window function with rows between but seems to end up getting the same result:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
I'm using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.
This is tricky to do as window functions -- because count(distinct) is not permitted. You can use a self-join:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
However, that can be expensive. One solution is to "unpivot" the data. That is, do an incremental count per user of going "in" and "out" of active states and then do a cumulative sum:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;

Skip specific rows using LAG in sql

I have a table that looks like this:
Using the LAG function in SQL, I would like to perform the LAG on only values where star_date=end_date and get the past previous start_date record where start_date=end_date.
That my end table will have an extra column like this:
I hope my question is clear, any help is appreciated.
You can assign a group to these values and use that:
select t.*,
(case when start_date = end_date
then lag(start_date) over (partition by (case when start_date = end_date then 1 else 0 end) order by start_date)
end) as prev_eq_start_date
from t;
Or:
select t.*,
(case when start_date = end_date
then lag(start_date) over (partition by start_date = end_date order by start_date)
end) as prev_eq_start_date
from t;
Note if you data is big and most rows have different dates, then you might have a resources issue. In this case, an additional, unused partition by key can help:
select t.*,
(case when start_date = end_date
then lag(start_date) over (partition by (case when start_date = end_date then 1 else 2 end), (case when start_date <> end_date then start_date end) order by start_date)
end) as prev_eq_start_date
from t;
This has no impact on the result but it can avoid a resources error caused by too many rows with different values.
Below is for BigQuery Standard SQL
#standardSQL
SELECT *, NULL AS lag_result
FROM `project.dataset.table` WHERE start_date != end_date
UNION ALL
SELECT *, LAG(start_date) OVER(ORDER BY start_date)
FROM `project.dataset.table` WHERE start_date = end_date
If to apply to sample data in your question - result is
Row user_id start_date end_date lag_result
1 1 2019-01-01 2019-02-28 null
2 3 2019-02-27 2019-02-28 null
3 4 2019-08-04 2019-09-01 null
4 2 2019-02-01 2019-02-01 null
5 5 2019-08-07 2019-08-07 2019-02-01
6 6 2019-08-27 2019-08-27 2019-08-07
Btw, in case if your start_date and end_date are of STRING data type ('27/02/2019') vs. DATE type ('2019-02-27' as it was assumed in above query) - you should use below one
#standardSQL
SELECT *, NULL AS lag_result
FROM `project.dataset.table` WHERE start_date != end_date
UNION ALL
SELECT *, LAG(start_date) OVER(ORDER BY PARSE_DATE('%d/%m/%Y', start_date))
FROM `project.dataset.table` WHERE start_date = end_date
with result
Row user_id start_date end_date lag_result
1 1 01/01/2019 28/02/2019 null
2 3 27/02/2019 28/02/2019 null
3 4 04/08/2019 01/09/2019 null
4 2 01/02/2019 01/02/2019 null
5 5 07/08/2019 07/08/2019 01/02/2019
6 6 27/08/2019 27/08/2019 07/08/2019
Use JOIN
SQL FIDDLE
SELECT T.*,T1.LAG_Result
FROM TABLE T LEFT JOIN
(
SELECT User_Id,LAG(start_date) OVER(ORDER BY start_date) LAG_Result
FROM TABLE S
WHERE start_date = end_date
) T1 ON T.User_Id = T1.User_Id

Count days from start_date to end_date or end of month

With datediff() I can count the days between two dates, but how can I count the days between the later date or the end of the month and the start date?
CREATE TABLE table1 (id int, start_date datetime, end_date datetime, jan int);
INSERT INTO table1 (id, start_date, end_date) VALUES
(1, '2016-12-12', '2017-01-17'),
(2, '2017-01-10', '2017-01-10'),
(3, '2017-01-10', '2017-02-10'),
(4, '2017-01-03', '2017-02-03'),
(5, '2016-12-03', '2017-02-03');
If I run:
select id, month(start_date) as month, datediff(end_date, start_date) as diff
from table1;
it returns
id month diff
1 12 36
2 1 0
3 1 31
4 1 31
5 12 62
but I would like it to return:
id month diff
1 12 19
5 12 28
1 1 17
2 1 0
3 1 21
4 1 28
5 1 31
3 2 10
4 2 3
5 2 3
I'm trying to get the amount of days in a month a event occurs by month.
I've created a separated query to update a new column with the values, but ideally it shouldn't have a new column, since I would need several new columns for each year-month combination and one for each year-month combination:
update table1 set jan= case
when start_date >= "2017-01-01" and end_date <= last_day("2017-01-01") then datediff(end_date, start_date)+1
when start_date >= "2017-01-01" and start_date <= last_day("2017-01-01") and end_date > last_day("2017-01-01") then datediff(last_day("2017-01-01"), start_date)+1
when start_date < "2017-01-01" and end_date between "2017-01-01" and last_day("2017-01-01") then datediff(end_date, "2017-01-01")+1
when start_date < "2017-01-01" and end_date > last_day("2017-01-01") then day(last_day("2017-01-01"))
else null
end;
Your problem is going to be getting multiple rows... so let's take a different tack.
This ends up being trivial if you have a calendar table: a table with a row-per-date (and a bunch of individual columns and indices):
SELECT Table1.id, Calendar.calendar_month, COUNT(*)
FROM Table1
JOIN Calendar
ON Calendar.calendar_date >= start_date
AND Calendar.calendar_date < end_date
GROUP BY Table1.id, Calendar.calendar_month
ORDER BY Table1.id, MIN(Calendar.calendar_date)
Fiddle Demo
I don't know if this is what you're looking for.
select month(start_date) as month,
datediff(LAST_DAY(start_date), start_date) as diff
from table1
UNION ALL
select month(end_date) as month,
IF(end_date < LAST_DAY(start_date), datediff(start_date, end_date),
datediff(end_date, LAST_DAY(start_date)))
from table1;
DEMO

Oracle SQL overlap between begin date and end date in 2 or more records

Database my_table:
id seq start_date end_date
1 1 01-01-2017 02-01-2017
1 2 07-01-2017 09-01-2017
1 3 11-01-2017 11-01-2017
2 1 20-01-2017 20-01-2017
3 1 01-02-2017 02-02-2017
3 2 03-02-2017 04-02-2017
3 3 08-01-2017 09-02-2017
3 4 09-01-2017 10-02-2017
3 5 10-01-2017 12-02-2017
My requirement is to get the first date (normally seq 1 start date) and end date (normally last seq end date) and the number of dates occurred during all seq for each unique ID.
Date occurred:
id 1 2 3
01-01-2017 20-01-2017 01-02-2017
02-01-2017 02-02-2017
07-01-2017 03-02-2017
08-01-2017 04-02-2017
09-01-2017 08-02-2017
11-01-2017 09-02-2017
10-02-2017
11-02-2017
12-02-2017
total 6 1 9
Here is the result I want:
id start_date end_date num_date
1 01-01-2017 11-01-2017 6
2 20-01-2017 20-01-2017 1
3 01-02-2017 12-02-2017 9
I have tried
SELECT id
, MIN(start_date)
, MAX(end_date)
, SUM(end_date - start_date + 1)
FROM my_table
GROUP BY id
and this SQL statement work fine in id 1 and 2 since there is no overlap date between begin date and end date. But for id 3, the result num_date is 11. Could you please suggest the SQL statement to solve this problem? Thank you.
One more question: The date in database is in datetime format. How do I convert it to date. I tried to use TRUNC function but it sometimes convert date to yesterday instead.
You need to count how many times an end_date equals the following start_date. For this you need to use the lag() or the lead() analytic function. You can use a case expression for the comparison, but alas you can't wrap the case expression within a COUNT or SUM in the same query; you need a subquery and an outer query.
Something like this; not tested, since you didn't provide CREATE TABLE and INSERT statements to recreate your sample data.
select id, min(start_date) as start_date, max(end_date) as end_date,
sum(end_date - start_date + 1 - flag) as num_days
from ( select id, start_date, end_date,
case when start_date = lag(end_date)
over (partition by id order by end_date) then 1
else 0 end as flag
from my_table
)
group by id;
SELECT id,
MIN( start_date ) AS start_date,
MAX( end_date ) AS end_date,
SUM( end_date - start_date + 1 ) AS num_days
FROM (
SELECT id,
GREATEST(
start_date,
COALESCE(
LAG( end_date ) OVER ( PARTITION BY id ORDER BY seq ) + 1,
start_date
)
) AS start_date,
end_date
FROM your_table
)
WHERE start_date <= end_date
GROUP BY id;

start date end date combine rows

In Redshift, through SQL script want to consolidate monthly records as long as gap between the end date of first and the start date of the next record is 32 days or less (<=32) into single record with minimum startdate of continuous month as output startdate and maximum of end date of continuous month as output enddate.
The below input data refers to the table's data and also listed the expected output. The input data is listed ORDER BY ID,STARTDT,ENDDT in ASC.
For example, in below table, consider ID 100, the gab between the end of the first record and start of the next record <=32, however gap between the second record end date and third records start date falls more than 32 days, hence the first two records to be consolidate into one record i.e. (ID),MIN(STARTSDT),MAX(ENDDT) which corresponds to first record in the expected output. Similarly gab between 3 and 4 record in the input data falls within the 32 days and thus these 2 records to be consolidated into single records which corresponds to the second record in the expected output.
INPUT DATA:
ID STARTDT ENDDT
100 2000-01-01 2000-01-31
100 2000-02-01 2000-02-29
100 2000-05-01 2000-05-31
100 2000-06-01 2000-06-30
100 2000-09-01 2000-09-30
100 2000-10-01 2000-10-31
101 2012-06-01 2012-06-30
101 2012-07-01 2012-07-31
102 2000-01-01 2000-01-31
103 2013-03-01 2013-03-31
103 2013-05-01 2013-05-31
EXPECTED OUTPUT:
ID MIN_STARTDT MAX_END_DT
100 2000-01-01 2000-02-29
100 2000-05-01 2000-06-30
100 2000-09-01 2000-10-31
101 2012-06-01 2012-07-31
102 2000-01-01 2000-01-31
103 2013-03-01 2013-03-31
103 2013-05-01 2013-05-31
You can do this in steps:
Use a join to identify where two adjacent records should be combined.
Then do a cumulative sum to assign all such adjacent records a grouping identifier.
Aggregate.
It looks like:
select id, min(startdt), max(enddte)
from (select t.*,
count(case when tprev.id is null then 1 else 0 end) over
(partition by t.idid
order by t.startdt
rows between unbounded preceding and current row
) as grp
from t left join
t tprev
on t.id = tprev.id and
t.startdt = tprev.enddt + interval '1 day'
) t
group by id, grp;
The question is very similar to this one and my answer is also similar: Fetch rows based on condition
The gist of the idea is to use Window Functions to identify transitions between period (events which are less than 33 days apart), and then do some filtering to remove the rows within the period, and then Window Functions again.
Complete solution:
SELECT
id,
startdt AS period_start,
period_end
FROM (
SELECT
id,
startdt,
enddt,
lead(enddt, 1)
OVER (PARTITION BY id
ORDER BY enddt) AS period_end,
period_boundary
FROM (
SELECT
id,
startdt,
enddt,
CASE WHEN period_switch = 0 AND reverse_period_switch = 1
THEN 'start'
ELSE 'end' END AS period_boundary
FROM (
SELECT
id,
startdt,
enddt,
CASE WHEN datediff(days, enddt, lead(startdt, 1)
OVER (PARTITION BY id
ORDER BY enddt ASC)) > 32
THEN 1
ELSE 0 END AS period_switch,
CASE WHEN datediff(days, lead(enddt, 1)
OVER (PARTITION BY id
ORDER BY enddt DESC), startdt) > 32
THEN 1
ELSE 0 END AS reverse_period_switch
FROM date_test
)
AS sessioned
WHERE period_switch != 0 OR reverse_period_switch != 0
UNION
SELECT -- adding start rows without transition
id,
startdt,
enddt,
'start'
FROM (
SELECT
id,
startdt,
enddt,
row_number()
OVER (PARTITION BY id
ORDER BY enddt ASC) AS row_num
FROM date_test
) AS with_row_number
WHERE row_num = 1
UNION
SELECT -- adding end rows without transition
id,
startdt,
enddt,
'end'
FROM (
SELECT
id,
startdt,
enddt,
row_number()
OVER (PARTITION BY id
ORDER BY enddt desc) AS row_num
FROM date_test
) AS with_row_number
WHERE row_num = 1
) AS with_boundary -- data set containing start/end boundaries
) AS with_end -- data set where end date is propagated into the start row of the period
WHERE period_boundary = 'start'
ORDER BY id, startdt ASC;
Note that in your expected output, you had a row for 103 2013-05-01 2013-05-31, however its start date is 31 days apart from end date of the previous row, so this row should instead be merged with the previous row for id 103 according to your requirements.
So the output that I get looks like this:
id start end
100 2000-01-01 2000-02-29
100 2000-05-01 2000-06-30
100 2000-09-01 2000-10-31
101 2012-06-01 2012-07-31
102 2000-01-01 2000-01-31
103 2013-03-01 2013-05-31