Create membership timeseries from static table - sql

I have a table (members) which shows users and any additions or deletions to a specific membership and on what date as below
User
Membership
Change
Date
1
100
A
01/01/1900
1
101
A
01/01/1990
1
100
D
01/01/2000
2
100
A
01/12/1990
2
101
A
01/01/1991
2
101
D
01/12/1991
2
100
D
01/01/1993
3
100
A
01/01/2000
I'm looking to find cases where one user is in multiple memberships in overlapping time periods, I'm using the below but it's pulling the wrong end dates when I run it
With membership As
(Select user, membership, date as Start_Date,
LEAD (date, 1, '31/12/9999') OVER (PARTITION BY membership ORDER BY
date) AS End_Date
FROM
(Select *, LAG(change,1,-1) OVER(PARTITION BY membership ORDER BY
date) AS Previous_change
From members) withprevious
Where change != previous_change),
MemberTimeSeries AS
(Select *
From membership
Where Start_Date IN
(Select a.Date
From members a
join membership b
on a.user = b.user
and a.membership = b.membership
Where a.change = 'A')),
DupeIDs AS
(Select Distinct a.user, a.membership, cast(a.start_date as date_ as
start_date, cast(a.end_date as date) as end_date
from membertimeseries a
join membertimeseries b
on a.user = b.user
and ((a.start_date >= b.start_date abd a.start_date < b.end_date)
Or (a.end_date > b.start_date and a.end_date <= b.end_date)
OR (b.start_date >= a.start_date abd b.start_date < a.end_date)
Or (b.end_date > a.start_date and b.end_date <= a.end_date)
I'm looking to see any user and membership combinations with respective start and end dates where it overlaps with another membership combination for the same user during any period of it's active membership. If there is no deletion in the table I want date to default to 31/12/9999
For example I want to see the below from my example
User
Membership
Start_Date
End_Date
1
100
01/01/1900
01/01/2000
1
101
01/01/1990
31/12/9999
2
100
01/12/1990
01/01/1993
2
101
01/01/1991
01/12/1991

After reviewing at your updated code, I decided it might be better to take a different approach.
You can combine the add and drop rows into ranges by first selecting the adds and then using an OUTER APPLY to look up the first following drop row with the same user and membership (if one exists).
If the above is wrapped up in a Common Table Expression (CTE), overlaps can then be identified by joining the ranges with themselves and limiting the join to those with the same user, overlapping date ranges, and excluding same row joins.
A standard test for overlapping date ranges is Start1 < End2 AND Start2 < End1. For exclusive end-dates (as appears to be the case here) < is used. If the ranges were defined with inclusive end dates <= would be used. Omitted end dates can be handled with an end-of-time default ISNULL(EndDate, '9999-12-31')'
To omit self matches, the following uses ROW_NUMBER() to assign distinct row IDs. You could alternately check for "not (all values equal)".
;WITH Ranges AS (
SELECT
A.[User], A.Membership, A.Date AS StartDate, D.Date AS EndDate,
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RowId
FROM Memberships A
OUTER APPLY (
SELECT TOP 1 D.*
FROM Memberships D
WHERE D.Change = 'D'
AND D.[User] = A.[User]
AND D.Membership = A.Membership
AND D.Date > A.Date
ORDER BY D.Date
) D
WHERE A.Change = 'A'
)
SELECT DISTINCT R.[User], R.Membership, R.StartDate, R.EndDate
FROM Ranges R
JOIN Ranges R2
ON R2.[User] = R.[User]
AND R2.RowId <> R.RowId
AND R2.StartDate < ISNULL(R.EndDate, '2099-12-31')
AND R.StartDate < ISNULL(R2.EndDate, '2099-12-31')
ORDER BY R.[User], R.StartDate
In case of multiple overlaps, DISTINCT eliminates dups. An alternative would be to replace the JOIN with a WHERE EXISTS(...).
Results:
User
Membership
StartDate
EndDate
1
100
1900-01-01
2000-01-01
1
101
1990-01-01
null
2
100
1990-12-01
1993-01-01
2
101
1991-01-01
1991-12-01
5
100
2020-01-01
2020-02-01
5
102
2020-01-15
2020-02-15
5
101
2020-02-01
2020-03-01
See this db<>fiddle.

Related

SQL 30 day active user query

I have a table of users and how many events they fired on a given date:
DATE
USERID
EVENTS
2021-08-27
1
5
2021-07-25
1
7
2021-07-23
2
3
2021-07-20
3
9
2021-06-22
1
9
2021-05-05
1
4
2021-05-05
2
2
2021-05-05
3
6
2021-05-05
4
8
2021-05-05
5
1
I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.
DATE
ACTIVE_USERS
2021-08-27
1
2021-07-25
3
2021-07-23
2
2021-07-20
2
2021-06-22
1
2021-05-05
5
I tried the following query which returned only the users who were active on the specified date:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
I also tried using a window function with rows between but seems to end up getting the same result:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
I'm using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.
This is tricky to do as window functions -- because count(distinct) is not permitted. You can use a self-join:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
However, that can be expensive. One solution is to "unpivot" the data. That is, do an incremental count per user of going "in" and "out" of active states and then do a cumulative sum:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;

MS-SQL how to add missing month in a table values

I have a table with the following entries,
ID
date
Frequency
1
'2012-04-30'
5
1
'2012-06-30'
4
1
'2012-07-31'
25
2
'2012-04-30'
7
2
'2012-05-31'
4
2
'2012-06-30'
1
2
'2012-07-31'
6
I need to add missing month and the date which gets added should be the last date of that month with frequency value as 0.
The expected output is
ID
date
Frequency
1
'2012-04-30'
5
1
'2012-05-31'
0
1
'2012-06-30'
4
1
'2012-07-31'
25
2
'2012-04-30'
7
2
'2012-05-31'
4
2
'2012-06-30'
1
2
'2012-07-31'
6
I need to add missing month and the date which gets added should be the last date of that
I would suggest recursive CTEs:
with cte as (
select id, date, frequency,
lead(date) over (partition by id order by date) as next_date
from t
union all
select id, eomonth(date, 1), 0, next_date
from cte
where eomonth(date, 1) < dateadd(day, -1, next_date)
)
select id, date, frequency
from cte
order by id, date;
The anchor part of the CTE calculates the end date for a given row. The recursive part then just keeps adding months to fill in the missing rows (and none if there are none). The use of eomonth(date, 1) is just a handy way of getting the last day of the next month.
Here is a db<>fiddle.
If you have all dates in the table, you can also use cross join to generate the rows and then left join to bring in the existing data:
select i.id, d.date, coalesce(t.frequency, 0) as frequency
from (select distinct id from t) i cross join
(select distinct date from t) d left join
t
on i.id = t.id and d.date = t.date
order by i.id, d.date;
If you have a large amount of data, you can compare performance. This may be a case where a recursive CTE is faster than alternative methods.

Is there a way to find the latest date that is more than n days in SQL?

I'm trying to find which asset has not been borrow for the last 90days.
The logic would be something like
IF latest date of an asset returned_date > 90 days
more than 90 days
ELIF created_date > 90 days
more than 90 days
ELSE
not more than 90 days
How do I write all of that into a single query
loan
loan_id asset_id returned_date
1 1 2019-12-14 12:00:00.000
2 1 2019-12-10 12:00:00.000
3 2 2020-11-10 12:00:00.000
asset
asset_id created_date
1 2019-12-05 12:00:00.000
2 2019-12-05 12:00:00.000
3 2019-12-05 12:00:00.000
If I understand correctly, this is just a not exists query. There are two conditions:
The asset was created at least 90 days ago.
There have been no returns in the last 90 days.
This would be:
select a.*
from asset a
where a.create_date < dateadd(day, -90, getdate()) and
not exists (select 1
from loan l
where l.asset_id = a.asset_id and
l.return_date >= dateadd(day, -90, getdate())
);
The following query returns only those assets that have not been borrowed for 90 days.
Notes:
If an asset has not been borrowed at all, its create date is used in the
calculation (marked by ** in the code)
If an asset has been borrowed multiple times, the most recent one is
used in the calculation (marked by *** in the code)
select * from(
select
a.asset_id,
l.loan_id,
isnull(l.return_date,a.create_date) as return_date, -- **
rank() over(partition by a.asset_id order by l.return_date desc) as rnk -- ***
from asset a
left join loan l on a.asset_id=l.asset_id
)x
where
rnk=1 -- ***
and datediff(day,return_date, getdate())>=90
You can use a CASE statement along with DATEADD (here dd represents days) to categorize which is borrowed or not (Solution 1). Then, if you wish to show one or the other, you would move the conditions to check this to a WHERE clause (Solution 2).
Solution 1
select
t1.asset_id,
t2.returned_date,
case when t2.returned_date > dateadd(dd,90,t1.created_date) then 'more than 90 days'
else 'not more than 90 days'
end as 'borrow window'
from asset t1
join loan t2
on t2.asset_id = t1.asset_id
Solution 2
select t1.*, t2.returned_date
from asset t1
join loan t2
on t2.asset_id = t1.asset_id
where t2.returned_date > dateadd(dd, 90, t1.created_date) -- only > 90

Query to remove continuous eligibility records

I need to get rid of the records whose eligibility already exists in another record of the memid. In the below example I need the output with only rows where I have mentioned Y. Table has memid,effdate,termdate. I have just prefixed Y to mention the record I need as output. How can we do this.Thanks.
MEMID EFFDATE TERMDATE
Y A1 2012-01-01 2078-12-31
A1 2012-02-01 2078-12-31
Y B1 2007-05-01 2008-12-31
Y B1 2009-10-01 2010-04-30
Y A2 1999-01-01 2078-12-31
A2 2006-01-01 2011-04-28
B2 1999-01-01 1999-10-01
Y B2 1999-01-01 2000-09-30
Y B2 2006-01-01 2006-01-01
Y B2 2009-08-01 2078-12-31
Y A3 2000-03-01 2009-01-31
A3 2002-04-01 2009-01-31
A3 2003-01-01 2006-06-30
A3 2006-01-01 2009-01-31
Y A3 2009-10-01 2010-07-31
Y A3 2011-06-01 2012-09-30
A3 2011-09-01 2012-09-30
Y A3 2013-06-01 2078-12-31
A3 2013-07-01 2078-12-31
B3 1999-01-01 2008-11-30
Y B3 1999-01-01 2078-12-31
B3 2006-01-01 2008-11-30
select all ranges where there is no covering bigger range with NOT EXISTS. Then remove duplicates with DISTINCT.
select distinct memid, effdate, termdate
from mytable
where not exists
(
select *
from mytable bigger
where bigger.memid = mytable.memid
and
(
(bigger.effdate <= mytable.effdate and bigger.termdate > mytable.termdate)
or
(bigger.effdate < mytable.effdate and bigger.termdate >= mytable.termdate)
)
);
This is a variation of the Gaps and Islands problem. Since you have tagged both MySQL and SQL-Server, and have not responded to my question asking you to clarify which, I will give a solution for both.
Your first step is expand all your ranges into continuous data by joining with a numbers table. This will turn a single row into one row for every date in the range:
SQL Server
SELECT t.memid, Date = DATEADD(DAY, n.Number, t.EffDate)
FROM YourTable t
INNER JOIN Numbers n
ON n.Number BETWEEN 0 AND DATEDIFF(DAY, t.EffDate, t.TermDate);
MySQL
SELECT t.memid, DATE_ADD(t.EffDate, INTERVAL n.Number DAY) AS Date
FROM YourTable t
INNER JOIN Numbers n
ON n.Number BETWEEN 0 AND DATEDIFF(t.EffDate, t.TermDate);
This will turn this row:
memid EFFDATE TERMDATE
A3 2009-10-01 2009-10-05
Into
memid Date
A3 2009-10-01
A3 2009-10-02
A3 2009-10-03
A3 2009-10-04
A3 2009-10-05
If you don't have a numbers table, then you should probably create one. (In each of the SQL Fiddle's below I have created a numbers table, so you can find ways of doing so there.
Now you have your continuous ranges you can apply the appropriate gaps and islands solution.
If you are using SQL Server then you can use ranking functions to resolve this:
WITH ContinuousRange AS
( SELECT t.memid,
d.Date,
GroupingSet = DATEADD(DAY, -DENSE_RANK() OVER(PARTITION BY memid
ORDER BY d.Date), d.Date)
FROM T
INNER JOIN Numbers n
ON n.Number BETWEEN 0 AND DATEDIFF(DAY, t.EffDate, t.TermDate)
OUTER APPLY (SELECT Date = DATEADD(DAY, n.Number, t.EffDate)) d
)
SELECT cr.MemID,
EffDate = MIN(cr.Date),
TermDate = MAX(cr.Date)
FROM ContinuousRange cr
GROUP BY cr.MemID, cr.GroupingSet
ORDER BY cr.MemID, cr.GroupingSet;
Simplified Example on SQL Fiddle
This works on the basis that end number in a sequence minus it's order in the sequence will give a constant for a continuous range, e.g.:
Sequence | OrderInSequence | (Sequence - OrderInSequence)
---------+-----------------+------------------------------
1 | 1 | 0
2 | 2 | 0
3 | 3 | 0
5 | 4 | 1
6 | 5 | 1
As you can see, where there is a gap in the sequence (between 3 and 5) the value in the 3rd column changes, this is how the column GroupingSet is calculated:
GroupingSet = DATEADD(DAY, -DENSE_RANK() OVER(PARTITION BY memid ORDER BY d.Date), d.Date)
Then when you can use thisc column to get the minimum and maximum value of each consectutive sequence (or island).
Since MySQL does not have ranking functions you will need to use user defined variables to emulate them:
SELECT MemID,
MIN(Date) AS EffDate,
MAX(Date) AS TermDate
FROM ( SELECT t.memid,
#i:= CASE WHEN t.MemID = #m
AND DATE_ADD(t.EffDate, INTERVAL n.Number DAY)
<= DATE_ADD(#d, INTERVAL 1 DAY)
THEN #i
ELSE #i + 1
END AS GroupingSet,
#m:= t.memid,
#d:= DATE_ADD(t.EffDate, INTERVAL n.Number DAY) AS Date
FROM t
INNER JOIN Numbers n
ON n.Number BETWEEN 0 AND DATEDIFF(t.TermDate, t.EffDate)
CROSS JOIN (SELECT #M:= '', #i:= 0, #d:= NULL) i
ORDER BY t.memid, DATE_ADD(t.EffDate, INTERVAL n.Number DAY)
) t
GROUP BY MemID, GroupingSet;
Simplified Example on SQL Fiddle
Getting the Grouping Set here is a more iterative process. The data is ordered by MemID and the date, then at each row the value of the date and the memid is stored in the variables #d and #m respectively. If the memid in the new row is the same as #m (i.e still in the same group of memid's) and the date in the new row is 1 day ahead of, or the same as #d then the grouping set does not increment, if it is a new memid, or the date is further than 1 day ahead of the previous date, then it is a new 'island' and the grouping set increments.
EDIT
To help with memory issues then you can deal with different scenarios differently. The first step is to remove any records contained entirely within another, e.g.
MemID EffDate TermDate
A1 2012-01-01 2078-12-31
A1 2012-02-01 2078-12-31
With these two, the second row is not required since its date range is contained entirely within the first. So this can be removed (This is done in the CTE called Filtered in the below query).
The second way to help is to remove the range expansion where it is not required, so with the above example we are just left with:
MemID EffDate TermDate
A1 2012-01-01 2078-12-31
And this is the only row for A1, it is therefore not necessary to expand this out into all the days between the EffDate and TermDate, and then take a minimum and maximum, we can simply use the EffDate and TermDate as they are. This is the query below the UNION ALL in the query below.
WITH Filtered AS
( SELECT MemID, EffDate, TermDate
FROM T
WHERE NOT EXISTS
( SELECT 1
FROM T T2
WHERE T.MemID = T2.MemID
AND T2.EffDate < T.EffDate
AND T2.TermDate >= T.TermDate
)
), ContinuousRange AS
( SELECT t.memid,
d.Date,
GroupingSet = DATEADD(DAY, -DENSE_RANK() OVER(PARTITION BY memid
ORDER BY d.Date), d.Date)
FROM Filtered T
INNER JOIN Numbers n
ON n.Number BETWEEN 0 AND DATEDIFF(DAY, t.EffDate, t.TermDate)
OUTER APPLY (SELECT Date = DATEADD(DAY, n.Number, t.EffDate)) d
WHERE EXISTS
( SELECT 1
FROM Filtered T2
WHERE T.MemID = T2.MemID
AND ( (T2.EffDate > T.EffDate AND T2.EffDate < T.TermDate)
OR (T2.TermDate > T.EffDate AND T2.TermDate < T.TermDate)
)
)
)
SELECT cr.MemID,
EffDate = MIN(cr.Date),
TermDate = MAX(cr.Date), 1
FROM ContinuousRange cr
GROUP BY cr.MemID, cr.GroupingSet
UNION ALL
SELECT T.MemID, T.EffDate, T.TermDate, 0
FROM Filtered T
WHERE NOT EXISTS
( SELECT 1
FROM Filtered T2
WHERE T.MemID = T2.MemID
AND ( (T2.EffDate > T.EffDate AND T2.EffDate < T.TermDate)
OR (T2.TermDate > T.EffDate AND T2.TermDate < T.TermDate)
)
)
ORDER BY MemID, EffDate;
Example on SQL Fiddle

Efficient join with a "correlated" subquery

Given three tables Dates(date aDate, doUse boolean), Days(rangeId int, day int, qty int) and Range(rangeId int, startDate date) in Oracle
I want to join these so that Range is joined with Dates from aDate = startDate where doUse = 1 whith each day in Days.
Given a single range it might be done something like this
SELECT rangeId, aDate, CASE WHEN doUse = 1 THEN qty ELSE 0 END AS qty
FROM (
SELECT aDate, doUse, SUM(doUse) OVER (ORDER BY aDate) day
FROM Dates
WHERE aDate >= :startDAte
) INNER JOIN (
SELECT rangeId, day,qty
FROM Days
WHERE rangeId = :rangeId
) USING (day)
ORDER BY day ASC
What I want to do is make query for all ranges in Range, not just one.
The problem is that the join value "day" is dependent on the range startDate to be calculated, wich gives me some trouble in formulating a query.
Keep in mind that the Dates table is pretty huge so I would like to avoid calculating the day value from the first date in the table, while each Range Days shouldn't be more than a 100 days or so.
Edit: Sample data
Dates Days
aDate doUse rangeId day qty
2008-01-01 1 1 1 1
2008-01-02 1 1 2 10
2008-01-03 0 1 3 8
2008-01-04 1 2 1 2
2008-01-05 1 2 2 5
Ranges
rangeId startDate
1 2008-01-02
2 2008-01-03
Result
rangeId aDate qty
1 2008-01-02 1
1 2008-01-03 0
1 2008-01-04 10
1 2008-01-05 8
2 2008-01-03 0
2 2008-01-04 2
2 2008-01-05 5
Try this:
SELECT rt.rangeId, aDate, CASE WHEN doUse = 1 THEN qty ELSE 0 END AS qty
FROM (
SELECT *
FROM (
SELECT r.*, t.*, SUM(doUse) OVER (PARTITION BY rangeId ORDER BY aDate) AS span
FROM (
SELECT r.rangeId, startDate, MAX(day) AS dm
FROM Range r, Days d
WHERE d.rangeid = r.rangeid
GROUP BY
r.rangeId, startDate
) r, Dates t
WHERE t.adate >= startDate
ORDER BY
rangeId, t.adate
)
WHERE
span <= dm
) rt, Days d
WHERE d.rangeId = rt.rangeID
AND d.day = GREATEST(rt.span, 1)
P. S. It seems to me that the only point to keep all these Dates in the database is to get a continuous calendar with holidays marked.
You may generate a calendar of arbitrary length in Oracle using following construction:
SELECT :startDate + ROWNUM
FROM dual
CONNECT BY
1 = 1
WHERE rownum < :length
and keep only holidays in Dates. A simple join will show you which Dates are holidays and which are not.
Ok, so maybe I've found a way. Someting like this:
SELECT irangeId, aDate + sum(case when doUse = 1 then 0 else 1) over (partionBy rangeId order by aDate) as aDate, qty
FROM Days INNER JOIN (
select irangeId, startDate + day - 1 as aDate, qty
from Range inner join Days using (irangeid)
) USING (aDate)
Now I just need a way to fill in the missing dates...
Edit: Nah, this way means that I'll miss the doUse vaue of the last dates...