Converting PostgreSQL recursive CTE to SQL Server - sql

I'm having trouble adapting some recursive CTE code from PostgreSQL to SQL Server, from the book "Fighting Churn with Data"
This is the working PostgreSQL code:
with recursive
active_period_params as (
select interval '30 days' as allowed_gap,
'2021-09-30'::date as calc_date
),
active as (
-- anchor
select distinct account_id, min(start_date) as start_date
from subscription inner join active_period_params
on start_date <= calc_date
and (end_date > calc_date or end_date is null)
group by account_id
UNION
-- recursive
select s.account_id, s.start_date
from subscription s
cross join active_period_params
inner join active e on s.account_id=e.account_id
and s.start_date < e.start_date
and s.end_date >= (e.start_date-allowed_gap)::date
)
select account_id, min(start_date) as start_date
from active
group by account_id
This is my attempt at converting to SQL Server. It gets stuck in a loop. I believe the issue has to do with the UNION ALL required by SQL Server.
with
active_period_params as (
select 30 as allowed_gap,
cast('2021-09-30' as date) as calc_date
),
active as (
-- anchor
select distinct account_id, min(start_date) as start_date
from subscription inner join active_period_params
on start_date <= calc_date
and (end_date > calc_date or end_date is null)
group by account_id
UNION ALL
-- recursive
select s.account_id, s.start_date
from subscription s
cross join active_period_params
inner join active e on s.account_id=e.account_id
and s.start_date < e.start_date
and s.end_date >= dateadd(day, -allowed_gap, e.start_date)
)
select account_id, min(start_date) as start_date
from active
group by account_id
The subscription table is a list of subscriptions belonging to customers. A customer can have multiple subscriptions with overlapping dates or gaps between dates. null end_date means the subscription is currently active and has no defined end_date. Example data for a single customer (account_id = 15) below:
subscription
---------------------------------------------------
| id | account_id | start_date | end_date |
---------------------------------------------------
| 6 | 15 | 01/06/2021 | null |
| 5 | 15 | 01/01/2021 | null |
| 4 | 15 | 01/06/2020 | 01/02/2021 |
| 3 | 15 | 01/04/2020 | 15/05/2020 |
| 2 | 15 | 01/03/2020 | 15/05/2020 |
| 1 | 15 | 01/06/2019 | 01/01/2020 |
Expected query result (as produced by PostgreSQL code):
------------------------------
| account_id | start_date |
------------------------------
| 15 | 01/03/2020 |
Issue:
The SQL Server code above gets stuck in a loop and doesn't produce a result.
Description of the PostgreSQL code:
anchor block finds subs that are active as at the calc_date (30/09/2021) (id 5 & 6), and returns the min start_date (01/01/2021)
the recursion block then looks for any earlier subs that existed within the allowed_gap, which is 30 days prior to the min_start date found in 1). id 4 meets this criteria, so the new min start_date is 01/06/2020
recursion repeats and finds two subs within the allowed_gap (01/06/2020 - 30 days). Of these subs (id 2 & 3), the new min start_date is 01/03/2020
recursion fails to find an earlier sub within the allowed_gap (01/03/2020 - 30 days)
query returns a start date of 01/03/2020 for account_id 15
Any help appreciated!

It seems the issue is related to the way SQL Server deals with recursive CTEs.
This is a type of gaps-and-islands problem, and does not actually require recursion.
There are a number of solutions, here is one. Given your requirement, there may be more efficient methods, but this should get you started.
Using LAG we identify rows which are within the specified gap of the next row
We use a running COUNT to give each consecutive set of rows an ID
We group by that ID, and take the minimum start_date, filtering out non-qualifying groups
Group again to get the minimum per account
DECLARE #allowed_gap int = 30,
#calc_date datetime = cast('2021-09-30' as date);
WITH PrevValues AS (
SELECT *,
IsStart = CASE WHEN ISNULL(LAG(end_date) OVER (PARTITION BY account_id
ORDER BY start_date), '2099-01-01') < DATEADD(day, -#allowed_gap, start_date)
THEN 1 END
FROM subscription
),
Groups AS (
SELECT *,
GroupId = COUNT(IsStart) OVER (PARTITION BY account_id
ORDER BY start_date ROWS UNBOUNDED PRECEDING)
FROM PrevValues
),
ByGroup AS (
SELECT
account_id,
GroupId,
start_date = MIN(start_date)
FROM Groups
GROUP BY account_id, GroupId
HAVING COUNT(CASE WHEN start_date <= #calc_date
and (end_date > #calc_date or end_date is null) THEN 1 END) > 0
)
SELECT
account_id,
start_date = MIN(start_date)
FROM ByGroup
GROUP BY account_id;
db<>fiddle

Related

Finding gaps between date ranges spanning records

I'm trying to write a query where I can find any gap in the date ranges for a given ID when passing in two dates.
EDIT: I need to know if a whole gap or part of a gap exists in my date range.
I have data in this format:
Example 1:
| ID | START_DATE | END_DATE |
|----|------------|------------|
| 1 | 01/01/2019 | 30/09/2019 |
| 1 | 01/03/2020 | (null) |
Example 2:
| ID | START_DATE | END_DATE |
|----|------------|------------|
| 2 | 01/01/2019 | 30/09/2019 |
| 2 | 01/10/2019 | 01/12/2019 |
| 2 | 02/12/2019 | (null) |
NB. A null end date essentially means "still active up to current day".
E.g. Example 1 has a gap of 152 days between 30/09/2019 and 01/03/2020. If I queried in the range of 05/05/2019 - 01/09/2019 there's no gap in that range. Whereas if I'm looking at the date range 05/05/2019 - 02/10/2019 there's a single day gap in that range.
For what it's worth, I don't actually care how many days gap, just whether there is one or not.
I've tried doing something like this but it doesn't work when my date falls into a gap:
SELECT SUM(START_DATE - PREV_END - 1)
FROM
(
SELECT ID, START_DATE, END_DATE, LAG(END_DATE) OVER (ORDER BY START_DATE) AS PREV_END_DATE
FROM TBL
WHERE ID = X_ID
)
WHERE START_DATE >= Y_FIRST_DATE
AND START_DATE <= Z_SECOND_DATE;
X_ID, Y_FIRST_DATE, and Z_SECOND_DATE are just any different ID or date range I might want to pass in.
How could I go about this?
Another option to determine the days might be by use SELECT .. FROM dual CONNECT BY LEVEL <= syntax through EXISTence of gaps by INTERSECTing two sets, one finds all dates between extremum parameters while the other finds all the dates fitting within the dates inserted into table as bounds :
SELECT CASE WHEN
SUM( 1 + LEAST(Z_SECOND_DATE,NVL(END_DATE,TRUNC(SYSDATE)))
- GREATEST(Y_FIRST_DATE,START_DATE) ) = Z_SECOND_DATE - Y_FIRST_DATE + 1 THEN
'NO Gap'
ELSE
'Gap Exists'
END "gap?"
FROM TBL t
WHERE ID = X_ID
AND EXISTS ( SELECT Y_FIRST_DATE + LEVEL - 1
FROM dual
CONNECT BY LEVEL <= Z_SECOND_DATE - Y_FIRST_DATE + 1
INTERSECT
SELECT t.START_DATE + LEVEL - 1
FROM dual
CONNECT BY LEVEL <= NVL(t.END_DATE,TRUNC(SYSDATE))- t.START_DATE + 1
)
START_DATE values are assumed to be non-null based on the sample data.
Demo
This is another variation the islands-and-gaps problem that pops up a lot here. I think this fits with Oracle's pattern matching functionality. Take this example:
WITH tbl AS
(
SELECT 1 AS ID, to_date('01/01/2019', 'DD/MM/YYYY') AS START_DATE, to_date('30/09/2019', 'DD/MM/YYYY') AS END_DATE FROM DUAL
UNION ALL
SELECT 1 AS ID, to_date('01/03/2020', 'DD/MM/YYYY') AS START_DATE, NULL AS END_DATE FROM DUAL
UNION ALL
SELECT 2 AS ID, to_date('01/01/2019', 'DD/MM/YYYY') AS START_DATE, to_date('30/09/2019', 'DD/MM/YYYY') AS END_DATE FROM DUAL
UNION ALL
SELECT 2 AS ID, to_date('01/10/2019', 'DD/MM/YYYY') AS START_DATE, to_date('01/12/2019', 'DD/MM/YYYY') AS END_DATE FROM DUAL
UNION ALL
SELECT 2 AS ID, to_date('02/12/2019', 'DD/MM/YYYY') AS START_DATE, NULL AS END_DATE FROM DUAL
)
SELECT *
FROM tbl
MATCH_RECOGNIZE(ORDER BY ID, start_date
MEASURES b.id AS ID,
a.end_date+1 AS GAP_START,
b.start_date-1 AS GAP_END
PATTERN (A B+)
DEFINE B AS start_date > PREV(end_date)+1 AND ID = PREV(ID))L;
I know it looks long, but most of it is creating the WITH clause. The pattern matching allows you to define what a gap is and pull the information accordingly. Notice that in order to have a gap, your start date must be greater than the previous end date + 1 grouped by the ID column.
To enhance this to answer your updated/edited question, just add this line of code to the end:
WHERE GREATEST(gap_start, TO_DATE('15/09/2019', 'DD/MM/YYYY' /*Y_FIRST_DATE*/)) <= LEAST(gap_end, to_date('15/10/2019', 'DD/MM/YYYY')/*Z_SECOND_DATE*/)
You can split the date range you are passing, into dates and then compare it with a date range in your table as follows:
SELECT
CASE WHEN SUM(CASE WHEN T.ID IS NULL THEN 1 END) > 0
THEN 'THERE IS GAP'
ELSE 'THERE IS NO GAP'
END AS RESULT_
FROM ( SELECT P_IN_FROM_DATE + LEVEL - 1 AS CUST_DATES
FROM DUAL
CONNECT BY LEVEL <= P_IN_TO_DATE - P_IN_FROM_DATE + 1
) CUST_TBL
LEFT JOIN TBL T
ON CUST_TBL.CUST_DATES BETWEEN T.START_DATE AND T.END_DATE
OR ( CUST_TBL.CUST_DATES >= T.START_DATE AND T.END_DATE IS NULL )
I would suggest finding the maximum end date before the current record -- based on the start date.
That would be:
select t.*
from (select t.*,
max(end_date) over (order by start_date
rows between unbounded preceding and 1 preceding
) as max_prev_end_date
from tbl t
where start_date <= :input_end_date and
end_date >= :input_start_date
) t
where max_prev_end_date < start_date;

Get detail days between two date (mysql query)

I have data like this:
id | start_date | end_date
----------------------------
1 | 16-09-2019 | 22-12-2019
I want to get the following results:
id | month | year | days
------------------------
1 | 09 | 2019 | 15
1 | 10 | 2019 | 31
1 | 11 | 2019 | 30
1 | 12 | 2019 | 22
Is there a way to get that result ?
This is what you want to do:
SELECT id, EXTRACT(MONTH FROM start_date ) as month , EXTRACT(YEAR FROM start_date ) as year , DATEDIFF(end_date, start_date ) as days
From tbl
You can use MONTH() , YEAR() and DATEDIFF() functions
SELECT id, MONTH(start_date) as month, YEAR(start_date ) as year, DATEDIFF(end_date, start_date ) as days from table-name
One way is to create a Calendar table and use that.
select month,year, count(*)
from Calendar
where db_date between '2019-09-16'
and '2019-12-22'
group by month,year
CHECK DEMO HERE
Also you can use recursive CTE to achieve the same.
You can use a recursive CTE and aggregation:
with recursive cte as (
select id, start_date, end_date
from t
union all
select id, start_date + interval 1 day, end_date
from cte
where start_date < end_date
)
select id, year(start_date), month(start_date), count(*) as days
from cte
group by id, year(start_date), month(start_date);
Here is a db<>fiddle.

Get last known record per month in BigQuery

Account balance collection, that shows the account balance of a customer at a given day:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 1 | -300 | 2019-10-11 |
| 1 | -200 | 2019-10-10 |
| 1 | 0 | 2019-10-09 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Notice, that customer #2 had no updates to his account balance in October.
I want to get the last account balance per customer per month. If there has been no account balance update for a customer in a given month, the last known account balance should be transferred to the current month. The result should look like that:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 2 | 200 | 2019-10-10 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Since the account balance of customer #2 was not updated in October but in September, we create a copy of the row from September changing the date to October. Any ideas how to achieve this in BigQuery?
Below is for BigQuery Standard SQL
#standardSQL
WITH customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
if to apply to sample data from your question - as it is in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
SELECT 1, -300, '2019-10-11' UNION ALL
SELECT 1, -200, '2019-10-10' UNION ALL
SELECT 2, 200, '2019-09-10' UNION ALL
SELECT 2, 100, '2019-08-11' UNION ALL
SELECT 2, 50, '2019-07-12' UNION ALL
SELECT 1, 600, '2019-09-02'
), customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id
result is
Row customer_id value timestamp
1 1 -500 2019-10-12
2 2 200 2019-10-10
3 1 600 2019-09-02
4 2 200 2019-09-10
5 1 null null
6 2 100 2019-08-11
7 1 null null
8 2 50 2019-07-12
The following query should mostly answer your question by creating a 'month-end' record for each customer for every month and getting the most recent balance:
with
-- Generate a set of months
month_begins as (
select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),
-- Get the month ends
month_ends as (
select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),
-- Cross Join and group so we get 1 customer record for every month to account for
-- situations where customer doesn't change balance in a month
user_month_ends as (
select
customer_id,
month_end_date
from `project.dataset.table`
cross join month_ends
group by 1,2
),
-- Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
select
customer_id,
value,
timestamp,
month_end_date
from `project.dataset.table`
inner join user_month_ends using(customer_id)
where timestamp <= month_end_date
),
-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
select
*,
row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
from values_prior_to_month_end
),
-- Finally, select only the most recent record for each customer per month
final as (
select
* except(my_row)
from ordered
where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
A few caveats:
I did not order results to match your desired result set, and I also kept a month-end date to illustrate the concept. You can easily change the ordering and exclude unneeded fields.
In the month_begins CTE, I set a range of months into the future, so your result set will contain the most recent balance of 'future months'. To make this a bit prettier, consider changing '2019-12-01' to 'current_date()' and your query will always return to the end of the current month.
Your timestamp field looks to be dates, so I used date logic, but you should be able to apply the same principles to use timestamp logic if your underlying fields are actual timestamps.
In your result set, I'm not sure why your 2nd row (customer 2) would have a timestamp of '2019-10-10', that seems arbitrary as customer 2 has no 2nd balance record.
I purposefully split the logic into several CTEs so I could comment on each step easier, you could definitely perform several steps in the same code block for a more condensed query.

GROUP BY ignore groups containing NULL value and fetch recent record by date from each group

I have a user_details table as below.
id user_id start_date end_date
1 55 5-1-2017 NULL
2 55 3-1-2017 4-30-2017
3 66 1-1-2018 1-31-2018
4 66 2-1-2018 4-12-2018
5 77 11-1-2016 11-30-2016
6 77 12-1-2016 NULL
7 99 8-1-2016 1-31-2017
8 99 7-1-2016 7-31-2016
I have to fetch the latest record by start_date for each user but fetch only those users having end_date set for all records of that user.
The output should be as below:
id user_id start_date end_date
4 66 2-1-2018 4-12-2018
7 99 8-1-2016 1-31-2017
How can I achieve this result?
You can use DISTINCT ON and an ORDER BY clause to get the row with the latest start_date per group.
Then eliminate the results with end_date IS NULL.
SELECT id, user_id, start_date, end_date
FROM (SELECT DISTINCT ON (user_id)
id, user_id, start_date, end_date
FROM user_detail
ORDER BY user_id, start_date DESC, end_date, id) AS q
WHERE end_date IS NOT NULL;
One approach is to aggregate by user_id, and then identify which users have an end_date from the latest record which is not NULL. We use a CTE to find the max start_date value for each user. In the HAVING clause we assert that when the start_date is the latest starting date, the end_date is also not NULL.
WITH cte AS (
SELECT id, user_id, start_date, end_date,
MAX(start_date) OVER (PARTITION BY user_id) max_start_date
FROM yourTable
)
SELECT
user_id,
MIN(start_date) AS start_date,
MAX(end_date) AS end_date
FROM cte
GROUP BY
user_id
HAVING
COUNT(CASE WHEN start_date = max_start_date AND
end_date IS NOT NULL THEN 1 END) > 0;
Demo
With NOT EXISTS:
select u.*
from user_details u
where not exists (
select 1 from user_details
where user_id = u.user_id and (start_date > u.start_date or end_date is null)
)
See the demo.
Results:
| id | user_id | start_date | end_date |
| --- | ------- | ---------- | --------- |
| 4 | 66 | 2018-02-01 | 2018-04-12|
| 7 | 99 | 2016-08-01 | 2017-01-31|
DISTINCT ON with the right index might be the most efficient method. But that index is quite specific: (user_id, start_date DESC, end_date, id). The following should have similar performance but with a simpler index:
select ud.*
from user_details ud
where ud.id = (select ud2.id
from user_details ud2
where ud2.user_id = ud.user_id
order by ud2.start_date desc
limit 1
) and
ud.end_date is not null;
For this, you want an index on user_details(user_id, start_date desc, id).

SQL Server - find absence date occurrences [duplicate]

This question already has an answer here:
SQL: Gaps and Islands, Grouped dates
(1 answer)
Closed 5 years ago.
I have the following dataset:
enter image description here
Here is script for this data:
;with dataset AS (
select 'EMP01' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-07' AS DATE) AS CUT_DATE
UNION
select 'EMP01' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-15' AS DATE) AS CUT_DATE
UNION
select 'EMP02' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-09' AS DATE) AS CUT_DATE
)
select *
from dataset
I need to divide these periods (PERIOD_START and PERIOD_END) by CUT_DATE (exclude cut dates from that periods) The number of cut dates could be any (3,5,8 etc).
Expecting result for the dataset above is:
If your version of SQL Server supports LAG, you can use this.
SELECT EMPLOYEE_ID,
ITEM_TYPE,
MIN(APPLY_DATE) AS STARTDATE,
MAX(APPLY_DATE) AS ENDDATE
FROM
(SELECT T.*,
SUM(CASE WHEN PREV_TYPE=ITEM_TYPE THEN 0 ELSE 1 END)
OVER(PARTITION BY EMPLOYEE_ID ORDER BY APPLY_DATE) AS GRP
FROM (SELECT D.*,
LAG(ITEM_TYPE) OVER(PARTITION BY EMPLOYEE_ID ORDER BY APPLY_DATE) AS PREV_TYPE
FROM DATA D
) T
) T
WHERE ITEM_TYPE IN ('Sickness','Vacation')
GROUP BY EMPLOYEE_ID,ITEM_TYPE,GRP
The logic is to get the previous row's item_type (based on ascending order of apply_date) and compare it with the current row's value. If they are equal, they belong to the same group. Else you start a new group. This is done in the sum window function. After groups are assigned, you just need to get the max and min date for an employee_id,item_type.
Sample Demo
You would use the LAG function.
If you order by something, the LAG function gives the previous value;
a full description can be found at: http://www.sqlservercentral.com/articles/T-SQL/106783/
Take a look at vkp's answer for a full query
This is another way if way if lag is supported.
Rextester Sample
with tbl as
(select d.*
,case when (item_type = lag(item_type) over (partition by employee_id order by apply_date))
then 0
else 1
end grp_tmp
from DATA2 d
where
item_type <> 'Worked'
)
,tbl2 as
(select t.*
,sum(grp_tmp) over (order by employee_id,apply_date
rows between unbounded preceding and current row
)
as grp
from tbl t
)
select
EMPLOYEE_ID
,ITEM_TYPE
,(CONVERT(VARCHAR(24),min(apply_date),103)
+' - '
+CONVERT(VARCHAR(24),max(apply_date),103)
) as range
from tbl2
group by EMPLOYEE_ID,
ITEM_TYPE
,grp
order by
employee_id
,min(apply_date);
Output
+-------------+-----------+-------------------------+
| EMPLOYEE_ID | ITEM_TYPE | range |
+-------------+-----------+-------------------------+
| 1 | Sickness | 23/05/2017 - 24/05/2017 |
| 1 | Vacation | 26/05/2017 - 29/05/2017 |
| 1 | Sickness | 01/06/2017 - 01/06/2017 |
| 2 | Sickness | 25/05/2017 - 30/05/2017 |
+-------------+-----------+-------------------------+