Date filtering in SQL - sql

Table below consists of 2 columns: a unique identifier and date. I am trying to build a new column of episodes, where a new episode would be triggered when >= 3 months between dates. This process should occur for each unique EMID. In the table attached, EMID ending in 98 would only have 1 episode, there are no intervals >2 months between each row in the date column. However, EMID ending in 03 would have 2 episodes, as there is almost a 3 year gap between rows 12 and 13. I have tried the following code, which doesn't work.
Table:
SELECT TOP (1000) [EMID],[Date]
CASE
WHEN DATEDIFF(month, Date, LEAD Date) <3
THEN "1"
ELSE IF DATEDIFF(month, Date, LEAD Date) BETWEEN 3 AND 5
THEN "2"
ELSE "3"
END episode
FROM [res_treatment_escalation].[dbo].[cspine42920a]
EDIT: Using Microsoft SQL Server Management Studio.
EDIT 2: I have made some progress but the output is not exactly what I am looking for. Here is the query I used:
SELECT TOP (1000) [EMID],[visit_date_01],
CASE
WHEN DATEDIFF(DAY, visit_date_01, LAG(visit_date_01,1,getdate()) OVER (partition by EMID order by EMID)) <= 90 THEN '1'
WHEN DATEDIFF(DAY, visit_date_01, LAG(visit_date_01,1,getdate()) OVER (PARTITION BY EMID ORDER BY EMID)) BETWEEN 90 AND 179 THEN '2'
WHEN DATEDIFF(DAY, visit_date_01, LAG(visit_date_01,1,getdate()) OVER (PARTITION BY EMID order by EMID)) > 180 THEN '3'
END AS EPISODE
FROM [res_treatment_escalation].[dbo].['c-spine_full_dataset_4#29#20_wi$']
table2Here is the actual vs expected output
The partition by EMID does not seem to be working correctly. Every time there is a new EMID a new episode is triggered. I am using day instead of month as the filter in DATEDIFF- this does not seem to recognize new episodes within the same EMID

Hmmm: Use LAG() to get the previous date. Use a date comparison to assign a flag and then a cumulative sum:
select c.*,
sum(case when prev_date > dateadd(month, -3, date) then 0 else 1 end) over
(partition by emid order by date) as episode_number
from (select c.*, lag(date) over (partition by emid order by date) as prev_date
from res_treatment_escalation.dbo.cspine42920a c
) c;

Related

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

Assign a Y/N flag based last 12 month activity

I'm working with a list of hospital patients and would like to flag each patient account with a "Y" if they were seen in the hospital nine or more times over the past 12 months.
I've come up with this, which would work fine if the patient list were static and only included a 12 month period:
SELECT
ENC.HSP_ACCOUNT_ID,
ENC.PAT_MRN_ID,
ENC.ADT_ARRIVAL_DTTM,
case when count(distinct txn.hsp_account_id) over(partition by PAT.PAT_MRN_ID) >= 9 then 'Y' else 'N' end as familiar_face_yn
FROM CLARITY.F_ED_ENCOUNTERS ENC
WHERE ENC.SERVICE_DATE BETWEEN '1-JUL-17' AND '31-OCT-18'
But I'd like to query the prior two years worth of data but only use the 12 months prior to the arrival date (ENC.ADT_ARRIVAL_DTTM) in calculating the Y or N.
The problem I'm running in to with the above query is that it's going back and counting all visits by a particular patient between 7/1/17 and 10/31/18.
What I'd like is that if the arrival date for a record is 8/1/18, it should count all visits between 8/1/17 and 8/1/18, ignoring anything with an arrival date earlier than 8/1/17 or later than 8/1/18.
Is this sort of "rolling" calculation possible? Many thanks!
You can use a windowing clause:
SELECT ENC.HSP_ACCOUNT_ID, ENC.PAT_MRN_ID, ENC.ADT_ARRIVAL_DTTM,
(CASE WHEN COUNT(DISTINCT txn.hsp_account_id) OVER
(PARTITION BY PAT.PAT_MRN_ID
ORDER BY ENC.SERVICE_DATE
RANGE BETWEEN 365 PRECEDING AND CURRENT ROW
) >= 9
THEN 'Y' ELSE 'N'
END) as familiar_face_yn
FROM CLARITY.F_ED_ENCOUNTERS ENC
WHERE ENC.SERVICE_DATE BETWEEN DATE '2017-07-01' AND DATE '2018-10-31'
with cte as
(
SELECT
ENC.HSP_ACCOUNT_ID,
ENC.PAT_MRN_ID,
ENC.ADT_ARRIVAL_DTTM,
-- find the most recent visit
max(ENC.ADT_ARRIVAL_DTTM) over(partition by PAT.PAT_MRN_ID) as last_date
FROM CLARITY.F_ED_ENCOUNTERS ENC
WHERE ENC.SERVICE_DATE BETWEEN '1-JUL-17' AND '31-OCT-18'
)
select ...
-- count all rows with within a 12 month range before the most recent visit
case when count(distinct case when ADT_ARRIVAL_DTTM >= add_months(last_date, -12) then txn.hsp_account_id end)
over (partition by PAT.PAT_MRN_ID) >= 9
then 'Y'
else 'N'
end as familiar_face_yn
from cte
I don't know if you really need the DISTINCT count...

How to calculate number of days working on tasks if we have many tasks and the date range of each tasks could have overlap

I run into a question during working and I would really appreciate if anyone could give me some ideas.
We have a table which keeps tracking of tasks employee has finished. Table structure as below :
EmployeeNum | TaskID |Start Date of task | End Date of task
I want to calculate how many days each employee has invested in each task using this table. At first my code looks like this:
Select
EmployeeNum,TaskID,DateDiff(day,StartDate,EndDate)+1 as PureDay
from
TaskTable
Group by
EmployeeNum,TaskID
But then I found a problem that there are overlaps in the date range for each task.
For example, we have TaskA, TaskB, TaskC for one employee.
TaskA is from 2018-10-01 to 2018-10-05
TaskB from 2018-10-02 to 2018-10-07
TaskC from 2018-10-09 to 2018-10-10
In this way, the actual working days of this employee should be from 2018-10-01 to 2018-10-07, and then 2018-10-09 to 2018-10-10 which is 9 days. If I calculate date range of each task then add them together then actual working days become 5+6+2=13 days instead of 9.
I'm wandering if there could be any good ways to solve this overlapping problem ? Thank you very much for any ideas!
Following query will count how many working days each employee spent on each task ;
SELECT
EmployeeNum,
TaskID,
(DATEDIFF(dd, StartDate, EndDate) + 1)
-(DATEDIFF(wk, StartDate, EndDate) * 2)
-(CASE WHEN DATENAME(dw, StartDate) = 'Sunday' THEN 1 ELSE 0 END)
-(CASE WHEN DATENAME(dw, EndDate) = 'Saturday' THEN 1 ELSE 0 END) as PureDay
FROM
TaskTable
GROUP BY
EmployeeNum,
TaskID
See this link for on explanation on how this computation works.
Once you know the date when a task starts, you can use a cumulative sum to assign a group to each record and then simply aggregate by that group (and other information).
The following query should do what you want:
with starts as (
select sm.*,
(case when exists (select 1
from tb_TaskMaster sm2
where sm2.EmpID = sm.EmpID and
sm2.StartDate < sm.StartDate and
sm2.EndDate >= sm.StartDate
)
then 0 else 1
end) as isstart
from tb_TaskMaster sm
)
select EmpID, count(TaskId) as cnt_TaskID, min(StartDate) as StartDate, max(EndDate) as EndDate,
datediff(Day, min(StartDate), max(EndDate)) + 1 as PureDay
from (select s.*, sum(isstart) over (partition by EmpID order by StartDate) as grp
from starts s
) s
group by EmpID, grp
order by EmpID
In this db<>fiddle, you could find the DDL & DML for my example data and the working of the code.
You can try this.
Im not sure it will work all the way but you can give it a try :)
declare #table table (empid int,taskid nvarchar(50),startdate date, enddate date)
insert into #table
values
(1,'TaskA','2018-10-01','2018-10-05'),
(1,'TaskB','2018-10-02','2018-10-07'),
(1,'TaskC','2018-10-09','2018-10-10')
select *,case when comparedate > startdate then datediff(dd,comparedate,enddate) else datediff(dd,startdate,enddate)+1 end as countofworkingdays from (
Select empid,taskid,startdate,enddate,lag(enddate,1,'1900-01-01') over(partition by empid order by startdate) as CompareDate from #table
)x
Result
This eliminates overlapping ranges by adjusting the start date based on all previous end dates:
with maxEndDates as
( -- find the maximum previous end date
Select empid,taskid,startdate,enddate,
max(EndDate)
over (partition by EmpID
order by StartDate, EndDate desc
rows between unbounded preceding and 1 preceding) as maxEndDate
from TaskTable
),
daysPerTask as
( -- calculate the difference based on the adjusted start date to eliminate overlaping days
select *,
case when maxEndDate >= enddate then 0 -- range already fully covered
when maxEndDate > startdate then datediff(dd, maxEndDate, enddate) -- range partially overlapping
else datediff(dd, startdate, enddate)+1 -- new range
end as dayCount
from maxEndDates
)
-- get the final count
select EmpID, sum(dayCount)
from daysPerTask
group by EmpID;
See db<>fiddle
Thank you all very much for your responding and help. I found a solution during searching in Stackoverflow, the following is it's link:
T-SQL date range in a table split and add the individual date to the table
The Tally table suggested by Felix in the above question is a great way to solve my problem since I have millions of records and the real situation is really complicated.
Thank you all again for your help!

SQL - Find the two closest date after a specific date

Dear Stack Overflow community,
I am looking for the patient id where the two consecutive dates after the very first one are less than 7 days.
So differences between 2nd and 1st date <= 7 days
and differences between 3rd and 2nd date <= 7 days
Example:
ID Date
1 9/8/2014
1 9/9/2014
1 9/10/2014
2 5/31/2014
2 7/20/2014
2 9/8/2014
For patient 1, the two dates following it are less than 7 days apart.
For patient 2 however, the following date are more than 7 days apart (50 days).
I am trying to write an SQL query that just output the patient id "1".
Thanks for your help :)
You want to use lead(), but this is complicated because you want this only for the first three rows. I think I would go for:
select t.*
from (select t.*,
lead(date, 1) over (partition by id order by date) as next_date,
lead(date, 2) over (partition by id order by date) as next_date_2,
row_number() over (partition by id order by date) as seqnum
from t
) t
where seqnum = 1 and
next_date <= date + interval '7' day and
next_date2 <= next_date + interval '7' day;
You can try using window function lag()
select * from
(
select id,date,lag(date) over(order by date) as prevdate
from tablename
)A where datediff(day,date,prevdate)<=7

How to take only one entry from a table based on an offset to a date column value

I have a requirement to get values from a table based on an offset conditions on a date column.
Say for eg: for the below attached table, if there is any dates that comes close within 15 days based on effectivedate column I should return only the first one.
So my expected result would be as below:
Here for A1234 policy, it returns 6/18/16 entry and skipped 6/12/16 entry as the offset between these 2 dates is within 15 days and I took the latest one from the list.
If you want to group rows together that are within 15 days of each other, then you have a variant of the gaps-and-islands problem. I would recommend lag() and cumulative sum for this version:
select polno, min(effectivedate), max(expirationdate)
from (select t.*,
sum(case when prev_ed >= dateadd(day, -15, effectivedate)
then 1 else 0
end) over (partition by polno order by effectivedate) as grp
from (select t.*,
lag(expirationdate) over (partition by polno order by effectivedate) as prev_ed
from t
) t
) t
group by polno, grp;