Assign a Y/N flag based last 12 month activity - sql

I'm working with a list of hospital patients and would like to flag each patient account with a "Y" if they were seen in the hospital nine or more times over the past 12 months.
I've come up with this, which would work fine if the patient list were static and only included a 12 month period:
SELECT
ENC.HSP_ACCOUNT_ID,
ENC.PAT_MRN_ID,
ENC.ADT_ARRIVAL_DTTM,
case when count(distinct txn.hsp_account_id) over(partition by PAT.PAT_MRN_ID) >= 9 then 'Y' else 'N' end as familiar_face_yn
FROM CLARITY.F_ED_ENCOUNTERS ENC
WHERE ENC.SERVICE_DATE BETWEEN '1-JUL-17' AND '31-OCT-18'
But I'd like to query the prior two years worth of data but only use the 12 months prior to the arrival date (ENC.ADT_ARRIVAL_DTTM) in calculating the Y or N.
The problem I'm running in to with the above query is that it's going back and counting all visits by a particular patient between 7/1/17 and 10/31/18.
What I'd like is that if the arrival date for a record is 8/1/18, it should count all visits between 8/1/17 and 8/1/18, ignoring anything with an arrival date earlier than 8/1/17 or later than 8/1/18.
Is this sort of "rolling" calculation possible? Many thanks!

You can use a windowing clause:
SELECT ENC.HSP_ACCOUNT_ID, ENC.PAT_MRN_ID, ENC.ADT_ARRIVAL_DTTM,
(CASE WHEN COUNT(DISTINCT txn.hsp_account_id) OVER
(PARTITION BY PAT.PAT_MRN_ID
ORDER BY ENC.SERVICE_DATE
RANGE BETWEEN 365 PRECEDING AND CURRENT ROW
) >= 9
THEN 'Y' ELSE 'N'
END) as familiar_face_yn
FROM CLARITY.F_ED_ENCOUNTERS ENC
WHERE ENC.SERVICE_DATE BETWEEN DATE '2017-07-01' AND DATE '2018-10-31'

with cte as
(
SELECT
ENC.HSP_ACCOUNT_ID,
ENC.PAT_MRN_ID,
ENC.ADT_ARRIVAL_DTTM,
-- find the most recent visit
max(ENC.ADT_ARRIVAL_DTTM) over(partition by PAT.PAT_MRN_ID) as last_date
FROM CLARITY.F_ED_ENCOUNTERS ENC
WHERE ENC.SERVICE_DATE BETWEEN '1-JUL-17' AND '31-OCT-18'
)
select ...
-- count all rows with within a 12 month range before the most recent visit
case when count(distinct case when ADT_ARRIVAL_DTTM >= add_months(last_date, -12) then txn.hsp_account_id end)
over (partition by PAT.PAT_MRN_ID) >= 9
then 'Y'
else 'N'
end as familiar_face_yn
from cte
I don't know if you really need the DISTINCT count...

Related

SQL - Get historic count of rows collected within a certain period by date

For many years I've been collecting data and I'm interested in knowing the historic counts of IDs that appeared in the last 30 days. The source looks like this
id
dates
1
2002-01-01
2
2002-01-01
3
2002-01-01
...
...
3
2023-01-10
If I wanted to know the historic count of ids that appeared in the last 30 days I would do something like this
with total_counter as (
select id, count(id) counts
from source
group by id
),
unique_obs as (
select id
from source
where dates >= DATEADD(Day ,-30, current_date)
group by id
)
select count(distinct(id))
from unique_obs
left join total_counter
on total_counter.id = unique_obs.id;
The problem is that this results would return a single result for today's count as provided by current_date.
I would like to see a table with such counts as if for example I had ran this analysis yesterday, and the day before and so on. So the expected result would be something like
counts
date
1235
2023-01-10
1234
2023-01-09
1265
2023-01-08
...
...
7383
2022-12-11
so for example, let's say that if the current_date was 2023-01-10, my query would've returned 1235.
If you need a distinct count of Ids from the 30 days up to and including each date the below should work
WITH CTE_DATES
AS
(
--Create a list of anchor dates
SELECT DISTINCT
dates
FROM source
)
SELECT COUNT(DISTINCT s.id) AS "counts"
,D.dates AS "date"
FROM CTE_DATES D
LEFT JOIN source S ON S.dates BETWEEN DATEADD(DAY,-29,D.dates) AND D.dates --30 DAYS INCLUSIVE
GROUP BY D.dates
ORDER BY D.dates DESC
;
If the distinct count didnt matter you could likely simplify with a rolling sum, only hitting the source table once:
SELECT S.dates AS "date"
,COUNT(1) AS "count_daily"
,SUM("count_daily") OVER(ORDER BY S.dates DESC ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING) AS "count_rolling" --assumes there is at least one row for every day.
FROM source S
GROUP BY S.dates
ORDER BY S.dates DESC;
;
This wont work though if you have gaps in your list of dates as it'll just include the latest 30 days available. In which case the first example without distinct in the count will do the trick.
SELECT count(*) AS Counts
dates AS Date
FROM source
WHERE dates >= DATEADD(DAY, -30, CURRENT_DATE)
GROUP BY dates
ORDER BY dates DESC

Retrieve Customers with a Monthly Order Frequency greater than 4

I am trying to optimize the below query to help fetch all customers in the last three months who have a monthly order frequency +4 for the past three months.
Customer ID
Feb
Mar
Apr
0001
4
5
6
0002
3
2
4
0003
4
2
3
In the above table, the customer with Customer ID 0001 should only be picked, as he consistently has 4 or more orders in a month.
Below is a query I have written, which pulls all customers with an average purchase frequency of 4 in the last 90 days, but not considering there is a consistent purchase of 4 or more last three months.
Query:
SELECT distinct lines.customer_id Customer_ID, (COUNT(lines.order_id)/90) PurchaseFrequency
from fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY Customer_ID
HAVING PurchaseFrequency >=4;
I tried to use window functions, however not sure if it needs to be used in this case.
I would sum the orders per month instead of computing the avg and then retrieve those who have that sum greater than 4 in the last three months.
Also I think you should select your interval using "month(CURRENT_DATE()) - 3" instead of using a window of 90 days. Of course if needed you should handle the case of when current_date is jan-feb-mar and in that case go back to oct-nov-dec of the previous year.
I'm not familiar with Google BigQuery so I can't write your query but I hope this helps.
So I've found the solution to this using WITH operator as below:
WITH filtered_orders AS (
select
distinct customer_id ID,
extract(MONTH from date) Order_Month,
count(order_id) CountofOrders
from customer_order_lines` lines
where EXTRACT(YEAR FROM date) = 2022 AND EXTRACT(MONTH FROM date) IN (2,3,4)
group by ID, Order_Month
having CountofOrders>=4)
select distinct ID
from filtered_orders
group by ID
having count(Order_Month) =3;
Hope this helps!
An option could be first count the orders by month and then filter users which have purchases on all months above your threshold:
WITH ORDERS_BY_MONTH AS (
SELECT
DATE_TRUNC(lines.date, MONTH) PurchaseMonth,
lines.customer_id Customer_ID,
COUNT(lines.order_id) PurchaseFrequency
FROM fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY PurchaseMonth, Customer_ID
)
SELECT
Customer_ID,
AVG(PurchaseFrequency) AvgPurchaseFrequency
FROM ORDERS_BY_MONTH
GROUP BY Customer_ID
HAVING COUNT(1) = COUNTIF(PurchaseFrequency >= 4)

Select rows for last n days after event occurs

I have the following table and data:
PatientID PatientName Diagnosed ReportDate ...
1 0
1 0
1 0
1 1
So there are multiple rows for each patient, as the reports come few times a day.
Whenever the diagnosed field is changed to 1, for that patient, I'd like to get the past 3 days of data . So when Diagnosed ==1, get report time -3 days of data for each patient.
SELECT Patients.ReportDate
FROM Patients
WHERE Diagnosed = 1 and date > ReportDate - interval '3' day;
So getting the past 3 days of data, can be done with ReportDate - interval time, but how do I specify that for every patient (since multiple ids can be for that patient) based on the diagnosed field?
I usually do this filtering after getting csvs in python, but the data set is too large, so I'd like to filter before I convert them to dataframes.
You can look at this another way, which is whether diagnosed = 1 in the next three days -- and take all rows where that is true:
select p.*
from (select p.*,
count(*) filter (where diagnosed = 1) over (partition by patientId order by reportDate range between interval '0 day' following and interval '3 day' following) as cnt_diagnosed_3
from patients p
) p
where cnt_diagnosed_3 > 0
order by patientId, reportDate;
Whenever the diagnosed field is changed to 1, for that patient, I'd like to get the past 3 days of data.
SELECT (p).*
FROM (
SELECT p
, diagnosed
, bool_or(diagnosed = 1) OVER (w RANGE BETWEEN CURRENT ROW AND '3 days' FOLLOWING) AS in_range
, lag(diagnosed) OVER w AS last_diagnosed
FROM patients p
WINDOW w AS (PARTITION BY patientid ORDER BY reportdate)
) sub
WHERE diagnosed = 0 AND in_range
OR diagnosed = 1 AND last_diagnosed = 0
ORDER BY patientid, reportdate;
db<>fiddle here
Returns the "past 3 days of data" where the "field is changed to 1" (previous row had "0").
The WINDOW clause is just syntactic sugar to avoid spelling out the same window definition repeatedly. (No additional benefit for performance.)
SELECT p in the innermost subquery is a neat way to get the whole row. The outer SELECT (p).* returns complete rows without auxiliary columns added in the subquery. This way we get whole rows without spelling out all columns (or even needing to know all of them).
RANGE distance PRECEDING/FOLLOWING requires Postgres 11 or later.
Here is a slower alternative that also works for older versions:
SELECT p.*
FROM (
SELECT patientid, reportdate
FROM (
SELECT patientid, reportdate, diagnosed
, lag(diagnosed) OVER (PARTITION BY patientid ORDER BY reportdate) AS last_diagnosed
FROM patients
) p0
WHERE diagnosed = 1
AND last_diagnosed = 0
) d
JOIN patients p USING (patientid)
WHERE p.reportdate BETWEEN d.reportdate - interval '3 days' AND d.reportdate
ORDER BY p.patientid, p.reportdate;
Subquery d select rows where Diagnosed just switched to 1. Then self-join to select your time frame.
For gaps-and-islands basics, see:
Select longest continuous sequence
You also added:
So when Diagnosed ==1, get report time -3 days of data for each patient.
That's a wider definition, and that's what Gordon's query does. Goes to show the importance of an exact definition of requirements.

Flag 2 actual vs benchmark readings every rolling 12 hours in SQL Developer Query

Looking for some help with code in SQL Developer query to flag any 2 temperature readings - every rolling 12 hours - if they are greater than the acceptable benchmark of 101 deg F.
The given data fields are:
Temp Recorded (DT/TM data type ; down to seconds)
Reading Value (number data type)
Patient ID
There are multiple readings taken throughout a patients stay, at random times.
Logically, we can check if two adjacent times total 12 hrs or more & EACH of their temp readings are >101 but not sure how to put it into a CASE statement (unless there's a better SQL syntax).
Will really appreciate if a SQL only solution can be recommended.
Many Thanks
Giving the code below including the CASE statement as provided by #Gordon Linoff underneath. The below sample code is part of a bigger query joining multiple tables:
SELECT CE.PatientID, CE.ReadingValue, CE.TempRecorded_DT_TM,
(case when sum(case when readingvalue > 101 then 1 else 0 end) over (
partition by patientid
order by dt
range between '12' hour preceding and current row
) >= 2
then 'Y' else 'N'
end) as temp_flag
FROM
edw.se_clinical_event CE
WHERE
CE.PatientID = '176660214'
AND
CE.TempRecorded_DT_TM >= '01-JAN-20'
ORDER BY
TempRecorded_DT_TM
If you want two readings in 12 hours that are greater than 101, then you can use a rolling sum with a window frame:
select t.*,
(case when sum(case when readingvalue > 101 then 1 else 0 end) over (
partition by patientid
order by dt
range between interval '12' hour preceding and current row
) >= 2
then 'Y' else 'N'
end) as temp_flag
from t;

Date filtering in SQL

Table below consists of 2 columns: a unique identifier and date. I am trying to build a new column of episodes, where a new episode would be triggered when >= 3 months between dates. This process should occur for each unique EMID. In the table attached, EMID ending in 98 would only have 1 episode, there are no intervals >2 months between each row in the date column. However, EMID ending in 03 would have 2 episodes, as there is almost a 3 year gap between rows 12 and 13. I have tried the following code, which doesn't work.
Table:
SELECT TOP (1000) [EMID],[Date]
CASE
WHEN DATEDIFF(month, Date, LEAD Date) <3
THEN "1"
ELSE IF DATEDIFF(month, Date, LEAD Date) BETWEEN 3 AND 5
THEN "2"
ELSE "3"
END episode
FROM [res_treatment_escalation].[dbo].[cspine42920a]
EDIT: Using Microsoft SQL Server Management Studio.
EDIT 2: I have made some progress but the output is not exactly what I am looking for. Here is the query I used:
SELECT TOP (1000) [EMID],[visit_date_01],
CASE
WHEN DATEDIFF(DAY, visit_date_01, LAG(visit_date_01,1,getdate()) OVER (partition by EMID order by EMID)) <= 90 THEN '1'
WHEN DATEDIFF(DAY, visit_date_01, LAG(visit_date_01,1,getdate()) OVER (PARTITION BY EMID ORDER BY EMID)) BETWEEN 90 AND 179 THEN '2'
WHEN DATEDIFF(DAY, visit_date_01, LAG(visit_date_01,1,getdate()) OVER (PARTITION BY EMID order by EMID)) > 180 THEN '3'
END AS EPISODE
FROM [res_treatment_escalation].[dbo].['c-spine_full_dataset_4#29#20_wi$']
table2Here is the actual vs expected output
The partition by EMID does not seem to be working correctly. Every time there is a new EMID a new episode is triggered. I am using day instead of month as the filter in DATEDIFF- this does not seem to recognize new episodes within the same EMID
Hmmm: Use LAG() to get the previous date. Use a date comparison to assign a flag and then a cumulative sum:
select c.*,
sum(case when prev_date > dateadd(month, -3, date) then 0 else 1 end) over
(partition by emid order by date) as episode_number
from (select c.*, lag(date) over (partition by emid order by date) as prev_date
from res_treatment_escalation.dbo.cspine42920a c
) c;