Window functions and calculating averages with tricky data manipulation

Window functions and calculating averages with tricky data manipulation - sql

I have a SQL Server programming challenge involving some manipulations of healthcare patient pulse readings.
The goal is to do an average of readings within a certain time period and to only include the latest pulse reading of the day.
As an example, times are appt_time:
PATIENT 1 PATIENT 2
‘1/1/2019 80 ‘1/3/2019 90
‘1/4/2019 85
‘1/2/2019 10 am 78
‘1/2/2019 1 pm 85
‘1/3/2019 90
A patient may or may not have a second reading in a day. Only the last 3 latest chronological readings are used for the average. If less than 3 readings are available, an average is computed for 2 readings, or 1 reading is chosen as average.
Can this be done with the SQL window functions? This is a little more efficient than using a subquery.
I have used first_VALUE desc statements successfully to pick the last pulse in a day. I then have tried various row_number commands to exclude the marked off row (first pulse of the day when 2 readings are present). I cannot seem to correctly calculate the average. I have used row_number in select and from clauses.
with CTEBPI3
AS (
SELECT pat_id
,appt_time
,bp_pulse
,first_VALUE (bp_pulse) over(partition by appt_time order by appt_time desc ) fv
,ROW_NUMBER() OVER (PARTITION BY appt_time ORDER BY APPT_time DESC)RN1
,,Round(Sum(bp_pulse) OVER (PARTITION BY Pat_id) / COUNT (appt_time) OVER (PARTITION BY Pat_id), 0) AS adJAVGSYS3
FROM
pat_enc
WHERE appt_time > '07/15/2018'
)
select *,
WHEN rn=1
Average for pat1 should be 85
Average for pat2 should be 87.5

You can do this with two window functions:
MAX(appt_time) OVER ... to get the latest time per day
DENSE_RANK() OVER ... to get the last three days
You get the date part from your datetime with CONVERT(DATE, appt_time). The average function AVGis already built in :-)
The complete query:
select pat_id, avg(bp_pulse) as average_pulse
from
(
select
pat_id, appt_time, bp_pulse,
max(appt_time) over (partition by pat_id, convert(date, appt_time)) as max_time,
dense_rank() over (partition by pat_id order by convert(date, appt_time) desc) as rn
from pat_enc
) evaluated
where appt_time = max_time -- last row per day
and rn <= 3 -- last three days
group by pat_id
order by pat_id;
If the column bp_pulse is defined as an integer, you must convert it to a decimal to avoid integer arithmetic:
select pat_id, avg(convert(decimal, bp_pulse)) as average_pulse
Demo: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=3df744fcf2af89cdfd8b3cd8b6546d89

Actually, window functions are not necessarily more efficient. It is worth comparing:
select p.pat_id, avg(p.bp_pulse)
from pat_enc p
where -- appt_time > '2018-07-15' and -- don't know if this is necessary
p.appt_time >= (select distinct convert(date, appt_time)
from pat_enc p2
where p2.pat_id = p.pat_id
order by distinct convert(date, appt_time)
offset 2 row fetch first 1 row only
) and
p.appt_time = (select max(p2.appt_time)
from pat_enc p2
where p2.pat_id = p.pat_id and
convert(date, p2.appt_time) = convert(date, p.appt_time)
);
This wants an index on pat_enc(pat_id, appt_time).
In fact, there are a variety of ways to write this logic, with different mixes of subqueries and window functions (this is one extreme).
Which performs the best will depend on the nature of your data. In particular:
The number of appointments on the same day -- is this normally 1 or a large number?
The overall number of days with appointments -- is this right around three or are there hundreds?
You need to test on your data, but I think window function will work best when relatively few rows are filtered out (~1 appointment/day, ~3 days with appointments). Subqueries will be helpful when more rows are being filtered.

Related

How to conditional SQL select

My table consists of user_id, revenue, publish_month columns.
Right now I use group_by user_id and sum(revenue) to get revenue for all individual users.
Is there a single SQL query I can use to query for user revenue across a time period conditionally? If for a specific user, there is a row for this month, I want to query for this month, last month and the month before. If there is not yet a row for this month, I want to query for last month and the two months before.
Any advice with which approach to take would be helpful. If I should be using cases, if-elses with exists or if this is do-able with a single SQL query?
UPDATE---since I did a bad job of describing the question, I've come to include some example data and expected results
Where current month is not present for user 33
Where current month is present

Assuming publish_month is a DATE datatype, this should get the most recent three months of data per user...
SELECT
user_id, SUM(revenue) as s_revenue
FROM
(
SELECT
user_id, revenue, publish_month,
MAX(publish_month) OVER (PARTITION BY user_id) AS user_latest_publish_month
FROM
yourtableyoudidnotname
)
summarised
WHERE
publish_month >= DATEADD(month, -2, user_latest_publish_month)
GROUP BY
user_id
If you want to limit that to the most recent 3 months out of the last 4 calendar months, just add AND publish_month >= DATEADD(month, -3, DATE_TRUNC(month, GETDATE()))
The ambiguity here is why it is important to include a Minimal Reproducible Example
With input data and require results, we could test our code against your requirements
If you're using strings for the publish_month, you shouldn't be, and should fix that with utmost urgency.

You can use a windowing function to "number" the months. In this way the most recent one will have a value of 1, the prior 2, and the one before 3. Then you can only select the items with a number of 3 or less.
Here is how:
SELECT user_id, revienue, publish_month,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY publish_month DESC) as RN
FROM yourtableyoudidnotname
now you just select the items with RN less than 3 and do your sum
SELECT user_id, SUM(revenue) as s_revenue
FROM (
SELECT user_id, revenue, publish_month,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY publish_month DESC) as RN
FROM yourtableyoudidnotname
) X
WHERE RN <= 3
GROUP BY user_id
You could also do this without a sub query if you use the windowing function for SUM and a range, but I think this is easier to understand.
From the comment -- there could be an issue if you have months from more than one year. To solve this make the biggest number in the order by always the most recent. so instead of
ORDER BY publish_month DESC
you would have
ORDER BY (100*publish_year)+publish_month DESC
This means more recent years will always have a higher number so january of 2023 will be 202301 while december of 2022 will be 202212. Since january is a bigger number it will get a row number of 1 and december will get a row number of 2.

SQL - calculating hours since the earliest date in a partition

I have the following SQL code:
select
survey.ContactId,
survey.CommId,
survey.CommCreatedDate,
survey.CommIdStatus,
br.[Value],
null as HoursPastSinceFirstActiveSurvey,
row_number() over (partition by survey.ContactId order by survey.CommCreatedDate desc) as [row]
from
Survey_Completed survey
inner join
Business_Rules br on br.Name = 'OPT_OUT_TIME'
where
survey.CommIdStatus = 'Active'
Which produces the following result set:
What I need help with is filling out HoursPastSinceFirstActiveSurvey. The logic here should be as follows:
Calculate the total number of hours that has passed since the earliest (by CommCreatedDate) record in the partition for consecutive (by day) records. In order to address the "consecutive" part, I was thinking perhaps it might be possible to add to the partitioning logic to only partition if the days are consecutive. I'm not entirely sure if that's possible though. So for example, look at the last two records. They are grouped as a partition and the dates are consecutive and the earliest date/time on this partition is Nov 11 2020 12:00 AM. So I would want to perform the following in order to populate HoursPastSinceFirstActiveSurvey for these two records:
Today's date minus Nov 11 2020 12:00 AM.
This would be the value for those two records in the partition for HoursPastSinceFirstActiveSurvey. I am not sure where to even start with this!! Thank you all.

I was able to solve for this by the following query. Feedback is entirely WELCOME!
select
Q2.ContactId,
min(Q2.CommCreatedDate) as MinDate,
max(Q2.CommCreatedDate) as MaxDate,
Q2.Consecutive,
datediff(hour, min(Q2.CommCreatedDate), max(Q2.CommCreatedDate)) AS HoursPassed
from
(select
Q1.ContactId,
Q1.CommId,
Q1.CommCreatedDate,
Q1.CommIdStatus,
Q1.[Value],
Q1.Consecutive,
Q1.[row],
Q1.countOfPartition
from
(select
survey.ContactId,
survey.CommId,
survey.CommCreatedDate,
survey.CommIdStatus,
br.[Value],
CAST(dateadd(day,-row_number() over (partition by survey.ContactId order by survey.CommCreatedDate), survey.CommCreatedDate) as Date) as Consecutive,
row_number() over (partition by survey.ContactId order by survey.CommCreatedDate desc) as [row],
count(*) over (partition by survey.ContactId) as countOfPartition
from
Survey_Completed survey
inner join
Business_Rules br on br.Name = 'OPT_OUT_TIME'
where
survey.CommIdStatus = 'Active') Q1
where
Q1.countOfPartition <> 1) Q2
group by
Q2.ContactId, Q2.Consecutive, Q2.[Value]
having
datediff(hour, min(Q2.CommCreatedDate), max(Q2.CommCreatedDate)) > Q2.[Value]

How to capture first row in a grouping and subsequent rows that are each a minimum of 15 days apart?

Assume a given insurance will only pay for the same patient visiting the same doctor once in 15 days. If the patient comes once, twice, or twenty times within those 15 days to the doctor, the doctor will get only one payment. If the patient comes again on Day 16 or Day 18 or Day 29 (or all three!), the doctor will get a second payment. The first visit (or first after the 15 day interval) is always the one that must be billed, along with its complaint.
The SQL for all visits can be loosely expressed as follows:
SELECT VisitID
,PatientID
,VisitDtm
,DoctorID
,ComplaintCode
FROM Visits
The goal is to query the Visits table in a way that would capture only billable incidents.
I have been trying to work through this question which is in essence quite similar to Group rows with that are less than 15 days apart and assign min/max date. However, the reason this won't work for me is that, as the accepted answerer (Salman A) points out, Note that this could group much longer date ranges together e.g. 01-01, 01-11, 01-21, 02-01 and 02-11 will be grouped together although the first and last dates are more than 15 days apart. This presents a problem for me as it is a requirement to always capture the next incident after 15 days have passed from the first incident.
I have spent quite a few hours thinking this through and poring over like problems, and am looking for help in understanding the path to a solution, not necessarily an actual code solution. If it's easier to answer in the context of a code solution, that is fine. Any and all guidance is very much appreciated!

This type of task requres a iterative process so you can keep track of the last billable visit. One approach is a recursive cte.
You would typically enumerate the visits of each patient use row_number(), then traverse the dataset starting from the first visit, while keeping track of the last "billable" visit. Once a visit is met that is more than 15 days latter than the last billable visit, the value resets.
with
data as (
select visitid, patientid, visitdtm, doctorid,
row_number() over(partition by patientid order by visitdtm) rn
from visits
),
cte as (
select d.*, visitdtm as billabledtm from data d where rn = 1
union all
select d.*,
case when d.visitdtm >= dateadd(day, 15, c.billabledtm)
then d.visitdtm
else c.billabledtm
end
from cte c
inner join data d
on d.patientid = c.patientid and d.rn = c.rn + 1
)
select * from cte where visitdtm = billabledtm order by patientid, rn
If a patient may have more than 100 visits, then you need to add option (maxrecursion 0) at the very end of the query.

Here's another approach. Similar to GMB's this adds a row_number to the Visits table in a CTE but it also adds the lead date difference between VisitDtm's. Then it takes cumulative "sum over" of the date difference and divides by 15. When that quotient increases by a full integer, it represents a billable event in the data.
Something like this
;with lead_cte as (
select v.*, row_number() over (partition by PatientId order by VisitDtm) rn,
datediff(d, VisitDtm, lead(VisitDtm) over (partition by PatientId order by VisitDtm)) lead_dt_diff
from Visits v),
cum_sum_cte as (
select lc.*, sum(lead_dt_diff) over (partition by PatientId order by VisitDtm)/15 cum_dt_diff
from lead_cte),
min_billable_cte as (
select PatientId, cum_dt_diff, min(rn) min_rn
from cum_sum_cte
group by PatientId, cum_dt_diff)
select lc.*
from lead_cte lc
join min_billable_cte mbc on lc.PatintId=mbc.PatientId
and lc.rn=mbc.min_rn;

Redshift Alternative for Correlated Sub-Query

I am using Redshift and need an alternative for a correlated subquery. I am getting the correlated subquery not supported error. However, for this particular exercise of trying to identify all sales transactions made by the same customer within a given hour from the originating transaction, I am not sure a traditional left join would work either. I.e., the query is dependent on the context or current value from the parent select. I have also tried something similar using row_number() window function but again, need a way to window / partition on a date range - not just customer_id.
The overall goal is to find the first sales transaction for a given customer id, then find all subsequent transactions made within 60 minutes of the first transaction. This logic will continue on for the remainder of the transactions for the same customer (and ultimately all customers in the database). That is, once the initial 60 minute window has been established from the time of the first transaction, a second 60 minute window would begin at the end of the first 60 minute window, and all transactions within the second window would also be identified and combined and then repeat for the remainder of transactions.
The output would list the first transaction id that started the 60 minute window, then the other subsequent transaction ids that were made within the 60 minute window. The 2nd row would display the first transaction id made by the same customer in the next 60 minute window (again, the first transaction post the first 60 minute window would be the start of the second 60 minute window) and then the subsequent transactions also made within the second 60 minute window.
The query example in its most basic form looks like the query below:
select
s1.customer_id,
s1.transaction_id,
s1.order_time,
(
select
s2.transaction_id
from
sales s2
where
s2.order_time > s1.order_time and
s2.order_time <= dateadd(m,60,s1.order_time) and
s2.customer_id = s1.customer_id
order by
s2.order_time asc
limit 1
) as sales_transaction_id_1,
(
select
s3.transaction_id
from
sales s3
where
s3.order_time > s1.order_time and
s3.order_time <= dateadd(m,60,s1.order_time) and
s3.customer_id = s1.customer_id
order by
s3.order_time asc
limit 1 offset 1
) as sales_transaction_id_2,
(
select
s3.transaction_id
from
sales s4
where
s4.order_time > s1.order_time and
s4.order_time <= dateadd(m,60,s1.order_time) and
s4.customer_id = s1.customer_id
order by
s4.order_time asc
limit 1 offset 1
) as sales_transaction_id_3
from
(
select
sales.customer_id,
sales.transaction_id,
sales.order_time
from
sales
order by
sales.order_time desc
) s1;
For example, if a customer made the following transactions:
customer_id transaction_id order_time
1234 33453 2017-06-05 13:30
1234 88472 2017-06-05 13:45
1234 88477 2017-06-05 14:10
1234 99321 2017-06-07 8:30
1234 99345 2017-06-07 8:45
The expected output would be as:
customer_id transaction_id sales_transaction_id_1 sales_transaction_id_2 sales_transaction_id_3
1234 33453 88472 88477 NULL
1234 99321 99345 NULL NULL
Also, it appears Redshift does not support lateral joins which seems to further restrict the options at my disposal. Any help would be greatly appreciated.

You can use window functions to get the subsequent transactions for every transaction. The window will be customer / hour and you can rank records to get the first "anchor" transaction and get all subsequent transactions that you need:
with
transaction_chains as (
select
customer_id
,transaction_id
,order_time
-- rank transactions within window to find the first "anchor" transaction
,row_number() over (partition by customer_id,date_trunc('minute',order_time) order by order_time)
-- 1st next order
,lead(transaction_id,1) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as transaction_id_1
,lead(order_time,1) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as order_time_1
-- 2nd next order
,lead(transaction_id,2) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as transaction_id_2
,lead(order_time,2) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as order_time_2
-- 2nd next order
,lead(transaction_id,3) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as transaction_id_3
,lead(order_time,3) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as order_time_3
from sales
)
select
customer_id
,transaction_id
,transaction_id_1
,transaction_id_2
,transaction_id_3
from transaction_chains
where row_number=1;

From your description, you just want group by and some sort of date difference. I'm not sure how you want to combine the rows, but here is the basic idea:
select s.customer_id,
min(order_time) as first_order_in_hour,
max(order_time) as last_order_in_hour,
count(*) as num_orders
from (select s.*,
min(order_time) over (partition by customer_id) as min_ot
from sales s
) s
group by customer_id, floor(datediff(second, min_ot, order_time) / (60 * 60));
This formulation (or something similar because Postgres does not have datediff()) would also be much faster in Postgres.

T-SQL aggregate window functions over specific time interval

Here's a SQL 2012 table:
CREATE TABLE [dbo].[TBL_BID]
(
[ID] [varchar](max) NULL,
[VALUE] [smallint] NULL,
[DT_START] [date] NULL,
[DT_FIN] [date] NULL
)
I can easily get last event's value, time since last event (or any specific lags) by LAG window function, as well as total number of events (or over specific number of past events), total average per user, etc
SELECT
ID,
[VALUE],
[DT_START], [DT_FIN],
-- days since the end of last event
DATEDIFF(d, LAG([DT_FIN], 1) OVER (PARTITION BY ID ORDER BY [DT_FIN]),
[DT_START]) AS LAG1_DT,
-- value of the last event
LAG([VALUE], 1) OVER (PARTITION BY ID ORDER BY [DT_FIN]) AS LAG1_VALUE,
-- number of events per id
COUNT(ID) OVER (PARTITION BY ID) AS N,
-- average [value] per id
ROUND(AVG(CAST([VALUE] as float)) OVER (PARTITION BY ID), 1) AS VAL_AVG
FROM
TBL_BID
I am trying to get for events happened over specified time interval, i.e 10 days, 30 days, 180 days, etc, before the start date of each event
count of events
average of [VALUE]
average time in days between the end of event and start of the next one
Something along the lines of:
COUNT(ID) OVER (PARTITION BY ID ORDER BY DT_FIN
RANGE BETWEEN DATEDIFF(d,-30,[DT_START]) AND [DT_START] )
UPDATE 4/19/2017:
Some statistics
About 20MM IDs, the time interval is 5 years,
mean number of events per ID is 3.0. It could be 100+ events per ID, but majority has only handful of events, the distribution is very right skewed
Events_per_ID Number_IDs
1 18676221
2 11254167
3 6992200
4 4487664
5 2933183
6 1957433
7 1330040
8 918873
9 644229
10 457858
........

The simplest approach is outer apply:
select . . .,
b.cnt_30
from TBL_BID b outer apply
(select count(*) as cnt_30
from TBL_BID b2
where b2.id = b.id and
b2.dt_start >= dateadd(day, -30, b.dt_start) and
b2.dt_start <= b.dt_start
) b;
This is not necessarily really efficient. You can readily extend it by adding more outer apply subqueries.

Need some more information, but the basic idea is to transform the windows functions type from range to rows by generating the full range of dates for each ID.
For each ID generate the relevant range of days (min(dt_start)-180 to max(dt_start))
Use the above row set as a base and LEFT JOIN TBL_BID on id and dt_fin (if (id,dt_fin) is not unique, aggregate first)
Use windows functions partition by id order by date rows between 180/30/10 preceding and current row

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas