T-SQL aggregate window functions over specific time interval - sql

Here's a SQL 2012 table:
CREATE TABLE [dbo].[TBL_BID]
(
[ID] [varchar](max) NULL,
[VALUE] [smallint] NULL,
[DT_START] [date] NULL,
[DT_FIN] [date] NULL
)
I can easily get last event's value, time since last event (or any specific lags) by LAG window function, as well as total number of events (or over specific number of past events), total average per user, etc
SELECT
ID,
[VALUE],
[DT_START], [DT_FIN],
-- days since the end of last event
DATEDIFF(d, LAG([DT_FIN], 1) OVER (PARTITION BY ID ORDER BY [DT_FIN]),
[DT_START]) AS LAG1_DT,
-- value of the last event
LAG([VALUE], 1) OVER (PARTITION BY ID ORDER BY [DT_FIN]) AS LAG1_VALUE,
-- number of events per id
COUNT(ID) OVER (PARTITION BY ID) AS N,
-- average [value] per id
ROUND(AVG(CAST([VALUE] as float)) OVER (PARTITION BY ID), 1) AS VAL_AVG
FROM
TBL_BID
I am trying to get for events happened over specified time interval, i.e 10 days, 30 days, 180 days, etc, before the start date of each event
count of events
average of [VALUE]
average time in days between the end of event and start of the next one
Something along the lines of:
COUNT(ID) OVER (PARTITION BY ID ORDER BY DT_FIN
RANGE BETWEEN DATEDIFF(d,-30,[DT_START]) AND [DT_START] )
UPDATE 4/19/2017:
Some statistics
About 20MM IDs, the time interval is 5 years,
mean number of events per ID is 3.0. It could be 100+ events per ID, but majority has only handful of events, the distribution is very right skewed
Events_per_ID Number_IDs
1 18676221
2 11254167
3 6992200
4 4487664
5 2933183
6 1957433
7 1330040
8 918873
9 644229
10 457858
........

The simplest approach is outer apply:
select . . .,
b.cnt_30
from TBL_BID b outer apply
(select count(*) as cnt_30
from TBL_BID b2
where b2.id = b.id and
b2.dt_start >= dateadd(day, -30, b.dt_start) and
b2.dt_start <= b.dt_start
) b;
This is not necessarily really efficient. You can readily extend it by adding more outer apply subqueries.

Need some more information, but the basic idea is to transform the windows functions type from range to rows by generating the full range of dates for each ID.
For each ID generate the relevant range of days (min(dt_start)-180 to max(dt_start))
Use the above row set as a base and LEFT JOIN TBL_BID on id and dt_fin (if (id,dt_fin) is not unique, aggregate first)
Use windows functions partition by id order by date rows between 180/30/10 preceding and current row

Related

Select rows for last n days after event occurs

I have the following table and data:
PatientID PatientName Diagnosed ReportDate ...
1 0
1 0
1 0
1 1
So there are multiple rows for each patient, as the reports come few times a day.
Whenever the diagnosed field is changed to 1, for that patient, I'd like to get the past 3 days of data . So when Diagnosed ==1, get report time -3 days of data for each patient.
SELECT Patients.ReportDate
FROM Patients
WHERE Diagnosed = 1 and date > ReportDate - interval '3' day;
So getting the past 3 days of data, can be done with ReportDate - interval time, but how do I specify that for every patient (since multiple ids can be for that patient) based on the diagnosed field?
I usually do this filtering after getting csvs in python, but the data set is too large, so I'd like to filter before I convert them to dataframes.
You can look at this another way, which is whether diagnosed = 1 in the next three days -- and take all rows where that is true:
select p.*
from (select p.*,
count(*) filter (where diagnosed = 1) over (partition by patientId order by reportDate range between interval '0 day' following and interval '3 day' following) as cnt_diagnosed_3
from patients p
) p
where cnt_diagnosed_3 > 0
order by patientId, reportDate;
Whenever the diagnosed field is changed to 1, for that patient, I'd like to get the past 3 days of data.
SELECT (p).*
FROM (
SELECT p
, diagnosed
, bool_or(diagnosed = 1) OVER (w RANGE BETWEEN CURRENT ROW AND '3 days' FOLLOWING) AS in_range
, lag(diagnosed) OVER w AS last_diagnosed
FROM patients p
WINDOW w AS (PARTITION BY patientid ORDER BY reportdate)
) sub
WHERE diagnosed = 0 AND in_range
OR diagnosed = 1 AND last_diagnosed = 0
ORDER BY patientid, reportdate;
db<>fiddle here
Returns the "past 3 days of data" where the "field is changed to 1" (previous row had "0").
The WINDOW clause is just syntactic sugar to avoid spelling out the same window definition repeatedly. (No additional benefit for performance.)
SELECT p in the innermost subquery is a neat way to get the whole row. The outer SELECT (p).* returns complete rows without auxiliary columns added in the subquery. This way we get whole rows without spelling out all columns (or even needing to know all of them).
RANGE distance PRECEDING/FOLLOWING requires Postgres 11 or later.
Here is a slower alternative that also works for older versions:
SELECT p.*
FROM (
SELECT patientid, reportdate
FROM (
SELECT patientid, reportdate, diagnosed
, lag(diagnosed) OVER (PARTITION BY patientid ORDER BY reportdate) AS last_diagnosed
FROM patients
) p0
WHERE diagnosed = 1
AND last_diagnosed = 0
) d
JOIN patients p USING (patientid)
WHERE p.reportdate BETWEEN d.reportdate - interval '3 days' AND d.reportdate
ORDER BY p.patientid, p.reportdate;
Subquery d select rows where Diagnosed just switched to 1. Then self-join to select your time frame.
For gaps-and-islands basics, see:
Select longest continuous sequence
You also added:
So when Diagnosed ==1, get report time -3 days of data for each patient.
That's a wider definition, and that's what Gordon's query does. Goes to show the importance of an exact definition of requirements.

Window functions and calculating averages with tricky data manipulation

I have a SQL Server programming challenge involving some manipulations of healthcare patient pulse readings.
The goal is to do an average of readings within a certain time period and to only include the latest pulse reading of the day.
As an example, times are appt_time:
PATIENT 1 PATIENT 2
‘1/1/2019 80 ‘1/3/2019 90
‘1/4/2019 85
‘1/2/2019 10 am 78
‘1/2/2019 1 pm 85
‘1/3/2019 90
A patient may or may not have a second reading in a day. Only the last 3 latest chronological readings are used for the average. If less than 3 readings are available, an average is computed for 2 readings, or 1 reading is chosen as average.
Can this be done with the SQL window functions? This is a little more efficient than using a subquery.
I have used first_VALUE desc statements successfully to pick the last pulse in a day. I then have tried various row_number commands to exclude the marked off row (first pulse of the day when 2 readings are present). I cannot seem to correctly calculate the average. I have used row_number in select and from clauses.
with CTEBPI3
AS (
SELECT pat_id
,appt_time
,bp_pulse
,first_VALUE (bp_pulse) over(partition by appt_time order by appt_time desc ) fv
,ROW_NUMBER() OVER (PARTITION BY appt_time ORDER BY APPT_time DESC)RN1
,,Round(Sum(bp_pulse) OVER (PARTITION BY Pat_id) / COUNT (appt_time) OVER (PARTITION BY Pat_id), 0) AS adJAVGSYS3
FROM
pat_enc
WHERE appt_time > '07/15/2018'
)
select *,
WHEN rn=1
Average for pat1 should be 85
Average for pat2 should be 87.5
You can do this with two window functions:
MAX(appt_time) OVER ... to get the latest time per day
DENSE_RANK() OVER ... to get the last three days
You get the date part from your datetime with CONVERT(DATE, appt_time). The average function AVGis already built in :-)
The complete query:
select pat_id, avg(bp_pulse) as average_pulse
from
(
select
pat_id, appt_time, bp_pulse,
max(appt_time) over (partition by pat_id, convert(date, appt_time)) as max_time,
dense_rank() over (partition by pat_id order by convert(date, appt_time) desc) as rn
from pat_enc
) evaluated
where appt_time = max_time -- last row per day
and rn <= 3 -- last three days
group by pat_id
order by pat_id;
If the column bp_pulse is defined as an integer, you must convert it to a decimal to avoid integer arithmetic:
select pat_id, avg(convert(decimal, bp_pulse)) as average_pulse
Demo: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=3df744fcf2af89cdfd8b3cd8b6546d89
Actually, window functions are not necessarily more efficient. It is worth comparing:
select p.pat_id, avg(p.bp_pulse)
from pat_enc p
where -- appt_time > '2018-07-15' and -- don't know if this is necessary
p.appt_time >= (select distinct convert(date, appt_time)
from pat_enc p2
where p2.pat_id = p.pat_id
order by distinct convert(date, appt_time)
offset 2 row fetch first 1 row only
) and
p.appt_time = (select max(p2.appt_time)
from pat_enc p2
where p2.pat_id = p.pat_id and
convert(date, p2.appt_time) = convert(date, p.appt_time)
);
This wants an index on pat_enc(pat_id, appt_time).
In fact, there are a variety of ways to write this logic, with different mixes of subqueries and window functions (this is one extreme).
Which performs the best will depend on the nature of your data. In particular:
The number of appointments on the same day -- is this normally 1 or a large number?
The overall number of days with appointments -- is this right around three or are there hundreds?
You need to test on your data, but I think window function will work best when relatively few rows are filtered out (~1 appointment/day, ~3 days with appointments). Subqueries will be helpful when more rows are being filtered.

Need to count unique transactions by month but ignore records that occur 3 days after 1st entry for that ID

I have a table with just two columns: User_ID and fail_date. Each time somebody's card is rejected they are logged in the table, their card is automatically tried again 3 days later, and if they fail again, another entry is added to the table. I am trying to write a query that counts unique failures by month so I only want to count the first entry, not the 3 day retries, if they exist. My data set looks like this
user_id fail_date
222 01/01
222 01/04
555 02/15
777 03/31
777 04/02
222 10/11
so my desired output would be something like this:
month unique_fails
jan 1
feb 1
march 1
april 0
oct 1
I'll be running this in Vertica, but I'm not so much looking for perfect syntax in replies. Just help around how to approach this problem as I can't really think of a way to make it work. Thanks!
You could use lag() to get the previous timestamp per user. If the current and the previous timestamp are less than or exactly three days apart, it's a follow up. Mark the row as such. Then you can filter to exclude the follow ups.
It might look something like:
SELECT month,
count(*) unique_fails
FROM (SELECT month(fail_date) month,
CASE
WHEN datediff(day,
lag(fail_date) OVER (PARTITION BY user_id,
ORDER BY fail_date),
fail_date) <= 3 THEN
1
ELSE
0
END follow_up
FROM elbat) x
WHERE follow_up = 0
GROUP BY month;
I'm not so sure about the exact syntax in Vertica, so it might need some adaptions. I also don't know, if fail_date actually is some date/time type variant or just a string. If it's just a string the date/time specific functions may not work on it and have to be replaced or the string has to be converted prior passing it to the functions.
If the data spans several years you might also want to include the year additionally to the month to keep months from different years apart. In the inner SELECT add a column year(fail_date) year and add year to the list of columns and the GROUP BY of the outer SELECT.
You can add a flag about whether this is a "unique_fail" by doing:
select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t;
Then, you want to count this flag by month:
select to_char(fail_date, 'Mon'), -- should aways include the year
sum(first_failure_flag)
from (select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t
) t
group by to_char(fail_date, 'Mon')
order by min(fail_date)
In a Derived Table, determine the previous fail_date (prev_fail_date), for a specific user_id and fail_date, using a Correlated subquery.
Using the derived table dt, Count the failure, if the difference of number of days between current fail_date and prev_fail_date is greater than 3.
DateDiff() function alongside with If() function is used to determine the cases, which are not repeated tries.
To Group By this result on Month, you can use MONTH function.
But then, the data can be from multiple years, so you need to separate them out yearwise as well, so you can do a multi-level group by, using YEAR function as well.
Try the following (in MySQL) - you can get idea for other RDBMS as well:
SELECT YEAR(dt.fail_date) AS year_fail_date,
MONTH(dt.fail_date) AS month_fail_date,
COUNT( IF(DATEDIFF(dt.fail_date, dt.prev_fail_date) > 3, user_id, NULL) ) AS unique_fails
FROM (
SELECT
t1.user_id,
t1.fail_date,
(
SELECT t2.fail_date
FROM your_table AS t2
WHERE t2.user_id = t1.user_id
AND t2.fail_date < t1.fail_date
ORDER BY t2.fail_date DESC
LIMIT 1
) AS prev_fail_date
FROM your_table AS t1
) AS dt
GROUP BY
year_fail_date,
month_fail_date
ORDER BY
year_fail_date ASC,
month_fail_date ASC

SQL Rolling Summary Statistics For Set Timeframe

I have a table that contains information about log-in events. Every time a user logs in, a record is added containing the user and the date. I want to calculate a new column in that table that holds the number of times that user has logged in in the past 31 days (including the current attempt). This is a simplified version of what my table looks like, including the column I want to add:
UserID Date LoginsInPast31Days
-------- ------------- --------------------
1 01-01-2012 1
2 02-01-2012 1
2 10-01-2012 2
1 25-01-2012 2
2 03-02-2012 2
2 22-03-2012 1
I know how to calculate a total amount of login attempts: I'd use COUNT(*) OVER (PARTITION BY UserId ORDER BY Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). However, I want to limit the timeframe to the last 31 days. My guess is that I have to change the UNBOUNDED PRECEDING, but how do I alter it in such a way that it select the right amount of rows?
One pretty efficient way is to add a record 30 days after each date. It looks like this:
select userid, dte,
sum(inc) over (partition by userid order by dte) as LoginsInPast31Days
from ((select distinct userid, logindate as dte, 1 as inc from logins) union all
(select distinct userid, dateadd(day, 31, dte, -1 as inc from logins)
) l;
You're almost there, 2 adjustments:
First make sure to group by user and date so you know how many rows to select
Secondly, you'll need to use 'ROWS BETWEEN CURRENT ROW AND 31 FOLLOWING' since you cannot limit the number of preceding records to use. By using descending sort order, you'll get the required result.
Combine these tips and you'll get:
SELECT SUM(COUNT(*)) OVER (
PARTITION BY t.userid_KEY
ORDER BY CAST(t.login_ts AS DATE) DESC
ROWS BETWEEN CURRENT ROW AND 31 FOLLOWING
)
FROM table AS t
GROUP BY t.userid, CAST(t.login_ts AS DATE)

Count over rows in previous time range partitioned by a specific column

My dataset consists of daily (actually business days) timeseries for different companies from different industries and I work with PostgreSQL. I have an indicator variable in my dataset taking values 1, -1 and most of the times 0. For better readability of the question I refer to days where the indicator variable is unequal to zero as indicator event.
So for all indicator events that are preceded by another indicator event for the same industry in the previous three business days, the indicator variable shall be updated to zero.
We can think of the following example dataset:
day company industry indicator
2012-01-12 A financial 1
2012-01-12 B consumer 0
2012-01-13 A financial 1
2012-01-13 B consumer -1
2012-01-16 A financial 0
2012-01-16 B consumer 0
2012-01-17 A financial 0
2012-01-17 B consumer 0
2012-01-17 C consumer 0
2012-01-18 A financial 0
2012-01-18 B consumer 0
2012-01-18 C consumer 1
So the indicator values that shall be updated to zero are on 2012-01-13 the entry for company A, and on 2012-01-18 the entry for company C, because they are preceded by another indicator event in the same industry within 3 business days.
I tried to accomplish it in the following way:
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
COUNT(CASE WHEN indicator <> 0 THEN 1 END)
OVER (PARTITION BY industry ORDER BY day
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) As cnt
FROM test
) alias
WHERE cnt >= 2)
My idea was to count the indicator events for the current day and the 3 preceding days partitioned by industry. If it counts more than 1, it updates the indicator value to zero.
The weak spot is, that so far it counts over the three preceding rows (partitioned by industry) instead of the three preceding business days. So in the example data, it is not able to update company C on 2012-01-18, because it counts over the last three rows where industry = consumer instead of counting over all rows where industry=consumer for the last three business days.
I tried different methods like adding another subquery in the third last line of the code or adding a WHERE EXISTS - clause after the third last line, to ensure that the code counts over the three preceding dates. But nothing worked. I really don't know out how to do that (I just learn to work with PostgreSQL).
Do you have any ideas how to fix it?
Or maybe I am thinking in a completely wrong direction and you know another approach how to solve my problem?
DB design
Fist off, your table should be normalized. industry should be a small foreign key column (typically integer) referencing industry_id of an industry table. Maybe you have that already and only simplified for the sake of the question. Your actual table definition would go a long way.
Since rows with an indicator are rare but highly interesting, create a (possibly "covering") partial index to make any solution faster:
CREATE INDEX tbl_indicator_idx ON tbl (industry, day)
WHERE indicator <> 0;
Equality first, range last.
Assuming that indicator is defined NOT NULL. If industry was an integer, this index would be perfectly efficient.
Query
This query identifies rows to be reset:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
SELECT industry, day
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('ISODOW' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) sub
WHERE incident
AND ct > 1
ORDER BY 1, 2;
SQL Fiddle.
ISODOW as extract() parameter is convenient to truncate weekends.
Integrate this in your UPDATE:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
UPDATE tbl t
SET indicator = 0
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('isodow' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) u
WHERE u.incident
AND u.ct > 1
AND t.industry = u.industry
AND t.day = u.day;
This should be substantially faster than your solution with correlated subqueries and a function call for every row. Even if that's based on my own previous answer, it's not perfect for this case.
In the meantime I found one possible solution myself (I hope that this isn't against the etiquette of the forum).
Please note that this is only one possible solution. You are very welcome to comment it or to develop
improvements if you want to.
For the first part, the function addbusinessdays which can add (or subtract) business day to
a given date, I am referring to:
http://osssmb.wordpress.com/2009/12/02/business-days-working-days-sql-for-postgres-2/
(I just slightly modified it because I don't care for holidays, just for weekends)
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
with alldates as (
SELECT i,
$1 + (i * case when $2 < 0 then -1 else 1 end) AS date
FROM generate_series(0,(abs($2) + 5)*2) i
),
days as (
select i, date, extract('dow' from date) as dow
from alldates
),
businessdays as (
select i, date, d.dow from days d
where d.dow between 1 and 5
order by i
)
select date from businessdays where
case when $2 > 0 then date >=$1 when $2 < 0 then date <=$1 else date =$1 end
limit 1
offset abs($2)
$BODY$
LANGUAGE 'sql' VOLATILE
COST 100;
ALTER FUNCTION addbusinessdays(date, integer) OWNER TO postgres;
For the second part, I am referring to this related question, where I am applying Erwin Brandstetter's correlated subquery approach: Window Functions or Common Table Expressions: count previous rows within range
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
(SELECT COUNT(CASE WHEN indicator <> 0 THEN 1 END)
FROM test t1
WHERE t1.industry = t.industry
AND t1.day between addbusinessdays(t.day,-3) and t.day) As cnt
FROM test t
) alias
WHERE cnt >= 2)