Select rows for last n days after event occurs - sql

I have the following table and data:
PatientID PatientName Diagnosed ReportDate ...
1 0
1 0
1 0
1 1
So there are multiple rows for each patient, as the reports come few times a day.
Whenever the diagnosed field is changed to 1, for that patient, I'd like to get the past 3 days of data . So when Diagnosed ==1, get report time -3 days of data for each patient.
SELECT Patients.ReportDate
FROM Patients
WHERE Diagnosed = 1 and date > ReportDate - interval '3' day;
So getting the past 3 days of data, can be done with ReportDate - interval time, but how do I specify that for every patient (since multiple ids can be for that patient) based on the diagnosed field?
I usually do this filtering after getting csvs in python, but the data set is too large, so I'd like to filter before I convert them to dataframes.

You can look at this another way, which is whether diagnosed = 1 in the next three days -- and take all rows where that is true:
select p.*
from (select p.*,
count(*) filter (where diagnosed = 1) over (partition by patientId order by reportDate range between interval '0 day' following and interval '3 day' following) as cnt_diagnosed_3
from patients p
) p
where cnt_diagnosed_3 > 0
order by patientId, reportDate;

Whenever the diagnosed field is changed to 1, for that patient, I'd like to get the past 3 days of data.
SELECT (p).*
FROM (
SELECT p
, diagnosed
, bool_or(diagnosed = 1) OVER (w RANGE BETWEEN CURRENT ROW AND '3 days' FOLLOWING) AS in_range
, lag(diagnosed) OVER w AS last_diagnosed
FROM patients p
WINDOW w AS (PARTITION BY patientid ORDER BY reportdate)
) sub
WHERE diagnosed = 0 AND in_range
OR diagnosed = 1 AND last_diagnosed = 0
ORDER BY patientid, reportdate;
db<>fiddle here
Returns the "past 3 days of data" where the "field is changed to 1" (previous row had "0").
The WINDOW clause is just syntactic sugar to avoid spelling out the same window definition repeatedly. (No additional benefit for performance.)
SELECT p in the innermost subquery is a neat way to get the whole row. The outer SELECT (p).* returns complete rows without auxiliary columns added in the subquery. This way we get whole rows without spelling out all columns (or even needing to know all of them).
RANGE distance PRECEDING/FOLLOWING requires Postgres 11 or later.
Here is a slower alternative that also works for older versions:
SELECT p.*
FROM (
SELECT patientid, reportdate
FROM (
SELECT patientid, reportdate, diagnosed
, lag(diagnosed) OVER (PARTITION BY patientid ORDER BY reportdate) AS last_diagnosed
FROM patients
) p0
WHERE diagnosed = 1
AND last_diagnosed = 0
) d
JOIN patients p USING (patientid)
WHERE p.reportdate BETWEEN d.reportdate - interval '3 days' AND d.reportdate
ORDER BY p.patientid, p.reportdate;
Subquery d select rows where Diagnosed just switched to 1. Then self-join to select your time frame.
For gaps-and-islands basics, see:
Select longest continuous sequence
You also added:
So when Diagnosed ==1, get report time -3 days of data for each patient.
That's a wider definition, and that's what Gordon's query does. Goes to show the importance of an exact definition of requirements.

Related

SQL - Get historic count of rows collected within a certain period by date

For many years I've been collecting data and I'm interested in knowing the historic counts of IDs that appeared in the last 30 days. The source looks like this
id
dates
1
2002-01-01
2
2002-01-01
3
2002-01-01
...
...
3
2023-01-10
If I wanted to know the historic count of ids that appeared in the last 30 days I would do something like this
with total_counter as (
select id, count(id) counts
from source
group by id
),
unique_obs as (
select id
from source
where dates >= DATEADD(Day ,-30, current_date)
group by id
)
select count(distinct(id))
from unique_obs
left join total_counter
on total_counter.id = unique_obs.id;
The problem is that this results would return a single result for today's count as provided by current_date.
I would like to see a table with such counts as if for example I had ran this analysis yesterday, and the day before and so on. So the expected result would be something like
counts
date
1235
2023-01-10
1234
2023-01-09
1265
2023-01-08
...
...
7383
2022-12-11
so for example, let's say that if the current_date was 2023-01-10, my query would've returned 1235.
If you need a distinct count of Ids from the 30 days up to and including each date the below should work
WITH CTE_DATES
AS
(
--Create a list of anchor dates
SELECT DISTINCT
dates
FROM source
)
SELECT COUNT(DISTINCT s.id) AS "counts"
,D.dates AS "date"
FROM CTE_DATES D
LEFT JOIN source S ON S.dates BETWEEN DATEADD(DAY,-29,D.dates) AND D.dates --30 DAYS INCLUSIVE
GROUP BY D.dates
ORDER BY D.dates DESC
;
If the distinct count didnt matter you could likely simplify with a rolling sum, only hitting the source table once:
SELECT S.dates AS "date"
,COUNT(1) AS "count_daily"
,SUM("count_daily") OVER(ORDER BY S.dates DESC ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING) AS "count_rolling" --assumes there is at least one row for every day.
FROM source S
GROUP BY S.dates
ORDER BY S.dates DESC;
;
This wont work though if you have gaps in your list of dates as it'll just include the latest 30 days available. In which case the first example without distinct in the count will do the trick.
SELECT count(*) AS Counts
dates AS Date
FROM source
WHERE dates >= DATEADD(DAY, -30, CURRENT_DATE)
GROUP BY dates
ORDER BY dates DESC

How to Average Number of Chats per Day on LEFT JOIN table in Snowflake SQL?

In Snowflake SQL dictation, how do I average the number of video chats per day using a field from a table I left joined to the entire query?
I'm thinking I have to do a SUM function to total the number of video chats and then aggregate by # of video chats for each date and then divide by 30 days (the rolling date range I specified throughout my entire query).
Any help would be appreciated as deadlines are approaching. Thank you.
SELECT DISTINCT
t1."pid",
IFNULL(t2."VideoChats",0),
t3."SFUser",
t3."TotalProviders",
t4."dimaccount.practice_specialty",
t5."Account: CMRR",
t6."CreatedDate",
t7."stg_sf_case.Date_Time_Resolved__c",
t8."stg_sf_case.Closed_Date",
t9."pid"
FROM (SELECT "pid"
FROM "EDW_PROD"."PUBLIC"."STG_MYSQL_PROVIDERMODULES" AS a
WHERE a."active"
AND a."status" = 'PURCHASED'
AND a."module_id" = '14'
GROUP BY a."pid"
) t1
LEFT JOIN (SELECT "started_at",
"pid",
COUNT(*) AS "VideoChats"
FROM "EDW_PROD"."PUBLIC"."STG_MYSQL_VIDEOCHATROOM" AS b
LEFT JOIN "EDW_PROD"."PUBLIC"."DIMACCOUNT" AS dimaccount
ON b."pid" = dimaccount."PID"
WHERE b."started_at" >= DATE_TRUNC('month', CURRENT_DATE())
AND b."started_at" < DATEADD('month', 1, DATE_TRUNC('month', CURRENT_DATE()))
AND dimaccount."CurrentRow" = 'Y'
GROUP BY b."pid", b."started_at"
) t2 ON t1."pid" = t2."pid"
For a rolling average you probably want to use a window function. Something along these lines.
SELECT AVG(VideoChats) over (partition by pid order by started_at rows between 30 preceding and current row) as AvgVideoChats
--I saw a post about AVG not allowing a sliding window, so you may have to do this instead
SELECT SUM(VideoChats) over (partition by pid order by started_at rows between 30 preceding and current row) / 30. as AvgVideoChats
You may need to do this in a wrapper around your t2 query and adjust your date filters so that there are values available for averaging, but I'm not quite clear enough on what your query is doing with dates, or what results you are looking for, to be sure.

Running total with Over

I'm trying to create a running total of the number of files per opened by day so I can use the data for a graph showing cumulative results.
The data is basically the file opening date, a calculated field showing 'This month' or 'Last Month' depending on the date and the running total field that I'm trying to figure out.
Date Month Count
==== ===== =====
2019-08-01 Last Month 6
2019-08-02 Last Month 2
2019-08-03 Last Month 5
I want to have a running total...so 6, 8, 13 etc
But all I'm getting is a row count (1,2,3 etc) for my count field.
select
FileDate,
Month,
sum(Count) OVER(PARTITION BY month order by Filedate) as 'Count'
from (
select
1 as 'Count',
Case
When month(cast(concat(right(d.var_val,4),substring(d.var_val,4,2),left(d.var_val,2)) as DATE) ) = Month(getdate()) then 'This Month'
else 'Last Month'
end as 'Month'
FROM data d
left join otherdata m on d.VAR_FileID = m.MAT_FileID
left join otherdata u on m.MAT_Fee_Earner = u.User_ID
left join otherdata br on m.MAT_BranchID = br.BR_ID
WHERE d.var_no IN ( '1628' )
and Len(var_val) = 10
)files
where Month(FileDate) in (MONTH(FileDate()),MONTH(getDate())-1)
and Year(Filedate) = Year(Getdate())
and Dept = 'Peterborough Property'
group by Month, FileDate, count
GO
I'm assuming I've not quite grasped the proper usage of 'OVER' - any pointers would be great!
The Partition clause indicates when to reset the count, so by partitioning by month you are only counting records for each discreet month to get a running total, over the whole dataset, you don't want the partition clause at all, just the order by clause.
Hope your clear with OVER clause now (with "Sentinel" answer), in which case you should replace desired column as follows, so that count continuously increase for all the rows from sub-query based on order by clause: for more details on OVER Clause..
sum(Count) OVER (Oder by Filedate) as [Count]
-- or
sum(Count) OVER (Oder by Filedate desc) as [Count]

Count occurrences of combinations of columns

I have daily time series (actually business days) for different companies and I work with PostgreSQL. There is also an indicator variable (called flag) taking the value 0 most of the time, and 1 on some rare event days. If the indicator variable takes the value 1 for a company, I want to further investigate the entries from two days before to one day after that event for the corresponding company. Let me refer to that as [-2,1] window with the event day being day 0.
I am using the following query
CREATE TABLE test AS
WITH cte AS (
SELECT *
, MAX(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN 1 preceding AND 2 following) Lead1
FROM mytable)
SELECT *
FROM cte
WHERE Lead1 = 1
ORDER BY day,company
The query takes the entries ranging from 2 days before the event to one day after the event, for the company experiencing the event.
The query does that for all events.
This is a small section of the resulting table.
day company flag
2012-01-23 A 0
2012-01-24 A 0
2012-01-25 A 1
2012-01-25 B 0
2012-01-26 A 0
2012-01-26 B 0
2012-01-27 B 1
2012-01-30 B 0
2013-01-10 A 0
2013-01-11 A 0
2013-01-14 A 1
Now I want to do further calculations for every [-2,1] window separately. So I need a variable that allows me to identify each [-2,1] window. The idea is that I count the number of windows for every company with the variable "occur", so that in further calculations I can use the clause
GROUP BY company, occur
Therefore my desired output looks like that:
day company flag occur
2012-01-23 A 0 1
2012-01-24 A 0 1
2012-01-25 A 1 1
2012-01-25 B 0 1
2012-01-26 A 0 1
2012-01-26 B 0 1
2012-01-27 B 1 1
2012-01-30 B 0 1
2013-01-10 A 0 2
2013-01-11 A 0 2
2013-01-14 A 1 2
In the example, the company B only occurs once (occur = 1). But the company A occurs two times. For the first time from 2012-01-23 to 2012-01-26. And for the second time from 2013-01-10 to 2013-01-14. The second time range of company A does not consist of all four days surrounding the event day (-2,-1,0,1) since the company leaves the dataset before the end of that time range.
As I said I am working with business days. I don't care for holidays, I have data from monday to friday. Earlier I wrote the following function:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
WITH alldates AS (
SELECT i,
$1 + (i * CASE WHEN $2 < 0 THEN -1 ELSE 1 END) AS date
FROM generate_series(0,(ABS($2) + 5)*2) i
),
days AS (
SELECT i, date, EXTRACT('dow' FROM date) AS dow
FROM alldates
),
businessdays AS (
SELECT i, date, d.dow FROM days d
WHERE d.dow BETWEEN 1 AND 5
ORDER BY i
)
-- adding business days to a date --
SELECT date FROM businessdays WHERE
CASE WHEN $2 > 0 THEN date >=$1 WHEN $2 < 0
THEN date <=$1 ELSE date =$1 END
LIMIT 1
offset ABS($2)
$BODY$
LANGUAGE 'sql' VOLATILE;
It can add/substract business days from a given date and works like that:
select * from addbusinessdays('2013-01-14',-2)
delivers the result 2013-01-10. So in Jakub's approach we can change the second and third last line to
w.day BETWEEN addbusinessdays(t1.day, -2) AND addbusinessdays(t1.day, 1)
and can deal with the business days.
Function
While using the function addbusinessdays(), consider this instead:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$func$
SELECT day
FROM (
SELECT i, $1 + i * sign($2)::int AS day
FROM generate_series(0, ((abs($2) * 7) / 5) + 3) i
) sub
WHERE EXTRACT(ISODOW FROM day) < 6 -- truncate weekend
ORDER BY i
OFFSET abs($2)
LIMIT 1
$func$ LANGUAGE sql IMMUTABLE;
Major points
Never quote the language name sql. It's an identifier, not a string.
Why was the function VOLATILE? Make it IMMUTABLE for better performance in repeated use and more options (like using it in a functional index).
(ABS($2) + 5)*2) is way too much padding. Replace with ((abs($2) * 7) / 5) + 3).
Multiple levels of CTEs were useless cruft.
ORDER BY in last CTE was useless, too.
As mentioned in my previous answer, extract(ISODOW FROM ...) is more convenient to truncate weekends.
Query
That said, I wouldn't use above function for this query at all. Build a complete grid of relevant days once instead of calculating the range of days for every single row.
Based on this assertion in a comment (should be in the question, really!):
two subsequent windows of the same firm can never overlap.
WITH range AS ( -- only with flag
SELECT company
, min(day) - 2 AS r_start
, max(day) + 1 AS r_stop
FROM tbl t
WHERE flag <> 0
GROUP BY 1
)
, grid AS (
SELECT company, day::date
FROM range r
,generate_series(r.r_start, r.r_stop, interval '1d') d(day)
WHERE extract('ISODOW' FROM d.day) < 6
)
SELECT *, sum(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN UNBOUNDED PRECEDING
AND 2 following) AS window_nr
FROM (
SELECT t.*, max(t.flag) OVER(PARTITION BY g.company ORDER BY g.day
ROWS BETWEEN 1 preceding
AND 2 following) in_window
FROM grid g
LEFT JOIN tbl t USING (company, day)
) sub
WHERE in_window > 0 -- only rows in [-2,1] window
AND day IS NOT NULL -- exclude missing days in [-2,1] window
ORDER BY company, day;
How?
Build a grid of all business days: CTE grid.
To keep the grid to its smallest possible size, extract minimum and maximum (plus buffer) day per company: CTE range.
LEFT JOIN actual rows to it. Now the frames for ensuing window functions works with static numbers.
To get distinct numbers per flag and company (window_nr), just count flags from the start of the grid (taking buffers into account).
Only keep days inside your [-2,1] windows (in_window > 0).
Only keep days with actual rows in the table.
Voilá.
SQL Fiddle.
Basically the strategy is to first enumarate the flag days and then join others with them:
WITH windows AS(
SELECT t1.day
,t1.company
,rank() OVER (PARTITION BY company ORDER BY day) as rank
FROM table1 t1
WHERE flag =1)
SELECT t1.day
,t1.company
,t1.flag
,w.rank
FROM table1 AS t1
JOIN windows AS w
ON
t1.company = w.company
AND
w.day BETWEEN
t1.day - interval '2 day' AND t1.day + interval '1 day'
ORDER BY t1.day, t1.company;
Fiddle.
However there is a problem with work days as those can mean whatever (do holidays count?).

Count over rows in previous time range partitioned by a specific column

My dataset consists of daily (actually business days) timeseries for different companies from different industries and I work with PostgreSQL. I have an indicator variable in my dataset taking values 1, -1 and most of the times 0. For better readability of the question I refer to days where the indicator variable is unequal to zero as indicator event.
So for all indicator events that are preceded by another indicator event for the same industry in the previous three business days, the indicator variable shall be updated to zero.
We can think of the following example dataset:
day company industry indicator
2012-01-12 A financial 1
2012-01-12 B consumer 0
2012-01-13 A financial 1
2012-01-13 B consumer -1
2012-01-16 A financial 0
2012-01-16 B consumer 0
2012-01-17 A financial 0
2012-01-17 B consumer 0
2012-01-17 C consumer 0
2012-01-18 A financial 0
2012-01-18 B consumer 0
2012-01-18 C consumer 1
So the indicator values that shall be updated to zero are on 2012-01-13 the entry for company A, and on 2012-01-18 the entry for company C, because they are preceded by another indicator event in the same industry within 3 business days.
I tried to accomplish it in the following way:
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
COUNT(CASE WHEN indicator <> 0 THEN 1 END)
OVER (PARTITION BY industry ORDER BY day
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) As cnt
FROM test
) alias
WHERE cnt >= 2)
My idea was to count the indicator events for the current day and the 3 preceding days partitioned by industry. If it counts more than 1, it updates the indicator value to zero.
The weak spot is, that so far it counts over the three preceding rows (partitioned by industry) instead of the three preceding business days. So in the example data, it is not able to update company C on 2012-01-18, because it counts over the last three rows where industry = consumer instead of counting over all rows where industry=consumer for the last three business days.
I tried different methods like adding another subquery in the third last line of the code or adding a WHERE EXISTS - clause after the third last line, to ensure that the code counts over the three preceding dates. But nothing worked. I really don't know out how to do that (I just learn to work with PostgreSQL).
Do you have any ideas how to fix it?
Or maybe I am thinking in a completely wrong direction and you know another approach how to solve my problem?
DB design
Fist off, your table should be normalized. industry should be a small foreign key column (typically integer) referencing industry_id of an industry table. Maybe you have that already and only simplified for the sake of the question. Your actual table definition would go a long way.
Since rows with an indicator are rare but highly interesting, create a (possibly "covering") partial index to make any solution faster:
CREATE INDEX tbl_indicator_idx ON tbl (industry, day)
WHERE indicator <> 0;
Equality first, range last.
Assuming that indicator is defined NOT NULL. If industry was an integer, this index would be perfectly efficient.
Query
This query identifies rows to be reset:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
SELECT industry, day
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('ISODOW' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) sub
WHERE incident
AND ct > 1
ORDER BY 1, 2;
SQL Fiddle.
ISODOW as extract() parameter is convenient to truncate weekends.
Integrate this in your UPDATE:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
UPDATE tbl t
SET indicator = 0
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('isodow' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) u
WHERE u.incident
AND u.ct > 1
AND t.industry = u.industry
AND t.day = u.day;
This should be substantially faster than your solution with correlated subqueries and a function call for every row. Even if that's based on my own previous answer, it's not perfect for this case.
In the meantime I found one possible solution myself (I hope that this isn't against the etiquette of the forum).
Please note that this is only one possible solution. You are very welcome to comment it or to develop
improvements if you want to.
For the first part, the function addbusinessdays which can add (or subtract) business day to
a given date, I am referring to:
http://osssmb.wordpress.com/2009/12/02/business-days-working-days-sql-for-postgres-2/
(I just slightly modified it because I don't care for holidays, just for weekends)
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
with alldates as (
SELECT i,
$1 + (i * case when $2 < 0 then -1 else 1 end) AS date
FROM generate_series(0,(abs($2) + 5)*2) i
),
days as (
select i, date, extract('dow' from date) as dow
from alldates
),
businessdays as (
select i, date, d.dow from days d
where d.dow between 1 and 5
order by i
)
select date from businessdays where
case when $2 > 0 then date >=$1 when $2 < 0 then date <=$1 else date =$1 end
limit 1
offset abs($2)
$BODY$
LANGUAGE 'sql' VOLATILE
COST 100;
ALTER FUNCTION addbusinessdays(date, integer) OWNER TO postgres;
For the second part, I am referring to this related question, where I am applying Erwin Brandstetter's correlated subquery approach: Window Functions or Common Table Expressions: count previous rows within range
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
(SELECT COUNT(CASE WHEN indicator <> 0 THEN 1 END)
FROM test t1
WHERE t1.industry = t.industry
AND t1.day between addbusinessdays(t.day,-3) and t.day) As cnt
FROM test t
) alias
WHERE cnt >= 2)