Get moving average over time frame in PostgreSQL with inconsistent data - sql

I have a table called answers with columns created_at and response, response being an integer 0 (for 'no'), 1 (for 'yes'), or 2 (for 'don't know'). I want to get a moving average for the response values, filtering out 2s for each day, only taking in to account the previous 30 days. I know you can do ROWS BETWEEN 29 AND PRECEDING AND CURRENT ROW but that only works if you have data for each day, and in my case there might be no data for a week or more.
My current query is this:
SELECT answers.created_at, answers.response,
AVG(answers.response)
OVER(ORDER BY answers.created_at::date ROWS
BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_average
FROM answers
WHERE answers.user_id = 'insert_user_id''
AND (answers.response = 0 OR answers.response = 1)
GROUP BY answers.created_at, answers.response
ORDER BY answers.created_at::date
But this will return an average based on the previous rows, if a user responded with a 1 on 2018-3-30 and a 0 on 2018-5-15, the rolling average on 2018-5-15 would be 0.5 instead of 0 as I want. How can I create a query that will only take in to account the responses that were created within the last 30 days for the rolling average?

Since Postgres 11 you can do this:
SELECT created_at,
response,
AVG(response) OVER (ORDER BY created_at
RANGE BETWEEN '29 day' PRECEDING AND current row) AS rolling_average
FROM answers
WHERE user_id = 1
AND response in (0,1)
ORDER BY created_at;

Try something like this:
SELECT * FROM (
SELECT
d.created_at, d.response,
Avg(d.response) OVER(ORDER BY d.created_at::date rows BETWEEN 29 PRECEDING AND CURRENT row) AS rolling_average
FROM (
SELECT
COALESCE(a.created_at, d.dates) AS created_at, response, a.user_id
FROM
(SELECT generate_series('2018-01-01'::date, '2018-05-31'::date, '1day'::interval)::date AS dates) d
LEFT JOIN
(SELECT * FROM answers WHERE answers.user_id = 'insert_user_id' AND ( answers.response = 0 OR answers.response = 1)) a
ON d.dates = a.created_at::date
) d
GROUP BY d.created_at, d.response
) agg WHERE agg.response IS NOT NULL
ORDER BY agg.created_at::date
generate_series creates list of days - you have to set reasonable boundaries
this list of days is LEFT JOINed with preselected answers
this result is used for rolling average calculation
after it I select only records with response and I get:
created_at | response | rolling_averagte
2018-03-30 | 1 | 1.00000000000000000000
2018-05-15 | 0 | 0.00000000000000000000

Related

How do I use SQL to perform a cumulative sum where the increments have an expiration?

Say the scenario is this:
I have a database of student infractions. When a student is late to class, or misses a homework assignment they get an infraction.
student_id
infraction_type
day
1
tardy
0
2
missed_assignment
0
1
tardy
29
2
missed_assignment
15
1
tardy
99
2
missed_assignment
29
The school has three strike system, at each infraction disciplinary action is taken. Call them D0,D1,D2.
Infractions expire after 30 days.
I want to be able to perform a query to calculate the total counts of disciplinary actions taken in a given time period.
So the number of disciplinary actions taken in the last 100 days (at day 99) would be
disciplinary_action
count
D0
3
D1
2
D2
1
A table generated showing the disciplinary actions taken would look like:
student_id
infraction_type
day
disciplinary_action_gen
1
tardy
0
D0
2
missed_assignment
0
D0
1
tardy
29
D1
2
missed_assignment
15
D1
1
tardy
99
D0
2
missed_assignment
29
D2
What SQL query could I use to do such a cumulative sum?
You can solve your problem by checking in the following order:
if <30 days have passed from the last two infractions, assign D2
if <30 days have passed from last infraction, assign D1
assign D0 (given its the first infraction)
This will work assuming your DBMS supports the tools used for this solution, namely:
the CASE expression, to conditionally assign infraction values
the LAG window function, to retrieve the previous "day" values
SELECT *,
CASE WHEN day - LAG(day,2) OVER(PARTITION BY student_id
ORDER BY day ) < 30 THEN 'D2'
WHEN day - LAG(day,1) OVER(PARTITION BY student_id
ORDER BY day ) < 30 THEN 'D1'
ELSE 'D0'
END AS disciplinary_action_gen
FROM tab
Check a MySQL demo here.
A similar approach using COUNT() as a window function and a frame definition -
SELECT
*,
CONCAT(
'D',
LEAST(
3,
COUNT(*) OVER (
PARTITION BY student_id
ORDER BY day ASC
RANGE BETWEEN 30 PRECEDING AND CURRENT ROW
)
) - 1
) AS disciplinary_action_gen
FROM infractions;
The frame definition (RANGE BETWEEN 30 PRECEDING AND CURRENT ROW) tells the server that we want to include all rows with a day value between (current row's value of day - 30) and (the current row's value of day). So, if the current row has a day value of 99, the count will be for all rows in the partition with a day value between 69 and 99.
To get the disciplinary counts, we can simply wrap this in a normal GROUP BY -
SELECT disciplinary_action, COUNT(*) AS count
FROM (
SELECT
CONCAT(
'D',
LEAST(
3,
COUNT(*) OVER (
PARTITION BY student_id
ORDER BY day ASC
RANGE BETWEEN 30 PRECEDING AND CURRENT ROW
)
) - 1
) AS disciplinary_action
FROM infractions
) t
GROUP BY disciplinary_action;
If your infractions are stored with a date, as opposed to the days in your example, this can be easily updated to use a date interval in the frame definition. And, if looking at counts of disciplinary actions in the last 100 days we need to include the previous 30 days, as these could impact the action (D0, D1 or D2) on the first day we are interested in.
SELECT disciplinary_action, COUNT(*) AS count
FROM (
SELECT
`date`,
CONCAT(
'D',
LEAST(
3,
COUNT(*) OVER (
PARTITION BY student_id
ORDER BY `date` ASC
RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW
)
) - 1
) AS disciplinary_action
FROM infractions
WHERE `date` >= CURRENT_DATE - INTERVAL 130 DAY
) t
WHERE `date` >= CURRENT_DATE - INTERVAL 100 DAY
GROUP BY disciplinary_action;
Here's a db<>fiddle

Extract previous row calculated value for use in current row calculations - Postgres

Have a requirement where I would need to rope the calculated value of the previous row for calculation in the current row.
The following is a sample of how the data currently looks :-
ID
Date
Days
1
2022-01-15
30
2
2022-02-18
30
3
2022-03-15
90
4
2022-05-15
30
The following is the output What I am expecting :-
ID
Date
Days
CalVal
1
2022-01-15
30
2022-02-14
2
2022-02-18
30
2022-03-16
3
2022-03-15
90
2022-06-14
4
2022-05-15
30
2022-07-14
The value of CalVal for the first row is Date + Days
From the second row onwards it should take the CalVal value of the previous row and add it with the current row Days
Essentially, what I am looking for is means to access the previous rows calculated value for use in the current row.
Is there anyway we can achieve the above via Postgres SQL? I have been tinkering with window functions and even recursive CTEs but have had no luck :(
Would appreciate any direction!
Thanks in advance!
select
id,
date,
coalesce(
days - (lag(days, 1) over (order by date, days))
, days) as days,
first_date + cast(days as integer) as newdate
from
(
select
-- get a running sum of days
id,
first_date,
date,
sum(days) over (order by date, days) as days
from
(
select
-- get the first date
id,
(select min(date) from table1) as first_date,
date,
days
from
table1
) A
) B
This query get the exact output you described. I'm not at all ready to say it is the best solution but the strategy employed is to essential create a running total of the "days" ... this means that we can just add this running total to the first date and that will always be the next date in the desired sequence. One finesse: to put the "days" back into the result, we calculated the current running total less the previous running total to arrive at the original amount.
assuming that table name is table1
select
id,
date,
days,
first_value(date) over (order by id) +
(sum(days) over (order by id rows between unbounded preceding and current row))
*interval '1 day' calval
from table1;
We just add cumulative sum of days to first date in table. It's not really what you want to do (we don't need date from previous row, just cumulative days sum)
Solution with recursion
with recursive prev_row as (
select id, date, days, date+ days*interval '1 day' calval
from table1
where id = 1
union all
select t.id, t.date, t.days, p.calval + t.days*interval '1 day' calval
from prev_row p
join table1 t on t.id = p.id+ 1
)
select *
from prev_row

Google Big Query SQL - Get most recent unique value by date

#EDIT - Following the comments, I rephrase my question
I have a BigQuery table that i want to use to get some KPI of my application.
In this table, I save each create or update as a new line in order to keep a better history.
So I have several times the same data with a different state.
Example of the table :
uuid |status |date
––––––|–––––––––––|––––––––––
3 |'inactive' |2018-05-12
1 |'active' |2018-05-10
1 |'inactive' |2018-05-08
2 |'active' |2018-05-08
3 |'active' |2018-05-04
2 |'inactive' |2018-04-22
3 |'inactive' |2018-04-18
We can see that we have multiple value of each data.
What I would like to get:
I would like to have the number of current 'active' entry (So there must be no 'inactive' entry with the same uuid after). And to complicate everything, I need this total per day.
So for each day, the amount of 'active' entries, including those from previous days.
So with this example I should have this result :
date | actives
____________|_________
2018-05-02 | 0
2018-05-03 | 0
2018-05-04 | 1
2018-05-05 | 1
2018-05-06 | 1
2018-05-07 | 1
2018-05-08 | 2
2018-05-09 | 2
2018-05-10 | 3
2018-05-11 | 3
2018-05-12 | 2
Actually i've managed to get the good amount of actives for one day. But my problem is when i want the results for each days.
What I've tried:
I'm stuck with two solutions that each return a different error.
First solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT COUNT(uuid)
FROM (
SELECT
uuid, status, date,
RANK() OVER(PARTITION BY uuid ORDER BY date DESC) rank
FROM users
WHERE
PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d",date)) <= i_date
)
WHERE
status = 'active'
and rank = 1
## rank is the condition which causes the error
) users
FROM
dates, UNNEST(arr_dates) i_date
ORDER BY i_date;
The SELECT with the RANK() OVER correctly returns the users with a rank column that allow me to know which entry is the last for each uuid.
But when I try this, I got a :
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. because of the rank = 1 condition.
Second solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT
COUNT(t1.uuid)
FROM
users t1
WHERE
t1.date = (
SELECT MAX(t2.date)
FROM users t2
WHERE
t2.uuid = t1.uuid
## Here that's the i_date condition which causes problem
AND PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d", t2.date)) <= i_date
)
AND status='active' ) users
FROM
dates,
UNNEST(arr_dates) i_date
ORDER BY i_date;
Here, the second select is working too and correctly returning the number of active user for a current day.
But the problem is when i try to use i_date to retrieve datas among the multiple days.
And Here i got a LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. error...
Which solution is more able to succeed ? What should i change ?
And, if my way of storing the data isn't good, how should i proceed in order to keep a precise history ?
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, COUNT(DISTINCT uuid) total_active
FROM `project.dataset.table`
WHERE status = 'active'
GROUP BY date
-- ORDER BY date
Update to address your "rephrased" question :o)
Below example is using dummy data from your question
#standardSQL
WITH `project.dataset.users` AS (
SELECT 3 uuid, 'inactive' status, DATE '2018-05-12' date UNION ALL
SELECT 1, 'active', '2018-05-10' UNION ALL
SELECT 1, 'inactive', '2018-05-08' UNION ALL
SELECT 2, 'active', '2018-05-08' UNION ALL
SELECT 3, 'active', '2018-05-04' UNION ALL
SELECT 2, 'inactive', '2018-04-22' UNION ALL
SELECT 3, 'inactive', '2018-04-18'
), dates AS (
SELECT day FROM UNNEST((
SELECT GENERATE_DATE_ARRAY(MIN(date), MAX(date))
FROM `project.dataset.users`
)) day
), active_users AS (
SELECT uuid, status, date first, DATE_SUB(next_status.date, INTERVAL 1 DAY) last FROM (
SELECT uuid, date, status, LEAD(STRUCT(status, date)) OVER(PARTITION BY uuid ORDER BY date ) next_status
FROM `project.dataset.users` u
)
WHERE status = 'active'
)
SELECT day, COUNT(DISTINCT uuid) actives
FROM dates d JOIN active_users u
ON day BETWEEN first AND IFNULL(last, day)
GROUP BY day
-- ORDER BY day
with result
Row day actives
1 2018-05-04 1
2 2018-05-05 1
3 2018-05-06 1
4 2018-05-07 1
5 2018-05-08 2
6 2018-05-09 2
7 2018-05-10 3
8 2018-05-11 3
9 2018-05-12 2
I think this -- or something similar -- will do what you want:
SELECT day,
coalesce(running_actives, 0) - coalesce(running_inactives, 0)
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2015-05-11'), DATE('2018-06-29'), INTERVAL 1 DAY)
) AS day left join
(select date, sum(countif(status = 'active')) over (order by date) as running_actives,
sum(countif(status = 'active')) over (order by date) as running_inactives
from t
group by date
) a
on a.date = day
order by day;
The exact solution depends on whether the "inactive" is inclusive of the day (as above) or takes effect the next day. Either is handled the same way, by using cumulative sums of actives and inactives and then taking the difference.
In order to get data on all days, this generates the days using arrays and unnest(). If you have data on all days, that step may be unnecessary

Cumulative Additions in SQL Server 2008

I am trying to add the previous row value to the current row and keep this adding until I reach the total.
I have tried the below query and I have set row number to the rows as per my needs, the additions should take place in the same manner.
select *
from #Finalle
order by rownum ;
Output:
Type Count_DB Rownum
------------------------------------------
Within 30 days 399480 1
Within 60 days 30536 2
Within 90 days 10432 3
Within 120 days 11777 4
Greater than 120 days 13091 5
Blank 29297 6
Total 494613 7
When I try the below query, it works fine until the 6th row, but fails for the last row:
select
f1.[type],
(select
Sum(f.[Count of ED_DB_category]) as [Cummumative]
from
#Finalle f
where
f1.rownum >= f.rownum)
from
#Finalle f1
order by
rownum
Output:
Type No column name
--------------------------------------
Within 30 days 399480
Within 60 days 430016
Within 90 days 440448
Within 120 days 452225
Greater than 120 days 465316
Blank 494613 <---Add only until here
Total 989226
Here the total should return the same value as in the first table.
How do I achieve this?
Try this:
select f1.[type],
CASE
WHEN type = 'Total' THEN f1.[Count_DB]
ELSE SUM(f1.[Count_DB]) OVER (ORDER BY rownum)
END
from #Finalle f1
order by rownum
You may looking for this
select f1.[type],
(select Sum(f.[Count of ED_DB_category])
from #Finalle f where f1.rownum >= f.rownum AND
f.Type <> 'Total'
)as [Cummumative]
from #Finalle f1
order by rownum
Maybe you could use windowed function....
Here's a SWAG but I am not sure of your table structures.....or what to order by
SELECT
f1.[type],
SUM(f.[Count of ED_DB_category]) OVER(ORDER BY f1.[type] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS [Cummumative]
FROM
FROM #Finalle f1
WHERE f.Type <> 'Total'

Count occurrences of combinations of columns

I have daily time series (actually business days) for different companies and I work with PostgreSQL. There is also an indicator variable (called flag) taking the value 0 most of the time, and 1 on some rare event days. If the indicator variable takes the value 1 for a company, I want to further investigate the entries from two days before to one day after that event for the corresponding company. Let me refer to that as [-2,1] window with the event day being day 0.
I am using the following query
CREATE TABLE test AS
WITH cte AS (
SELECT *
, MAX(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN 1 preceding AND 2 following) Lead1
FROM mytable)
SELECT *
FROM cte
WHERE Lead1 = 1
ORDER BY day,company
The query takes the entries ranging from 2 days before the event to one day after the event, for the company experiencing the event.
The query does that for all events.
This is a small section of the resulting table.
day company flag
2012-01-23 A 0
2012-01-24 A 0
2012-01-25 A 1
2012-01-25 B 0
2012-01-26 A 0
2012-01-26 B 0
2012-01-27 B 1
2012-01-30 B 0
2013-01-10 A 0
2013-01-11 A 0
2013-01-14 A 1
Now I want to do further calculations for every [-2,1] window separately. So I need a variable that allows me to identify each [-2,1] window. The idea is that I count the number of windows for every company with the variable "occur", so that in further calculations I can use the clause
GROUP BY company, occur
Therefore my desired output looks like that:
day company flag occur
2012-01-23 A 0 1
2012-01-24 A 0 1
2012-01-25 A 1 1
2012-01-25 B 0 1
2012-01-26 A 0 1
2012-01-26 B 0 1
2012-01-27 B 1 1
2012-01-30 B 0 1
2013-01-10 A 0 2
2013-01-11 A 0 2
2013-01-14 A 1 2
In the example, the company B only occurs once (occur = 1). But the company A occurs two times. For the first time from 2012-01-23 to 2012-01-26. And for the second time from 2013-01-10 to 2013-01-14. The second time range of company A does not consist of all four days surrounding the event day (-2,-1,0,1) since the company leaves the dataset before the end of that time range.
As I said I am working with business days. I don't care for holidays, I have data from monday to friday. Earlier I wrote the following function:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
WITH alldates AS (
SELECT i,
$1 + (i * CASE WHEN $2 < 0 THEN -1 ELSE 1 END) AS date
FROM generate_series(0,(ABS($2) + 5)*2) i
),
days AS (
SELECT i, date, EXTRACT('dow' FROM date) AS dow
FROM alldates
),
businessdays AS (
SELECT i, date, d.dow FROM days d
WHERE d.dow BETWEEN 1 AND 5
ORDER BY i
)
-- adding business days to a date --
SELECT date FROM businessdays WHERE
CASE WHEN $2 > 0 THEN date >=$1 WHEN $2 < 0
THEN date <=$1 ELSE date =$1 END
LIMIT 1
offset ABS($2)
$BODY$
LANGUAGE 'sql' VOLATILE;
It can add/substract business days from a given date and works like that:
select * from addbusinessdays('2013-01-14',-2)
delivers the result 2013-01-10. So in Jakub's approach we can change the second and third last line to
w.day BETWEEN addbusinessdays(t1.day, -2) AND addbusinessdays(t1.day, 1)
and can deal with the business days.
Function
While using the function addbusinessdays(), consider this instead:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$func$
SELECT day
FROM (
SELECT i, $1 + i * sign($2)::int AS day
FROM generate_series(0, ((abs($2) * 7) / 5) + 3) i
) sub
WHERE EXTRACT(ISODOW FROM day) < 6 -- truncate weekend
ORDER BY i
OFFSET abs($2)
LIMIT 1
$func$ LANGUAGE sql IMMUTABLE;
Major points
Never quote the language name sql. It's an identifier, not a string.
Why was the function VOLATILE? Make it IMMUTABLE for better performance in repeated use and more options (like using it in a functional index).
(ABS($2) + 5)*2) is way too much padding. Replace with ((abs($2) * 7) / 5) + 3).
Multiple levels of CTEs were useless cruft.
ORDER BY in last CTE was useless, too.
As mentioned in my previous answer, extract(ISODOW FROM ...) is more convenient to truncate weekends.
Query
That said, I wouldn't use above function for this query at all. Build a complete grid of relevant days once instead of calculating the range of days for every single row.
Based on this assertion in a comment (should be in the question, really!):
two subsequent windows of the same firm can never overlap.
WITH range AS ( -- only with flag
SELECT company
, min(day) - 2 AS r_start
, max(day) + 1 AS r_stop
FROM tbl t
WHERE flag <> 0
GROUP BY 1
)
, grid AS (
SELECT company, day::date
FROM range r
,generate_series(r.r_start, r.r_stop, interval '1d') d(day)
WHERE extract('ISODOW' FROM d.day) < 6
)
SELECT *, sum(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN UNBOUNDED PRECEDING
AND 2 following) AS window_nr
FROM (
SELECT t.*, max(t.flag) OVER(PARTITION BY g.company ORDER BY g.day
ROWS BETWEEN 1 preceding
AND 2 following) in_window
FROM grid g
LEFT JOIN tbl t USING (company, day)
) sub
WHERE in_window > 0 -- only rows in [-2,1] window
AND day IS NOT NULL -- exclude missing days in [-2,1] window
ORDER BY company, day;
How?
Build a grid of all business days: CTE grid.
To keep the grid to its smallest possible size, extract minimum and maximum (plus buffer) day per company: CTE range.
LEFT JOIN actual rows to it. Now the frames for ensuing window functions works with static numbers.
To get distinct numbers per flag and company (window_nr), just count flags from the start of the grid (taking buffers into account).
Only keep days inside your [-2,1] windows (in_window > 0).
Only keep days with actual rows in the table.
Voilá.
SQL Fiddle.
Basically the strategy is to first enumarate the flag days and then join others with them:
WITH windows AS(
SELECT t1.day
,t1.company
,rank() OVER (PARTITION BY company ORDER BY day) as rank
FROM table1 t1
WHERE flag =1)
SELECT t1.day
,t1.company
,t1.flag
,w.rank
FROM table1 AS t1
JOIN windows AS w
ON
t1.company = w.company
AND
w.day BETWEEN
t1.day - interval '2 day' AND t1.day + interval '1 day'
ORDER BY t1.day, t1.company;
Fiddle.
However there is a problem with work days as those can mean whatever (do holidays count?).