How do I calculate a rolling average over a specific range timeframe in BigQuery? - sql

I have a BigQuery table like the one below, where data wasn't necessarily recorded at a consistent rate:
| timestamp | value |
|-------------------------|-------|
| 2022-10-01 00:03:00 UTC | 2.43 |
| 2022-10-01 00:17:00 UTC | 4.56 |
| 2022-10-01 00:36:00 UTC | 3.64 |
| 2022-10-01 00:58:00 UTC | 2.15 |
| 2022-10-01 01:04:00 UTC | 2.90 |
| 2022-10-01 01:13:00 UTC | 5.88 |
... ...
I want to calculate a rolling average (as a new column) on value over a certain timeframe, e.g. the previous 12 hours. I know it's relatively simple to do over a fixed number of rows, and I've tried using LAG and TIMESTAMP_SUB functions to select the right values to average over, but I'm quite new to SQL so I'm not even sure if this is the right approach.
Does anyone know how to go about this? Thanks!

Please use a window function.
You need to calculate a date and hour column as an integer. For this we take the unix date and multiply it by 24 hours. Then we add the hours of the day. We ignore daylight saving time.
WITH
tbl AS (SELECT 10* rand() as val, timestamp_add(snapshot_date,interval cast(rand()*5000 as int64) minute) as timestamps FROM UNNEST(GENERATE_Timestamp_ARRAY("2021-01-01 00:00:00","2023-01-01 0:00:00",INTERVAL 1 hour)) AS snapshot_date)
SELECT
*,
unix_date(date(timestamps))*24+extract(hour from timestamps) as dummy_time,
avg(val) over WIN1_range as rolling_avg,
sum(1) over WIN1_range as values_in_avg
FROM
tbl
window WIN1_range as (order by unix_date(date(timestamps))*24+extract(hour from timestamps) range between 12 PRECEDING and current row)

BigQuery has simplified specifications for the range frame of window functions:
Tip: If you want to use a range with a date, use ORDER BY with the UNIX_DATE() function. If you want to use a range with a timestamp, use the UNIX_SECONDS(), UNIX_MILLIS(), or UNIX_MICROS() function.
Here, we can simply use unix_seconds() when ordering the records in the partition, and accordingly specify an interval of 12 hours as seconds:
select ts, val,
avg(value) over(
order by unix_seconds(ts)
range between 12 * 60 * 60 preceding and current row
) as avg_last_12_hours
from mytable
Now say we wanted the average over the last 2 days, we would use unix_date() instead:
select ts, val,
avg(value) over(
order by unix_date(ts)
range between 2 preceding and current row
) as avg_last_12_hours
from mytable

Related

Get max value of binned time-interval

I have a 'requests' table with a 'time_request' column which has a timestamp for each request. I want to know the maximum amount of requests that i had in a single minute.
So im guessing i need to somehow 'group by' a 1m time interval, and then do some sort of MAX(COUNT(request_id))? Although nested aggregations are not allowed.
Will appreciate any help.
Table example:
request_id | time_request
------------------+---------------------
ab1 | 2021-03-29 16:20:05
ab2 | 2021-03-29 16:20:20
bc3 | 2021-03-31 20:34:07
fw3 | 2021-03-31 20:38:53
fe4 | 2021-03-31 20:39:53
Expected result: 2 (There were a maximum of 2 requests in a single minute)
Thanks!
You may use window function count and specify logical interval of one minute as the window boundary. It will calculate the count for each row and will account all the rows that are within one minute before.
Code for Postgres is below:
with a as (
select
id
, cast(ts as timestamp) as ts
from(values
('ab1', '2021-03-29 16:20:05'),
('ab2', '2021-03-29 16:20:20'),
('bc3', '2021-03-31 20:34:07'),
('fw3', '2021-03-31 20:38:53'),
('fe4', '2021-03-31 20:39:53')
) as t(id, ts)
)
, count_per_interval as (
select
a.*
, count(id) over (
order by ts asc
range between
interval '1' minute preceding
and current row
) as cnt_per_min
from a
)
select max(cnt_per_min)
from count_per_interval
| max |
| --: |
| 2 |
db<>fiddle here

PSQL Recursive Adding Query

I have a table called "deaths" with two columns. One is a date, and the second is the amount of people who died on that specific date.
I need a query that gives me the total amount of people who died between that date and 90 days prior. For example, if row value of the date is 30/09/2021, I would need to add the deaths since 02/07/2021.
¿Can I get any guidance as to how can I do this?
"deaths" Table example below.
Date | Deaths |
------------+--------+
2021-08-19 | 21 |
2021-08-18 | 96 |
2021-08-17 | 100 |
2021-08-16 | 64 |
2021-08-15 | 107 |
2021-08-14 | 93 |
So, if this was all my data, the first row (2021-08-19) of my result should be (21 + 96 + 100 + 64 + 107 + 93).
Hope I was clear enough.
You don't need a recursive query for this, you can use window functions instead. As others have mentioned, "date" is not a good name for a column and it would have been better to give sample data with dates more than 90 days apart, but I believe this query should work for you:
SELECT "date",
deaths,
sum(deaths) OVER (ORDER BY "date" RANGE BETWEEN interval '90 days' preceding and current row)
FROM deaths;
The clause RANGE BETWEEN interval '90 days' preceding and current row, called the frame, limits the rows that will be part of the sum.
Use
Select distinct "date" , (Select sum(deaths) from table where
"date" <=d."date" and "date" >=d."date" -90)
as tot_deaths from tabl3 d;

how to query time-series data in postgresql to find spikes

I have a table called cpu_usages and I'm trying to find spikes of cpu usage. My table stores 4 columns:
id serial
at timestamp
cpu_usage float
cpu_core int
the at column stores a timestamp of every minute ever day.
I want to select all rows where I take each timestamp and get the next 3 minutes and if any of the timestamps has a cpu_value over at least 3% higher than the starting value for that timestamp, then return it
So for example if I have these rows:
id|at|cpu_values,cpu_core
1 | 2019-01-01-00:00|1|0
2 | 2019-01-01-00:01|1|0
3 | 2019-01-01-00:02|4|0
4 | 2019-01-01-00:03|1|0
5 | 2019-01-01-00:04|1|0
6 | 2019-01-01-00:05|1|0
7 | 2019-01-01-00:06|1|0
8 | 2019-01-01-00:07|1|0
9 | 2019-01-01-00:08|6|0
10 | 2019-01-01-00:00|1|1
11 | 2019-01-01-00:01|1|1
12| 2019-01-01-00:02|4|1
13 | 2019-01-01-00:03|1|1
14 | 2019-01-01-00:04|1|1
15 | 2019-01-01-00:05|1|1
16 | 2019-01-01-00:06|1|1
17 | 2019-01-01-00:07|1|1
18 | 2019-01-01-00:08|6|1
It would return rows:
1,2,6,7,8
I am not sure how to do this because it sounds like it needs some sort of nested joins.
Can anyone assist me with this?
This answers the original version of the question.
Just use window functions. Assuming you want the larger value, then you want to look back not forward:
select t.*
from (select t.*,
max(cpu_value) over (order by timestamp
range between interval '3 minute' preceding and interval '1 second' preceding
) as previous_min
from t
) t
where previous_min * 1.03 < cpu_value;
EDIT:
Looking backwards, this would be:
select t.*
from (select t.*,
min(cpu_value) over (order by timestamp
range between interval '1 second' following and interval '3 minute' following
) as next_min
from t
) t
where cpu_value * 1.03 > next_min;

How to make query that selects based on 1 day interval?

How can I get all IDs that have more than 10 entries on one day?
Here is the sample data:
ID | Time
__________________________
4 | 2019-02-14 17:22:43
__________________________
2 | 2019-04-27 07:51:09
__________________________
83 | 2018-01-07 08:38:37
__________________________
I am having a hard time using count and going through and finding all of the ones on the same day. The Hour:Min:Sec is what is causing problems for me.
For MySql it would be:
select distinct id from tablename
group by id, date(time)
having count(*) > 10
The date() function rejects the time part of the column, so the grouping is done only by the date part.
For SqlServer you would use:
convert(date, time)

Moving average last 30 days

I want to find the number of unique users active in the last 30 days. I want to calculate this for today, but also for days in the past. The dataset contains user ids, dates and events triggered by the user saved in BigQuery. A user is active by opening a mobile app triggering the event session_start. Example of the unnested dataset.
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
I found a solution which suits my problem:
BigQuery: how to group and count rows within rolling timestamp window?
My BigQuery script so far:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
This script results in the following table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids. How can I fix the rolling window function in my script?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
Of course, as that's exactly what the code
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
is asking for.
Check out https://stackoverflow.com/a/49866033/132438 where the question asks about specifically counting uniques in a rolling window: Turns out it's a very slow operation given how much memory it requires.
The solution for this when you want a rolling count of uniques: Go for approximate results.
From the linked answer:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
Working solution for a weekly calculation of the number of active users in the last 30 days.
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC