Get a rolling count of timestamps in SQL - sql

I have a table (in an Oracle DB) that looks something like what is shown below with about 4000 records. This is just an example of how the table is designed. The timestamps range for several years.
| Time | Action |
| 9/25/2019 4:24:32 PM | Yes |
| 9/25/2019 4:28:56 PM | No |
| 9/28/2019 7:48:16 PM | Yes |
| .... | .... |
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval. I would like this done by looking at each timestamp and getting a count of timestamps that appear within 15 minutes of that timestamp.
My goal would to have something like
| Interval | Count |
| 9/25/2019 4:24:00 PM - 9/25/2019 4:39:00 | 2 |
| 9/25/2019 4:25:00 PM - 9/25/2019 4:40:00 | 2 |
| ..... | ..... |
| 9/25/2019 4:39:00 PM - 9/25/2019 4:54:00 | 0 |
I am not sure how I would be able to do this, if at all. Any ideas or advice would be much appreciated.

If you want any 15 minute interval in the data, then you can use:
select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t;
If you want the maximum, then use rank() on this:
select t.*
from (select t.*, rank() over (order by cnt_15 desc) as seqnum
from (select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t
) t
) t
where seqnum = 1;
This doesn't produce exactly the results you specify in the query. But it does answer the question:
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval.

You could enumerate the minutes with a recursive query, then bring the table with a left join:
with recursive cte (start_dt, max_dt) as (
select trunc(min(time), 'mi'), max(time) from mytable
union all
select start_dt + interval '1' minute, max_dt from cte where start_dt < max_dt
)
select
c.start_dt,
c.start_dt + interval '15' minute end_dt,
count(t.time) cnt
from cte c
left join mytable t
on t.time >= c.start_dt
and t.time < c.start_dt + interval '15' minute
group by c.start_dt

Related

Get max value of binned time-interval

I have a 'requests' table with a 'time_request' column which has a timestamp for each request. I want to know the maximum amount of requests that i had in a single minute.
So im guessing i need to somehow 'group by' a 1m time interval, and then do some sort of MAX(COUNT(request_id))? Although nested aggregations are not allowed.
Will appreciate any help.
Table example:
request_id | time_request
------------------+---------------------
ab1 | 2021-03-29 16:20:05
ab2 | 2021-03-29 16:20:20
bc3 | 2021-03-31 20:34:07
fw3 | 2021-03-31 20:38:53
fe4 | 2021-03-31 20:39:53
Expected result: 2 (There were a maximum of 2 requests in a single minute)
Thanks!
You may use window function count and specify logical interval of one minute as the window boundary. It will calculate the count for each row and will account all the rows that are within one minute before.
Code for Postgres is below:
with a as (
select
id
, cast(ts as timestamp) as ts
from(values
('ab1', '2021-03-29 16:20:05'),
('ab2', '2021-03-29 16:20:20'),
('bc3', '2021-03-31 20:34:07'),
('fw3', '2021-03-31 20:38:53'),
('fe4', '2021-03-31 20:39:53')
) as t(id, ts)
)
, count_per_interval as (
select
a.*
, count(id) over (
order by ts asc
range between
interval '1' minute preceding
and current row
) as cnt_per_min
from a
)
select max(cnt_per_min)
from count_per_interval
| max |
| --: |
| 2 |
db<>fiddle here

Postgres query for difference between latest and first record of the day

Postgres data alike this:
| id | read_at | value_1 |
| ------|------------------------|---------|
| 16239 | 2021-11-28 16:13:00+00 | 1509 |
| 16238 | 2021-11-28 16:12:00+00 | 1506 |
| 16237 | 2021-11-28 16:11:00+00 | 1505 |
| 16236 | 2021-11-28 16:10:00+00 | 1501 |
| 16235 | 2021-11-28 16:09:00+00 | 1501 |
| ..... | .......................| .... |
| 15266 | 2021-11-28 00:00:00+00 | 1288 |
A value is added every minute and increases over time.
I would like to get the current total for the day and have this in a Grafana stat panel. Above it would be: 221 (1509-1288). Latest record minus first record of today.
SELECT id,read_at,value_1
FROM xyz
ORDER BY id DESC
LIMIT 1;
With this the latest record is given (A).
SELECT id,read_at,value_1
FROM xyz
WHERE read_at = CURRENT_DATE
ORDER BY id DESC
LIMIT 1;
With this the first record of the day is given (B).
Grafana cannot do math on this (A-B). Single query would be best.
Sadly my database knowledge is low and attempts at building queries have not succeeded, and have taken all afternoon now.
Theoretical ideas to solve this:
Subtract the min from the max value where time frame is today.
Using a lag, lag it for the count of records that are recorded today. Subtract lag value from latest value.
Window function.
What is the best way (performance wise) forward and how would such query be written?
Calculate the cumulative total last_value - first_value for each record for the current day using window functions (this is the t subquery) and then pick the latest one.
select current_total, read_at::date as read_at_date
from
(
select last_value(value_1) over w - first_value(value_1) over w as current_total,
read_at
from the_table
where read_at >= current_date and read_at < current_date + 1
window w as (partition by read_at::date order by read_at)
) as t
order by read_at desc limit 1;
However if it is certain that value_1 only "increases over time" then simple grouping will do and that is by far the best way performance wise:
select max(value_1) - min(value_1) as current_total,
read_at::date as read_at_date
from the_table
where read_at >= current_date and read_at < current_date + 1
group by read_at::date;
Please, check if it works.
Since you intend to publish it in Grafana, the query does not impose a period filter.
https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/3080
create table g (id int, read_at timestamp, value_1 int);
insert into g
values
(16239, '2021-11-28 16:13:00+00', 1509),
(16238, '2021-11-28 16:12:00+00', 1506),
(16237, '2021-11-28 16:11:00+00', 1505),
(16236, '2021-11-28 16:10:00+00', 1501),
(16235, '2021-11-28 16:09:00+00', 1501),
(15266, '2021-11-28 00:00:00+00', 1288);
select date(read_at), max(value_1) - min(value_1)
from g
group by date(read_at);
Since you data contains multiple values for 2 distinct times (16:09 and 16:10), this indicates the possibility that min and max values do not always increase in the time interval. Leaving open the possibility of a decrease. So do you want max - min reading or the difference in reading at min/max time. The following get value difference to get difference between the first and latest reading of the day as indicated in the title.
with parm(dt) as
( values (date '2021-11-28') )
, first_read (f_read,f_value) as
( select read_at, value_1
from test_tbl
where read_at at time zone 'UTC'=
( select min(read_at at time zone 'UTC')
from test_tbl
join parm
on ((read_at at time zone 'UTC')::date = dt)
)
)
, last_read (l_read, l_value) as
( select read_at,value_1
from test_tbl
where read_at at time zone 'UTC'=
( select max(read_at at time zone 'UTC')
from test_tbl
join parm
on ((read_at at time zone 'UTC')::date = dt)
)
)
select l_read, f_read, l_value, f_value, l_value - f_value as "Day Difference"
from last_read
join first_read on true;

how to query time-series data in postgresql to find spikes

I have a table called cpu_usages and I'm trying to find spikes of cpu usage. My table stores 4 columns:
id serial
at timestamp
cpu_usage float
cpu_core int
the at column stores a timestamp of every minute ever day.
I want to select all rows where I take each timestamp and get the next 3 minutes and if any of the timestamps has a cpu_value over at least 3% higher than the starting value for that timestamp, then return it
So for example if I have these rows:
id|at|cpu_values,cpu_core
1 | 2019-01-01-00:00|1|0
2 | 2019-01-01-00:01|1|0
3 | 2019-01-01-00:02|4|0
4 | 2019-01-01-00:03|1|0
5 | 2019-01-01-00:04|1|0
6 | 2019-01-01-00:05|1|0
7 | 2019-01-01-00:06|1|0
8 | 2019-01-01-00:07|1|0
9 | 2019-01-01-00:08|6|0
10 | 2019-01-01-00:00|1|1
11 | 2019-01-01-00:01|1|1
12| 2019-01-01-00:02|4|1
13 | 2019-01-01-00:03|1|1
14 | 2019-01-01-00:04|1|1
15 | 2019-01-01-00:05|1|1
16 | 2019-01-01-00:06|1|1
17 | 2019-01-01-00:07|1|1
18 | 2019-01-01-00:08|6|1
It would return rows:
1,2,6,7,8
I am not sure how to do this because it sounds like it needs some sort of nested joins.
Can anyone assist me with this?
This answers the original version of the question.
Just use window functions. Assuming you want the larger value, then you want to look back not forward:
select t.*
from (select t.*,
max(cpu_value) over (order by timestamp
range between interval '3 minute' preceding and interval '1 second' preceding
) as previous_min
from t
) t
where previous_min * 1.03 < cpu_value;
EDIT:
Looking backwards, this would be:
select t.*
from (select t.*,
min(cpu_value) over (order by timestamp
range between interval '1 second' following and interval '3 minute' following
) as next_min
from t
) t
where cpu_value * 1.03 > next_min;

Oracle partition with Monthly Interval

I am performing a query with a partition window of 1 calendar month. The data I'm working with is collected at regular intervals eg. every fifteen minutes.
Here is the code:
SELECT AVG(data_value) OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW)
This query works well, and collects the monthly average. The only problem is that the start and end of the interval are exactly a month apart, so the boundaries of the interval window are inclusive, eg. the start would be Nov-01-2019 00:00 and the End would be Dec-01-2019 00:00.
I need to make it so that the starting boundary is not included, because it's not considered part of the data set, eg. Start at Nov-01-2019 00:15 (the next row) and the End would still be Dec-01-2019 00:00.
I'm wondering if there's something that Oracle can do that would achieve this.
I imagine the code looking something like this:
SELECT AVG(data_value) OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH (+ 1 ROW) PRECEDING AND CURRENT ROW)
I've tried several variants of this but Oracle does not like them. Any help would be appreciated.
Work out how many days there were in the previous month using:
EXTRACT( DAY FROM TRUNC( time_stamp, 'MM' ) - 1 )
The use the NUMTODSINTERVAL function to create an interval of one fewer days so you exclude the extra day that is being counted:
SELECT id,
data_value,
time_stamp,
AVG(data_value)
OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN NUMTODSINTERVAL(
EXTRACT( DAY FROM TRUNC( time_stamp, 'MM' ) - 2 ),
'DAY'
) PRECEDING
AND CURRENT ROW
) AS avg_value_month_minus_1_day
FROM table_name;
So, if your data is:
CREATE TABLE table_name ( id, data_value, time_stamp ) AS
SELECT 1,
LEVEL,
DATE '2020-01-01' + LEVEL - 1
FROM DUAL
CONNECT BY LEVEL <= 50;
Then comparing the above query to your output with:
SELECT id,
data_value,
time_stamp,
AVG(data_value)
OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN NUMTODSINTERVAL(
EXTRACT( DAY FROM TRUNC( time_stamp, 'MM' ) - 2 ),
'DAY'
) PRECEDING
AND CURRENT ROW
) AS avg_value_month_minus_1_day,
AVG(data_value)
OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING
AND CURRENT ROW
) AS avg_value_month
FROM table_name;
Outputs (for February, when there is a full month of preceding data):
ID | DATA_VALUE | TIME_STAMP | AVG_VALUE_MONTH_MINUS_1_DAY | AVG_VALUE_MONTH
-: | ---------: | :------------------ | --------------------------: | --------------:
1 | 32 | 2020-02-01 00:00:00 | 17 | 16.5
1 | 33 | 2020-02-02 00:00:00 | 18 | 17.5
1 | 34 | 2020-02-03 00:00:00 | 19 | 18.5
1 | 35 | 2020-02-04 00:00:00 | 20 | 19.5
1 | 36 | 2020-02-05 00:00:00 | 21 | 20.5
1 | 37 | 2020-02-06 00:00:00 | 22 | 21.5
1 | 38 | 2020-02-07 00:00:00 | 23 | 22.5
1 | 39 | 2020-02-08 00:00:00 | 24 | 23.5
1 | 40 | 2020-02-09 00:00:00 | 25 | 24.5
1 | 41 | 2020-02-10 00:00:00 | 26 | 25.5
1 | 42 | 2020-02-11 00:00:00 | 27 | 26.5
1 | 43 | 2020-02-12 00:00:00 | 28 | 27.5
1 | 44 | 2020-02-13 00:00:00 | 29 | 28.5
1 | 45 | 2020-02-14 00:00:00 | 30 | 29.5
1 | 46 | 2020-02-15 00:00:00 | 31 | 30.5
1 | 47 | 2020-02-16 00:00:00 | 32 | 31.5
1 | 48 | 2020-02-17 00:00:00 | 33 | 32.5
1 | 49 | 2020-02-18 00:00:00 | 34 | 33.5
1 | 50 | 2020-02-19 00:00:00 | 35 | 34.5
db<>fiddle here
Alas, Oracle doesn't support intervals with both months and smaller units.
One method is to subtract it out:
select (sum(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and current row
) -
sum(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and '3' month preceding
)
) /
(count(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and current row
) -
count(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and '3' month preceding
)
)
Admittedly, this is cumbersome for an average, but it might be just fine for a sum() or count().
To shift the window of time that you are looking at you can shift the value you are sorting on by an appropriate interval of time:
SELECT AVG(data_value)
OVER (PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW
) Current_Calc
, AVG(data_value)
OVER (PARTITION BY id
ORDER BY time_stamp - interval '15' minute
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW
) Shift_Back
, AVG(data_value)
OVER (PARTITION BY id
ORDER BY time_stamp + interval '15' minute
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW
) shift_forward
FROM Your_Data
based on the description of your problem I believe you want to shift it back by 15 minutes, but I could be misreading the problem statement, and without appropriate data to test against, and expected results </shrugs>
These are sliding windows that always contain one months worth of data relative to the current time_stamp This means that for each time_stamp month you will get anywhere from 29 to 32 days worth of data with some of that data being counted in both the current and preceding months averages.
On the other hand, if what you are interested in is averages for the discreet months, then you should be partitioning by month rather creating a sliding window, if you want running averages per month you can add the sort, but you won't need the windowing clause:
SELECT TRUNC(time_stamp, 'MM') MON
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp, 'MM')) MON_AVG
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp, 'MM')
ORDER BY time_stamp) RUN_MON_AVG
, TRUNC(time_stamp - INTERVAL '15' MINUTE, 'MM') MON_2
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp - INTERVAL '15' MINUTE, 'MM')
) MON_AVG_2
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp - INTERVAL '15' MINUTE, 'MM')
ORDER BY time_stamp) RUN_MON_AVG
FROM Your_Data
Thanks for the feedback! I was able to assemble the answer I needed based on the answers above. Here is the code that I went with:
SELECT AVG(data_value) OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN (NUMTODSINTERVAL(EXTRACT( DAY FROM (TRUNC(time_stamp,'MM') - 1) ),'DAY') - NUMTODSINTERVAL(1,'SECOND')) PRECEDING AND CURRENT ROW)
Because my interval is exactly one month, and I want to remove the first entry, I first convert the previous month into an interval in seconds, as recommended above. Then I subtract one second from the lower bound of the interval. This has the effect of making the lower bound of the interval an "open" bound and the upper bound a "closed" bound.
As a side note, I used one second because the periodicity of my dataset is not consistent, but its minimum is three minutes, so anything less than that will work.

Moving average last 30 days

I want to find the number of unique users active in the last 30 days. I want to calculate this for today, but also for days in the past. The dataset contains user ids, dates and events triggered by the user saved in BigQuery. A user is active by opening a mobile app triggering the event session_start. Example of the unnested dataset.
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
I found a solution which suits my problem:
BigQuery: how to group and count rows within rolling timestamp window?
My BigQuery script so far:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
This script results in the following table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids. How can I fix the rolling window function in my script?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
Of course, as that's exactly what the code
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
is asking for.
Check out https://stackoverflow.com/a/49866033/132438 where the question asks about specifically counting uniques in a rolling window: Turns out it's a very slow operation given how much memory it requires.
The solution for this when you want a rolling count of uniques: Go for approximate results.
From the linked answer:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
Working solution for a weekly calculation of the number of active users in the last 30 days.
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC