how to query time-series data in postgresql to find spikes - sql

I have a table called cpu_usages and I'm trying to find spikes of cpu usage. My table stores 4 columns:
id serial
at timestamp
cpu_usage float
cpu_core int
the at column stores a timestamp of every minute ever day.
I want to select all rows where I take each timestamp and get the next 3 minutes and if any of the timestamps has a cpu_value over at least 3% higher than the starting value for that timestamp, then return it
So for example if I have these rows:
id|at|cpu_values,cpu_core
1 | 2019-01-01-00:00|1|0
2 | 2019-01-01-00:01|1|0
3 | 2019-01-01-00:02|4|0
4 | 2019-01-01-00:03|1|0
5 | 2019-01-01-00:04|1|0
6 | 2019-01-01-00:05|1|0
7 | 2019-01-01-00:06|1|0
8 | 2019-01-01-00:07|1|0
9 | 2019-01-01-00:08|6|0
10 | 2019-01-01-00:00|1|1
11 | 2019-01-01-00:01|1|1
12| 2019-01-01-00:02|4|1
13 | 2019-01-01-00:03|1|1
14 | 2019-01-01-00:04|1|1
15 | 2019-01-01-00:05|1|1
16 | 2019-01-01-00:06|1|1
17 | 2019-01-01-00:07|1|1
18 | 2019-01-01-00:08|6|1
It would return rows:
1,2,6,7,8
I am not sure how to do this because it sounds like it needs some sort of nested joins.
Can anyone assist me with this?

This answers the original version of the question.
Just use window functions. Assuming you want the larger value, then you want to look back not forward:
select t.*
from (select t.*,
max(cpu_value) over (order by timestamp
range between interval '3 minute' preceding and interval '1 second' preceding
) as previous_min
from t
) t
where previous_min * 1.03 < cpu_value;
EDIT:
Looking backwards, this would be:
select t.*
from (select t.*,
min(cpu_value) over (order by timestamp
range between interval '1 second' following and interval '3 minute' following
) as next_min
from t
) t
where cpu_value * 1.03 > next_min;

Related

Get max value of binned time-interval

I have a 'requests' table with a 'time_request' column which has a timestamp for each request. I want to know the maximum amount of requests that i had in a single minute.
So im guessing i need to somehow 'group by' a 1m time interval, and then do some sort of MAX(COUNT(request_id))? Although nested aggregations are not allowed.
Will appreciate any help.
Table example:
request_id | time_request
------------------+---------------------
ab1 | 2021-03-29 16:20:05
ab2 | 2021-03-29 16:20:20
bc3 | 2021-03-31 20:34:07
fw3 | 2021-03-31 20:38:53
fe4 | 2021-03-31 20:39:53
Expected result: 2 (There were a maximum of 2 requests in a single minute)
Thanks!
You may use window function count and specify logical interval of one minute as the window boundary. It will calculate the count for each row and will account all the rows that are within one minute before.
Code for Postgres is below:
with a as (
select
id
, cast(ts as timestamp) as ts
from(values
('ab1', '2021-03-29 16:20:05'),
('ab2', '2021-03-29 16:20:20'),
('bc3', '2021-03-31 20:34:07'),
('fw3', '2021-03-31 20:38:53'),
('fe4', '2021-03-31 20:39:53')
) as t(id, ts)
)
, count_per_interval as (
select
a.*
, count(id) over (
order by ts asc
range between
interval '1' minute preceding
and current row
) as cnt_per_min
from a
)
select max(cnt_per_min)
from count_per_interval
| max |
| --: |
| 2 |
db<>fiddle here

Get a rolling count of timestamps in SQL

I have a table (in an Oracle DB) that looks something like what is shown below with about 4000 records. This is just an example of how the table is designed. The timestamps range for several years.
| Time | Action |
| 9/25/2019 4:24:32 PM | Yes |
| 9/25/2019 4:28:56 PM | No |
| 9/28/2019 7:48:16 PM | Yes |
| .... | .... |
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval. I would like this done by looking at each timestamp and getting a count of timestamps that appear within 15 minutes of that timestamp.
My goal would to have something like
| Interval | Count |
| 9/25/2019 4:24:00 PM - 9/25/2019 4:39:00 | 2 |
| 9/25/2019 4:25:00 PM - 9/25/2019 4:40:00 | 2 |
| ..... | ..... |
| 9/25/2019 4:39:00 PM - 9/25/2019 4:54:00 | 0 |
I am not sure how I would be able to do this, if at all. Any ideas or advice would be much appreciated.
If you want any 15 minute interval in the data, then you can use:
select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t;
If you want the maximum, then use rank() on this:
select t.*
from (select t.*, rank() over (order by cnt_15 desc) as seqnum
from (select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t
) t
) t
where seqnum = 1;
This doesn't produce exactly the results you specify in the query. But it does answer the question:
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval.
You could enumerate the minutes with a recursive query, then bring the table with a left join:
with recursive cte (start_dt, max_dt) as (
select trunc(min(time), 'mi'), max(time) from mytable
union all
select start_dt + interval '1' minute, max_dt from cte where start_dt < max_dt
)
select
c.start_dt,
c.start_dt + interval '15' minute end_dt,
count(t.time) cnt
from cte c
left join mytable t
on t.time >= c.start_dt
and t.time < c.start_dt + interval '15' minute
group by c.start_dt

Oracle partition with Monthly Interval

I am performing a query with a partition window of 1 calendar month. The data I'm working with is collected at regular intervals eg. every fifteen minutes.
Here is the code:
SELECT AVG(data_value) OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW)
This query works well, and collects the monthly average. The only problem is that the start and end of the interval are exactly a month apart, so the boundaries of the interval window are inclusive, eg. the start would be Nov-01-2019 00:00 and the End would be Dec-01-2019 00:00.
I need to make it so that the starting boundary is not included, because it's not considered part of the data set, eg. Start at Nov-01-2019 00:15 (the next row) and the End would still be Dec-01-2019 00:00.
I'm wondering if there's something that Oracle can do that would achieve this.
I imagine the code looking something like this:
SELECT AVG(data_value) OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH (+ 1 ROW) PRECEDING AND CURRENT ROW)
I've tried several variants of this but Oracle does not like them. Any help would be appreciated.
Work out how many days there were in the previous month using:
EXTRACT( DAY FROM TRUNC( time_stamp, 'MM' ) - 1 )
The use the NUMTODSINTERVAL function to create an interval of one fewer days so you exclude the extra day that is being counted:
SELECT id,
data_value,
time_stamp,
AVG(data_value)
OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN NUMTODSINTERVAL(
EXTRACT( DAY FROM TRUNC( time_stamp, 'MM' ) - 2 ),
'DAY'
) PRECEDING
AND CURRENT ROW
) AS avg_value_month_minus_1_day
FROM table_name;
So, if your data is:
CREATE TABLE table_name ( id, data_value, time_stamp ) AS
SELECT 1,
LEVEL,
DATE '2020-01-01' + LEVEL - 1
FROM DUAL
CONNECT BY LEVEL <= 50;
Then comparing the above query to your output with:
SELECT id,
data_value,
time_stamp,
AVG(data_value)
OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN NUMTODSINTERVAL(
EXTRACT( DAY FROM TRUNC( time_stamp, 'MM' ) - 2 ),
'DAY'
) PRECEDING
AND CURRENT ROW
) AS avg_value_month_minus_1_day,
AVG(data_value)
OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING
AND CURRENT ROW
) AS avg_value_month
FROM table_name;
Outputs (for February, when there is a full month of preceding data):
ID | DATA_VALUE | TIME_STAMP | AVG_VALUE_MONTH_MINUS_1_DAY | AVG_VALUE_MONTH
-: | ---------: | :------------------ | --------------------------: | --------------:
1 | 32 | 2020-02-01 00:00:00 | 17 | 16.5
1 | 33 | 2020-02-02 00:00:00 | 18 | 17.5
1 | 34 | 2020-02-03 00:00:00 | 19 | 18.5
1 | 35 | 2020-02-04 00:00:00 | 20 | 19.5
1 | 36 | 2020-02-05 00:00:00 | 21 | 20.5
1 | 37 | 2020-02-06 00:00:00 | 22 | 21.5
1 | 38 | 2020-02-07 00:00:00 | 23 | 22.5
1 | 39 | 2020-02-08 00:00:00 | 24 | 23.5
1 | 40 | 2020-02-09 00:00:00 | 25 | 24.5
1 | 41 | 2020-02-10 00:00:00 | 26 | 25.5
1 | 42 | 2020-02-11 00:00:00 | 27 | 26.5
1 | 43 | 2020-02-12 00:00:00 | 28 | 27.5
1 | 44 | 2020-02-13 00:00:00 | 29 | 28.5
1 | 45 | 2020-02-14 00:00:00 | 30 | 29.5
1 | 46 | 2020-02-15 00:00:00 | 31 | 30.5
1 | 47 | 2020-02-16 00:00:00 | 32 | 31.5
1 | 48 | 2020-02-17 00:00:00 | 33 | 32.5
1 | 49 | 2020-02-18 00:00:00 | 34 | 33.5
1 | 50 | 2020-02-19 00:00:00 | 35 | 34.5
db<>fiddle here
Alas, Oracle doesn't support intervals with both months and smaller units.
One method is to subtract it out:
select (sum(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and current row
) -
sum(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and '3' month preceding
)
) /
(count(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and current row
) -
count(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and '3' month preceding
)
)
Admittedly, this is cumbersome for an average, but it might be just fine for a sum() or count().
To shift the window of time that you are looking at you can shift the value you are sorting on by an appropriate interval of time:
SELECT AVG(data_value)
OVER (PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW
) Current_Calc
, AVG(data_value)
OVER (PARTITION BY id
ORDER BY time_stamp - interval '15' minute
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW
) Shift_Back
, AVG(data_value)
OVER (PARTITION BY id
ORDER BY time_stamp + interval '15' minute
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW
) shift_forward
FROM Your_Data
based on the description of your problem I believe you want to shift it back by 15 minutes, but I could be misreading the problem statement, and without appropriate data to test against, and expected results </shrugs>
These are sliding windows that always contain one months worth of data relative to the current time_stamp This means that for each time_stamp month you will get anywhere from 29 to 32 days worth of data with some of that data being counted in both the current and preceding months averages.
On the other hand, if what you are interested in is averages for the discreet months, then you should be partitioning by month rather creating a sliding window, if you want running averages per month you can add the sort, but you won't need the windowing clause:
SELECT TRUNC(time_stamp, 'MM') MON
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp, 'MM')) MON_AVG
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp, 'MM')
ORDER BY time_stamp) RUN_MON_AVG
, TRUNC(time_stamp - INTERVAL '15' MINUTE, 'MM') MON_2
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp - INTERVAL '15' MINUTE, 'MM')
) MON_AVG_2
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp - INTERVAL '15' MINUTE, 'MM')
ORDER BY time_stamp) RUN_MON_AVG
FROM Your_Data
Thanks for the feedback! I was able to assemble the answer I needed based on the answers above. Here is the code that I went with:
SELECT AVG(data_value) OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN (NUMTODSINTERVAL(EXTRACT( DAY FROM (TRUNC(time_stamp,'MM') - 1) ),'DAY') - NUMTODSINTERVAL(1,'SECOND')) PRECEDING AND CURRENT ROW)
Because my interval is exactly one month, and I want to remove the first entry, I first convert the previous month into an interval in seconds, as recommended above. Then I subtract one second from the lower bound of the interval. This has the effect of making the lower bound of the interval an "open" bound and the upper bound a "closed" bound.
As a side note, I used one second because the periodicity of my dataset is not consistent, but its minimum is three minutes, so anything less than that will work.

Redshift querying data on dates

I am trying to query data from redshift table.
I have a users table with columns such as name, age, gender and created_at for example:
|----|-----|------|----------|
|name|age |gender|created_at|
------------------------------
|X | 24 | F | some_date|
______________________________
I need to query above table, in such a way that I have additional columns such as created_this_week, created_last_week, created_last_4_week, current_month, last_month etc
Additional flag columns should be 'Y' for conditions such as data is from last week, current week, current month, last month, last 4 weeks (excluding this week) so last 4 weeks starting last week etc, something like below.
|----|-----|------|-----------|------------|---------|-----------|------------|---------|
|name|age |gender|created_at |current_week|last_week|last_4_week|current_mnth|last_mnth|
_________________________________________________________________________________________
| X | 24 | F |CURRENTDATE| Y | N | N | Y | N |
_________________________________________________________________________________________
| F | 21 | M | lst_wk_dt | N | Y | Y | Depends | depends |
_________________________________________________________________________________________
I am new to PostgresSQL and Redshift, and still in my learning phase, I spent past few hrs trying to do this myself but was unsuccessful. I'd really appreciate if someone can help me out with this one.
You would use a case expressions:
select t.*,
(case when created_at >= now() - interval '1 week' then 'Y' else 'N' end) as week1,
(case when created_at >= now() - interval '4 week' then 'Y' else 'N' end) as week4,
. . .
from t;

Poor Performance on Outer Join Timestamp Range Comparisons (Gap-Filling Time Series Data)

I have some time-series data (1.5 million rows currently). I am filling in some time gaps in my query using the generate_series method.
Imagine the following data that has a gap between 10 AM and 1 PM....
+-------+----------+-------+
| time | category | value |
+-------+----------+-------+
| 8 AM | 1 | 100 |
| 9 AM | 1 | 200 |
| 10 AM | 1 | 300 |
| 1 PM | 1 | 100 |
| 2 PM | 1 | 500 |
+-------+----------+-------+
I need my query results to fill in any gaps with the last known value for the series. Such as the following....
+-------+----------+-------+
| time | category | value |
+-------+----------+-------+
| 8 AM | 1 | 100 |
| 9 AM | 1 | 200 |
| 10 AM | 1 | 300 |
| 11 AM | 1 | 300 | (Gap filled with last known value)
| 12 PM | 1 | 300 | (Gap filled with last known value)
| 1 PM | 1 | 100 |
| 2 PM | 1 | 500 |
+-------+----------+-------+
I have a query that does this, but it's really slow (~5 secs in thesimplified example below). I'm hoping someone can show me a better/faster way?
In my case, my data is by the minute. So I fill in the gaps on 1-minute increments. I use the lead/window function to determine what the NEXT timestamp is for each row so I know which generated gap fillers will use that value.
Please see example below....
Generate test data
(create data for every minute for a year, with a 1 hour gap two hours ago)
create table mydata as
with a as
(
select
date_time
from
generate_series(date_trunc('minute', now())::timestamp - '1 year':: interval, date_trunc('minute', now()::timestamp - '2 hours'::interval), interval '1 minute') as date_time
union
select
date_time
from
generate_series(date_trunc('minute', now())::timestamp - '1 hour':: interval, date_trunc('minute', now()::timestamp ), interval '1 minute') as date_time
),
b as
(
select category from generate_series(1,10,1) as category
)
select
a.*,
b.*,
round(random() * 100)::integer as value
from
a
cross join
b
;
create index myindex1 on mydata (category, date_time);
create index myindex2 on mydata (date_time);
Query the data to get all category=5 data for the last 5 days (with gaps filled)
with a as
(
select
mydata.*,
lead(mydata.date_time) over (PARTITION BY category ORDER BY date_time asc) as next_date_time
from
mydata
where
category = 5
and
date_time between now() - '5 days'::interval and now()
),
b as
(
SELECT generated_time::timestamp without time zone FROM generate_series(date_trunc('minute', now()) - '5 days'::interval, date_trunc('minute', now()), interval '1 minute') as generated_time
)
select
b.generated_time as date_time,
a.category,
a.value
from
b
left join
a
on
b.generated_time >= a.date_time and b.generated_time < a.next_date_time
order by
b.generated_time desc
;
This query functions perfectly. Sample results...
+---------------------+----------+-------+
| date_time | category | value |
+---------------------+----------+-------+
| 2018-07-06 12:17:00 | 5 | 13 |
| 2018-07-06 12:16:00 | 5 | 17 | (gap filled)
| 2018-07-06 12:15:00 | 5 | 17 | (gap filled)
| ... | ... | ... | (gap filled)
| 2018-07-06 11:18:00 | 5 | 17 | (gap filled)
| 2018-07-06 11:17:00 | 5 | 17 |
| 2018-07-06 11:16:00 | 5 | 62 |
+---------------------+----------+-------+
However, this part kills performance...
b.generated_time >= a.date_time and b.generated_time < a.next_date_time
If I just do something like..
b.generated_time = a.next_date_time
Then it's very fast, but of course, incorrect results. It really doesn't like me doing an 'and', OR, greaterThan or lessThan. I thought that maybe it was because I was comparing to next_date_time which is generated on-the-fly and not indexed. But I even tried materializing that data into a table with an index first, performance was roughly the same.
I added the timescaledb extension tag to this post in case they have some built-in functionality to assist with this.
The 'explain' results
Sort (cost=268537.46..270431.35 rows=757556 width=16)
Sort Key: b.generated_time DESC
CTE a
-> WindowAgg (cost=0.44..11057.66 rows=6818 width=24)
-> Index Scan using myindex1 on mydata (cost=0.44..10938.35 rows=6818 width=16)
Index Cond: ((category = 5) AND (date_time >= (now() - '5 days'::interval)) AND (date_time <= now()))
CTE b
-> Function Scan on generate_series generated_time (cost=0.02..12.52 rows=1000 width=8)
-> Nested Loop Left Join (cost=0.00..170538.18 rows=757556 width=16)
Join Filter: ((b.generated_time >= a.date_time) AND (b.generated_time < a.next_date_time))
-> CTE Scan on b (cost=0.00..20.00 rows=1000 width=8)
-> CTE Scan on a (cost=0.00..136.36 rows=6818 width=24)
I'm using Postgres 10.4. Any suggestions on how to make this faster?
Thanks!!
So, I going to 'partially' answer my own question. I did find a way to accomplish what I want that performs MUCH better (sub-second). However, it is not as intuitive/readable and would really like to know how to make my first method faster. Just for the sake of knowledge, I really want to know what I was doing wrong.
Anyway, the following method seems to work. I calculate the number of minutes between each row, then just generate a series of rows with the same data but with 1 minute increments that number of times.
I'll give this a few days. If nobody comes up with a fix (or a better way) for the first method, then I'll mark this as the accepted answer.
select
generate_series(date_time, date_time + (((EXTRACT(EPOCH FROM (lead(mydata.date_time) over w - date_time)) / 60)-1) || 'minutes')::interval, interval '1 minute') as date_time,
category,
value
from
mydata
where
category = 5
and
date_time between now() - '5 days'::interval and now()
window w as (PARTITION BY category ORDER BY date_time asc)
order by
mydata.date_time desc