Sampling SQL timeseries - sql

I have a timeseries of datetime, double columns stored in mySQL and would like to sample the timeseries every minute (i.e. pull out the last value at one minute intervals). Is there an efficient way of doing this in one select statement?
The brute force way would involve either selecting the whole series and doing the sampling on the client side or sending one select for each point (e.g. select * from data where timestamp<xxxxxxxxx order by timestamp desc limit 1).

You could convert your timestamps to UNIX timestamps, group by unix_timestamp DIV 60 and pull the maximum timestamps from each group. Afterwards join the obtained list back to the original table to pull the data for the obtained timestamps.
Basically it might look something like this:
SELECT
t.* /* you might want to be more specific here */
FROM atable t
INNER JOIN (
SELECT
MAX(timestamp) AS timestamp
FROM atable
GROUP BY UNIX_TIMESTAMP(timestamp) DIV 60
) m ON t.timestamp = m.timestamp

You can use DATE_FORMAT to get just the parts of the datetime that you want. You want to get the datetime down to the minute, and then for each group with that "rounded-off" time, get the row with the maximum time.

Related

How to compute window function for each nth row in Presto?

I am working with a table that contains timeseries data, with a row for each minute for each user.
I want to compute some aggregate functions on a rolling window of N calendar days.
This is achieved via
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
) as my_col
FROM my_table
However, I am only interested in the result of this at a daily scale.
i.e. I want the window to be computed only at 00:00:00, but I want the window itself to contain all the minute-by-minute data to be passed into my aggregate function.
Right now I am doing this:
WITH agg_results AS (
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp_col
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
)
FROM my_table
)
SELECT * FROM agg_results
WHERE
timestamp_col = DATE_TRUNC('day', "timestamp_col")
This works in theory, but it does 60 * 24 more computations that necessary, resulting in the query being super slow.
Essentially, I am trying to find a way to make the right window bound skip rows based on a condition. Or, if it is simpler to implement, for every nth row (as I have a constant number of rows for each day).
I don't think that's possible with window functions. You could switch to a subquery instead, assuming that your aggregate function works as a regular aggregate function too (that is, without an OVER() clause):
select
timestamp_col,
(
select some_aggregate_fun(t1.col)
from my_table t1
where
t1.user_id = t.user_id
and t1.timestamp_col >= t.timestamp_col - interval '1' day
and t1.timestamp_col <= t.timestamp_col
)
from my_table t
where timestamp_col = date_trunc('day', timestamp_col)
I am unsure that this would perform better than your original query though; you might need to assess that against your actual dataset.
You can change interval '1' day to the actual interval you want to use.

Saving only unique datapoints in SQL

For simplicity: We have a table with 2 columns, value and date.
Every second a new data is received and we want to save it with it's timestamp. Since the data can be similar, to lower usage, if data is the same as previous entry, we don't save it.
Question: Given that same value was received during 24 hours, only the first value & date pair is saved. If we want to query 'Average value in last 1 hour', is there a way to have the db (PostgreSQL) see that no values are saved in last hour and search for last existing value entry?
It is not as easy as it may seem, and it is not just about retrieving the latest data point when there is none available within the last hour. You want to calculate an average, so you need to rebuild the time-series data of the period, second per second, filling the gaps with the latest available data point.
I think the simplest approach is generate_series() to build the rows, and then a lateral join to recover the data:
select avg(d.value) avg_last_hour
from generate_series(now() - interval '1 hour', now(), '1 second') t(ts)
cross join lateral (
select d.*
from data d
where d.date <= t.ts
order by d.date desc
limit 1
) t
Hmmm . . . if you simply want the average of values in the most recent hour in the data, you can use:
select date_trunc('hour', date) as ddhh, avg(value)
from t
group by ddhh
order by ddhh desc
limit 1;
If you have a lot of data being collected, it might be faster to add an index on date and use:
select avg(value)
from t
where date >= date_trunc('hour', (select max(t2.date) from t t2));

Obtain latest record for a given second Postgres

I have data with millisecond precision timestamp. I want to only filter for the most recent timestamp within a given second. Ie. records (2020-07-13 5:05.38.009, event1), (2020-07-13 5:05.38.012, event2) should only retrieve the latter.
I've tried the following:
SELECT
timestamp as time, event as value, event_type as metric
FROM
table
GROUP BY
date_trunc('second', time)
But then I'm asked to group by event as well and I see all the data (as if no group by was provided)
In Postgres, you can use distinct on:
select distinct on (date_trunc('second', time)) t.*
from t
order by time desc;

Resample on time series data

I have a table with time series column in the millisecond, I want to resample the time series and apply mean on the group. How can I implement it in Postgres?
"Resample" means aggregate all time stamps within one second or one minute. All rows within one second or one minute form a group.
table structure
date x y z
Use date_trunc() to truncate timestamps to a given unit of time, and GROUP BY that expression:
SELECT date_trunc('minute', date) AS date_truncated_to_minute
, avg(x) AS avg_x
, avg(y) AS avg_y
, avg(z) AS avg_z
FROM tbl
GROUP BY 1;
Assuming your misleadingly named date column is actually of type timestamp or timestamptz.
Related answer with more details and links:
PostgreSQL: running count of rows for a query 'by minute'

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.