Resample on time series data - sql

I have a table with time series column in the millisecond, I want to resample the time series and apply mean on the group. How can I implement it in Postgres?
"Resample" means aggregate all time stamps within one second or one minute. All rows within one second or one minute form a group.
table structure
date x y z

Use date_trunc() to truncate timestamps to a given unit of time, and GROUP BY that expression:
SELECT date_trunc('minute', date) AS date_truncated_to_minute
, avg(x) AS avg_x
, avg(y) AS avg_y
, avg(z) AS avg_z
FROM tbl
GROUP BY 1;
Assuming your misleadingly named date column is actually of type timestamp or timestamptz.
Related answer with more details and links:
PostgreSQL: running count of rows for a query 'by minute'

Related

Saving only unique datapoints in SQL

For simplicity: We have a table with 2 columns, value and date.
Every second a new data is received and we want to save it with it's timestamp. Since the data can be similar, to lower usage, if data is the same as previous entry, we don't save it.
Question: Given that same value was received during 24 hours, only the first value & date pair is saved. If we want to query 'Average value in last 1 hour', is there a way to have the db (PostgreSQL) see that no values are saved in last hour and search for last existing value entry?
It is not as easy as it may seem, and it is not just about retrieving the latest data point when there is none available within the last hour. You want to calculate an average, so you need to rebuild the time-series data of the period, second per second, filling the gaps with the latest available data point.
I think the simplest approach is generate_series() to build the rows, and then a lateral join to recover the data:
select avg(d.value) avg_last_hour
from generate_series(now() - interval '1 hour', now(), '1 second') t(ts)
cross join lateral (
select d.*
from data d
where d.date <= t.ts
order by d.date desc
limit 1
) t
Hmmm . . . if you simply want the average of values in the most recent hour in the data, you can use:
select date_trunc('hour', date) as ddhh, avg(value)
from t
group by ddhh
order by ddhh desc
limit 1;
If you have a lot of data being collected, it might be faster to add an index on date and use:
select avg(value)
from t
where date >= date_trunc('hour', (select max(t2.date) from t t2));

How to select rows until the sum of a column reaches N, where the column is of type TIME

I would like to select enough audio calls to have 00:10:00 minutes of audio. I have tried to achieve this by writing the following SQL (postgres) statement
SELECT file_name, audio_duration
FROM (
SELECT distinct file_name, audio_duration, SUM(audio_duration)
OVER (ORDER BY audio_duration) AS total_duration
FROM data
) AS t
WHERE
t.total_duration <='00:10:00'
GROUP BY file_name, audio_duration
My problem is that it doesn't seem to be calculating the total duration correctly.
I suspect this is due the audio_duration column being the TIME type.
If anyone have any hints or suggestions on how to make this query, it would be greatly appreciated.
You should really define that column to be an interval. A time column stores a moment in time, e.g. "3 in the afternoon".
However you can cast a single time value to an interval. You also don't need the window function to first calculate the "running total" if you want the total duration per file:
SELECT file_name, sum(audio_duration::interval) as total_duration
FROM data
GROUP BY file_name
HAVING sum(audio_duration::interval) <= interval '10 minute';
To permanently change the column type to an interval you can use:
alter table data
alter duration type interval;
I fully agree with #a_horse_with_no_name that Interval is the better datatype, but must admit that the Time datatype in not incorrect. While you cannot add (+) time datatypes you can SUM them. Summing time datatypes result in an interval, and produces the same result as summing corresponding intervals. Time besides being moment is also the interval from the beginning of day to that moment. Demo (fiddle)
with as_time (dur) as ( values ('10:34:45 AM'::time), ('03:14:50 PM'::time), ('11:15:25 PM'::time))
, as_intv (dur) as ( values ('10:34:45'::interval), ('15:14:50'::interval),('23:15:25'::interval))
select *
from (select sum(dur) sum_time from as_time) st
, (select sum(dur) sum_intv from as_intv) si;
BTW: The answer to the rhetorical question "what is the sum of "8 in the morning" and "3 in the afternoon"? Well it's 23:00:00.

examine if one time series column of table has two adjacent time points which have interval larger than certain length

I am dealing with data preprocessing on a table containing time series column
toy example Table A
timestamp value
12:30:24 1
12:32:21 3
12:33:21 4
timestamp is ordered and always go incrementally
Is that possible to define an function or something else to return "True expression" when table has two adjacent time points which have interval larger than certain length and return "False" otherwise?
I am using postgresql, thank you
SQL Fiddle
select bool_or(bigger_than) as bigger_than
from (
select
time - lag(time) over (order by time)
>
interval '1 minute' as bigger_than
from table_a
) s;
bigger_than
-------------
t
bool_or will stop searching as soon as it finds the first true value.
http://www.postgresql.org/docs/current/static/functions-aggregate.html
Your sample data shows a time value. But it works the same for a timestamp
Something like this:
select count(*) > 0
from (
select timestamp,
lag(timestamp) over (order by value) as prev_ts
from table_a
) t
where timestamp - prev_ts < interval '1' minute;
It calculates the difference between a timestamp and it's "previous" timestamp. The order of the timestamps is defined by the value column. The outer query then counts the number of rows where the difference is smaller than 1 minute.
lag() is called a window functions. More details on those can be found in the manual:
http://www.postgresql.org/docs/current/static/tutorial-window.html

Query - find empty interval in series of timestamps

I have a table that stores historical data. I get a row inserted in this query every 30 seconds from different type of sources and obviously there is a time stamp associated.
Let's make my parameter as disservice to 1 hour.
Since I charge my services based on time, I need to know, for example, in a specific month, if there is a period within this month in which the there is an interval which is equal or exceeds my 1 hour interval.
A simplified structure of the table would be like:
tid serial primary key,
tunitd id int,
tts timestamp default now(),
tdescr text
I don't want to write a function that loops through all the records comparing them one by one as I suppose it is time and memory consuming.
Is there any way to do this directly from SQL maybe using the interval type in PostgreSQL?
Thanks.
this small SQL query will display all gaps with the duration more than one hour:
select tts, next_tts, next_tts-tts as diff from
(select a.tts, min(b.tts) as next_tts
from test1 a
inner join test1 b ON a.tts < b.tts
GROUP BY a.tts) as c
where next_tts - tts > INTERVAL '1 hour'
order by tts;
SQL Fiddle

Sampling SQL timeseries

I have a timeseries of datetime, double columns stored in mySQL and would like to sample the timeseries every minute (i.e. pull out the last value at one minute intervals). Is there an efficient way of doing this in one select statement?
The brute force way would involve either selecting the whole series and doing the sampling on the client side or sending one select for each point (e.g. select * from data where timestamp<xxxxxxxxx order by timestamp desc limit 1).
You could convert your timestamps to UNIX timestamps, group by unix_timestamp DIV 60 and pull the maximum timestamps from each group. Afterwards join the obtained list back to the original table to pull the data for the obtained timestamps.
Basically it might look something like this:
SELECT
t.* /* you might want to be more specific here */
FROM atable t
INNER JOIN (
SELECT
MAX(timestamp) AS timestamp
FROM atable
GROUP BY UNIX_TIMESTAMP(timestamp) DIV 60
) m ON t.timestamp = m.timestamp
You can use DATE_FORMAT to get just the parts of the datetime that you want. You want to get the datetime down to the minute, and then for each group with that "rounded-off" time, get the row with the maximum time.