calculating average with grouping based on time intervals - sql

In a postgres table I have store the speed of an object with a 10 seconds interval. The values are not available for every 10 seconds during the day; so it could be that there is no line for today 16:39:40
How would the query look like to get an relation containing the average of the speed for 1 minute (or 30sec or n-sec) intervals for a given day, assuming the non-existing rows mean a speed of 0.
speed_table
id (int, pk)
ts (timestamp)
speed (numeric)
I've built this query but am getting stuck on some important parts:
SELECT
date_trunc('minute', ts) AS truncated,
avg(speed)
FROM speed_table AS t
WHERE ts >= '2014-06-21 00:00:00'
AND ts <= '2014-06-21 23:59:59'
AND condition2 = 'something'
GROUP BY date_trunc('minute', ts)
ORDER BY truncated
How can I alter the interval in something other then the result of the date_trunc function eg 5 minutes of 30 seconds?
How can I add the not available rows for the remaining of the day?

Simple and fast solution for this particular example:
SELECT date_trunc('minute', ts) AS minute
, sum(speed)/6 AS avg_speed
FROM speed_table AS t
WHERE ts >= '2014-06-21 0:0'
AND ts < '2014-06-20 0:0' -- exclude dangling corner case
AND condition2 = 'something'
GROUP BY 1
ORDER BY 1;
You need to factor in missing rows as "0 speed". Since a minute has 6 samples, just sum and divide by 6. Missing rows evaluate to 0 implicitly.
This returns no row for minutes with no rows at all.avg_speed for missing result rows is 0.
General query for arbitrary intervals
Works for all any interval listed in the manual for date_trunc():
SELECT date_trunc('minute', g.ts) AS ts_start
, avg(COALESCE(speed, 0)) AS avg_speed
FROM (SELECT generate_series('2014-06-21 0:0'::timestamp
, '2014-06-22 0:0'::timestamp
, '10 sec'::interval) AS ts) g
LEFT JOIN speed_table t USING (ts)
WHERE (t.condition2 = 'something' OR
t.condition2 IS NULL) -- depends on actual condition!
AND g.ts <> '2014-06-22 0:0'::timestamp -- exclude dangling corner case
GROUP BY 1
ORDER BY 1;
The problematic part is the additional unknown condition. You would need to define that. And decide whether missing rows supplied by generate_series should pass the test or not (which can be tricky!).
I let them pass in my example (and all other rows with a NULL values).
Compare:
PostgreSQL: running count of rows for a query 'by minute'
Arbitrary intervals:
Truncate timestamp to arbitrary intervals
For completely arbitrary intervals consider #Clodoaldo's math based on epoch values or use the often overlooked function width_bucket(). Example:
Aggregating (x,y) coordinate point clouds in PostgreSQL
Aggregating (x,y) coordinate point clouds in PostgreSQL

If you had issued some data it would be possible to test so this can contain errors. Point them including the error message so I can fix.
select
to_timestamp(
(extract(epoch from ts)::integer / (60 * 2)) * (60 * 2)
) as truncated,
avg(coalesce(speed, 0)) as avg_speed
from
generate_series (
'2014-06-21 00:00:00'::timestamp,
'2014-06-22'::timestamp - interval '1 second',
'10 seconds'
) ts (ts)
left join
speed_table t on ts.ts = t.ts and condition2 = 'something'
group by 1
order by 1
The example is grouped by 30 seconds. It is number of seconds since 1970-01-01 00:00:00 (epoch) divided by 120. When you want to group by 5 minutes divide it by 12 (60 / 5).
The generate_series in the example is generating timestamps at 1 second interval. It is left outer joined to the speed table so it fills the gaps. When the speed is null then coalesce returns 0.

Related

examine if one time series column of table has two adjacent time points which have interval larger than certain length

I am dealing with data preprocessing on a table containing time series column
toy example Table A
timestamp value
12:30:24 1
12:32:21 3
12:33:21 4
timestamp is ordered and always go incrementally
Is that possible to define an function or something else to return "True expression" when table has two adjacent time points which have interval larger than certain length and return "False" otherwise?
I am using postgresql, thank you
SQL Fiddle
select bool_or(bigger_than) as bigger_than
from (
select
time - lag(time) over (order by time)
>
interval '1 minute' as bigger_than
from table_a
) s;
bigger_than
-------------
t
bool_or will stop searching as soon as it finds the first true value.
http://www.postgresql.org/docs/current/static/functions-aggregate.html
Your sample data shows a time value. But it works the same for a timestamp
Something like this:
select count(*) > 0
from (
select timestamp,
lag(timestamp) over (order by value) as prev_ts
from table_a
) t
where timestamp - prev_ts < interval '1' minute;
It calculates the difference between a timestamp and it's "previous" timestamp. The order of the timestamps is defined by the value column. The outer query then counts the number of rows where the difference is smaller than 1 minute.
lag() is called a window functions. More details on those can be found in the manual:
http://www.postgresql.org/docs/current/static/tutorial-window.html

Query with Interval in Oracle (using JFreeChart)

I am assembling a query to show a experiment in JFreeChart.The Query works fine, but not the displaying in the JFreeChart.Its assembly the intervals as String (interval like 60bigger than TAsmaller than120 is last in the chart, should be second).I will put an example using five intervals of 60minutes each (TA is a numeric field and means Time Average):
SELECT INTERVAL, COUNT(*) TOTAL FROM (SELECT CASE WHEN TA>0 AND TA<=60.00 THEN '0<TA<=60.00' WHEN TA>60.00 AND TA<=120.00 THEN '60.00<TA<=120.00' WHEN TA>120.00 AND TA<=180.00 THEN '120.00<TA<=180.00' WHEN TA>180.00 AND TA<=240.00 THEN '180.00<TA<=240.00' WHEN TA>240.00 THEN '240.00<TA' END INTERVAL, TA FROM MP) GROUP BY INTERVAL HAVING INTERVAL IS NOT NULL ORDER BY INTERVAL
How can i do that to display correctely the intervals without destroying/damaging much my query, because it will be assembled on-the-fly depending of user choice.
If the INTERVAL column will always start with a valid number followed by <, you can convert its first value to a numeric for sorting:
SELECT INTERVAL, COUNT(*) TOTAL
FROM (
SELECT
CASE
WHEN TA>0 AND TA<=60.00 THEN '0<TA<=60.00'
WHEN TA>60.00 AND TA<=120.00 THEN '60.00<TA<=120.00'
WHEN TA>120.00 AND TA<=180.00 THEN '120.00<TA<=180.00'
WHEN TA>180.00 AND TA<=240.00 THEN '180.00<TA<=240.00'
WHEN TA>240.00 THEN '240.00<TA'
END INTERVAL,
TA
FROM MP
)
WHERE INTERVAL IS NOT NULL
GROUP BY INTERVAL
ORDER BY TO_NUMBER(SUBSTR(INTERVAL, 1, INSTR(INTERVAL, '<') - 1));
Also, I've changed the HAVING Interval IS NOT NULL to WHERE Interval IS NOT NULL because HAVING is for aggregated values such as COUNT(*), not for grouping values like INTERVAL.
Addendum If the number of intervals will vary, the query below may work out better for you. It calculates the interval text just like the query above, but it can handle any number of intervals. The first CASE condition handles values outside the range (n<TA); the second condition handles values within the range (m<TA<=n).
I've pointed the values you'll need to set for each query.
SELECT Interval, COUNT(*) FROM (
SELECT
TA,
CASE
WHEN TA > Parm_IntSize * Parm_IntCount THEN
TO_CHAR(Parm_IntSize * Parm_IntCount) || '<TA'
ELSE
TO_CHAR(TRUNC(TA / Parm_IntSize) * Parm_IntSize)
|| '<='
|| TO_CHAR((TRUNC(TA / Parm_IntSize) + 1) * Parm_IntSize)
END AS Interval
FROM MP
CROSS JOIN (
SELECT
60 AS Parm_IntSize, -- Specify interval size here
4 AS Parm_IntCount -- Specify number of intervals here
FROM DUAL
) Parms
)
WHERE Interval IS NOT NULL
GROUP BY Interval
ORDER BY TO_NUMBER(SUBSTR(Interval, 1, INSTR(Interval, '<') - 1))

Min, Max, Avg for 2-hour intervals

In below query, I am calculating the minimum, the maximum, and the average for a two hour interval using PostgreSQL.
The query works fine for even start hours (..04:00:00+05:30), but it gives similar result as that of even start time for odd start hours (..05:00:00+05:30).
The multiple by 2 returns even hours which is the problem.
SELECT tagid, CAST(sample_time_stamp as Date) AS stat_date,
(floor(extract(hour from sample_time_stamp)/2) * 2)::int AS hrs,
min(sensor_reading) AS theMin,
max(sensor_reading) AS theMax,
avg(sensor_reading) AS theAvg
FROM sensor_readings WHERE tagid =1 AND
sample_time_stamp BETWEEN '2012-10-23 01:00:00+05:30'
AND '2012-10-23 05:59:00+05:30'
GROUP BY tagid,CAST(sample_time_stamp as Date),
floor(extract(hour from sample_time_stamp)/2) * 2
ORDER BY tagid,stat_date, hrs
OutPut for Odd start Hour ('2012-10-23 01:00:00+05:30')
tagid date hrs theMin themax theAvg
1 2012-10-23 0 6 58 30.95
1 2012-10-23 2 2 59 29.6916666666667
1 2012-10-23 4 3 89 31.7666666666667
OutPut for Even start Hour ('2012-10-23 02:00:00+05:30')
tagid date hrs theMin themax theAvg
1 2012-10-23 2 2 59 29.6916666666667
1 2012-10-23 4 3 89 31.7666666666667
To get constant time-frames starting with your minimum timestamp:
WITH params AS (
SELECT '2012-10-23 01:00:00+05:30'::timestamptz AS _min -- input params
,'2012-10-23 05:59:00+05:30'::timestamptz AS _max
,'2 hours'::interval AS _interval
)
,ts AS (SELECT generate_series(_min, _max, _interval) AS t_min FROM params)
,timeframe AS (
SELECT t_min
,lead(t_min, 1, _max) OVER (ORDER BY t_min) AS t_max
FROM ts, params
)
SELECT s.tagid
,t.t_min
,t.t_max -- mildly redundant except for last row
,min(s.sensor_reading) AS the_min
,max(s.sensor_reading) AS the_max
,avg(s.sensor_reading) AS the_avg
FROM timeframe t
LEFT JOIN sensor_readings s ON s.tagid = 1
AND s.sample_time_stamp >= t.t_min
AND s.sample_time_stamp < t.t_max
GROUP BY 1,2,3
ORDER BY 1,2;
Can be used for any time frame and any interval length. Requires PostgreSQL 8.4 or later.
If the maximum timestamp _max does not fall on _min + n * _interval the last time-frame is truncated. The last row can therefore represent a shorter time-frame than your desired _interval.
Key elements
Common Table Expressions (CTE) for easier handling. Input parameter values once in the top CTE params.
generate_series() for intervals to create the time raster.
Window function lead(...) with 3 parameters (including default) - to cover the special case of last row.
LEFT JOIN between raster and actual data, so that time frames without matching data will still show in the result (with NULL values as data).
That's also the reason for a later edit: WHERE condition had to move to the LEFT JOIN condition, to achieve that.
Alternative time frame generation with recursive CTE:
WITH RECURSIVE params AS (
SELECT '2012-10-23 01:00:00+05:30'::timestamptz AS _min -- input params
,'2012-10-23 05:59:00+05:30'::timestamptz AS _max
,'2 hours'::interval AS _interval
)
, timeframe AS (
SELECT _min AS t_min, LEAST(_min + _interval, _max) AS t_max
FROM params
UNION ALL
SELECT t_max, LEAST(t_max + _interval, _max)
FROM timeframe t, params p
WHERE t_max < _max
)
SELECT ...
Slightly faster ... take your pick.
-> sqlfiddle displaying both.
Note that you can have non-recursive CTEs (additionally) even when declared WITH RECURSIVE.
Performance & Index
Should be faster than your original query. Half the code deals with generating the time raster, which concerns few rows and is very fast. Handling actual table rows (the expensive part) gets cheaper, because we don't calculate a new value from every sample_time_stamp any more.
You should definitely have a multi-column index of the form:
CREATE INDEX foo_idx ON sensor_readings (tagid, sample_time_stamp DESC);
I use DESC on the assumption that you more often query recent entries (later timestamps). Remove the modifier if that's not the case. Doesn't make a big difference either way.

Group by data intervals

I have a single table which stores bandwidth usage on the network over a period of time. One column will contain the date time (primary key) and another column will record the bandwidth. Data is recorded every minute. We will have other columns recording other data at that moment in time.
If the user requests the data on 15 minute intervals (within a 24 hour period given start and end date), is it possible with a single query to get the data I require or would I have to write a stored procedure/cursor to do this? Users may then request 5 minute intervals data etc.
I will most likely be using Postgres but are there other NOSQL options which would be better?
Any ideas?
WITH t AS (
SELECT ts, (random()*100)::int AS bandwidth
FROM generate_series('2012-09-01', '2012-09-04', '1 minute'::interval) ts
)
SELECT date_trunc('hour', ts) AS hour_stump
,(extract(minute FROM ts)::int / 15) AS min15_slot
,count(*) AS rows_in_timeslice -- optional
,sum(bandwidth) AS sum_bandwidth
FROM t
WHERE ts >= '2012-09-02 00:00:00+02'::timestamptz -- user's time range
AND ts < '2012-09-03 00:00:00+02'::timestamptz -- careful with borders
GROUP BY 1, 2
ORDER BY 1, 2;
The CTE t provides data like your table might hold: one timestamp ts per minute with a bandwidth number. (You don't need that part, you work with your table instead.)
Here is a very similar solution for a very similar question - with detailed explanation how this particular aggregation works:
date_trunc 5 minute interval in PostgreSQL
Here is a similar solution for a similar question concerning running sums - with detailed explanation and links for the various functions used:
PostgreSQL: running count of rows for a query 'by minute'
Additional question in comment
WITH -- same as above ...
SELECT DISTINCT ON (1,2)
date_trunc('hour', ts) AS hour_stump
,(extract(minute FROM ts)::int / 15) AS min15_slot
,bandwidth AS bandwith_sample_at_min15
FROM t
WHERE ts >= '2012-09-02 00:00:00+02'::timestamptz
AND ts < '2012-09-03 00:00:00+02'::timestamptz
ORDER BY 1, 2, ts DESC;
Retrieves one un-aggregated sample per 15 minute interval - from the last available row in the window. This will be the 15th minute if the row is not missing. Crucial parts are DISTINCT ON and ORDER BY.
More information about the used technique here:
Select first row in each GROUP BY group?
select
date_trunc('hour', d) +
(((extract(minute from d)::integer / 5 * 5)::text) || ' minute')::interval
as "from",
date_trunc('hour', d) +
((((extract(minute from d)::integer / 5 + 1) * 5)::text) || ' minute')::interval
- '1 second'::interval
as "to",
sum(random() * 1000) as bandwidth
from
generate_series('2012-01-01', '2012-01-31', '1 minute'::interval) s(d)
group by 1, 2
order by 1, 2
;
That for 5 minutes ranges. For 15 minutes divide by 15.

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.