Min, Max, Avg for 2-hour intervals - sql

In below query, I am calculating the minimum, the maximum, and the average for a two hour interval using PostgreSQL.
The query works fine for even start hours (..04:00:00+05:30), but it gives similar result as that of even start time for odd start hours (..05:00:00+05:30).
The multiple by 2 returns even hours which is the problem.
SELECT tagid, CAST(sample_time_stamp as Date) AS stat_date,
(floor(extract(hour from sample_time_stamp)/2) * 2)::int AS hrs,
min(sensor_reading) AS theMin,
max(sensor_reading) AS theMax,
avg(sensor_reading) AS theAvg
FROM sensor_readings WHERE tagid =1 AND
sample_time_stamp BETWEEN '2012-10-23 01:00:00+05:30'
AND '2012-10-23 05:59:00+05:30'
GROUP BY tagid,CAST(sample_time_stamp as Date),
floor(extract(hour from sample_time_stamp)/2) * 2
ORDER BY tagid,stat_date, hrs
OutPut for Odd start Hour ('2012-10-23 01:00:00+05:30')
tagid date hrs theMin themax theAvg
1 2012-10-23 0 6 58 30.95
1 2012-10-23 2 2 59 29.6916666666667
1 2012-10-23 4 3 89 31.7666666666667
OutPut for Even start Hour ('2012-10-23 02:00:00+05:30')
tagid date hrs theMin themax theAvg
1 2012-10-23 2 2 59 29.6916666666667
1 2012-10-23 4 3 89 31.7666666666667

To get constant time-frames starting with your minimum timestamp:
WITH params AS (
SELECT '2012-10-23 01:00:00+05:30'::timestamptz AS _min -- input params
,'2012-10-23 05:59:00+05:30'::timestamptz AS _max
,'2 hours'::interval AS _interval
)
,ts AS (SELECT generate_series(_min, _max, _interval) AS t_min FROM params)
,timeframe AS (
SELECT t_min
,lead(t_min, 1, _max) OVER (ORDER BY t_min) AS t_max
FROM ts, params
)
SELECT s.tagid
,t.t_min
,t.t_max -- mildly redundant except for last row
,min(s.sensor_reading) AS the_min
,max(s.sensor_reading) AS the_max
,avg(s.sensor_reading) AS the_avg
FROM timeframe t
LEFT JOIN sensor_readings s ON s.tagid = 1
AND s.sample_time_stamp >= t.t_min
AND s.sample_time_stamp < t.t_max
GROUP BY 1,2,3
ORDER BY 1,2;
Can be used for any time frame and any interval length. Requires PostgreSQL 8.4 or later.
If the maximum timestamp _max does not fall on _min + n * _interval the last time-frame is truncated. The last row can therefore represent a shorter time-frame than your desired _interval.
Key elements
Common Table Expressions (CTE) for easier handling. Input parameter values once in the top CTE params.
generate_series() for intervals to create the time raster.
Window function lead(...) with 3 parameters (including default) - to cover the special case of last row.
LEFT JOIN between raster and actual data, so that time frames without matching data will still show in the result (with NULL values as data).
That's also the reason for a later edit: WHERE condition had to move to the LEFT JOIN condition, to achieve that.
Alternative time frame generation with recursive CTE:
WITH RECURSIVE params AS (
SELECT '2012-10-23 01:00:00+05:30'::timestamptz AS _min -- input params
,'2012-10-23 05:59:00+05:30'::timestamptz AS _max
,'2 hours'::interval AS _interval
)
, timeframe AS (
SELECT _min AS t_min, LEAST(_min + _interval, _max) AS t_max
FROM params
UNION ALL
SELECT t_max, LEAST(t_max + _interval, _max)
FROM timeframe t, params p
WHERE t_max < _max
)
SELECT ...
Slightly faster ... take your pick.
-> sqlfiddle displaying both.
Note that you can have non-recursive CTEs (additionally) even when declared WITH RECURSIVE.
Performance & Index
Should be faster than your original query. Half the code deals with generating the time raster, which concerns few rows and is very fast. Handling actual table rows (the expensive part) gets cheaper, because we don't calculate a new value from every sample_time_stamp any more.
You should definitely have a multi-column index of the form:
CREATE INDEX foo_idx ON sensor_readings (tagid, sample_time_stamp DESC);
I use DESC on the assumption that you more often query recent entries (later timestamps). Remove the modifier if that's not the case. Doesn't make a big difference either way.

Related

Multiple aggregate sums from different conditions in one sql query

Whereas I believe this is a fairly general SQL question, I am working in PostgreSQL 9.4 without an option to use other database software, and thus request that any answer be compatible with its capabilities.
I need to be able to return multiple aggregate totals from one query, such that each sum is in a new row, and each of the groupings are determined by a unique span of time, e.g. WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'. The number of records that satisfy there WHERE clause is unknown and may be zero, in which case ideally the result is "0". This is what I have worked out so far:
(
SELECT SUM(minutes) AS min
FROM downtime
WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-14' AND '2016-02-21'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-28' AND '2016-03-06'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-06' AND '2016-03-13'
)
UNION ALL
(
SELECT SUM(minutes))
FROM downtime
WHERE time_stamp BETWEEN '2016-03-13' AND '2016-03-20'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-20' AND '2016-03-27'
)
Result:
min
---+-----
1 | 119
2 | 4
3 | 30
4 |
5 | 62
6 | 350
That query gets me almost the exact result that I want; certainly good enough in that I can do exactly what I need with the results. Time spans with no records are blank but that was predictable, and whereas I would prefer "0" I can account for the blank rows in software.
But, while it isn't terrible for the 6 weeks that it represents, I want to be flexible and to be able to do the same thing for different time spans, and for a different number of data points, such as each day in a week, each week in 3 months, 6 months, each month in 1 year, 2 years, etc... As written above, it feels as if it is going to get tedious fast... for instance 1 week spans over a 2 year period is 104 sub-queries.
What I'm after is a more elegant way to get the same (or similar) result.
I also don't know if doing 104 iterations of a similar query to the above (vs. the 6 that it does now) is a particularly efficient usage.
Ultimately I am going to write some code which will help me build (and thus abstract away) the long, ugly query--but it would still be great to have a more concise and scale-able query.
In Postgres, you can generate a series of times and then use these for the aggregation:
select g.dte, coalesce(sum(dt.minutes), 0) as minutes
from generate_series('2016-02-07'::timestamp, '2016-03-20'::timestamp, interval '7 day') g(dte) left join
downtime dt
on dt.timestamp >= g.dte and dt.timestamp < g.dte + interval '7 day'
group by g.dte
order by g.dte;

calculating average with grouping based on time intervals

In a postgres table I have store the speed of an object with a 10 seconds interval. The values are not available for every 10 seconds during the day; so it could be that there is no line for today 16:39:40
How would the query look like to get an relation containing the average of the speed for 1 minute (or 30sec or n-sec) intervals for a given day, assuming the non-existing rows mean a speed of 0.
speed_table
id (int, pk)
ts (timestamp)
speed (numeric)
I've built this query but am getting stuck on some important parts:
SELECT
date_trunc('minute', ts) AS truncated,
avg(speed)
FROM speed_table AS t
WHERE ts >= '2014-06-21 00:00:00'
AND ts <= '2014-06-21 23:59:59'
AND condition2 = 'something'
GROUP BY date_trunc('minute', ts)
ORDER BY truncated
How can I alter the interval in something other then the result of the date_trunc function eg 5 minutes of 30 seconds?
How can I add the not available rows for the remaining of the day?
Simple and fast solution for this particular example:
SELECT date_trunc('minute', ts) AS minute
, sum(speed)/6 AS avg_speed
FROM speed_table AS t
WHERE ts >= '2014-06-21 0:0'
AND ts < '2014-06-20 0:0' -- exclude dangling corner case
AND condition2 = 'something'
GROUP BY 1
ORDER BY 1;
You need to factor in missing rows as "0 speed". Since a minute has 6 samples, just sum and divide by 6. Missing rows evaluate to 0 implicitly.
This returns no row for minutes with no rows at all.avg_speed for missing result rows is 0.
General query for arbitrary intervals
Works for all any interval listed in the manual for date_trunc():
SELECT date_trunc('minute', g.ts) AS ts_start
, avg(COALESCE(speed, 0)) AS avg_speed
FROM (SELECT generate_series('2014-06-21 0:0'::timestamp
, '2014-06-22 0:0'::timestamp
, '10 sec'::interval) AS ts) g
LEFT JOIN speed_table t USING (ts)
WHERE (t.condition2 = 'something' OR
t.condition2 IS NULL) -- depends on actual condition!
AND g.ts <> '2014-06-22 0:0'::timestamp -- exclude dangling corner case
GROUP BY 1
ORDER BY 1;
The problematic part is the additional unknown condition. You would need to define that. And decide whether missing rows supplied by generate_series should pass the test or not (which can be tricky!).
I let them pass in my example (and all other rows with a NULL values).
Compare:
PostgreSQL: running count of rows for a query 'by minute'
Arbitrary intervals:
Truncate timestamp to arbitrary intervals
For completely arbitrary intervals consider #Clodoaldo's math based on epoch values or use the often overlooked function width_bucket(). Example:
Aggregating (x,y) coordinate point clouds in PostgreSQL
Aggregating (x,y) coordinate point clouds in PostgreSQL
If you had issued some data it would be possible to test so this can contain errors. Point them including the error message so I can fix.
select
to_timestamp(
(extract(epoch from ts)::integer / (60 * 2)) * (60 * 2)
) as truncated,
avg(coalesce(speed, 0)) as avg_speed
from
generate_series (
'2014-06-21 00:00:00'::timestamp,
'2014-06-22'::timestamp - interval '1 second',
'10 seconds'
) ts (ts)
left join
speed_table t on ts.ts = t.ts and condition2 = 'something'
group by 1
order by 1
The example is grouped by 30 seconds. It is number of seconds since 1970-01-01 00:00:00 (epoch) divided by 120. When you want to group by 5 minutes divide it by 12 (60 / 5).
The generate_series in the example is generating timestamps at 1 second interval. It is left outer joined to the speed table so it fills the gaps. When the speed is null then coalesce returns 0.

db2 suppress recursive warning

I have a recursive sql that I am running which works but gives me the following warning.
SQL0347W The recursive common table expression "DT_LAST_YEAR" may
contain an infinite loop. SQLSTATE=01605
How can I get rid of the warning?
INSERT INTO REP_MAN_TRAN_COUNTS (SITEDIRECTORYID, BUSINESSDATE, TRANCOUNT)
WITH dt_this_year (level, seqdate) AS
(
SELECT 1, date(current timestamp) -7 DAYS FROM sysibm.sysdummy1
UNION ALL
SELECT level, seqdate + level days FROM dt_this_year WHERE level < 1000 AND seqdate + 1 days < date(current timestamp)
)
,dt_last_year (level, seqdate) AS
(
SELECT 1, date(current timestamp) -7 DAYS - 1 year FROM sysibm.sysdummy1
UNION ALL
SELECT level, seqdate + level days FROM dt_last_year WHERE level < 1000 AND seqdate + 1 days < date(current timestamp) -1 year
)
select 10049, date(dts.calendarday), count(*) trancount
from (
SELECT seqdate AS calendarday FROM dt_this_year
UNION
SELECT seqdate AS calendarday FROM dt_last_year
) dts LEFT JOIN ccftrxheader ccf
ON date(dts.calendarday) = date(ccf.businessdate)
WHERE ccf.sitedirectoryid=10049
GROUP BY ccf.sitedirectoryid,dts.calendarday
How do you get rid of warnings?
By changing the code so that it no longer generates the warning in the first place. Hiding warnings is problematic, because it often disguises a potentially larger problem. I'm fairly certain it's complaining here because the termination clause you provide for level can't ever be reached (because you never manipulate it).
Personally, I'd probably re-write your query into something like this:
INSERT INTO Rep_Man_Tran_Counts (siteDirectoryId, businessDate, tranCount)
WITH dt_Calendar_Data (level, calendarDay) AS
(SELECT l, c
FROM (VALUES (1, CURRENT_DATE - 7 DAYS),
(1, CURRENT_DATE - 7 DAYS - 1 YEAR)) t(l, c)
UNION ALL
SELECT level + 1, calendarDay + 1 DAYS
FROM dt_Calendar_Data
WHERE level < 7)
SELECT 10049, dtCal.calendarDay, COALESCE(COUNT(*), 0) as tranCount
FROM dt_Calendar_Data dtCal
LEFT JOIN ccftrxHeader ccf
ON ccf.businessDate = dtCal.calendarDay
AND ccf.siteDirectoryId = 10049
GROUP BY dtCal.seqDate
(untested, as you've provided no sample data, and I don't have a DB2 instance)
I've assumed you actually wanted a LEFT JOIN, as opposed to the regular INNER JOIN you were actually getting (due to the condition in the WHERE clause, and probably the GROUP BY as well). To avoid adding nulls to your data, I've wrapped the count in COALESCE(...), which will give you 0 instead.
I've also assumed that businessDate is a DATE type, and not a timestamp. If it is a timestamp this query needs to be adjusted (note that the function you were using would for the optimizer to ignore indices).
Note that order of operations with dates matter! Thankfully when dealing with year ranges, you only have one day to worry about in the Gregorian calendar (February 29th). Your current ordering will compare identical calendar days at the start of the range (which one has the "gap" depends on whether this year or last year is a leap year).
EDIT:
Sure, lets look at that CTE:
FROM(VALUES (1, CURRENT_DATE - 7 DAYS),
(1, CURRENT_DATE - 7 DAYS - 1 YEAR)) t(l, c)
This is just a standard VALUES clause used as a table reference. This is the SQL Standard way to construct a small temp table (Rather than referencing the dummy tables, which tend to be vendor-specific). If the statement is run on 2014-02-26 then the resulting table will be:
t
l c
===============
1 "2014-02-19"
1 "2013-02-19"
These columns get renamed by the column listing of the CTE, which are then referenced in the join (and in the case of a recursive CTE, by the recursive portion).
This then forms the starting data for the rest of the recursive query:
UNION ALL
SELECT level + 1, calendarDay + 1 DAYS
FROM dt_Calendar_Data
WHERE level < 7
In DB2 (and some other RDBMSs), recursive CTEs essentially execute iteratively, acting off the results of the "previous" invocation. Every time around, we increment level, and add another day to calendarDay. The "next" rows are then:
level calendarDay
======================
2 "2014-02-20"
2 "2013-02-20"
This continues until the "previous" row has level = 7, which means a new row is not generated (check the WHERE clause). In general, it's best to only have one termination condition (and make progress every iteration), to make it easier for the optimizer to spot. The resulting data is then in the ranges:
level calendarDay
=====================
1 "2014-02-19"
. .....
7 "2014-02-26"
1 "2013-02-19"
. .....
7 "2013-02-26"
... as a side note, I generated the this year/last year data together to make the number of references shorter. If you only needed the one year, level is unnecessary.

SQL Average Inter-arrival Time, Time Between Dates

I have a table with sequential timestamps:
2011-03-17 10:31:19
2011-03-17 10:45:49
2011-03-17 10:47:49
...
I need to find the average time difference between each of these(there could be dozens) in seconds or whatever is easiest, I can work with it from there. So for example the above inter-arrival time for only the first two times would be 870 (14m 30s). For all three times it would be: (870 + 120)/2 = 445 (7m 25s).
A note, I am using postgreSQL 8.1.22 .
EDIT: The table I mention above is from a different query that is literally just a one-column list of timestamps
Not sure I understood your question completely, but this might be what you are looking for:
SELECT avg(difference)
FROM (
SELECT timestamp_col - lag(timestamp_col) over (order by timestamp_col) as difference
FROM your_table
) t
The inner query calculates the distance between each row and the preceding row. The result is an interval for each row in the table.
The outer query simply does an average over all differences.
i think u want to find avg(timestamptz).
my solution is avg(current - min value). but since result is interval, so add it to min value again.
SELECT avg(target_col - (select min(target_col) from your_table))
+ (select min(target_col) from your_table)
FROM your_table
If you cannot upgrade to a version of PG that supports window functions, you
may compute your table's sequential steps "the slow way."
Assuming your table is "tbl" and your timestamp column is "ts":
SELECT AVG(t1 - t0)
FROM (
-- All this silliness would be moot if we could use
-- `` lead(ts) over (order by ts) ''
SELECT tbl.ts AS t0,
next.ts AS t1
FROM tbl
CROSS JOIN
tbl next
WHERE next.ts = (
SELECT MIN(ts)
FROM tbl subquery
WHERE subquery.ts > tbl.ts
)
) derived;
But don't do that. Its performance will be terrible. Please do what
a_horse_with_no_name suggests, and use window functions.

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.