Saving only unique datapoints in SQL - sql

For simplicity: We have a table with 2 columns, value and date.
Every second a new data is received and we want to save it with it's timestamp. Since the data can be similar, to lower usage, if data is the same as previous entry, we don't save it.
Question: Given that same value was received during 24 hours, only the first value & date pair is saved. If we want to query 'Average value in last 1 hour', is there a way to have the db (PostgreSQL) see that no values are saved in last hour and search for last existing value entry?

It is not as easy as it may seem, and it is not just about retrieving the latest data point when there is none available within the last hour. You want to calculate an average, so you need to rebuild the time-series data of the period, second per second, filling the gaps with the latest available data point.
I think the simplest approach is generate_series() to build the rows, and then a lateral join to recover the data:
select avg(d.value) avg_last_hour
from generate_series(now() - interval '1 hour', now(), '1 second') t(ts)
cross join lateral (
select d.*
from data d
where d.date <= t.ts
order by d.date desc
limit 1
) t

Hmmm . . . if you simply want the average of values in the most recent hour in the data, you can use:
select date_trunc('hour', date) as ddhh, avg(value)
from t
group by ddhh
order by ddhh desc
limit 1;
If you have a lot of data being collected, it might be faster to add an index on date and use:
select avg(value)
from t
where date >= date_trunc('hour', (select max(t2.date) from t t2));

Related

How to compare time stamps from consecutive rows

I have a table that I would like to sort by a timestamp desc and then compare all consecutive rows to determine the difference between each row. From there, I would like to find all the rows whose difference is greater than ~2hours.
I'm stuck on how to actually compare consecutive rows in a table. Any help would be much appreciated.
I'm using Oracle SQL Developer 3.2
You didn't show us your table definition, but something like this:
select *
from (
select t.*,
t.timestamp_column,
t.timestamp_column - lag(timestamp_column) over (order by timestamp_column) as diff
from the_table t
) x
where diff > interval '2' hour;
This assumes that timestamp_column is defined as timestamp not date (otherwise the result of the difference wouldn't be an interval)

examine if one time series column of table has two adjacent time points which have interval larger than certain length

I am dealing with data preprocessing on a table containing time series column
toy example Table A
timestamp value
12:30:24 1
12:32:21 3
12:33:21 4
timestamp is ordered and always go incrementally
Is that possible to define an function or something else to return "True expression" when table has two adjacent time points which have interval larger than certain length and return "False" otherwise?
I am using postgresql, thank you
SQL Fiddle
select bool_or(bigger_than) as bigger_than
from (
select
time - lag(time) over (order by time)
>
interval '1 minute' as bigger_than
from table_a
) s;
bigger_than
-------------
t
bool_or will stop searching as soon as it finds the first true value.
http://www.postgresql.org/docs/current/static/functions-aggregate.html
Your sample data shows a time value. But it works the same for a timestamp
Something like this:
select count(*) > 0
from (
select timestamp,
lag(timestamp) over (order by value) as prev_ts
from table_a
) t
where timestamp - prev_ts < interval '1' minute;
It calculates the difference between a timestamp and it's "previous" timestamp. The order of the timestamps is defined by the value column. The outer query then counts the number of rows where the difference is smaller than 1 minute.
lag() is called a window functions. More details on those can be found in the manual:
http://www.postgresql.org/docs/current/static/tutorial-window.html

Group by data intervals

I have a single table which stores bandwidth usage on the network over a period of time. One column will contain the date time (primary key) and another column will record the bandwidth. Data is recorded every minute. We will have other columns recording other data at that moment in time.
If the user requests the data on 15 minute intervals (within a 24 hour period given start and end date), is it possible with a single query to get the data I require or would I have to write a stored procedure/cursor to do this? Users may then request 5 minute intervals data etc.
I will most likely be using Postgres but are there other NOSQL options which would be better?
Any ideas?
WITH t AS (
SELECT ts, (random()*100)::int AS bandwidth
FROM generate_series('2012-09-01', '2012-09-04', '1 minute'::interval) ts
)
SELECT date_trunc('hour', ts) AS hour_stump
,(extract(minute FROM ts)::int / 15) AS min15_slot
,count(*) AS rows_in_timeslice -- optional
,sum(bandwidth) AS sum_bandwidth
FROM t
WHERE ts >= '2012-09-02 00:00:00+02'::timestamptz -- user's time range
AND ts < '2012-09-03 00:00:00+02'::timestamptz -- careful with borders
GROUP BY 1, 2
ORDER BY 1, 2;
The CTE t provides data like your table might hold: one timestamp ts per minute with a bandwidth number. (You don't need that part, you work with your table instead.)
Here is a very similar solution for a very similar question - with detailed explanation how this particular aggregation works:
date_trunc 5 minute interval in PostgreSQL
Here is a similar solution for a similar question concerning running sums - with detailed explanation and links for the various functions used:
PostgreSQL: running count of rows for a query 'by minute'
Additional question in comment
WITH -- same as above ...
SELECT DISTINCT ON (1,2)
date_trunc('hour', ts) AS hour_stump
,(extract(minute FROM ts)::int / 15) AS min15_slot
,bandwidth AS bandwith_sample_at_min15
FROM t
WHERE ts >= '2012-09-02 00:00:00+02'::timestamptz
AND ts < '2012-09-03 00:00:00+02'::timestamptz
ORDER BY 1, 2, ts DESC;
Retrieves one un-aggregated sample per 15 minute interval - from the last available row in the window. This will be the 15th minute if the row is not missing. Crucial parts are DISTINCT ON and ORDER BY.
More information about the used technique here:
Select first row in each GROUP BY group?
select
date_trunc('hour', d) +
(((extract(minute from d)::integer / 5 * 5)::text) || ' minute')::interval
as "from",
date_trunc('hour', d) +
((((extract(minute from d)::integer / 5 + 1) * 5)::text) || ' minute')::interval
- '1 second'::interval
as "to",
sum(random() * 1000) as bandwidth
from
generate_series('2012-01-01', '2012-01-31', '1 minute'::interval) s(d)
group by 1, 2
order by 1, 2
;
That for 5 minutes ranges. For 15 minutes divide by 15.

PostgreSQL "nested"? distincts and count

I need to get the count of the distinct names per hour in one query in PostgreSQL 9.1
The relevant columns(generalized for question) in my table are:
occurred timestamp with time zone and
name character varying(250)
And the table name for the sake of the question is just table
The occurred timestamps will all be within a midnight to midnight(exclusive) range for one day. So far my query looks like:
'SELECT COUNT(DISTINCT ON (name)) FROM table'
It would be nice if I could get the output formatted as a list of 24 integers(one for each hour of the day), the names aren't required to be returned.
If I understand correctly what you want, you can write:
SELECT EXTRACT(HOUR FROM occurred),
COUNT(DISTINCT name)
FROM ...
WHERE ...
GROUP
BY EXTRACT(HOUR FROM occurred)
ORDER
BY EXTRACT(HOUR FROM occurred)
;
SELECT date_trunc('hour', occurred) AS hour_slice
,count(DISTINCT name) AS name_ct
FROM mytable
GROUP BY 1
ORDER BY 1;
DISTINCT ON is a different feature.
date_trunc() gives you a sum for every distinct hour, while EXTRACT sums per hour-of-day over longer periods of time. The two results do not add up, because summing up multiple count(DISTINCT x) is equal or greater than one count(DISTINCT x).
You want this by hour:
select extract(hour from occurred) as hr, count(distinct name)
from table t
group by extract(hour from occurred)
order by 1
This assumes there is data for only one day. Otherwise, hours from different days would be combined. To get around this, you would need to include date information as well.

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.