Aggregating (x,y) coordinate point clouds in PostgreSQL - sql

I have a a PostgreSQL database table with the following simplified structure:
Device Id varchar
Pos_X (int)
Pos_Y (int)
Basically this table contains a lot of two dimensional waypoint data for devices. Now I want to design a query which reduces the number of coordinates in the output. It should aggregate nearby coordinates (for a certain x,y threshold)
An example:
row 1: DEVICE1;603;1205
row 2: DEVICE1;604;1204
If the threshold is 5, these two rows should be aggregated since the variance is smaller than 5.
Any idea how to do this in PostgreSQL or SQL in general?

Use the often overlooked built-in function width_bucket() in combination with your aggregation:
If your coordinates run from, say, 0 to 2000 and you want to consolidate everything within squares of 5 to single points, I would lay out a grid of 10 (5*2) like this:
SELECT device_id
, width_bucket(pos_x, 0, 2000, 2000/10) * 10 AS pos_x
, width_bucket(pos_y, 0, 2000, 2000/10) * 10 AS pos_y
, count(*) AS ct -- or any other aggregate
FROM tbl
GROUP BY 1,2,3
ORDER BY 1,2,3;
To minimize the error you could GROUP BY the grid as demonstrated, but save actual average coordinates:
SELECT device_id
, avg(pos_x)::int AS pos_x -- save actual averages to minimize error
, avg(pos_y)::int AS pos_y -- cast if you need to
, count(*) AS ct -- or any other aggregate
FROM tbl
GROUP BY
device_id
, width_bucket(pos_x, 0, 2000, 2000/10) * 10 -- aggregate by grid
, width_bucket(pos_y, 0, 2000, 2000/10) * 10
ORDER BY 1,2,3;
sqlfiddle demonstrating both alongside.
Well, this particular case could be simpler:
...
GROUP BY
device_id
, (pos_x / 10) * 10 -- truncates last digit of an integer
, (pos_y / 10) * 10
...
But that's just because the demo grid size of 10 conveniently matches the decimal system. Try the same with a grid size of 17 or something ...
Expand to timestamps
You can expand this approach to cover date and timestamp values by converting them to unix epoch (number of seconds since '1970-1-1') with extract().
SELECT extract(epoch FROM '2012-10-01 21:06:38+02'::timestamptz);
When you are done, convert the result back to timestamp with time zone:
SELECT timestamptz 'epoch' + 1349118398 * interval '1s';
Or simply to_timestamp():
SELECT to_timestamp(1349118398);

select [some aggregates] group by (pos_x/5, pos_y/5);
Where instead of 5 you can have any number depending how much aggregation you need/

Related

How do I calculate spikes in data over the last x amount of minutes in QuestDB

I have a table CPU with some metrics coming in over Influx line protocol, how can I run a query that tells me if data has exceeded a certain threshold in the last x minutes,
I am trying to do something like:
select usage_system, timestamp
from cpu
Where usage_system > 20
But I have all records returned. For the time range, I don't want to hardcode timestamps or have to replace it with the current time to get a relative query
usage
timestamp
27.399999999906
2021-04-14T12:02:30.000000Z
26.400000000139
2021-04-14T12:02:30.000000Z
25.666666666899
2021-04-14T12:02:30.000000Z
...
...
Ideally I would like to be able to check if usage is above an average, but above hardcoded value is fine for now.
The following query will dynamically filter the last 5 minutes on the timestamp field and will return rows based on a hardcoded value (when usage_system is above 20):
SELECT *
FROM cpu
WHERE usage_system > 20
AND timestamp > dateadd('m', -5, now());
If you want to add some more details to the query, you can cross join an aggregate value:
WITH avg_usage AS (select avg(usage_system) average FROM cpu)
SELECT timestamp, cpu.usage_system usage, average, cpu.usage_system > avg_usage.average above_average
FROM cpu CROSS JOIN avg_usage
Where timestamp > dateadd('m', -5, now());
This will add a boolean column above_average and will be true or false depending on whether the row's usage_system is above the aggregate:
timestamp
usage
average
above_average
2021-04-14T13:30:00.000000Z
20
10
true
2021-04-14T13:30:00.000000Z
5
10
false
If you want all columns, it might be useful to move this filter down into the WHERE clause:
WITH avg_usage AS (select avg(usage_system) average FROM cpu)
SELECT *
FROM cpu CROSS JOIN avg_usage
WHERE timestamp > dateadd('m', -5, now())
AND cpu.usage_system > avg_usage.average;
This will then allow to do more complex filtering, such as returning all rows which are in a certain percentile, like the following which will return rows where the usage_system is above 80% of the recorded maximum in the last 5 minutes (i.e. highest by CPU usage):
WITH max_usage AS (select max(usage_system) maximum FROM cpu)
SELECT *
FROM cpu CROSS JOIN max_usage
WHERE timestamp > dateadd('m', -5, now())
AND cpu.usage_system > (max_usage.maximum / 100 * 80);
edit: the last query was based on the example in the QuestDB WITH keyword documentation

REGR_SLOPE in Teradata SQL Query Returning 0 Slope

I am a relative newbie with Teradata SQL and have run into this strange (I think strange) situation. I am trying to run a regression (REGR_SLOPE) on sensor data. I am gathering sensor readings for a single day, each day is 80 observations which is confirmed by the COUNT in the outer SELECT. My query is:
SELECT
d.meter_id,
REGR_SLOPE(d.reading_measure, d.x_axis) AS slope,
COUNT(d.x_axis) AS xcount,
COUNT(d.reading_measure) AS read_count
FROM
(
SELECT
meter_id,
reading_measure,
row_number() OVER (ORDER BY Reading_Dttm) AS x_axis
FROM data_mart.v_meter_reading
WHERE Reading_Start_Dt = '2017-12-12'
AND Meter_Id IN (11932101, 11419827, 11385229, 11643466)
AND Channel_Num = 5
) d
GROUP BY 1
When I use the "IN" clause in the subquery to specify Meter_Id, I get slope values, but when I take it out (to run over all meters) all the slopes are 0 (zero). I would simply like to run a line through a day's worth of observations (80).
I'm using Teradata v15.0.
What am I missing / doing wrong?
I would bet a Pepperoni Pizza that it's the x_axis value.
Instead try ROW_NUMBER() OVER (PARTITION BY meter_id ORDER BY reading_dttm)
This will ensure that the x_axis starts again from 1 for each meter, and each reading will always be 1 away from the previous reading on the x_axis.
This makes me thing you should probably just use reading_dttm as the x_axis value, rather than fabricating one with ROW_NUMBER(). That way readings with a 5 hour gap between them have a different slope to readings with a 10 day gap between them. You may need to convert the reading_dttm's data-type, with a function like TO_UNIXTIME(reading_dttm), or something similar.
I'll message you my address for the Pizza Delivery. (Joking.)
Additional to #MatBailie's answer.
You probably know that should you order by the timestamp instead of the ROW_NUMBER, but you couldn't do it because Teradata doesn't allow timestamps in this place (strange).
There's no built-in TO_UNIXTIME function in Teradata, but you can use this instead:
REPLACE FUNCTION TimeStamp_to_UnixTime (ts TIMESTAMP(6))
RETURNS decimal(18,6)
LANGUAGE SQL
CONTAINS SQL
DETERMINISTIC
SQL SECURITY DEFINER
COLLATION INVOKER
INLINE TYPE 1
RETURN
(Cast(ts AS DATE) - DATE '1970-01-01') * 86400
+ (Extract(HOUR From ts) * 3600)
+ (Extract(MINUTE From ts) * 60)
+ (Extract(SECOND From ts));
If you're not allowed to create UDFs simply cut&paste the calculation.

Every 10th row based on timestamp

I have a table with signal name, value and timestamp. these signals where recorded at sampling rate of 1sample/sec. Now i want to plot a graph on values of months, and it is becoming very heavy for the system to perform it within seconds. So my question is " Is there any way to view 1 value/minute in other words i want to see every 60th row.?"
You can use the row_number() function to enumerate the rows, and then use modulo arithmetic to get the rows:
select signalname, value, timestamp
from (select t.*,
row_number() over (order by timestamp) as seqnum
from table t
) t
where seqnum % 60 = 0;
If your data really is regular, you can also extract the seconds value and check when that is 0:
select signalname, value, timestamp
from table t
where datepart(second, timestamp) = 0
This assumes that timestamp is stored in an appropriate date/time format.
Instead of sampling, you could use the one minute average for your plot:
select name
, min(timestamp)
, avg(value)
from Yourtable
group by
name
, datediff(minute, '2013-01-01', timestamp)
If you are charting months, even the hourly average might be detailed enough.

Min, Max, Avg for 2-hour intervals

In below query, I am calculating the minimum, the maximum, and the average for a two hour interval using PostgreSQL.
The query works fine for even start hours (..04:00:00+05:30), but it gives similar result as that of even start time for odd start hours (..05:00:00+05:30).
The multiple by 2 returns even hours which is the problem.
SELECT tagid, CAST(sample_time_stamp as Date) AS stat_date,
(floor(extract(hour from sample_time_stamp)/2) * 2)::int AS hrs,
min(sensor_reading) AS theMin,
max(sensor_reading) AS theMax,
avg(sensor_reading) AS theAvg
FROM sensor_readings WHERE tagid =1 AND
sample_time_stamp BETWEEN '2012-10-23 01:00:00+05:30'
AND '2012-10-23 05:59:00+05:30'
GROUP BY tagid,CAST(sample_time_stamp as Date),
floor(extract(hour from sample_time_stamp)/2) * 2
ORDER BY tagid,stat_date, hrs
OutPut for Odd start Hour ('2012-10-23 01:00:00+05:30')
tagid date hrs theMin themax theAvg
1 2012-10-23 0 6 58 30.95
1 2012-10-23 2 2 59 29.6916666666667
1 2012-10-23 4 3 89 31.7666666666667
OutPut for Even start Hour ('2012-10-23 02:00:00+05:30')
tagid date hrs theMin themax theAvg
1 2012-10-23 2 2 59 29.6916666666667
1 2012-10-23 4 3 89 31.7666666666667
To get constant time-frames starting with your minimum timestamp:
WITH params AS (
SELECT '2012-10-23 01:00:00+05:30'::timestamptz AS _min -- input params
,'2012-10-23 05:59:00+05:30'::timestamptz AS _max
,'2 hours'::interval AS _interval
)
,ts AS (SELECT generate_series(_min, _max, _interval) AS t_min FROM params)
,timeframe AS (
SELECT t_min
,lead(t_min, 1, _max) OVER (ORDER BY t_min) AS t_max
FROM ts, params
)
SELECT s.tagid
,t.t_min
,t.t_max -- mildly redundant except for last row
,min(s.sensor_reading) AS the_min
,max(s.sensor_reading) AS the_max
,avg(s.sensor_reading) AS the_avg
FROM timeframe t
LEFT JOIN sensor_readings s ON s.tagid = 1
AND s.sample_time_stamp >= t.t_min
AND s.sample_time_stamp < t.t_max
GROUP BY 1,2,3
ORDER BY 1,2;
Can be used for any time frame and any interval length. Requires PostgreSQL 8.4 or later.
If the maximum timestamp _max does not fall on _min + n * _interval the last time-frame is truncated. The last row can therefore represent a shorter time-frame than your desired _interval.
Key elements
Common Table Expressions (CTE) for easier handling. Input parameter values once in the top CTE params.
generate_series() for intervals to create the time raster.
Window function lead(...) with 3 parameters (including default) - to cover the special case of last row.
LEFT JOIN between raster and actual data, so that time frames without matching data will still show in the result (with NULL values as data).
That's also the reason for a later edit: WHERE condition had to move to the LEFT JOIN condition, to achieve that.
Alternative time frame generation with recursive CTE:
WITH RECURSIVE params AS (
SELECT '2012-10-23 01:00:00+05:30'::timestamptz AS _min -- input params
,'2012-10-23 05:59:00+05:30'::timestamptz AS _max
,'2 hours'::interval AS _interval
)
, timeframe AS (
SELECT _min AS t_min, LEAST(_min + _interval, _max) AS t_max
FROM params
UNION ALL
SELECT t_max, LEAST(t_max + _interval, _max)
FROM timeframe t, params p
WHERE t_max < _max
)
SELECT ...
Slightly faster ... take your pick.
-> sqlfiddle displaying both.
Note that you can have non-recursive CTEs (additionally) even when declared WITH RECURSIVE.
Performance & Index
Should be faster than your original query. Half the code deals with generating the time raster, which concerns few rows and is very fast. Handling actual table rows (the expensive part) gets cheaper, because we don't calculate a new value from every sample_time_stamp any more.
You should definitely have a multi-column index of the form:
CREATE INDEX foo_idx ON sensor_readings (tagid, sample_time_stamp DESC);
I use DESC on the assumption that you more often query recent entries (later timestamps). Remove the modifier if that's not the case. Doesn't make a big difference either way.

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.