Identify groups of rows in close proximity - sql

I am doing a project for school and have been given a database over gps-recordings for three people during the course of a week. I am trying to group these recordings into trips based on the time between them. If a recording is within 300 seconds from the recording before it, they are considered to be part of the same trip, otherwise, they are considered part of different trips.
So far I have managed to calculate the time difference between a recording on row n and the one on row n-1 and I am now trying to create a function for merging the recordings intro trips. This would have been real easy in another programming language, but in this course we are using PostgreSQL which I am not that well versed in.
To solve this, I am trying to create a function with a variable that increases every time the time difference between two recordings is greater than 300 seconds and assigns each row to a trip based on the variable. This is as far as I have currently gotten, although at the moment, the variable resets X all the time, thus assigning all rows to trip 1...
CREATE OR REPLACE FUNCTION tripmerge(time_diff double precision)
RETURNs integer AS $$
declare
X integer := 1;
ID integer;
BEGIN
IF time_diff < 300 THEN
ID = X;
ELSE
ID =X;
X:=X+1;
END IF;
RETURN ID;
END;$$
LANGUAGE plpgsql;
How do I change so X does not reset all the time? I am using PostgreSQL 9.1.
EDIT:
This is the table I am working with:
curr_rec (timestamp), prev_rec (timestamp), time_diff (double precision)
With this being a sample of the dataset:
'2013-11-14 05:22:33.991+01',null ,null
'2013-11-14 09:15:40.485+01','2013-11-14 05:22:33.991+01',13986.494
'2013-11-14 09:17:04.837+01','2013-11-14 09:15:40.485+01',84.352
'2013-11-14 09:17:43.055+01','2013-11-14 09:17:04.837+01',38.218
'2013-11-14 09:23:24.205+01','2013-11-14 09:17:43.055+01',341.15
The expected result would add a column:
tripID
1
2
2
2
3
And I think this fiddle should be working: http://sqlfiddle.com/#!1/4e3e5/1/0

This query uses only curr_rec, not the other redundant, precomputed columns:
SELECT 1 + count(step OR NULL) OVER (ORDER BY curr_rec) AS trip_id
FROM (
SELECT curr_rec
,lag(curr_rec) OVER (ORDER BY curr_rec) AS prev_rec
,curr_rec - lag(curr_rec) OVER (ORDER BY curr_rec)
> interval '5 min' AS step
FROM timestamps
) x;
Key features are:
The window function lag(), which I use to see if the previous row is more than 5 minutes ago. (Just using an interval for the comparison, no need to extract seconds)
The window aggregate function count() - that's just the basic aggregate function with an OVER clause.
The expression step OR NULL only leaves TRUE or NULL, where only TRUE is counted in a running count, thereby arriving at your desired result.
SQL Fiddle (building on the one you provided).

Related

Limit result rows for minimal time intervals for PostgreSQL

Background: I am running TeslaMate/Grafana for monitoring my car status, one of the gauges plots the battery level fetched from database. My server is located remotely and running in a Dock from an old NAS, so both query performance and network overhead matters.
I found the koisk page frequently hangs and by investigation, it might caused by the query -- two of the plots returns 10~100k rows of results from database. I want to limit the number of rows returned by SQL queries, as the plots certainly don't have that much precision for drawing such detailed intervals.
I tried to follow this answer and use row_number() to pop only 100-th rows of results, but more complicated issues turned up, that is, the time intervals among rows are not consistent.
The car has 4 status, driving / online / asleep / offline.
If the car is at driving status, the time interval could be less than 200ms as the car pushes the status whenever it has new data.
If the car is at online status, the time interval could be several minutes as the system actively fetches the status from the car.
Even worse, if the system thinks the car is going to sleep and need to stop fetching status (to avoid preventing the car to sleep), the interval could be 40 minutes maximum depend on settings.
If the car is in asleep/offline status, no data is recorded at all.
This obviously makes skipping every n-th rows a bad idea, as for case 2-4 above, lots of data points might missing so that Grafana cannot plot correct graph representing the battery level at satisfactory precision.
I wonder if there's any possible to skip the rows by time interval from a datetime field rather than row_number() without much overhead from the query? i.e., fetch every row with minimal 1000ms from the previous row.
E.g., I have following data in the table, I want the rows returned are row 1, 4 and 5.
row date
[1] 1610000001000
[2] 1610000001100
[3] 1610000001200
[4] 1610000002000
[5] 1610000005000
The current (problematic) method I am using is as follows:
SELECT $__time(t.date), t.battery_level AS "SOC [%]"
FROM (
SELECT date, battery_level, row_number() OVER(ORDER BY date ASC) AS row
FROM (
SELECT battery_level, date
FROM positions
WHERE car_id = $car_id AND $__timeFilter(date)
UNION ALL
SELECT battery_level, date
FROM charges c
JOIN charging_processes p ON p.id = c.charging_process_id
WHERE $__timeFilter(date) AND p.car_id = $car_id) AS data
ORDER BY date ASC) as t
WHERE t.row % 100 = 0;
This method clearly gives problem that only returns alternate rows instead of what I wanted (given the last row reads t.row % 2 = 0)
PS: please ignore the table structures and UNION from the sample code, I haven't dig deep enough to the tables which could be other tweaks but irrelevant to this question anyway.
Thanks in advance!
You can use a recursive CTE:
WITH RECURSIVE rec(cur_row, cur_date) AS (
(
SELECT row, date
FROM t
ORDER BY date
LIMIT 1
)
UNION ALL
(
SELECT row, date
FROM t
JOIN rec
ON t.date >= cur_date + 1000
ORDER BY t.date
LIMIT 1
)
)
SELECT *
FROM rec;
cur_row
cur_date
1
1610000001000
4
1610000002000
5
1610000005000
View on DB Fiddle
Using a function instead would probably be faster:
CREATE OR REPLACE FUNCTION f() RETURNS SETOF t AS
$$
DECLARE
row t%ROWTYPE;
cur_date BIGINT;
BEGIN
FOR row IN
SELECT *
FROM t
ORDER BY date
LOOP
IF row.date >= cur_date + 1000 OR cur_date IS NULL
THEN
cur_date := row.date;
RETURN NEXT row;
END IF;
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT *
FROM f();
row
date
1
1610000001000
4
1610000002000
5
1610000005000

Generate a random value in a row based on a value from another table

I want to create a large amount of mock data in a table (in Postgresql). The schema of the table looks like this
price float,
id id,
period timestamptz
For price, this will be a random float number between 1-5
For id, this will be a value from another table that contain all value in id column (which may have a lot of id)
For period, this will generate a random datetime value in a specific range of time.
Here, I want to create a single query that can generate all these rows equal to amount of id I have to a specific range of time that I select.
E.g.
Let say I have 3 ids (a,b,c) in another table and I want to generate time series between 2019-08-01 00:00:00+00 and 2019-08-05 00:00:00+00
The result from this query will generate value like this:
price id period
3.4 b 2019-08-03 10:01:22+00
2.5 a 2019-08-04 05:44:31+00
4.8 c 2019-08-04 14:51:10+00
The price and id are random. Also period, but with specific range. Key thing is that, all ids need to be generated.
Generating random number and datetime is not hard but how can I create a query that can generate rows based on all id gathered from another table.
Ps. I have edited the example which might mislead my question
This answers a reasonable interpretation of the original question.
Getting a random value from a second table can be a little tricky. If the second table is not too big, then this works:
select distinct on (gs.ts) gs.ts, ids.id, cast(random() * 4.1 + 1 as numeric(2, 1))
from generate_series('2019-08-01 00:00:00+00'::timestamp, '2019-08-05 00:00:00+00'::timestamp, interval '30 minute') gs(ts) cross join
ids
order by gs.ts, random()
Use the function make_timestamptz generating a random integer for each part, except year and month. This will create random timestamps. As for getting the id from another table just select from that table.
/*
function to generate random integers. (Lots of then needed.)
*/
create or replace function utl_gen_random_integer(
int1_in integer,
int2_in integer)
returns integer
language sql volatile strict
as
$$
/* return a random integer between, inclusively, two integers, relative values of the integers does not matter. */
with ord as ( select greatest(int1_in, int2_in) as hi
, least(int1_in, int2_in) as low
)
select floor(random()*(hi-low+1)+l)::integer from ord;
$$;
-- create the id source table and populate
create table id_source( id text) ;
insert into id_source( id)
with id_range as ( select 'abcdefgh'::text idl)
select substring(idl,utl_gen_random_integer(1,length(idl)), 1)
from id_range, generate_series(1,20) ;
And the generation query:
select trunc((utl_gen_random_integer(1,4) + (utl_gen_random_integer(0,100))/100.0),2) Price
, id
, make_timestamptz ( 2019 -- year
, 08 -- month
, utl_gen_random_integer(1,5) -- day
, utl_gen_random_integer(1,24)-1 -- hours
, utl_gen_random_integer(1,60)-1 -- min
, (utl_gen_random_integer(1,60)-1)::float -- sec
, '+00'
)
from id_source;
The result generates the time at UTC (+00). However any subsequent Postgres will display the result converted to local time with offset. To view in UCT append "at time zone 'UCT'" to the query.

SQL script to find previous value, not necessarily previous row

is there a way in SQL to find a previous value, not necessarily in the previous row, within the same SELECT statement?
See picture below. I'd like to add another column, ELAPSED, that calculates the time difference between TIMERSTART, but only when DEVICEID is the same, and I_TYPE is viewDisplayed. e.g. subtract 1 from 2, store difference in 3, store 0 in 4 because i_type is not viewDisplayed, subtract 2 from 5, store difference in 6, and so on.
It has to be a statement, I can't use a stored procedure in this case.
SELECT DEVICEID, I_TYPE, TIMERSTART,
O AS ELAPSED -- CASE WHEN <CONDITION> THEN TIMEDIFF() ELSE 0 END AS ELAPSED
FROM CLIENT_USAGE
ORDER BY TIMERSTART ASC
I'm using SAP HANA DB, but it works pretty much like the latest version of MS-SQL. So, if you know how to make it work in SQL, I can make it work in HANA.
You can make a subquery to find the last time entered previous to the row in question.
select deviceid, i_type, timerstart, (timerstart - timerlast) as elapsed.
from CLIENT_USAGE CU
join ( select top 1 timerstart as timerlast
from CLIENT_USAGE C
where (C.i_type = CU.i_type) and
(C.deviceid = CU.deviceid) and (C.timerstart < CU.timerstart)
order by C.timerstart desc
) as temp1
on temp1.i_type = CU.i_type
order by timerstart asc
This is a rough sketch of what the sql should look like I do not know what your primary key is on this table if it is i_type or i_type and deviceid. But this should help with how to atleast calculate the field. I do not think it would be necessary to store the value unless this table is very large or the hardware being used is very slow. It can be calculated rather easily each time this query is run.
SAP HANA supports window functions:
select DEVICEID,
TIMERSTART,
lag(TIMERSTART) over (partition by DEVICEID order by TIMERSTART) as previous_start
from CLIENT_USAGE
Then you can wrap this in parentheses and manipulate the data to your hearts' content

SQL workaround to substitute FOLLOWING / PRECEEDING in PostgreSQL 8.4

I have a query that does a basic moving average using the FOLLOWING / PRECEDING syntax of PostgreSQL 9.0. To my horror I discovered our pg server runs on 8.4 and there is no scope to get an upgrade in the near future.
I am therefore looking for the simplest way to make a backwards compatible query of the following:
SELECT time_series,
avg_price AS daily_price,
CASE WHEN row_number() OVER (ORDER BY time_series) > 7
THEN avg(avg_price) OVER (ORDER BY time_series DESC ROWS BETWEEN 0 FOLLOWING
AND 6 FOLLOWING)
ELSE NULL
END AS avg_price
FROM (
SELECT to_char(closing_date, 'YYYY/MM/DD') AS time_series,
SUM(price) / COUNT(itemname) AS avg_price
FROM auction_prices
WHERE itemname = 'iphone6_16gb' AND price < 1000
GROUP BY time_series
) sub
It is a basic 7-day moving average for a table containing price and timestamp columns:
closing_date timestamp
price numeric
itemname text
The requirement for basic is due to my basic knowledge of SQL.
Postgres 8.4 already has CTEs.
I suggest to use that, calculate the daily average in a CTE and then self-join to all days (existing or not) in the past week. Finally, aggregate once more for the weekly average:
WITH cte AS (
SELECT closing_date::date AS closing_day
, sum(price) AS day_sum
, count(price) AS day_ct
FROM auction_prices
WHERE itemname = 'iphone6_16gb'
AND price <= 1000 -- including upper border
GROUP BY 1
)
SELECT d.closing_day
, CASE WHEN d.day_ct > 1
THEN d.day_sum / d.day_ct
ELSE d.day_sum
END AS avg_day -- also avoids division-by-zero
, CASE WHEN sum(w.day_ct) > 1
THEN sum(w.day_sum) / sum(w.day_ct)
ELSE sum(w.day_sum)
END AS week_avg_proper -- also avoids division-by-zero
FROM cte d
JOIN cte w ON w.closing_day BETWEEN d.closing_day - 6 AND d.closing_day
GROUP BY d.closing_day, d.day_sum, d.day_ct
ORDER BY 1;
SQL Fiddle. (Running on Postgres 9.3, but should work in 8.4, too.)
Notes
I used a different (correct) algorithm to calculate the weekly average. See considerations in my comment to the question.
This calculates averages for every day in the base table, including corner cases. But no row for days without any rows.
One can subtract integer from date: d.closing_day - 6. (But not from varchar or timestamp!)
It's rather confusing that you call a timestamp column closing_date - it's not a date, it's a timestamp.
And time_series for the resulting column with a date value? I use closing_day instead ...
Note how I count prices count(price), not items COUNT(itemname) - which would be an entry point for a sneaky error if either of the columns can be NULL. If neither can be NULL count(*) would be superior.
The CASE construct avoids division-by-zero errors, which can occur as long as the column you are counting can be NULL. I could use COALESCE for the purpose, but while being at it I simplified the case for exactly 1 price as well.
-- make a subset and rank it on date
WITH xxx AS (
SELECT
rank() OVER(ORDER BY closing_date) AS rnk
, closing_date
, price
FROM auction_prices
WHERE itemname = 'iphone6_16gb' AND price < 1000
)
-- select subset, + aggregate on self-join
SELECT this.*
, (SELECT AVG(price) AS mean
FROM xxx that
WHERE that.rnk > this.rnk + 0 -- <<-- adjust window
AND that.rnk < this.rnk + 7 -- <<-- here
)
FROM xxx this
ORDER BY this.rnk
;
Note: the CTE is for conveniance (Postgres-8.4 does have CTE's), but the CTE could be replaced by a subquery or, more elegantly, by a view.
The code assumes that the time series is has no gaps (:one opservation for every {product*day}. When not: join with a calendar table (which could also contain the rank.)
(also note that I did not cover the corner cases.)
PostgreSQL 8.4.... wasn't that in the day when everybody thought Windows 95 was great? Anyway...
The only option I can think of is to use a stored procedure with a scrollable cursor and do the math manually:
CREATE FUNCTION auction_prices(item text, price_limit real)
RETURNS TABLE (closing_date timestamp, avg_day real, avg_7_day real) AS $$
DECLARE
last_date date;
first_date date;
cur refcursor;
rec record;
dt date;
today date;
today_avg real;
p real;
sum_p real;
n integer;
BEGIN
-- There may be days when an item was not traded under the price limit, so need a
-- series of consecutive days to find all days. Find the end-points of that
-- interval.
SELECT max(closing_date), min(closing_date) INTO last_date, first_date
FROM auction_prices
WHERE itemname = item AND price < price_limit;
-- Need at least some data, so quit if item was never traded under the price limit.
IF NOT FOUND THEN
RETURN;
END IF;
-- Create a scrollable cursor over the auction_prices daily average and the
-- series of consecutive days. The LEFT JOIN means that you will get a NULL
-- for avg_price on days without trading.
OPEN cur SCROLL FOR
SELECT days.dt, sub.avg_price
FROM generate_series(last_date, first_date, interval '-1 day') AS days(dt)
LEFT JOIN (
SELECT sum(price) / count(itemname) AS avg_price
FROM auction_prices
WHERE itemname = item AND price < price_limit
GROUP BY closing_date
) sub ON sub.closing_date::date = days.dt::date;
<<all_recs>>
LOOP -- over the entire date series
-- Get today's data (today = first day of 7-day period)
FETCH cur INTO today, today_avg;
EXIT all_recs WHEN NOT FOUND; -- No more data, so exit the loop
IF today_avg IS NULL THEN
n := 0;
sum_p := 0.0;
ELSE
n := 1;
sum_p := today_avg;
END IF;
-- Loop over the remaining 6 days
FOR i IN 2 .. 7 LOOP
FETCH cur INTO dt, p;
EXIT all_recs WHEN NOT FOUND; -- No more data, so exit the loop
IF p IS NOT NULL THEN
sum_p := sum_p + p;
n := n + 1;
END IF;
END LOOP;
-- Save the data to the result set
IF n > 0 THEN
RETURN NEXT today, today_avg, sum_p / n;
ELSE
RETURN NEXT today, today_avg, NULL;
END IF;
-- Move the cursor back to the starting row of the next 7-day period
MOVE RELATIVE -6 FROM cur;
END LOOP all_recs;
CLOSE cur;
RETURN;
END; $$ LANGUAGE plpgsql STRICT;
A few notes:
There may be dates when an item is not traded under the limit price. In order to get accurate moving averages, you need to include those days. Generate a series of consecutive dates during which the item was indeed traded under the limit price and you will get accurate results.
The cursor needs to be scrollable such that you can look forward 6 days to earlier dates to get data needed for the calculation, and then move back 6 days to calculate the average for the next day.
You cannot calculate a moving average on the last 6 days. The simple reason is that the MOVE command needs a constant number of records to move. Parameter substitution is not supported. On the up side, your moving average will always be for 7 days (of which not all may have seen trading).
This function will by no means be fast, but it should work. No guarantees though, I have not worked on an 8.4 box for years.
Use of this function is rather straightforward. Since it is returning a table you can use it in a FROM clause like any other table (and even JOIN to other relations):
SELECT to_char(closing_date, 'YYYY/MM/DD') AS time_series, avg_day, avg_7_day
FROM auction_prices('iphone6_16gb', 1000);

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.