Finding groups within time series in Postgre - sql

I have a list of timestamps and i want to tag them as a group when they are close enough (less than 15 sec intervall). This is what I want to have eventually :
time
group number
18:01:00
1
18:01:06
1
18:10:00
/
18:20:30
2
18:20:40
2
18:20:50
2
18:25:02
/

Use lag() and date comparisons to determine where a group begins. Then use a cumulative sum. You actually only want to include rows that have multiple rows in the group, so this is a little tricker than the simple gaps-and-islands problem:
select t.*,
(case when prev_time > time - interval '15 second' or
next_time < time + interval '15 second'
then sum(case when prev_time > time - interval '15 sec' then 0 else 1 end) over (order by time)
end) as group_number
from (select t.*,
lag(time) over (order by time) as prev_time,
lead(time) over (order by time) as next_time
from t
) t

Related

How can I aggregate time series data in postgres from a specific timestamp & fixed intervals (e.g. 1 hour , 1 day, 7 day ) without using date_trunc()?

I have a postgres table "Generation" with half-hourly timestamps spanning 2009 - present with energy data:
I need to aggregate (average) the data across different intervals from specific timepoints, for example data from 2021-01-07T00:00:00.000Z for one year at 7 day intervals, or 3 months at 1 day interval or 7 days at 1h interval etc. date_trunc() partly solves this, but rounds the weeks to the nearest monday e.g.
SELECT date_trunc('week', "DATETIME") AS week,
count(*),
AVG("GAS") AS gas,
AVG("COAL") AS coal
FROM "Generation"
WHERE "DATETIME" >= '2021-01-07T00:00:00.000Z' AND "DATETIME" <= '2022-01-06T23:59:59.999Z'
GROUP BY week
ORDER BY week ASC
;
returns the first time series interval as 2021-01-04 with an incorrect count:
week count gas coal
"2021-01-04 00:00:00" 192 18291.34375 2321.4427083333335
"2021-01-11 00:00:00" 336 14477.407738095239 2027.547619047619
"2021-01-18 00:00:00" 336 13947.044642857143 1152.047619047619
****EDIT: the following will return the correct weekly intervals by checking the start date relative to the nearest monday / start of week, and adjusts the results accordingly:
WITH vars1 AS (
SELECT '2021-01-07T00:00:00.000Z'::timestamp as start_time,
'2021-01-28T00:00:00.000Z'::timestamp as end_time
),
vars2 AS (
SELECT
((select start_time from vars1)::date - (date_trunc('week', (select start_time from vars1)::timestamp))::date) as diff
)
SELECT date_trunc('week', "DATETIME" - ((select diff from vars2) || ' day')::interval)::date + ((select diff from vars2) || ' day')::interval AS week,
count(*),
AVG("GAS") AS gas,
AVG("COAL") AS coal
FROM "Generation"
WHERE "DATETIME" >= (select start_time from vars1) AND "DATETIME" < (select end_time from vars1)
GROUP BY week
ORDER BY week ASC
returns..
week count gas coal
"2021-01-07 00:00:00" 336 17242.752976190477 2293.8541666666665
"2021-01-14 00:00:00" 336 13481.497023809523 1483.0565476190477
"2021-01-21 00:00:00" 336 15278.854166666666 1592.7916666666667
And then for any daily or hourly (swap out day with hour) intervals you can use the following:
SELECT date_trunc('day', "DATETIME") AS day,
count(*),
AVG("GAS") AS gas,
AVG("COAL") AS coal
FROM "Generation"
WHERE "DATETIME" >= '2022-01-07T00:00:00.000Z' AND "DATETIME" < '2022-01-10T23:59:59.999Z'
GROUP BY day
ORDER BY day ASC
;
In order to select the complete week, you should change the WHERe-clause to something like:
WHERE "DATETIME" >= date_trunc('week','2021-01-07T00:00:00.000Z'::timestamp)
AND "DATETIME" < (date_trunc('week','2022-01-06T23:59:59.999Z'::timestamp) + interval '7' day)::date
This will effectively get the records from January 4,2021 until (and including ) January 9,2022
Note: I changed <= to < to stop the end-date being included!
EDIT:
when you want your weeks to start on January 7, you can always group by:
(date_part('day',(d-'2021-01-07'))::int-(date_part('day',(d-'2021-01-07'))::int % 7))/7
(where d is the column containing the datetime-value.)
see: dbfiddle
EDIT:
This will get the list from a given date, and a specified interval.
see DBFIFFLE
WITH vars AS (
SELECT
'2021-01-07T00:00:00.000Z'::timestamp AS qstart,
'2022-01-06T23:59:59.999Z'::timestamp AS qend,
7 as qint,
INTERVAL '1 DAY' as qinterval
)
SELECT
(select date(qstart) FROM vars) + (SELECT qinterval from vars) * ((date_part('day',("DATETIME"-(select date(qstart) FROM vars)))::int-(date_part('day',("DATETIME"-(select date(qstart) FROM vars)))::int % (SELECT qint FROM vars)))::int) AS week,
count(*),
AVG("GAS") AS gas,
AVG("COAL") AS coal
FROM "Generation"
WHERE "DATETIME" >= (SELECT qstart FROM vars) AND "DATETIME" <= (SELECT qend FROM vars)
GROUP BY week
ORDER BY week
;
I added the WITH vars to do the variable stuff on top and no need to mess with the rest of the query. (Idea borrowed here)
I only tested with qint=7,qinterval='1 DAY' and qint=14,qinterval='1 DAY' (but others values should work too...)
Using the function EXTRACT you may calculate the difference in days, weeks and hours between your timestamp ts and the start_date as follows
Difference in Days
extract (day from ts - start_date)
Difference in Weeks
Is the difference in day divided by 7 and truncated
trunc(extract (day from ts - start_date)/7)
Difference in Hours
Is the difference in day times 24 + the difference in hours of the day
extract (day from ts - start_date)*24 + extract (hour from ts - start_date)
The difference can be used in GROUP BY directly. E.g. for week grouping the first group is difference 0, i.e. same week, the next group with difference 1, the next week, etc.
Sample Example
I'm using a CTE for the start date to avoid multpile copies of the paramater
with start_time as
(select DATE'2021-01-07' as start_ts),
prep as (
select
ts,
extract (day from ts - (select start_ts from start_time)) day_diff,
trunc(extract (day from ts - (select start_ts from start_time))/7) week_diff,
extract (day from ts - (select start_ts from start_time)) *24 + extract (hour from ts - (select start_ts from start_time)) hour_diff,
value
from test_table
where ts >= (select start_ts from start_time)
)
select week_diff, avg(value)
from prep
group by week_diff order by 1

How to get the percentage change from same time 7 days ago?

I have a big PostgreSQL database with time series data.
I query the data with a resample to one hour. What I want is to compare the the mean value from the last hour to the value 7 days ago at the same time and I don't know how to do it.
This is what I use to get the latest value.
SELECT DATE_TRUNC('hour', datetime) AS time, AVG(value) as value, id FROM database
GROUP BY id, time
WHERE datetime > now()- '01:00:00'::interval
You can use a CTE to calculate last week's average in the same time period, then join on id and hour.
with last_week as
(
SELECT
id,
extract(hour from datetime) as time,
avg(value) as avg_value
FROM my_table
where DATE_TRUNC('hour', datetime) =
(date_trunc('hour', now() - interval '7 DAYS'))
group by 1,2
)
select n.id,
DATE_TRUNC('hour', n.datetime) AS time_now,
avg(n.value) as avg_now,
t.avg_value as avg_last_week
from my_table n
left join last_week t
on t.id = n.id
and t.time = extract(hour from n.datetime)
where datetime > now()- '01:00:00'::interval
group by 1,2,4
order by 1
I'm making a few assumptions on how your data appear.
**EDIT - JUST NOTICED YOU ASKED FOR PERCENT CHANGE
Showing change as decimal...
select id,
extract(hour from time_now) as hour_now,
avg_now,
avg_last_week,
coalesce(((avg_now - avg_last_week) / avg_last_week), 0) AS CHANGE
from (
with last_week as
(
SELECT
id,
extract(hour from datetime) as time,
avg(value) as avg_value
FROM my_table
where DATE_TRUNC('hour', datetime) =
(date_trunc('hour', now() - interval '7 DAYS'))
group by 1,2
)
select n.id,
DATE_TRUNC('hour', n.datetime) AS time_now,
avg(n.value) as avg_now,
t.avg_value as avg_last_week
from my_table n
left join last_week t
on t.id = n.id
and t.time = extract(hour from n.datetime)
where datetime > now()- '01:00:00'::interval
group by 1,2,4
)z
group by 1,2,3,4
order by 1,2
db-fiddle found here: https://www.db-fiddle.com/f/rWJATypGzHPZ8sG2vXAGXC/4

Syntax error in redshift sql query with subqueries

I'm quite new in SQL in general and haven't deal with redshift before. I'm trying to make one query, which works perfectly in postgresql. But I get syntax error in redshift. The query is:
SELECT
test.table_1.user_id as user_id,
test.table_1.timestamp as start_session,
test.table_1.step_3 :: timestamp + interval '1 hour' as end_session,
test.table_1.step_3 :: timestamp + interval '1 hour' - test.table_1.timestamp :: timestamp as session_duration
FROM (SELECT *,
min(case when page = 'second_page' then timestamp end) OVER (partition by user_id order by timestamp desc rows between unbounded preceding and unbounded following) as step_2,
min(case when page = 'third_page' then timestamp end) OVER (partition by user_id order by timestamp desc rows between unbounded preceding and unbounded following) as step_3
FROM test.table_1) test.table_1
WHERE
test.table_1.page = 'first_page' AND
step_2 > test.table_1.timestamp AND
step_3 > step_2 AND
step_3 :: timestamp - step_2 :: timestamp < '1 hour' AND
step_2 :: timestamp - test.table_1.timestamp :: timestamp < '1 hour'
ORDER BY
user_id,start_session
The error is Error running query: syntax error at or near "." LINE 11: FROM test.vimbox_pages) test.vimbox_pages ^ in line FROM test.table_1) test.table_1
I don't understand what's wrong there.
By this query I'm trying to get session list of users actions during reading pages in some order.
Will be thankful for any help!
Aliases are identifiers and need to follow the rules for identifiers. You can also simplify your query in other ways:
SELECT t.user_id, t.timestamp as start_session,
(t.step_3::timestamp + interval '1 hour' as end_session),
(t.step_3::timestamp + interval '1 hour' - t.timestamp::timestamp) as session_duration
FROM (SELECT t.*,
MIN(CASE WHEN page = 'second_page' THEN timestamp END) OVER (PARTITION BY user_id) as step_2,
MIN(CASE WHEN page = 'third_page' THEN timestamp END) OVER (partition by user_id) as step_3
FROM test.table_1 t
) t
WHERE t.page = 'first_page' AND
step_2 > t.timestamp AND
step_3 > step_2 AND
step_3::timestamp < step_2::timestamp + interval '1 hour' AND
step_2::timestamp < timestamp + interval '1 hour'
ORDER BY user_id, start_session;
Notes:
Your windowing clause is unnecessarily complex. No ORDER BY is necessary if you want the entire window range.
The conversions to timestamp should be unnecessary, given the names of the columns. But I have left them in.
t.user_id as user_id is redundant. The column name is going to be user_id anyway.
I don't ever see spaces around ::. Of course they are allowed, but the type conversion has very high precedence and is typically written without spaces.
I prefer timestamp comparisons to timestamps, rather than converting to intervals. Strange things can happen with intervals.

How to get average of groups by making two groups per day (daytime/nighttime)?

I have a table named thermo storing two different temperatures per 30 minutes
(dt=time | Ti=internal temperature | To=external temperature)
I would like to get alternately the daytime and nighttime average.
This could be by grouping hours 06:00-17:59 and 18:00-05:59
The best I could do is to have grouped 00:00-11:59 and 12:00-23:59 by the following code:
SELECT CAST(strftime('%m%d', dt) AS TIME) || CAST(strftime('%H', dt)/12 AS TIME) AS time,
round(avg(Ti), 1) AS Ti,
round(avg(To), 1) AS To,
FROM thermo WHERE dt > datetime(CURRENT_TIMESTAMP, 'localtime', '-10 days')
GROUP BY time ORDER BY time;
Is there a way to have shifted the time groups?
You can use a CASE to determine the period of interest:
SELECT CAST(strftime('%m%d', dt) AS TIME) as dt,
(CASE WHEN strftime('%H', dt) BETWEEN 6 AND 17 THEN 'daytime'
ELSE 'nightime'
END) as period,
round(avg(Ti), 1) as Ti,
round(avg(To), 1) as To
FROM thermo
WHERE dt > datetime(CURRENT_TIMESTAMP, 'localtime', '-10 days')
GROUP BY dt, period
ORDER BY time;

Difference of datetime column in SQL

I have a table of 20000 records. each Record has a datetime field. I want to select all records where gap between one record and subsequent record is more than one hour [condition to be applied on datetime field].
can any one give me the SQL command code for this purpose.
regards
KAM
ANSI SQL supports the lead() function. However, date/time functions vary by database. The following is the logic you want, although the exact syntax varies, depending on the database:
select t.*
from (select t.*,
lead(datetimefield) over (order by datetimefield) as next_datetimefield
from t
) t
where datetimefield + interval '1 hour' < next_datetimefield;
Note: In Teradata, the where would be:
where datetimefield + interval '1' hour < next_datetimefield;
This can also be done with a sub query, which should work on all DBMS. As gordon said, date/time functions are different in every one.
SELECT t.* FROM YourTable t
WHERE t.DateCol + interval '1 hour' < (SELECT min(s.DateCol) FROM YourTable s
WHERE t.ID = s.ID AND s.DateCol > t.DateCol)
You can replace this:
t.DateCol + interval '1 hour'
With one of this so it will work on almost every DBMS:
DATE_ADD( t.DateCol, INTERVAL 1 hour)
DATEADD(hour,1,t.DateCol)
Although Teradata doesn't support Standard SQL's LEAD it's easy to rewrite:
select tab.*,
min(ts) over (order by ts rows between 1 following and 1 following) as next_ts
from tab
qualify
ts < next_ts - interval '1' hour
If you don't need to show the next timestamp:
select *
from tab
qualify
ts < min(ts) over (order by ts rows between 1 following and 1 following) - interval '1' hour
QUALIFY is a Teradata extension, but really nice to have, similar to HAVING after GROUP BY