How to group timestamps into islands (based on arbitrary gap)?

How to group timestamps into islands (based on arbitrary gap)? - sql

Consider this list of dates as timestamptz:
I grouped the dates by hand using colors: every group is separated from the next by a gap of at least 2 minutes.
I'm trying to measure how much a given user studied, by looking at when they performed an action (the data is when they finished studying a sentence.) e.g.: on the yellow block, I'd consider the user studied in one sitting, from 14:24 till 14:27, or roughly 3 minutes in a row.
I see how I could group these dates with a programming language by going through all of the dates and looking for the gap between two rows.
My question is: how would go about grouping dates in this way with Postgres?
(Looking for 'gaps' on Google or SO brings too many irrelevant results; I think I'm missing the vocabulary for what I'm trying to do here.)

SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM (
SELECT done
, lag(done) OVER (ORDER BY done) <= done - interval '2 min' AS step
FROM tbl
) sub
ORDER BY done;
The subquery sub returns step = true if the previous row is at least 2 min away - sorted by the timestamp column done itself in this case.
The outer query adds a rolling count of steps, effectively the group number (grp) - combining the aggregate FILTER clause with another window function.
fiddle
Related:
Query to find all timestamps more than a certain interval apart
How to label groups in postgresql when group belonging depends on the preceding line?
Select longest continuous sequence
Grouping or Window
About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters
Conditional lead/lag function PostgreSQL?

Building up on Erwin's answer, here is the full query for tallying up the amount of time people spent on those sessions/islands:
My data only shows when people finished reviewing something, not when they started, which means we don't know when a session truly started; and some islands only have one timestamp in them (leading to a 0-duration estimate.) I'm accounting for both by calculating the average review time and adding it to the total duration of islands.
This is likely very idiosyncratic to my use case, but I learned a thing or two in the process, so maybe this will help someone down the line.
-- Returns estimated total study time and average time per review, both in seconds
SELECT (EXTRACT( EPOCH FROM logged) + countofislands * avgreviewtime) as totalstudytime, avgreviewtime -- add total logged time to estimate for first-review-in-island and 1-review islands
FROM
(
SELECT -- get the three key values that will let us calculate total time spent
sum(duration) as logged
, count(island) as countofislands
, EXTRACT( EPOCH FROM sum(duration) FILTER (WHERE duration != '00:00:00'::interval) )/( sum(reviews) FILTER (WHERE duration != '00:00:00'::interval) - count(reviews) FILTER (WHERE duration != '00:00:00'::interval)) as avgreviewtime
FROM
(
SELECT island, age( max(done), min(done) ) as duration, count(island) as reviews -- calculate the duration of islands
FROM
(
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS island -- give a unique number to each island
FROM (
SELECT -- detect the beginning of islands
done,
(
lag(done) OVER (ORDER BY done) <= done - interval '2 min'
) AS step
FROM review
WHERE clicker_id = 71 AND "done" > '2015-05-13' AND "done" < '2015-05-13 15:00:00' -- keep the queries small and fast for now
) sub
ORDER BY done
) grouped
GROUP BY island
) sessions
) summary

Related

SQL: Apply an aggregate result per day using window functions

Consider a time-series table that contains three fields time of type timestamptz, balance of type numeric, and is_spent_column of type text.
The following query generates a valid result for the last day of the given interval.
SELECT
MAX(DATE_TRUNC('DAY', (time))) as last_day,
SUM(balance) FILTER ( WHERE is_spent_column is NULL ) AS value_at_last_day
FROM tbl
2010-07-12 18681.800775017498741407984000
However, I am in need of an equivalent query based on window functions to report the total value of the column named balance for all the days up to and including the given date .
Here is what I've tried so far, but without any valid result:
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(sum(balance) FILTER ( WHERE is_spent_column is NULL ) ) OVER ( ORDER BY DATE_TRUNC('DAY', (time)) ) AS total_value_per_day
FROM tbl
group by 1
order by 1 desc
2010-07-12 16050.496339044977568391974000
2010-07-11 13103.159119670350269890284000
2010-07-10 12594.525752964512456914454000
2010-07-09 12380.159588711091681327014000
2010-07-08 12178.119542536668113577014000
2010-07-07 11995.943973804127033140014000
EDIT:
Here is a sample dataset:
LINK REMOVED
The running total can be computed by applying the first query above on the entire dataset up to and including the desired day. For example, for day 2009-01-31, the result is 97.13522530000000000000, or for day 2009-01-15 when we filter time as time < '2009-01-16 00:00:00' it returns 24.446144000000000000.
What I need is an alternative query that computes the running total for each day in a single query.
EDIT 2:
Thank you all so very much for your participation and support.
The reason for differences in result sets of the queries was on the preceding ETL pipelines. Sorry for my ignorance!
Below I've provided a sample schema to test the queries.
https://www.db-fiddle.com/f/veUiRauLs23s3WUfXQu3WE/2
Now both queries given above and the query given in the answer below return the same result.

Consider calculating running total via window function after aggregating data to day level. And since you aggregate with a single condition, FILTER condition can be converted to basic WHERE:
SELECT daily,
SUM(total_balance) OVER (ORDER BY daily) AS total_value_per_day
FROM (
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(balance) AS total_balance
FROM tbl
WHERE is_spent_column IS NULL
GROUP BY 1
) AS daily_agg
ORDER BY daily

Grafana, postgresql: aggregate function calls cannot contain window function calls

In Grafana, we want to show bars indicating maximum of 15-minut averages in the choosen time interval. Our data has regular 1-minute intervals. The database is Postgresql.
To show the 15-minute averages, we use the following query:
SELECT
timestamp AS time,
AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING) AS value,
'15-min Average' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
ORDER BY time
To show bars indicating maximum of raw values in the choosen time interval, we use the following query:
SELECT
$__timeGroup(timestamp,'$INTERVAL') AS time,
MAX(rawvalue) AS value,
'Interval Max' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
GROUP BY $__timeGroup(timestamp,'$INTERVAL')
ORDER BY time
A naive combination of both solutions does not work:
SELECT
$__timeGroup(timestamp,'$INTERVAL') AS time,
MAX(AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING)) AS value,
'Interval Max 15-min Average' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
GROUP BY $__timeGroup(timestamp,'$INTERVAL')
ORDER BY time
We get error: "pq: aggregate function calls cannot contain window function calls".
There is a suggestion on SO to use "with" (Count by criteria over partition) but I do not know hot to use it in our case.

Use the first query as a CTE (or with) for the second one. The order by clause of the CTE and the where clause of the second query as well as the metric column of the CTE are no longer needed. Alternatively you can use the first query as a derived table in the from clause of the second one.
with t as
(
SELECT
timestamp AS time,
AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING) AS value
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
)
SELECT
$__timeGroup(time,'$INTERVAL') AS time,
MAX(value) AS value,
'Interval Max 15-min Average' AS metric
FROM t
GROUP BY 1 ORDER BY 1;
Unrelated but what are $__timeFilter and $__timeGroup? Their sematics are clear but where do they come from? BTW you may find this function useful.

SQL question: count of occurrence greater than N in any given hour

I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span

You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.

I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.

Can I speed up this subquery nested PostgreSQL Query

I have the following PostgreSQL code (which works, but slowly) which I'm using to create a materialized view, however it is quite slow and length of code seems cumbersome with the multiple sub-queries. Is there anyway I can improve the speed this code executes at or rewrite so it's shorter and easier to maintain?
CREATE MATERIALIZED VIEW station_views.obs_10_min_avg_ffdi_powerbi AS
SELECT t.station_num,
initcap(t.station_name) AS station_name,
t.day,
t.month_int,
to_char(to_timestamp(t.month_int::text, 'MM'), 'TMMonth') AS Month,
round(((date_part('year', age(t2.dmax, t2.dmin)) * 12 + date_part('month', age(t2.dmax, t2.dmin))) / 12)::numeric, 1) AS record_years,
round((t2.count_all_vals / t2.max_10_periods * 100)::numeric, 1) AS per_datset,
max(t.avg_bom_fdi) AS max,
avg(t.avg_bom_fdi) AS avg,
percentile_cont(0.95) WITHIN GROUP (ORDER BY t.avg_bom_fdi) AS percentile_cont_95,
percentile_cont(0.99) WITHIN GROUP (ORDER BY t.avg_bom_fdi) AS percentile_cont_99
FROM ( SELECT a.station_num,
d.station_name,
a.ten_minute_intervals_utc,
date_part('day', a.ten_minute_intervals_utc) AS day,
date_part('month', a.ten_minute_intervals_utc) AS month_int,
a.avg_bom_fdi
FROM analysis.obs_10_min_avg_ffdi_bom a,
obs_minute_stn_det d
WHERE d.station_num = a.station_num) t,
( SELECT obs_10_min_avg_ffdi_bom_view.station_num,
obs_10_min_avg_ffdi_bom_view.station_name,
min(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) AS dmin,
max(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) AS dmax,
date_part('epoch', max(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) - min(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc)) / 600 AS max_10_periods,
count(*) AS count_all_vals
FROM analysis.obs_10_min_avg_ffdi_bom_view
GROUP BY obs_10_min_avg_ffdi_bom_view.station_num, obs_10_min_avg_ffdi_bom_view.station_name) t2
WHERE t.station_num = t2.station_num
GROUP BY t.station_num, t.station_name, Month, t.month_int, t.day, record_years, per_datset
ORDER BY t.month_int, t.day
WITH DATA;
The output I get is a row for each weather station (station_num & station_name) along with the day & month that a weather variable is recorded (avg_bom_fdi). The month value is retained and converted to a name for purposes of plotting values averaged per month on the chart. I also pull in the total number of years that recordings exist for that station (record_years) and a percentage of how complete that dataset is (per_datset). These both come from the second subquery (t2). The first subquery (t) is used to average the data per day and return the daily max, average and 95/99th percentiles.

I agree with the running the explain plan / execution plan on this
query.
Also , if not needed remove order by
If you see , lot of
time spent on fetching a particular value while reviewing execution plan,
try creating an index on that particular column.
Depending on high
and low cardinality , you can create B-Tree or Bit Map index,if you are deciding on index.

I think you need read something about Execution plan. It's good way to understand what doing with you query.
I recommended you documentation about this problem - LINK

postgres select aggregate timespans

I have a table with the following structure:
timstamp-start, timestamp-stop
1,5
6,10
25,30
31,35
...
i am only interested in continuous timespans e.g. the break between a timestamp-end and the following timestamp-start is less than 3.
How could I get the aggregated covered timespans as a result:
timestamp-start,timestamp-stop
1,10
25,35
The reason I am considering this is because a user may request a timespan that would need to return several thousand rows. However, most records are continous and using above method could potentially reduce many thousand of rows down to just a dozen. Or is the added computation not worth the savings in bandwith and latency?

You can group the time stamps in three steps:
Add a flag to determine where a new period starts (that is, a gap greater than 3).
Cumulatively sum the flag to assign groupings.
Re-aggregate with the new groupings.
The code looks like:
select min(ts_start) as ts_start, max(ts_end) as ts_end
from (select t.*,
sum(flag) over (order by ts_start) as grouping
from (select t.*,
(coalesce(ts_start - lag(ts_end) over (order by ts_start),0) > 3)::int as flag
from t
) t
) t
group by grouping;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas