postgres select aggregate timespans - sql

I have a table with the following structure:
timstamp-start, timestamp-stop
1,5
6,10
25,30
31,35
...
i am only interested in continuous timespans e.g. the break between a timestamp-end and the following timestamp-start is less than 3.
How could I get the aggregated covered timespans as a result:
timestamp-start,timestamp-stop
1,10
25,35
The reason I am considering this is because a user may request a timespan that would need to return several thousand rows. However, most records are continous and using above method could potentially reduce many thousand of rows down to just a dozen. Or is the added computation not worth the savings in bandwith and latency?

You can group the time stamps in three steps:
Add a flag to determine where a new period starts (that is, a gap greater than 3).
Cumulatively sum the flag to assign groupings.
Re-aggregate with the new groupings.
The code looks like:
select min(ts_start) as ts_start, max(ts_end) as ts_end
from (select t.*,
sum(flag) over (order by ts_start) as grouping
from (select t.*,
(coalesce(ts_start - lag(ts_end) over (order by ts_start),0) > 3)::int as flag
from t
) t
) t
group by grouping;

Related

SQL: Apply an aggregate result per day using window functions

Consider a time-series table that contains three fields time of type timestamptz, balance of type numeric, and is_spent_column of type text.
The following query generates a valid result for the last day of the given interval.
SELECT
MAX(DATE_TRUNC('DAY', (time))) as last_day,
SUM(balance) FILTER ( WHERE is_spent_column is NULL ) AS value_at_last_day
FROM tbl
2010-07-12 18681.800775017498741407984000
However, I am in need of an equivalent query based on window functions to report the total value of the column named balance for all the days up to and including the given date .
Here is what I've tried so far, but without any valid result:
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(sum(balance) FILTER ( WHERE is_spent_column is NULL ) ) OVER ( ORDER BY DATE_TRUNC('DAY', (time)) ) AS total_value_per_day
FROM tbl
group by 1
order by 1 desc
2010-07-12 16050.496339044977568391974000
2010-07-11 13103.159119670350269890284000
2010-07-10 12594.525752964512456914454000
2010-07-09 12380.159588711091681327014000
2010-07-08 12178.119542536668113577014000
2010-07-07 11995.943973804127033140014000
EDIT:
Here is a sample dataset:
LINK REMOVED
The running total can be computed by applying the first query above on the entire dataset up to and including the desired day. For example, for day 2009-01-31, the result is 97.13522530000000000000, or for day 2009-01-15 when we filter time as time < '2009-01-16 00:00:00' it returns 24.446144000000000000.
What I need is an alternative query that computes the running total for each day in a single query.
EDIT 2:
Thank you all so very much for your participation and support.
The reason for differences in result sets of the queries was on the preceding ETL pipelines. Sorry for my ignorance!
Below I've provided a sample schema to test the queries.
https://www.db-fiddle.com/f/veUiRauLs23s3WUfXQu3WE/2
Now both queries given above and the query given in the answer below return the same result.
Consider calculating running total via window function after aggregating data to day level. And since you aggregate with a single condition, FILTER condition can be converted to basic WHERE:
SELECT daily,
SUM(total_balance) OVER (ORDER BY daily) AS total_value_per_day
FROM (
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(balance) AS total_balance
FROM tbl
WHERE is_spent_column IS NULL
GROUP BY 1
) AS daily_agg
ORDER BY daily

Azure Stream ANalytics - Find Most Recent `n` Events Within Time Interval

I am working with Azure Stream Analytics and, to illustrate my situation, I have streaming events corresponding to buy(+)/sell(-) orders from users of a certain amount. So, key fields in an individual event look like: {UserId: 'u12345', Type: 'Buy', Amt: 14.0}.
I want to write a query which outputs UserId's and the sum of Amt for the most recent (up to) 5 events within a sliding 24 hr period partitioned by UserId.
To clarify:
If there are more than 5 events for a given UserId in the last 24 hours, I only want the sum of Amt for the most recent 5.
If there are fewer than 5 events, I either want the UserId to be omitted or the sum of the Amt of the events that do exist.
I've tried looking at LIMIT DURATION predicates, but there doesn't seem to be a way to limit the number of events as well as filter on time while PARTITION'ing by UserId. Has anyone done something like this?
Considering the comments, I think this should work:
WITH Last5 AS (
SELECT
UserId,
System.Timestamp() AS windowEnd,
COLLECTTOP(5) OVER (ORDER BY CAST(EventEnqueuedUtcTime AS DATETIME) DESC) AS Top5
FROM input1
TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
SlidingWindow(hour,24),
UserId
HAVING COUNT(*) >= 5 --We want at least 5
)
SELECT
L.UserId,
System.Timestamp() AS ts,
SUM(C.ArrayValue.value.Amt) AS sumAmt
INTO myOutput
FROM Last5 AS L
CROSS APPLY GetArrayElements(L.Top5) AS C
GROUP BY
System.Timestamp(), --Snapshot window
L.UserId
We use a CTE to first get the sliding window of 24h. In there we both filter to only retain windows of more than 5 records (HAVING COUNT(*) > 5), and collect only the last 5 of them (COLLECTOP(5) OVER...). Note that I had to TIMESTAMP BY and CAST on my own timestamp when testing the query, you may not need that in your case.
Next we need to unpack the collected records, that's done via CROSS APPLY GetArrayElements, and sum them. I use a snapshot window for that, as I don't need time grouping on that one.
Please let me know if you need more details.

How to compare the value of one row with the upper row in one column of an ordered table?

I have a table in PostgreSQL that contains the GPS points from cell phones. It has an integer column that stores epoch (the number of seconds from 1960). I want to order the table based on time (epoch column), then, break the trips to sub trips when there is no GPS record for more than 2 minutes.
I did it with GeoPandas. However, it is too slow. I want to do it inside the PostgreSQL. How can I compare each row of the ordered table with the previous row (to see if the epoch has a difference of 2 minutes or more)?
In fact, I do not know how to compare each row with the upper row.
You can use lag():
select t.*
from (select t.*,
lag(timestamp_epoch) over (partition by trip order by timestamp_epoch) as last_timestamp_epoch
from t
) t
where last_timestamp_epoch < timestamp_epoch - 120
I want to order the table based on time (epoch column), then, break the trips to sub trips when there is no GPS record for more than 2 minutes.
After comparing to the previous (or next) row, with the window function lag() (or lead()), form groups based on the gaps to get sub trip numbers:
SELECT *, count(*) FILTER (WHERE step) OVER (PARTITION BY trip ORDER BY timestamp_epoch) AS sub_trip
FROM (
SELECT *
, (timestamp_epoch - lag(timestamp_epoch) OVER (PARTITION BY trip ORDER BY timestamp_epoch)) > 120 AS step
FROM tbl
) sub;
Further reading:
Select longest continuous sequence

How to group timestamps into islands (based on arbitrary gap)?

Consider this list of dates as timestamptz:
I grouped the dates by hand using colors: every group is separated from the next by a gap of at least 2 minutes.
I'm trying to measure how much a given user studied, by looking at when they performed an action (the data is when they finished studying a sentence.) e.g.: on the yellow block, I'd consider the user studied in one sitting, from 14:24 till 14:27, or roughly 3 minutes in a row.
I see how I could group these dates with a programming language by going through all of the dates and looking for the gap between two rows.
My question is: how would go about grouping dates in this way with Postgres?
(Looking for 'gaps' on Google or SO brings too many irrelevant results; I think I'm missing the vocabulary for what I'm trying to do here.)
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM (
SELECT done
, lag(done) OVER (ORDER BY done) <= done - interval '2 min' AS step
FROM tbl
) sub
ORDER BY done;
The subquery sub returns step = true if the previous row is at least 2 min away - sorted by the timestamp column done itself in this case.
The outer query adds a rolling count of steps, effectively the group number (grp) - combining the aggregate FILTER clause with another window function.
fiddle
Related:
Query to find all timestamps more than a certain interval apart
How to label groups in postgresql when group belonging depends on the preceding line?
Select longest continuous sequence
Grouping or Window
About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters
Conditional lead/lag function PostgreSQL?
Building up on Erwin's answer, here is the full query for tallying up the amount of time people spent on those sessions/islands:
My data only shows when people finished reviewing something, not when they started, which means we don't know when a session truly started; and some islands only have one timestamp in them (leading to a 0-duration estimate.) I'm accounting for both by calculating the average review time and adding it to the total duration of islands.
This is likely very idiosyncratic to my use case, but I learned a thing or two in the process, so maybe this will help someone down the line.
-- Returns estimated total study time and average time per review, both in seconds
SELECT (EXTRACT( EPOCH FROM logged) + countofislands * avgreviewtime) as totalstudytime, avgreviewtime -- add total logged time to estimate for first-review-in-island and 1-review islands
FROM
(
SELECT -- get the three key values that will let us calculate total time spent
sum(duration) as logged
, count(island) as countofislands
, EXTRACT( EPOCH FROM sum(duration) FILTER (WHERE duration != '00:00:00'::interval) )/( sum(reviews) FILTER (WHERE duration != '00:00:00'::interval) - count(reviews) FILTER (WHERE duration != '00:00:00'::interval)) as avgreviewtime
FROM
(
SELECT island, age( max(done), min(done) ) as duration, count(island) as reviews -- calculate the duration of islands
FROM
(
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS island -- give a unique number to each island
FROM (
SELECT -- detect the beginning of islands
done,
(
lag(done) OVER (ORDER BY done) <= done - interval '2 min'
) AS step
FROM review
WHERE clicker_id = 71 AND "done" > '2015-05-13' AND "done" < '2015-05-13 15:00:00' -- keep the queries small and fast for now
) sub
ORDER BY done
) grouped
GROUP BY island
) sessions
) summary

BigQuery - counting number of events within a sliding time frame

I would like to count the number of events within a sliding time frame.
For example, say I would like to know how many bids were in the last 1000 seconds for the Google stock (GOOG).
I'm trying the following query:
SELECT
symbol,
start_date,
start_time,
bid_price,
count(if(max(start_time)-start_time<1000,1,null)) over (partition by symbol order by start_time asc) cnt
FROM [bigquery-samples:nasdaq_stock_quotes.quotes]
where symbol = 'GOOG'
The logic is as follow: the partition window (by symbol) is ordered with the bid time (leaving alone the bid date for sake of simplicity).
For each window (defined by the row at the "head" of the window) I would like to count the number of rows which have start_time that is less than 1000 seconds than the "head" row time.
I'm trying to use max(start_time) to get the top row in the window. This doesn't seem to work and I get an error:
Error: MAX is an analytic function and must be accompanied by an OVER clause.
Is it possible to have two an analytic functions in one column (both count and max in this case)? Is there a different solution to the problem presented?
Try using the range function.
SELECT
symbol,
start_date,
start_time,
bid_price,
count(market_center) over (partition by symbol order by start_time RANGE 1000 PRECEDING) cnt
FROM [bigquery-samples:nasdaq_stock_quotes.quotes]
where symbol = 'GOOG'
order by 2, 3
I used market_center just as a counter, additional fields can be used as well.
Note: the RANGE function is not documented in BigQuery Query Reference, however it's a standard SQL function which appears to work in this case