How to compute window function for each nth row in Presto?

How to compute window function for each nth row in Presto? - sql

I am working with a table that contains timeseries data, with a row for each minute for each user.
I want to compute some aggregate functions on a rolling window of N calendar days.
This is achieved via
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
) as my_col
FROM my_table
However, I am only interested in the result of this at a daily scale.
i.e. I want the window to be computed only at 00:00:00, but I want the window itself to contain all the minute-by-minute data to be passed into my aggregate function.
Right now I am doing this:
WITH agg_results AS (
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp_col
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
)
FROM my_table
)
SELECT * FROM agg_results
WHERE
timestamp_col = DATE_TRUNC('day', "timestamp_col")
This works in theory, but it does 60 * 24 more computations that necessary, resulting in the query being super slow.
Essentially, I am trying to find a way to make the right window bound skip rows based on a condition. Or, if it is simpler to implement, for every nth row (as I have a constant number of rows for each day).

I don't think that's possible with window functions. You could switch to a subquery instead, assuming that your aggregate function works as a regular aggregate function too (that is, without an OVER() clause):
select
timestamp_col,
(
select some_aggregate_fun(t1.col)
from my_table t1
where
t1.user_id = t.user_id
and t1.timestamp_col >= t.timestamp_col - interval '1' day
and t1.timestamp_col <= t.timestamp_col
)
from my_table t
where timestamp_col = date_trunc('day', timestamp_col)
I am unsure that this would perform better than your original query though; you might need to assess that against your actual dataset.
You can change interval '1' day to the actual interval you want to use.

Related

Grafana, postgresql: aggregate function calls cannot contain window function calls

In Grafana, we want to show bars indicating maximum of 15-minut averages in the choosen time interval. Our data has regular 1-minute intervals. The database is Postgresql.
To show the 15-minute averages, we use the following query:
SELECT
timestamp AS time,
AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING) AS value,
'15-min Average' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
ORDER BY time
To show bars indicating maximum of raw values in the choosen time interval, we use the following query:
SELECT
$__timeGroup(timestamp,'$INTERVAL') AS time,
MAX(rawvalue) AS value,
'Interval Max' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
GROUP BY $__timeGroup(timestamp,'$INTERVAL')
ORDER BY time
A naive combination of both solutions does not work:
SELECT
$__timeGroup(timestamp,'$INTERVAL') AS time,
MAX(AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING)) AS value,
'Interval Max 15-min Average' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
GROUP BY $__timeGroup(timestamp,'$INTERVAL')
ORDER BY time
We get error: "pq: aggregate function calls cannot contain window function calls".
There is a suggestion on SO to use "with" (Count by criteria over partition) but I do not know hot to use it in our case.

Use the first query as a CTE (or with) for the second one. The order by clause of the CTE and the where clause of the second query as well as the metric column of the CTE are no longer needed. Alternatively you can use the first query as a derived table in the from clause of the second one.
with t as
(
SELECT
timestamp AS time,
AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING) AS value
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
)
SELECT
$__timeGroup(time,'$INTERVAL') AS time,
MAX(value) AS value,
'Interval Max 15-min Average' AS metric
FROM t
GROUP BY 1 ORDER BY 1;
Unrelated but what are $__timeFilter and $__timeGroup? Their sematics are clear but where do they come from? BTW you may find this function useful.

How to compare time stamps from consecutive rows

I have a table that I would like to sort by a timestamp desc and then compare all consecutive rows to determine the difference between each row. From there, I would like to find all the rows whose difference is greater than ~2hours.
I'm stuck on how to actually compare consecutive rows in a table. Any help would be much appreciated.
I'm using Oracle SQL Developer 3.2

You didn't show us your table definition, but something like this:
select *
from (
select t.*,
t.timestamp_column,
t.timestamp_column - lag(timestamp_column) over (order by timestamp_column) as diff
from the_table t
) x
where diff > interval '2' hour;
This assumes that timestamp_column is defined as timestamp not date (otherwise the result of the difference wouldn't be an interval)

examine if one time series column of table has two adjacent time points which have interval larger than certain length

I am dealing with data preprocessing on a table containing time series column
toy example Table A
timestamp value
12:30:24 1
12:32:21 3
12:33:21 4
timestamp is ordered and always go incrementally
Is that possible to define an function or something else to return "True expression" when table has two adjacent time points which have interval larger than certain length and return "False" otherwise?
I am using postgresql, thank you

SQL Fiddle
select bool_or(bigger_than) as bigger_than
from (
select
time - lag(time) over (order by time)
>
interval '1 minute' as bigger_than
from table_a
) s;
bigger_than
-------------
t
bool_or will stop searching as soon as it finds the first true value.
http://www.postgresql.org/docs/current/static/functions-aggregate.html
Your sample data shows a time value. But it works the same for a timestamp

Something like this:
select count(*) > 0
from (
select timestamp,
lag(timestamp) over (order by value) as prev_ts
from table_a
) t
where timestamp - prev_ts < interval '1' minute;
It calculates the difference between a timestamp and it's "previous" timestamp. The order of the timestamps is defined by the value column. The outer query then counts the number of rows where the difference is smaller than 1 minute.
lag() is called a window functions. More details on those can be found in the manual:
http://www.postgresql.org/docs/current/static/tutorial-window.html

SQL Average Inter-arrival Time, Time Between Dates

I have a table with sequential timestamps:
2011-03-17 10:31:19
2011-03-17 10:45:49
2011-03-17 10:47:49
...
I need to find the average time difference between each of these(there could be dozens) in seconds or whatever is easiest, I can work with it from there. So for example the above inter-arrival time for only the first two times would be 870 (14m 30s). For all three times it would be: (870 + 120)/2 = 445 (7m 25s).
A note, I am using postgreSQL 8.1.22 .
EDIT: The table I mention above is from a different query that is literally just a one-column list of timestamps

Not sure I understood your question completely, but this might be what you are looking for:
SELECT avg(difference)
FROM (
SELECT timestamp_col - lag(timestamp_col) over (order by timestamp_col) as difference
FROM your_table
) t
The inner query calculates the distance between each row and the preceding row. The result is an interval for each row in the table.
The outer query simply does an average over all differences.

i think u want to find avg(timestamptz).
my solution is avg(current - min value). but since result is interval, so add it to min value again.
SELECT avg(target_col - (select min(target_col) from your_table))
+ (select min(target_col) from your_table)
FROM your_table

If you cannot upgrade to a version of PG that supports window functions, you
may compute your table's sequential steps "the slow way."
Assuming your table is "tbl" and your timestamp column is "ts":
SELECT AVG(t1 - t0)
FROM (
-- All this silliness would be moot if we could use
-- `` lead(ts) over (order by ts) ''
SELECT tbl.ts AS t0,
next.ts AS t1
FROM tbl
CROSS JOIN
tbl next
WHERE next.ts = (
SELECT MIN(ts)
FROM tbl subquery
WHERE subquery.ts > tbl.ts
)
) derived;
But don't do that. Its performance will be terrible. Please do what
a_horse_with_no_name suggests, and use window functions.

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?

Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.

Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;

If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn

How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29

This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to compute window function for each nth row in Presto? - sql

Related

Grafana, postgresql: aggregate function calls cannot contain window function calls

How to compare time stamps from consecutive rows

examine if one time series column of table has two adjacent time points which have interval larger than certain length

SQL Average Inter-arrival Time, Time Between Dates

Postgres SQL select a range of records spaced out by a given interval

Categories

Resources