Grafana, postgresql: aggregate function calls cannot contain window function calls - sql

In Grafana, we want to show bars indicating maximum of 15-minut averages in the choosen time interval. Our data has regular 1-minute intervals. The database is Postgresql.
To show the 15-minute averages, we use the following query:
SELECT
timestamp AS time,
AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING) AS value,
'15-min Average' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
ORDER BY time
To show bars indicating maximum of raw values in the choosen time interval, we use the following query:
SELECT
$__timeGroup(timestamp,'$INTERVAL') AS time,
MAX(rawvalue) AS value,
'Interval Max' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
GROUP BY $__timeGroup(timestamp,'$INTERVAL')
ORDER BY time
A naive combination of both solutions does not work:
SELECT
$__timeGroup(timestamp,'$INTERVAL') AS time,
MAX(AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING)) AS value,
'Interval Max 15-min Average' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
GROUP BY $__timeGroup(timestamp,'$INTERVAL')
ORDER BY time
We get error: "pq: aggregate function calls cannot contain window function calls".
There is a suggestion on SO to use "with" (Count by criteria over partition) but I do not know hot to use it in our case.

Use the first query as a CTE (or with) for the second one. The order by clause of the CTE and the where clause of the second query as well as the metric column of the CTE are no longer needed. Alternatively you can use the first query as a derived table in the from clause of the second one.
with t as
(
SELECT
timestamp AS time,
AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING) AS value
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
)
SELECT
$__timeGroup(time,'$INTERVAL') AS time,
MAX(value) AS value,
'Interval Max 15-min Average' AS metric
FROM t
GROUP BY 1 ORDER BY 1;
Unrelated but what are $__timeFilter and $__timeGroup? Their sematics are clear but where do they come from? BTW you may find this function useful.

Related

postgresql window function minimum period to calculate average

i want to calculate rolling average over a rolling period of 252 days but only if 252 days data available in table otherwise null value for rows.
currently i am using this query:
SELECT datestamp, symbol, avg(close) OVER (PARTITION BY symbol ORDER BY datestamp ROWS BETWEEN 251 PRECEDING AND CURRENT ROW) FROM daily_prices.
it is giving avg also if 252 days data not available.
i want acheive result as we get with pandas rolling function by defining min_period value.
"i want to calculate rolling average over a rolling period of 252 days"
The clause ROWS BETWEEN 251 PRECEDING AND CURRENT ROW doesn't refer to a period of time but to the number of rows in the window and which preceed the current row according to the ORDER BY datestamp clause.
I would suggest you a slightly different solution for the window function in order to implement the period of time :
SELECT datestamp, symbol, avg(close) OVER (PARTITION BY symbol ORDER BY datestamp RANGE BETWEEN '251 days' PRECEDING AND CURRENT ROW) FROM daily_prices
Then I don't understand in which case you want a null value. In the window of a current row, you will have at least the current row, so that the avg() can't be null.
Just do a count over the same window and use it to modify your results.
I used a named window to avoid specifying the same window repeatedly.
with daily_prices as (select 1 as symbol, 5 as close, t as datestamp from generate_series(now()-interval '1 year',now(),interval '1 day') f(t))
SELECT
datestamp,
symbol,
case when count(close) OVER w = 252 then
avg(close) OVER w
end
FROM daily_prices
window w as (PARTITION BY symbol ORDER BY datestamp ROWS BETWEEN 251 PRECEDING AND CURRENT ROW);

How to compute window function for each nth row in Presto?

I am working with a table that contains timeseries data, with a row for each minute for each user.
I want to compute some aggregate functions on a rolling window of N calendar days.
This is achieved via
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
) as my_col
FROM my_table
However, I am only interested in the result of this at a daily scale.
i.e. I want the window to be computed only at 00:00:00, but I want the window itself to contain all the minute-by-minute data to be passed into my aggregate function.
Right now I am doing this:
WITH agg_results AS (
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp_col
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
)
FROM my_table
)
SELECT * FROM agg_results
WHERE
timestamp_col = DATE_TRUNC('day', "timestamp_col")
This works in theory, but it does 60 * 24 more computations that necessary, resulting in the query being super slow.
Essentially, I am trying to find a way to make the right window bound skip rows based on a condition. Or, if it is simpler to implement, for every nth row (as I have a constant number of rows for each day).
I don't think that's possible with window functions. You could switch to a subquery instead, assuming that your aggregate function works as a regular aggregate function too (that is, without an OVER() clause):
select
timestamp_col,
(
select some_aggregate_fun(t1.col)
from my_table t1
where
t1.user_id = t.user_id
and t1.timestamp_col >= t.timestamp_col - interval '1' day
and t1.timestamp_col <= t.timestamp_col
)
from my_table t
where timestamp_col = date_trunc('day', timestamp_col)
I am unsure that this would perform better than your original query though; you might need to assess that against your actual dataset.
You can change interval '1' day to the actual interval you want to use.

How to group timestamps into islands (based on arbitrary gap)?

Consider this list of dates as timestamptz:
I grouped the dates by hand using colors: every group is separated from the next by a gap of at least 2 minutes.
I'm trying to measure how much a given user studied, by looking at when they performed an action (the data is when they finished studying a sentence.) e.g.: on the yellow block, I'd consider the user studied in one sitting, from 14:24 till 14:27, or roughly 3 minutes in a row.
I see how I could group these dates with a programming language by going through all of the dates and looking for the gap between two rows.
My question is: how would go about grouping dates in this way with Postgres?
(Looking for 'gaps' on Google or SO brings too many irrelevant results; I think I'm missing the vocabulary for what I'm trying to do here.)
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM (
SELECT done
, lag(done) OVER (ORDER BY done) <= done - interval '2 min' AS step
FROM tbl
) sub
ORDER BY done;
The subquery sub returns step = true if the previous row is at least 2 min away - sorted by the timestamp column done itself in this case.
The outer query adds a rolling count of steps, effectively the group number (grp) - combining the aggregate FILTER clause with another window function.
fiddle
Related:
Query to find all timestamps more than a certain interval apart
How to label groups in postgresql when group belonging depends on the preceding line?
Select longest continuous sequence
Grouping or Window
About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters
Conditional lead/lag function PostgreSQL?
Building up on Erwin's answer, here is the full query for tallying up the amount of time people spent on those sessions/islands:
My data only shows when people finished reviewing something, not when they started, which means we don't know when a session truly started; and some islands only have one timestamp in them (leading to a 0-duration estimate.) I'm accounting for both by calculating the average review time and adding it to the total duration of islands.
This is likely very idiosyncratic to my use case, but I learned a thing or two in the process, so maybe this will help someone down the line.
-- Returns estimated total study time and average time per review, both in seconds
SELECT (EXTRACT( EPOCH FROM logged) + countofislands * avgreviewtime) as totalstudytime, avgreviewtime -- add total logged time to estimate for first-review-in-island and 1-review islands
FROM
(
SELECT -- get the three key values that will let us calculate total time spent
sum(duration) as logged
, count(island) as countofislands
, EXTRACT( EPOCH FROM sum(duration) FILTER (WHERE duration != '00:00:00'::interval) )/( sum(reviews) FILTER (WHERE duration != '00:00:00'::interval) - count(reviews) FILTER (WHERE duration != '00:00:00'::interval)) as avgreviewtime
FROM
(
SELECT island, age( max(done), min(done) ) as duration, count(island) as reviews -- calculate the duration of islands
FROM
(
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS island -- give a unique number to each island
FROM (
SELECT -- detect the beginning of islands
done,
(
lag(done) OVER (ORDER BY done) <= done - interval '2 min'
) AS step
FROM review
WHERE clicker_id = 71 AND "done" > '2015-05-13' AND "done" < '2015-05-13 15:00:00' -- keep the queries small and fast for now
) sub
ORDER BY done
) grouped
GROUP BY island
) sessions
) summary

BigQuery - counting number of events within a sliding time frame

I would like to count the number of events within a sliding time frame.
For example, say I would like to know how many bids were in the last 1000 seconds for the Google stock (GOOG).
I'm trying the following query:
SELECT
symbol,
start_date,
start_time,
bid_price,
count(if(max(start_time)-start_time<1000,1,null)) over (partition by symbol order by start_time asc) cnt
FROM [bigquery-samples:nasdaq_stock_quotes.quotes]
where symbol = 'GOOG'
The logic is as follow: the partition window (by symbol) is ordered with the bid time (leaving alone the bid date for sake of simplicity).
For each window (defined by the row at the "head" of the window) I would like to count the number of rows which have start_time that is less than 1000 seconds than the "head" row time.
I'm trying to use max(start_time) to get the top row in the window. This doesn't seem to work and I get an error:
Error: MAX is an analytic function and must be accompanied by an OVER clause.
Is it possible to have two an analytic functions in one column (both count and max in this case)? Is there a different solution to the problem presented?
Try using the range function.
SELECT
symbol,
start_date,
start_time,
bid_price,
count(market_center) over (partition by symbol order by start_time RANGE 1000 PRECEDING) cnt
FROM [bigquery-samples:nasdaq_stock_quotes.quotes]
where symbol = 'GOOG'
order by 2, 3
I used market_center just as a counter, additional fields can be used as well.
Note: the RANGE function is not documented in BigQuery Query Reference, however it's a standard SQL function which appears to work in this case

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?