Resources Exceeded in BigQuery - google-bigquery

I keeping getting resources exceeded for the following query. I've tried running in batch mode and from the command line and nothing seems to be working. Anyone have ideas?
SELECT num, extid, amount, note, balance
FROM
(
SELECT row_number() over(partition by extid order by stamp) as num
, extid, stamp, ds, amount, note, balance
FROM monte.ledger2_trailing_21d
WHERE ds >= '2015-02-09'
ORDER BY extid, stamp
)
WHERE num <= 10
limit 300

This is a deceptively expensive query; timeseries analysis is always hard in SQL-like environments. The PARTITION BY clause you have written requires all of the data for a single extid to be present in memory on a single machine, which is overloading it and causing your resources exceeded error.
You can mitigate this RAM requirement by having a ROWS clause to limit the scope of your partition. Here is an example:
SELECT extid, stamp, ds, amount, note, balance
FROM (
SELECT
extid, stamp, ds, amount, note, balance,
MAX(tenth_stamp) OVER(PARTITION BY extid) AS target_stamp
FROM (
SELECT extid, stamp, ds, amount, note, balance,
MIN(stamp) OVER (
PARTITION BY extid
ORDER BY stamp DESC
ROWS BETWEEN CURRENT ROW AND 9 FOLLOWING
) AS tenth_stamp
FROM
[monte.ledger2_trailing_21d]
WHERE ds >= '2015-02-09'
)
)
WHERE stamp >= target_stamp
ORDER BY extid, stamp
LIMIT 300
The inner-most sub-select extracts your data and a field tenth_stamp that holds the least stamp of the 10 rows examined. Using MIN() makes this work even when there are fewer than 10 rows for any given extid.
The middle sub-select finds the largest tenth_stamp for each extid. This is the tenth total stamp for that extid. The outer SELECT can then restrict the result to only rows with a stamp within the ten most recent stamp for their respective extid, giving you the desired result.
When executed, this takes a total of 4 stages. It will not run fast, but never requires large amounts of data in a single location. Hope that helps!

Related

Google BigQuery - why does window function order by cause memory error although used together with partition by

I get a memory error in google BigQuery that I don't understand:
My base table (> 1 billion rows) consists of a user ID, a balance increment per day and the day.
From the balance_increment per day I want to return the total balance each time there is a new increment. For the next step I would also require the next day there is a new balance increment. So I do:
select
userID
, date
, sum(balance_increment) over (partition by userID order by date) as balance
, lead(date, 1, current_date()) over (partition by userID order by date) as next_date
from my_base_table
Although I used partition by in the over clause I get a memory error with this query caused by the sort operation (the order by if I understood corectly?):
BadRequest: 400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 135% of limit.
Top memory consumer(s):
sort operations used for analytic OVER() clauses: 98%
other/unattributed: 2%
But when I check how often a unique user ID appears, the most is not even 4000 times. I know that I have a bunch of userIDs (apparently > 31 million as the image (see below) suggests, but I thought when doing a partition by the query will be separated into different slots if necessary?
Here I check how often a single userID occurs. This query btw. works just fine:
SELECT
userID
, count(*) as userID_count
FROM my_base_table
GROUP BY userID
ORDER BY userID_count DESC
(sorry, in the image I called it entity instead of userID)
So my questions are:
Did I understand it correctly that the memory error comes from the order by date?
Why is that a big issue when I have less than 4000 occurences that have to be ordered when I use the partition by?
Why does my second query run through although at the end I have to order > 31 million rows?
How can I solve this issue?
I solved the memory issue by pre-ordering the base table by userID and date as suggested by #Samuel who pointed out, that preordering should reduce the key exchange over the nodes - it worked!
With ordered_base_table as (
Select * from my_base_table order by userID, date
)
select
userID
, date
, sum(balance_increment) over (partition by userID order by date) as balance
, lead(date, 1, current_date()) over (partition by userID order by date) as next_date
from ordered_base_table
Thanks!

In a table of timed observations from multiple sensors, how to optimally retrieve the last observation for each sensor

My table looks as follows:
sensor
time
value
AAA
2021-01-05 04:10:14
3.14159
AAA
2021-01-05 05:08:07
3.94756
ABC
2021-01-05 03:40:54
4.32543
I'm looking for a query that retrieves the rows corresponding to the last observation for each sensor, i.e.:
sensor
time
value
AAA
2021-01-05 05:08:07
3.94756
ABC
2021-01-05 03:40:54
4.32543
After doing some research I came across this solution:
SELECT DISTINCT ON (sensor) sensor, time, value
FROM observations
ORDER BY sensor, time DESC
The problem with the above is that it's rather costly for large tables.
A possible solution would be to have another table holding only the last observation for each sensor, separate from the one holding all the historical ones. While that would work, I was wondering if there is something more elegant, i.e. that allows me to keep a single table, while having a better performance.
Thanks in advance.
You can use window function such as ROW_NUMBER()
SELECT sensor, time, value
FROM
(
SELECT o.*,
ROW_NUMBER() OVER (PARTITION BY sensor ORDER BY time DESC) AS rn
FROM observations o ) oo
WHERE rn = 1
assuming ties occur for time values, then DENSE_RANK() function might be used instead in order to include all equal time values within the result set.
Demo
You could try to use the recursive CTE:
with recursive t as (
(select sensor, time, value
from observations
order by sensor, time desc limit 1)
union all
(select o.sensor, o.time, o.value
from observations as o join t on (o.sensor > t.sensor)
order by o.sensor, o.time desc limit 1))
select * from t;
Index on observations(sensor, time desc) could help a lot.
For this query:
SELECT DISTINCT ON (sensor) sensor, time, value
FROM observations
ORDER BY sensor, time DESC
You want an index on (sensor, time desc).
With such an index, this is probably the fastest method for doing what you want.

SQL calculate time difference with previous row

I have code that provides the order number, the estimated time of delivery, the actual time of delivery and the difference between the two times.
If the order is late I need to take that difference and add it to the next order to display the new estimated time of delivery.
How can I have SQL reach back to the previous row and get the calculated difference to add to the estimated time of delivery? LAG is not available since we are using 2012 SQL Shell.
This gets the datediff of current datetime from previous record
WITH orders AS
(SELECT *, ROW_NUMBER() OVER (ORDER BY datetimecolumn) AS rownum
FROM mytable
)
SELECT DATEDIFF(second, curr.est_tod, prev.act_tod)
FROM orders curr
INNER JOIN orders prev
ON curr.rownum = prev.rownum - 1

postgres select aggregate timespans

I have a table with the following structure:
timstamp-start, timestamp-stop
1,5
6,10
25,30
31,35
...
i am only interested in continuous timespans e.g. the break between a timestamp-end and the following timestamp-start is less than 3.
How could I get the aggregated covered timespans as a result:
timestamp-start,timestamp-stop
1,10
25,35
The reason I am considering this is because a user may request a timespan that would need to return several thousand rows. However, most records are continous and using above method could potentially reduce many thousand of rows down to just a dozen. Or is the added computation not worth the savings in bandwith and latency?
You can group the time stamps in three steps:
Add a flag to determine where a new period starts (that is, a gap greater than 3).
Cumulatively sum the flag to assign groupings.
Re-aggregate with the new groupings.
The code looks like:
select min(ts_start) as ts_start, max(ts_end) as ts_end
from (select t.*,
sum(flag) over (order by ts_start) as grouping
from (select t.*,
(coalesce(ts_start - lag(ts_end) over (order by ts_start),0) > 3)::int as flag
from t
) t
) t
group by grouping;

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?