Google BigQuery - why does window function order by cause memory error although used together with partition by - sql

I get a memory error in google BigQuery that I don't understand:
My base table (> 1 billion rows) consists of a user ID, a balance increment per day and the day.
From the balance_increment per day I want to return the total balance each time there is a new increment. For the next step I would also require the next day there is a new balance increment. So I do:
select
userID
, date
, sum(balance_increment) over (partition by userID order by date) as balance
, lead(date, 1, current_date()) over (partition by userID order by date) as next_date
from my_base_table
Although I used partition by in the over clause I get a memory error with this query caused by the sort operation (the order by if I understood corectly?):
BadRequest: 400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 135% of limit.
Top memory consumer(s):
sort operations used for analytic OVER() clauses: 98%
other/unattributed: 2%
But when I check how often a unique user ID appears, the most is not even 4000 times. I know that I have a bunch of userIDs (apparently > 31 million as the image (see below) suggests, but I thought when doing a partition by the query will be separated into different slots if necessary?
Here I check how often a single userID occurs. This query btw. works just fine:
SELECT
userID
, count(*) as userID_count
FROM my_base_table
GROUP BY userID
ORDER BY userID_count DESC
(sorry, in the image I called it entity instead of userID)
So my questions are:
Did I understand it correctly that the memory error comes from the order by date?
Why is that a big issue when I have less than 4000 occurences that have to be ordered when I use the partition by?
Why does my second query run through although at the end I have to order > 31 million rows?
How can I solve this issue?

I solved the memory issue by pre-ordering the base table by userID and date as suggested by #Samuel who pointed out, that preordering should reduce the key exchange over the nodes - it worked!
With ordered_base_table as (
Select * from my_base_table order by userID, date
)
select
userID
, date
, sum(balance_increment) over (partition by userID order by date) as balance
, lead(date, 1, current_date()) over (partition by userID order by date) as next_date
from ordered_base_table
Thanks!

Related

partition big query LIMIT over date range

I'm quite new to SQL & big query so this might be simple. I'm running some queries on the public dataset GDELT in BQ and have a question regarding the LIMIT. GDELT is massive (14.4 TB) and when I query for something, in this case a person, I could get up to 100k rows of results or more which is this case is too much. But when I use LIMIT it seems like it does not really partition the results evenly over the dates, causing me to get very random timelines. How does limit work and is there a way to get the results more evenly based on days?
SELECT DATE,V2Tone,DocumentIdentifier as URL, Themes, Persons, Locations
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE DATE>=20210610000000 and _PARTITIONTIME >= TIMESTAMP(#start_date)
AND DATE<=20210818999999 and _PARTITIONTIME <= TIMESTAMP(#end_date)
AND LOWER(DocumentIdentifier) like #url_topic
LIMIT #limit
When running this query and doing some preproc, I get the following time series:
It's based on 15k results, but they are distributed very unevenly/randomly across the days (since there are over 500k results in total if I don't use limit). I would like to make a query that limits my output to 15k but partitions the data somewhat equally over the days.
you need to order by , when you are not sorting your result , the order of returned result is not guaranteed:
but if you are looking to get the same number of rows per day , you can use window functions:
select * from (
SELECT
DATE,
V2Tone,
DocumentIdentifier as URL,
Themes,
Persons,
Locations,
row_number() over (partition by DATE) rn
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE
DATE >= 20210610000000 AND DATE <= 20210818999999
and _PARTITIONDATE >= #start_date and _PARTITIONDATE <= #end_date
AND LOWER(DocumentIdentifier) like #url_topic
) t where rn = #numberofrowsperday
if you are passing date only you can use _PARTITIONDATE to filter the partitions.

Grouping by last day of each month—inefficient running

I am attempting to pull month end balances from all accounts a customer has for every month. Here is what I've written. This runs correctly and gives me what I want—but it also runs extremely slowly. How would you recommend speeding it up?
SELECT DISTINCT
[AccountNo]
,SourceDate
,[AccountType]
,[AccountSub]
,[Balance]
FROM [DW].[dbo].[dwFactAccount] AS fact
WHERE SourceDate IN (
SELECT MAX(SOURCEDATE)
FROM [DW].[dbo].[dwFactAccount]
GROUP BY MONTH(SOURCEDATE),
YEAR (SOURCEDATE)
)
ORDER BY SourceDate DESC
I'd try a window function:
SELECT * FROM (
SELECT
[AccountNo]
,[SourceDate]
,[AccountType]
,[AccountSub]
,[Balance]
,ROW_NUMBER() OVER(
PARTITION BY accountno, EOMONTH(sourcedate)
ORDER BY sourcedate DESC
) as rn
FROM [DW].[dbo].[dwFactAccount]
)x
WHERE x.rn = 1
The row number will establish an incrementing counter in order of sourcedate descending. The counter will restart from 1 when the month in sourcedate changes (or the account number changes) thanks to the EOMONTH function quantising any date in a given month to be the last date of the month (2020-03-9 12:34:56 becomes 2020-03-31, as do all other datetimes in March). Any similar tactic to quantise to a fixed date in the month would also work such as using YEAR(sourcedate), MONTH(sourcedate)
You need to build a dimension table for the date with Date as PK, and your SourceDate in the fact table ref. that date dimension table.
Date dimension table can have month, year, week, is_weekend, is_holiday, etc. columns. You join your fact table with the date dimension table and you can group data using any columns in date table you want.
Your absolute first step should be to view the execution plan for the query and determine why the query is slow.
The following explains how to see a graphical execution plan:
Display an Actual Execution Plan
The steps to interpreting the plan and optimizing the query are too much for an SO answer, but you should be able to find some good articles on the topic by Googling. You could also post the plan in an edit to your question and get some real feedback on what steps to take to improve query performance.

SQL question: count of occurrence greater than N in any given hour

I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span
You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.
I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.

Resources Exceeded in BigQuery

I keeping getting resources exceeded for the following query. I've tried running in batch mode and from the command line and nothing seems to be working. Anyone have ideas?
SELECT num, extid, amount, note, balance
FROM
(
SELECT row_number() over(partition by extid order by stamp) as num
, extid, stamp, ds, amount, note, balance
FROM monte.ledger2_trailing_21d
WHERE ds >= '2015-02-09'
ORDER BY extid, stamp
)
WHERE num <= 10
limit 300
This is a deceptively expensive query; timeseries analysis is always hard in SQL-like environments. The PARTITION BY clause you have written requires all of the data for a single extid to be present in memory on a single machine, which is overloading it and causing your resources exceeded error.
You can mitigate this RAM requirement by having a ROWS clause to limit the scope of your partition. Here is an example:
SELECT extid, stamp, ds, amount, note, balance
FROM (
SELECT
extid, stamp, ds, amount, note, balance,
MAX(tenth_stamp) OVER(PARTITION BY extid) AS target_stamp
FROM (
SELECT extid, stamp, ds, amount, note, balance,
MIN(stamp) OVER (
PARTITION BY extid
ORDER BY stamp DESC
ROWS BETWEEN CURRENT ROW AND 9 FOLLOWING
) AS tenth_stamp
FROM
[monte.ledger2_trailing_21d]
WHERE ds >= '2015-02-09'
)
)
WHERE stamp >= target_stamp
ORDER BY extid, stamp
LIMIT 300
The inner-most sub-select extracts your data and a field tenth_stamp that holds the least stamp of the 10 rows examined. Using MIN() makes this work even when there are fewer than 10 rows for any given extid.
The middle sub-select finds the largest tenth_stamp for each extid. This is the tenth total stamp for that extid. The outer SELECT can then restrict the result to only rows with a stamp within the ten most recent stamp for their respective extid, giving you the desired result.
When executed, this takes a total of 4 stages. It will not run fast, but never requires large amounts of data in a single location. Hope that helps!

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?