PERCENT_RANK() in BigQuery returns Resources exceeded - google-bigquery

When I try to use PERCENT_RANK() over a large dataset, it gives me an error.
SELECT
a2_lngram,
a2_decade,
a2_totalfreq,
a2_totalbooks,
a2_freq, a2_bfreq,
a2_arf,
c_avgarf,
d_arf,
oi,
PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf
FROM [trigram.trigrams8]
With a destination table and AllowLargeResults returns:
"Resources exceeded during query execution."
When I limit the results to few hundreds it runs fine.
JobID: otichyproject1:job_PpTpmMXYETUMiM_2scGgc997JVg
The dataset is public.

This is expected: The input for an analytic/window function needs to fit in one node for it to run successfully.
PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf
will only run if all the rows fit in one node. If they don't you'll see the "Resources exceeded during query execution" error.
There's a way to scale up with analytic functions: Partition your data.
PERCENT_RANK() OVER (PARTITION BY country ORDER BY d_arf DESC) plarf
... then the function can be run over multiple nodes, as long as each 'country' rows fit in one VM.
Not your case though - the fix I would do here is calculate the total on a separate subquery, join, and divide.
In summary, analytic functions are cool, but they have scalability issues on the size of each partition - luckily there are other ways to get the same results.

Related

Google Bigquery Memory error when using ROW_NUMBER() on large table - ways to replace long hash by short unique identifier

For a query in google BigQuery I want to replace a long hash by a shorter numeric unique identifier to save some memory afterwards, so I do:
SELECT
my_hash
, ROW_NUMBER() OVER (ORDER BY null) AS id_numeric
FROM hash_table_raw
GROUP BY my_hash
I don't even need an order in the id, but ROW_NUMBER() requires an ORDER BY.
When I try this on my dataset (> 1 billion rows) I get a memory error:
400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 126% of limit.
Top memory consumer(s):
sort operations used for analytic OVER() clauses: 99%
other/unattributed: 1%
Is there another way to replace a hash by an shorter identifier?
Thanks!
One does not really need to have populated over clause while doing this.
e.g. following will work:
select col, row_number() over() as row_num from (select 'A' as col)
So that will be your first try.
Now, with billion+ rows that you have: if above fails: you can do something like this (considering order is not at all important for you): but here you have to do it in parts:
SELECT
my_hash
, ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 0
And in subsequent queries:
you can get max(id_numeric) from previous run and add that as an offset to next:
SELECT
my_hash
, previous_max_id_numberic_val + ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 1
And keep appending outputs of these mod queries (0-4) to a single new table.

Issues with Analytic function in BigQuery

since June 2nd we are having issues with analytic functions. when the query (not the partitions) passes a certain size the query fails with the following error:
Resources exceeded during query execution: The query could not be
executed in the allotted memory. Peak usage: 125% of limit. Top memory
consumer(s): analytic OVER() clauses: 97% other/unattributed: 3% . at
[....]
has anyone encountered the same problem?
BigQuery chooses several parallel workers for OVER() clauses based on the size of the tables being run upon. We can see resources exceeded error, when too much data are being processed by workers that BigQuery assigned to your query.
I assume that this issue could come from the OVER() clause and the amount of data used. You'll need to try to tune a bit on your query script (specially on the OVER() clauses) as it is said in the error message.
To read more about the error, take a look to the official documentation.
That's what would help is Slots - unit of computational capacity required to execute SQL queries:
When you enroll in a flat-rate pricing plan, you purchase a dedicated
number of slots to use for query processing. You can specify the
allocation of slots by location for all projects attached to that
billing account. Contact your sales representative if you are
interested in flat-rate pricing.
I hope you find the above pieces of information useful.
We were able ot overcome this limitation by splitting the original data into several shards and applying the analytical function on each shard.
In essence (for 8 shards):
WITH
t AS (
SELECT
RAND () AS __aux,
*
FROM <original table>
)
SELECT
* EXCEPT (__aux),
F () OVER (...) AS ...
FROM t
MOD (CAST (__aux*POW(10,9) AS INT64), 8) = 0
UNION ALL
....
SELECT
* EXCEPT (__aux),
F () OVER (...) AS ...
FROM t
MOD (CAST (__aux*POW(10,9) AS INT64), 8) = 7

Optimize query to avoid "Resources exceeded during query execution"

I've saved this query to as a View:
SELECT nth(1,CodAlm) as FirstCode,
nth(1,DesAlm) as FirstDescription,
last(CodAlm) as LastCode,
Last(DesAlm) as LastDescription,
max(DATE(DataTic)) as LastVisit,
min(DATE(DataTic)) as FirstVisit,
DATEDIFF(CURRENT_TIMESTAMP(),TIMESTAMP(max(DATE(DataTic)))) as Diffdays,
count(distinct DATE(DataTic)) as countVisits,
count(distinct CodAlm) as NumberCodes,
sum(subtot) as Totalimport,
TarCli,
Last(nomcli) as Name,
Last(cogcli) as LastName,
Last(emailcli) as email,
Last(sexcli) as gender
FROM (SELECT CodAlm, DesAlm, DataTic,SubTot, TarCli, NomCli,CogCli,EmailCli,SexCli FROM [bime.Sales] where Year(DataTic)>2012 AND IsFirstLine="1" ORDER by TarCli, DataTic)
group each by tarcli
But, When I run any query over this view, bigquery returns me Resources exceeded during query execution. I think that ORDER BY is the cause of my problem, but I need this to show my results correctly. How I could rewrite this query correctly? bime.Sales table have 18 Milion rows.
See this question for more info:
What causes "resources exceeded" in BigQuery?
The error is likely caused by the GROUP EACH BY clause, and the most likely reason is that you have a skewed distribution of keys (i.e., one key with a disproportionate number of records). Can you look at your data distribution, and perhaps filter out any skewed keys?
Also note that the ORDER BY is not guaranteed to be preserved by GROUP EACH BY, so you need to apply the ordering after the GROUP EACH BY. You may find it useful to use analytic functions like FIRST_VALUE, NTH_VALUE, and LAST_VALUE with OVER(PARTITION BY tarcli ORDER BY DataTic) instead of GROUP EACH BY if you want to get reliable ordering.
Some things you should consider and try (if you haven't done that so far):
1) Do you really need the "group each by"? Have you tried with just "group by"?
2) Have you tried to use a table instead of a view? You could try to "materialize" the view to check if the resource consumption decreases.
3) Can you shard the data? Perhaps putting each year or month in a different table (using DataTic). That would decrease the size of each table and, therefore, the resource usage.
Cheers!

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions

How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)

Divide does nothing

I'm attempting to create percentile scores. My query generates the ranks correctly, but the divide does nothing(the ranks are displayed in the columns rather than scores)
"/"(RANK() OVER(ORDER BY "Disk IO"),Count(*)) "Disk IO Score"
I've also tried generating the rank then selecting that and dividing, but it has the same result.
SELECT ..."/"("Disk IO Score",Count(*)) "Score"...
FROM(....RANK() OVER(ORDER BY "Disk IO") "Disk IO Score"...)
Thanks,
Buzkie
SELECT "System_Name", "/"(RANK() OVER(ORDER BY "Disk IO"),Count(*)) "Disk IO Score"
FROM (Select...)
GROUP BY "System_Name", "Disk IO"
Seems you are using aggregate COUNT(*) rather than analytic one.
Try this:
SELECT RANK() OVER (...) / COUNT(*) OVER (...)
And could you please post the whole query (including GROUP BY clauses)?
I guess it's answered. The count(*) was returning 1 so I was just dividing by 1.