With Google BigQuery, I'm running a query with a group by and receive the error, "resources exceeded during query execution".
Would an increased quota allow the query to run?
Any other suggestions?
SELECT
ProductId,
StoreId,
ProductSizeId,
InventoryDate as InventoryDate,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP EACH BY
ProductSizeId,
InventoryDate,
ProductId,
StoreId;
The table is around 250GB, project # is 883604934239.
A combination of reducing the data involved and recent updates to BigQuery, this query now runs.
where ABS(HASH(ProductId) % 4) = 0
Was used to reduce the 1.3 Billion rows in the table (% 3 still failed).
With the test data set it gives "Error: Response too large to return in big query" which can be handled by writing the results out to a table. Click Enable Options, 'Select Table' (and enter a table name), then check 'Allow Large Results'.
Related
I was thinking to use this together with DBT to check that all the DAG, dependencies and such is correct without incurring in costs.
I was thinking of adding a LIMIT 0 in BigQuery queries. I'm not finding any official doc stating whether this is the case.
Are those queries not billed?
Correct, this will not bill any data. You can run a dry run to verify:
dzagales#cloudshell:~ (elzagales)$ bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 0'
Query successfully validated. Assuming the tables are not modified, running this query will process 0 bytes of data.
dzagales#cloudshell:~ (elzagales)$ bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1'
Query successfully validated. Assuming the tables are not modified, running this query will process 254787 bytes of data.
Above you can see a LIMIT 0 bills 0 bytes, while a LIMIT 1 will scan the whole table.
I want to schedule one Google Cloud query for every day, and I also want to receive email alerts whenever any table's size exceeds 1 TB. Is it possible that?
With INFORMATION_SCHEMA.TABLE_STORAGE the size of all tables in a project and a region can be obtained. The error raises an alert and for schedulded queries an email notification can be set.
For each region the project is using you need to set up a schedule query.
SELECT
STRING_AGG(summary),if(count(1)>0,error(concat(count(1)," tables too large, total: ",sum(total_logical_bytes)," list: " ,STRING_AGG(summary) )),"")
FROM
(
SELECT
project_id,
table_name,
SUM(total_logical_bytes) AS total_logical_bytes,
CONCAT(project_id,'.',project_id,'=',SUM(total_logical_bytes) ) AS summary
FROM
`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE
GROUP BY
1,
2
HAVING
total_logical_bytes> 1024*1024 # 1MB Limit
ORDER BY
total_logical_bytes DESC
)
The inner query obtaines all tables in eu-region and filters these above 1 MB. The outer query checks for more than one item in the if statement and raises an alert with error.
The other possibility is that the "Processed Data Amount" has the size of the top, enclosing STRUCT/RECORD type even though only one subfield of the STRUCT/RECORD column is selected.
The online doc has that "0 bytes + the size of the contained fields", which is not explicit to me. Can someone help to clarify? Thanks.
Think of a record as a storage mechanism. When you query against it (like you would a regular table), you are still only charged for the columns you use (select, filter, join, etc).
Check out the following query estimates for these similar queries.
-- This query would process 5.4GB
select
* -- everything
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
-- This query would process 33.7MB
select
visitorId, -- integer
totals -- record
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
-- This query would process 6.9MB
select
visitorId, -- integer
totals.hits -- specific column from record
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
On-demand pricing is based on the number of bytes processed by a query, in the Struct/Record data type you are charged according to the columns you select within the Record.
The expression “0 bytes + the size of the contained fields” means that the size will depend on the columns data types within the record.
Moreover, you can estimate costs before running your query using the query validator
since June 2nd we are having issues with analytic functions. when the query (not the partitions) passes a certain size the query fails with the following error:
Resources exceeded during query execution: The query could not be
executed in the allotted memory. Peak usage: 125% of limit. Top memory
consumer(s): analytic OVER() clauses: 97% other/unattributed: 3% . at
[....]
has anyone encountered the same problem?
BigQuery chooses several parallel workers for OVER() clauses based on the size of the tables being run upon. We can see resources exceeded error, when too much data are being processed by workers that BigQuery assigned to your query.
I assume that this issue could come from the OVER() clause and the amount of data used. You'll need to try to tune a bit on your query script (specially on the OVER() clauses) as it is said in the error message.
To read more about the error, take a look to the official documentation.
That's what would help is Slots - unit of computational capacity required to execute SQL queries:
When you enroll in a flat-rate pricing plan, you purchase a dedicated
number of slots to use for query processing. You can specify the
allocation of slots by location for all projects attached to that
billing account. Contact your sales representative if you are
interested in flat-rate pricing.
I hope you find the above pieces of information useful.
We were able ot overcome this limitation by splitting the original data into several shards and applying the analytical function on each shard.
In essence (for 8 shards):
WITH
t AS (
SELECT
RAND () AS __aux,
*
FROM <original table>
)
SELECT
* EXCEPT (__aux),
F () OVER (...) AS ...
FROM t
MOD (CAST (__aux*POW(10,9) AS INT64), 8) = 0
UNION ALL
....
SELECT
* EXCEPT (__aux),
F () OVER (...) AS ...
FROM t
MOD (CAST (__aux*POW(10,9) AS INT64), 8) = 7
I've been trying to run this query:
SELECT
created
FROM
TABLE_DATE_RANGE(
program1_insights.insights_,
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-02-09')
)
LIMIT
10
And BigQuery complains that the query is too large.
I've experimented with writing the table names out manually:
SELECT
created
FROM program1_insights.insights_20160101,
program1_insights.insights_20160102,
program1_insights.insights_20160103,
program1_insights.insights_20160104,
program1_insights.insights_20160105,
program1_insights.insights_20160106,
program1_insights.insights_20160107,
program1_insights.insights_20160108,
program1_insights.insights_20160109,
program1_insights.insights_20160110,
program1_insights.insights_20160111,
program1_insights.insights_20160112,
program1_insights.insights_20160113,
program1_insights.insights_20160114,
program1_insights.insights_20160115,
program1_insights.insights_20160116,
program1_insights.insights_20160117,
program1_insights.insights_20160118,
program1_insights.insights_20160119,
program1_insights.insights_20160120,
program1_insights.insights_20160121,
program1_insights.insights_20160122,
program1_insights.insights_20160123,
program1_insights.insights_20160124,
program1_insights.insights_20160125,
program1_insights.insights_20160126,
program1_insights.insights_20160127,
program1_insights.insights_20160128,
program1_insights.insights_20160129,
program1_insights.insights_20160130,
program1_insights.insights_20160131,
program1_insights.insights_20160201,
program1_insights.insights_20160202,
program1_insights.insights_20160203,
program1_insights.insights_20160204,
program1_insights.insights_20160205,
program1_insights.insights_20160206,
program1_insights.insights_20160207,
program1_insights.insights_20160208,
program1_insights.insights_20160209
LIMIT
10
And not surprisingly, BigQuery returns the same error.
This Q&A says that "query too large" means that BigQuery is generating an internal query that's too large to be processed. But in the past, I've run queries over way more than 40 tables with no problem.
My question is: what is it about this query in particular that's causing this error, when other, larger-seeming queries run fine? Is it that doing a single union over this number of tables is not supported?
Answering question: what is it about this query in particular that's causing this error
The problem is not in query itself.
Query looks good.
I just run similar query against ~400 daily tables with total 5.8B (billion) rows of total size 5.7TB with:
Query complete (150.0s elapsed, 21.7 GB processed)
SELECT
Timestamp
FROM
TABLE_DATE_RANGE(
MyEvents.Events_,
TIMESTAMP('2015-01-01'),
TIMESTAMP('2016-02-12')
)
LIMIT
10
You should look around - btw, are you sure you are not over-simplifying query in your question?