I have a destination table(created as an output of some other query),
simple order by on one of its column is resulting "resource exceeded" error message.
Destination table created has 8.5 million rows and 6 columns (size 567 MB approx).
select col1,col2.....col6 from desttable order by col 5 desc
is resulting "resource exceeded" error message.
Remove ORDER BY and see if error disappear!
ORDER BY moves WHOLE data into one worker - thus resources exceeded
If I am adding "LIMIT" and "OFFSET" clause in the query after order by
its working,even though LIMIT clause is the last to be evaluated.How
it is working there??
When you add LIMIT N - query runs on multiple workers. Each worker gets only part of the data to process and outputs only respective N rows. Those N rows from all workers than gets "delivered" to one worker where final ORDER BY and LIMIT occurs and "winning" N rows becomes output of whole query
Related
I want to schedule one Google Cloud query for every day, and I also want to receive email alerts whenever any table's size exceeds 1 TB. Is it possible that?
With INFORMATION_SCHEMA.TABLE_STORAGE the size of all tables in a project and a region can be obtained. The error raises an alert and for schedulded queries an email notification can be set.
For each region the project is using you need to set up a schedule query.
SELECT
STRING_AGG(summary),if(count(1)>0,error(concat(count(1)," tables too large, total: ",sum(total_logical_bytes)," list: " ,STRING_AGG(summary) )),"")
FROM
(
SELECT
project_id,
table_name,
SUM(total_logical_bytes) AS total_logical_bytes,
CONCAT(project_id,'.',project_id,'=',SUM(total_logical_bytes) ) AS summary
FROM
`region-eu`.INFORMATION_SCHEMA.TABLE_STORAGE
GROUP BY
1,
2
HAVING
total_logical_bytes> 1024*1024 # 1MB Limit
ORDER BY
total_logical_bytes DESC
)
The inner query obtaines all tables in eu-region and filters these above 1 MB. The outer query checks for more than one item in the if statement and raises an alert with error.
select *
, row_number() OVER(PARTITION BY user_id,event_datetime_start,event_datetime_end ORDER BY user_id, event_datetime_start, event_datetime_end,dt_watched) rk
from `blackout_tv_july` a
cross join unnest(GENERATE_TIMESTAMP_ARRAY(event_datetime_start,(datetime_add(event_datetime_end, interval 1 MINUTE)), interval 1 MINUTE)) dt_watched
error show
GENERATE_ARRAY(1658598950677000, 3317172759513000, 1) produced too many elements
There is no predefined limit for GENERATE_ARRAY but from the error it seems to be that this is occurring due to the large size of the array which is incurring out of memory error at runtime. The larger the array, the more likely to hit the runtime error. The exact number varies depending on the query which you are running. There is a 100 MB limit on the size of array that has been put to prevent accidentally writing very heavy CPU-bound queries.
For your requirement, you can split the array. For example :
SELECT ARRAY_CONCAT(GENERATE_TIMESTAMP_ARRAY(parameters), GENERATE_TIMESTAMP_ARRAY(parameters))...
I'm using splunk. Postgres SQL.
I have 2.1M rows to pull up in ASC order.
This is my standard query that is working:
WHERE we_student_id = 5678
ORDER BY audit.b_created_date DESC NULLS LAST
then I usually use: (if data is more than 1-2M. I split them into batches)
FETCH FIRST 500000 ROWS ONLY
OFFSET 500000 ROWS FETCH NEXT 500000 ROWS ONLY
this time, my client requested to extract them by ASC order based on audited id not audit_created_date.
I used:
WHERE student_id = 5678
ORDER BY audit.audited_id ASC NULLS LAST
==========
I tried to pull up the first 500k.
I used:
*ORDER BY ASC NULLS LAST
LIMIT 500000 OFFSET 0*
The result is just 100k.
I tried to put maxrows=0 before my select statement with the same query
*ORDER BY ASC NULLS LAST
LIMIT 500000 OFFSET 0*
but I'm getting an error: canceling statement due to user request.
I tried this query to get the first 400k instead of 500k and removed the OFFSET 0. And I'm still using maxrows=0 before my select statement
*ORDER BY ASC NULLS LAST
LIMIT 400000*
There's a result 400k.
When I tried to extract the next 400k, I queried
*LIMIT 400000 OFFSET 400000*
I encountered the error again: canceling statement due to user request.
Usually, I can pull up 2M rows on Database. I usually use "FETCH FIRST 1000000" Then offset the other batch. My usual query on DB is
ORDER BY DESC NULLS LAST and use FETCH first and OFFSET
But this time, my client wants to get the data by ASC order.
I tried FETCH FIRST 400000 ROWS ONLY query and there's a 400k result. but whenever I increase the number to 500000, I get this error: canceling statement due to user request.
I usually use maxrows=0 because Splunk only shows the first 100k rows. Most of my data are 1-2 Million.
This error only happened when the client requested the reports by ASC order.
I just want to pull up the 2.1M rows on the database and I don't know how to pull it up by ASC order. I don't know if I'm using OFFSET and LIMIT correctly.
for example
SELECT company_ID, totalRevenue
FROM `BigQuery.BQdataset.companyperformance`
ORDER BY totalRevenue LIMIT 10
The only difference I can see between using and not using LIMIT 10 is just the different amount of data used for displaying to user.
The system still orders all the data first before performing a LIMIT.
Below is applicable for BigQuery
Not necessarily 100% technically correct - but close enough so I hope below will give you an idea why LIMIT N is extremely important to consider in BigQuery
Assume you have 1,000,000 rows of data and 8 workers to process query like below
SELECT * FROM table_with_1000000_rows ORDER BY some_field
Round 1: To sort this data each worker gets 125,000 rows – so now you have 8 sorted sets of 125,000 rows each
Round 2: Worker #1 sends its sorted data (125,000 rows) to worker #2, #3 sends to #4 and so on. So now we have 4 workers and each produce ordered set of 250,000 rows
Round 3: Above logic repeated and now we have just 2 workers each producing ordered list of 500,000 rows
Round 4: And finally, just one worker producing final ordered set of 1,000,000 rows
Of course, based on number of rows and number of available workers – number of rounds can be different than in above example
In Summary: what we have here:
a. We have quite a huge amount of data being transferred between workers – this can be quite a factor for performance going down
b. And we have chance for one of the workers not being able to process amount of data distributed to respective worker. It can happen earlier or later and is usually manifested with “Resources exceeded …” type of error
So, now if you have LIMIT as a part of query as below
SELECT * FROM table_with_1000000_rows ORDER BY some_field LIMIT 10
So, now – Round 1 is going to be the same. But starting with Round 2 – ONLY top 10 rows will be sent to another worker – thus in each Round after first one - only 20 rows will processed and only top 10 will be sent for further processing
Hope you see how different these two processes in terms of volume of the data being sent between workers and how much work each worker needs to apply to sort respective data
To Summarize:
Without LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 1,500,000
• Overall merged ordered rows (Round 2 - 4): 1,500,000;
• Final result: 1,000,000 rows
With LIMIT 10:
• Initial rows moved (Round 1): 1,000,000;
• Initial rows ordered (Round 1): 1,000,000;
• Intermediate rows moved (Round 2 - 4): 70
• Overall merged ordered rows (Round 2 - 4): 140;
• Final result: 10 rows
Hope above numbers clearly show the difference in performance you gain using LIMIT N and in some cases even ability to successfully run the query without "Resource exceeded ..." error
This answer assumes you are asking about the difference between the following two variants:
ORDER BY totalRevenue
ORDER BY totalRevenue LIMIT 10
In many databases, if a suitable index existed involving totalRevenue, the LIMIT query could stop sorting after finding the top 10 records.
In the absence of any index, as you pointed out, both versions would have to do a full sort, and therefore should perform the same.
Also, there is a potentially major performance difference between the two, if the table be large. In the LIMIT version, BigQuery only has to send across 10 records, while in the non LIMIT version, potentially much more data has to be sent.
There is no performance gain. bigQuery still has go through all the records on the table.
You can partition your data in order to cut the amount of records that bigQuery has to read. That will increase performance. You can read more information here:
https://cloud.google.com/bigquery/docs/partitioned-tables
See the statistical difference in bigQuery UI between the below 2 queries
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 1000
SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 10000
As you can see BQ will return immediately to UI after the limit criteria is reached this result in better performance and less traffic on the network
With Google BigQuery, I'm running a query with a group by and receive the error, "resources exceeded during query execution".
Would an increased quota allow the query to run?
Any other suggestions?
SELECT
ProductId,
StoreId,
ProductSizeId,
InventoryDate as InventoryDate,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP EACH BY
ProductSizeId,
InventoryDate,
ProductId,
StoreId;
The table is around 250GB, project # is 883604934239.
A combination of reducing the data involved and recent updates to BigQuery, this query now runs.
where ABS(HASH(ProductId) % 4) = 0
Was used to reduce the 1.3 Billion rows in the table (% 3 still failed).
With the test data set it gives "Error: Response too large to return in big query" which can be handled by writing the results out to a table. Click Enable Options, 'Select Table' (and enter a table name), then check 'Allow Large Results'.