BigQuery returning "query too large" on short query - google-bigquery

I've been trying to run this query:
SELECT
created
FROM
TABLE_DATE_RANGE(
program1_insights.insights_,
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-02-09')
)
LIMIT
10
And BigQuery complains that the query is too large.
I've experimented with writing the table names out manually:
SELECT
created
FROM program1_insights.insights_20160101,
program1_insights.insights_20160102,
program1_insights.insights_20160103,
program1_insights.insights_20160104,
program1_insights.insights_20160105,
program1_insights.insights_20160106,
program1_insights.insights_20160107,
program1_insights.insights_20160108,
program1_insights.insights_20160109,
program1_insights.insights_20160110,
program1_insights.insights_20160111,
program1_insights.insights_20160112,
program1_insights.insights_20160113,
program1_insights.insights_20160114,
program1_insights.insights_20160115,
program1_insights.insights_20160116,
program1_insights.insights_20160117,
program1_insights.insights_20160118,
program1_insights.insights_20160119,
program1_insights.insights_20160120,
program1_insights.insights_20160121,
program1_insights.insights_20160122,
program1_insights.insights_20160123,
program1_insights.insights_20160124,
program1_insights.insights_20160125,
program1_insights.insights_20160126,
program1_insights.insights_20160127,
program1_insights.insights_20160128,
program1_insights.insights_20160129,
program1_insights.insights_20160130,
program1_insights.insights_20160131,
program1_insights.insights_20160201,
program1_insights.insights_20160202,
program1_insights.insights_20160203,
program1_insights.insights_20160204,
program1_insights.insights_20160205,
program1_insights.insights_20160206,
program1_insights.insights_20160207,
program1_insights.insights_20160208,
program1_insights.insights_20160209
LIMIT
10
And not surprisingly, BigQuery returns the same error.
This Q&A says that "query too large" means that BigQuery is generating an internal query that's too large to be processed. But in the past, I've run queries over way more than 40 tables with no problem.
My question is: what is it about this query in particular that's causing this error, when other, larger-seeming queries run fine? Is it that doing a single union over this number of tables is not supported?

Answering question: what is it about this query in particular that's causing this error
The problem is not in query itself.
Query looks good.
I just run similar query against ~400 daily tables with total 5.8B (billion) rows of total size 5.7TB with:
Query complete (150.0s elapsed, 21.7 GB processed)
SELECT
Timestamp
FROM
TABLE_DATE_RANGE(
MyEvents.Events_,
TIMESTAMP('2015-01-01'),
TIMESTAMP('2016-02-12')
)
LIMIT
10
You should look around - btw, are you sure you are not over-simplifying query in your question?

Related

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

MS-Access SQL Query Performance - Slow

I have a query that pulls data and joins from other queries. The final query is taking about 8 minutes to run. Is there a more efficient way to query the information from queries built on queries (such as creating a table with the results of the first queries OR building an index - OR ??).
My final query is as simple as:
SELECT Filtered_ZFEWN.[Base 8], Filtered_ZFEWN.Notification, Filtered_ZFEWN.
[Service Product], Filtered_ZFEWN.[Product Hierarchy]
FROM Filtered_ZFEWN RIGHT JOIN
Notifications_by_Base_8 ON Filtered_ZFEWN.[Base 8] =
Notifications_by_Base_8.[ZFEWN Base 8]
WHERE Notifications_by_Base_8.[Product Hierarchy] IN
(SELECT Notifications_by_Base_8.[Product Hierarchy]
FROM Notifications_by_Base_8
WHERE Notifications_by_Base_8.[Product Hierarchy] NOT LIKE "*MISC*");
This query is pulling data from 6 other queries (you can see it explicitly pulls from two queries, but the other queries are also built upon 4 queries). I am looking for performance improvement.
By adding the wildcard at the beginning of the
Like "*MISC*" ,
you are stopping Access using any indexes on the sub query.
This will slow it down significantly on a larger dataset.
Can you move the filtering for that earlier in the query chain, or remove then starting wildcard? Or build an In clause without the Not Clause?
Isn't the subquery (query after the IN key word) redundant and always returns the same result? – Peeyush 15 hours ago
#peeyush You are correct - That is the answer. I removed the select subquery and kept only the NOT LIKE "misc" portion and this ran in 8 seconds. Thank you!!! – Analyst123456789 11 secs ago edit

Access - Match Column Range / Complex Join

How do we match columns based on condition of closeness to value.
This requires Complex Query / Range Comparison / Multiple Joins conditions.
Getting Query size exceeded 2GB error.
Tables :
InvDetails1 / InvDetails2 / INVDL / ExpectedResult
Field Relation :
InvDetails1.F1 = InDetails2.F3
InvDetails2.F5 = INVDL.F1
INVDL.DLID = ExpectedResult.DLID
ExpectedResult.Total - 1 < InvDetails1.F6 < ExpectedResult.Total + 1
left(InvDetails1.F21,10) = '2013-03-07'
Return Results where Number of records from ExpectedResult is only 1.
Group by InvDetails1.F1 , count(ExpectedResult.DLID) works.
From this result.
Final Result :
InvDetails1.F1 , InvDetails1.F16 , ExpectedResult.DLID , ExpectedResult.NMR
ExpectedResult - has millions of rows.
InvDetails - few hundred thousands
If I was in that situation and was finding that my query was "hitting a wall" at 2GB then one thing I would try would be to create a separate saved Select Query to isolate the InvDetails1 records just for the specific date in question. Then I would use that query instead of the full InvDetails1 table when joining to the other tables.
My reasoning is that the query optimizer may not be able to use your left(InvDetails1.F21,10) = '2013-03-07' condition to exclude InvDetails1 records early in the execution plan, possibly causing the query to grow much larger than it really needs to (internally, while it is being processed). Forcing the date selection to the beginning of the process by putting it in a separate (prerequisite) query may keep the size of the "main" query down to a more feasible size.
Also, if I found myself in the situation where my queries were getting that big I would also keep a watchful eye on the size of my .accdb (or .mdb) file to ensure that it does not get too close to 2GB. I've never had it happen myself, but I've heard that database files that hit the 2GB barrier can result in some nasty errors and be rather "challenging" to recover.

Bigquery resources exceeded during query execution, quota?

With Google BigQuery, I'm running a query with a group by and receive the error, "resources exceeded during query execution".
Would an increased quota allow the query to run?
Any other suggestions?
SELECT
ProductId,
StoreId,
ProductSizeId,
InventoryDate as InventoryDate,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP EACH BY
ProductSizeId,
InventoryDate,
ProductId,
StoreId;
The table is around 250GB, project # is 883604934239.
A combination of reducing the data involved and recent updates to BigQuery, this query now runs.
where ABS(HASH(ProductId) % 4) = 0
Was used to reduce the 1.3 Billion rows in the table (% 3 still failed).
With the test data set it gives "Error: Response too large to return in big query" which can be handled by writing the results out to a table. Click Enable Options, 'Select Table' (and enter a table name), then check 'Allow Large Results'.

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.
Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human
With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.
I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.
this question gives some pretty good information
When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.