Does BigQuery charge for querying only the stream buffer? - google-bigquery

I have a day partitioned table with approx 300k rows in the streaming buffer. When running an interactive, non-cached, standard SQL query using
SELECT .. FROM .. WHERE _PARTITIONTIME IS NULL
The query validator says:
Valid: This query will process 0 B when run.
And after executing, the job information tab says:
Bytes Processed 0 B
Bytes Billed 0 B
The query is certainly returning real-time results each time I run it. Is this actually a free operation?

Related

Are LIMIT 0 queries billed in BigQuery?

I was thinking to use this together with DBT to check that all the DAG, dependencies and such is correct without incurring in costs.
I was thinking of adding a LIMIT 0 in BigQuery queries. I'm not finding any official doc stating whether this is the case.
Are those queries not billed?
Correct, this will not bill any data. You can run a dry run to verify:
dzagales#cloudshell:~ (elzagales)$ bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 0'
Query successfully validated. Assuming the tables are not modified, running this query will process 0 bytes of data.
dzagales#cloudshell:~ (elzagales)$ bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1'
Query successfully validated. Assuming the tables are not modified, running this query will process 254787 bytes of data.
Above you can see a LIMIT 0 bills 0 bytes, while a LIMIT 1 will scan the whole table.

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

BigQuery returning "query too large" on short query

I've been trying to run this query:
SELECT
created
FROM
TABLE_DATE_RANGE(
program1_insights.insights_,
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-02-09')
)
LIMIT
10
And BigQuery complains that the query is too large.
I've experimented with writing the table names out manually:
SELECT
created
FROM program1_insights.insights_20160101,
program1_insights.insights_20160102,
program1_insights.insights_20160103,
program1_insights.insights_20160104,
program1_insights.insights_20160105,
program1_insights.insights_20160106,
program1_insights.insights_20160107,
program1_insights.insights_20160108,
program1_insights.insights_20160109,
program1_insights.insights_20160110,
program1_insights.insights_20160111,
program1_insights.insights_20160112,
program1_insights.insights_20160113,
program1_insights.insights_20160114,
program1_insights.insights_20160115,
program1_insights.insights_20160116,
program1_insights.insights_20160117,
program1_insights.insights_20160118,
program1_insights.insights_20160119,
program1_insights.insights_20160120,
program1_insights.insights_20160121,
program1_insights.insights_20160122,
program1_insights.insights_20160123,
program1_insights.insights_20160124,
program1_insights.insights_20160125,
program1_insights.insights_20160126,
program1_insights.insights_20160127,
program1_insights.insights_20160128,
program1_insights.insights_20160129,
program1_insights.insights_20160130,
program1_insights.insights_20160131,
program1_insights.insights_20160201,
program1_insights.insights_20160202,
program1_insights.insights_20160203,
program1_insights.insights_20160204,
program1_insights.insights_20160205,
program1_insights.insights_20160206,
program1_insights.insights_20160207,
program1_insights.insights_20160208,
program1_insights.insights_20160209
LIMIT
10
And not surprisingly, BigQuery returns the same error.
This Q&A says that "query too large" means that BigQuery is generating an internal query that's too large to be processed. But in the past, I've run queries over way more than 40 tables with no problem.
My question is: what is it about this query in particular that's causing this error, when other, larger-seeming queries run fine? Is it that doing a single union over this number of tables is not supported?
Answering question: what is it about this query in particular that's causing this error
The problem is not in query itself.
Query looks good.
I just run similar query against ~400 daily tables with total 5.8B (billion) rows of total size 5.7TB with:
Query complete (150.0s elapsed, 21.7 GB processed)
SELECT
Timestamp
FROM
TABLE_DATE_RANGE(
MyEvents.Events_,
TIMESTAMP('2015-01-01'),
TIMESTAMP('2016-02-12')
)
LIMIT
10
You should look around - btw, are you sure you are not over-simplifying query in your question?

Bigquery resources exceeded during query execution, quota?

With Google BigQuery, I'm running a query with a group by and receive the error, "resources exceeded during query execution".
Would an increased quota allow the query to run?
Any other suggestions?
SELECT
ProductId,
StoreId,
ProductSizeId,
InventoryDate as InventoryDate,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP EACH BY
ProductSizeId,
InventoryDate,
ProductId,
StoreId;
The table is around 250GB, project # is 883604934239.
A combination of reducing the data involved and recent updates to BigQuery, this query now runs.
where ABS(HASH(ProductId) % 4) = 0
Was used to reduce the 1.3 Billion rows in the table (% 3 still failed).
With the test data set it gives "Error: Response too large to return in big query" which can be handled by writing the results out to a table. Click Enable Options, 'Select Table' (and enter a table name), then check 'Allow Large Results'.

Oracle & WCF Data Service: No query results returned to DS until entire query completed

I've been troubleshooting a major performance issue in a SL4 application that utilizes the Entity Framework and a WCF Data Service to query some simple, mid-sized (~10M records) tables, and I've finally made some progress.
A query sent from the data service for either 100 rows or a simple filter that returns >100 rows for a simple table takes 5 minutes to return anything at all. The same query run in SQL Developer returns the first 50 rows virtually instantaneously.
Examining the trace logs of these queries reveals the exact same execution plan and overall elapsed times. The difference is that nothing is returned to the data service until the entire query is executed, whereas SQL developer gets the first 100 right away.
I thought paging might be the solution: config.SetEntitySetPageSize("*", 5);
But the trace log shows the first batch of rows being grabbed immediately before Oracle moves on to the next statement:
.... SIMPLE SQL QUERY ...
PARSE #2:c=78000,e=100606,p=0,cr=3,cu=0,mis=1,r=0,dep=0,og=1,tim=205746975549
EXEC #2:c=0,e=226,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=205747109728
FETCH #2:c=46800,e=42299,p=0,cr=7134,cu=0,mis=0,r=20,dep=0,og=1,tim=205747229491
=====================
PARSING IN CURSOR #3 len=402 dep=1 uid=0 oct=3 lid=0 tim=205747256638 hv=3607805727 ad='18f64810'
select parttype, partcnt, partkeycols, flags, defts#, defpctfree, defpctused, definitrans, defmaxtrans, deftiniexts, defextsize, defminexts, defmaxexts, defextpct, deflists, defgroups, deflogging, spare1, mod(spare2, 256) subparttype, mod(trunc(spare2/256), 256) subpartkeycols, mod(trunc(spare2/65536), 65536) defsubpartcnt, mod(trunc(spare2/4294967296), 256) defhscflags from partobj$ where obj# = :1
END OF STMT
At the end of the trace log, the rest of the rows are fetched:
....
END OF STMT
EXEC #3:c=0,e=15,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=205747262472
FETCH #3:c=0,e=24,p=0,cr=3,cu=0,mis=0,r=1,dep=1,og=4,tim=205747263090
FETCH #2:c=78001,e=81705,p=0,cr=14014,cu=0,mis=0,r=40,dep=0,og=1,tim=205747515725
FETCH #2:c=171601,e=162799,p=0,cr=27712,cu=0,mis=0,r=80,dep=0,og=1,tim=205747887738
FETCH #2:c=343202,e=328508,p=0,cr=55584,cu=0,mis=0,r=160,dep=0,og=1,tim=205748620538
...
*** 2012-03-21 19:34:10 2012
XCTEND rlbk=0, rd_only=1
as before, and still nothing is returned to the data service until they are all FETCH'ed.
The question is: how do you make Oracle send the first page of results back to the data service immediately?
a) Your query does not ask for only 100 rows!
b) You have no ORDER BY so how do you define the rows anyway?
Some ways of Oracle:
/* FIRST_ROWS */ Hint in your query will return first rows as early as it can but will still return the entire set.
WHERE ROWNUM < 100
will stop sthe query after 100 rows are emitted. You need an ORDER BY for this to make sense.
Ask Tom