Google Big Query charges for querying full table if subquery used - google-bigquery

I have a partitioned table and am trying to limit my search to a few partitions. To do this I am running a query (using legacy SQL) that looks like the following:
SELECT
*
FROM
[project:dataset.table]
WHERE
_PARTITIONTIME >= "2018-07-10 00:00:00"
AND _PARTITIONTIME < "2018-07-11 00:00:00"
AND col IN (
SELECT
col
FROM
[project:dataset.table]
WHERE
_PARTITIONTIME >= "2018-07-10 00:00:00"
AND _PARTITIONTIME < "2018-07-11 00:00:00"
AND col2 > 0)
I limit the main query and the subquery using _PARTITIONTIME, so big query should only need to search those partitions. When I run this query though I get billed as if I'd just queried the entire table without using _PARTITIONTIME. Why does this happen?
UPDATE
The equivalent query using standard SQL does not have this problem, so use that as a workaround. I'd still like to know why this happens though. If it's just a bug or if legacy SQL actually does attempt to access all the data in a table for a query like this.

As noted in the question, switching to #standardSQL is the right solution. You shouldn't expect any big updates to the legacy SQL dialect - while #standardSQL will keep getting some substantial ones.
Also note that there are 2 types of partitioned tables today:
Tables partitioned by ingestion time
Tables that are partitioned based on a TIMESTAMP or DATE column
If you try to query the second type with legacy SQL:
SELECT COUNT(*)
FROM [fh-bigquery:wikipedia_v2.pageviews_2018]
WHERE datehour BETWEEN "2018-01-01 00:00:00" AND "2018-01-02 00:00:00"
you get the error "Querying tables partitioned on a field is not supported in Legacy SQL".
Meanwhile this works:
#standardSQL
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v2.pageviews_2018`
WHERE datehour BETWEEN "2018-01-01 00:00:00" AND "2018-01-02 00:00:00"
I'm adding these points to enhance the message "it's time to switch to #standardSQL to get the best out of BigQuery".

I think this is a BigQuery Legacy SQL specific issue.
There is a list of cases for when Pseudo column queries scan all partitions and there is an explicit mentioning of Legacy SQL - In legacy SQL, the _PARTITIONTIME filter works only when ...
I don't see exactly your case in that list - but the best way is just use Standard SQL here

Related

Querying a partitioned BigQuery table across multiple far-apart _PARTITIONDATE days?

I have a number of very big tables that are partitioned by _PARTITIONDATE which I'd like to query off of regularly in an efficient way. Each time I run the query, I only need to search across a small number of dates, but these dates will change every run and may be months/years apart from one-another.
To capture these dates, I could do _PARTITIONDATE >= '2015-01-01' but this is making the queries run very slow as there are millions of rows on each partition. I could also do _PARTITIONDATE BETWEEN '2015-01-01' AND '2017-01-01', but the exact date range will change every run. What I'd like to do is something like _PARTITIONDATE IN ("2015-03-10", "2016-01-24", "2016-22-03", "2017-06-14") so that the query only needs to run on the dates provided, which from my testing appears to work.
The problem I'm running into is that the list of dates will change every time, requiring me to join in the list of dates in a temp table first. When doing that like this source._PARTITIONDATE IN (datelist.date), it does not work and runs into an error if that's the only WHERE condition when querying a partition-required table.
Any advice for ways I might get this to work or other approach to querying off specific partitions that aren't back to back without having to process querying the whole thing?
I've been reading through the BigQuery documentation but I don't see an answer to this question. I do see it says that the following "doesn't limit the scanned partitions, because it uses table values, which are dynamic." So possibly what I'm trying to do is impossible with the current BQ limitations?
_PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)
Script is a possible solution.
DECLARE max_date DEFAULT (SELECT MAX(...) FROM ...);
SELECT .... FROM ... WHERE _PARTITIONDATE = max_date;

Optimize SELECT MAX(timestamp) query

I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).

Do I need to use the pseudo column _PARTITIONTIME when querying from a column partitioned table?

I created a time-partitioned table on BigQuery by using a date column from the table itself:
new_table.time_partitioning = bigquery.TimePartitioning(field='date')
I query the data by a simple request as follows:
SELECT * FROM t where date="2020-04-08"
My question is whether this is sufficient to query the partitioning, and thus reduce costs, or do I need to add also the pseudo columns _PARTITIONTIME as outlined in the section on Querying Partitioned Tables?
SELECT * FROM t where _PARTITIONTIME = TIMESTAMP("2020-04-08")
Quick answer is SELECT * FROM t where date="2020-04-08" is good enough for you to engage "partition pruning" and reduce cost.
Longer answer is always consult UI to see if partition filter is properly engaged for certain query:
SELECT * FROM `bigquery-public-data.crypto_bitcoin.transactions`
WHERE block_timestamp_month >= "2020-01-01"
This month ->
Year to date ->

BigQuery, date partitioned tables and decorator

I am familiar with using table decorators to query a table, for example, as it was a week ago or for data inserted over a certain date range.
Introducing date-partitioned tables revealed a pseudo column called _PARTITIONTIME. Using a date decorator syntax, you can add records to a certain partition in the table.
I was wondering if the pseudo column _PARTITIONTIME is also used, behind the scene, to support table decorators or something that straightforward.
If yes, can it be accessed/changed, as we do with the pseudo column of partitioned tables?
Is it called _PARTITIONTIME or _INSERTIONTIME? Of course, both didn't work. :)
First check if indeed the table is partitioned by reading out partitions
SELECT TIMESTAMP(partition_id)
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
In case not you will get error: Cannot read partition information from a table that is not partitioned
then another important step: To select the value of _PARTITIONTIME, you must use an alias.
SELECT
_PARTITIONTIME AS pt,
field1
FROM
mydataset.table1
but when you use in WHERE it's not mandatory, only when it's in select.
#legacySQL
SELECT
field1
FROM
mydataset.table1
WHERE
_PARTITIONTIME > DATE_ADD(TIMESTAMP('2016-04-15'), -5, "DAY")
you can always reference one partitioned table with the decorator: mydataset.table$20160519

How do I use the TABLE_QUERY() function in BigQuery?

A couple of questions about the TABLE_QUERY function:
The examples show using table_id in the query string, are there other fields available?
It seems difficult to debug. I'm getting "error evaluating subsidiary query" when I try to use it.
How does TABLE_QUERY() work?
The TABLE_QUERY() function allows you to write a SQL WHERE clause that is evaluated to find which tables to run the query over. For instance, you can run the following query to count the rows in all tables in the publicdata:samples dataset that are older than 7 days:
SELECT count(*)
FROM TABLE_QUERY(publicdata:samples,
"MSEC_TO_TIMESTAMP(creation_time) < "
+ "DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY')")
Or you can run this to query over all tables that have ‘git’ in the name (which are the github_timeline and the github_nested sample tables) and find the most common urls:
SELECT url, COUNT(*)
FROM TABLE_QUERY(publicdata:samples, "table_id CONTAINS 'git'")
GROUP EACH BY url
ORDER BY url DESC
LIMIT 100
Despite being very powerful, TABLE_QUERY() can be difficult to use. The WHERE clause must be specified as a string, which can be a little bit awkward. Moreover, it can be difficult to debug, since when there is a problem, you only get the error “Error evaluating subsidiary query”, which isn’t always helpful.
How it works:
TABLE_QUERY() essentially executes two queries. When you run TABLE_QUERY(<dataset>, <table_query>), BigQuery executes SELECT table_id FROM <dataset>.__TABLES_SUMMARY__ WHERE <table_query> to get the list of table IDs to run the query on, then it executes your actual query over those tables.
The __TABLES__ portion of that query may look unfamiliar. __TABLES_SUMMARY__ is a meta-table containing information about tables in a dataset. You can use this meta-table yourself. For example, the query SELECT * FROM publicdata:samples.__TABLES_SUMMARY__ will return metadata about the tables in the publicdata:samples dataset.
Available Fields:
The fields of the __TABLES_SUMMARY__ meta-table (that are all available in the TABLE_QUERY query) include:
table_id: name of the table.
creation_time: time, in milliseconds since 1/1/1970 UTC, that the table was created. This is the same as the creation_time field on the table.
type: whether it is a view (2) or regular table (1).
The following fields are not available in TABLE_QUERY() since they are members of __TABLES__ but not __TABLES_SUMMARY__. They're kept here for historical interest and to partially document the __TABLES__ metatable:
last_modified_time: time, in milliseconds since 1/1/1970 UTC, that the table was updated (either metadata or table contents). Note that if you use the tabledata.insertAll() to stream records to your table, this might be a few minutes out of date.
row_count: number of rows in the table.
size_bytes: total size in bytes of the table.
How to debug
In order to debug your TABLE_QUERY() queries, you can do the same thing that BigQuery does; that is, you can run the the metatable query yourself. For example:
SELECT * FROM publicdata:samples.__TABLES_SUMMARY__
WHERE MSEC_TO_TIMESTAMP(creation_time) <
DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY')
lets you not only debug your query but also see what tables would be returned when you run the TABLE_QUERY function. Once you have debugged the inner query, you can put it together with your full query over those tables.
Alternative answer, for those moving forward to Standard SQL:
BigQuery Standard SQL doesn't support TABLE_QUERY, but it supports * expansion for table names.
When expanding table names *, you can use the meta-column _TABLE_SUFFIX to narrow the selection.
Table expansion with * only works when all tables have compatible schemas.
For example, to get the average worldwide NOAA GSOD temperature between 2010 and 2014:
#standardSQL
SELECT AVG(temp) avg_temp, _TABLE_SUFFIX y
FROM `bigquery-public-data.noaa.gsod_20*` #every year that starts with "20"
WHERE _TABLE_SUFFIX BETWEEN "10" AND "14" #only years between 2010 and 2014
GROUP BY y
ORDER BY y