Google BigQuery optimization with subquery in WHERE clause - google-bigquery

I am attempting to set up a query that selects a subset of data from a range of daily partitions of Google Analytics session data and writes the data to a Google BigQuery staging table. The challenge for me is to reduce the processing cost when using a subquery in the WHERE clause.
Google Analytics data from the query are to be appended to a staging table before being processed and loaded into the target data table (my-data-table). The main query is given in two forms below. The first is hard-coded. The second reflects the preferred form. The upper bound on _TABLE_SUFFIX is hard-coded for both to simplify the query. The objective is to use MAX(date), where date has the form YYYYMMDD, from my-data-table as a lower bound on the ga_sessions_* daily partitions. The query has been simplified for presentation here but is believed to contain all necessary elements.
The aggregate query (SELECT MAX(date) FROM my-project-12345.dataset.my-data-table) returns the value '20201015' and processes 202 KB. Depending upon whether I use the returned value explicitly (as '20201015') in the WHERE clause of the main query or use the SELECT MAX() query in the WHERE clause, there is a significant difference in data processed between the two queries (2.3 GB for the explicit value vs 138.1 GB for the SELECT MAX() expression).
Is there an optimization, plan, or directive that can be applied to the preferred form of the main query that will reduce the data processing cost? Thank you for any assistance that can be provided.
Main Query (hard-coded version, processes 2.3 GB)
SELECT
GA.date,
GA.field1,
hits.field2,
hits.field3
FROM
`my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE
hits.type IN ('PAGE', 'EVENT')
AND hits.field0 = 'some value'
AND _TABLE_SUFFIX > '20201015'
AND _TABLE_SUFFIX < '20201025'
Main Query (preferred form, processes 138.1 GB without optimization)
SELECT
GA.date,
GA.field1,
hits.field2,
hits.field3
FROM
`my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE
hits.type IN ('PAGE', 'EVENT')
AND hits.field0 = 'some value'
AND _TABLE_SUFFIX > (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`)
AND _TABLE_SUFFIX < '20201025'

You can use scripting for this
The "trick" is in pre-computing
DECLARE start_date STRING;
SET start_date = (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`);
and assigning to variable and then use this variable in where clause on main query - in this case it will use cost effective version
AND _TABLE_SUFFIX > start_date
AND _TABLE_SUFFIX < '20201025'

Related

BigQuery - Same query, REGEXP_CONTAINS processed 3X less data than IN Operator? Details inside

I'm trying to extract data where the SKU matches either of two values."GGOEGGCX056299|GGOEGAAX0104"
When I run the REGEXP_CONTAINS version, it uses 3X less space from my query quota [17.6 MB vs 51.5 MB using IN operator]. My Regex version is also set to search for specific SKU's via the pipe symbol so I'm wondering what caused the REGEX version to use less space in processing the query compared to the IN operator that also searched for two specific SKU's?
Any help with understanding the difference and how can I make my queries more efficient?
Thanks.
SELECT
date,
prod.productSKU AS SKU,
SUM(prod.productQuantity) AS qty_purchased
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST (hits) hit, UNNEST(product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170131'
AND
REGEXP_CONTAINS (prod.productSKU,"GGOEGGCX056299|GGOEGAAX0104")
GROUP BY date, SKU
ORDER BY date ASC
When I run the IN version to pull the same data, it says used 51.5 MB
SELECT
date,
prod.productSKU AS SKU,
SUM(prod.productQuantity) AS qty_purchased
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST (hits) hit, UNNEST(product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170331'
AND
prod.productSKU IN ("GGOEGGCX056299", "GGOEGAAX0104")
GROUP BY date, SKU
ORDER BY date ASC
it uses 3X less space from my query quota [17.6 MB vs 51.5 MB]
Below is why!!!
in first query you have
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170131'
while in second
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170331'
Obviously , second query covering more tables thus difference in bytes - one month vs three months - thus ~3x difference

_TABLE_SUFFIX BETWEEN syntax is not selecting any tables

I'm looking at the public GitHub events dataset githubarchive.day.YYYYMMDD to pull public events that belong to me.
For this I use a simple query like:
SELECT id, actor.login, type
FROM `githubarchive.day.2*`
WHERE
_TABLE_SUFFIX BETWEEN '20200520' AND '20200528'
AND actor.login='ahmetb'
This BETWEEN clause doesn't seem to be matching any tables according to this message
Query complete (0.4 sec elapsed, 0 B processed)
If I use a simpler syntax like this it works:
SELECT id, actor.login, type
FROM `githubarchive.day.202005*`
WHERE actor.login='ahmetb'
Query complete (2.2 sec elapsed, 2.4 GB processed)
However using the wildcard syntax directly in FROM is not an option for me as I determine the table suffix dynamically through a query parameter.
Below is correct version
SELECT id, actor.login, type
FROM `githubarchive.day.2*`
WHERE
_TABLE_SUFFIX BETWEEN '0200520' AND '0200528'
AND actor.login='ahmetb'
Note: you needed to remove first 2 in dates in below line
_TABLE_SUFFIX BETWEEN '0200520' AND '0200528'
Or you might wanted below one
SELECT id, actor.login, type
FROM `githubarchive.day.*`
WHERE
_TABLE_SUFFIX BETWEEN '20200520' AND '20200528'
AND actor.login='ahmetb'
which makes more sense to me

Chrome UX Report: improve query performance

I am querying the Chrome UX Report public dataset using the following query to get values for the indicated metrics over time for a set of country-specific tables. The query runs for a very long time (I stopped it at 180 seconds) because I don't know what the timeout is for a query or how to tell if the query hung.
I'm trying to get aggregate data year over year for average_fcp, average_fp and average_dcl. I'm not sure if I'm using BigQuery correctly or there are ways to optimize the query to make it runs faster
This is the query I'm using.
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
AVG(fp.density) as average_fp,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
ORDER BY
yyyymm
I'm not sure if it makes mathematical sense to get the AVG() of all the densities - but let's do it anyways.
The bigger problem in the query is this:
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
-- that's an explosive join: It transforms one row with 3 arrays with ~500 elements each, into 125 million rows!!! That's why the query isn't running.
A similar query that gives you similar results:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT AVG(fcp.density) FROM UNNEST(first_contentful_paint.histogram.bin) fcp WHERE fcp.start > 20000) AS average_fcp,
(SELECT AVG(fp.density) FROM UNNEST(first_paint.histogram.bin) fp) AS average_fp,
(SELECT AVG(dcl.density) FROM UNNEST(dom_content_loaded.histogram.bin) dcl) AS average_dcl
FROM `chrome-ux-report.country_cl.*`
WHERE form_factor.name = 'desktop'
)
GROUP BY yyyymm
ORDER BY yyyymm
The good news: This query runs in 3.3 seconds.
Now that the query runs in 3 seconds, the most important question is: Does it make sense mathematically?
Bonus: This query makes more sense to me mathematically speaking, but I'm not 100% sure about it:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT yyyymm, origin, SUM(weighted_fcp) average_fcp, SUM(weighted_fp) average_fp, SUM(weighted_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT SUM(start*density) FROM UNNEST(first_contentful_paint.histogram.bin)) AS weighted_fcp,
(SELECT SUM(start*density) FROM UNNEST(first_paint.histogram.bin)) AS weighted_fp,
(SELECT SUM(start*density) FROM UNNEST(dom_content_loaded.histogram.bin)) AS weighted_dcl,
origin
FROM `chrome-ux-report.country_cl.*`
)
GROUP BY origin, yyyymm
)
GROUP BY yyyymm
ORDER BY yyyymm
After carefully reviewing your query, I concluded that the processing time for each of the actions you are performing is around 6 seconds or less. Therefore, I decided to execute each task from each unnest and then append the tables together, using UNION ALL method.
The query ran within 4 seconds. The syntax is:
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fp.density) as average_fp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(dom_content_loaded.histogram.bin) as dcl
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
ORDER BY
yyyymm
In addition, I would like to point that according to the documentation it is advisable to avoid the excessive use of wildcards opting to use date ranges and materializing large datasets results. Also, I would like to point that BigQuery limits cached results to 10gb.
I hope it helps.
Let me start saying that BigQuery query timeout is very long (6 hours) so you should not have a problem on this front but you might encounter other errors.
We had the same issue internally, we have datasets with data divided in country tables, even though the tables are partitioned on timestamp when running queries over hounders of tables, not only the query takes a long time, but sometime it will fail with resources exceeded error.
Our solution was to aggregate all this table into one single one adding 'country' column use it as clustering column. This not only made our queries executed but it made them even faster than our temporary solution of running the same query on a sub set of the country tables as intermediate steps and then combining the results together. It is now faster, easier and cleaner.
Coming back to your specific question, I suggest to create a new table (which you will need to host $$$) that is the combination of all the tables inside a dataset as a partitioned table.
The quickest way, unfortunately also the more expensive one (you will pay for the query scan), is to use a create table statement.
create table `project_id.dataset_id.table_id`
partition by date_month
cluster by origin
as (
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
);
If this query fails you can run it on a sub set of table e.g. where starts_with(_table_suffix, '2018') and the run the following query with the 'write append' disposition against the table you create before.
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
where starts_with(_table_suffix, '2019')
If you noticed I also used a clustering column, which is think is a best practice to do.
Note for who is curating Google public datasets.
It would be nice to have a public "chrome_ux_report" dataset with a just a single table partitioned by date and clustered by country.

Use DataStudio to specify the date range for a custom query in BigQuery, where the date range influences operators in the query

I currently have a DataStudio dashboard connected to a BigQuery custom query.
That BQ query has a hardcoded date range and the status of one of the columns (New_or_Relicensed) can change dynamically for a row, based on the dates specified in the range. I would like to be able to alter that range from DataStudio.
I have tried:
simply connecting the DS dashboard to the custom query in BQ and then introducing a date range filter, but as you can imagine - that does not work because it's operating on an already hard-coded date range.
reviewing similar answers, but their problem doesn't appear to be quite the same E.g. BigQuery Data Studio Custom Query
Here is the query I have in BQ:
SELECT t0.New_Or_Relicensed, t0.Title_Category FROM (WITH
report_range AS
(
SELECT
TIMESTAMP '2019-06-24 00:00:00' AS start_date,
TIMESTAMP '2019-06-30 00:00:00' AS end_date
)
SELECT
schedules.schedule_entry_id AS Schedule_Entry_ID,
schedules.schedule_entry_starts_at AS Put_Up,
schedules.schedule_entry_ends_at AS Take_Down,
schedule_entries_metadata.contract AS Schedule_Entry_Contract,
schedules.platform_id AS Platform_ID,
platforms.platform_name AS Platform_Name,
titles_metadata.title_id AS Title_ID,
titles_metadata.name AS Title_Name,
titles_metadata.category AS Title_Category,
IF (other_schedules.schedule_entry_id IS NULL, "new", "relicensed") AS New_Or_Relicensed
FROM
report_range, client.schedule_entries AS schedules
JOIN client.schedule_entries_metadata
ON schedule_entries_metadata.schedule_entry_id = schedules.schedule_entry_id
JOIN
client.platforms
ON schedules.platform_id = platforms.platform_id
JOIN
client.titles_metadata
ON schedules.title_id = titles_metadata.title_id
LEFT OUTER JOIN
client.schedule_entries AS other_schedules
ON schedules.platform_id = other_schedules.platform_id
AND other_schedules.schedule_entry_ends_at < report_range.start_date
AND schedules.title_id = other_schedules.title_id
WHERE
((schedules.schedule_entry_starts_at >= report_range.start_date AND
schedules.schedule_entry_starts_at <= report_range.end_date) OR
(schedules.schedule_entry_ends_at >= report_range.start_date AND
schedules.schedule_entry_ends_at <= report_range.end_date))
) AS t0 LIMIT 100;
Essentially - I would like to be able to set the start_date and end_date from google data studio, and have those dates incorporated into the report_range that then influences the operations in the rest of the query (that assign a schedule entry as new or relicensed).
Have you looked at using the Custom Query interface of the BigQuery connector in Data Studio to define start_date and end_date as parameters as part of a filter.
Your query would need a little re-work...
The following example custom query uses the #DS_START_DATE and #DS_END_DATE parameters as part of a filter on the creation date column of a table. The records produced by the query will be limited to the date range selected by the report user, reducing the number of records returned and resulting in a faster query:
Resources:
Introducing BigQuery parameters in Data Studio
https://www.blog.google/products/marketingplatform/analytics/introducing-bigquery-parameters-data-studio/
Running parameterized queries
https://cloud.google.com/bigquery/docs/parameterized-queries
I had a similar issue where I wanted to incorporate a 30 day look back before the start (#ds_start_date). In this case I was using Google Analytics UA session data and using table suffix in my where clause. I was able to calculate a date RELATIVE to the built in data studio "string" dates by using the following:
...
WHERE
_table_suffix BETWEEN
CAST(FORMAT_DATE('%Y%m%d', DATE_SUB (PARSE_DATE('%Y%m%d',#DS_START_DATE), INTERVAL 30 DAY)) AS STRING)
AND
CAST(FORMAT_DATE('%Y%m%d', DATE_SUB (PARSE_DATE('%Y%m%d',#DS_END_DATE), INTERVAL 0 DAY)) AS STRING)

How to choose the latest partition in BigQuery table?

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.
I've tried (as far as I know, BigQuery does not support QUALIFY):
SELECT col FROM table WHERE _PARTITIONTIME = (
SELECT pt FROM (
SELECT pt, RANK() OVER(ORDER by pt DESC) as rnk FROM (
SELECT _PARTITIONTIME AS pt FROM table GROUP BY 1)
)
)
WHERE rnk = 1
);
But this does not work and reads all rows.
SELECT col from table WHERE _PARTITIONTIME = TIMESTAMP('YYYY-MM-DD')
where 'YYYY-MM-DD' is a specific date does work.
However, I need to run this script in the future, but the table update (and the _PARTITIONTIME) is irregular. Is there a way I can pull data only from the latest partition in BigQuery?
October 2019 Update
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now
See example below
DECLARE max_date TIMESTAMP;
SET max_date = (
SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);
SELECT * FROM `project.dataset.partitioned_table`
WHERE _PARTITIONTIME = max_date;
Update for those who like downvoting without checking context, etc.
I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query
Try
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(TIMESTAMP(partition_id))
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
)
or
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(_PARTITIONTIME)
FROM [dataset.partitioned_table]
)
Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.
As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.
The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.
I'm happy to learn if there is a better workaround to get the latest partition!
List all partitions with:
#standardSQL
SELECT
_PARTITIONTIME as pt
FROM
`[DATASET].[TABLE]`
GROUP BY 1
And then choose the latest timestamp.
Good luck :)
https://cloud.google.com/bigquery/docs/querying-partitioned-tables
I found the workaround to this issue. You can use with statement, select last few partitions and filter out the result. This is I think better approach because:
You are not limited by fixed partition date (like today - 1 day). It will always take the latest partition from given range.
It will only scan last few partitions and not whole table.
Example with last 3 partitions scan:
WITH last_three_partitions as (select *, _PARTITIONTIME as PARTITIONTIME
FROM dataset.partitioned_table
WHERE _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 DAY))
SELECT col1, PARTITIONTIME from last_three_partitions
WHERE PARTITIONTIME = (SELECT max(PARTITIONTIME) from last_three_partitions)
A compromise that manages to query only a few partitions without resorting to scripting or failing with missing partitions for fixed dates.
WITH latest_partitions AS (
SELECT *, _PARTITIONDATE AS date
FROM `myproject.mydataset.mytable`
WHERE _PARTITIONDATE > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT
*
FROM
latest_partitions
WHERE
date = (SELECT MAX(date) FROM latest_partitions)
You can leverage the __TABLES__ list of tables to avoid re-scanning everything or having to hope latest partition is ~3 days ago. I did the split and ordinal stuff to guard against in case my table prefix appears more than once in the table name for some reason.
This should work for either _PARTITIONTIME or _TABLE_SUFFIX.
select * from `project.dataset.tablePrefix*`
where _PARTITIONTIME = (
SELECT split(table_id,'tablePrefix')[ordinal(2)] FROM `project.dataset.__TABLES__`
where table_id like 'tablePrefix%'
order by table_id desc limit 1)
I had this answer in a less popular question, so copying it here as it's relevant (and this question is getting more pageviews):
Mikhail's answer looks like this (working on public data):
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
AND wiki='es'
# 122.2 MB processed
But it seems the question wants something like this:
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
AND wiki='es'
# 50.6 GB processed
... but for way less than 50.6GB
What you need now is some sort of scripting, to perform this in 2 steps:
max_date = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
;
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = {{max_date}}
AND wiki='es'
# 115.2 MB processed
You will have to script this outside BigQuery - or wait for news on https://issuetracker.google.com/issues/36955074.
Building on the answer from Chase. If you have a table that requires you filter over a column, and you're receiving the error:
Cannot query over table 'myproject.mydataset.mytable' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
Then you can use:
SELECT
MAX(_PARTITIONTIME) AS pt
FROM
`myproject.mydataset.mytable`
WHERE _PARTITIONTIME IS NOT NULL
Instead of the latest partition, I've used this to get the earliest partition in a dataset by simply changing max to min.