Chrome UX Report: improve query performance - google-bigquery

I am querying the Chrome UX Report public dataset using the following query to get values for the indicated metrics over time for a set of country-specific tables. The query runs for a very long time (I stopped it at 180 seconds) because I don't know what the timeout is for a query or how to tell if the query hung.
I'm trying to get aggregate data year over year for average_fcp, average_fp and average_dcl. I'm not sure if I'm using BigQuery correctly or there are ways to optimize the query to make it runs faster
This is the query I'm using.
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
AVG(fp.density) as average_fp,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
ORDER BY
yyyymm

I'm not sure if it makes mathematical sense to get the AVG() of all the densities - but let's do it anyways.
The bigger problem in the query is this:
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
-- that's an explosive join: It transforms one row with 3 arrays with ~500 elements each, into 125 million rows!!! That's why the query isn't running.
A similar query that gives you similar results:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT AVG(fcp.density) FROM UNNEST(first_contentful_paint.histogram.bin) fcp WHERE fcp.start > 20000) AS average_fcp,
(SELECT AVG(fp.density) FROM UNNEST(first_paint.histogram.bin) fp) AS average_fp,
(SELECT AVG(dcl.density) FROM UNNEST(dom_content_loaded.histogram.bin) dcl) AS average_dcl
FROM `chrome-ux-report.country_cl.*`
WHERE form_factor.name = 'desktop'
)
GROUP BY yyyymm
ORDER BY yyyymm
The good news: This query runs in 3.3 seconds.
Now that the query runs in 3 seconds, the most important question is: Does it make sense mathematically?
Bonus: This query makes more sense to me mathematically speaking, but I'm not 100% sure about it:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT yyyymm, origin, SUM(weighted_fcp) average_fcp, SUM(weighted_fp) average_fp, SUM(weighted_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT SUM(start*density) FROM UNNEST(first_contentful_paint.histogram.bin)) AS weighted_fcp,
(SELECT SUM(start*density) FROM UNNEST(first_paint.histogram.bin)) AS weighted_fp,
(SELECT SUM(start*density) FROM UNNEST(dom_content_loaded.histogram.bin)) AS weighted_dcl,
origin
FROM `chrome-ux-report.country_cl.*`
)
GROUP BY origin, yyyymm
)
GROUP BY yyyymm
ORDER BY yyyymm

After carefully reviewing your query, I concluded that the processing time for each of the actions you are performing is around 6 seconds or less. Therefore, I decided to execute each task from each unnest and then append the tables together, using UNION ALL method.
The query ran within 4 seconds. The syntax is:
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fp.density) as average_fp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(dom_content_loaded.histogram.bin) as dcl
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
ORDER BY
yyyymm
In addition, I would like to point that according to the documentation it is advisable to avoid the excessive use of wildcards opting to use date ranges and materializing large datasets results. Also, I would like to point that BigQuery limits cached results to 10gb.
I hope it helps.

Let me start saying that BigQuery query timeout is very long (6 hours) so you should not have a problem on this front but you might encounter other errors.
We had the same issue internally, we have datasets with data divided in country tables, even though the tables are partitioned on timestamp when running queries over hounders of tables, not only the query takes a long time, but sometime it will fail with resources exceeded error.
Our solution was to aggregate all this table into one single one adding 'country' column use it as clustering column. This not only made our queries executed but it made them even faster than our temporary solution of running the same query on a sub set of the country tables as intermediate steps and then combining the results together. It is now faster, easier and cleaner.
Coming back to your specific question, I suggest to create a new table (which you will need to host $$$) that is the combination of all the tables inside a dataset as a partitioned table.
The quickest way, unfortunately also the more expensive one (you will pay for the query scan), is to use a create table statement.
create table `project_id.dataset_id.table_id`
partition by date_month
cluster by origin
as (
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
);
If this query fails you can run it on a sub set of table e.g. where starts_with(_table_suffix, '2018') and the run the following query with the 'write append' disposition against the table you create before.
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
where starts_with(_table_suffix, '2019')
If you noticed I also used a clustering column, which is think is a best practice to do.
Note for who is curating Google public datasets.
It would be nice to have a public "chrome_ux_report" dataset with a just a single table partitioned by date and clustered by country.

Related

Google BigQuery optimization with subquery in WHERE clause

I am attempting to set up a query that selects a subset of data from a range of daily partitions of Google Analytics session data and writes the data to a Google BigQuery staging table. The challenge for me is to reduce the processing cost when using a subquery in the WHERE clause.
Google Analytics data from the query are to be appended to a staging table before being processed and loaded into the target data table (my-data-table). The main query is given in two forms below. The first is hard-coded. The second reflects the preferred form. The upper bound on _TABLE_SUFFIX is hard-coded for both to simplify the query. The objective is to use MAX(date), where date has the form YYYYMMDD, from my-data-table as a lower bound on the ga_sessions_* daily partitions. The query has been simplified for presentation here but is believed to contain all necessary elements.
The aggregate query (SELECT MAX(date) FROM my-project-12345.dataset.my-data-table) returns the value '20201015' and processes 202 KB. Depending upon whether I use the returned value explicitly (as '20201015') in the WHERE clause of the main query or use the SELECT MAX() query in the WHERE clause, there is a significant difference in data processed between the two queries (2.3 GB for the explicit value vs 138.1 GB for the SELECT MAX() expression).
Is there an optimization, plan, or directive that can be applied to the preferred form of the main query that will reduce the data processing cost? Thank you for any assistance that can be provided.
Main Query (hard-coded version, processes 2.3 GB)
SELECT
GA.date,
GA.field1,
hits.field2,
hits.field3
FROM
`my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE
hits.type IN ('PAGE', 'EVENT')
AND hits.field0 = 'some value'
AND _TABLE_SUFFIX > '20201015'
AND _TABLE_SUFFIX < '20201025'
Main Query (preferred form, processes 138.1 GB without optimization)
SELECT
GA.date,
GA.field1,
hits.field2,
hits.field3
FROM
`my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE
hits.type IN ('PAGE', 'EVENT')
AND hits.field0 = 'some value'
AND _TABLE_SUFFIX > (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`)
AND _TABLE_SUFFIX < '20201025'
You can use scripting for this
The "trick" is in pre-computing
DECLARE start_date STRING;
SET start_date = (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`);
and assigning to variable and then use this variable in where clause on main query - in this case it will use cost effective version
AND _TABLE_SUFFIX > start_date
AND _TABLE_SUFFIX < '20201025'

BigQuery - Same query, REGEXP_CONTAINS processed 3X less data than IN Operator? Details inside

I'm trying to extract data where the SKU matches either of two values."GGOEGGCX056299|GGOEGAAX0104"
When I run the REGEXP_CONTAINS version, it uses 3X less space from my query quota [17.6 MB vs 51.5 MB using IN operator]. My Regex version is also set to search for specific SKU's via the pipe symbol so I'm wondering what caused the REGEX version to use less space in processing the query compared to the IN operator that also searched for two specific SKU's?
Any help with understanding the difference and how can I make my queries more efficient?
Thanks.
SELECT
date,
prod.productSKU AS SKU,
SUM(prod.productQuantity) AS qty_purchased
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST (hits) hit, UNNEST(product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170131'
AND
REGEXP_CONTAINS (prod.productSKU,"GGOEGGCX056299|GGOEGAAX0104")
GROUP BY date, SKU
ORDER BY date ASC
When I run the IN version to pull the same data, it says used 51.5 MB
SELECT
date,
prod.productSKU AS SKU,
SUM(prod.productQuantity) AS qty_purchased
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST (hits) hit, UNNEST(product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170331'
AND
prod.productSKU IN ("GGOEGGCX056299", "GGOEGAAX0104")
GROUP BY date, SKU
ORDER BY date ASC
it uses 3X less space from my query quota [17.6 MB vs 51.5 MB]
Below is why!!!
in first query you have
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170131'
while in second
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170331'
Obviously , second query covering more tables thus difference in bytes - one month vs three months - thus ~3x difference

Can I speed up this subquery nested PostgreSQL Query

I have the following PostgreSQL code (which works, but slowly) which I'm using to create a materialized view, however it is quite slow and length of code seems cumbersome with the multiple sub-queries. Is there anyway I can improve the speed this code executes at or rewrite so it's shorter and easier to maintain?
CREATE MATERIALIZED VIEW station_views.obs_10_min_avg_ffdi_powerbi AS
SELECT t.station_num,
initcap(t.station_name) AS station_name,
t.day,
t.month_int,
to_char(to_timestamp(t.month_int::text, 'MM'), 'TMMonth') AS Month,
round(((date_part('year', age(t2.dmax, t2.dmin)) * 12 + date_part('month', age(t2.dmax, t2.dmin))) / 12)::numeric, 1) AS record_years,
round((t2.count_all_vals / t2.max_10_periods * 100)::numeric, 1) AS per_datset,
max(t.avg_bom_fdi) AS max,
avg(t.avg_bom_fdi) AS avg,
percentile_cont(0.95) WITHIN GROUP (ORDER BY t.avg_bom_fdi) AS percentile_cont_95,
percentile_cont(0.99) WITHIN GROUP (ORDER BY t.avg_bom_fdi) AS percentile_cont_99
FROM ( SELECT a.station_num,
d.station_name,
a.ten_minute_intervals_utc,
date_part('day', a.ten_minute_intervals_utc) AS day,
date_part('month', a.ten_minute_intervals_utc) AS month_int,
a.avg_bom_fdi
FROM analysis.obs_10_min_avg_ffdi_bom a,
obs_minute_stn_det d
WHERE d.station_num = a.station_num) t,
( SELECT obs_10_min_avg_ffdi_bom_view.station_num,
obs_10_min_avg_ffdi_bom_view.station_name,
min(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) AS dmin,
max(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) AS dmax,
date_part('epoch', max(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc) - min(obs_10_min_avg_ffdi_bom_view.ten_minute_intervals_utc)) / 600 AS max_10_periods,
count(*) AS count_all_vals
FROM analysis.obs_10_min_avg_ffdi_bom_view
GROUP BY obs_10_min_avg_ffdi_bom_view.station_num, obs_10_min_avg_ffdi_bom_view.station_name) t2
WHERE t.station_num = t2.station_num
GROUP BY t.station_num, t.station_name, Month, t.month_int, t.day, record_years, per_datset
ORDER BY t.month_int, t.day
WITH DATA;
The output I get is a row for each weather station (station_num & station_name) along with the day & month that a weather variable is recorded (avg_bom_fdi). The month value is retained and converted to a name for purposes of plotting values averaged per month on the chart. I also pull in the total number of years that recordings exist for that station (record_years) and a percentage of how complete that dataset is (per_datset). These both come from the second subquery (t2). The first subquery (t) is used to average the data per day and return the daily max, average and 95/99th percentiles.
I agree with the running the explain plan / execution plan on this
query.
Also , if not needed remove order by
If you see , lot of
time spent on fetching a particular value while reviewing execution plan,
try creating an index on that particular column.
Depending on high
and low cardinality , you can create B-Tree or Bit Map index,if you are deciding on index.
I think you need read something about Execution plan. It's good way to understand what doing with you query.
I recommended you documentation about this problem - LINK

How to choose the latest partition in BigQuery table?

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.
I've tried (as far as I know, BigQuery does not support QUALIFY):
SELECT col FROM table WHERE _PARTITIONTIME = (
SELECT pt FROM (
SELECT pt, RANK() OVER(ORDER by pt DESC) as rnk FROM (
SELECT _PARTITIONTIME AS pt FROM table GROUP BY 1)
)
)
WHERE rnk = 1
);
But this does not work and reads all rows.
SELECT col from table WHERE _PARTITIONTIME = TIMESTAMP('YYYY-MM-DD')
where 'YYYY-MM-DD' is a specific date does work.
However, I need to run this script in the future, but the table update (and the _PARTITIONTIME) is irregular. Is there a way I can pull data only from the latest partition in BigQuery?
October 2019 Update
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now
See example below
DECLARE max_date TIMESTAMP;
SET max_date = (
SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);
SELECT * FROM `project.dataset.partitioned_table`
WHERE _PARTITIONTIME = max_date;
Update for those who like downvoting without checking context, etc.
I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query
Try
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(TIMESTAMP(partition_id))
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
)
or
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(_PARTITIONTIME)
FROM [dataset.partitioned_table]
)
Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.
As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.
The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.
I'm happy to learn if there is a better workaround to get the latest partition!
List all partitions with:
#standardSQL
SELECT
_PARTITIONTIME as pt
FROM
`[DATASET].[TABLE]`
GROUP BY 1
And then choose the latest timestamp.
Good luck :)
https://cloud.google.com/bigquery/docs/querying-partitioned-tables
I found the workaround to this issue. You can use with statement, select last few partitions and filter out the result. This is I think better approach because:
You are not limited by fixed partition date (like today - 1 day). It will always take the latest partition from given range.
It will only scan last few partitions and not whole table.
Example with last 3 partitions scan:
WITH last_three_partitions as (select *, _PARTITIONTIME as PARTITIONTIME
FROM dataset.partitioned_table
WHERE _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 DAY))
SELECT col1, PARTITIONTIME from last_three_partitions
WHERE PARTITIONTIME = (SELECT max(PARTITIONTIME) from last_three_partitions)
A compromise that manages to query only a few partitions without resorting to scripting or failing with missing partitions for fixed dates.
WITH latest_partitions AS (
SELECT *, _PARTITIONDATE AS date
FROM `myproject.mydataset.mytable`
WHERE _PARTITIONDATE > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT
*
FROM
latest_partitions
WHERE
date = (SELECT MAX(date) FROM latest_partitions)
You can leverage the __TABLES__ list of tables to avoid re-scanning everything or having to hope latest partition is ~3 days ago. I did the split and ordinal stuff to guard against in case my table prefix appears more than once in the table name for some reason.
This should work for either _PARTITIONTIME or _TABLE_SUFFIX.
select * from `project.dataset.tablePrefix*`
where _PARTITIONTIME = (
SELECT split(table_id,'tablePrefix')[ordinal(2)] FROM `project.dataset.__TABLES__`
where table_id like 'tablePrefix%'
order by table_id desc limit 1)
I had this answer in a less popular question, so copying it here as it's relevant (and this question is getting more pageviews):
Mikhail's answer looks like this (working on public data):
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
AND wiki='es'
# 122.2 MB processed
But it seems the question wants something like this:
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
AND wiki='es'
# 50.6 GB processed
... but for way less than 50.6GB
What you need now is some sort of scripting, to perform this in 2 steps:
max_date = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
;
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = {{max_date}}
AND wiki='es'
# 115.2 MB processed
You will have to script this outside BigQuery - or wait for news on https://issuetracker.google.com/issues/36955074.
Building on the answer from Chase. If you have a table that requires you filter over a column, and you're receiving the error:
Cannot query over table 'myproject.mydataset.mytable' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
Then you can use:
SELECT
MAX(_PARTITIONTIME) AS pt
FROM
`myproject.mydataset.mytable`
WHERE _PARTITIONTIME IS NOT NULL
Instead of the latest partition, I've used this to get the earliest partition in a dataset by simply changing max to min.

20 Day moving average with joins alone

There are questions like this all over the place so let me specify where I specifically need help.
I have seen moving averages in SQL with Oracle Analytic functions, MSSQL apply, or a variety of other methods. I have also seen this done with self joins (one join for each day of the average, such as here How do you create a Moving Average Method in SQL? ).
I am curious as to if there is a way (only using self joins) to do this in SQL (preferably oracle, but since my question is geared towards joins alone this should be possible for any RDBMS). The way would have to be scalable (for a 20 or 100 day moving average, in contrast to the link I researched above, which required a join for each day in the moving average).
My thoughts are
select customer, a.tradedate, a.shares, avg(b.shares)
from trades a, trades b
where b.tradedate between a.tradedate-20 and a.tradedate
group by customer, a.tradedate
But when I tried it in the past it hadn't worked. To be more specific, I am trying a smaller but similar exmaple (5 day avg instead of 20 day) with this fiddle demo and cant find out where I am going wrong. http://sqlfiddle.com/#!6/ed008/41
select a.ticker, a.dt_date, a.volume, avg(b.volume)
from yourtable a, yourtable b
where b.dt_date between a.dt_date-5 and a.dt_date
and a.ticker=b.ticker
group by a.ticker, a.dt_date, a.volume
I don't see anything wrong with your second query, I think the only reason it's not what you're expecting is because the volume field is an integer data type so when you calculate the average the resulting output will also be an integer data type. For an average you have to cast it, because the result won't necessarily be an integer (whole number):
select a.ticker, a.dt_date, a.volume, avg(cast(b.volume as float))
from yourtable a
join yourtable b
on a.ticker = b.ticker
where b.dt_date between a.dt_date - 5 and a.dt_date
group by a.ticker, a.dt_date, a.volume
Fiddle:
http://sqlfiddle.com/#!6/ed008/48/0 (thanks to #DaleM for DDL)
I don't know why you would ever do this vs. an analytic function though, especially since you mention wanting to do this in Oracle (which has analytic functions). It would be different if your preferred database were MySQL or a database without analytic functions.
Just to add to the answer, this is how you would achieve the same result in Oracle using analytic functions. Notice how the PARTITION BY acts as the join you're using on ticker. That splits up the results so that the same date shared across multiple tickers don't interfere.
select ticker,
dt_date,
volume,
avg(cast(volume as decimal)) over( partition by ticker
order by dt_date
rows between 5 preceding
and current row ) as mov_avg
from yourtable
order by ticker, dt_date, volume
Fiddle:
http://sqlfiddle.com/#!4/0d06b/4/0
Analytic functions will likely run much faster.
http://sqlfiddle.com/#!6/ed008/45 would appear to be what you need.
select a.ticker,
a.dt_date,
a.volume,
(select avg(cast(b.volume as float))
from yourtable b
where b.dt_date between a.dt_date-5 and a.dt_date
and a.ticker=b.ticker)
from yourtable a
order by a.ticker, a.dt_date
not a join but a subquery