_TABLE_SUFFIX BETWEEN syntax is not selecting any tables - google-bigquery

I'm looking at the public GitHub events dataset githubarchive.day.YYYYMMDD to pull public events that belong to me.
For this I use a simple query like:
SELECT id, actor.login, type
FROM `githubarchive.day.2*`
WHERE
_TABLE_SUFFIX BETWEEN '20200520' AND '20200528'
AND actor.login='ahmetb'
This BETWEEN clause doesn't seem to be matching any tables according to this message
Query complete (0.4 sec elapsed, 0 B processed)
If I use a simpler syntax like this it works:
SELECT id, actor.login, type
FROM `githubarchive.day.202005*`
WHERE actor.login='ahmetb'
Query complete (2.2 sec elapsed, 2.4 GB processed)
However using the wildcard syntax directly in FROM is not an option for me as I determine the table suffix dynamically through a query parameter.

Below is correct version
SELECT id, actor.login, type
FROM `githubarchive.day.2*`
WHERE
_TABLE_SUFFIX BETWEEN '0200520' AND '0200528'
AND actor.login='ahmetb'
Note: you needed to remove first 2 in dates in below line
_TABLE_SUFFIX BETWEEN '0200520' AND '0200528'
Or you might wanted below one
SELECT id, actor.login, type
FROM `githubarchive.day.*`
WHERE
_TABLE_SUFFIX BETWEEN '20200520' AND '20200528'
AND actor.login='ahmetb'
which makes more sense to me

Related

Google BigQuery optimization with subquery in WHERE clause

I am attempting to set up a query that selects a subset of data from a range of daily partitions of Google Analytics session data and writes the data to a Google BigQuery staging table. The challenge for me is to reduce the processing cost when using a subquery in the WHERE clause.
Google Analytics data from the query are to be appended to a staging table before being processed and loaded into the target data table (my-data-table). The main query is given in two forms below. The first is hard-coded. The second reflects the preferred form. The upper bound on _TABLE_SUFFIX is hard-coded for both to simplify the query. The objective is to use MAX(date), where date has the form YYYYMMDD, from my-data-table as a lower bound on the ga_sessions_* daily partitions. The query has been simplified for presentation here but is believed to contain all necessary elements.
The aggregate query (SELECT MAX(date) FROM my-project-12345.dataset.my-data-table) returns the value '20201015' and processes 202 KB. Depending upon whether I use the returned value explicitly (as '20201015') in the WHERE clause of the main query or use the SELECT MAX() query in the WHERE clause, there is a significant difference in data processed between the two queries (2.3 GB for the explicit value vs 138.1 GB for the SELECT MAX() expression).
Is there an optimization, plan, or directive that can be applied to the preferred form of the main query that will reduce the data processing cost? Thank you for any assistance that can be provided.
Main Query (hard-coded version, processes 2.3 GB)
SELECT
GA.date,
GA.field1,
hits.field2,
hits.field3
FROM
`my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE
hits.type IN ('PAGE', 'EVENT')
AND hits.field0 = 'some value'
AND _TABLE_SUFFIX > '20201015'
AND _TABLE_SUFFIX < '20201025'
Main Query (preferred form, processes 138.1 GB without optimization)
SELECT
GA.date,
GA.field1,
hits.field2,
hits.field3
FROM
`my-project-12345.dataset.ga_sessions_*` AS GA, UNNEST(GA.hits) AS hits
WHERE
hits.type IN ('PAGE', 'EVENT')
AND hits.field0 = 'some value'
AND _TABLE_SUFFIX > (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`)
AND _TABLE_SUFFIX < '20201025'
You can use scripting for this
The "trick" is in pre-computing
DECLARE start_date STRING;
SET start_date = (SELECT MAX(date) FROM `my-project-12345.dataset.my-data-table`);
and assigning to variable and then use this variable in where clause on main query - in this case it will use cost effective version
AND _TABLE_SUFFIX > start_date
AND _TABLE_SUFFIX < '20201025'

Chrome UX Report: improve query performance

I am querying the Chrome UX Report public dataset using the following query to get values for the indicated metrics over time for a set of country-specific tables. The query runs for a very long time (I stopped it at 180 seconds) because I don't know what the timeout is for a query or how to tell if the query hung.
I'm trying to get aggregate data year over year for average_fcp, average_fp and average_dcl. I'm not sure if I'm using BigQuery correctly or there are ways to optimize the query to make it runs faster
This is the query I'm using.
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
AVG(fp.density) as average_fp,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
ORDER BY
yyyymm
I'm not sure if it makes mathematical sense to get the AVG() of all the densities - but let's do it anyways.
The bigger problem in the query is this:
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
-- that's an explosive join: It transforms one row with 3 arrays with ~500 elements each, into 125 million rows!!! That's why the query isn't running.
A similar query that gives you similar results:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT AVG(fcp.density) FROM UNNEST(first_contentful_paint.histogram.bin) fcp WHERE fcp.start > 20000) AS average_fcp,
(SELECT AVG(fp.density) FROM UNNEST(first_paint.histogram.bin) fp) AS average_fp,
(SELECT AVG(dcl.density) FROM UNNEST(dom_content_loaded.histogram.bin) dcl) AS average_dcl
FROM `chrome-ux-report.country_cl.*`
WHERE form_factor.name = 'desktop'
)
GROUP BY yyyymm
ORDER BY yyyymm
The good news: This query runs in 3.3 seconds.
Now that the query runs in 3 seconds, the most important question is: Does it make sense mathematically?
Bonus: This query makes more sense to me mathematically speaking, but I'm not 100% sure about it:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT yyyymm, origin, SUM(weighted_fcp) average_fcp, SUM(weighted_fp) average_fp, SUM(weighted_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT SUM(start*density) FROM UNNEST(first_contentful_paint.histogram.bin)) AS weighted_fcp,
(SELECT SUM(start*density) FROM UNNEST(first_paint.histogram.bin)) AS weighted_fp,
(SELECT SUM(start*density) FROM UNNEST(dom_content_loaded.histogram.bin)) AS weighted_dcl,
origin
FROM `chrome-ux-report.country_cl.*`
)
GROUP BY origin, yyyymm
)
GROUP BY yyyymm
ORDER BY yyyymm
After carefully reviewing your query, I concluded that the processing time for each of the actions you are performing is around 6 seconds or less. Therefore, I decided to execute each task from each unnest and then append the tables together, using UNION ALL method.
The query ran within 4 seconds. The syntax is:
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fp.density) as average_fp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(dom_content_loaded.histogram.bin) as dcl
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
ORDER BY
yyyymm
In addition, I would like to point that according to the documentation it is advisable to avoid the excessive use of wildcards opting to use date ranges and materializing large datasets results. Also, I would like to point that BigQuery limits cached results to 10gb.
I hope it helps.
Let me start saying that BigQuery query timeout is very long (6 hours) so you should not have a problem on this front but you might encounter other errors.
We had the same issue internally, we have datasets with data divided in country tables, even though the tables are partitioned on timestamp when running queries over hounders of tables, not only the query takes a long time, but sometime it will fail with resources exceeded error.
Our solution was to aggregate all this table into one single one adding 'country' column use it as clustering column. This not only made our queries executed but it made them even faster than our temporary solution of running the same query on a sub set of the country tables as intermediate steps and then combining the results together. It is now faster, easier and cleaner.
Coming back to your specific question, I suggest to create a new table (which you will need to host $$$) that is the combination of all the tables inside a dataset as a partitioned table.
The quickest way, unfortunately also the more expensive one (you will pay for the query scan), is to use a create table statement.
create table `project_id.dataset_id.table_id`
partition by date_month
cluster by origin
as (
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
);
If this query fails you can run it on a sub set of table e.g. where starts_with(_table_suffix, '2018') and the run the following query with the 'write append' disposition against the table you create before.
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
where starts_with(_table_suffix, '2019')
If you noticed I also used a clustering column, which is think is a best practice to do.
Note for who is curating Google public datasets.
It would be nice to have a public "chrome_ux_report" dataset with a just a single table partitioned by date and clustered by country.

how to get the most recent dataset using a wildcard in google bigquery

if I have a series of fact tables such as:
fact-01012001
fact-01022001
fact-01032001
dim001
dim002
a wildcard will allow me to search all three, for example:
select * from fact-*
is there a way to use wildcards or otherwise to get the most recent fact table? say only 01032001?
Until the relevant feature request is implemented, you will need to use a query to determine the most recent date, then another query to select from that table. For example:
#standardSQL
SELECT _TABLE_SUFFIX AS latest_date
FROM `fact-*`
ORDER BY PARSE_DATE('%m%d%Y', _TABLE_SUFFIX) DESC LIMIT 1;
After retrieving the latest date, query it:
#standardSQL
SELECT *
FROM `fact-01032001`;
Below is one-step approach for BigQuery Standard SQL
#standardSQL
SELECT *
FROM `yourProject.yourDataset.fact_*`
WHERE _TABLE_SUFFIX IN (
SELECT
FORMAT_DATE('%m%d%Y', MAX(PARSE_DATE('%m%d%Y', SUBSTR(table_id, - 8)))) AS d
FROM `yourProject.yourDataset.__TABLES_SUMMARY__`
WHERE SUBSTR(table_id, 1, LENGTH('fact_')) = 'fact_'
AND LENGTH(table_id) = LENGTH('fact_') + 8
GROUP BY SUBSTR(table_id, 1, LENGTH(table_id) - 8)
)
Of course you can replace LENGTH('fact_') with 5 - I just put it this way so it is understood better
And 8 is the length of expected suffix, so you are catching only expected table from list of :
fact_01012001
fact_01022001
fact_01032001
I would like to improve one step further the solution given by Mikhail Berlyant.
As the number of shards grow, you will see a couple of problems:
You are always querying all the shards, increasing the billing (BigQuery bills by the bytes processed by the query, more shards, more bytes).
There is a limit of 1000 shards that you can query in a single query (one shard per day is a bit under 3 years worth of data).
So, with this solution, you would be only be querying 1 or 2 shards, depending if the daily data have been already loaded or not.
SELECT *
FROM `yourProject.yourDataset.fact_*`
WHERE
PARSE_DATE('%m%d%Y',
_TABLE_SUFFIX) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
AND CURRENT_DATE()
AND
_TABLE_SUFFIX IN (
SELECT
FORMAT_DATE('%m%d%Y', MAX(PARSE_DATE('%m%d%Y', SUBSTR(table_id, - 8)))) AS d
FROM `yourProject.yourDataset.__TABLES_SUMMARY__`
WHERE STARTS_WITH(table_id, 'fact_')
AND LENGTH(table_id) = LENGTH('fact_') + 8
)
All the notes from the original answer applies.

Returning SQL data from multiple tables, limited by date range in 1 table using Athena / Presto

I'm SLOWLY making my way through a project leveraging AWS Athena to process various log files. My goal is to use the log files for event correlation, so I need to find some way to select and display data from multiple tables, within a given time range, from a single SQL statement. Here is an example of what I am trying to achieve:
scada.timestamp process.eventid scada.srcaddr process.requestid scada. action
2017-03-16T07:25:46.000Z c148e2ce-8500-467a-a970-ef1d43dd4aea 172.31.25.225 032bfafb-e8a3-4c06-a2dc-fa740abc135 ACCEPT
2017-03-16T07:25:46.000Z 8cc8143a-cf55-4db3-b112-0ff7f268edd0 172.31.25.225 f413e138-9445-408f-8124-ee6c33229889 ACCEPT
Here is a sample of data from the 2 tables:
Table 1:
SELECT eventtime, requestid, eventid FROM process_native limit 10;
eventtime requestid eventid
2016-05-07T08:57:37Z 032bfafb-e8a3-4c06-a2dc-fa740abc135c c148e2ce-8500-467a-a970-ef1d43dd4aea
2016-05-07T08:57:37Z f413e138-9445-408f-8124-ee6c33229889 8cc8143a-cf55-4db3-b112-0ff7f268edd0
Table 2:
SELECT tstart, srcaddr, action FROM scada_raw limit 10;
tstart srcaddr action
1489509010 139.59.39.211 REJECT
1489509010 172.31.20.111 ACCEPT
As table 2 stores the time as unix time that complicates things a little, I need to convert that so I have a common time format to work with:
Table 2 with updated time:
SELECT to_iso8601(from_unixtime(tstart)) as timestamp, srcaddr, action FROM scada_raw limit 10;
timestamp srcaddr action
2017-03-16T07:25:46.000Z 172.31.25.225 ACCEPT
2017-03-16T07:25:46.000Z 172.31.25.225 ACCEPT
Frankly, I have no idea how to go about this :)
Here is a query I thought up, it just times out:
SELECT process_native.eventid,
process_native.requestid,
scada_raw.srcaddr,
scada_raw.action,
FROM process_native, scada_raw
WHERE scada_rawe.eventtime >= '2017-02-17T00:00:00Z'
AND scada_raw.eventtime < '2017-03-20T00:00:00Z'
I really don't know where to go next, I've been working with SQL for all of 3 days now, and this is WAY beyond me. Is my goal even achievable?
Thank you!
You might be able to get the records near each other, even if you can't guarantee the dates will match for a proper join. For example:
eventtime requestid eventid srcaddr action
2017-03-14 16:30:10.000 139.59.39.211 REJECT
2017-03-14 16:30:10.000 172.31.20.111 ACCEPT
2017-03-14 16:30:11.000 032bfafb-e8... c148e2ce-85...
2017-03-14 16:30:11.000 f413e138-94... 8cc8143a-cf...
From a query like this:
WITH TimelineRecords AS (
SELECT
eventtime,
requestid,
eventid,
NULL srcaddr,
NULL action
FROM
process_native
WHERE
eventtime BETWEEN timestamp '2017-03-14 16:30:00' AND timestamp '2017-03-14 16:35:00'
UNION ALL
SELECT
from_unixtime(tstart) eventtime,
NULL requestid,
NULL eventid,
srcaddr,
action
FROM
scada_raw
WHERE
from_unixtime(tstart) BETWEEN timestamp '2017-03-14 16:30:00' AND timestamp '2017-03-14 16:35:00'
)
SELECT
*
FROM
TimelineRecords
ORDER BY
eventtime;
Sorry about the two WHERE clauses, Athena didn't like it when I put that on the last select statement.

How to choose the latest partition in BigQuery table?

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.
I've tried (as far as I know, BigQuery does not support QUALIFY):
SELECT col FROM table WHERE _PARTITIONTIME = (
SELECT pt FROM (
SELECT pt, RANK() OVER(ORDER by pt DESC) as rnk FROM (
SELECT _PARTITIONTIME AS pt FROM table GROUP BY 1)
)
)
WHERE rnk = 1
);
But this does not work and reads all rows.
SELECT col from table WHERE _PARTITIONTIME = TIMESTAMP('YYYY-MM-DD')
where 'YYYY-MM-DD' is a specific date does work.
However, I need to run this script in the future, but the table update (and the _PARTITIONTIME) is irregular. Is there a way I can pull data only from the latest partition in BigQuery?
October 2019 Update
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now
See example below
DECLARE max_date TIMESTAMP;
SET max_date = (
SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);
SELECT * FROM `project.dataset.partitioned_table`
WHERE _PARTITIONTIME = max_date;
Update for those who like downvoting without checking context, etc.
I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query
Try
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(TIMESTAMP(partition_id))
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
)
or
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(_PARTITIONTIME)
FROM [dataset.partitioned_table]
)
Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.
As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.
The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.
I'm happy to learn if there is a better workaround to get the latest partition!
List all partitions with:
#standardSQL
SELECT
_PARTITIONTIME as pt
FROM
`[DATASET].[TABLE]`
GROUP BY 1
And then choose the latest timestamp.
Good luck :)
https://cloud.google.com/bigquery/docs/querying-partitioned-tables
I found the workaround to this issue. You can use with statement, select last few partitions and filter out the result. This is I think better approach because:
You are not limited by fixed partition date (like today - 1 day). It will always take the latest partition from given range.
It will only scan last few partitions and not whole table.
Example with last 3 partitions scan:
WITH last_three_partitions as (select *, _PARTITIONTIME as PARTITIONTIME
FROM dataset.partitioned_table
WHERE _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 DAY))
SELECT col1, PARTITIONTIME from last_three_partitions
WHERE PARTITIONTIME = (SELECT max(PARTITIONTIME) from last_three_partitions)
A compromise that manages to query only a few partitions without resorting to scripting or failing with missing partitions for fixed dates.
WITH latest_partitions AS (
SELECT *, _PARTITIONDATE AS date
FROM `myproject.mydataset.mytable`
WHERE _PARTITIONDATE > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT
*
FROM
latest_partitions
WHERE
date = (SELECT MAX(date) FROM latest_partitions)
You can leverage the __TABLES__ list of tables to avoid re-scanning everything or having to hope latest partition is ~3 days ago. I did the split and ordinal stuff to guard against in case my table prefix appears more than once in the table name for some reason.
This should work for either _PARTITIONTIME or _TABLE_SUFFIX.
select * from `project.dataset.tablePrefix*`
where _PARTITIONTIME = (
SELECT split(table_id,'tablePrefix')[ordinal(2)] FROM `project.dataset.__TABLES__`
where table_id like 'tablePrefix%'
order by table_id desc limit 1)
I had this answer in a less popular question, so copying it here as it's relevant (and this question is getting more pageviews):
Mikhail's answer looks like this (working on public data):
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
AND wiki='es'
# 122.2 MB processed
But it seems the question wants something like this:
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
AND wiki='es'
# 50.6 GB processed
... but for way less than 50.6GB
What you need now is some sort of scripting, to perform this in 2 steps:
max_date = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
;
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = {{max_date}}
AND wiki='es'
# 115.2 MB processed
You will have to script this outside BigQuery - or wait for news on https://issuetracker.google.com/issues/36955074.
Building on the answer from Chase. If you have a table that requires you filter over a column, and you're receiving the error:
Cannot query over table 'myproject.mydataset.mytable' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
Then you can use:
SELECT
MAX(_PARTITIONTIME) AS pt
FROM
`myproject.mydataset.mytable`
WHERE _PARTITIONTIME IS NOT NULL
Instead of the latest partition, I've used this to get the earliest partition in a dataset by simply changing max to min.