Big Query copy partition tables within certain range - google-bigquery

I have Production dataset and Test dataset in BigQuery. Both are partitioned in Days (both have _PARTITIONTIME column).
Is there a way for me to copy Production dataset (using bq cp function) partition tables within a specific range to Test dataset? ex: in the last 1 month.
If I just want to copy 1 partition tables, I would just use the $[yyyymmdd] keyword to select that 1 partition tables, but I am trying to avoid using comma 30 times to select a one month worth of partition tables.
I know querying is possible with something like _PARTITIONTIME >= "2018-01-01 00:00:00" AND _PARTITIONTIME < "2018-01-30 00:00:00".
Is it also possible to do a similar thing for copy?
Thanks

Answer is no.
You can write simple script using bq command to do this doing it one by one.
Or you can generate bq command to do this with something like this:
#standardSQL
SELECT
CONCAT('bq cp <srcproj>:<dataset>.<table>$', partname, ' <testproj>:<dataset>.<table>$', partname)
FROM (
SELECT
DISTINCT FORMAT_DATETIME('%Y%m%d',
CAST(_PARTITIONDATE AS datetime)) partname
FROM
`<srcproj>.<dataset>.<table>`
WHERE
_PARTITIONTIME >= "2018-05-10 00:00:00"
AND _PARTITIONTIME < "2018-05-13 00:00:00")

Related

Automatically add partition conditions to WHERE clause

I have a columnar table that is partitioned by day and hour. It is stored on S3 in parquet files to be queried by Athena. Here is the CREATE TABLE:
CREATE EXTERNAL TABLE foo (
-- other columns here
dt timestamp,
day string,
hour string
)
PARTITIONED BY (day string, hour string)
STORED AS parquet
LOCATION 's3://foo/foo'
And the layout on S3 is like:
s3://foo/foo/day=2021-10-10/hh=00/*.parquet
s3://foo/foo/day=2021-10-10/hh=01/*.parquet
...etc
s3://foo/foo/day=2021-10-10/hh=23/*.parquet
So a query like the following will be fast because it only scans over one hour of parquet files because the partition columns are being used to filter it:
-- fast, easy to write
SELECT * FROM foo WHERE day = '2021-10-10' AND hour = '00'
However, the table also includes the full datetime dt. Usually we want to write queries for ranges that don't align to a day/hour boundary, and/or are in a different timezone.
For example, this will scan ALL parquet files and be very slow:
-- slow, easy to write
SELECT * FROM foo WHERE dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
It can be improved by manually calculating the day and hour that minimally enclose the time period:
-- fast, painful to write
SELECT * FROM foo
WHERE
((day, hh) IN (('2021-10-09', '23'), ('2021-10-11', '00')) OR day = '2021-10-10')
AND
dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
Ideally this extra condition could be added transparently by the database so as to avoid having to manually add the ((day,hh) IN (...)).
Is this possible somehow with Athena?
I've wished for this feature many times, but unfortunately Athena does not support it. You have to include both the predicate for the dt column and the day and hour partition keys.

partition big query LIMIT over date range

I'm quite new to SQL & big query so this might be simple. I'm running some queries on the public dataset GDELT in BQ and have a question regarding the LIMIT. GDELT is massive (14.4 TB) and when I query for something, in this case a person, I could get up to 100k rows of results or more which is this case is too much. But when I use LIMIT it seems like it does not really partition the results evenly over the dates, causing me to get very random timelines. How does limit work and is there a way to get the results more evenly based on days?
SELECT DATE,V2Tone,DocumentIdentifier as URL, Themes, Persons, Locations
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE DATE>=20210610000000 and _PARTITIONTIME >= TIMESTAMP(#start_date)
AND DATE<=20210818999999 and _PARTITIONTIME <= TIMESTAMP(#end_date)
AND LOWER(DocumentIdentifier) like #url_topic
LIMIT #limit
When running this query and doing some preproc, I get the following time series:
It's based on 15k results, but they are distributed very unevenly/randomly across the days (since there are over 500k results in total if I don't use limit). I would like to make a query that limits my output to 15k but partitions the data somewhat equally over the days.
you need to order by , when you are not sorting your result , the order of returned result is not guaranteed:
but if you are looking to get the same number of rows per day , you can use window functions:
select * from (
SELECT
DATE,
V2Tone,
DocumentIdentifier as URL,
Themes,
Persons,
Locations,
row_number() over (partition by DATE) rn
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE
DATE >= 20210610000000 AND DATE <= 20210818999999
and _PARTITIONDATE >= #start_date and _PARTITIONDATE <= #end_date
AND LOWER(DocumentIdentifier) like #url_topic
) t where rn = #numberofrowsperday
if you are passing date only you can use _PARTITIONDATE to filter the partitions.

Athena queries changes on the fly according to the partitions introduced in S3

I have partitioned my DATA_BUCKET in S3 with structure of
S3/DATA_BUCKET/table_1/YYYY/MM/DD/files.parquet
Now I have three additional columns in the table_1 which are visible in Athena as "partition_0", ""partition_1" and "partition_2" (for Year, Month and Day respectively).
Till now my apps were making time-related-queries based on the "time_stamp" column in the table:
select * from table_1 where time_stamp like '2023-01-17%'
Now to leverage the performance because of the partitions, the corresponding new query is:
select * from table_1 where partition_0 = '2023' and partition_1 = '01' and partition_2 = '17'
Problem:
Since there are many previous queries made on time_stamp in my apps I do not want to change them but still somehow transform those queries to my "partitions-type-queries" like above.
Is there any way like internally in Athena or something else ?
TIA
You can create a view from the original table with a new "time_stamp" column.
This column calculate Date from date-parts:
CREATE OR REPLACE VIEW my_view AS
SELECT mytable.col1,
mytable.col2,
cast(date_add('day', trans_day - 1, date_add('month', trans_month - 1, date_add('year', trans_year - 1970, from_unixtime(0)))) as Date) as time_stamp
FROM my_db.my_table mytable

Chrome UX Report: improve query performance

I am querying the Chrome UX Report public dataset using the following query to get values for the indicated metrics over time for a set of country-specific tables. The query runs for a very long time (I stopped it at 180 seconds) because I don't know what the timeout is for a query or how to tell if the query hung.
I'm trying to get aggregate data year over year for average_fcp, average_fp and average_dcl. I'm not sure if I'm using BigQuery correctly or there are ways to optimize the query to make it runs faster
This is the query I'm using.
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
AVG(fp.density) as average_fp,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
ORDER BY
yyyymm
I'm not sure if it makes mathematical sense to get the AVG() of all the densities - but let's do it anyways.
The bigger problem in the query is this:
UNNEST(first_paint.histogram.bin) as fp,
UNNEST(dom_content_loaded.histogram.bin) as dcl,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
-- that's an explosive join: It transforms one row with 3 arrays with ~500 elements each, into 125 million rows!!! That's why the query isn't running.
A similar query that gives you similar results:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT AVG(fcp.density) FROM UNNEST(first_contentful_paint.histogram.bin) fcp WHERE fcp.start > 20000) AS average_fcp,
(SELECT AVG(fp.density) FROM UNNEST(first_paint.histogram.bin) fp) AS average_fp,
(SELECT AVG(dcl.density) FROM UNNEST(dom_content_loaded.histogram.bin) dcl) AS average_dcl
FROM `chrome-ux-report.country_cl.*`
WHERE form_factor.name = 'desktop'
)
GROUP BY yyyymm
ORDER BY yyyymm
The good news: This query runs in 3.3 seconds.
Now that the query runs in 3 seconds, the most important question is: Does it make sense mathematically?
Bonus: This query makes more sense to me mathematically speaking, but I'm not 100% sure about it:
SELECT yyyymm,
AVG(average_fcp) average_fcp,
AVG(average_fp) average_fp,
AVG(average_dcl) average_dcl
FROM (
SELECT yyyymm, origin, SUM(weighted_fcp) average_fcp, SUM(weighted_fp) average_fp, SUM(weighted_dcl) average_dcl
FROM (
SELECT
_TABLE_SUFFIX AS yyyymm,
(SELECT SUM(start*density) FROM UNNEST(first_contentful_paint.histogram.bin)) AS weighted_fcp,
(SELECT SUM(start*density) FROM UNNEST(first_paint.histogram.bin)) AS weighted_fp,
(SELECT SUM(start*density) FROM UNNEST(dom_content_loaded.histogram.bin)) AS weighted_dcl,
origin
FROM `chrome-ux-report.country_cl.*`
)
GROUP BY origin, yyyymm
)
GROUP BY yyyymm
ORDER BY yyyymm
After carefully reviewing your query, I concluded that the processing time for each of the actions you are performing is around 6 seconds or less. Therefore, I decided to execute each task from each unnest and then append the tables together, using UNION ALL method.
The query ran within 4 seconds. The syntax is:
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fcp.density) AS average_fcp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_contentful_paint.histogram.bin) AS fcp
WHERE
form_factor.name = 'desktop' AND
fcp.start > 20000
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(fp.density) as average_fp,
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(first_paint.histogram.bin) as fp
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
UNION ALL
SELECT
_TABLE_SUFFIX AS yyyymm,
AVG(dcl.density) as average_dcl
FROM
`chrome-ux-report.country_cl.*`,
UNNEST(dom_content_loaded.histogram.bin) as dcl
WHERE
form_factor.name = 'desktop'
GROUP BY
yyyymm
ORDER BY
yyyymm
In addition, I would like to point that according to the documentation it is advisable to avoid the excessive use of wildcards opting to use date ranges and materializing large datasets results. Also, I would like to point that BigQuery limits cached results to 10gb.
I hope it helps.
Let me start saying that BigQuery query timeout is very long (6 hours) so you should not have a problem on this front but you might encounter other errors.
We had the same issue internally, we have datasets with data divided in country tables, even though the tables are partitioned on timestamp when running queries over hounders of tables, not only the query takes a long time, but sometime it will fail with resources exceeded error.
Our solution was to aggregate all this table into one single one adding 'country' column use it as clustering column. This not only made our queries executed but it made them even faster than our temporary solution of running the same query on a sub set of the country tables as intermediate steps and then combining the results together. It is now faster, easier and cleaner.
Coming back to your specific question, I suggest to create a new table (which you will need to host $$$) that is the combination of all the tables inside a dataset as a partitioned table.
The quickest way, unfortunately also the more expensive one (you will pay for the query scan), is to use a create table statement.
create table `project_id.dataset_id.table_id`
partition by date_month
cluster by origin
as (
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
);
If this query fails you can run it on a sub set of table e.g. where starts_with(_table_suffix, '2018') and the run the following query with the 'write append' disposition against the table you create before.
select
date(PARSE_TIMESTAMP("%Y%m%d", concat(_table_suffix, "01"), "UTC")) as date_month,
*
from `chrome-ux-report.country_cl.*`
where starts_with(_table_suffix, '2019')
If you noticed I also used a clustering column, which is think is a best practice to do.
Note for who is curating Google public datasets.
It would be nice to have a public "chrome_ux_report" dataset with a just a single table partitioned by date and clustered by country.

How to choose the latest partition in BigQuery table?

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.
I've tried (as far as I know, BigQuery does not support QUALIFY):
SELECT col FROM table WHERE _PARTITIONTIME = (
SELECT pt FROM (
SELECT pt, RANK() OVER(ORDER by pt DESC) as rnk FROM (
SELECT _PARTITIONTIME AS pt FROM table GROUP BY 1)
)
)
WHERE rnk = 1
);
But this does not work and reads all rows.
SELECT col from table WHERE _PARTITIONTIME = TIMESTAMP('YYYY-MM-DD')
where 'YYYY-MM-DD' is a specific date does work.
However, I need to run this script in the future, but the table update (and the _PARTITIONTIME) is irregular. Is there a way I can pull data only from the latest partition in BigQuery?
October 2019 Update
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now
See example below
DECLARE max_date TIMESTAMP;
SET max_date = (
SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);
SELECT * FROM `project.dataset.partitioned_table`
WHERE _PARTITIONTIME = max_date;
Update for those who like downvoting without checking context, etc.
I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query
Try
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(TIMESTAMP(partition_id))
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
)
or
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(_PARTITIONTIME)
FROM [dataset.partitioned_table]
)
Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.
As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.
The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.
I'm happy to learn if there is a better workaround to get the latest partition!
List all partitions with:
#standardSQL
SELECT
_PARTITIONTIME as pt
FROM
`[DATASET].[TABLE]`
GROUP BY 1
And then choose the latest timestamp.
Good luck :)
https://cloud.google.com/bigquery/docs/querying-partitioned-tables
I found the workaround to this issue. You can use with statement, select last few partitions and filter out the result. This is I think better approach because:
You are not limited by fixed partition date (like today - 1 day). It will always take the latest partition from given range.
It will only scan last few partitions and not whole table.
Example with last 3 partitions scan:
WITH last_three_partitions as (select *, _PARTITIONTIME as PARTITIONTIME
FROM dataset.partitioned_table
WHERE _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 DAY))
SELECT col1, PARTITIONTIME from last_three_partitions
WHERE PARTITIONTIME = (SELECT max(PARTITIONTIME) from last_three_partitions)
A compromise that manages to query only a few partitions without resorting to scripting or failing with missing partitions for fixed dates.
WITH latest_partitions AS (
SELECT *, _PARTITIONDATE AS date
FROM `myproject.mydataset.mytable`
WHERE _PARTITIONDATE > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT
*
FROM
latest_partitions
WHERE
date = (SELECT MAX(date) FROM latest_partitions)
You can leverage the __TABLES__ list of tables to avoid re-scanning everything or having to hope latest partition is ~3 days ago. I did the split and ordinal stuff to guard against in case my table prefix appears more than once in the table name for some reason.
This should work for either _PARTITIONTIME or _TABLE_SUFFIX.
select * from `project.dataset.tablePrefix*`
where _PARTITIONTIME = (
SELECT split(table_id,'tablePrefix')[ordinal(2)] FROM `project.dataset.__TABLES__`
where table_id like 'tablePrefix%'
order by table_id desc limit 1)
I had this answer in a less popular question, so copying it here as it's relevant (and this question is getting more pageviews):
Mikhail's answer looks like this (working on public data):
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
AND wiki='es'
# 122.2 MB processed
But it seems the question wants something like this:
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
AND wiki='es'
# 50.6 GB processed
... but for way less than 50.6GB
What you need now is some sort of scripting, to perform this in 2 steps:
max_date = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
;
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = {{max_date}}
AND wiki='es'
# 115.2 MB processed
You will have to script this outside BigQuery - or wait for news on https://issuetracker.google.com/issues/36955074.
Building on the answer from Chase. If you have a table that requires you filter over a column, and you're receiving the error:
Cannot query over table 'myproject.mydataset.mytable' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
Then you can use:
SELECT
MAX(_PARTITIONTIME) AS pt
FROM
`myproject.mydataset.mytable`
WHERE _PARTITIONTIME IS NOT NULL
Instead of the latest partition, I've used this to get the earliest partition in a dataset by simply changing max to min.