Bigquery Select all latest partitions from a wildcard set of tables - sql

We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.

With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;

With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)

Related

BigQuery partition pruning when making union and joining partitioned tables

I have the following pattern for my data.
Ingestion time partitioned (daily) raw data tables - contain changes of a particular entity.
Ingestion time partitioned snapshot tables - contain the whole state of a particular entity for a given day. The current partition is populated once a day by adding the previous partition and data from a raw data table (current day).
Partitioned view - exposes data for a given day. The view is required to expose also new data (current date) from the raw data table which won't be available until the snapshot partition is populated.
Now I have two views for two entities: One and Two. I want to create another view which will join data from two views based on partition.
Unfortunately, BigQuery cannot handle partition pruning correctly and does not perform partition pruning when joining entities.
My query (not completed yet, lack of deduplication etc. but it's not the point):
WITH
one_raw_data AS (
SELECT _pt, * EXCEPT(_pt),
FROM `MyDataset.OneRawData`
WHERE _pt = TIMESTAMP(CURRENT_DATE())
),
one_snapshot_table AS (
SELECT _PARTITIONTIME AS _pt, *
FROM `MyDataset.OneSnapshot`
WHERE _PARTITIONTIME <= TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
),
two_raw_data AS (
SELECT _pt, * EXCEPT(_pt),
FROM `MyDataset.TwoRawData`
WHERE _pt = TIMESTAMP(CURRENT_DATE())
),
two_snapshot_table AS (
SELECT _PARTITIONTIME AS _pt, *
FROM `MyDataset.TwoSnapshot`
WHERE _PARTITIONTIME <= TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
),
one_full_view AS (
SELECT * FROM one_snapshot_table
UNION ALL
SELECT * FROM one_raw_data
),
two_full_view AS (
SELECT * FROM two_snapshot_table
UNION ALL
SELECT * FROM two_raw_data
),
finalView AS (
SELECT one._pt
FROM one_full_view one
INNER JOIN two_full_view two
ON one._pt = two._pt
AND one.myKey = two.myKey
)
SELECT *
FROM finalView
WHERE _pt = '2022-10-23'
When I try to query: e.g. one_full_view partition pruning works correctly. Seems like a combination of unions + join does not work. Is it expected behaviour for BigQuery? Can this be achieved without making a sharded view for each day and hardcoding partitions?

Querying all partition table

I have around 600 partitioned tables called table.ga_session. Each table is separated by 1 day, and for each table it has its own unique name, for example, table for date (30/12/2021) has its name as table.ga_session_20211230. The same goes for other table, the naming format would be like this table.ga_session_YYYYMMDD.
Now, when I try to call all partitioned table, I cannot use command like this:. The error showed that _PARTITIONTIME is unrecognized.
SELECT
*,
_PARTITIONTIME pt
FROM `table.ga_sessions_20211228`
where _PARTITIONTIME
BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP('2020-01-02')
I also tried this and does not work
select *
from between `table.ga_sessions_20211228`
and
`table.ga_sessions_20211229`
I also cannot use FROM 'table.ga_sessions' to apply WHERE clause to take out range of time as the table does not exist. How do I call all of these partitioned table? Thank you in advance!
You can query using wildcard tables. For example:
SELECT max
FROM `bigquery-public-data.noaa_gsod.gsod*`
WHERE _TABLE_SUFFIX = '1929'
This will specifically query the gsod1929 table, but the table_suffix clause can be excluded if desired.
In your scenario you could do:
select *
from table.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20190101' and '20200102'
For more information see the documentation here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference

Getting peak value of a column in table till this date

I have an oracle table having columns {date, id, profit, max_profit}.
I have data in date and profit, and I want highest value of profit till date in max_profit, I am using query below
UPDATE MY_TABLE a SET a.MAX_PROFIT = (SELECT MAX(b.PROFIT)
FROM MY_TABLE b WHERE b.DATE <= a.DATE
AND a.id = b.id)
This is giving me correct result, but I have millions of rows for which query is taking considerable time, any faster way of doing it ?
You can use a MERGE statement with an analytic function:
MERGE INTO my_table dst
USING (
SELECT ROWID rid,
MAX( profit ) OVER ( PARTITION BY id ORDER BY "DATE" ) AS max_profit
FROM my_table
) src
ON ( src.rid = dst.ROWID )
WHEN MATCHED THEN
UPDATE SET max_profit = src.max_profit;
When you do something like "SELECT MAX(...)" you're going to scan all the records implicated in the 'WHERE" part of the query, so you want to make getting all those records as easy on the database as possible.
Do you have an index on the table that includes the id and date columns?
Depending on the behavior of this application, if you're doing a lot fewer updates/inserts (as opposed to doing a ton of reads during reporting or some other process), a possible performance enhancement might be to keep the value you're storing in the max_profit column up to date somewhere while you're changing the data. Have you considered a separate table that just stores the profit calculation for each possible date?

How to extract record's table name when using Table wildcard functions [duplicate]

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

Is there a way to select table_id in a Bigquery Table Wildcard Query

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)