BigQuery partition pruning when making union and joining partitioned tables - sql

I have the following pattern for my data.
Ingestion time partitioned (daily) raw data tables - contain changes of a particular entity.
Ingestion time partitioned snapshot tables - contain the whole state of a particular entity for a given day. The current partition is populated once a day by adding the previous partition and data from a raw data table (current day).
Partitioned view - exposes data for a given day. The view is required to expose also new data (current date) from the raw data table which won't be available until the snapshot partition is populated.
Now I have two views for two entities: One and Two. I want to create another view which will join data from two views based on partition.
Unfortunately, BigQuery cannot handle partition pruning correctly and does not perform partition pruning when joining entities.
My query (not completed yet, lack of deduplication etc. but it's not the point):
WITH
one_raw_data AS (
SELECT _pt, * EXCEPT(_pt),
FROM `MyDataset.OneRawData`
WHERE _pt = TIMESTAMP(CURRENT_DATE())
),
one_snapshot_table AS (
SELECT _PARTITIONTIME AS _pt, *
FROM `MyDataset.OneSnapshot`
WHERE _PARTITIONTIME <= TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
),
two_raw_data AS (
SELECT _pt, * EXCEPT(_pt),
FROM `MyDataset.TwoRawData`
WHERE _pt = TIMESTAMP(CURRENT_DATE())
),
two_snapshot_table AS (
SELECT _PARTITIONTIME AS _pt, *
FROM `MyDataset.TwoSnapshot`
WHERE _PARTITIONTIME <= TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
),
one_full_view AS (
SELECT * FROM one_snapshot_table
UNION ALL
SELECT * FROM one_raw_data
),
two_full_view AS (
SELECT * FROM two_snapshot_table
UNION ALL
SELECT * FROM two_raw_data
),
finalView AS (
SELECT one._pt
FROM one_full_view one
INNER JOIN two_full_view two
ON one._pt = two._pt
AND one.myKey = two.myKey
)
SELECT *
FROM finalView
WHERE _pt = '2022-10-23'
When I try to query: e.g. one_full_view partition pruning works correctly. Seems like a combination of unions + join does not work. Is it expected behaviour for BigQuery? Can this be achieved without making a sharded view for each day and hardcoding partitions?

Related

BigQuery: Create View using ROW_NUMBER function breaks partition filter policy

we have a table created in BQ, 'TS' column used as partitioning column when create the table, like "PARTITION BY DATE(TS)". and we set "require_partition_filter=true"
When we create view like below, query on view works:
CREATE OR REPLACE VIEW mydataset.test_view AS select * from
mydataset.test_table
--query based on view
select * from mydataset.test_view where TS > TIMESTAMP("2021-09-05 08:30:00")
However, if we add ROW_NUMBER() function in view creation statement, the same query on view would raise error:
CREATE OR REPLACE VIEW mydataset.test_view AS
select *, ROW_NUMBER() over (
partition by ID
order by ID, TS
) as row_number from mydataset.test_table
--query based on view return error
select * from mydataset.test_view where TS > TIMESTAMP("2021-09-05 08:30:00")
--error msg: Cannot query over table without a filter over column(s) 'TS'
that can be used for partition elimination
What's the reason for it and what's available solution? Appreciate any ideas/thoughts. Thanks.
Update:
We also tried adding 'where' in the view creation. however, for our case, we can not limit the real data time range exposed through view, user want to have capability to query all the data from view. That means we can only filter TS with an always true condition like 'TS is not null' or 'TS > TIMESTAMP("1970-01-01 00:00:00").' By doing this, query on view will not throw error, but it has very poor performance as actually no partition pruning when execute the query even we add additional TS filter when query on view.
As per this doc when you are enabling the require_partition_filter=true, and when you attempt to query the table without a WHERE clause it throws the error you are getting.
From the first view you have created and in the query, you have passed the where clause and as a result you are getting the output.
In the second view, you have added the ROW_NUMBER() function and when querying that view you are getting the error message. The error is happening because of the filter conditions you have passed in the query.
I tried replicating your issue with a sample set of data. The workaround is to create a view with the Where condition(with a subset of the data in the source table) and then query the view as required. You can refer to the below query and let me know if this workaround helps you:
Creating a view :
CREATE OR REPLACE VIEW dataset2.view3 AS
select *, ROW_NUMBER() over (
partition by id
order by id, ts
) as rownum from `myproject.dataset2.part4` where ts > TIMESTAMP("2021-09-05 08:30:00")
Here in the view along with ROW_NUMBER(), I have added the Where clause in the query.
Query :
Select * from `myproject.dataset2.view3`
Output :
If you add another where clause when querying over the view, it will work and it will not throw any error.
Query :
SELECT * FROM `myproject.dataset2.view3` where ts > TIMESTAMP("2021-10-30 10:20:02")
Output :
As per your reply since you cannot limit the real data time range exposed through view you can refer to the below cases :
Case 1 (partition by id):
CREATE OR REPLACE VIEW dataset2.view8 AS
select *, ROW_NUMBER() over (
partition by id
order by id,ts
) as rownum from `myproject.dataset2.part4` where ts IS NOT NULL
Query 1 (on the view) :
SELECT * FROM `myprojectproject.dataset2.view8`
It processed 160B of data.
Query 2 (on the view) :
SELECT * FROM `myproject.dataset2.view 8` where ts > TIMESTAMP("2021-09-05 08:30:00")
It also processed 160B of data which means if you are doing partition by id, partition pruning is not happening.
Case 2 (Partition by ts):
CREATE OR REPLACE VIEW dataset2.view8 AS
select *, ROW_NUMBER() over (
partition by ts
order by id,ts
) as rownum from `myproject.dataset2.part4` where ts IS NOT NULL
Query1 (on the view) :
SELECT * FROM `myproject.dataset2.view 8`
It processed 160B of data.
Query2 (on the view ) :
SELECT * FROM `myproject.dataset2.view 8` where ts > TIMESTAMP("2021-09-05 08:30:00")
It returned 128B of data which means partitioning by ts does partitioning pruning.
Since your requirement is to attain partition pruning, I tried partition by ts because, after the view is created you are going to query the view not the table. So to attain partition pruning , try to use the same partition condition when creating the view i.e partition by ts that you have used while creating the table and you can use the WHERE clause while querying the view.
If this does not fulfill your requirement, provide your sample data with output and explain the use case of why you want to partition by id in the view.
A view is really just a subquery of the original partitioned table. Your first statement works because the created view query is filtered down on the partitioned field ts. The filter from the query of the view gets passed through to the partitioned table. It seems BigQuery recognizes that SELECT * simply returns the full table so it must bypass that step altogether so it doesn't actually return the entire partitioned table.
The reason the second one does not work is because the created view query
ROW_NUMBER() over (partition by ID order by ID, TS)
is querying the entire partitioned table in the over clause (meaning it is not filtered, which the partitioned table is requiring) since it has not evaluated the WHERE on the data yet since it's creating it's own partition in the named window.
you can use table functions instead of view.
Documentation link
CREATE OR REPLACE TABLE FUNCTION mydataset.test_tablefunction (fromTime timestamp)
AS
select *, ROW_NUMBER() over (
partition by ID
order by ID, TS
) as row_number from mydataset.test_table
where TS > fromTime
usage example:
select * from mydataset.test_tablefunction(TIMESTAMP("2021-09-05 08:30:00"))

Using "match_recognize" in a Common Table Expression in Snowflake

Update: This was answered here.
I am putting together a somewhat complex query to do event detection, join(s), and time-based binning with a large time-series dataset in Snowflake. I recently noticed that match_recognize lets me eloquently detect time-series events, but whenever I try to use a match_recognize expression within a Common Table Expression (with .. as ..), I receive the following error:
SQL compilation error: MATCH_RECOGNIZE not supported in this context.
I've done a lot of searching/reading, but haven't found any documented limitations on match_recognize in CTEs. Here's my query:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
),
label_events as (
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
And I get the same error as above with this.
Is this a limitation that I'm not seeing, or am I doing something wrong?
Non-recursive cte could be always rewritten as inline view:
--select ...
--from (
select id, timestamp, measurement, event_number
from (select distinct id, timestamp, measurement
from dataset) clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)mr
-- ) -- if other transformations are required
It is not ideal, but at least it will allow query to run.
Per this thread from a comment by Filipe Hoffa: MATCH_RECOGNIZE with CTE in Snowflake
This seemed to be a non-documented limitation of Snowflake at the time. A two or three step solution has worked well for me:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
)
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
);
set quid=last_query_id();
with label_events as (
select *
from table(result_scan($quid))
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
I prefer to use a variable here, because I can re-run the second query multiple times during development/debugging without having to re-run the first query.
It is also important to note that cached GEOGRAPHY objects in Snowflake are converted to GEOJSON, so when retrieving these with result_scan, you must typecast them back to the GEOGRAPHY type.

Bigquery Select all latest partitions from a wildcard set of tables

We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;
With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)

I want "live materialized views", with the latest info for any row

I saw this solution as an alternative to materialized views:
I want a "materialized view" of the latest records
But it's using the scheduled queries that run at most every 3 hours. My users are expecting live data, what can I do?
2018-10: BigQuery doesn't support materialized views, but you can use this approach:
Use the previous solution to "materialize" a summary of the latest data, until the time that scheduled query ran.
Create a view that combines the materialized data, with a live view of the latest data on the append-only table.
Code would look like this:
CREATE OR REPLACE VIEW `wikipedia_vt.just_latest_rows_live` AS
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
FROM (
SELECT * FROM `fh-bigquery.wikipedia_vt.just_latest_rows`
# previously "materialized" results
UNION ALL
SELECT * FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
# append-only table, source of truth
WHERE datehour > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY )
) a
GROUP BY title
)
Note that BigQuery is able to use TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY ) to prune partitions effectively.

How does one do a SQL select over multiple partitions?

Is there a more efficient way than:
select * from transactions partition( partition1 )
union all
select * from transactions partition( partition2 )
union all
select * from transactions partition( partition3 );
It should be exceptionally rare that you use the PARTITION( partitionN ) syntax in a query.
You would normally just want to specify values for the partition key and allow Oracle to perform partition elimination. If your table is partitioned daily based on TRANSACTION_DATE, for example
SELECT *
FROM transactions
WHERE transaction_date IN (date '2010-11-22',
date '2010-11-23',
date '2010-11-24')
would select all the data from today's partition, yesterday's partition, and the day before's partition.
Can you provide additional context? What are your predicates? What makes you think that you need to explicitly tell the optimizer to go against multiple partitions. You may have the wrong partition key in use, for example.