Bigquery apply subquery to partition time - sql

I have two queries which work correct separately but together there is an error:
WITH minimum_time AS
(
SELECT DATE (min(_PARTITIONTIME)) AS minimums
FROM `Day`
WHERE DATE (_PARTITIONTIME) = "2020-11-20"
)
SELECT *
FROM `Day`
WHERE DATE (_PARTITIONTIME) > (SELECT minimums
FROM minimum_time)
and I get this error:
Cannot query over table 'Day' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
I do not quite understand why this is happening, first query returns a date.

You were getting the error because:
the table has the option set: require_partition_filter=true, a query on the table should fail if no partition filter is specified.
There is limitation on using subquery as the partition filter, the limitation is documented here.
In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans.
The workaround is using BigQuery Scripting to pre-determine the partition filter, like:
DECLARE minimums DATE DEFAULT ((SELECT minimums FROM `Day` WHERE ...));
SELECT *
FROM `Day`
WHERE DATE (_PARTITIONTIME) > minimums; -- minimums is a constant to the second query

Related

BigQuery date partition condition from subquery

I have a date partitioned table, however costs and speed does not improve when the date condition is fetched from a subquery. The subquery fetches a single value of type DATE, however it is not used to run a partitioned query, instead the whole table is fetched. If I enter the date as a string, it works perfectly, just not from the subquery.
(
SELECT
*
FROM
`mydataset.mydataset.mytable`
WHERE
`datetime` > (
SELECT
DISTINCT updated_at_datetime
FROM
`mydataset.mydataset.my_other_table`
LIMIT
1)
AND `date` >= DATE(DATETIME_TRUNC((
SELECT
DISTINCT updated_at_datetime
FROM
`mydataset.mydataset.my_other_table`
LIMIT
1), DAY)))
From the docs:
To limit the partitions that are scanned in a query, use a constant expression in your filter. If you use dynamic expressions in your query filter, BigQuery must scan all of the partitions.
If you can run your query as a script, an approach is split in two statements:
DECLARE LAST_PARTITION DEFAULT (SELECT MAX(updated_at_datetime) FROM `mydataset.mydataset.my_other_table`);
(
SELECT
*
FROM
`mydataset.mydataset.mytable`
WHERE
`datetime` > LAST_PARTITION
AND `date` >= DATE(DATETIME_TRUNC(LAST_PARTITION));

Using "match_recognize" in a Common Table Expression in Snowflake

Update: This was answered here.
I am putting together a somewhat complex query to do event detection, join(s), and time-based binning with a large time-series dataset in Snowflake. I recently noticed that match_recognize lets me eloquently detect time-series events, but whenever I try to use a match_recognize expression within a Common Table Expression (with .. as ..), I receive the following error:
SQL compilation error: MATCH_RECOGNIZE not supported in this context.
I've done a lot of searching/reading, but haven't found any documented limitations on match_recognize in CTEs. Here's my query:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
),
label_events as (
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
And I get the same error as above with this.
Is this a limitation that I'm not seeing, or am I doing something wrong?
Non-recursive cte could be always rewritten as inline view:
--select ...
--from (
select id, timestamp, measurement, event_number
from (select distinct id, timestamp, measurement
from dataset) clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
)mr
-- ) -- if other transformations are required
It is not ideal, but at least it will allow query to run.
Per this thread from a comment by Filipe Hoffa: MATCH_RECOGNIZE with CTE in Snowflake
This seemed to be a non-documented limitation of Snowflake at the time. A two or three step solution has worked well for me:
with clean_data as (
-- Remove duplicate entries
select distinct id, timestamp, measurement
from dataset
)
select *
from clean_data
match_recognize (
partition by id
order by timestamp
measures
match_number() as event_number
all rows per match
after match skip past last row
pattern(any_row row_between_gaps+)
define
-- Classify contiguous sections of datapoints with < 20min between adjacent points.
row_between_gaps as datediff(minute, lag(timestamp), timestamp) < 20
);
set quid=last_query_id();
with label_events as (
select *
from table(result_scan($quid))
)
-- Do binning with width_bucket/etc. here
select id, timestamp, measurement, event_number
from label_events;
I prefer to use a variable here, because I can re-run the second query multiple times during development/debugging without having to re-run the first query.
It is also important to note that cached GEOGRAPHY objects in Snowflake are converted to GEOJSON, so when retrieving these with result_scan, you must typecast them back to the GEOGRAPHY type.

Bigquery Select all latest partitions from a wildcard set of tables

We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;
With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)

What is the execution order of the PARTITION BY clause compared to other SQL clauses?

I cannot find any source mentioning execution order for Partition By window functions in SQL.
Is it in the same order as Group By?
For example table like:
Select *, row_number() over (Partition by Name)
from NPtable
Where Name = 'Peter'
I understand if Where gets executed first, it will only look at Name = 'Peter', then execute window function that just aggregates this particular person instead of entire table aggregation, which is much more efficient.
But when the query is:
Select top 1 *, row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
Doesn't the window function need to be executed against the entire table first then applies the Date> condition otherwise the result is wrong?
Window functions are executed/calculated at the same stage as SELECT, stage 5 in your table. In other words, window functions are applied to all rows that are "visible" in the SELECT stage.
In your second example
Select top 1 *,
row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
WHERE is logically applied before Partition by Name of the row_number() function.
Note, that this is logical order of processing the query, not necessarily how the engine physically processes the data.
If query optimiser decides that it is cheaper to scan the whole table and later discard dates according to the WHERE filter, it can do it. But, any kind of these transformations must be performed in such a way that the final result is consistent with the order of the logical steps outlined in the table you showed.
It is part of the SELECT phase of the query execution. There are different types of SELECT clauses, based on the query.
SELECT FOR
SELECT GROUP BY
SELECT ORDER BY
SELECT OVER
SELECT INTO
SELECT HAVING
PARTITION BY comes in the SELECT OVER clause. Here, a window of the result set is generated out of the result set generated in the previous stages: FROM, WHERE, GROUP BY etc.
The OVER clause defines a window or user-specified set of rows within
a query result set. A window function then computes a value for each
row in the window. You can use the OVER clause with functions to
compute aggregated values such as moving averages, cumulative
aggregates, running totals, or a top N per group results.
OVER ( [ PARTITION BY value_expression ] [ order_by_clause ] )
Arguments
PARTITION BY Divides the query result set into partitions. The window
function is applied to each partition separately and computation
restarts for each partition.
value_expression Specifies the column by which the rowset is
partitioned. value_expression can only refer to columns made available
by the FROM clause. value_expression cannot refer to expressions or
aliases in the select list. value_expression can be a column
expression, scalar subquery, scalar function, or user-defined
variable.
Defines the logical order of the rows within each
partition of the result set. That is, it specifies the logical order
in which the window functioncalculation is performed.
order_by_expression Specifies a column or expression on which to sort.
order_by_expression can only refer to columns made available by the
FROM clause. An integer cannot be specified to represent a column name
or alias.
You can read more about it SELECT-OVER
row_number() (and other window functions) are allowed in two clauses:
SELECT
ORDER BY
The function is parsed along with the rest of the clause. After all, it is a function present in the clause. In both cases, the WHERE clause would be -- logically -- applied first, so the results would be after filtering.
Do note that this is a logical parsing of the query. The actual execution may have little to do with the structure of the query.

How does one do a SQL select over multiple partitions?

Is there a more efficient way than:
select * from transactions partition( partition1 )
union all
select * from transactions partition( partition2 )
union all
select * from transactions partition( partition3 );
It should be exceptionally rare that you use the PARTITION( partitionN ) syntax in a query.
You would normally just want to specify values for the partition key and allow Oracle to perform partition elimination. If your table is partitioned daily based on TRANSACTION_DATE, for example
SELECT *
FROM transactions
WHERE transaction_date IN (date '2010-11-22',
date '2010-11-23',
date '2010-11-24')
would select all the data from today's partition, yesterday's partition, and the day before's partition.
Can you provide additional context? What are your predicates? What makes you think that you need to explicitly tell the optimizer to go against multiple partitions. You may have the wrong partition key in use, for example.