BigQuery date partition condition from subquery - google-bigquery

I have a date partitioned table, however costs and speed does not improve when the date condition is fetched from a subquery. The subquery fetches a single value of type DATE, however it is not used to run a partitioned query, instead the whole table is fetched. If I enter the date as a string, it works perfectly, just not from the subquery.
(
SELECT
*
FROM
`mydataset.mydataset.mytable`
WHERE
`datetime` > (
SELECT
DISTINCT updated_at_datetime
FROM
`mydataset.mydataset.my_other_table`
LIMIT
1)
AND `date` >= DATE(DATETIME_TRUNC((
SELECT
DISTINCT updated_at_datetime
FROM
`mydataset.mydataset.my_other_table`
LIMIT
1), DAY)))

From the docs:
To limit the partitions that are scanned in a query, use a constant expression in your filter. If you use dynamic expressions in your query filter, BigQuery must scan all of the partitions.
If you can run your query as a script, an approach is split in two statements:
DECLARE LAST_PARTITION DEFAULT (SELECT MAX(updated_at_datetime) FROM `mydataset.mydataset.my_other_table`);
(
SELECT
*
FROM
`mydataset.mydataset.mytable`
WHERE
`datetime` > LAST_PARTITION
AND `date` >= DATE(DATETIME_TRUNC(LAST_PARTITION));

Related

How to convert a query on multiple table rows into using a single array?

I previously had this table:
CREATE TABLE traces_v0
( canvas_id UUID NOT NULL
, tlid BIGINT NOT NULL
, trace_id UUID NOT NULL
, timestamp TIMESTAMP WITH TIME ZONE NOT NULL
, PRIMARY KEY (canvas_id, tlid, trace_id)
);
which I'm trying to change into this table:
CREATE TABLE traces_v0
( canvas_id UUID NOT NULL
, root_tlid BIGINT NOT NULL
, trace_id UUID NOT NULL
, callgraph_tlids BIGINT[] NOT NULL
, timestamp TIMESTAMP WITH TIME ZONE NOT NULL
, PRIMARY KEY (canvas_id, root_tlid, trace_id)
);
Which is to say, where previously there was one row per (tlid, trace_id), there is now a single row with a trace_id and an array of callgraph_tlids.
I have a query which worked well on the old table:
SELECT tlid, trace_id
FROM (
SELECT
tlid, trace_id,
ROW_NUMBER() OVER (PARTITION BY tlid ORDER BY timestamp DESC) as row_num
FROM traces_v0
WHERE tlid = ANY(#tlids::bigint[])
AND canvas_id = #canvasID
) t
WHERE row_num <= 10
This fetches the last 10 (tlid, trace_id) for each of tlids (a bigint array) ordered by timestamp. This is exactly what I need and was very effective.
(fyi: the "at" (#tlids) syntax is just a fancy way of writing $1, supported by my postgres driver)
I'm struggling to port this to the new table layout. I came up with the following which works except that it doesn't limit to 10 per tlid ordered by timestamp:
SELECT callgraph_tlids, trace_id
FROM traces_v0
WHERE #tlids && callgraph_tlids -- '&&' is the array overlap operator
AND canvas_id = #canvasID
ORDER BY timestamp DESC"
How can I do this query where I limit the results to 10 rows per tlid, ordered by timestamp?
I'm using Postgres 9.6 if that matters.
How can I do this query where I limit the results to 10 rows per tlid, ordered by timestamp?
If timestamps for all rows in the old design that were aggregated into the same array in the new design have been the same all along, then this query for the new design is logically equivalent:
SELECT trace_id, tlid
FROM (
SELECT t.trace_id, c.tlid
, row_number() OVER (PARTITION BY c.tlid ORDER BY t.timestamp DESC) AS rn
FROM traces_v0 t
JOIN LATERAL unnest(t.callgraph_tlids) c(tlid) ON c.tlid = ANY(#tlids)
WHERE t.canvas_id = #canvasid
AND t.callgraph_tlids && #tlids
) sub
WHERE rn <= 10;
But that means ORDER BY timestamp DESC has been a non-deterministic sort order all along and your new query is just as unreliable as the old one. The top 10 may change from one invocation of the query to the next. If you want a deterministic results, add more expressions as tiebreaker(s) to the ORDER BY list until the sort order is unambiguous - probably in any case.
The WHERE condition t.callgraph_tlids && #tlids on top of the join condition ON c.tlid = ANY(#tlids) is logically redundant, but typically helps to make your query much faster, especially with a GIN index on callgraph_tlids. See:
Can PostgreSQL index array columns?
About the LATERAL join:
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
Even works in your outdated, unsupported Postgres 9.6. But upgrade to a current version in any case.
If timestamps of aggregated rows were not the same, then the answer is: You cannot.
The new design removes required information. The old design has a separate timestamp for each tlid, while the new design only has a single timestamp for a whole array (callgraph_tlids).

Bigquery apply subquery to partition time

I have two queries which work correct separately but together there is an error:
WITH minimum_time AS
(
SELECT DATE (min(_PARTITIONTIME)) AS minimums
FROM `Day`
WHERE DATE (_PARTITIONTIME) = "2020-11-20"
)
SELECT *
FROM `Day`
WHERE DATE (_PARTITIONTIME) > (SELECT minimums
FROM minimum_time)
and I get this error:
Cannot query over table 'Day' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
I do not quite understand why this is happening, first query returns a date.
You were getting the error because:
the table has the option set: require_partition_filter=true, a query on the table should fail if no partition filter is specified.
There is limitation on using subquery as the partition filter, the limitation is documented here.
In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans.
The workaround is using BigQuery Scripting to pre-determine the partition filter, like:
DECLARE minimums DATE DEFAULT ((SELECT minimums FROM `Day` WHERE ...));
SELECT *
FROM `Day`
WHERE DATE (_PARTITIONTIME) > minimums; -- minimums is a constant to the second query

Bigquery Select all latest partitions from a wildcard set of tables

We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;
With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)

Will the partition be hit in an inner Union?

I have the followning SQL statement:
SELECT *
FROM (
SELECT eu_dupcheck AS dupcheck
, eu_date AS threshold
FROM WF_EU_EVENT_UNPROCESSED
WHERE eu_dupcheck IS NOT NULL
UNION
SELECT he_dupcheck AS dupcheck
, he_date AS threshold
FROM WF_HE_HISTORY_EVENT
WHERE he_dupcheck IS NOT NULL
)
WHERE threshold > sysdate - 30
The second table is partitioned by date but the first isn't. I need to know if the partition of the second table will be hit in this query, or will it do a full table scan?
I would be surprised if Oracle were smart enough to avoid a full table scan. Remember that UNION processes the data by removing duplicates. So, Oracle would have to recognize that:
The where clause is appropriate for the partitioning (this is actually easy).
That partitioning does not affect the duplicate removal (this is a bit harder, but true because the date is in the select).
Oracle has a smart optimizer, so perhaps it can recognize this situation (and it would probably avoid the full table scan for a UNION ALL). However, you are safer by moving the condition to the subqueries:
SELECT *
FROM ((SELECT eu_dupcheck AS dupcheck, eu_date AS threshold
FROM WF_EU_EVENT_UNPROCESSED
WHERE eu_dupcheck IS NOT NULL AND eu_date > sysdate - 30
) UNION
(SELECT he_dupcheck AS dupcheck, he_date AS threshold
FROM WF_HE_HISTORY_EVENT
WHERE he_dupcheck IS NOT NULL AND he_date > sysdate - 30
)
) eh;

T-SQL dates comparison

Is a result set of the following query:
SELECT * FROM Table
WHERE Date >= '20130101'
equals to result set of the following query:
SELECT * FROM Table
WHERE Date = '20130101'
UNION ALL
SELECT * FROM Table
WHERE Date > '20130101'
?
Date is DATETIME field
On the result YES but on the performance NO.
There may have a performance issue. The first one only scans the table once while the second one scans twice because of UNION. (one SELECT statement is more faster than two combine select statement)
So I'd rather go on the first one.