BigQuery: querying the latest partition, bytes to be processed vs. actually processed - google-bigquery

I'm struggling with querying efficiently the last partition of a table, using a date or datetime field. The first approach was to filter like this:
SELECT *
FROM my_table
WHERE observation_date = (SELECT MAX(observation_date) FROM my_table)
But that, according to BigQuery's processing estimation, scans the entire table and does not use the partitions. Even Google states this happens in their documentation. It does work if I use an exact value for the partition:
SELECT *
FROM my_table
WHERE observation_date = CURRENT_DATE
But if the table is not up to date then the query will not get any results and my automatic procesess will fail. If I include an offset like observation_date = DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY), I will likely miss the latest partition.
What is the best practice to get the latest partition efficiently?
What makes this worse is that BigQuery's estimation of the bytes to be processed with the active query does not match what was actually processed, unless I'm not interpreting those numbers correctly. Find below the screenshot of the mismatching values.
BigQuery screen with aparrently mistmatching processed bytes
Finally a couple of scenarios that I also tested:
If I store a max_date with a DECLARE statement first, as suggested in this post, the estimation seems to work, but it is not clear why. However, the actual processed bytes after running the query is not different than the case that filters the latest partition in the WHERE clause.
Using the same declared max_date in a table that is both partitioned and clustered, the estimation works only when using a filter for the partition, but fails if I include a filter for the cluster.

After some iterations I got an answer from Google and although it doesn't resolve the issue, it acknowledges it happens.
Tables partitioned with DATE or DATETIME fields cannot be efficiently queried for their latest partition. The best practice remains to filter with something like WHERE observation_date = (SELECT MAX(observation_date) FROM my_table) and that will scan the whole table.
They made notes to try and improve this in the future but we have to deal with this for now. I hope this helps somebody that was trying to do the same thing.

Related

Querying a partitioned BigQuery table across multiple far-apart _PARTITIONDATE days?

I have a number of very big tables that are partitioned by _PARTITIONDATE which I'd like to query off of regularly in an efficient way. Each time I run the query, I only need to search across a small number of dates, but these dates will change every run and may be months/years apart from one-another.
To capture these dates, I could do _PARTITIONDATE >= '2015-01-01' but this is making the queries run very slow as there are millions of rows on each partition. I could also do _PARTITIONDATE BETWEEN '2015-01-01' AND '2017-01-01', but the exact date range will change every run. What I'd like to do is something like _PARTITIONDATE IN ("2015-03-10", "2016-01-24", "2016-22-03", "2017-06-14") so that the query only needs to run on the dates provided, which from my testing appears to work.
The problem I'm running into is that the list of dates will change every time, requiring me to join in the list of dates in a temp table first. When doing that like this source._PARTITIONDATE IN (datelist.date), it does not work and runs into an error if that's the only WHERE condition when querying a partition-required table.
Any advice for ways I might get this to work or other approach to querying off specific partitions that aren't back to back without having to process querying the whole thing?
I've been reading through the BigQuery documentation but I don't see an answer to this question. I do see it says that the following "doesn't limit the scanned partitions, because it uses table values, which are dynamic." So possibly what I'm trying to do is impossible with the current BQ limitations?
_PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)
Script is a possible solution.
DECLARE max_date DEFAULT (SELECT MAX(...) FROM ...);
SELECT .... FROM ... WHERE _PARTITIONDATE = max_date;

BigQuery: cost of querying tables partitioned by ingestion time vs date/timestamp partitioned

We are trying to build (or better say rebuild) our DWH in the cloud based on BigQuery. We decided to use 'partitioned by date field' tables (like a 'created_date' field) for our raw data instead of ingestion time partitions because with this feature we can load data easely and then query it with "group by" partition date column, build datamarts bla bla bla. We supposed that this partition method will increase queries speed and reduce it cost (versus non-partitioned tables - yes), BUT we've discovered than when you querying table with WHERE by partition field (like 'select count(*) from table where created_date=current_date'), it will cost money.
Our old-style ingestion time partitioned table queries with WHERE _PARTITIONTIME ='' were FREE! (like 'select count(*) from table where _PARTITIONTIME=current_date')
For example:
1) select value1 from table1 where _PARTITIONTIME = current_date
2) select value1 from table1 where created_date = current_date
3) select count(*) from table1 where _PARTITIONTIME = current_date
The second query costs more, because it will scan 2 columns. Its logical. But not fair((( The 3rd query is absolutely free btw!
This is very sad situation, because there is NO ANY WARNING about this 'side effect' in the documentation. This feature designed to make DB developers life easier (i guess), and it positioned as best practice feature and highly recommended by Google. But nobody said that it will cost you additional money also!
So the question is can we somehow query date-field partitioned tables using partition key for free? Is there any other pseudocolumn or method of filtering by partition key available if you use date/timestamp field based partitioning?
(ps: you guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist).
Thnx!
So the question is can we somehow query date-field partitioned tables
using partition key for free?
The answer is No, querying the partition will not be free.
Is there any other pseudocolumn or method of filtering by partition
key available if you use date/timestamp field based partitioning?
If you want partitioning by date, this can only be achieved using ingestion-time partitioning with the _PARTITIONTIME pseudocolumn or using dates value in a selected date/timestamp value columns. Currently there is no alternative option available. Keep in mind that one of the main goals of partitioning is reducing the amount of data being scanned mainly by reducing the number of rows that are scanned.
You guys from google must add some pseudocolumn for the date/timestamp partition method if it does not exist
I understand that you would like to have some pseudocolumn for the data column partitioned method, but could you please elaborate a bit more what values you would like to see in this partition in your original post?
Edit: A feature request has been opened on your behalf. You can follow it here

Hive Not Utilizing Partitions in Query

I have a view that works to pull the most recent data for a Hive history table. The history table is partitioned by day. The way that the view works is very straightforward—it has a subquery that does a max date on the date field (the one that is used as the partition) then filters the table based upon that value. The table contains hundreds of days (partitions), each with many millions of rows. In order to speed up the subquery, I am attempting to limit the partitions that are scanned to the last one created. To account for holiday weekends, I'm going back four days to ensure that the query returns data.
If I hard code the values with dates, the subquery runs very fast, and limits to the partitions correctly.
However, if I attempt to limit the partitions with a subquery to calculate the last partition, it doesn’t recognize the partitions and does a full table scan. The query will return correct results, as the filter works, but it takes a long time because it is not limiting the partitions scanned.
I tried doing the subquery as a WITH statement, then using an INNER JOIN on bus_date, but got the same results—partitions were not utilized.
The behavior is repeatable via a query, so I’ll use that rather than the view to demonstrate:
SELECT *
FROM a.transactions
WHERE bus_date IN (SELECT MAX (bus_date)
FROM a.transactions maxtrans
WHERE bus_date >= date_sub (CURRENT_DATE, 4));
There are no error messages, and the query actually works (filters to pull the correct data), but it scans all partitions so it is extremely slow. How can I limit the query to utilize the partitions identified in the subquery?
I'm still hopeful that someone will have an answer for this, but I did want to post the workaround that I've come up with in case it is useful for someone else.
SELECT *
FROM a.transactions
WHERE bus_date >= date_sub (CURRENT_DATE, 4)
AND bus_date IN (SELECT MAX (bus_date)
FROM a.transactions maxtrans
WHERE bus_date >= date_sub (CURRENT_DATE, 4));
The query is a little clumsy, as it is filtering on the business date twice. The first time it limits the main set of data to the last four days (which limits to those partitions and avoids a scan of all partitions) and the second pins it down to the last day for which data has been loaded (via the MAX bus_date). This is far from perfect, but performs CONSIDERABLY better than the query scanning all partitions. Thanks.

reduce the amount of data scanned by Athena when using aggregate functions

The below query scans 100 mb of data.
select * from table where column1 = 'val' and partition_id = '20190309';
However the below query scans 15 GB of data (there are over 90 partitions)
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
How can I optimize the second query to scan the same amount of data as the first?
There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one #PiotrFindeisen pointed out around dynamic filtering.
The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.
If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"
select *
from table
where column1 = 'val' and
partition_id = (select max(partition_id) from "table$partitions");
The second problem is the one #PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.
This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.
BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.
EDIT: Presto removed the __internal_partitions__ table in their 0.193 release so I'd suggest not using the solution defined in the Slow aggregation queries for partition keys section below in any production systems since Athena 'transparently' updates presto versions. I ended up just going with the naive SELECT max(partition_date) ... query but also using the same lookback trick outlined in the Lack of Dynamic Filtering section. It's about 3x slower than using the __internal_partitions__ table, but at least it won't break when Athena decides to update their presto version.
----- Original Post -----
So I've come up with a fairly hacky way to accomplish this for date-based partitions on large datasets for when you only need to look back over a few partitions'-worth of data for a match on the max, however, please note that I'm not 100% sure how brittle the usage of the information_schema.__internal_partitions__ table is.
As #Dain noted above, there are really two issues. The first being how slow an aggregation of the max(partition_date) query is, and the second being Presto's lack of support for dynamic filtering.
Slow aggregation queries for partition keys
To solve the first issue, I'm using the information_schema.__internal_partitions__ table which allows me to get quick aggregations on the partitions of a table without scanning the data inside the files. (Note that partition_value, partition_key, and partition_number in the below queries are all column names of the __internal_partitions__ table and not related to your table's columns)
If you only have a single partition key for your table, you can do something like:
SELECT max(partition_value) FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
But if you have multiple partition keys, you'll need something more like this:
SELECT max(partition_date) as latest_partition_date from (
SELECT max(case when partition_key = 'partition_date' then partition_value end) as partition_date, max(case when partition_key = 'another_partition_key' then partition_value end) as another_partition_key
FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
GROUP BY partition_number
)
WHERE
-- ... Filter down by values for e.g. another_partition_key
)
These queries should run fairly quickly (mine run in about 1-2 seconds) without scanning through the actual data in the files, but again, I'm not sure if there are any gotchas with using this approach.
Lack of Dynamic Filtering
I'm able to mitigate the worst effects of the second problem for my specific use-case because I expect there to always be a partition within a finite amount of time back from the current date (e.g. I can guarantee any data-production or partition-loading issues will be remedied within 3 days). It turns out that Athena does do some pre-processing when using presto's datetime functions, so this does not have the same types of issues with Dynamic Filtering as using a sub-query.
So you can change your query to limit how far it will look back for the actual max using the datetime functions so that the amount of data scanned will be limited.
SELECT * FROM "DATABASE_NAME"."TABLE_NAME"
WHERE partition_date >= cast(date '2019-06-25' - interval '3' day as varchar) -- Will only scan partitions from 3 days before '2019-06-25'
AND partition_date = (
-- Insert the partition aggregation query from above here
)
I don't know if it is still relevant, but just found out:
Instead of:
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
Use:
select a.* from table a
inner join (select max(partition_id) max_id from table) b on a.partition_id=b.max_id
where column1 = 'val';
I think it has something to do with optimizations of joins to use partitions.

Oracle join of two tables where joined to table returns highest of a result set restricted by the joining (parent?) table?

Sorry for the long title, but its a difficult dilemma to distill down to a short title.. :-)
I have two tables, an auto maintenance log table and an auto trip log table, something like the following:
auto_maint_log:
auto_id
maint_datetime
maint_description
auto_trip_log:
auto_id
trip_datetime
ending_odometer
I need to select all maintenance events for each auto, and for each event, lookup the most recent ending_odometer value at the time of the maintenance from the trip log table. The only way I have successfully accomplished this task is using a function (i.e. get_odometer(auto_id, maint_datetime) as a column in my query) in which the odometer reading from the trip log is evaluated to pull the most recent trip prior to the passed maint_datetime, thus returning the most recent ending_odometer.
While the function call does work, in theory, in practice it does not. My end goal is to create a view of the maintenance data that includes the (then) current odometer value, and I have millions of maintenance rows across hundreds of vehicles. The performance of the function call makes its use as a solution impractical, nay impossible.. :-)
I've tried all variations of MAX, rownum, subselects, analytics (over/partition, etc.) I've been able to scour from google'ing this, and have not been able to code a working query.
Any suggestions, or brilliant solutions are welcome!
Thanks,
You can do this with a correlated subquery:
select aml.*,
(select max(atl.ending_odometer)
from auto_trip_log atl
where alt.auto_id = aml.auto_id and
atl.trip_datetime <= aml.maint_datetime
) as ending_odometer
from auto_maint_log aml;
This query uses max() because (presumably) the odometer readings are steadily increasing.
EDIT:
select aml.*,
(select max(atl.ending_odometer) over (dense_rank first order by atl.trip_datetime desc)
from auto_trip_log atl
where alt.auto_id = aml.auto_id and
atl.trip_datetime <= aml.maint_datetime
) as ending_odometer
from auto_maint_log aml;