I have a partitioned table, with partitions based on timestamp column, like this:
...
partition by date_trunc(update_date, month)
...
Now, how do I query partitions in this table?
Afaik, I can't use _PARTITIONDATE pseudo column, since this is not an ingestion-time partitioned table. Do I just filter my query using date_trunc(update_date, month)?
select * from my_project.my_dataset.information_schema.partitions
where table_name = 'partitioned_table'
Here I can get partition_ids, but I can't address them in my query either.
I'm not sure, but whenever you want to to test if a partition is working: try typing out a SELECT * statement without a filter, and don't run it but check the estimated query size. Then try filtering on what you believe is the partition, also without running, and see if the estimated query size is reduced.
In this case try filtering on date_trunc(update_date, month) and see if it reduces the query size
Related
I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).
I have a view that works to pull the most recent data for a Hive history table. The history table is partitioned by day. The way that the view works is very straightforward—it has a subquery that does a max date on the date field (the one that is used as the partition) then filters the table based upon that value. The table contains hundreds of days (partitions), each with many millions of rows. In order to speed up the subquery, I am attempting to limit the partitions that are scanned to the last one created. To account for holiday weekends, I'm going back four days to ensure that the query returns data.
If I hard code the values with dates, the subquery runs very fast, and limits to the partitions correctly.
However, if I attempt to limit the partitions with a subquery to calculate the last partition, it doesn’t recognize the partitions and does a full table scan. The query will return correct results, as the filter works, but it takes a long time because it is not limiting the partitions scanned.
I tried doing the subquery as a WITH statement, then using an INNER JOIN on bus_date, but got the same results—partitions were not utilized.
The behavior is repeatable via a query, so I’ll use that rather than the view to demonstrate:
SELECT *
FROM a.transactions
WHERE bus_date IN (SELECT MAX (bus_date)
FROM a.transactions maxtrans
WHERE bus_date >= date_sub (CURRENT_DATE, 4));
There are no error messages, and the query actually works (filters to pull the correct data), but it scans all partitions so it is extremely slow. How can I limit the query to utilize the partitions identified in the subquery?
I'm still hopeful that someone will have an answer for this, but I did want to post the workaround that I've come up with in case it is useful for someone else.
SELECT *
FROM a.transactions
WHERE bus_date >= date_sub (CURRENT_DATE, 4)
AND bus_date IN (SELECT MAX (bus_date)
FROM a.transactions maxtrans
WHERE bus_date >= date_sub (CURRENT_DATE, 4));
The query is a little clumsy, as it is filtering on the business date twice. The first time it limits the main set of data to the last four days (which limits to those partitions and avoids a scan of all partitions) and the second pins it down to the last day for which data has been loaded (via the MAX bus_date). This is far from perfect, but performs CONSIDERABLY better than the query scanning all partitions. Thanks.
The below query scans 100 mb of data.
select * from table where column1 = 'val' and partition_id = '20190309';
However the below query scans 15 GB of data (there are over 90 partitions)
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
How can I optimize the second query to scan the same amount of data as the first?
There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one #PiotrFindeisen pointed out around dynamic filtering.
The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.
If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"
select *
from table
where column1 = 'val' and
partition_id = (select max(partition_id) from "table$partitions");
The second problem is the one #PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.
This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.
BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.
EDIT: Presto removed the __internal_partitions__ table in their 0.193 release so I'd suggest not using the solution defined in the Slow aggregation queries for partition keys section below in any production systems since Athena 'transparently' updates presto versions. I ended up just going with the naive SELECT max(partition_date) ... query but also using the same lookback trick outlined in the Lack of Dynamic Filtering section. It's about 3x slower than using the __internal_partitions__ table, but at least it won't break when Athena decides to update their presto version.
----- Original Post -----
So I've come up with a fairly hacky way to accomplish this for date-based partitions on large datasets for when you only need to look back over a few partitions'-worth of data for a match on the max, however, please note that I'm not 100% sure how brittle the usage of the information_schema.__internal_partitions__ table is.
As #Dain noted above, there are really two issues. The first being how slow an aggregation of the max(partition_date) query is, and the second being Presto's lack of support for dynamic filtering.
Slow aggregation queries for partition keys
To solve the first issue, I'm using the information_schema.__internal_partitions__ table which allows me to get quick aggregations on the partitions of a table without scanning the data inside the files. (Note that partition_value, partition_key, and partition_number in the below queries are all column names of the __internal_partitions__ table and not related to your table's columns)
If you only have a single partition key for your table, you can do something like:
SELECT max(partition_value) FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
But if you have multiple partition keys, you'll need something more like this:
SELECT max(partition_date) as latest_partition_date from (
SELECT max(case when partition_key = 'partition_date' then partition_value end) as partition_date, max(case when partition_key = 'another_partition_key' then partition_value end) as another_partition_key
FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
GROUP BY partition_number
)
WHERE
-- ... Filter down by values for e.g. another_partition_key
)
These queries should run fairly quickly (mine run in about 1-2 seconds) without scanning through the actual data in the files, but again, I'm not sure if there are any gotchas with using this approach.
Lack of Dynamic Filtering
I'm able to mitigate the worst effects of the second problem for my specific use-case because I expect there to always be a partition within a finite amount of time back from the current date (e.g. I can guarantee any data-production or partition-loading issues will be remedied within 3 days). It turns out that Athena does do some pre-processing when using presto's datetime functions, so this does not have the same types of issues with Dynamic Filtering as using a sub-query.
So you can change your query to limit how far it will look back for the actual max using the datetime functions so that the amount of data scanned will be limited.
SELECT * FROM "DATABASE_NAME"."TABLE_NAME"
WHERE partition_date >= cast(date '2019-06-25' - interval '3' day as varchar) -- Will only scan partitions from 3 days before '2019-06-25'
AND partition_date = (
-- Insert the partition aggregation query from above here
)
I don't know if it is still relevant, but just found out:
Instead of:
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
Use:
select a.* from table a
inner join (select max(partition_id) max_id from table) b on a.partition_id=b.max_id
where column1 = 'val';
I think it has something to do with optimizations of joins to use partitions.
I want to query a massive table, and need to get my query runtime down.
I'm trying to breakup my target query, into many steps, by running my summarizing query against each individual table partition (I will then aggregate the outputs). All the columns in my where clauses are indexed (nonclustered) -- all the columns I'm pulling in my query are indexed. The "Month" column is our partition index.
How do I write my query, so that I'm explicitly telling SQL Server to only use one "Month" partition?
edit to include Execution plan:
Per the comment, used this site: https://www.brentozar.com/pastetheplan/?id=SJRAIUD3V
Assuming that you have read the suggestions and still want to use partitions the query would be as follows once you have portioned the table with Month as the key_column.
SELECT <Your_select_list>
FROM <dbo.partitioned_table>
WHERE key_column = <target_month>
I am running a query which is selecting data on the basis of joins between 6-7 tables. When I execute the query it is taking 3-4 seconds to complete. But when I put a where clause on the fetched data it's taking more than one minute to execute. My query is fetching large amounts of data so I can't write it here but the situation I faced is explained below:
Select Category,x,y,z
from
(
---Sample Query
) as a
it's only taking 3-4 seconds to execute. But
Select Category,x,y,z
from
(
---Sample Query
) as a
where category Like 'Spart%'
is taking more than 2-3 minutes to execute.
Why is it taking more time to execute when I use the where clause?
It's impossible to say exactly what the issue is without seeing the full query. It is likely that the optimiser is pushing the WHERE into the "Sample query" in a way that is not performant. Possibly could be resolved by updating statistics on the table, but an easier option would be to insert the whole query into a temporary table, and filter from there.
Select Category,x,y,z
INTO #temp
from
(
---Sample Query
) as a
SELECT * FROM #temp WHERE category Like 'Spart%'
This will force the optimiser to tackle it in the logical order of pulling your data together before applying the WHERE to the end result. You might like to consider indexing the temp table's category field also.
If you're using MS SQL by checking the management studio actual execution plan it may already suggest an index creation
In any case, you should add to the index used by the query the column "Category"
If you don't have an index on that table create it composed by column "Category" and all the other columns used in join or where
bear in mind by using like 'text%' clause you could end in index scan and not index seek