Subquery in partitioned Athena tables - sql

I am using partitions in Athena. I have a partition called snapshot, and when I call a query as such:
select * from mytable where snapshot = '2020-06-25'
Then, as expected only the specified partition is scanned and my query is fast. However, if I use a subquery which returns a single date, it is slooow:
select * from mytable where snapshot = (select '2020-06-25')
The above actually scans all partitions and not only the specified date, and results in very low performance.
My question is can I use a subquery to specify partitions and increase performance. I need to use a subsquery to add some custom logic which returns a date based on some criteria.

Edit:
Trino 356 is able to inline such queries, see https://github.com/trinodb/trino/issues/4231#issuecomment-845733371
Older answer:
Presto still does not inline trivial subquery like (select '2020-06-25').
This is tracked by https://github.com/trinodb/trino/issues/4231.
Thus, you should not expect Athena to inline, as it's based on Presto .172.
I need to use a subsquery to add some custom logic which returns a date based on some criteria.
If your query is going to be more sophisticated, not a constant expression, it will not be inlined anyway. If snapshot is a partition key, then you could leverage a recently added feature -- dynamic partition pruning. Read more at https://trino.io/blog/2020/06/14/dynamic-partition-pruning.html.
This of course assumes you can choose Presto version.
If you are constraint to Athena, your only option is to evaluate the subquery outside of the main query (separately), and pass it back to the main query as a constant (e.g. literal).

The Athena 2.0 released in late 2020 seems to have improved their push_down_predicate handling to support subquery.
Here is their related statement from https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference.html#engine-versions-reference-0002
Predicate inference and pushdown – Predicate inference and pushdown extended for queries that use a <symbol> IN <subquery> predicate.
My test with our own Athena table indicates this is indeed the case. My test query is roughly as below
SELECT *
FROM table_partitioned_by_scrape_date
WHERE scrape_date = (
SELECT max(scrape_date)
FROM table_partitioned_by_scrape_date
)
From the bytes scanned by the query, I can tell Athena indeed only scanned the partition with the latest scrape_date.
Moreover, I also tested support of push_down_predicate in JOIN clause where the join_to value is result of another query. Even though it is not mentioned in the release note, apparently Athena 2.0 now is smart enough also to support this scenario and only scan the latest scrape_date partition. I have tested similar query in Athena 1.0 before, it would scan all the partitions instead. My test query is as below
WITH l as (
SELECT max(scrape_date) as latest_scrape_date
FROM table_partitioned_by_scrape_date
)
SELECT deckard_id
FROM table_partitioned_by_scrape_date as t
JOIN l ON t.scrape_date = l.latest_scrape_date

Related

SQL Server performance on concat method

I need to write a query to fetch some data from table and need to use concat method. I end up preparing two different queries and I'm not sure which one will have better performance when we have huge amounts of data.
Please help me understand which query will perform better, and why.
I need to concat twice, one for displaying and one for where condition:
select
id, concat(boolean_value, long_value, string_value, double_value) as Value
from
table
where
concat(boolean_value, long_value, string_value, double_value) = 'XX'
Write the above query as a subquery and add the where condition
Select *
from
(select
id, concat(boolean_value, long_value, string_value, double_value) as Value
from
table) as output
where
Value = 'XX'
Note: this is sample query and the actual query will have multiple joins and the need to concat from multiple columns of different tables
SQL databases represent the result set being produced. You seem to be asking if there is common expression elimination -- that is, will the concat() be performed exactly once.
You have a better chance of this with the subquery than with the version that repeat the expression. But SQL Server could be smart enough to only evaluate it once.
If you want to guarantee single evaluation, then I think cross apply does that:
select t.id, v.value
from table t cross apply
(values (concat( boolean_value, long_value, string_value, double_value))
) as Value
where v.value = 'XX';
I would, however, question why you are comparing the result of the concat() rather than the base columns. Comparing the base columns would allow the optimizer to take advantage of indexes, partitions, and statistics.
You may run EXPLAIN on both queries to see what exactly is on the mind of SQL Server. I would expect that the first version would be preferable, because, unlike the second version, it does not force SQL Server to materialize an intermediate table. Regarding whether or not SQL Server would have to actually evaluate CONCAT twice in the first version, it may not even matter if the cost of the subquery be very high.

Query the exact partition when table is hash-partitioned

When I want to query a single partition I usually use something like that:
Select * from t (partition p1)
But when you have to query it in your pl/sql code it comes to using execute immediate and hard-parse of the statement.
Okay, for RANGE-partitioned table (let it be SOME_DATE of date type) I can workaround it like
Select * from t where some_date <= :1 and some_date > :2
Assuming :1 and :2 stand for partition bonds.
Well, as for LIST-partitioned table I can easily specify the exact value of my partition key field like
Select * from t where part_key = 'X'
And what about HASH-partitioning? For example, I have a table partitioned by hash(id) in 16 partitions. And I have 16 jobs each handling its own partition. So I have to use it like that
Select * from t (partition p<n>)
Question is: can I do it like this for example
Select * from t where hash(id) = :1
To enforce partition pruning take the whole n-th partition?
It's okay when you have just 16 partitions but in my case I have composite partitioning (date + hash(id)), so every time job handles a partition it's always a new sql_id and it ends up in quick shared pool growth
It appears Oracle internally uses the ora_hash function (at least since 10g) to assign a value to a partition. So you could use that to read all the data from a single partition. Unfortunately, though, since you'd be running a query like
select *
from t
where ora_hash( id, 9 ) = 6
to get all the data in the 6th of 8 hash partitions, I'd expect Oracle to have to read every partition in the table (and compute the hash on every id) because the optimizer isn't going to be smart enough to recognize that your expression happens to map exactly to its internal partitioning strategy. So I don't think you'd want to do this to split data up to be processed by different threads.
Depending on what those threads are doing, would it be possible to use Oracle's built-in parallelism instead (potentially incorporating things like parallelizable pipelined table functions if you're doing ETL processing). If you tell Oracle to use 16 parallel threads and your table has 16 partitions, Oracle will internally almost certainly do the right thing.

reduce the amount of data scanned by Athena when using aggregate functions

The below query scans 100 mb of data.
select * from table where column1 = 'val' and partition_id = '20190309';
However the below query scans 15 GB of data (there are over 90 partitions)
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
How can I optimize the second query to scan the same amount of data as the first?
There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one #PiotrFindeisen pointed out around dynamic filtering.
The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.
If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"
select *
from table
where column1 = 'val' and
partition_id = (select max(partition_id) from "table$partitions");
The second problem is the one #PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.
This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.
BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.
EDIT: Presto removed the __internal_partitions__ table in their 0.193 release so I'd suggest not using the solution defined in the Slow aggregation queries for partition keys section below in any production systems since Athena 'transparently' updates presto versions. I ended up just going with the naive SELECT max(partition_date) ... query but also using the same lookback trick outlined in the Lack of Dynamic Filtering section. It's about 3x slower than using the __internal_partitions__ table, but at least it won't break when Athena decides to update their presto version.
----- Original Post -----
So I've come up with a fairly hacky way to accomplish this for date-based partitions on large datasets for when you only need to look back over a few partitions'-worth of data for a match on the max, however, please note that I'm not 100% sure how brittle the usage of the information_schema.__internal_partitions__ table is.
As #Dain noted above, there are really two issues. The first being how slow an aggregation of the max(partition_date) query is, and the second being Presto's lack of support for dynamic filtering.
Slow aggregation queries for partition keys
To solve the first issue, I'm using the information_schema.__internal_partitions__ table which allows me to get quick aggregations on the partitions of a table without scanning the data inside the files. (Note that partition_value, partition_key, and partition_number in the below queries are all column names of the __internal_partitions__ table and not related to your table's columns)
If you only have a single partition key for your table, you can do something like:
SELECT max(partition_value) FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
But if you have multiple partition keys, you'll need something more like this:
SELECT max(partition_date) as latest_partition_date from (
SELECT max(case when partition_key = 'partition_date' then partition_value end) as partition_date, max(case when partition_key = 'another_partition_key' then partition_value end) as another_partition_key
FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
GROUP BY partition_number
)
WHERE
-- ... Filter down by values for e.g. another_partition_key
)
These queries should run fairly quickly (mine run in about 1-2 seconds) without scanning through the actual data in the files, but again, I'm not sure if there are any gotchas with using this approach.
Lack of Dynamic Filtering
I'm able to mitigate the worst effects of the second problem for my specific use-case because I expect there to always be a partition within a finite amount of time back from the current date (e.g. I can guarantee any data-production or partition-loading issues will be remedied within 3 days). It turns out that Athena does do some pre-processing when using presto's datetime functions, so this does not have the same types of issues with Dynamic Filtering as using a sub-query.
So you can change your query to limit how far it will look back for the actual max using the datetime functions so that the amount of data scanned will be limited.
SELECT * FROM "DATABASE_NAME"."TABLE_NAME"
WHERE partition_date >= cast(date '2019-06-25' - interval '3' day as varchar) -- Will only scan partitions from 3 days before '2019-06-25'
AND partition_date = (
-- Insert the partition aggregation query from above here
)
I don't know if it is still relevant, but just found out:
Instead of:
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
Use:
select a.* from table a
inner join (select max(partition_id) max_id from table) b on a.partition_id=b.max_id
where column1 = 'val';
I think it has something to do with optimizations of joins to use partitions.

Subquery is faster using a function

I have a long query (~200 lines) that I have embedded in a function:
CREATE FUNCTION spot_rate(base_currency character(3),
contra_currency character(3),
pricing_date date) RETURNS numeric(20,8)
Whether I run the query directly or the function I get similar results and similar performance. So far so good.
Now I have another long query that looks like:
SELECT x, sum(y * spot_rates.spot)
FROM (SELECT a, b, sum(c) FROM t1 JOIN t2 etc. (6 joins here)) AS table_1,
(SELECT
currency,
spot_rate(currency, 'USD', current_date) AS "spot"
FROM (SELECT DISTINCT currency FROM table_2) AS "currencies"
) AS "spot_rates"
WHERE
table_1.currency = spot_rates.currency
GROUP BY / ORDER BY
This query runs in 300 ms, which is slowish but fast enough at this stage (and probably makes sense given the number of rows and aggregation operations).
If however I replace spot_rate(currency, 'USD', current_date) by its equivalent query, it runs in 5+ seconds.
Running the subquery alone returns in ~200ms whether I use the function or the equivalent query.
Why would the query run more slowly than the function when used as a subquery?
ps: I hope there is a generic answer to this generic problem - if not I'll post more details but creating a contrived example is not straightforward.
EDIT: EXPLAIN ANALYZE run on the 2 subqueries and whole queries
subquery with function: http://explain.depesz.com/s/UHCF
subquery with direct query: http://explain.depesz.com/s/q5Q
whole query with function: http://explain.depesz.com/s/ZDt
whole query with direct query: http://explain.depesz.com/s/R2f
just the function body, using one set of arguments: http://explain.depesz.com/s/mEp
Just a wild guess: your query's range-table is exceeding the join_collapse_limit, causing a suboptimal plan to be used.
Try moving the subquery-body (the equivalent of the function) into a CTE, to keep it intact. (CTE's are always executed, and never broken-up by the query-generator/planner)
pre-calculting parts of the query into (TEMP) tables or materialised views can also help to reduce the number of RTEs
You could (temporarily) increase join_collapse_limit, but this will cost more planning time, and there certainly is a limit to this (the number of possible plans grows exponentially with the size of the range table.)
Normally, you can detect this behaviour by the bad query plan (like here: fewer index scans), but you'll need knowledge of the schema, and there must be some kind of reasonable plan possible (read: PK/FK and indices must be correct, too)

Oracle: why doesn't use parallel execution?

Look at the following query:
If I comment the subquery it uses parallel execution otherwise it doesn't.
After the query has been
SELECT /*+ parallel(c, 20) */
1, (SELECT 2 FROM DUAL)
FROM DUAL c;
You could have found the answer in the documentation:
A SELECT statement can be parallelized only if the following
conditions are satisfied:
The query includes a parallel hint specification (PARALLEL or
PARALLEL_INDEX) or the schema objects referred to in the query have a
PARALLEL declaration associated with them.
At least one of the tables specified in the query requires one of
the following:
A full table scan
An index range scan spanning multiple partitions
No scalar subqueries are in the SELECT list.
Your query falls at the final hurdle: it has a scalar subquery in its projection. If you want to parallelize the query you need to find another way to write it.
One Idea could be not to use a subquery, but you can try and use a join? Your sub query seems fairly simply, no grouping etc, so it should not be an issue to translate it into a join.
Maybe the optimizer is not capable of parallel execution when there are subqueries.