reduce the amount of data scanned by Athena when using aggregate functions - sql

The below query scans 100 mb of data.
select * from table where column1 = 'val' and partition_id = '20190309';
However the below query scans 15 GB of data (there are over 90 partitions)
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
How can I optimize the second query to scan the same amount of data as the first?

There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one #PiotrFindeisen pointed out around dynamic filtering.
The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.
If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"
select *
from table
where column1 = 'val' and
partition_id = (select max(partition_id) from "table$partitions");
The second problem is the one #PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.
This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.
BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.

EDIT: Presto removed the __internal_partitions__ table in their 0.193 release so I'd suggest not using the solution defined in the Slow aggregation queries for partition keys section below in any production systems since Athena 'transparently' updates presto versions. I ended up just going with the naive SELECT max(partition_date) ... query but also using the same lookback trick outlined in the Lack of Dynamic Filtering section. It's about 3x slower than using the __internal_partitions__ table, but at least it won't break when Athena decides to update their presto version.
----- Original Post -----
So I've come up with a fairly hacky way to accomplish this for date-based partitions on large datasets for when you only need to look back over a few partitions'-worth of data for a match on the max, however, please note that I'm not 100% sure how brittle the usage of the information_schema.__internal_partitions__ table is.
As #Dain noted above, there are really two issues. The first being how slow an aggregation of the max(partition_date) query is, and the second being Presto's lack of support for dynamic filtering.
Slow aggregation queries for partition keys
To solve the first issue, I'm using the information_schema.__internal_partitions__ table which allows me to get quick aggregations on the partitions of a table without scanning the data inside the files. (Note that partition_value, partition_key, and partition_number in the below queries are all column names of the __internal_partitions__ table and not related to your table's columns)
If you only have a single partition key for your table, you can do something like:
SELECT max(partition_value) FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
But if you have multiple partition keys, you'll need something more like this:
SELECT max(partition_date) as latest_partition_date from (
SELECT max(case when partition_key = 'partition_date' then partition_value end) as partition_date, max(case when partition_key = 'another_partition_key' then partition_value end) as another_partition_key
FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
GROUP BY partition_number
)
WHERE
-- ... Filter down by values for e.g. another_partition_key
)
These queries should run fairly quickly (mine run in about 1-2 seconds) without scanning through the actual data in the files, but again, I'm not sure if there are any gotchas with using this approach.
Lack of Dynamic Filtering
I'm able to mitigate the worst effects of the second problem for my specific use-case because I expect there to always be a partition within a finite amount of time back from the current date (e.g. I can guarantee any data-production or partition-loading issues will be remedied within 3 days). It turns out that Athena does do some pre-processing when using presto's datetime functions, so this does not have the same types of issues with Dynamic Filtering as using a sub-query.
So you can change your query to limit how far it will look back for the actual max using the datetime functions so that the amount of data scanned will be limited.
SELECT * FROM "DATABASE_NAME"."TABLE_NAME"
WHERE partition_date >= cast(date '2019-06-25' - interval '3' day as varchar) -- Will only scan partitions from 3 days before '2019-06-25'
AND partition_date = (
-- Insert the partition aggregation query from above here
)

I don't know if it is still relevant, but just found out:
Instead of:
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
Use:
select a.* from table a
inner join (select max(partition_id) max_id from table) b on a.partition_id=b.max_id
where column1 = 'val';
I think it has something to do with optimizations of joins to use partitions.

Related

BigQuery: querying the latest partition, bytes to be processed vs. actually processed

I'm struggling with querying efficiently the last partition of a table, using a date or datetime field. The first approach was to filter like this:
SELECT *
FROM my_table
WHERE observation_date = (SELECT MAX(observation_date) FROM my_table)
But that, according to BigQuery's processing estimation, scans the entire table and does not use the partitions. Even Google states this happens in their documentation. It does work if I use an exact value for the partition:
SELECT *
FROM my_table
WHERE observation_date = CURRENT_DATE
But if the table is not up to date then the query will not get any results and my automatic procesess will fail. If I include an offset like observation_date = DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY), I will likely miss the latest partition.
What is the best practice to get the latest partition efficiently?
What makes this worse is that BigQuery's estimation of the bytes to be processed with the active query does not match what was actually processed, unless I'm not interpreting those numbers correctly. Find below the screenshot of the mismatching values.
BigQuery screen with aparrently mistmatching processed bytes
Finally a couple of scenarios that I also tested:
If I store a max_date with a DECLARE statement first, as suggested in this post, the estimation seems to work, but it is not clear why. However, the actual processed bytes after running the query is not different than the case that filters the latest partition in the WHERE clause.
Using the same declared max_date in a table that is both partitioned and clustered, the estimation works only when using a filter for the partition, but fails if I include a filter for the cluster.
After some iterations I got an answer from Google and although it doesn't resolve the issue, it acknowledges it happens.
Tables partitioned with DATE or DATETIME fields cannot be efficiently queried for their latest partition. The best practice remains to filter with something like WHERE observation_date = (SELECT MAX(observation_date) FROM my_table) and that will scan the whole table.
They made notes to try and improve this in the future but we have to deal with this for now. I hope this helps somebody that was trying to do the same thing.

Subquery in partitioned Athena tables

I am using partitions in Athena. I have a partition called snapshot, and when I call a query as such:
select * from mytable where snapshot = '2020-06-25'
Then, as expected only the specified partition is scanned and my query is fast. However, if I use a subquery which returns a single date, it is slooow:
select * from mytable where snapshot = (select '2020-06-25')
The above actually scans all partitions and not only the specified date, and results in very low performance.
My question is can I use a subquery to specify partitions and increase performance. I need to use a subsquery to add some custom logic which returns a date based on some criteria.
Edit:
Trino 356 is able to inline such queries, see https://github.com/trinodb/trino/issues/4231#issuecomment-845733371
Older answer:
Presto still does not inline trivial subquery like (select '2020-06-25').
This is tracked by https://github.com/trinodb/trino/issues/4231.
Thus, you should not expect Athena to inline, as it's based on Presto .172.
I need to use a subsquery to add some custom logic which returns a date based on some criteria.
If your query is going to be more sophisticated, not a constant expression, it will not be inlined anyway. If snapshot is a partition key, then you could leverage a recently added feature -- dynamic partition pruning. Read more at https://trino.io/blog/2020/06/14/dynamic-partition-pruning.html.
This of course assumes you can choose Presto version.
If you are constraint to Athena, your only option is to evaluate the subquery outside of the main query (separately), and pass it back to the main query as a constant (e.g. literal).
The Athena 2.0 released in late 2020 seems to have improved their push_down_predicate handling to support subquery.
Here is their related statement from https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference.html#engine-versions-reference-0002
Predicate inference and pushdown – Predicate inference and pushdown extended for queries that use a <symbol> IN <subquery> predicate.
My test with our own Athena table indicates this is indeed the case. My test query is roughly as below
SELECT *
FROM table_partitioned_by_scrape_date
WHERE scrape_date = (
SELECT max(scrape_date)
FROM table_partitioned_by_scrape_date
)
From the bytes scanned by the query, I can tell Athena indeed only scanned the partition with the latest scrape_date.
Moreover, I also tested support of push_down_predicate in JOIN clause where the join_to value is result of another query. Even though it is not mentioned in the release note, apparently Athena 2.0 now is smart enough also to support this scenario and only scan the latest scrape_date partition. I have tested similar query in Athena 1.0 before, it would scan all the partitions instead. My test query is as below
WITH l as (
SELECT max(scrape_date) as latest_scrape_date
FROM table_partitioned_by_scrape_date
)
SELECT deckard_id
FROM table_partitioned_by_scrape_date as t
JOIN l ON t.scrape_date = l.latest_scrape_date

Query the exact partition when table is hash-partitioned

When I want to query a single partition I usually use something like that:
Select * from t (partition p1)
But when you have to query it in your pl/sql code it comes to using execute immediate and hard-parse of the statement.
Okay, for RANGE-partitioned table (let it be SOME_DATE of date type) I can workaround it like
Select * from t where some_date <= :1 and some_date > :2
Assuming :1 and :2 stand for partition bonds.
Well, as for LIST-partitioned table I can easily specify the exact value of my partition key field like
Select * from t where part_key = 'X'
And what about HASH-partitioning? For example, I have a table partitioned by hash(id) in 16 partitions. And I have 16 jobs each handling its own partition. So I have to use it like that
Select * from t (partition p<n>)
Question is: can I do it like this for example
Select * from t where hash(id) = :1
To enforce partition pruning take the whole n-th partition?
It's okay when you have just 16 partitions but in my case I have composite partitioning (date + hash(id)), so every time job handles a partition it's always a new sql_id and it ends up in quick shared pool growth
It appears Oracle internally uses the ora_hash function (at least since 10g) to assign a value to a partition. So you could use that to read all the data from a single partition. Unfortunately, though, since you'd be running a query like
select *
from t
where ora_hash( id, 9 ) = 6
to get all the data in the 6th of 8 hash partitions, I'd expect Oracle to have to read every partition in the table (and compute the hash on every id) because the optimizer isn't going to be smart enough to recognize that your expression happens to map exactly to its internal partitioning strategy. So I don't think you'd want to do this to split data up to be processed by different threads.
Depending on what those threads are doing, would it be possible to use Oracle's built-in parallelism instead (potentially incorporating things like parallelizable pipelined table functions if you're doing ETL processing). If you tell Oracle to use 16 parallel threads and your table has 16 partitions, Oracle will internally almost certainly do the right thing.

5+ Intermediate SQL Tables to Arrive at Desired Table, Postgres

I am generating reports on electoral data that group voters into their age groups, and then assign those age groups a quartile, before finally returning the table of age groups and quartiles.
By the time I arrive at the table with the schema and data that I want, I have created 7 intermediate tables that might as well be deleted at this point.
My question is, is it plausible that so many intermediate tables are necessary? Or this a sign that I am "doing it wrong?"
Technical Specifics:
Postgres 9.4
I am chaining tables, starting with the raw database tables and successively transforming the table closer to what I want. For instance, I do something like:
CREATE TABLE gm.race_code_and_turnout_count AS
SELECT race_code, count(*)
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
And then I do
CREATE TABLE gm.race_code_and_percent_of_total_turnout AS
SELECT race_code, count, round((count::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.race_code_and_turnout_count
And that first table goes off in a second branch:
CREATE TABLE gm.race_code_and_turnout_percentage AS
SELECT t1.race_code, round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM gm.race_code_and_turnout_count AS t1
JOIN gm.race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
So each table is building on the one before it.
While temporary tables are used a lot in SQL Server (mainly to overcome the peculiar locking behaviour that it has) it is far less common in Postgres (and your example uses regular tables, not temporary tables).
Usually the overhead of creating a new table is higher than letting the system store intermediate on disk.
From my experience, creating intermediate tables usually only helps if:
you have a lot of data that is aggregated and can't be aggregated in memory
the aggregation drastically reduces the data volume to be processed so that the next step (or one of the next steps) can handle the data in memory
you can efficiently index the intermediate tables so that the next step can make use of those indexes to improve performance.
you re-use a pre-computed result several times in different steps
The above list is not completely and using this approach can also be beneficial if only some of these conditions are true.
If you keep creating those tables create them at least as temporary or unlogged tables to minimized the IO overhead that comes with writing that data and thus keep as much data in memory as possible.
However I would always start with a single query instead of maintaining many different tables (that all need to be changed if you have to change the structure of the report).
For example your first two queries from your question can easily be combined into a single query with no performance loss:
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code;
This is going to be faster than writing the data twice to disk (including all transactional overhead).
If you stack your queries using common table expressions Postgres will automatically store the data on disk if it gets too big, if not it will process it in-memory. When manually creating the tables you force Postgres to write everything to disk.
So you might want to try something like this:
with race_code_and_turnout_count as (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
), race_code_and_total_count as (
select ....
from ....
), race_code_and_turnout_percentage as (
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM ace_code_and_turnout_count AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
)
select *
from ....;
and see how that performs.
If you don't re-use the intermediate steps more than once, writing them as a derived table instead of a CTE might be faster in Postgres due to the way the optimizer works, e.g.:
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
) AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
If it performs well and results in the right output, I see nothing wrong with it. I do however suggest to use (local) temporary tables if you need intermediate tables.
Your series of queries can always be optimized to use fewer intermediate steps. Do that if you feel your reports start performing poorly.

Is there efficient SQL to query a portion of a large table

The typical way of selecting data is:
select * from my_table
But what if the table contains 10 million records and you only want records 300,010 to 300,020
Is there a way to create a SQL statement on Microsoft SQL that only gets 10 records at once?
E.g.
select * from my_table from records 300,010 to 300,020
This would be way more efficient than retrieving 10 million records across the network, storing them in the IIS server and then counting to the records you want.
SELECT * FROM my_table is just the tip of the iceberg. Assuming you're talking a table with an identity field for the primary key, you can just say:
SELECT * FROM my_table WHERE ID >= 300010 AND ID <= 300020
You should also know that selecting * is considered poor practice in many circles. They want you specify the exact column list.
Try looking at info about pagination. Here's a short summary of it for SQL Server.
Absolutely. On MySQL and PostgreSQL (the two databases I've used), the syntax would be
SELECT [columns] FROM table LIMIT 10 OFFSET 300010;
On MS SQL, it's something like SELECT TOP 10 ...; I don't know the syntax for offsetting the record list.
Note that you never want to use SELECT *; it's a maintenance nightmare if anything ever changes. This query, though, is going to be incredibly slow since your database will have to scan through and throw away the first 300,010 records to get to the 10 you want. It'll also be unpredictable, since you haven't told the database which order you want the records in.
This is the core of SQL: tell it which 10 records you want, identified by a key in a specific range, and the database will do its best to grab and return those records with minimal work. Look up any tutorial on SQL for more information on how it works.
When working with large tables, it is often a good idea to make use of Partitioning techniques available in SQL Server.
The rules of your partitition function typically dictate that only a range of data can reside within a given partition. You could split your partitions by date range or ID for example.
In order to select from a particular partition you would use a query similar to the following.
SELECT <Column Name1>…/*
FROM <Table Name>
WHERE $PARTITION.<Partition Function Name>(<Column Name>) = <Partition Number>
Take a look at the following white paper for more detailed infromation on partitioning in SQL Server 2005.
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope this helps however please feel free to pose further questions.
Cheers, John
I use wrapper queries to select the core query and then just isolate the ROW numbers that i wish to take from the query - this allows the SQL server to do all the heavy lifting inside the CORE query and just pass out the small amount of the table that i have requested. All you need to do is pass the [start_row_variable] and the [end_row_variable] into the SQL query.
NOTE: The order clause is specified OUTSIDE the core query [sql_order_clause]
w1 and w2 are TEMPORARY table created by the SQL server as the wrapper tables.
SELECT
w1.*
FROM(
SELECT w2.*,
ROW_NUMBER() OVER ([sql_order_clause]) AS ROW
FROM (
<!--- CORE QUERY START --->
SELECT [columns]
FROM [table_name]
WHERE [sql_string]
<!--- CORE QUERY END --->
) AS w2
) AS w1
WHERE ROW BETWEEN [start_row_variable] AND [end_row_variable]
This method has hugely optimized my database systems. It works very well.
IMPORTANT: Be sure to always explicitly specify only the exact columns you wish to retrieve in the core query as fetching unnecessary data in these CORE queries can cost you serious overhead
Use TOP to select only a limited amont of rows like:
SELECT TOP 10 * FROM my_table WHERE ID >= 300010
Add an ORDER BY if you want the results in a particular order.
To be efficient there has to be an index on the ID column.