AWS Athena SQL view to conditionally query tables based on where condition - sql

I tried to create an Athena table with 1700+ partitions where each partition has 100 buckets (not S3 buckets but Athena buckets, which are like hashes on top of high cardinality columns to speed up queries). Unfortunately I learned that Athena doesn't support more than 100 partitions, and if you use their workaround to support more than 100 partitions, you can't use any buckets. So then I thought, what if I made each partition its own separate table? Then I could give each table its own 100 buckets. The idea is that I would hide all of these tables under a SQL view. When the user specifies a partition, the view would pick the correct table, and query only that table. That led me to post this question.
How can I make a SQL view in Athena that conditionally queries tables? At first I thought it would be simple, but now I'm realizing it's not. I could naively union all the tables, and then run the condition on the unioned result, but that would defeat the purpose of the partitions because I would end up reading all of the data all the time. Is what I'm describing possible?
I want to do this but without the union:
CREATE VIEW example AS (
SELECT col1, col2, 1 as partition FROM partition1
UNION ALL
SELECT col1, col2, 2 as partition FROM partition2)
Where the user would specify something like:
SELECT col1, col2
FROM example
WHERE partition = 1

Related

UNION in AWS athena on 2 tables having billions of records

I have two tables in Athena, which has md5 as one of the columns, and both the tables have around a billion entries. I need to find md5 that is common in both tables. I was trying with the following.
WITH t1 as (
SELECT table1.md5
from table1
),
t2 as (
SELECT table2.md5
from table2
)
SELECT md5 from t1 union all select md5 from t2 group by md5 having count(*)> 1
The query works when I limit the records but when run it with no limit... Athena query fails after running 20-30 mins. seems like too big of o/p that it can handle. There are no partitions etc. that I can use to load fewer data. I know the file it will generate in S3 for query o/p will be in GBs. Is there a way I can make this query better that Athena likes it, or do you have any other thoughts on how to achieve this?
I also tried to create a new table from the query o/p, but it seems eventual; first, the query needs to be completed before it will create a new table.
CREATE TABLE common_md5 (above query)

reduce the amount of data scanned by Athena when using aggregate functions

The below query scans 100 mb of data.
select * from table where column1 = 'val' and partition_id = '20190309';
However the below query scans 15 GB of data (there are over 90 partitions)
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
How can I optimize the second query to scan the same amount of data as the first?
There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one #PiotrFindeisen pointed out around dynamic filtering.
The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.
If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"
select *
from table
where column1 = 'val' and
partition_id = (select max(partition_id) from "table$partitions");
The second problem is the one #PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.
This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.
BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.
EDIT: Presto removed the __internal_partitions__ table in their 0.193 release so I'd suggest not using the solution defined in the Slow aggregation queries for partition keys section below in any production systems since Athena 'transparently' updates presto versions. I ended up just going with the naive SELECT max(partition_date) ... query but also using the same lookback trick outlined in the Lack of Dynamic Filtering section. It's about 3x slower than using the __internal_partitions__ table, but at least it won't break when Athena decides to update their presto version.
----- Original Post -----
So I've come up with a fairly hacky way to accomplish this for date-based partitions on large datasets for when you only need to look back over a few partitions'-worth of data for a match on the max, however, please note that I'm not 100% sure how brittle the usage of the information_schema.__internal_partitions__ table is.
As #Dain noted above, there are really two issues. The first being how slow an aggregation of the max(partition_date) query is, and the second being Presto's lack of support for dynamic filtering.
Slow aggregation queries for partition keys
To solve the first issue, I'm using the information_schema.__internal_partitions__ table which allows me to get quick aggregations on the partitions of a table without scanning the data inside the files. (Note that partition_value, partition_key, and partition_number in the below queries are all column names of the __internal_partitions__ table and not related to your table's columns)
If you only have a single partition key for your table, you can do something like:
SELECT max(partition_value) FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
But if you have multiple partition keys, you'll need something more like this:
SELECT max(partition_date) as latest_partition_date from (
SELECT max(case when partition_key = 'partition_date' then partition_value end) as partition_date, max(case when partition_key = 'another_partition_key' then partition_value end) as another_partition_key
FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
GROUP BY partition_number
)
WHERE
-- ... Filter down by values for e.g. another_partition_key
)
These queries should run fairly quickly (mine run in about 1-2 seconds) without scanning through the actual data in the files, but again, I'm not sure if there are any gotchas with using this approach.
Lack of Dynamic Filtering
I'm able to mitigate the worst effects of the second problem for my specific use-case because I expect there to always be a partition within a finite amount of time back from the current date (e.g. I can guarantee any data-production or partition-loading issues will be remedied within 3 days). It turns out that Athena does do some pre-processing when using presto's datetime functions, so this does not have the same types of issues with Dynamic Filtering as using a sub-query.
So you can change your query to limit how far it will look back for the actual max using the datetime functions so that the amount of data scanned will be limited.
SELECT * FROM "DATABASE_NAME"."TABLE_NAME"
WHERE partition_date >= cast(date '2019-06-25' - interval '3' day as varchar) -- Will only scan partitions from 3 days before '2019-06-25'
AND partition_date = (
-- Insert the partition aggregation query from above here
)
I don't know if it is still relevant, but just found out:
Instead of:
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
Use:
select a.* from table a
inner join (select max(partition_id) max_id from table) b on a.partition_id=b.max_id
where column1 = 'val';
I think it has something to do with optimizations of joins to use partitions.

How to avoid evaluating the same calculated column in Hive query repetedly

Lets say I have a calculated column:-
select str_to_map("k1:1,k2:2,k3:3")["k1"] as col1,
str_to_map("k1:1,k2:2,k3:3")["k2"] as col2,
str_to_map("k1:1,k2:2,k3:3")["k3"] as col3;
How do I 'fix' the column calculation only once and access its value multiple times in the query? The map being calculated is the same, only different keys are being accessed for different columns. Performing the same calculation repeatedly is a waste of resources. This example is purposely made too simple, but the point is I want to know how to avoid this kind of redundancy in Hive in general.
In general use subqueries, they are calculated once.
select map_col.["k1"] as col1,
map_col.["k2"] as col2,
map_col.["k3"] as col3
from
(
select str_to_map("k1:1,k2:2,k3:3") as map_col from table...
)s;
Also you can materialize some query into table to reuse the dataset across different queries or workflows.

What's the advised way to do groupby 2 attributes from a Hive table with 1000 columns?

Give a Hive table with 1000 columns:
col1, col2, ..., col1000
The source table contains billions of rows, and the size is about 1PB.
I only need to query 3 columns,
select col1, col2, sum(col3) as col3
from myTable
group by
col1, col2
Will it be advised to do a subquery first, and then send it to the aggregation of group by, so that we have much smaller files sending to groupby? Not sure it Hive automatically takes care of this.
select col1, col2, sum(col3) as col3
from
(select col1, col2, col3
from myTable
) a
group by
col1, col2
Behind the scenes it shouldn't really matter if you do a sub-query or not, but you can look at the explain plan of each query to see if you notice any differences between them.
The ideal situation would be for your table to be stored in a columnar format, so if a lot of queries like this will be used in the future then I would ensure that your table is stored as parquet files which use columnar storage and will give you excellent query performance.
If it isn't in this format then you can create a new table using a create as select statement.
create table yourNewParquetTable stored as parquet as select * from yourOldTable;
In general, there is no reason to use a subquery in this situation. You basically have two situations:
First, Hive could store/fetch all the columns together. In that case, Hive needs to read all the data in all columns either for the subquery or for the aggregation.
Otherwise, Hive could store/fetch only the columns you need. In that case, Hive would do so for either version.
That said, there is a reason to avoid the subquery in some databases. MySQL materializes subqueries -- meaning they are stored as if they were temporary tables. This is unnecessary overhead and a good reason to avoid unnecessary subqueries with MySQL. Hive does not do that. It compiles the query in a data flow and executes the data flow.

5+ Intermediate SQL Tables to Arrive at Desired Table, Postgres

I am generating reports on electoral data that group voters into their age groups, and then assign those age groups a quartile, before finally returning the table of age groups and quartiles.
By the time I arrive at the table with the schema and data that I want, I have created 7 intermediate tables that might as well be deleted at this point.
My question is, is it plausible that so many intermediate tables are necessary? Or this a sign that I am "doing it wrong?"
Technical Specifics:
Postgres 9.4
I am chaining tables, starting with the raw database tables and successively transforming the table closer to what I want. For instance, I do something like:
CREATE TABLE gm.race_code_and_turnout_count AS
SELECT race_code, count(*)
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
And then I do
CREATE TABLE gm.race_code_and_percent_of_total_turnout AS
SELECT race_code, count, round((count::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.race_code_and_turnout_count
And that first table goes off in a second branch:
CREATE TABLE gm.race_code_and_turnout_percentage AS
SELECT t1.race_code, round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM gm.race_code_and_turnout_count AS t1
JOIN gm.race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
So each table is building on the one before it.
While temporary tables are used a lot in SQL Server (mainly to overcome the peculiar locking behaviour that it has) it is far less common in Postgres (and your example uses regular tables, not temporary tables).
Usually the overhead of creating a new table is higher than letting the system store intermediate on disk.
From my experience, creating intermediate tables usually only helps if:
you have a lot of data that is aggregated and can't be aggregated in memory
the aggregation drastically reduces the data volume to be processed so that the next step (or one of the next steps) can handle the data in memory
you can efficiently index the intermediate tables so that the next step can make use of those indexes to improve performance.
you re-use a pre-computed result several times in different steps
The above list is not completely and using this approach can also be beneficial if only some of these conditions are true.
If you keep creating those tables create them at least as temporary or unlogged tables to minimized the IO overhead that comes with writing that data and thus keep as much data in memory as possible.
However I would always start with a single query instead of maintaining many different tables (that all need to be changed if you have to change the structure of the report).
For example your first two queries from your question can easily be combined into a single query with no performance loss:
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code;
This is going to be faster than writing the data twice to disk (including all transactional overhead).
If you stack your queries using common table expressions Postgres will automatically store the data on disk if it gets too big, if not it will process it in-memory. When manually creating the tables you force Postgres to write everything to disk.
So you might want to try something like this:
with race_code_and_turnout_count as (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
), race_code_and_total_count as (
select ....
from ....
), race_code_and_turnout_percentage as (
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM ace_code_and_turnout_count AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
)
select *
from ....;
and see how that performs.
If you don't re-use the intermediate steps more than once, writing them as a derived table instead of a CTE might be faster in Postgres due to the way the optimizer works, e.g.:
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
) AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
If it performs well and results in the right output, I see nothing wrong with it. I do however suggest to use (local) temporary tables if you need intermediate tables.
Your series of queries can always be optimized to use fewer intermediate steps. Do that if you feel your reports start performing poorly.