Athena (Hive/Presto) query partitioned table IN statement - hive

I have the following partitioned table in Athena (HIVE/Presto):
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
id STRING,
data STRING
)
PARTITIONED BY (
year string,
month string,
day string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket';
Data is stored in s3 organized in a path structure like s3://mybucket/year=2020/month=01/day=30/.
I would like to know if the following query would leverage partitioning optimization:
SELECT
*
FROM
mydb.mytable
WHERE
(year='2020' AND month='08' AND day IN ('10', '11', '12')) OR
(year='2020' AND month='07' AND day IN ('29', '30', '31'));
I am assuming since IN operator will be transformed in a series of OR conditions, this will be still a query which will benefit by partitioning. Am I correct?

Unfortunately Athena does not expose information that would make it easier to understand how to optimise queries. Currently the only thing you can do is to run different variations of queries and look at the statistics returned in the GetQueryExecution API call.
One way to figure out if Athena will make use of partitioning in a query is to run the query with different values for the partition column and make sure that the amount of data scanned is different. If the amount of data is different Athena was able to prune partitions during query planning.

Yes, it's also mentioned int the documentation.
When Athena runs a query on a partitioned table, it checks to see if any partitioned columns are used in the WHERE clause of the query. If partitioned columns are used, Athena requests the AWS Glue Data Catalog to return the partition specification matching the specified partition columns. The partition specification includes the LOCATION property that tells Athena which Amazon S3 prefix to use when reading data. In this case, only data stored in this prefix is scanned. If you do not use partitioned columns in the WHERE clause, Athena scans all the files that belong to the table's partitions.

Related

BigQuery - Max Partiton Date for a Custom Partitioned Table

Is there a metadata operation that can give me the max partitioned date/timestamp in use (for custom partitioned table not Ingest partitioning), such that I do not need to scan a whole table using MAX function? Or some other clever SQL way? Our source table is very large, and it gets a fresh snapshot of data most days - but then that data is generally for current_date()-1...but all in all I cant rely on much except for a query to tell me the max partition in use that doesnt cost the earth for a large table? thought?
SELECT MAX(custom_partition_field) FROM Y
#legacySQL
SELECT MAX(partition_id)
FROM [project:dataset.your_table$__PARTITIONS_SUMMARY__]
It is documented at Listing partitions in partitioned tables

Redshift Spectrum Query - Request ran out of memory in the S3 query layer

I am trying to execute a query with grouping on 26 columns. Data is stored in S3 in parquet format partitioned by day. Redshift Spectrum query is returning below error. I am not able to find any relevant documentation in aws regarding this.
Request ran out of memory in the S3 query layer
Total Number of rows in table : 770 Million
Total size of table in Parquet format : 45 GB
Number of records in each partition : 4.2 Million
Million Redshift configuration : Single node dc2.xlarge
Attached is the table ddl
Try declaring the text columns in this table as VARCHAR rather than STRING. Also make sure to use the minimum possible VARCHAR size for the column to reduce the memory required by the GROUP BY.
Also, two further suggestions:
Recommend always using at least 2 nodes of Redshift. This gives
you a free leader node and allows your compute nodes to use all
their RAM for query processing.
Grouping by so many columns is an unusual query pattern. If you are looking for duplicates in the table consider hashing the columns into a single value and grouping on that. Here's an example:
SELECT MD5(ws_sold_date_sk
||ws_sold_time_sk
||ws_ship_date_sk
||ws_item_sk
||ws_bill_customer_sk
||ws_bill_cdemo_sk
||ws_bill_hdemo_sk
||ws_bill_addr_sk
||ws_ship_customer_sk
||ws_ship_cdemo_sk
||ws_ship_hdemo_sk
||ws_ship_addr_sk
||ws_web_page_sk
||ws_web_site_sk
||ws_ship_mode_sk)
, COUNT(*)
FROM spectrum.web_sales
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
;

BigQuery - What is the difference between Wildcard and partitionned tables [duplicate]

I try to understand if there is a difference in big query (in the cost or possibility of requesting for example) between :
Create one table per day (like my_table_2018_02_06)
Create a time partitioned table (my-table with time partition by day).
Thanks !
Short explanation: querying multiple tables using Wildcard Tables was the proposed alternative for when BigQuery did not have a partition mechanism available. The natural evolution was to include the feature of Partitioned Table, and currently there is an alpha release consisting in column-based time partitioning, i.e. letting the user define which column (having a DATE or TIMESTAMP data type) will be used for the partitioning.
So currently BigQuery engineers are working in adding more new features to table partitioning, instead of the legacy Wildcard Tables methodology, then I'd suggest that you work with them.
Long explanation: you are comparing two approaches that in fact are used with the same purpose, but which have different implications:
Wildcard Tables: some time ago, when table partitioning was not a feature supported by Big Query, Wildcard Tables was the way to query multiple tables using concise SQL queries. A Wildcard Table represents the union of all the tables that match the wildcard expression specified in the SQL statement. However, Wildcard Tables have some limitations, such as:
Do not support views.
Do not support cached results (queries containing wildcard tables are billed every time they are run, even if the "cached results" option is checked).
Only work with native BigQuery storage (cannot work with external tables [Bigtable, Storage or Drive]).
Only available in standard SQL.
Partitioned Tables: these are unique tables that are divided into segments, split by date. There is a lot of documentation regarding how to work with Partitioned Tables, and regarding the pricing, each partition in a Partitioned Table is considered an independent entity, so if a partition was not updated for the last 90 days, this data will be considered long-term and therefore will be billed with the appropriate discount (as would happen with a normal table). Finally, Partitioned Tables are here to stay, so there are more incoming features to them, such as column-based partitioning, which is currently in alpha, and you can follow its status in this Public Issue Tracker post. On the other hand, there are also some current limitations to be considered:
Maximum of 2500 partitions per Partitioned Table.
Maximum of 2000 partition updates per table per day.
Maximum of 50 partition updates every 10 seconds.
So in general, it would be advisable to work with Partitioned Tables over multiple tables using Wildcard Tables. However, you should always consider your use case and see which one of the possibilities meets your requirements better.
One thing to add to your decision criteria here is caching and usage of legacy vs standard SQL.
Since the syntax in standard SQL for selecting multiple tables uses a wild card there is no way for the query result to be cached.
Interestingly, the query result would have been cached if legacy SQL was used. Just converting the query to standard SQL would disable caching.
This may be important to consider, at least in some cases more than others.
Thank you,
Hazem
Not exactly a time partition, but one can benefit from both worlds - wildcard "partitions" and real partitions to slice the data even further. Below is an example where we first use the data suffix to select only table holding data from that particular date, then we use actual partitioning within the table to limit the amount of data scanned even further.
Create first partitioned table with data suffix
CREATE TABLE `test_2021-01-05` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-05` (x,y) values (5,1);
insert `test_2021-01-05` (x,y) values (5,2);
insert `test_2021-01-05` (x,y) values (5,3);
Create second partitioned table with data suffix
CREATE TABLE `test_2021-01-04` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-04` (x,y) values (4,1);
insert `test_2021-01-04` (x,y) values (4,2);
Select all the data from both tables using wildcard notation, 80B of data is the whole test set
select * from `test_*`
-- 80B, all the data
Just select data from one table, which is like partitioning on date
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
-- 48B
Select data both from one table(where I am interested in one date) and only from one partition
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
and y = 1
-- 16B, that was the goal
Select data just from one partition from all the tables
select * from `test_*`
where y = 1
-- 32B, only one partition from both tables
The ultimate goal was to limit the data scanned when reading, thus reducing the cost and increasing performance.

Wilcard on day table vs time partition

I try to understand if there is a difference in big query (in the cost or possibility of requesting for example) between :
Create one table per day (like my_table_2018_02_06)
Create a time partitioned table (my-table with time partition by day).
Thanks !
Short explanation: querying multiple tables using Wildcard Tables was the proposed alternative for when BigQuery did not have a partition mechanism available. The natural evolution was to include the feature of Partitioned Table, and currently there is an alpha release consisting in column-based time partitioning, i.e. letting the user define which column (having a DATE or TIMESTAMP data type) will be used for the partitioning.
So currently BigQuery engineers are working in adding more new features to table partitioning, instead of the legacy Wildcard Tables methodology, then I'd suggest that you work with them.
Long explanation: you are comparing two approaches that in fact are used with the same purpose, but which have different implications:
Wildcard Tables: some time ago, when table partitioning was not a feature supported by Big Query, Wildcard Tables was the way to query multiple tables using concise SQL queries. A Wildcard Table represents the union of all the tables that match the wildcard expression specified in the SQL statement. However, Wildcard Tables have some limitations, such as:
Do not support views.
Do not support cached results (queries containing wildcard tables are billed every time they are run, even if the "cached results" option is checked).
Only work with native BigQuery storage (cannot work with external tables [Bigtable, Storage or Drive]).
Only available in standard SQL.
Partitioned Tables: these are unique tables that are divided into segments, split by date. There is a lot of documentation regarding how to work with Partitioned Tables, and regarding the pricing, each partition in a Partitioned Table is considered an independent entity, so if a partition was not updated for the last 90 days, this data will be considered long-term and therefore will be billed with the appropriate discount (as would happen with a normal table). Finally, Partitioned Tables are here to stay, so there are more incoming features to them, such as column-based partitioning, which is currently in alpha, and you can follow its status in this Public Issue Tracker post. On the other hand, there are also some current limitations to be considered:
Maximum of 2500 partitions per Partitioned Table.
Maximum of 2000 partition updates per table per day.
Maximum of 50 partition updates every 10 seconds.
So in general, it would be advisable to work with Partitioned Tables over multiple tables using Wildcard Tables. However, you should always consider your use case and see which one of the possibilities meets your requirements better.
One thing to add to your decision criteria here is caching and usage of legacy vs standard SQL.
Since the syntax in standard SQL for selecting multiple tables uses a wild card there is no way for the query result to be cached.
Interestingly, the query result would have been cached if legacy SQL was used. Just converting the query to standard SQL would disable caching.
This may be important to consider, at least in some cases more than others.
Thank you,
Hazem
Not exactly a time partition, but one can benefit from both worlds - wildcard "partitions" and real partitions to slice the data even further. Below is an example where we first use the data suffix to select only table holding data from that particular date, then we use actual partitioning within the table to limit the amount of data scanned even further.
Create first partitioned table with data suffix
CREATE TABLE `test_2021-01-05` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-05` (x,y) values (5,1);
insert `test_2021-01-05` (x,y) values (5,2);
insert `test_2021-01-05` (x,y) values (5,3);
Create second partitioned table with data suffix
CREATE TABLE `test_2021-01-04` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-04` (x,y) values (4,1);
insert `test_2021-01-04` (x,y) values (4,2);
Select all the data from both tables using wildcard notation, 80B of data is the whole test set
select * from `test_*`
-- 80B, all the data
Just select data from one table, which is like partitioning on date
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
-- 48B
Select data both from one table(where I am interested in one date) and only from one partition
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
and y = 1
-- 16B, that was the goal
Select data just from one partition from all the tables
select * from `test_*`
where y = 1
-- 32B, only one partition from both tables
The ultimate goal was to limit the data scanned when reading, thus reducing the cost and increasing performance.

How to efficiently get the partitions of a BigQuery table

Is there a way of getting a list of the partitions in a BigQuery date-partitioned table? Right now the best way I have found of do this is using the _PARTITIONTIME meta-column, but this needs to scan all the rows in all the partitions. Is there an equivalent to a show partitions call or maybe something in the bq command-line tool?
To list partitions in a table, query the table's summary partition by using the partition decorator separator ($) followed by PARTITIONS_SUMMARY. For example, the following command retrieves the partition IDs for table1:
SELECT partition_id from [mydataset.table1$__PARTITIONS_SUMMARY__];