Performing Date Math on Hive Partition Columns - hive

My data is partitioned by day in the standard Hive format:
/year=2020/month=10/day=01
/year=2020/month=10/day=02
/year=2020/month=10/day=03
/year=2020/month=10/day=04
...
I want to query all data from the last 60 days, using Amazon Athena (IE: Presto). I want this query to use the partitioned columns (year, month, day) so that only the necessary partition files are scanned. Assuming I can't change the file partition format, what is the best approach to this problem?

You don't have to use year, month, day as the partition keys for the table. You can have a single partition key called date and add the partitions like this:
ALTER TABLE the_table ADD
PARTITION (`date` = '2020-10-01') LOCATION 's3://the-bucket/data/year=2020/month=10/day=01'
PARTITION (`date` = '2020-10-02') LOCATION 's3://the-bucket/data/year=2020/month=10/day=02'
...
With this setup you can even set the type of the partition key to date:
PARTITIONED BY (`date` date)
Now you have a table with a date column typed as a DATE, and you can use any of the date and time functions to do calculations on it.
What you won't be able to do with this setup is use MSCK REPAIR TABLE to load partitions, but you really shouldn't do that anyway – it's extremely slow and inefficient and really something you only do when you have a couple of partitions to load into a new table.

An alternative way to that proposed by Theo, is to use the following syntax, e.g.:
select ... from my_table where year||month||day between '2020630' and '20201010'
this works when the format for the columns year, month and day are string. It's particularly useful to query across months.

Related

Creating a daily Oracle partition

Creating oracle partition for a table for the every day.
ALTER TABLE TAB_123 ADD PARTITION PART_9999 VALUES LESS THAN ('0001') TABLESPACE TS_1
Here I am getting error because value is decreased as 0001 as lower boundary.
You can have Oracle automatically create partitions by using the PARTITION BY RANGE option.
Sample DDL, assuming that the partition key is column my_date_column :
create table TAB_123
( ... )
partition by range(my_date_column) interval(/*numtoyminterval*/ NUMTODSINTERVAL(1,'day'))
( partition p_first values less than (to_date('2010-01-01', 'yyyy-mm-dd')) tablespace ts_1)
;
With this set up in place, Oracle will, if needed, create a partition on the fly when you insert data into the table. It is also usually a good idea to create a default partition, as shown above.
This naming convention (last digit of year plus day number) won't support holding more than ten years worth of data. Maybe you think that doesn't matter but I know databases which are well into their second decade. Be optimistic!
Also, that key is pretty much useless for querying. Most queries against partitioned tables want to get the benefit of partition elimination. But that only' works if the query uses the same value as the partition key. Developers really won't want to be casting a date to YDDD format every time they write a select on the table.
So. Use an actual date for defining the partition key and hence range. Also for naming the partition if it matters that much.
ALTER TABLE TAB_123
ADD PARTITION P20200101 VALUES LESS THAN (date '2020-01-02') TABLESPACE TS_1
/
Note that the range is defined by less than the next day. Otherwise the date of the partition name won't align with the date of the records in the actual partition.

BigQuery - Max Partiton Date for a Custom Partitioned Table

Is there a metadata operation that can give me the max partitioned date/timestamp in use (for custom partitioned table not Ingest partitioning), such that I do not need to scan a whole table using MAX function? Or some other clever SQL way? Our source table is very large, and it gets a fresh snapshot of data most days - but then that data is generally for current_date()-1...but all in all I cant rely on much except for a query to tell me the max partition in use that doesnt cost the earth for a large table? thought?
SELECT MAX(custom_partition_field) FROM Y
#legacySQL
SELECT MAX(partition_id)
FROM [project:dataset.your_table$__PARTITIONS_SUMMARY__]
It is documented at Listing partitions in partitioned tables

Insert to clustered hive table from spark

i'm trying to do some performance optimization on the data storage. the idea is to use the bucketing/clustering of hive to bucket the available devices (based on column id). my current approach is inserting data from an external table based on parquet files into the table. As a result it applies the bucketing.
INSERT INTO TABLE bucketed_table PARTITION (year, month, day)
SELECT id, feature, value, year, month, day
FROM parquet_table ;
I would like to get rid of this step in between by ingesting the data directly into that table directly from PySpark 2.1.
Executing the same statement using SparkSQL leads to different results. Adding the cluster by clause
INSERT INTO TABLE bucketed_table PARTITION (year, month, day)
SELECT id, feature, value, year, month, day
FROM parquet_table cluster by id ;
still leads to different output files.
This leads to two questions:
1) What is the right way to insert into a clustered hive table from spark?
2) Does writing with clustered by statement enable the benefits of the hive metastore on the data?
I don't believe that it's supported as of yet. I'm currently using Spark 2.3 and it fails, as opposed to succeeding and corrupting your data store.
Checkout the jira ticket here if you want to track its progress

BigQuery table partitioning by month

I can't find any documentation relating to this. Is time_partitioning_type=DAY the only way to partition a table in BigQuery? Can this parameter take any other values besides a date?
Note that even if you partition on day granularity, you can still write your queries to operate at the level of months using an appropriate filter on _PARTITIONTIME. For example,
#standardSQL
SELECT * FROM MyDatePartitionedTable
WHERE DATE_TRUNC(EXTRACT(DATE FROM _PARTITIONTIME), MONTH) = '2017-01-01';
This selects all rows from January of this year.
Unfortunately not. BigQuery currently only supports date-partitioned tables.
https://cloud.google.com/bigquery/docs/partitioned-tables
BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date
It seems like this would work:
#standardSQL
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY event_month
OPTIONS (
description="this is a table partitioned by month"
) AS
SELECT
DATE_TRUNC(DATE(some_event_timestamp), month) as event_month,
*
FROM `TableThatNeedsPartitioning`
For those that run into the error "Too many partitions produced by query, allowed 4000, query produces at least X partitions", due to the 4000 partitions BigQuery limit as of 2023.02, you can do the following:
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY DATE_TRUNC(date_column, MONTH)
OPTIONS (
description="This is a table partitioned by month"
) AS
-- Your query
Basically, take #david-salmela 's answer, but move the DATE_TRUNC part to the PARTITION BY section.
It seems to work exactly like PARTITION BY date_column in terms of querying the table (e.g. WHERE date_column = "2023-02-20"), but my understanding is that you always retrieve data for a whole month in terms of cost.

Is hive partitioning hierarchical in nature?

Say we have a table partitioned as:-
CREATE EXTERNAL TABLE MyTable (
col1 string,
col2 string,
col3 string
)
PARTITIONED BY(year INT, month INT, day INT, hour INT, combination_id BIGINT);
Now obviously year is going to store year value (e.g. 2016), the month will store month va.ue (e.g. 7) the day will store day (e.g. 18) and hour will store hour value in 24 hour format (e.g. 13). And combination_id is going to be combination of padded (if single digit value pad it with 0 on left) values for all these. So in this case for example the combination id is 2016071813.
So we fire query (lets call it Query A):-
select * from mytable where combination_id = 2016071813
Now Hive doesn't know that combination_id is actually combination of year,month,day and hour. So will this query not take proper advantage of partitioning?
In other words, if I have another query, call it Query B, will this be more optimal than query A or there is no difference?:-
select * from mytable where year=2016 and month=7 and day=18 and hour=13
If Hive partitioning scheme is really hierarchical in nature then Query B should be better from performance point of view is what I am thinking. Actually I want to decide whether to get rid of combination_id altogether from partitioning scheme if it is not contributing to better performance at all.
The only real advantage for using combination id is to be able to use BETWEEN operator in select:-
select * from mytable where combination_id between 2016071813 and 2016071823
But if this is not going to take advantage of partitioning scheme, it is going to hamper performance.
Yes. Hive partitioning is hierarchical.
You can simply check this by printing the partitions of the table using below query.
show partitions MyTable;
Output:
year=2016/month=5/day=5/hour=5/combination_id=2016050505
year=2016/month=5/day=5/hour=6/combination_id=2016050506
year=2016/month=5/day=5/hour=7/combination_id=2016050507
In your scenario, you don't need to specify combination_id as partition column if you are not using for querying.
You can partition either by
Year, month, day, hour columns
or
combination_id only
Partitioning by Multiple columns helps in performance in grouping operations.
Say if you want to find maximum of a col1 for 'March' month of the years (2016 & 2015).
It can easily fetch the records by going to the specific 'Year' partition(year=2016/2015) and month partition(month=3)