Create hive partition based on time zone - hive

I'm trying to materialize hive table based on file that are stored as parquet in GCS, with path like gs://abc/dt=02-02-2019/hr=02(physical partition based on UTC)
Now I want to create two hive table where the logical partition is based on timezone, say one for UTC and other for CET, how can I partition such that date and hour based partition picks the dt and hr value based on timezone. Also it would be great if it can also accommodate for day-light saving etc.
I am using airflow to create external hive table.

there is ablog that explains this well https://medium.com/udemy-engineering/supporting-multiple-time-zones-on-hive-with-single-data-source-b884cba46451
The basic idea is to have data stored as utc time. And partitioned by utc hour. That way we can have two hive tables. One hive table points as is, which is utc.
But for lets a PT hive table, you would point 18th hour to UTC 11th hour, so there is mapping conversion happening at each.

Related

Drop part of partitions in Hive SQL

I have an external hive table, partitioned on the date column. Each date has multiple AB test experiment’s data.
I need to create a job that drops experiments which have ended more than 6 months ago.
Since dropping data in an external hive partitioned table, drops the entire partition. In this case, data for one whole date. Is there a way to only drop part of a date?

Schedule Query Partition Table Hourly

I have schedule query which runs hourly I want to partition the table hourly so in the destinaltion I have provided this mytable_{run_time|"%Y%m%d%H"}, but this is creating a new table for every run in my BigQuery datasets , when I change the destination to mytable_{run_time|"%Y%m%d"}, it's partition the data correctly based on date
How to enable hourly partition in big query ?
What you are doing is aligned to table sharding, which you can do but it is not as performant and involves more management. In theory it acts similarly to a partition but is not the same. What you are likely seeing when you use the format mytable_{run_time|"%Y%m%d"} is that you're inserting multiple hours into the same day, and depending on what your table definition is may be partitioned within a single day.
You will want to define the partition in the creation of the table see below:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#create_a_time-unit_column-partitioned_table

Partition expiry countdown for date/time field based partition

As of May 2021, Google Big Query documentation does not clearly mention when the countdown for partition expiry for time/date field partitioned table start? Is the date/time of the partition itself is the start of the partition expiry countdown or does the expiry countdown start when the partition is created?
For example, if a table like following is created
CREATE TABLE IF NOT EXISTS `project_id.dataset_name.table_name`
(
dateTime TIMESTAMP NOT NULL
, trainName STRING
, fleet STRING
, customer STRING
)
PARTITION BY DATE(dateTime)
OPTIONS (
partition_expiration_days = 3
)
So, if the table is created on say 5th of the month but while inserting the data, if the data for 1st of that month (for field dateTime) is inserted, will that data be expired already upon insertion? Or will it expire on 9th of the same month?
For ingestion based partitioning this confusion does not arise as the ingestion timestamp itself is a partition timestamp.
References:
Create a time-unit column-partitioned table
Updating default partition expiration times
Use the expiration settings to remove unneeded tables and partitions
That data will expire on insertion. Date/time of the partition itself is the start of the partition expiry countdown. So, data for the 1st of the month will not be present in the table.

Split hive partition to create multiple partition

I have an external hive table which is partitioned on load_date (DD-MM-YYYY). however the very first period lets say 01-01-2000 has all the data from 1980 till 2000. How can I further create partitions on year for the previous data while keeping the existing data (data for load date greater than 01-01-2000) still available
First load the data of '01-01-2000' into a table and create a dynamic partition table partitioned by data '01-01-2000'. This might solve your problem.

Sql Server 2008 partition table based on insert date

My question is about table partitioning in SQL Server 2008.
I have a program that loads data into a table every 10 mins or so. Approx 40 million rows per day.
The data is bcp'ed into the table and needs to be able to be loaded very quickly.
I would like to partition this table based on the date the data is inserted into the table. Each partition would contain the data loaded in one particular day.
The table should hold the last 50 days of data, so every night I need to drop any partitions older than 50 days.
I would like to have a process that aggregates data loaded into the current partition every hour into some aggregation tables. The summary will only ever run on the latest partition (since all other partitions will already be summarised) so it is important it is partitioned on insert_date.
Generally when querying the data, the insert date is specified (or multiple insert dates). The detailed data is queried by drilling down from the summarised data and as this is summarised based on insert date, the insert date is always specified when querying the detailed data in the partitioned table.
Can I create a default column in the table "Insert_date" that gets a value of Getdate() and then partition on this somehow?
OR
I can create a column in the table "insert_date" and put a hard coded value of today's date.
What would the partition function look like?
Would seperate tables and a partitioned view be better suited?
I have tried both, and even though I think partition tables are cooler. But after trying to teach how to maintain the code afterwards it just wasten't justified. In that scenario we used a hard coded field date field that was in the insert statement.
Now I use different tables ( 31 days / 31 tables ) + aggrigation table and there is an ugly union all query that joins togeather the monthly data.
Advantage. Super timple sql, and simple c# code for bcp and nobody has complained about complexity.
But if you have the infrastructure and a gaggle of .net / sql gurus I would choose the partitioning strategy.