I have an external table mytable. We have a job scheduled in Airflow that picks up an SQL file and executes it once a day.
On a daily basis, I need to add a partition to the table corresponding to that day.
So for 2019-09-27, I would need to execute
ALTER TABLE MYTABLE ADD PARTITION(year=2018,month=9,day=27,ts=1538006400) location '/path/to/data/20180927/'
I can get the year/month/day using SELECT year/month/day(current_date) and the timestamp using select unix_timestamp(CURRENT_DATE, 'yyyy-MM-dd'), but how would I write an SQL query that would generate the entire ALTER TABLE... ADD PARTITION query as above?
Scripting it is an easy way but I need this done in SQL alone.
Hive conf has a variable for current date current_date you can used it as
ALTER TABLE MYTABLE ADD PARTITION (ts= '${hiveconf:current_date}')
functions current_date is now available in Hive 1.2.0 and higher.
Related
If the execution date is today, this month is July, so I want to include the month after the table name.
I want to put category_table202107 even the month after the table name, Is it possible to include the execution month in the table name when CREATE with Athena's presto sql?
No, it is not possible to 'calculate' the name of a table in the CREATE TABLE command.
Instead, you should do that in whatever program is sending the CREATE TABLE command to Amazon Athena.
I'm working on BigQuery and have created a view using multiple tables. Each day data needs to be synced with multiple platforms. I need to insert a date or some other field via SQL through which I can identify which rows were added into the view each day or which rows got updated so only that data I can take forward each day instead of syncing all every day. Best way I can think is to somehow add the the current date wherever an update to a row happens but that date needs to be constant until a further update happens for that record.
Ex:
Sample data
Say we get the view T1 on 1st September and T2 on 2nd. I need to to only spot ID:2 for 1st September and ID:3,4,5 on 2nd September. Note: no such date column is there.I need help in creating such column or any other approach to verify which rows are getting updated/added daily
You can create a BigQuery schedule queries with frequency as daily (24 hours) using below INSERT statement:
INSERT INTO dataset.T1
SELECT
*
FROM
dataset.T2
WHERE
date > (SELECT MAX(date) FROM dataset.T1);
Your table where the data is getting streamed to (in your case: sample data) needs to be configured as a partitioned table. Therefor you use "Partition by ingestion time" so that you don't need to handle the date yourself.
Configuration in BQ
After you recreated that table append your existing data to that new table with the help of the format options in BQ (append) and RUN.
Then you create a view based on that table with:
SELECT * EXCEPT (rank)
FROM (
SELECT
*,
ROW_NUMBER() OVER (GROUP BY invoice_id ORDER BY _PARTITIONTIME desc) AS rank
FROM `your_dataset.your_sample_data_table`
)
WHERE rank = 1
Always use the view from that on.
This is an extension of a previous question I asked: How to compare two columns with different data type groups
We are exploring the idea of changing the metadata on the table as opposed to performing a CAST operation on the data in SELECT statements. Changing the metadata in the MySQL metastore is easy enough. But, is it possible to have that metadata change applied to partitions (they are daily)? Otherwise, we might be stuck with current and future data being of type BIGINT while the historical is STRING.
Question: Is it possible to change partition meta data in HIVE? If yes, how?
You can change partition column type using this statement:
alter table {table_name} partition column ({column_name} {column_type});
Also you can re-create table definition and change all columns types using these steps:
Make your table external, so it can be dropped without dropping the data
ALTER TABLE abc SET TBLPROPERTIES('EXTERNAL'='TRUE');
Drop table (only metadata will be removed).
Create EXTERNAL table using updated DDL with types changed and with the same LOCATION.
recover partitions:
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
And finally you can make you table MANAGED again if necessary:
ALTER TABLE tablename SET TBLPROPERTIES('EXTERNAL'='FALSE');
Note: All commands above should be ran in HUE, not MySQL.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
Refer : altering partition column type in Hive
You can think of it this way
- Hive stores the data by creating a folder in hdfs with partition column values
- Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible
exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/2009/file2
tab1/clientdata/2010/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table
I am trying to change the already existing partition column to another column.
The current workflow I'm using:
Backup the existing data
Create a new table with new partition column
Reload the data into new partitions
My problem:
Since there is huge data in our existing partition tables, this way will be costly
Is there a way we can do Alter table and change partition column name to another?
You can not avoid 1-time cost of scanning the table as you can see from the error message generated from this CREATE OR REPLACE DML command
#standardSQL
CREATE OR REPLACE TABLE `project.dataset.table`
PARTITION BY DATE(ts)
AS
SELECT * FROM `project.dataset.table`
Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is interval(type:day,field:ts) and existing spec is none
What you can do to save cost is use the WHERE command to limit the number of the partition you move from existing table to the new table
CREATE TABLE project.mydataset.newPartitionTable
PARTITION BY date
OPTIONS (
partition_expiration_days=365,
description="Table with a new partition"
) AS
SELECT * from `project.dataset.table` WHERE
PARTITIONTIME >= '2019-01-23 00:00:00'
AND _PARTITIONTIME <= '2019-01-23 00:00:00'
You can consider for example not to move your Long-term storage which is data you haven't access for the last 90 days (see this link for more details)
If you want to keep your original table name you can drop/create it with the new partition field, after the copy, and use the copy option from webUI which will be free of charge
I need to prepare the script to increase the partition range if the partition is going to get finished in next 2-3 months. How to find the existing table partition and we can edit to existing table or we need to create a new script.
Appreciate response
How to find the existing table partition
You could either generate the table DDL using DBMS_METADATA package to get the complete table DDL.
Or, query the user_tab_partitions view to get the table partition information.
To add new partitions, you need to use ADD PARTITION clause:
ALTER TABLE <table_name>
ADD PARTITION <new_partition>
VALUES (<new_value>)
TABLESPACE <tablespace_name>;