Hive Update partition vs MSCK Repair - hive

I have a table with thousands of partition. I want to change all the partition location to diff cluster.
Ex:
for table test_table and partition day=2021041600
Old location: hdfs://cluster1/dir1/dir2/day=2021041600/\<files>
New location: hdfs://cluster2/dir1/dir2/day=2021041600/\<files>
I can achieve this using 2 ways.
We can fetch the list of all the partitions and update the partition location for every partition 1 by 1.
We can change the base location of the table and run the MSCK repair command on the table.
My question is which option would we better approach to take?

1st approach will work.
2nd approach (MSCK repair):
MSCK REPAIR will not work if you change table location because partitions are mounted to old locations outside table location.
Make table EXTERNAL, DROP, CREATE with new location, run MSCK REPAIR:
alter table test_table SET TBLPROPERTIES('EXTERNAL'='TRUE');
drop table test_table;
create table test_table ... location 'hdfs://cluster2/dir1/dir2';
MSCK REPAIR TABLE test_table;

Related

Do I need to do msck repair table after alter table?

I have a partitioned by column 'date' table to which I added a column like so:
ALTER TABLE table_name ADD COLUMNS (message_id_external string) CASCADE;
Do I need to do msck repair table after this and analyze table table_name partition(date) compute statistics noscan; ?
REPAIR TABLE does not care about columns, it checks that all partitions which are in metadata exist in HDFS and vice-versa, it will not refresh any metadata for existing partitions -- No, you do not need to run it if no partition locations were added or removed from HDFS. If there are no partition folders were created or removed, repair will do nothing.
Second command analyze table table_name partition(date) compute statistics noscan; - also will give you nothing, if you executed it previously before adding column.
ANALYZE with NOSCAN will gather only number of files and their size

Is it possible to change partition metadata in HIVE?

This is an extension of a previous question I asked: How to compare two columns with different data type groups
We are exploring the idea of changing the metadata on the table as opposed to performing a CAST operation on the data in SELECT statements. Changing the metadata in the MySQL metastore is easy enough. But, is it possible to have that metadata change applied to partitions (they are daily)? Otherwise, we might be stuck with current and future data being of type BIGINT while the historical is STRING.
Question: Is it possible to change partition meta data in HIVE? If yes, how?
You can change partition column type using this statement:
alter table {table_name} partition column ({column_name} {column_type});
Also you can re-create table definition and change all columns types using these steps:
Make your table external, so it can be dropped without dropping the data
ALTER TABLE abc SET TBLPROPERTIES('EXTERNAL'='TRUE');
Drop table (only metadata will be removed).
Create EXTERNAL table using updated DDL with types changed and with the same LOCATION.
recover partitions:
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
And finally you can make you table MANAGED again if necessary:
ALTER TABLE tablename SET TBLPROPERTIES('EXTERNAL'='FALSE');
Note: All commands above should be ran in HUE, not MySQL.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
Refer : altering partition column type in Hive
You can think of it this way
- Hive stores the data by creating a folder in hdfs with partition column values
- Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible
exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/2009/file2
tab1/clientdata/2010/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table

drop column from a partition in hive external table

I have a hive external table with 3 partition columns (A,B,C) and now I want to drop B and C columns from the partition.Is it possible to do so?
I have tried with Alter table tab_name drop column col_name; --- but it throws an error stating partitioned columns cannot be dropped.
To drop partition columns the table should be recreated. The steps are:
Drop table, dropping external table will not drop data files.
Reorganize data folders to reflect new partition structure. Partitions are folders on physical level, hierarchically organized. If you delete upper level partition, then all sub-folders should be moved to the upper level and so on. if you are deleting two upper partition columns and only one is left then it should be only one level subfolders under the table location.
Create table with new partitioning schema on top of old location.
Run MSCK repair table. It will create partition metadata for all found partitions folders.
If all of these steps seem too complex or too difficult to do, then simply create new table and load data :
Create new table with new partitioning schema.
Load data into new table.
drop old table and rename new one
Like this:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table new_table partition(C)
select --list columns without deleted
from old_table;
And finally, after dropping old table, you can rename new one using ALTER TABLE table_name RENAME TO new_table_name.

drop table command with partitions column in hive

Does drop table command in hive also drop partitions?
I just want to know that or we have to use alter table table_name drop partition() command for this?
DROP TABLE statement always drops partitions metadata for both MANAGED and EXTERNAL tables because partitions can not exist without table. But for EXTERNAL tables it does not drop data in the filesystem.
If table is MANAGED, then DROP TABLE will delete table and partitions metadata and data in table location as well, all the table location including partition sub-folders.
If the table is EXTERNAL, it will drop only table definition in metadata and partition definitions, table location with data, including all partition folders will remain as is, and you can again create table on top of the same location and recover partitions.
The same is applicable for DROP PARTITION: if table is MANAGED, it will remove partition metadata along with partition sub-folder. And if table is EXTERNAL, partition sub-folder with data will remain, only partition metadata will be deleted.
So, for MANAGED tables, you do not need to delete data after dropping table or partition.
See also DROP TABLE and DROP PARTITION manual for more details.

Delete partition directories from HDFS, would it reflect in hive table?

Lets say I created a hive table with partition column as year, month and day and if i delete the partition from hdfs, then result get reflected in hive table or not
Yes. The partition data will be gone.
The metastore will still hold the partition information (metadata) and you can see it using show partition mytable.
You can find the partitions need to be dropped using msck repair mytable.
You can drop the partitions using alter table mytable drop partition (...)
Hive table will still show the partitions, you will have to either drop the partitions deleted on HDFS manually (or drop and re-create table) and run MSCK.
Commands:
If you intend to alter the table and drop all deleted partitions-
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
[IGNORE PROTECTION] [PURGE]; -- (Note: PURGE available in Hive 1.2.0 and later, IGNORE PROTECTION not available 2.0.0 and later)
I would go with drop and re-create table then run MSCK.
To add all existing partitions to table-
msck repair table <table_name>
Alternatively, you could drop all partitions using ALTER TABLE and then run the MSCK command.