How to update partition metadata in Hive , when partition data is manualy deleted from HDFS - hive

What is the way to automatically update the metadata of Hive partitioned tables?
If new partition data's were added to HDFS (without alter table add partition command execution) . then we can sync up the metadata by executing the command 'msck repair'.
What to be done if a lot of partitioned data were deleted from HDFS (without the execution of alter table drop partition commad execution).
What is the way to syncup the Hive metatdata?

EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax :
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS]
This was implemented in HIVE-17824
As correctly stated by HakkiBuyukcengiz, MSCK REPAIR doesn't remove partitions if the corresponding folder on HDFS was manually deleted, it only adds partitions if new folders are created.
Extract from offical documentation :
In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore.
This is what I usually do in the presence of external tables if multiple partitions folders are manually deleted on HDFS and I want to quickly refresh the partitions :
Drop the table (DROP TABLE table_name)
(dropping an external table does not delete the underlying partition files)
Recreate the table (CREATE EXTERNAL TABLE table_name ...)
Repair it (MSCK REPAIR TABLE table_name)
Depending on the number of partitions this can take a long time. The other solution is to use ALTER TABLE DROP PARTITION (...) for each deleted partition folder but this can be tedious if multiple partitions were deleted.

Try using
MSCK REPAIR TABLE <tablename>;

Ensure the table is set to external, drop all partitions then run the table repair:
alter table mytable_name set TBLPROPERTIES('EXTERNAL'='TRUE')
alter table mytable_name drop if exists partition (`mypart_name` <> 'null');
msck repair table mytable_name;
If msck repair throws an error, then run hive from the terminal as:
hive --hiveconf hive.msck.path.validation=ignore
or set hive.msck.path.validation=ignore;

Related

Old records appear in the Hadoop table after drop and creating new table with the same old name

I have a question regarding creating tables in Hadoop.
I create external table the following way:
CREATE EXTERNAL HADOOP TABLE SCHEMA.TABLENAME (
ID BIGINT NOT NULL,
CODE INTEGER,
"VALUE" DOUBLE
STORED AS ORC
TBLPROPERTIES ('bigsql.table.io.doAs'='false',
'bucketing_version'='2',
'orc.compress'='ZLIB',
'orc.create.index'='true')
After I created this table I run Jenkins job (with sqoop process) which loads 70.000.000 records to this table.
Then I needed to remove this table, so I run:
DROP TABLE SCHEMA.TABLENAME
Later on I want to create a table with the same name as the previous one, but I need it to be empty. I make the same query as earlier, I do:
CREATE EXTERNAL HADOOP TABLE SCHEMA.TABLENAME (
ID BIGINT NOT NULL,
CODE INTEGER,
"VALUE" DOUBLE
STORED AS ORC
TBLPROPERTIES ('bigsql.table.io.doAs'='false',
'bucketing_version'='2',
'orc.compress'='ZLIB',
'orc.create.index'='true')
But when I create table this way, it has 70.000.000 records inside it again, although I didn't run any job to populate it.
This is why I have two questions:
When I drop and create table with old name, then is it recovering records from the old table?
How can I drop (or truncate) table in bigsql/hive so that I have an empty table with the old name.
I am using bigsql and hive.
Dropping an external table doesn't remove the stored data, only the metadata from the Hive Metastore.
Refer Managed vs External Tables
Key points...
Use external tables when files are already present or in remote locations
files should remain even if the table is dropped
Create a managed table (remove EXTERNAL from your query), if you want to be able to DROP and/or TRUNCATE.
Or have your Jenkins job run hadoop fs -rm -skipTrash before the import.

Drop empty Impala partitions

Impala external table partitions still show up in stats with row count 0 after deleting the data in HDFS and altering (like ALTER TABLE table RECOVER PARTITIONS) refreshing (REFRESH table) and invalidation of metadata.
Trying to drop partitions one by one works, but there are tens of partitions which should be removed and it would be quite tedious.
Dropping and recreating the table would also be an option but that way all the statistics would be dropped together with the table.
Is there any kind of other options in impala to get this done?
Found a workaround through HIVE.
By issuing MSCK REPAIR TABLE tablename SYNC PARTITIONS then refreshing the table in impala, the empty partitions disappear.

LOCATION in Hive

While creating hive tables, Can I point the 'LOCATION' to a place in hdfs where data is present. Do I still need to load data or Will the data be available on hive directly?
You can specify any location while creating table and the data will be accessible. If table is partitioned, then use ALTER TABLE ADD PARTITION or MSCK REPAIR TABLE table_name or Amazon version ALTER TABLE table_name RECOVER PARTITIONS , this will add any partitions that exist on HDFS but not in metastore to the metastore. See docs here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
If table is not partitioned, you can simply specify the location with data while creating table or change table location using ALTER TABLE SET LOCATION.

Hive Partition recovery

How to recover partitions in easy fashion. Here is the scenario :
Have 'n' partitions on existing external table 't'
Dropped table 't'
Recreated table 't' // Note : same table but with excluding some column
How to recover the 'n' partitions that existed for table 't' in step #1 ?
I can manually alter table to add 'n' partition by writing some script. But that's very tedious. Is there something built-in to recover these partitions ?
When the partitions directories still exist in the HDFS, simply run this command:
MSCK REPAIR TABLE table_name;
It adds the partitions definitions to the metastore based on what exists in the table directory.
Metadata is not saved in the trash and is removed permanently; you will not be able to restore the metadata of dropped tables, partitions, etc. Reference: http://www.cloudera.com/documentation/archive/cdh/4-x/4-7-1/CDH4-Installation-Guide/cdh4ig_hive_trash.html

Hive: DROP TABLE IF EXISTS <Table Name> does not free memory

When I am using DROP TABLE IF EXISTS <Table Name> in hive, it is not freeing the memory. The files are created as 0000_n.bz2 and they are still on disk.
I have two questions here:
1) Will these files keep on growing for each and every insert?
2) Is there any DROP equivalent to remove the files as well on the disk?
Couple of things you can do:
Check if the table is an external table and in that case you need to drop the files manually on HDFS as dropping tables won't drop the files:
hadoop fs -rm /HDFS_location/filename
Secondly check if you are in the right database. You need to issue use database command before dropping the tables. The database should be same as the one in which tables were created.
There are two types of tables in hive.
Hive managed table: If you drop a hive managed table the data in HDFS are automatically deleted.
External Table: If you drop an external table, hive doesnt delete the underlying data.
I believe yours is an external table.
Drop table if exists table_name purge;
This command will also remove data files from trash folder and cannot be recovered after table drop