We are thinking in create a housekeeping program to delete old partitions from HIVE tables. After reading the documentation, I found that you can establish a retention period using SET TBLPROPERTIES.
ALTER TABLE employees SET TBLPROPERTIES ('discover.partitions'='true');
ALTER TABLE employees SET TBLPROPERTIES ('partition.retention.period'='7d');
My questions are the following:
After executing this command HIVE will delete old partitions ? In
other systems, normally setting properties don't affect to existing
elements.
When this clean-up takes place ? I did not see a way to configure it ? What happens if someone is accessing the partition in the moment the process is triggered ?
Or it's just best to just create an script to read the metadata dictionary
in HIVE and delete old partitions ?
Thanks for your help!
Related
Could you please let me know how to add retention period to Hive tables.
In the below URL I could see partition discovery and retention is not recommended for use on managed tables. I don't understand why it is not recommended.
I have created a table added below properties to the table schema.
Just to be sure I have ran the command MSCK REPAIR TABLE table_name SYNC PARTITIONS
I have inserted the data into the table. As per the retention period, the partitions should be dropped after 30 minutes but nothing was dropped.
Am I missing something here? Thank you in advance for your help
'auto.purge'='true', 'discover.partitions'='true',
'partition.retention.period'='30m',
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive-manage-partitions.html
I am creating a managed table via Impala as follows:
CREATE TABLE IF NOT EXISTS table_name
STORED AS parquet
TBLPROPERTIES ('transactional'='false', 'insert_only'='false')
AS ...
This should result in a managed table which does not support HIVE-ACID.
However, when I run the command I still end up with an external table.
Why is this?
I found out in the Cloudera documentation that neglecting the EXTERNAL-keyword when creating the table does not mean that the table definetly will be managed:
When you use EXTERNAL keyword in the CREATE TABLE statement, HMS stores the table as an external table. When you omit the EXTERNAL keyword and create a managed table, or ingest a managed table, HMS might translate the table into an external table or the table creation can fail, depending on the table properties.
Thus, setting transactional=false and insert_only=false leads to an External Table in the interpretation of the Hive Metastore.
Interestingly, only setting TBLPROPERTIES ('transactional'='false') is completly ignored and will still result in a managed table having transactional=true).
I created 2 external tables Hive. In first table specified data location with create statement. In second table loaded data after creating it.
I can see data file created for second table in /hive/warehouse/ directory. Then I set "external.table.purge"="true" for both tables. And DROP both tables. But data files of both tables remains as is.
What is the behaviour of 'external.table.purge'='true'. Shouldn't it delete data files as well on issuing Drop command?
If Hive does not take any ownership over data files of external table, why is there even an option as 'external.table.purge'='true'.
I read in one of the threads, where someone mentioned it is possible to delete data as well for external tables by ALTER TABLE ... SET TBLPROPERTIES('external.table.purge'='true'), but unable to find that post again.
You can not drop the data in external table but you can do it for internal(managed) tables. So convert the table to internal and then drop it.
First change eternal property to false.
hive> ALTER TABLE nyse_external SET TBLPROPERTIES('EXTERNAL'='False');
and then you can easily drop it.
hive> drop table nyse_external;
TBLPROPERTIES ("external.table.purge"="true") should work for hive version 4.x+.
Answer to point 1:
Table property "external.table.purge", which if true (and if the table is an external table), will let Hive know to delete the table data when the table is dropped. This feature is introduced in this apache jira.
https://issues.apache.org/jira/browse/HIVE-19981 .
For reference on how to set the property take a look at this example,
https://docs.cloudera.com/runtime/7.2.7/using-hiveql/topics/hive_drop_external_table_data.html
Impala external table partitions still show up in stats with row count 0 after deleting the data in HDFS and altering (like ALTER TABLE table RECOVER PARTITIONS) refreshing (REFRESH table) and invalidation of metadata.
Trying to drop partitions one by one works, but there are tens of partitions which should be removed and it would be quite tedious.
Dropping and recreating the table would also be an option but that way all the statistics would be dropped together with the table.
Is there any kind of other options in impala to get this done?
Found a workaround through HIVE.
By issuing MSCK REPAIR TABLE tablename SYNC PARTITIONS then refreshing the table in impala, the empty partitions disappear.
While running Insert command INSERT INTO TABLE xyz PARTITION(partition_date='2020-02-28') values('A',123, 'C',45)..... or Alter table drop partition (alter table xyz drop if exists partition(partition_date='2020-02-28'); command in hive, if hive services got restarted in between through ambari or due to any unwanted scenario, then that acquired the exclusive lock on that partition which will remains after the restart also and for that kind of job there is no yarn application id is generated sometimes and if generated then it also got succeeded but exclusive locks remains on that table or partition, which later we have to manually released from the table or partitioned.
So why these locks remains on that partition or table and how these kind of scenarios can be handle at our end?
Is there any workaround for these kind of scenarios?
I met a similar problem and resolved it, there are two ways:
(1) The hive lock information is stored at a mysql table called hive.hive_locks, so you can delete relevant rows about your sql table, or truncate that table. But this way cannot fix the problem permanently.
(2) Add a configuration in hive-site.xml, just like this:
<property>
<name>metastore.task.threads.always</name>
<value>org.apache.hadoop.hive.metastore.events.EventCleanerTask,org.apache.hadoop.hive.metastore.RuntimeStatsCleanerTask,org.apache.hadoop.hive.metastore.repl.DumpDirCleanerTask,org.apache.hadoop.hive.metastore.txn.AcidHouseKeeperService</value>
</property>
You can also refer to the answer on this question, i made a detailed explanation about the second way:
https://stackoverflow.com/a/73771475/9120595