Add retention period to hive tables - hive

Could you please let me know how to add retention period to Hive tables.
In the below URL I could see partition discovery and retention is not recommended for use on managed tables. I don't understand why it is not recommended.
I have created a table added below properties to the table schema.
Just to be sure I have ran the command MSCK REPAIR TABLE table_name SYNC PARTITIONS
I have inserted the data into the table. As per the retention period, the partitions should be dropped after 30 minutes but nothing was dropped.
Am I missing something here? Thank you in advance for your help
'auto.purge'='true', 'discover.partitions'='true',
'partition.retention.period'='30m',
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive-manage-partitions.html

Related

Is there a way to specify who drop table?

I met the phenomenon that the table suddenly disappears.
I checked the history of project query, but there is no DELETE TABLE query.
I want to find who (or which service account) did drop the table.
Is there a way to specify who drop table other than project query history?
Added:
I already checked table expiration date, partition expiration date.
Yes there are other optios as well
You can check in Activity logs
In more details , you can check logs in Logging.
Use filter in log query
resource.type="bigquery_resource"
protoPayload.methodName = "tableservice.delete"

Specify Retention Period for a HIVE Partitioned Table

We are thinking in create a housekeeping program to delete old partitions from HIVE tables. After reading the documentation, I found that you can establish a retention period using SET TBLPROPERTIES.
ALTER TABLE employees SET TBLPROPERTIES ('discover.partitions'='true');
ALTER TABLE employees SET TBLPROPERTIES ('partition.retention.period'='7d');
My questions are the following:
After executing this command HIVE will delete old partitions ? In
other systems, normally setting properties don't affect to existing
elements.
When this clean-up takes place ? I did not see a way to configure it ? What happens if someone is accessing the partition in the moment the process is triggered ?
Or it's just best to just create an script to read the metadata dictionary
in HIVE and delete old partitions ?
Thanks for your help!

Drop empty Impala partitions

Impala external table partitions still show up in stats with row count 0 after deleting the data in HDFS and altering (like ALTER TABLE table RECOVER PARTITIONS) refreshing (REFRESH table) and invalidation of metadata.
Trying to drop partitions one by one works, but there are tens of partitions which should be removed and it would be quite tedious.
Dropping and recreating the table would also be an option but that way all the statistics would be dropped together with the table.
Is there any kind of other options in impala to get this done?
Found a workaround through HIVE.
By issuing MSCK REPAIR TABLE tablename SYNC PARTITIONS then refreshing the table in impala, the empty partitions disappear.

Exclusive lock at Hive Table has not got released automatically when hive services restart in between

While running Insert command INSERT INTO TABLE xyz PARTITION(partition_date='2020-02-28') values('A',123, 'C',45)..... or Alter table drop partition (alter table xyz drop if exists partition(partition_date='2020-02-28'); command in hive, if hive services got restarted in between through ambari or due to any unwanted scenario, then that acquired the exclusive lock on that partition which will remains after the restart also and for that kind of job there is no yarn application id is generated sometimes and if generated then it also got succeeded but exclusive locks remains on that table or partition, which later we have to manually released from the table or partitioned.
So why these locks remains on that partition or table and how these kind of scenarios can be handle at our end?
Is there any workaround for these kind of scenarios?
I met a similar problem and resolved it, there are two ways:
(1) The hive lock information is stored at a mysql table called hive.hive_locks, so you can delete relevant rows about your sql table, or truncate that table. But this way cannot fix the problem permanently.
(2) Add a configuration in hive-site.xml, just like this:
<property>
<name>metastore.task.threads.always</name>
<value>org.apache.hadoop.hive.metastore.events.EventCleanerTask,org.apache.hadoop.hive.metastore.RuntimeStatsCleanerTask,org.apache.hadoop.hive.metastore.repl.DumpDirCleanerTask,org.apache.hadoop.hive.metastore.txn.AcidHouseKeeperService</value>
</property>
You can also refer to the answer on this question, i made a detailed explanation about the second way:
https://stackoverflow.com/a/73771475/9120595

How to take daily snapshots of a table

I am building a sales database. One of the tables has to be a hierarchy of sales reps and their assigned territories. Ohese reps and their territories change every day, and I need to keep track of what exactly that table looks like every day. I will need to take snapshots of the table daily.
I would like to know what I have to do or how I have to store the data in the table, to be able to know exactly what the data in the table was at a certain point in time.
Is this possible?
Please keep in mind that the table will not be more than one megabyte or so.
I suggest using Paul Nielsen's AutoAudit:
AutoAudit is a SQL Server (2005, 2008) Code-Gen utility that creates
Audit Trail Triggers with:
Created, CreatedBy, Modified, ModifiedBy, and RowVersion (incrementing INT) columns to table
Insert event logged to Audit table
Updates old and new values logged to Audit table
Delete logs all final values to the Audit table
view to reconstruct deleted rows
UDF to reconstruct Row History
Schema Audit Trigger to track schema changes
Re-code-gens triggers when Alter Table changes the table
His original blog post: CodeGen to Create Fixed Audit Trail Triggers
Before you implement in production suggest you restore a backup of your database into development and work on that.
This is for MS SQL.
Seeing as the table is so small, you are best of to use the Snapshot functionality provided by MS SQL.
To make a snapshot of a database:
CREATE DATABASE YourDB_Snapshot_DateStamp ON
( NAME = YourDB_Data, FILENAME =
'C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Data\YourDB_Snapshot_DateStamp.ss' )
AS SNAPSHOT OF YourDB;
GO
See this page for reference: http://msdn.microsoft.com/en-us/library/ms175876.aspx
You can make as many snapshots as you want. So my advice is to create a script or task that creates a daily snapshot and appends the date to the snapshot name. This way you will have all your snapshots visible on your server.
Important to note: Snapshots are read only.