Hive data retention issue

Hive data retention issue - hive

I am using hive version 3, have created a few manage tables, external tables to my database and added retention policy as per https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive-set-partition-retention.html guidelines. However, I do not see any partitions getting deleted per retention.
I could see 'discover.partitions'='true' is enabled by default.
And I set 'partition.retention.period'='7d', added a couple of dummy partitions but I don't see anything getting deleted.
I even tried MSCK repair to sync, but didn't see any results.
How to set data retention to hive to delete partitions from selected table which are older than n days?

Related

msck repair a big table take very long time

I have a daily ingestion of data into HDFS . From data into HDFS I generate Hive tables partitioned by date and another column. One day has 130G data. After generate the data, I run msck repair. Now every msck tasks more than 2 hours. In my mind, msck will scan the whole table data (we have about 200 days data) and then update metadata. My question is: is there a way let the msck only scan the last day data and then update the metadata to speed up the whole process? by the way there is no small files issue, I already merge the small files before msck.

When you creating external table or doing repair/recover partitions with this configuration:
set hive.stats.autogather=true;
Hive scans each file in the table location to get statistics and it can take too much time.
The solution is to switch it off before create/alter table/recover partitions
set hive.stats.autogather=false;
See these related tickets: HIVE-18743, HIVE-19489, HIVE-17478
If you need statistics, you can gather statistics only for new partitions if necessary using
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS
See details here: ANALYZE TABLE
Also if you know which partitions should be added, use ALTER TABLE ADD PARTITION - you can add many partitions in single command.

Query delta changes to Hive metadata from last month

We have a hive metastore in mysql db. Use case is to query all the updates alters or deletes done to the hive metadata from past one month. We need to query this information to update an existing data catalog about this metadata.
We need only the delta changes happened from past query pull. How to query changes from hive metastore tables to check changes happened on a metadata during a time interval?
Tried few query sample to pull metadata.
We need to query this information to update an existing data catalog about this metadata.
We need only the delta changes happened from past query pull. How to query changes from hive metastore tables to check changes happened on a metadata during a time interval?

AWS Athena MSCK REPAIR TABLE tablename command

Is there any number of partitions we would expect this command
MSCK REPAIR TABLE tablename;
to fail on?
I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run
MSCK REPAIR TABLE tablename;
We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.

MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.
What to do instead depends on a number of things that are unique to your situation.
In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.
When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.
Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….
When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.
If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.

HIVE 3.1 - Automatic Major compaction triggered only once per partition

I have an acid enabled, partitioned, bucketed hive table to which I am writing using a streaming client. I see that several delta files are created as the records are written into partitions. I wanted to enable auto-compaction and tried the following base and specific params:
hive.support.concurrency=true
hive.enforce.bucketing=true
hive.exec.dynamic.partition.mode=nonstrict
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on=true
hive.compactor.worker.threads=1
with,
hive.compactor.initiator.on=true
hive.compactor.cleaner.run.interval=5000ms
hive.compactor.delta.num.threshold=10
hive.compactor.delta.pct.threshold=0.1f
hive.compactor.abortedtxn.threshold=1000
hive.compactor.initiator.failed.compacts.threshold=2
hive.compactor.abortedtxn.threshold=1000
I did the above with hopes of enabling major compaction. However I see that major compaction is triggered automatically only once. i.e, Major compaction runs once and creates a base file. Once a base file is created for a number of delta files within that partition, Major compaction is not scheduled further, despite more delta files streamed into the partition since. How do I enable auto-Major compaction for a table? Has anyone faced similar issues before?

I have the same issue and the only solution I found is run manual compaction for each partition.
ALTER TABLE myTable PARTITION (myPartitionColumn='myPartitionValue') COMPACT 'major';
I still trying to figure out why happens.

Difference between invalidate metadata and refresh commands in Impala?

I saw at this link which affects Impala version 1.1:
Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement.
Does this still hold true for later versions of Impala?

According to Cloudera's Impala guide (Cloudera Enterprise 5.8) but stayed the same for 5.9:
INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA
waits to reload the metadata when needed for a subsequent query, but
reloads all the metadata for the table, which can be an expensive
operation, especially for large tables with many partitions. REFRESH
reloads the metadata immediately, but only loads the block location
data for newly added data files, making it a less expensive operation
overall. If data was altered in some more extensive way, such as being
reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a
performance penalty from reduced local reads. If you used Impala
version 1.0, the INVALIDATE METADATA statement works just like the
Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is
optimized for the common use case of adding new data files to an
existing table, thus the table name argument is now required.
and related to working on existing tables:
The table name is a required parameter [for REFRESH]. To flush the metadata for all
tables, use the INVALIDATE METADATA command.
Because REFRESH table_name only works for tables that the current
Impala node is already aware of, when you create a new table in the
Hive shell, enter INVALIDATE METADATA new_table before you can see the
new table in impala-shell. Once the table is known by Impala, you can
issue REFRESH table_name after you add data files for that table.
So it seems like it indeed stayed the same. I believe CDH 5.9 comes with Impala 2.7.

As per Impala document Invalidate Metada and Refresh
INVALIDATE METADATA Statement
The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA.
INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:
Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad.
REFRESH Statement
The REFRESH statement reloads the metadata for the table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, namely Hive Metastore (HMS) and NameNodes.
Usage notes:
The table name is a required parameter, and the table must already exist and be known to Impala.
Only the metadata for the specified table is reloaded.
Use the REFRESH statement to load the latest metastore metadata for a particular table after one of the following scenarios happens outside of Impala:
Deleting, adding, or modifying files.
For example, after loading new data files into the HDFS data directory for the table, appending to an existing HDFS file, inserting data from Hive via INSERT or LOAD DATA.
Deleting, adding, or modifying partitions.
For example, after issuing ALTER TABLE or other table-modifying SQL statement in Hive

Invalidate metadata is used to refresh the metastore and the data (structure & data)==complete flush
Refresh is used to update only the data = lightweight flush

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas