Difference between invalidate metadata and refresh commands in Impala? - impala

I saw at this link which affects Impala version 1.1:
Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement.
Does this still hold true for later versions of Impala?

According to Cloudera's Impala guide (Cloudera Enterprise 5.8) but stayed the same for 5.9:
INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA
waits to reload the metadata when needed for a subsequent query, but
reloads all the metadata for the table, which can be an expensive
operation, especially for large tables with many partitions. REFRESH
reloads the metadata immediately, but only loads the block location
data for newly added data files, making it a less expensive operation
overall. If data was altered in some more extensive way, such as being
reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a
performance penalty from reduced local reads. If you used Impala
version 1.0, the INVALIDATE METADATA statement works just like the
Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is
optimized for the common use case of adding new data files to an
existing table, thus the table name argument is now required.
and related to working on existing tables:
The table name is a required parameter [for REFRESH]. To flush the metadata for all
tables, use the INVALIDATE METADATA command.
Because REFRESH table_name only works for tables that the current
Impala node is already aware of, when you create a new table in the
Hive shell, enter INVALIDATE METADATA new_table before you can see the
new table in impala-shell. Once the table is known by Impala, you can
issue REFRESH table_name after you add data files for that table.
So it seems like it indeed stayed the same. I believe CDH 5.9 comes with Impala 2.7.

As per Impala document Invalidate Metada and Refresh
INVALIDATE METADATA Statement
The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA.
INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:
Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad.
REFRESH Statement
The REFRESH statement reloads the metadata for the table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, namely Hive Metastore (HMS) and NameNodes.
Usage notes:
The table name is a required parameter, and the table must already exist and be known to Impala.
Only the metadata for the specified table is reloaded.
Use the REFRESH statement to load the latest metastore metadata for a particular table after one of the following scenarios happens outside of Impala:
Deleting, adding, or modifying files.
For example, after loading new data files into the HDFS data directory for the table, appending to an existing HDFS file, inserting data from Hive via INSERT or LOAD DATA.
Deleting, adding, or modifying partitions.
For example, after issuing ALTER TABLE or other table-modifying SQL statement in Hive

Invalidate metadata is used to refresh the metastore and the data (structure & data)==complete flush
Refresh is used to update only the data = lightweight flush

Related

databricks error IllegalStateException: The transaction log has failed integrity checks

I have a table that I need drop, delete transaction log and recreate, but while I am trying to drop I get following error.
I have ran repair table statement on this one and could be responsible for error but not sure.
IllegalStateException: The transaction log has failed integrity checks. We recommend you contact Databricks support for assistance. To disable this check, set spark.databricks.delta.state.corruptionIsFatal to false. Failed verification of:
Table size (bytes) - Expected: 0 Computed: 63233
Number of files - Expected: 0 Computed: 1
We think this may just be related to s3 eventual consistency. Please try waiting a few extra minutes after deleting the Delta directory before writing new data to it. Also, normal MSCK REPAIR TABLE doesn't do anything for Delta, as Delta doesn't use the Hive Metastore to store the partitions. There is an FSCK REPAIR TABLE, but that is for removing the file entries from the transaction log of a Databricks Delta table that can no longer be found in the underlying file system.
We don't recommend overwriting a Delta table in place, like you might with a normal Spark table. Delta is not like a normal table - it's a table, plus a transaction log, and many versions of your data (unless fully vacuumed). If you want to overwrite parts of the table, or even the whole table, you should use Delta's delete functionality. If you want to completely change the table, consider writing to an entirely new directory, such as /table/v2/... and separately deleting the other table.
To skip the issue from occurring can use below command (PySpark notebook):
spark.conf.set("spark.databricks.delta.state.corruptionIsFatal", False)

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

Impact of "INVALIDATE METADATA" on "COMPUTE STATS" in Impala

I understand that running INVALIDATE METADATA statement on a table flushes its metatdata. Will it also invalidate any meta data created by the COMPUTE STATS statement?
No, INVALIDATE METADATA just clears the cached metadata in the Impala Catalog. Table and column statistics are persisted in the Hive Metastore.

Update Hive metadata location for many tables

I would like to change the bucket name in location of many Hive tables. Is it possible for us to connect to mySQL database and update it? I think it is possible.But I would like to know if it is safe to do it in production database.
Yes, it is possible, and I have seen it done; but
(a) the Metastore schema is not documented, and each Hive version brings some minor changes, so you have to do your own exploration to find where/how the StorageDescriptor objects are persisted -- then some unit tests / non-regression tests on a Dev system -- plus, don't forget to run a full DB backup before tinkering with your Prod system (and to rehearse an emergency restoration on your Dev system, too!)
(b) you have to update the StorageDescriptor for tables, but also for partitions -- remember that for partitioned tables, the table-level LOCATION is just used as default root dir for future partitions; once created, a partition retains its location until it is ALTERed explicitly.
For the record, the preferred method for bulk updates is (in theory) the Hive MetaTool but unfortunately, it does not support the kind of updates that you need.Right now it's only good for changing the NameNode alias in all HDFS paths, because that was a real pain point...
A valid alternative to brutal SQL Updates would be to develop a custom Java program, using the Hive MetaStore API, to scan all tables & partitions then read their StorageDescriptor then run RegEx changes on their Location then write back the changes (which is exactly what the MetaTool does, only at a lower level). But that would be overkill.
Finally, a possible compromise would be a SQL Select on the appropriate MySQL table, to generate (with regexp_replace()) a chain of ALTER Table/Partition LOCATION commands to run later in the Hive CLI.Plus a chain of ALTER to revert to the original locations, in case you have to do an emergency rollback :-/

Spark HiveContext does not retrieve newly inserted records from Hive Table

I am using Spark 1.4. HiveContext is used to connect Hive. I did the following
val hx = new HiveContext(sc)
import hx.implicits._
hx.sql("select * from tab").show
// it is fine, result was shown as expected
then, I inserted a few records into tab from beeline console
hx.refreshTable("tab")
hx.sql("select * from tab").show
// still old records, no newly inserted records
My question is: why the HiveContext didn't retrieve the newly inserted records?
hiveContext.refreshTable(tableName: String) - this will refresh only metadata of the table (not the actual data)
Notes from official documentaition : (credits: https://spark.apache.org)
refreshTable(tableName: String): Unit
Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache
To retrive newly inserted records:- uncache first and cache again using , uncacheTable(String tableName) and cacheTable(String tableName)
If the target table is partitioned, You need to insert with 'partition' option. If you miss out the partition, data will not be visible.
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) SELECT col1,col2,.... FROM tablename2
On a differently slight case, I have an RDD coming from a Spark SQL statement via HiveContext. The solution which worked for me after some experiments was to actually regenerate the RDD itself.
It does not matter whether you are using the DDL by Spark SQL or sending SQL statements directly via hiveContext.sql.
I have seen around people using a "count trick" in order to force the recomputation of a dataset but at least in my attempts I couldn't get to see the new data this way.
Anyway trying caching, refreshing and friends did not work for me, if somebody has some proper pattern here please share.