What is hive.metastore.warehouse.dir for? - hive

I am new to HIVE, I am trying to setup a hive metastore service with standalone MySQL DB, and I realized that I need to config hive.metastore.warehouse.dir in the hive-site.xml, but I am having a hard time to understand what it is for?
1, None of the metadata will be stored in this location, because all of the metadata will be stored in the MySQL db.
2, None of the data files will be stored in this location, because I am not setting up a Hive data service, it is just a metastore service. And when creating hive tables, I will specify the location of the table.
Why do I still need to set this configuration?

spark.sql.warehouse.dir is a static configuration property that sets Hive’s hive.metastore.warehouse.dir property, i.e. the location of default database for the Hive warehouse
That is correct. This directory indicates where the actual data in the tables will reside.
It sounds like in most of your situations, the data will reside outside of what you set for this directory. However, if a user were to forget to set the location or if there are any internal/automated calls that use the "default" database. This is where your "default" data will reside.

Related

SQL Polybase External Table - Dealing with file metadata

I cannot find any reference for dealing with file metadata when creating an External Table starting from a partitioned source of files. More precisely: I have a set of partitioned parquet files. The partition strategy is in the form:
{YEAR}/{MONTH}/{filename}.parquet
Now I can create an external table referencing the whole set using the LOCATION pointing at the root of the partition and using s recursive strategy.
LOCATION = 'folder_or_filepath' Specifies the folder or the file path
and file name for the actual data in Hadoop or Azure blob storage. The
location starts from the root folder. The root folder is the data
location specified in the external data source.
In this context, it would be crucial to be able to access partitioning metadata like {YEAR}, {MONTH} or {filename} and store them as columns into the newly created external table for further usages.
By my researches, access file metadata seems to be a missing feature right now. But I'm not sure.
For sure, it is not possible leverages on PARTITION BY functionality as evidenced here:
https://feedback.azure.com/forums/307516-azure-synapse-analytics/suggestions/19520860-polybase-partitioned-by-functionality-when-creati
Is there some mitigation strategy? I'm about to set up a Data Factory Mapping Dataflow which will do the dirty job. But I'm still unsure about these two options:
Reducing the partitioned set into a single file adding metadata columns on each row;
Just adding metadata columns on each file and leave the partitioned hierarchy;
Bonus: any suggestion?

Deploying to multiple schemas using Flyway

I have a question regarding Flyway and managing multiple schemas. I have multiple schemas (schema1, schema2, schema3) with different deployment schedules and different folder locations (sql/schema1, sql/schema2, sql/schema3) with different code.
I want to Flyway to create the schemas before the code deployment but how do I set this up in a single config file? I read the Flyway doc (https://flywaydb.org/documentation/faq#multiple-schemas) but is the example using a single config file? or do i need to create multiple config files (one per schema)?
Can i achieve the same setting comma delimited schema list? will "Schema1" only look in the "sql/Schema1" location? I really dont want Schema1 pulling code from a different folder i.e. sql/Schema2, etc.
Thanks in advance!
When using Flyway with multiple schemas, you need to explicitly say in the sql statements which schema the sql is going to change. You can do this by putting an ALTER SESSION SET CURRENT_SCHEMA=schema1 at the top of each migration file, or prefixing all your statements like CREATE TABLE schema1.bananas.
If this is not practical, it would be best to create a number of config files, each with a single schema specified, and a single location specified. e.g.
flyway.schemas=schema1
flyway.locations=filesystem:sql/schema1
Then you can run Flyway with each config file individually to migrate that particular schema.

Update Hive metadata location for many tables

I would like to change the bucket name in location of many Hive tables. Is it possible for us to connect to mySQL database and update it? I think it is possible.But I would like to know if it is safe to do it in production database.
Yes, it is possible, and I have seen it done; but
(a) the Metastore schema is not documented, and each Hive version brings some minor changes, so you have to do your own exploration to find where/how the StorageDescriptor objects are persisted -- then some unit tests / non-regression tests on a Dev system -- plus, don't forget to run a full DB backup before tinkering with your Prod system (and to rehearse an emergency restoration on your Dev system, too!)
(b) you have to update the StorageDescriptor for tables, but also for partitions -- remember that for partitioned tables, the table-level LOCATION is just used as default root dir for future partitions; once created, a partition retains its location until it is ALTERed explicitly.
For the record, the preferred method for bulk updates is (in theory) the Hive MetaTool but unfortunately, it does not support the kind of updates that you need.Right now it's only good for changing the NameNode alias in all HDFS paths, because that was a real pain point...
A valid alternative to brutal SQL Updates would be to develop a custom Java program, using the Hive MetaStore API, to scan all tables & partitions then read their StorageDescriptor then run RegEx changes on their Location then write back the changes (which is exactly what the MetaTool does, only at a lower level). But that would be overkill.
Finally, a possible compromise would be a SQL Select on the appropriate MySQL table, to generate (with regexp_replace()) a chain of ALTER Table/Partition LOCATION commands to run later in the Hive CLI.Plus a chain of ALTER to revert to the original locations, in case you have to do an emergency rollback :-/

when should we go for a external table and internal table in Hive?

I understand the difference between Internal tables and external tables in hive as below
1) if we drop the internal Table File and metadata will be deleted, however , in case of External only metadata will be
deleted
2) if the file data need to be shared by other tools/applications then we go for external table if not
internal table, so that if we drop the table(external) data will still be available for other tools/applications
I have gone through the answers for question "Difference between Hive internal tables and external tables? "
but still I am not clear about the proper uses cases for Internal Table
so my question is why is that I need to make an Internal table ? why cant I make everything as External table?
Use EXTERNAL tables when:
The data is also used outside of Hive.
For example, the data files are read and processed by an existing program that doesn't lock the files.
The data is permanent i.e used when needed.
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.
Let's understand it with two simple scenarios:
Suppose you have a data set, and you have to perform some analytics/problem statements on it. Because of the nature of problem statements, few of them can be done by HiveQL, few of them need Pig Latin and few of them need Map Reduce etc., to get the job done. In this situation External Table comes into picture- the same data set can be used to solve entire analytics instead of having different different copies of same data set for the different different tools. Here Hive don't need authority on the data set because several tools are going to use it.
There can be a scenario, where entire analytics/problem statements can be solved by only HiveQL. In such situation Internal Table comes into picture- Means you can put the entire data set into Hive's Warehouse and Hive is going to have complete authority on the data set.

Where does hive stores its table?

I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.