I am new to hive, and some question confusing me very much.
first, after installation of hive, I just run hive, then I can create, select tables. where is the hive server, what is the use of it.
second, what is the use of metastore server, I know we need the metastore to access the metadata about hive tables, does that mean if I start a metastore server I can request it in other app and get the information?
Metastore server talks to the backend such as Derby/MySql to store and retrieve table metadata. If any Hive component wants to get/set metadata, it calls the MetaStore APIs. APIs are such getTable(tableName), createDatabase(dbName) etc. Basically metastore abstracts and provides backend (derby/mysql/postgres) independent API layer. Similar to HiveServer this can also run as a server. If there is no metastore server running, then the Driver will load the metastore in its process. If metastore is running as a separate server then the Driver object communicates with the metastore over network.
Related
Since I've starting using Azure Synapse Analytics, I created a Spark Pool Cluster, then on the Spark Pool cluster I created databases and tables using Pyspark on top of parquet files in Azure Data Lake Store Gen2.
I use to be able to access my spark Database/ parquet tables through SSMS using the Serverless SQL endpoint but now I can no longer see my spark Databases through the Severless SQL Endpoint in SSMS. My spark databases are still accessible through Azure Data Studio but not through SSMS. Nothing has been deployed or alter on my side. Can you help resolve the issue? I would like to be able to access my spark databases through SSMS.
Sql Serverless Endpoint
Azure Synapse Database
If your Spark DB is built on top of Parquet files, as you said, databases should sync to external tables in Serverless SQL pool just fine and you should be able to see synced SQL external tables in SSMS as well. Check this link for more info about metadata synchronization.
If everything mentioned above is checked, then I'd suggest you to navigate to Help + Support in Azure Portal and fill in a support ticket request with details of your problem so engineering team can take a look and see whether there is some issue with your workspace or not.
I'm using Hive as my meta store database and the Hive Standalone Metastore for dealing with the DDLs, via this thrift client that implements the server thrift mapping.
I want to perform an MSCK (or some other method like this) to bulk add partitions to the Hive new tables.
But afaik, this Thrift mapping file doesn't expose an msck method.
Although, I see that there's something about the Msck implemented inside standalone server (I think that it should have been implemented in jira HIVE-17824). But there isn't in the HiveMetastore class (that I understood that is the mapping of the Thrift server methods).
Does anyone know whether I can run MSCK through the standalone hive server via thrift client?
With python I am currently using this client with success: PyHive.
And from dbeaver you can also do it (if the command must be run by some human): dbeaver.
EDIT (I did not realize that the question was about sending the command directly to hive metastore):
The interface called IMetaStoreClient (the protocol between hive client and hive metastore server) does not implement MSCK command because it does not need it. Let me explain the logic behind MSCK command:
Check if table exists in hive metastore.
Scan for new partitions in the physical file system where the table stores its data. See code checkMetastore.
Create/Add those new partitions. See code createPartitionsInBatches. This code ends up using the method called add_partitions of the hive metastore client.
See add_partitions. In this point and not before the client application sends data to the hive metastore server.
Drop partitions which are not in the file system anymore. See code dropPartitionsInBatches which ends up using the method called dropPartitions of the hive metastore client.
See dropPartitions. Again, it is in this point and not before where the client application sends data to the hive metastore server.
MSCK is not really a hive metastore command. It requires logic implemented by the client running that MSCK command. In your case, you should add that logic to the client that you want to use.
For example, Spark, already implements that logic when using MSCK.
I am trying to connect to hive databases with a client, I have tried using DBeaver and downloaded the hive driver, but after that I have noticed that there is a kerbero's instance in the middle, and it seems that the dbeaver driver doesn't supoort kerberos.
¿There is some windows client suitable to query hive databases easy to plug in, considering the kerbero's instance?
Thanks in advance.
want to have central hive meta store to consume from databrick, spectrum etc ..
Is it possible to setup w/o installing hadoop
Yes, Hive metastore installation does not require Hadoop.
Querying data from the Hive metastore requires a Hive client (within Spark) and a Hadoop compatible filesystem (such as S3)
AWS Glue Data Catalog is the recommended system nowadays, not RDS
I installed Hive 1.2.1 and configured to work with Hadoop 2.7.
But I didn't setup meta store for Hive with Derby or MySQL.
And also I don't have a copy of hive-site.xml under $HIVE_HOME/conf.
My question is how still I am able to create database & tables in Hive. Where all these meta data is stored?
Appreciate your insight.
Thanks in advance.
By default Hive uses Derby and starts metastore (based on derby) in embedded mode. The metastore and hiveserver runs in the same process. I believe hive initializes the metastore for you in embedded mode.
http://www.cloudera.com/documentation/archive/cdh/4-x/4-2-0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html