getting the hive meta-store url to use in other systems - hive

When I go into hive in command line, is there a way to get the hive metastore url that is being used?
I'm trying to connect another system to hive but can't seem to figure out what the metastore url is.

Here is the command.
hive> set hive.metastore.uris;
Here is the output
hive.metastore.uris=thrift://sandbox.hortonworks.com:9083
Using set you can get all hadoop and hive parameters that are being used when hive CLI is launched.

Related

Access Glue Catalog from Dev endpoint and local Zeppelin with spark sql

I have set up a local Zeppelin notebook to access Glue Dev endpoint. I'm able to run spark and pyspark code and access the Glue catalog. But when I try spark.sql("show databases").show() or %sql show databases only default is returned.
When spinning up an EMR cluster we have to choose "Use for Hive table metadata" to enable this, but I hoped this would be the default setting for Glue Development endpoint, which seems to be not the case. Any workaround for this?

Setting Remote hive metastore on postgresql for EMR

I am trying to setup postgresql db as external Hive metastore for AWS EMR.
I have tried hosting it on both EC2 and RDS.
I have already tried steps as given here.
But it doesnt go through, EMR fails in the provisioning step only with message
On the master instance (instance-id), application provisioning failed
I could not decipher anything from the failure log.
I also copied postgresql jdbc jar in paths
/usr/lib/hive/lib/ and /usr/lib/hive/jdbc/
in case EMR doesnt already has it, but still no help!
Then I setup the system by manually editing hive-site.xml and setting properties:
javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
javax.jdo.option.ConnectionUserName
javax.jdo.option.ConnectionPassword
datanucleus.fixedDatastore
datanucleus.schema.autoCreateTables
and had to run hive --service metatool -listFSRoot.
After these manual settings I was able to get EMR to use postgres db as remote metastore.
Is there any way I can make it work using the configuration file as mentioned in official documentation?
Edit:
Configuration setting I am using to for remote mysql metastore:
classification=hive-site,properties=[javax.jdo.option.ConnectionURL=jdbc:mysql://[host]:3306/[dbname]?createDatabaseIfNotExist=true,javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver,javax.jdo.option.ConnectionUserName=[user],javax.jdo.option.ConnectionPassword=[pass]]
I could never find a clean approach to configure this at the time of EMR startup itself.
The main problem is that EMR initializes the schema with MySQL using the command :
/usr/lib/hive/bin/schematool -initSchema -dbType MySQL
which should be postgres for our case.
The following manual steps allows to you configure postgres as external metastore:
1) Start EMR cluster with hive application, with default configurations.
2) Stop hive using command :
sudo stop hive-server2
3) Copy postgresql-jdbc jar (stored in some S3 location) to /usr/lib/hive/lib/ on EMR
4) Overwrite the default hive-site.xml in /usr/lib/hive/conf/ with custom one containing the JDO configuration for the Postgresql running on the EC2 node
5) Execute command :
sudo /usr/lib/hive/bin/schematool -upgradeSchema -dbType postgres

Redshift External tables via Hive metastore

I've a redshift DB setup and we do periodic archival of the data into S3. I would like to create redshift external tables on top of these archived files. AWS documentation suggests that this can be done either via athena or via hive metastore. Since athena is quite expensive, I would like to get this done via Hive metastore. But I'm struggling with the connectivity here.
Below are the links of the steps that I followed:
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html
Creating the external schema works out fine; but while creating the table i get the follow error:
Invalid operation: Hive Metastore error. HOST: XX.XXX.XXX.XX PORT: 9083 ERROR: Default TException.
Any idea what can be done here?

Hive LLAP doesn't work with Parquet format

After finding out Hive LLAP, I really want to use it.
I started Azure HDinsight cluster with LLAP enabled. However, it doesn't seem to work any better than normal Hive. I have data stored in Parquet files. I only see ORC files mentioned in LLAP related docs or talks.
Does it also support Parquet format?
Answering my own question.
We reached out to Azure support. Hive LLAP only works with ORC file format (as of 05.2017).
So with Parquet either we have to use Apache Impala for fast interactive queries (https://impala.incubator.apache.org) as alternative to LLAP or change the stored file format to ORC.
Update: This is currently work in progress and no longer be the case with the next release of HDP. As of HDP 3.0. LLAP will support caching for the Parquet file format. This update should flow into HDInsights shortly after the 3.0 release.

How to copy a Hive external table to Readshift without using Data Pipeline

I'd like to upload a Hive external table to AWS Redshift directly via command line. I don't want to use the Data Pipeline. Do I have to upload the table to S3 first and then copy it to Redshift? Is there any way to do it directly?
You can load Redshift directly from remote hosts using SSH or, if you're using their EMR version of Hadoop, you can load directly from the HDFS file system.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html