Hive metastore in hdfs - hive

Usually you store the metadata onto a jdbc compliant database like mysql. Is it possible to keep it in hdfs somehow like in Hbase. Did not find anything useful in the wikis. Thanks.

No, this is not possible to do as neither HDFS nor HBase support a proper JDBC interface yet, as is demanded by Hive's current metastore mechanisms.

Related

How to access Metastore from beeline?

I need to do some SQL queries (as here) directly from Metasore. PS: the commands SHOW/DESCRIBE are not enough.
How to enable access from it as database, or what the database name of Metastore? ... In nowadays (2019) it is possible?
NOTES
What is Metastore? For me is a very important element of the Hive architecture, final user need some access to it... "All Hive implementation need a metastore service, where it stores metadata. It is implemented using tables in relational database. By default, Hive uses built-in Derby SQL server", 1. Of course, you need in your context a "standard" Metastore. On my corporation's Hadoop cluster we are planning to standardize Metastore (local and long term standard), perhaps PostgreSQL and also an (PostgREST API for external consume of some SQL-Views from it).
The SQL definitions (table names, etc.) will be stable and Metastore queries will be reliable when Metastore is a long-term local standard.
The Metastore it is closely connected to Hive, where it is a Java API, but Metastore is also a standard RDBMS and offers standard connection (by SQL) for the external universe. PS: my interest on Metastore is in this external context.
Spark-shell solution
Spark access Metastore under the hood, it have a first class metadata method, that returns a dataframe, the schema property, that expose names, types, etc. and offers getComment method.
See https://stackoverflow.com/a/57857021/287948

Audit hive table

I have a hive table lets say it as table A. My requirement is to capture all the DML and DDL operations on table A in table B. Is there any way to capture the same?
Thanks in advance..
I have not come across any such tool however Cloudera Navigator helps to manage it. Refer the detailed documentation.
Cloudera Navigator
Cloudera Navigator auditing supports tracking access to:
HDFS entities accessed by HDFS, Hive, HBase, Impala, and Solr
services
HBase and Impala
Hive metadata
Sentry
Solr
Cloudera Navigator Metadata Server
Alternatively, if you are not using cloudera distribution, you can still access hive-metastore log file under /var/log/hive/hadoop-cmf-hive-HIVEMETASTORE.log.out and check the changes applied to the different table.
I haven't used Apache atlas yet, but from the documentation, it looks like they have Audit store and hive bridge. That works for operational events as well.
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/atlas-overview/content/apache_atlas_features.html

Presto and Hive

I'm trying to enable basic SQL querying of CSV files located in an s3 directory. Presto seemed like a natural fit (the files are 10s GB). As I went through the setup in Presto, I tried creating a table using the Hive connector. It was not clear to me if I only needed the hive metastore to save my table configurations in Presto, or if I have to create them in there first.
The documentation makes it seem that you can use Presto without having to CONFIGURE Hive, but using Hive syntax. Is that accurate? My experiences are that AWS S3 has not been able to connect.
Presto syntax is similar to Hive syntax. For most simple queries, the identical syntax would function in both. However, there are some key differences that make Presto and Hive not entirely the same thing. For example, in Hive, you might use LATERAL VIEW EXPLODE, whereas in Presto you'd use CROSS JOIN UNNEST. There are many such examples of nuanced syntactical differences between the two.
It is not possible to use vanilla Presto to analyze data on S3 without Hive. Presto provides only distributed execution engine. However, it lacks metadata information about tables. Thus, Presto Coordinator needs Hive to retrieve table metadata to parse and execute a query.
However, you can use AWS Athena, which is managed Presto, to run queries on top of S3.
Another option, in recent 0.198 release Presto adds a capability to connect AWS Glue and retrieve table metadata on top of files in S3.
I know it's been a while, but if this question is still outstanding, have you considered using Spark? Spark connects easily with out-of-the-box methods and can query/process data living in S3/CSV formats.
Also, I'm curious: what solution did you end up implementing to resolve your issue?

Use hive metastore service WITHOUT Hadoop/HDFS

I know the question is a little bit strange. I love Hadoop & HDFS, but recently work on SparkSQL with Hive Metastore.
I want to use SparkSQL as a vertical SQL engine to run OLAP query across different datasources like RDB, Mongo, Elastic ... without ETL process. Then I register different schema as external tables in Metastore with corresponding Hive storage Handler.
Moreover, HDFS is not used as a datasource in my work. Then, given Map/R is already replaced by Spark engine. That sound to me that Hadoop/HDFS is useless but to base the installation of Hive. I don't want to buy them all.
I wonder If I only start Hive metastore service without Hadoop/HDFS to support SparkSQL, what kind of issue will happen. Would I put myself into the jungle?
What you need is "Hive Local Mode" (search for "Hive, Map-Reduce and Local-Mode" in the page).
Also this may help.
This configuration is only suggested if you are experimenting locally. But in this case you only need the metastore.
Also from here;
Spark SQL uses Hive Metastore, even if when we don't configure it to . When not configured it uses a default Derby DB as metastore.
So this seems to be quite legal;
Arrange your metastore in Hive
Start Hive in local mode
And make Spark use Hive metastore
Use Spark as an SQL engine for all datasources supported by Hive.

How to set up Hive metastore off Redshift

I couldn't find a way to set up a metastore off Redshift for Hive. I am wondering if there is anyone who has tried this. Also since Redshift supports PostgreSQL, maybe it is possible. Please share if you have any experience.
I am new to Hive and am using CDH5.4.
Redshift as DSS isn't suitable to store Hive meta store by definition. Use RDS service for that purpose.