How to enable metastore caching in hive? - hive

i am facing bottleneck at metastore(mariadb) level. While doing analysis on metastore queries being generated by hive metastore service, i found that same get table queries are being fired multiple times.
Is there any caching feature available at metastore layer?

Related

How to access Metastore from beeline?

I need to do some SQL queries (as here) directly from Metasore. PS: the commands SHOW/DESCRIBE are not enough.
How to enable access from it as database, or what the database name of Metastore? ... In nowadays (2019) it is possible?
NOTES
What is Metastore? For me is a very important element of the Hive architecture, final user need some access to it... "All Hive implementation need a metastore service, where it stores metadata. It is implemented using tables in relational database. By default, Hive uses built-in Derby SQL server", 1. Of course, you need in your context a "standard" Metastore. On my corporation's Hadoop cluster we are planning to standardize Metastore (local and long term standard), perhaps PostgreSQL and also an (PostgREST API for external consume of some SQL-Views from it).
The SQL definitions (table names, etc.) will be stable and Metastore queries will be reliable when Metastore is a long-term local standard.
The Metastore it is closely connected to Hive, where it is a Java API, but Metastore is also a standard RDBMS and offers standard connection (by SQL) for the external universe. PS: my interest on Metastore is in this external context.
Spark-shell solution
Spark access Metastore under the hood, it have a first class metadata method, that returns a dataframe, the schema property, that expose names, types, etc. and offers getComment method.
See https://stackoverflow.com/a/57857021/287948

Accessing Spark Tables (Parquet on ADLS) using Presto from a Local Linux Machine

Would like to know if we can access the Spark External tables with MS SQL as metastore and external files on Azure Data lake using Hive Metastore service (Presto) from a Linux Machine.
We are trying to access the spark delta tables having parquet files on ADLS through Presto. Below is the scenario. I would like to know if there is a possible way to achieve this. We are doing this as a POC only and we believe knowing the answer will take us to next step.
Our central data repository is all spark Delta tables created by many pipelines. The data is stored in Parquet format. MS SQL is the external metastore. Data in these spark tables are used by other teams/applications and they would like to access these data through Presto.
We learnt that Presto uses the metastore service of Hive to access the hive table details. We tried accessing the tables from Hive (thinking if this works Presto also works). But we find problems with different filesystems. We have setup Hadoop and Hive in one single Linux machine. The versions are 3.1.2 & 3.1.1. The hive service is connecting to the SQL metastore and showing the results of few basic commands. However when it comes to accessing the actual data stored in parquet in a ADLS path, it fails saying File system exception. I understand this problem that it is an interaction of many file systems like (ADFS, HDFS, linux) but not finding any blogs that guides us. Kindly help.
Hive Show Database command:
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> SHOW DATABASES;
OK
7nowrtpocqa
c360
default
digital
Hive Listing tables:
hive> SHOW TABLES;
OK
amzn_order_details
amzn_order_items
amzn_product_details
Query data from Orders table:
hive> select * from dlvry_orders limit 3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"
Time taken: 3.37 seconds
How can I make my setup access the Datalake files and bring in the data?
I believe my metastore should have the exact full path of the ADLS where files are stored. If it is, how will my Hive/Hadoop in Linux will understand the path.
If it can recognize the path also, in which configuraion file should I give the credentials for accessing the data lake (in any .XML)
How can the different file systems interact
Kindly help. Thanks for all the inputs.

Audit hive table

I have a hive table lets say it as table A. My requirement is to capture all the DML and DDL operations on table A in table B. Is there any way to capture the same?
Thanks in advance..
I have not come across any such tool however Cloudera Navigator helps to manage it. Refer the detailed documentation.
Cloudera Navigator
Cloudera Navigator auditing supports tracking access to:
HDFS entities accessed by HDFS, Hive, HBase, Impala, and Solr
services
HBase and Impala
Hive metadata
Sentry
Solr
Cloudera Navigator Metadata Server
Alternatively, if you are not using cloudera distribution, you can still access hive-metastore log file under /var/log/hive/hadoop-cmf-hive-HIVEMETASTORE.log.out and check the changes applied to the different table.
I haven't used Apache atlas yet, but from the documentation, it looks like they have Audit store and hive bridge. That works for operational events as well.
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/atlas-overview/content/apache_atlas_features.html

Acessing Hive Data on HAWQ/PXF with HCatalog

I've configured Hortonworks HDP with ambari services, later I add HAWQ and PXF. Through some research i've seen that is possible to query data stored in Hive through HCatalog and since i already loaded the dataset to hive it would easy the work, but i am doing some benchmark, can someone tell me if the use of the Hcatalog affects HAWQ in terms of performance?
When HAWQ accesses PXF tables using HCatalog integration, it determines format of underlying table(and even partition) and uses profile, optimized for that particular format, so there should be no performance degradation.
To add to Oleksandr's point, when HAWQ queries hcatalog, even the Hive catalog data is stored only in memory and not on disk within HAWQ without any contention with native HAWQ tables. That said, external hcatalog queries wouldn't be as performant as native HAWQ queries.

Use hive metastore service WITHOUT Hadoop/HDFS

I know the question is a little bit strange. I love Hadoop & HDFS, but recently work on SparkSQL with Hive Metastore.
I want to use SparkSQL as a vertical SQL engine to run OLAP query across different datasources like RDB, Mongo, Elastic ... without ETL process. Then I register different schema as external tables in Metastore with corresponding Hive storage Handler.
Moreover, HDFS is not used as a datasource in my work. Then, given Map/R is already replaced by Spark engine. That sound to me that Hadoop/HDFS is useless but to base the installation of Hive. I don't want to buy them all.
I wonder If I only start Hive metastore service without Hadoop/HDFS to support SparkSQL, what kind of issue will happen. Would I put myself into the jungle?
What you need is "Hive Local Mode" (search for "Hive, Map-Reduce and Local-Mode" in the page).
Also this may help.
This configuration is only suggested if you are experimenting locally. But in this case you only need the metastore.
Also from here;
Spark SQL uses Hive Metastore, even if when we don't configure it to . When not configured it uses a default Derby DB as metastore.
So this seems to be quite legal;
Arrange your metastore in Hive
Start Hive in local mode
And make Spark use Hive metastore
Use Spark as an SQL engine for all datasources supported by Hive.