Audit hive table - hive

I have a hive table lets say it as table A. My requirement is to capture all the DML and DDL operations on table A in table B. Is there any way to capture the same?
Thanks in advance..

I have not come across any such tool however Cloudera Navigator helps to manage it. Refer the detailed documentation.
Cloudera Navigator
Cloudera Navigator auditing supports tracking access to:
HDFS entities accessed by HDFS, Hive, HBase, Impala, and Solr
services
HBase and Impala
Hive metadata
Sentry
Solr
Cloudera Navigator Metadata Server
Alternatively, if you are not using cloudera distribution, you can still access hive-metastore log file under /var/log/hive/hadoop-cmf-hive-HIVEMETASTORE.log.out and check the changes applied to the different table.

I haven't used Apache atlas yet, but from the documentation, it looks like they have Audit store and hive bridge. That works for operational events as well.
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/atlas-overview/content/apache_atlas_features.html

Related

Hive ORC ACID table on AZURE Blob Storage possible for MERGE

On HDFS Hive ORC ACID for Hive MERGE no issue.
On S3 not possible.
For Azure HD Insight I am not clear from docs if such a table on Azure Blob Storage is posible? Seeking confirmation or otherwise.
I am pretty sure no go. See the update I gave on the answer, however.
According to Azure HDInsight offical documents Azure HDInsight 4.0 overview as the figure below,
As I known, Hive MERGE requires MapReduce, but HDInsight does not support it for Hive, so it's also not possible.
UPDATE by question poster
HDInsight 4.0 doesn't support MapReduce for Apache Hive. Use Apache Tez instead. So, with Tez it will still work and from https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-version-release Spark with Hive 3 and Warehouse Connector are also options.

Accessing Spark Tables (Parquet on ADLS) using Presto from a Local Linux Machine

Would like to know if we can access the Spark External tables with MS SQL as metastore and external files on Azure Data lake using Hive Metastore service (Presto) from a Linux Machine.
We are trying to access the spark delta tables having parquet files on ADLS through Presto. Below is the scenario. I would like to know if there is a possible way to achieve this. We are doing this as a POC only and we believe knowing the answer will take us to next step.
Our central data repository is all spark Delta tables created by many pipelines. The data is stored in Parquet format. MS SQL is the external metastore. Data in these spark tables are used by other teams/applications and they would like to access these data through Presto.
We learnt that Presto uses the metastore service of Hive to access the hive table details. We tried accessing the tables from Hive (thinking if this works Presto also works). But we find problems with different filesystems. We have setup Hadoop and Hive in one single Linux machine. The versions are 3.1.2 & 3.1.1. The hive service is connecting to the SQL metastore and showing the results of few basic commands. However when it comes to accessing the actual data stored in parquet in a ADLS path, it fails saying File system exception. I understand this problem that it is an interaction of many file systems like (ADFS, HDFS, linux) but not finding any blogs that guides us. Kindly help.
Hive Show Database command:
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> SHOW DATABASES;
OK
7nowrtpocqa
c360
default
digital
Hive Listing tables:
hive> SHOW TABLES;
OK
amzn_order_details
amzn_order_items
amzn_product_details
Query data from Orders table:
hive> select * from dlvry_orders limit 3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"
Time taken: 3.37 seconds
How can I make my setup access the Datalake files and bring in the data?
I believe my metastore should have the exact full path of the ADLS where files are stored. If it is, how will my Hive/Hadoop in Linux will understand the path.
If it can recognize the path also, in which configuraion file should I give the credentials for accessing the data lake (in any .XML)
How can the different file systems interact
Kindly help. Thanks for all the inputs.

What does a catalog mean in Presto?

I understand the concept of DB and schemas within a DB and tables in a schema. Where does a catalog fit in this whole space. Does this extend to hive as well ?
Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schema and tables inside of the catalog.
For example, the hive connector maps each hive database to a schema, so if the hive connector is mounted as the hive catalog
Catalog is mount point.

Use hive metastore service WITHOUT Hadoop/HDFS

I know the question is a little bit strange. I love Hadoop & HDFS, but recently work on SparkSQL with Hive Metastore.
I want to use SparkSQL as a vertical SQL engine to run OLAP query across different datasources like RDB, Mongo, Elastic ... without ETL process. Then I register different schema as external tables in Metastore with corresponding Hive storage Handler.
Moreover, HDFS is not used as a datasource in my work. Then, given Map/R is already replaced by Spark engine. That sound to me that Hadoop/HDFS is useless but to base the installation of Hive. I don't want to buy them all.
I wonder If I only start Hive metastore service without Hadoop/HDFS to support SparkSQL, what kind of issue will happen. Would I put myself into the jungle?
What you need is "Hive Local Mode" (search for "Hive, Map-Reduce and Local-Mode" in the page).
Also this may help.
This configuration is only suggested if you are experimenting locally. But in this case you only need the metastore.
Also from here;
Spark SQL uses Hive Metastore, even if when we don't configure it to . When not configured it uses a default Derby DB as metastore.
So this seems to be quite legal;
Arrange your metastore in Hive
Start Hive in local mode
And make Spark use Hive metastore
Use Spark as an SQL engine for all datasources supported by Hive.

Can hive process table on different hdfs cluster?

Can one hive instance store different tables across hdfs clusters. and then do hive ql on these tables?
My use case is that I have one hive table on one hdfs cluster. I want to do some process on it with hive ql and have the output been written to another hdfs cluster. I wish to achieve this directly only by hive, not need to run through some dump / copy / import process. So Is that possible? I don't really think it is possible, however, I notice a design page on :
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27837073
in it , it said that :
"Note that, even today, different partitions/tables can span multiple dfs's, and hive does not enforce any restrictions. Those dfs's can be in different data centers also"
except this, I failed to google anything related.
anyone have any ideas on this? Thanks.
There are multiple ways to handle this. you can go with mirroring (use tools like Apache Falcon). In this case you have data stored in both the clusters. If you want to query across clusters having a different table without mirroring then use tools like Apache Drill which can join data from different datasources. it currently supports hive,mongo,json, kudu etc