I want to know how many times my hive tables are accessed.
The details I would like to get here are, the tableName and how many times it was accessed. Eg:-
tableName
No.Of Access
Table1
100
Table2
80
....
....
Table n
n
Is there any Hive/Linux command/code to do so? Also, I tried to understand the last access time of my table using
describe formatted database.table
But it shows me
Name
type
'LastAccessTime:
'UNKNOWN
Any suggestions/help is greatly appreciated.
Check out your hive audit log.
Audit Logs Audit logs are logged from the Hive metastore server for
every metastore API invocation.
An audit log has the function and some of the relevant function
arguments logged in the metastore log file. It is logged at the INFO
level of log4j, so you need to make sure that the logging at the INFO
level is enabled (see HIVE-3505). The name of the log entry is
"HiveMetaStore.audit".
Audit logs were added in Hive 0.7 for secure client connections
(HIVE-1948) and in Hive 0.10 for non-secure connections (HIVE-3277;
also see HIVE-2797).
There is also HDFS audit logs that you could use to derive access to hive tables.
If you have Ranger enabled that is your best bet to help see who is access what.
Related
I am using Hive based on HDInsight Hadoop cluster -- Hadoop 2.7 (HDI 3.6).
We have some old Hive tables that point to some very storage accounts that don't exist any more. But these tables still point to these storage locations , basically the Hive Metastore still contain references to the deleted storage accounts. If I try to drop such a hive table , I get an error
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.hadoop.fs.azure.AzureException org.apache.hadoop.fs.azure.AzureException: No credentials found for account <deletedstorage>.blob.core.windows.net in the configuration, and its container data is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials.)
Manipulating the Hive Metatstore directly is risky as it could land the Metastore in an invalid state.
Is there any way to get rid of these orphan tables?
Getting the Error in acquiring locks, when trying to run count(*) on partitioned tables.
The table has 365 partitions when filtered on <= 350 partitions, the queries are working fine.
when tried to include more partitions for the query, it's failing with the error.
working on Hive-managed ACID tables, with the following default values
hive.support.concurrency=true //cannot make it as false, it's throwing <table> is missing from the ValidWriteIdList config: null, should be true for ACID read and write.
hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.txn.strict.locking.mode=false
hive.exec.dynamic.partition.mode=nonstrict
Tried increasing/decreasing values for these following with a beeline session.
hive.lock.numretries
hive.unlock.numretries
hive.lock.sleep.between.retries
hive.metastore.batch.retrieve.max={default 300} //changed to 10000
hive.metastore.server.max.message.size={default 104857600} // changed to 10485760000
hive.metastore.limit.partition.request={default -1} //did not change as -1 is unlimited
hive.metastore.batch.retrieve.max={default 300} //changed to 10000.
hive.lock.query.string.max.length={default 10000} //changed to higher value
Using the HDI-4.0 interactive-query-llap cluster, the meta-store is backed by default sql-server provided along.
The problem is NOT due to service tier of the hive metastore database.
It is most probably due to too many partitions in one query based on the symptom.
I meet the same issue several times.
In the hivemetastore.log, you shall able to see such error:
metastore.RetryingHMSHandler: MetaException(message:Unable to update transaction database com.microsoft.sqlserver.jdbc.SQLServerException: The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:578)
This is due to in Hive metastore, each partition involved in the hive query requires at most 8 parameters to acquire a lock.
Some possible workarounds:
Decompose the the query into multiple sub-queries to read from fewer
partitions.
Reduce the number of partitions by setting different partition keys.
Remove partitioning if partition keys don't have any filters.
Following are the parameters which manage the batch size for INSERT query generated by the direct SQL. Their default value is 1000. Set both of them to 100 (as a good starting point) in the Custom hive-site section of Hive configs via. Ambari and restart ALL Hive related components (including Hive metastore).
hive.direct.sql.max.elements.values.clause=100
hive.direct.sql.max.elements.in.clause=100
We also faced the same error in HDInsight and after doing many configuration changes similar to what you have done, the only thing that worked is scaling our Hive Metastore SQL DB server.
We had to scale it all the way to a P2 tier with 250 DTUs for our workloads to work without these Lock Exceptions. As you may know, with the tier and DTU count, the SQL server's IOPS and response time improves thus we suspected that the Metastore performance was the root cause for these Lock Exceptions with the increase in workloads.
Following link provides information about the DTU based performance variation in SQL servers in Azure.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu
Additionally as I know, the default Hive metastore that gets provisioned when you opt to not provide an external DB in cluster creation is just an S1 tier DB. This would not be suitable for any high capacity workloads. At the same time, as a best practice always provision your metastores external to the cluster and attach at cluster provisioning time, as this gives you the flexibility to connect the same Metastore to multiple clusters (so that your Hive layer schema can be shared across multiple clusters, e.g. Hadoop for ETLs and Spark for Processing / Machine Learning), and you have the full control to scale up or down your metastore as per your need anytime.
The only way to scale the default metastore is by engaging the Microsoft support.
We faced the same issue in HDINSIGHT. We solved it by upgrading the metastore.
The Default metastore had only 5 DTU which is not recommended for production environments. So we migrated to custom Metastore and spin the Azure SQL SERVER (P2 above 250 DTUs) and the setting the below properties:
hive.direct.sql.max.elements.values.clause=200
hive.direct.sql.max.elements.in.clause=200
Above values are set because SQL SERVER cannot process more than 2100 parameter. When you have partitions more than 348, you faced this issue as 1 partition creates 8 parameters for metastore 8 x 348
I am creating and insert tables in HIVE,and the files are created on HDFS and some on external storage S3
Assuming if I created a 10 tables,is there any system table in Hive where I can find the table info created by the user??? (for example like in Teradata we have DBC.tablesv which hold information of all the user defined tables)
You can find where you metastore is configured to be in the hive-site.xml file.
Its usual location is under /etc/hive/{$hadoop_version}/ or /etc/hive/conf/.
grep for "hive.metastore.uris" or "javax.jdo.option.ConnectionURL" to see which db you are using for the metastore. The credentials should also be there.
If, for example, your metastore is on a MySQL server, you can run queries like
SELECT * FROM TBLS;
SELECT * FROM PARTITIONS;
etc
You can't query (as in SELECT ... FROM...) the metadata from within Hive.
You do however have comnands that display that information, e.g. show databases, show tables, desc MyTable etc.
I'm not sure I understood 100% your question, if you mean the informations about the creation of the table, like the query itself, with the location on HDFS, table properties, etc, you can try with:
SHOW CREATE TABLE <table>;
If you need to retrieve a list of the columns names and datatypes try with:
DESCRIBE <table>;
I m working with pgsql .I want to save the audit record to any file system(spread sheet,word...). ie, I have a web application. Any changes(insert,delete,update) occur in the app, will recorded in the audit logg table.But no of tables are in db also each table have more than 5000 rows. so it is difficult (bulk data)to save audit logg as a table.So I want to save audit log as a file in pgSQL. How can it implement?
Thankyou..
Veena.I have worked on pgsql past 2 years back.To my knowledge,To configure a PostgreSQL database as a standalone audit log database or to save audit file, just follow this.firstly Gather database information after that create the audit store schema and configure a PostgreSQL Server data source for CA SiteMinder and Point the Policy Server to the database finally restart the policy server.
You can create the logging schema so the pgsql server database can store audit logs.
To create audit logs,Open sm_postgresql_logs.sql in a text editor and copy the contents of the entire file and start a SQL client, such as psql, and log in as the user who administers the Policy Server database.Select the database instance from the database list and paste the schema from sm_postgresql_logs.sql into the query after that execute the query.
The audit log store schema is created in the database.
Hope this will help you.
Summary of the issue:
Whenever I insert data into a dynamically partitioned table, far too much time is being spent updating the partition statistics in the metastore.
More details:
I have several queries that select data from one hive table and insert it into another table that is dynamically partitioned into about 8000 partitions. The queries complete quickly and correctly. The output files are copied into the partition directories very quickly. But then this happens for every partition:
INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(253)) - ugi=hive ip=unknown-ip-addr cmd=append_partition : db=default tbl=some_table[14463,1410]
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(284)) - Updating partition stats fast for: some_table
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(292)) - Updated size to 1042
Each such partition update is taking about 500 milliseconds. But Hive puts an exclusive lock on the entire table while these updates are happening, and with 8000 such partitions this means that my table is locked for an unacceptably long time.
It seems to me that there must be some way to disable these partition statistics without affecting the performance of Hive too terribly; after all, I could just manually copy files to these partitions without involving Hive at all.
I've tried settings some of the "hive.stats" settings, but there's very little documentation on these settings so I don't know exactly what they're supposed to do. Specifically, I've tried setting:
set hive.stats.autogather=false;
set hive.stats.collect.rawdatasize=false;
Any suggestions on how to prevent Hive from trying to keep track of partition statistics would be greatly appreciated!
Using set hive.stats.autogather=false will not take effect within the application. The reason is that when the Hive connection is created, it configures the hive configs to the metastore and once it is configured, it cannot be modified anymore.
You can disable the statistics in two ways:
1. Via the Hive shell
Using the Hive shell, type hive --hiveconf hive.stats.autogather=false.
2. Updating hive-site.xml
Update the following in hive-site.xml and restart the Hive session.
<property>
<name>hive.stats.autogather</name>
<value>false</value>
</property>
https://cwiki.apache.org/confluence/display/Hive/StatsDev
According to the Hive documentation, this should be able to disable the statistics on partitions:
set hive.stats.autogather=false;