Where does hive metastore store locks info? - hive

I am trying to create indexes on one hive table and getting error:
FAILED: Error in acquiring locks: Lock acquisition for
LockRequest(component:[LockComponent(type:EXCLUSIVE, level:PARTITION,
dbname:,
tablename:jobs_indx_jobs_title,
partitionname:year=2016/month=1/sourcecd=BYD),
LockComponent(type:SHARED_READ, level:TABLE, dbname:,
tablename:jobs), LockComponent(type:SHARED_READ, level:PARTITION,
dbname:, tablename:jobs,
partitionname:year=2016/month=1/sourcecd=BD)], txnid:0, user:hadoop,
hostname:Hortorn-NN-2.b2vheq12ivkfdsjdskdf3nba.dx.internal.cloudapp.net)
timed out after 5504043ms. LockResponse(lockid:58318, state:WAITING)
I want to know in which table hive metastore locks info that it shows while executing "show locks" command?

It's not in the Metastore, it's in a ZooKeeper topic...
Just read the documentation and the design decisions back in 2010 for HIVE-1293

If the table is non-transactional, try setting hive.support.concurrency=false.

Related

databricks error IllegalStateException: The transaction log has failed integrity checks

I have a table that I need drop, delete transaction log and recreate, but while I am trying to drop I get following error.
I have ran repair table statement on this one and could be responsible for error but not sure.
IllegalStateException: The transaction log has failed integrity checks. We recommend you contact Databricks support for assistance. To disable this check, set spark.databricks.delta.state.corruptionIsFatal to false. Failed verification of:
Table size (bytes) - Expected: 0 Computed: 63233
Number of files - Expected: 0 Computed: 1
We think this may just be related to s3 eventual consistency. Please try waiting a few extra minutes after deleting the Delta directory before writing new data to it. Also, normal MSCK REPAIR TABLE doesn't do anything for Delta, as Delta doesn't use the Hive Metastore to store the partitions. There is an FSCK REPAIR TABLE, but that is for removing the file entries from the transaction log of a Databricks Delta table that can no longer be found in the underlying file system.
We don't recommend overwriting a Delta table in place, like you might with a normal Spark table. Delta is not like a normal table - it's a table, plus a transaction log, and many versions of your data (unless fully vacuumed). If you want to overwrite parts of the table, or even the whole table, you should use Delta's delete functionality. If you want to completely change the table, consider writing to an entirely new directory, such as /table/v2/... and separately deleting the other table.
To skip the issue from occurring can use below command (PySpark notebook):
spark.conf.set("spark.databricks.delta.state.corruptionIsFatal", False)

SparkSQL JDBC writer fails with "Cannot acquire locks error"

I'm trying to insert 50 million rows from hive table into a SQLServer table using SparkSQL JDBC Writer.Below is the line of code that I'm using to insert the data
mdf1.coalesce(4).write.mode(SaveMode.Append).jdbc(connectionString, "dbo.TEST_TABLE", connectionProperties)
The spark job is failing after processing 10 million rows with the below error
java.sql.BatchUpdateException: The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.
But the same job succeeds if I use the below line of code.
mdf1.coalesce(1).write.mode(SaveMode.Append).jdbc(connectionString, "dbo.TEST_TABLE", connectionProperties)
I'm trying to open 4 parallel connections to the SQLServer to optimize the performance. But the job keeps failing with "cannot aquire locks error" after processing 10 million rows. Also, If I limit the dataframe to just few million rows(less than 10 million), the job succeeds even with four parallel connections
Can anybody suggest me if SparkSQL can be used to export huge volumes of data into RDBMS and if I need to make any configuration changes on SQL server table.
Thanks in Advance.

Cant create ORC external tables on Hawq PXF

I'm using Pivotal Hawq with ambari and now I'm trying to run some queries over ORC hive tables with hawq.
Previously I was able to create the external queries on pqsql using SELECT * FROM hcatalog.hive-db-name.hive-table-name distributed randomly;
But now everytime I get the error:
Exception report message java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.
Can you provide some help on how to surpass this?
I believe you have missed a step to update your pxf-profiles.xml file that's required after upgrading to HDB 2.2. Please see the instructions listed here:
http://hdb.docs.pivotal.io/220/hdb/install/install-ambari.html#post-install-212-req

sonar 5.1.1 analysis results for different branches giving timeouts with mysql db

We are using Sonarqube 5.1.1 with a MySQL database. We are facing timeout issues with the database. We ran the MySQL tuning primer script and made some changes to the InnoDB timeout (increased it in /etc/my.cnf), but it made no difference. One of the suggestions from mysl tuner output is :
"of 7943 temp tables, 40% were created on disk"
Note: BLOB and TEXT columns are not allowed in memory tables.
Are there any suggestions for dealing with Sonar analysis results for a bunch of different branches?
Perhaps using Postgres instead of MySQL?
We get errors as shown below:
Failed to process analysis report 8 of project "X"
org.apache.ibatis.exceptions.PersistenceException:
Error committing transaction. Cause: org.apache.ibatis.executor.BatchExecutorException:
org.sonar.core.issue.db.IssueMapper.insert (batch index #1) failed.
Cause: java.sql.BatchUpdateException: Lock wait timeout exceeded; try
restarting transaction
Cause: org.apache.ibatis.executor.BatchExecutorException: org.sonar.core.issue.db.IssueMapper.insert (batch index #1) failed.
Cause: java.sql.BatchUpdateException: Lock wait timeout exceeded; try
restarting transaction

Hive: Any way to disable partition statistics?

Summary of the issue:
Whenever I insert data into a dynamically partitioned table, far too much time is being spent updating the partition statistics in the metastore.
More details:
I have several queries that select data from one hive table and insert it into another table that is dynamically partitioned into about 8000 partitions. The queries complete quickly and correctly. The output files are copied into the partition directories very quickly. But then this happens for every partition:
INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(253)) - ugi=hive ip=unknown-ip-addr cmd=append_partition : db=default tbl=some_table[14463,1410]
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(284)) - Updating partition stats fast for: some_table
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(292)) - Updated size to 1042
Each such partition update is taking about 500 milliseconds. But Hive puts an exclusive lock on the entire table while these updates are happening, and with 8000 such partitions this means that my table is locked for an unacceptably long time.
It seems to me that there must be some way to disable these partition statistics without affecting the performance of Hive too terribly; after all, I could just manually copy files to these partitions without involving Hive at all.
I've tried settings some of the "hive.stats" settings, but there's very little documentation on these settings so I don't know exactly what they're supposed to do. Specifically, I've tried setting:
set hive.stats.autogather=false;
set hive.stats.collect.rawdatasize=false;
Any suggestions on how to prevent Hive from trying to keep track of partition statistics would be greatly appreciated!
Using set hive.stats.autogather=false will not take effect within the application. The reason is that when the Hive connection is created, it configures the hive configs to the metastore and once it is configured, it cannot be modified anymore.
You can disable the statistics in two ways:
1. Via the Hive shell
Using the Hive shell, type hive --hiveconf hive.stats.autogather=false.
2. Updating hive-site.xml
Update the following in hive-site.xml and restart the Hive session.
<property>
<name>hive.stats.autogather</name>
<value>false</value>
</property>
https://cwiki.apache.org/confluence/display/Hive/StatsDev
According to the Hive documentation, this should be able to disable the statistics on partitions:
set hive.stats.autogather=false;