ERROR : FAILED: Error in acquiring locks: Error communicating with the metastore org.apache.hadoop.hive.ql.lockmgr.LockException - hive

Getting the Error in acquiring locks, when trying to run count(*) on partitioned tables.
The table has 365 partitions when filtered on <= 350 partitions, the queries are working fine.
when tried to include more partitions for the query, it's failing with the error.
working on Hive-managed ACID tables, with the following default values
hive.support.concurrency=true //cannot make it as false, it's throwing <table> is missing from the ValidWriteIdList config: null, should be true for ACID read and write.
hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.txn.strict.locking.mode=false
hive.exec.dynamic.partition.mode=nonstrict
Tried increasing/decreasing values for these following with a beeline session.
hive.lock.numretries
hive.unlock.numretries
hive.lock.sleep.between.retries
hive.metastore.batch.retrieve.max={default 300} //changed to 10000
hive.metastore.server.max.message.size={default 104857600} // changed to 10485760000
hive.metastore.limit.partition.request={default -1} //did not change as -1 is unlimited
hive.metastore.batch.retrieve.max={default 300} //changed to 10000.
hive.lock.query.string.max.length={default 10000} //changed to higher value
Using the HDI-4.0 interactive-query-llap cluster, the meta-store is backed by default sql-server provided along.

The problem is NOT due to service tier of the hive metastore database.
It is most probably due to too many partitions in one query based on the symptom.
I meet the same issue several times.
In the hivemetastore.log, you shall able to see such error:
metastore.RetryingHMSHandler: MetaException(message:Unable to update transaction database com.microsoft.sqlserver.jdbc.SQLServerException: The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:578)
This is due to in Hive metastore, each partition involved in the hive query requires at most 8 parameters to acquire a lock.
Some possible workarounds:
Decompose the the query into multiple sub-queries to read from fewer
partitions.
Reduce the number of partitions by setting different partition keys.
Remove partitioning if partition keys don't have any filters.
Following are the parameters which manage the batch size for INSERT query generated by the direct SQL. Their default value is 1000. Set both of them to 100 (as a good starting point) in the Custom hive-site section of Hive configs via. Ambari and restart ALL Hive related components (including Hive metastore).
hive.direct.sql.max.elements.values.clause=100
hive.direct.sql.max.elements.in.clause=100

We also faced the same error in HDInsight and after doing many configuration changes similar to what you have done, the only thing that worked is scaling our Hive Metastore SQL DB server.
We had to scale it all the way to a P2 tier with 250 DTUs for our workloads to work without these Lock Exceptions. As you may know, with the tier and DTU count, the SQL server's IOPS and response time improves thus we suspected that the Metastore performance was the root cause for these Lock Exceptions with the increase in workloads.
Following link provides information about the DTU based performance variation in SQL servers in Azure.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu
Additionally as I know, the default Hive metastore that gets provisioned when you opt to not provide an external DB in cluster creation is just an S1 tier DB. This would not be suitable for any high capacity workloads. At the same time, as a best practice always provision your metastores external to the cluster and attach at cluster provisioning time, as this gives you the flexibility to connect the same Metastore to multiple clusters (so that your Hive layer schema can be shared across multiple clusters, e.g. Hadoop for ETLs and Spark for Processing / Machine Learning), and you have the full control to scale up or down your metastore as per your need anytime.
The only way to scale the default metastore is by engaging the Microsoft support.

We faced the same issue in HDINSIGHT. We solved it by upgrading the metastore.
The Default metastore had only 5 DTU which is not recommended for production environments. So we migrated to custom Metastore and spin the Azure SQL SERVER (P2 above 250 DTUs) and the setting the below properties:
hive.direct.sql.max.elements.values.clause=200
hive.direct.sql.max.elements.in.clause=200
Above values are set because SQL SERVER cannot process more than 2100 parameter. When you have partitions more than 348, you faced this issue as 1 partition creates 8 parameters for metastore 8 x 348

Related

CETAS times out for large tables in Synapse Serverless SQL

I'm trying to create a new external table using CETAS (CREATE EXTERNAL TABLE AS SELECT * FROM <table>) statement from an already existing external table in Azure Synapse Serverless SQL Pool. The table I'm selecting from is a very large external table built on around 30 GB of data in parquet format stored in ADLS Gen 2 storage but the query always times out after about 30 minutes. I've tried using premium storage and also tried out most if not all the suggestions made here as well but it didn't help and the query still times out.
The error I get in Synapse Studio is :-
Statement ID: {550AF4B4-0F2F-474C-A502-6D29BAC1C558} | Query hash: 0x2FA8C2EFADC713D | Distributed request ID: {CC78C7FD-ED10-4CEF-ABB6-56A3D4212A5E}. Total size of data scanned is 0 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes. Query timeout expired.
The core use case is that assuming I only have the external table name, I want to create a copy of the data over which that external table is created in Azure storage itself.
Is there a way to resolve this timeout issue or a better way to solve the problem?
This is a limitation of Serverless.
Query timeout expired
The error Query timeout expired is returned if the query executed more
than 30 minutes on serverless SQL pool. This is a limit of serverless
SQL pool that cannot be changed. Try to optimize your query by
applying best practices, or try to materialize parts of your queries
using CETAS. Check is there a concurrent workload running on the
serverless pool because the other queries might take the resources. In
that case you might split the workload on multiple workspaces.
Self-help for serverless SQL pool - Query Timeout Expired
The core use case is that assuming I only have the external table name, I want to create a copy of the data over which that external table is created in Azure storage itself.
It's simple to do in a Data Factory copy job, a Spark job, or AzCopy.

Is it possible to reduce the number of MetaStore checks when querying a Hive table with lots of columns?

I am using spark sql on databricks, which uses a Hive metastore, and I am trying to set up a job/query that uses quite a few columns (20+).
The amount of time it takes to run the metastore validation checks is scaling linearly with the number of columns included in my query - is there any way to skip this step? Or pre-compute the checks? Or to at least make the metastore only check once per table rather than once per column?
A small example is that when I run the below, even before calling display or collect, the metastore checker happens once:
new_table = table.withColumn("new_col1", F.col("col1")
and when I run the below, the metastore checker happens multiple times, and therefore takes longer:
new_table = (table
.withColumn("new_col1", F.col("col1")
.withColumn("new_col2", F.col("col2")
.withColumn("new_col3", F.col("col3")
.withColumn("new_col4", F.col("col4")
.withColumn("new_col5", F.col("col5")
)
The metastore checks it's doing look like this in the driver node:
20/01/09 11:29:24 INFO HiveMetaStore: 6: get_database: xxx
20/01/09 11:29:24 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: xxx
The view to the user on databricks is:
Performing Hive catalog operation: databaseExists
Performing Hive catalog operation: tableExists
Performing Hive catalog operation: getRawTable
Running command...
I would be interested to know if anyone can confirm that this is just the way it works (a metastore check per column), and if I have to just plan for the overhead of the metastore checks.
I am surprised by this behavior as it does not fit with the Spark processing model and I cannot replicate it in Scala. It is possible that it is somehow specific to PySpark but I doubt that since PySpark is just an API for creating Spark plans.
What is happening, however, is that after every withColumn(...) the plan is analyzed. If the plan is large, this can take a while. There is a simple optimization, however. Replace multiple withColumn(...) calls for independent columns with df.select(F.col("*"), F.col("col2").as("new_col2"), ...). In this case, only a single analysis will be performed.
In some cases of extremely large plans, we've saved 10+ minutes of analysis for a single notebook cell.

Is temp DB scalable in azure SQL DB

I’m trying to bulk import a very large table (75G) into the azure SQL DB (pricing tier P6 premium 1000 DTUs), it failed with the following error message
“Msg 40544, Level 17, State 2, Line 179
The database ‘tempdb’ has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.”
I looked up a few blogs where they are suggesting an increase in the tier. I was wondering if I could just scale up tempdb without having to increase the tier itself.
Right now I’m chunking up the file into smaller volumes to load it but if I have to build index on this table after loading, I’m pretty sure it would fail again with the same error message.
Any thoughts??
No. You have no direct control over TempDB and it's behavior. However, as you scale up your service tier, it's my understanding that your thresholds within TempDB also go up.

How to make duplicate a postgres database on the same RDS instance faster?

thank you guys in advance.
I am having a 60GB Postgres RDS on aws, and there is databaseA inside this RDS instance, I want to make a duplicate of databaseA called databaseB in the same RDS server.
So basically what I tried is to run CREATE DATABASE databaseB WITH TEMPLATE databaseA OWNER postgres; This single query took 6 hours to complete, which is too slow. I see the max IOPS during the process is 120, not even close to the limit of aws general SSD's limit 10,000 IOPS. I have also tried tunning up work_mem, shared_buffers, effective_cache_size in parameter group, There is no improvements at all.
My last option is to just create two separate RDS instance, but It will be much easier if I can do this in one instance. I'd appreciate any suggestions.
(The instance class is db.m4.xlarge)
As mentioned by Matt; you have two options:
Increase your server size which will give you more IOPS.
Switch to provisioned IOPS
As this is a temporary requirement I will go with 1 because u can upgrade to max. available server --> do database copy --> downgrade db server seamlessly and won't take much time. Switching SSD to provisioned IOPS will take lots of time because it needs to convert your data and hence more downtime. And later again when u will switch back from provisioned iops to SSD again it will take time.
Note that Both 1 & 2 are expensive ( if u really dont need them ) if used for long term; so u can't leave it as is.

Hive Update and Delete

I am using Hive 1.0.0 Version and Hadoop 2.6.0 and Cloudera ODBC Driver. I am trying to Update and Delete the data in the hive database from Cloudera HiveOdbc Driver it throws an error. Here is my error.
What i have done ?
CREATE:
create database geometry;
create table odbctest (EmployeeID Int,FirstName String,Designation String, Salary Int,Department String)
clustered by (department)
into 3 buckets
stored as orcfile
TBLPROPERTIES ('transactional'='true');
Table created.
INSERT:
insert into table geometry.odbctest values(10,'Hive','Hive',0,'B');
By passing the above query the data is inserting into database.
UPDATE:
When i am trying to Update the following error is getting
update geometry.odbctest set salary = 50000 where employeeid = 10;
SQL> update geometry.odbctest set salary = 50000 where employeeid = 10;
[S1000][Cloudera][HiveODBC] (55) Insert operation is not support for
table: HIVE.geometry.odbctest
[ISQL]ERROR: Could not SQLPrepare
DELETE:
When i am trying to Delete the following error is getting
delete from geometry.odbctest where employeeid=10;
SQL> delete from geometry.odbctest where employeeid=10;
[S1000][Cloudera][HiveODBC] (55) Insert operation is not support for table: HIVE.geometry.odbctest
[ISQL]ERROR: Could not SQLPrepare
Can anyone help me out,
You have done a couple of required steps properly:
ORC format
Bucketed table
A likely cause would be: one or more of the following hive settings were not included:
These configuration parameters must be set appropriately to turn on
transaction support in Hive:
hive.support.concurrency – true
hive.enforce.bucketing – true
hive.exec.dynamic.partition.mode – nonstrict
hive.txn.manager – org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on – true (for exactly one instance of the Thrift metastore service)
hive.compactor.worker.threads – a positive number on at least one instance of the Thrift metastore service
The full requirements for transaction support are here: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
If you have verified the above settings are in place then do a
describe extended odbctest;
To evaluate its transaction related characteristics.
I stumbled across the same issue when connecting to Hive 1.2 using the Simba ODBC driver distributed by Cloudera (v 2.5.12.1005 64-bit). After verifying everything in javadba's post, I did some additional digging and found the problem to be a bug in the ODBC driver.
I was able to resolve the issue by using the Progress DataDirect driver, and it looks like the version of the driver distributed by hortonworks works as well (links to both solutions below).
https://www.progress.com/data-sources/apache-hive-drivers-for-odbc-and-jdbc
http://hortonworks.com/hdp/addons/
Hope that helps anyone who may still be struggling!
You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.
Here is what you can findenter link description here
Hadoop is a batch processing system and Hadoop jobs tend to have high
latency and incur substantial overheads in job submission and
scheduling. As a result - latency for Hive queries is generally very
high (minutes) even when data sets involved are very small (say a few
hundred megabytes). As a result it cannot be compared with systems
such as Oracle where analyses are conducted on a significantly smaller
amount of data but the analyses proceed much more iteratively with the
response times between iterations being less than a few minutes. Hive
aims to provide acceptable (but not optimal) latency for interactive
data browsing, queries over small data sets or test queries.
Hive is not designed for online transaction processing and does not
offer real-time queries and row level updates. It is best used for
batch jobs over large sets of immutable data (like web logs).
As of now Hive does not support Update and Delete Operations on the data in HDFS.
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions