Hive Update and Delete

Hive Update and Delete - hive

I am using Hive 1.0.0 Version and Hadoop 2.6.0 and Cloudera ODBC Driver. I am trying to Update and Delete the data in the hive database from Cloudera HiveOdbc Driver it throws an error. Here is my error.
What i have done ?
CREATE:
create database geometry;
create table odbctest (EmployeeID Int,FirstName String,Designation String, Salary Int,Department String)
clustered by (department)
into 3 buckets
stored as orcfile
TBLPROPERTIES ('transactional'='true');
Table created.
INSERT:
insert into table geometry.odbctest values(10,'Hive','Hive',0,'B');
By passing the above query the data is inserting into database.
UPDATE:
When i am trying to Update the following error is getting
update geometry.odbctest set salary = 50000 where employeeid = 10;
SQL> update geometry.odbctest set salary = 50000 where employeeid = 10;
[S1000][Cloudera][HiveODBC] (55) Insert operation is not support for
table: HIVE.geometry.odbctest
[ISQL]ERROR: Could not SQLPrepare
DELETE:
When i am trying to Delete the following error is getting
delete from geometry.odbctest where employeeid=10;
SQL> delete from geometry.odbctest where employeeid=10;
[S1000][Cloudera][HiveODBC] (55) Insert operation is not support for table: HIVE.geometry.odbctest
[ISQL]ERROR: Could not SQLPrepare
Can anyone help me out,

You have done a couple of required steps properly:
ORC format
Bucketed table
A likely cause would be: one or more of the following hive settings were not included:
These configuration parameters must be set appropriately to turn on
transaction support in Hive:
hive.support.concurrency – true
hive.enforce.bucketing – true
hive.exec.dynamic.partition.mode – nonstrict
hive.txn.manager – org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on – true (for exactly one instance of the Thrift metastore service)
hive.compactor.worker.threads – a positive number on at least one instance of the Thrift metastore service
The full requirements for transaction support are here: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
If you have verified the above settings are in place then do a
describe extended odbctest;
To evaluate its transaction related characteristics.

I stumbled across the same issue when connecting to Hive 1.2 using the Simba ODBC driver distributed by Cloudera (v 2.5.12.1005 64-bit). After verifying everything in javadba's post, I did some additional digging and found the problem to be a bug in the ODBC driver.
I was able to resolve the issue by using the Progress DataDirect driver, and it looks like the version of the driver distributed by hortonworks works as well (links to both solutions below).
https://www.progress.com/data-sources/apache-hive-drivers-for-odbc-and-jdbc
http://hortonworks.com/hdp/addons/
Hope that helps anyone who may still be struggling!

You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.
Here is what you can findenter link description here
Hadoop is a batch processing system and Hadoop jobs tend to have high
latency and incur substantial overheads in job submission and
scheduling. As a result - latency for Hive queries is generally very
high (minutes) even when data sets involved are very small (say a few
hundred megabytes). As a result it cannot be compared with systems
such as Oracle where analyses are conducted on a significantly smaller
amount of data but the analyses proceed much more iteratively with the
response times between iterations being less than a few minutes. Hive
aims to provide acceptable (but not optimal) latency for interactive
data browsing, queries over small data sets or test queries.
Hive is not designed for online transaction processing and does not
offer real-time queries and row level updates. It is best used for
batch jobs over large sets of immutable data (like web logs).

As of now Hive does not support Update and Delete Operations on the data in HDFS.
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Related

ERROR : FAILED: Error in acquiring locks: Error communicating with the metastore org.apache.hadoop.hive.ql.lockmgr.LockException

Getting the Error in acquiring locks, when trying to run count(*) on partitioned tables.
The table has 365 partitions when filtered on <= 350 partitions, the queries are working fine.
when tried to include more partitions for the query, it's failing with the error.
working on Hive-managed ACID tables, with the following default values
hive.support.concurrency=true //cannot make it as false, it's throwing <table> is missing from the ValidWriteIdList config: null, should be true for ACID read and write.
hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.txn.strict.locking.mode=false
hive.exec.dynamic.partition.mode=nonstrict
Tried increasing/decreasing values for these following with a beeline session.
hive.lock.numretries
hive.unlock.numretries
hive.lock.sleep.between.retries
hive.metastore.batch.retrieve.max={default 300} //changed to 10000
hive.metastore.server.max.message.size={default 104857600} // changed to 10485760000
hive.metastore.limit.partition.request={default -1} //did not change as -1 is unlimited
hive.metastore.batch.retrieve.max={default 300} //changed to 10000.
hive.lock.query.string.max.length={default 10000} //changed to higher value
Using the HDI-4.0 interactive-query-llap cluster, the meta-store is backed by default sql-server provided along.

The problem is NOT due to service tier of the hive metastore database.
It is most probably due to too many partitions in one query based on the symptom.
I meet the same issue several times.
In the hivemetastore.log, you shall able to see such error:
metastore.RetryingHMSHandler: MetaException(message:Unable to update transaction database com.microsoft.sqlserver.jdbc.SQLServerException: The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:578)
This is due to in Hive metastore, each partition involved in the hive query requires at most 8 parameters to acquire a lock.
Some possible workarounds:
Decompose the the query into multiple sub-queries to read from fewer
partitions.
Reduce the number of partitions by setting different partition keys.
Remove partitioning if partition keys don't have any filters.
Following are the parameters which manage the batch size for INSERT query generated by the direct SQL. Their default value is 1000. Set both of them to 100 (as a good starting point) in the Custom hive-site section of Hive configs via. Ambari and restart ALL Hive related components (including Hive metastore).
hive.direct.sql.max.elements.values.clause=100
hive.direct.sql.max.elements.in.clause=100

We also faced the same error in HDInsight and after doing many configuration changes similar to what you have done, the only thing that worked is scaling our Hive Metastore SQL DB server.
We had to scale it all the way to a P2 tier with 250 DTUs for our workloads to work without these Lock Exceptions. As you may know, with the tier and DTU count, the SQL server's IOPS and response time improves thus we suspected that the Metastore performance was the root cause for these Lock Exceptions with the increase in workloads.
Following link provides information about the DTU based performance variation in SQL servers in Azure.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu
Additionally as I know, the default Hive metastore that gets provisioned when you opt to not provide an external DB in cluster creation is just an S1 tier DB. This would not be suitable for any high capacity workloads. At the same time, as a best practice always provision your metastores external to the cluster and attach at cluster provisioning time, as this gives you the flexibility to connect the same Metastore to multiple clusters (so that your Hive layer schema can be shared across multiple clusters, e.g. Hadoop for ETLs and Spark for Processing / Machine Learning), and you have the full control to scale up or down your metastore as per your need anytime.
The only way to scale the default metastore is by engaging the Microsoft support.

We faced the same issue in HDINSIGHT. We solved it by upgrading the metastore.
The Default metastore had only 5 DTU which is not recommended for production environments. So we migrated to custom Metastore and spin the Azure SQL SERVER (P2 above 250 DTUs) and the setting the below properties:
hive.direct.sql.max.elements.values.clause=200
hive.direct.sql.max.elements.in.clause=200
Above values are set because SQL SERVER cannot process more than 2100 parameter. When you have partitions more than 348, you faced this issue as 1 partition creates 8 parameters for metastore 8 x 348

Is it possible to reduce the number of MetaStore checks when querying a Hive table with lots of columns?

I am using spark sql on databricks, which uses a Hive metastore, and I am trying to set up a job/query that uses quite a few columns (20+).
The amount of time it takes to run the metastore validation checks is scaling linearly with the number of columns included in my query - is there any way to skip this step? Or pre-compute the checks? Or to at least make the metastore only check once per table rather than once per column?
A small example is that when I run the below, even before calling display or collect, the metastore checker happens once:
new_table = table.withColumn("new_col1", F.col("col1")
and when I run the below, the metastore checker happens multiple times, and therefore takes longer:
new_table = (table
.withColumn("new_col1", F.col("col1")
.withColumn("new_col2", F.col("col2")
.withColumn("new_col3", F.col("col3")
.withColumn("new_col4", F.col("col4")
.withColumn("new_col5", F.col("col5")
)
The metastore checks it's doing look like this in the driver node:
20/01/09 11:29:24 INFO HiveMetaStore: 6: get_database: xxx
20/01/09 11:29:24 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: xxx
The view to the user on databricks is:
Performing Hive catalog operation: databaseExists
Performing Hive catalog operation: tableExists
Performing Hive catalog operation: getRawTable
Running command...
I would be interested to know if anyone can confirm that this is just the way it works (a metastore check per column), and if I have to just plan for the overhead of the metastore checks.

I am surprised by this behavior as it does not fit with the Spark processing model and I cannot replicate it in Scala. It is possible that it is somehow specific to PySpark but I doubt that since PySpark is just an API for creating Spark plans.
What is happening, however, is that after every withColumn(...) the plan is analyzed. If the plan is large, this can take a while. There is a simple optimization, however. Replace multiple withColumn(...) calls for independent columns with df.select(F.col("*"), F.col("col2").as("new_col2"), ...). In this case, only a single analysis will be performed.
In some cases of extremely large plans, we've saved 10+ minutes of analysis for a single notebook cell.

Hive Insert-Only Transactional Tables

What are the specific benefits of using a Hive Insert-Only Transactional Table? Most of the documentation just indicates that if you don't need Delete or Alter functionality, then create this table. Does this speed up processing? Reduce Overhead?

Currently, full ACID tables are only supported in ORC file format. Micromanaged, a.k.a INSERT only transactional tables support any other storage format.
So, if you have all your tables stored in ORC format, you can go ahead with full ACID. If you have other storage types, and you need to be able to do INSERT statements, micromanaged tables can help you there.
Also: for full ACID tables, compaction is done by a MapReduce job. You can configure Hive to use the query based compactor for major compactions (as in creating a new base), but minor compactions (as in merging delta files) are still done with MR, and MR only.
For micromanaged tables, the compaction is query based. So, if you are using Hive on Tez, or Hive on Spark, and you do not want to have MR at all, that is fine. But for full ACID tables, if you want minor compactions, you'll need MapReduce.
Insider note: query based minor compaction for full ACID tables will be supported really soon, and I am pretty sure Parquet is going to support ACID tables very soon.

What you read everywhere is that unlike full transactional tables, "insert-only transactional tables" support data insert operations only.
But that doesn't say much. What one wants to know is:
What is a transactional insert operation ?
To say that an operation is a transaction means basically that it follows the ACID principles and especially its most important property: atomicity (the A from ACID).
In his great book Desigining Data Intensive Applications, Martin Kleppmann explains the atomicity property well:
"if a transaction was aborted, the application can be sure that it
didn’t change anything, so it can safely be retried.
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity."
Desigining Data Itensive Applications, March 2017, 1st edition chapter 7, p. 234
In Hive, this is done by creating a delta directory for each insert transaction, which keeps the new data isolated until the transaction is completed. If there is an error the directory is deleted, otherwise, it is appended.

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

Hive analyze compute stats query failing

I'm running Hive 1.0, trying to compute column statistics using the built-in analyze command. HQL script looks like:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
use db;
analyze table tbl compute statistics for columns;
Which kicks off a map-only MR task as expected. The job runs to 100% for both map and reduce, then reports:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
But the job is registered as a SUCCESS.
Googling led me to this JIRA ticket, but the resolution says the problem is resolved in Hive 0.14. Is there something simple I'm missing in the query?
EDIT: Five and a half years later, I've changed jobs and industries twice, picked up Spark and then abandoned Hadoop altogether in all my workflows, and the world aligned around efficient cloud data lakes that don't require a new query language. Hive is a distant memory for me, but I hope the other answer seekers found sufficient workarounds. I don't think I ever did.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas