Hive analyze compute stats query failing - hive

I'm running Hive 1.0, trying to compute column statistics using the built-in analyze command. HQL script looks like:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
use db;
analyze table tbl compute statistics for columns;
Which kicks off a map-only MR task as expected. The job runs to 100% for both map and reduce, then reports:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
But the job is registered as a SUCCESS.
Googling led me to this JIRA ticket, but the resolution says the problem is resolved in Hive 0.14. Is there something simple I'm missing in the query?
EDIT: Five and a half years later, I've changed jobs and industries twice, picked up Spark and then abandoned Hadoop altogether in all my workflows, and the world aligned around efficient cloud data lakes that don't require a new query language. Hive is a distant memory for me, but I hope the other answer seekers found sufficient workarounds. I don't think I ever did.

Related

How to force MR execution when running simple Hive query?

There is Hive 2.1.1 over MR, table test_table stored as sequencefile and the following ad-hoc query:
select t.*
from test_table t
where t.test_column = 100
Although this query can be executed without starting MR (fetch task), sometimes it takes longer to scan HDFS files rather than triggering a single map job.
When I want to enforce MR execution, I make the query more complex: e.g., using distinct. The significant drawbacks of this approach are:
Query results may differ from the original query's
Brings meaningless calculation load on the cluster
Is there a recommended way to force MR execution when using Hive-on-MR?
The hive executor decides either to execute map task or fetch task depending on the following settings (with defaults):
hive.fetch.task.conversion ("more") — the strategy for converting MR tasks into fetch tasks
hive.fetch.task.conversion.threshold (1 GB) — max size of input data that can be fed to a fetch task
hive.fetch.task.aggr (false) — when set to true, queries like select count(*) from src also can be executed in a fetch task
It prompts me the following two options:
set hive.fetch.task.conversion.threshold to a lower value, e.g. 512 Mb
set hive.fetch.task.conversion to "none"
For some reason lowering the threshold did not change anything in my case, so I stood with the second option: seems fine for ad-hoc queries.
More details regarding these settings can be found in Cloudera forum and Hive wiki.
Just add set hive.execution.engine=mr; before your query and it will enforce Hive to use MR.

Is it possible to reduce the number of MetaStore checks when querying a Hive table with lots of columns?

I am using spark sql on databricks, which uses a Hive metastore, and I am trying to set up a job/query that uses quite a few columns (20+).
The amount of time it takes to run the metastore validation checks is scaling linearly with the number of columns included in my query - is there any way to skip this step? Or pre-compute the checks? Or to at least make the metastore only check once per table rather than once per column?
A small example is that when I run the below, even before calling display or collect, the metastore checker happens once:
new_table = table.withColumn("new_col1", F.col("col1")
and when I run the below, the metastore checker happens multiple times, and therefore takes longer:
new_table = (table
.withColumn("new_col1", F.col("col1")
.withColumn("new_col2", F.col("col2")
.withColumn("new_col3", F.col("col3")
.withColumn("new_col4", F.col("col4")
.withColumn("new_col5", F.col("col5")
)
The metastore checks it's doing look like this in the driver node:
20/01/09 11:29:24 INFO HiveMetaStore: 6: get_database: xxx
20/01/09 11:29:24 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: xxx
The view to the user on databricks is:
Performing Hive catalog operation: databaseExists
Performing Hive catalog operation: tableExists
Performing Hive catalog operation: getRawTable
Running command...
I would be interested to know if anyone can confirm that this is just the way it works (a metastore check per column), and if I have to just plan for the overhead of the metastore checks.
I am surprised by this behavior as it does not fit with the Spark processing model and I cannot replicate it in Scala. It is possible that it is somehow specific to PySpark but I doubt that since PySpark is just an API for creating Spark plans.
What is happening, however, is that after every withColumn(...) the plan is analyzed. If the plan is large, this can take a while. There is a simple optimization, however. Replace multiple withColumn(...) calls for independent columns with df.select(F.col("*"), F.col("col2").as("new_col2"), ...). In this case, only a single analysis will be performed.
In some cases of extremely large plans, we've saved 10+ minutes of analysis for a single notebook cell.

Mapreduce job not launching when running hive query with where clause

I am using apache-hive-1.2.2 on Hadoop 2.6.0. When am running a hive query with where clause it is giving results immediately without launching any MapReduce job. I'm not sure what is happening. Table has over 100k records.
I am quoting this from Hive Documentation
hive.fetch.task.conversion
Some select queries can be converted to a single FETCH task,
minimizing latency. Currently the query should be single sourced not
having any subquery and should not have any aggregations or distincts
(which incur RS – ReduceSinkOperator, requiring a MapReduce task),
lateral views and joins.
Any type of the sort of aggregation like max or min or count is going to require a MapReduce job. So it depends on your data-set you have.
select * from tablename;
It just reads raw data from files in HDFS, so it is much faster without MapReduce and it doesn't need MR.
This is due to the the property "hive.fetch.task.conversion". The default value is set to "more" (Hive 2.1.0) and results in Hive trying to go straight at the data by launching a single Fetch task instead of a Map Reduce job wherever possible.
This behaviour however might not be desirable in case you have a huge table (say 500 GB+) as it would cause a single thread to be launched instead of multiple threads as happens in the case of a Map Reduce job.
You can set this property to "minimal" or "none" in hive-site.xml to bypass the behaviour.

Hive Update and Delete

I am using Hive 1.0.0 Version and Hadoop 2.6.0 and Cloudera ODBC Driver. I am trying to Update and Delete the data in the hive database from Cloudera HiveOdbc Driver it throws an error. Here is my error.
What i have done ?
CREATE:
create database geometry;
create table odbctest (EmployeeID Int,FirstName String,Designation String, Salary Int,Department String)
clustered by (department)
into 3 buckets
stored as orcfile
TBLPROPERTIES ('transactional'='true');
Table created.
INSERT:
insert into table geometry.odbctest values(10,'Hive','Hive',0,'B');
By passing the above query the data is inserting into database.
UPDATE:
When i am trying to Update the following error is getting
update geometry.odbctest set salary = 50000 where employeeid = 10;
SQL> update geometry.odbctest set salary = 50000 where employeeid = 10;
[S1000][Cloudera][HiveODBC] (55) Insert operation is not support for
table: HIVE.geometry.odbctest
[ISQL]ERROR: Could not SQLPrepare
DELETE:
When i am trying to Delete the following error is getting
delete from geometry.odbctest where employeeid=10;
SQL> delete from geometry.odbctest where employeeid=10;
[S1000][Cloudera][HiveODBC] (55) Insert operation is not support for table: HIVE.geometry.odbctest
[ISQL]ERROR: Could not SQLPrepare
Can anyone help me out,
You have done a couple of required steps properly:
ORC format
Bucketed table
A likely cause would be: one or more of the following hive settings were not included:
These configuration parameters must be set appropriately to turn on
transaction support in Hive:
hive.support.concurrency – true
hive.enforce.bucketing – true
hive.exec.dynamic.partition.mode – nonstrict
hive.txn.manager – org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on – true (for exactly one instance of the Thrift metastore service)
hive.compactor.worker.threads – a positive number on at least one instance of the Thrift metastore service
The full requirements for transaction support are here: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
If you have verified the above settings are in place then do a
describe extended odbctest;
To evaluate its transaction related characteristics.
I stumbled across the same issue when connecting to Hive 1.2 using the Simba ODBC driver distributed by Cloudera (v 2.5.12.1005 64-bit). After verifying everything in javadba's post, I did some additional digging and found the problem to be a bug in the ODBC driver.
I was able to resolve the issue by using the Progress DataDirect driver, and it looks like the version of the driver distributed by hortonworks works as well (links to both solutions below).
https://www.progress.com/data-sources/apache-hive-drivers-for-odbc-and-jdbc
http://hortonworks.com/hdp/addons/
Hope that helps anyone who may still be struggling!
You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.
Here is what you can findenter link description here
Hadoop is a batch processing system and Hadoop jobs tend to have high
latency and incur substantial overheads in job submission and
scheduling. As a result - latency for Hive queries is generally very
high (minutes) even when data sets involved are very small (say a few
hundred megabytes). As a result it cannot be compared with systems
such as Oracle where analyses are conducted on a significantly smaller
amount of data but the analyses proceed much more iteratively with the
response times between iterations being less than a few minutes. Hive
aims to provide acceptable (but not optimal) latency for interactive
data browsing, queries over small data sets or test queries.
Hive is not designed for online transaction processing and does not
offer real-time queries and row level updates. It is best used for
batch jobs over large sets of immutable data (like web logs).
As of now Hive does not support Update and Delete Operations on the data in HDFS.
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Pig step execution details

I am newbee to pig .
I have written a small script in pig , where in i first load the data from two different tables and further right outer join the two tables ,later also i have next join of tables for two different st of data .It works fine .But i want to see
the steps of execution , like in which step my data is loaded that way i can note the time
needed for loading later details of step for data joining like how much time it is
taking for these much records to be joined .
Basically i want to know which part of my pig script is taking longer time to run so
that way i can further optimize my pig script .
Anyway we could println within the script and find which steps got executed which has started to execute .
Through jobtracker details link i could not get much info , just could see mapper is running & reducer is running , but idealy mapper for which part of script is running could not find that.
For example for a hive job run we can see in the jobtracker details link which step is currently getting executed.
Any information will be really helpfull.
Thanks in advance .
I'd suggest you to have a look at the followings:
Pig's Progress Notification Listener
Penny : this is a monitoring tool but I'm afraid that it hasn't been updated in the recent past (e.g: it won't compile for Pig 0.12.0 unless you do some code changes)
Twitter's Ambrose project. https://github.com/twitter/ambrose
On the other, after executing the script you can see a detailed statistics about the execution time of each alias (see: Job Stats (time in seconds)).
Have a look at the EXPLAIN operator. This doesn't give you real-time stats as your code is executing, but it should give you enough information about the MapReduce plan your script generates that you'll be able to match up the MR jobs with the steps in your script.
Also, while your script is running you can inspect the configuration of the Hadoop job. Look at the variables "pig.alias" and "pig.job.feature". These tell you, respectively, which of your aliases (tables/relations) is involved in that job and what Pig operations are being used (e.g., HASH_JOIN for a JOIN step, SAMPLER or ORDER BY for an ORDER BY step, and so on). This information is also available in the job stats that are output to the console upon completion.